This Article Is Based On The Cerebras Research Article 'TotalEnergies and Cerebras Create Massively Scalable Stencil Algorithm'. All Credit For This Research Goes To The Researchers Of This Research 👏👏👏 Please Don't Forget To Join Our ML Subreddit
Many High-Performance Computing (HPC) applications rely on stencil methods. Stencil computations are a class of algorithms that update elements in a multidimensional grid based on neighboring values using a fixed pattern – the stencil. They’re used to solve various partial differential equations (PDEs), such as those in fluid mechanics, weather forecasting, and seismic imaging.
One of the most distinguishing features of stencil algorithms, particularly the high-order systems discussed later in this piece, is that the computation reads all variables in memory but only uses them in a few arithmetic operations. In other words, each action necessitates a large amount of input data while spending little time computing the outcome. Furthermore, stencil algorithms access input data in a nearest-neighbor manner, which does not translate well to DRAM read from huge contiguous memory regions. These issues are well-known to be unsuitable for hierarchical memory structures. Those designs are ideal for computation-intensive applications like dense linear algebra or graphics rendering, where each element of data read requires a significant number of floating-point operations.
On older architectures, stencil algorithms, on the other hand, tend to be less computation expensive and memory-bound: maximum performance is restricted by the speed at which data can be transported and read from memory. The following are some of the outcomes of being memory-bound:
- Increasing the processor unit’s clock speed will not improve the situation.
- When attempting to remedy the problem by adding more processing power, scaling concerns develop.
The data transfer speed between processor units is crucial in memory-bound application performance. Coupling numerous devices together, coupled with an interconnect, is a common way of increasing processing power. The interconnect is slower than the fabric’s bandwidth when data is transferred to another device, causing delays. When attempting to remedy the problem by adding more processing power, scaling concerns develop).
The work presents a novel way to implement a stencil algorithm on the Cerebras CS-2 System, which is powered by a Wafer-Scale Engine (WSE), which packs 850,000 cores onto a single piece of silicon. The Cerebras Software Language (CSL), part of the Cerebras Software Development Kit, created the algorithm. The WSE’s extraordinarily high memory bandwidth – 20 petabytes per second – combined with highly efficient neighbor-to-neighbor communication and a sophisticated algorithm implementation provide amazing results. However, there is another option: accept the data-transfer needs and construct the algorithm to take full advantage of Cerebras’ hardware bandwidth.
TotalEnergies created the test problem (Minimod) as a public benchmark case to evaluate the performance of new hardware solutions. This research focuses on solving the isotropic acoustic kernel in a constant density domain. Using a 25-point stencil, the equation is discretized into a finite difference (FD). This means that in every dimension, every point in the discretized space communicates with its four neighbors.
For this post, the Wafer-Scale Engine (WSE) that powers the CS-2 can be thought of as an on-wafer, entirely distributed memory engine. The method is based on custom-designed localized broadcast patterns that may send, receive, and compute data simultaneously at the hardware level. Moving data between Processing Elements (PEs) that are close together can thus be done quickly. By simply collapsing the third dimension, the 3D domain is mapped into the 2D PE map.
The comparisons in the paper are made at the accelerator level (i.e., one CS-2 or one A100), ignoring any host communication. On-device RAM on the A100s in the TotalEnergies cluster is 40GB. As the problem size enlarges, the number of processing elements grows. The test problem is repeated 1,000 times.
The WSE-2 beats the A100 by more than 220x in the most significant size. As can be seen, the WSE-2 takes the same amount of time regardless of the complexity of the problem, indicating that it is compute-bound. The WSE-2’s weak scaling efficiency is nearly ideal, at better than 98 percent for all scales. Both of these results are astounding to experienced HPC practitioners.
An examination of the roofline model is also performed, which demonstrates that the implementation is compute-bound. The WSE-2 has a total throughput of 503 TFLOPs, a stunning figure for a single device node. The findings of this study hold a lot of promise for HPC applications on the WSE-2. The authors are currently working on more complex applications, both stencil-based and hybridized with Machine Learning (ML) applications, especially given the WSE-2’s already demonstrated capabilities for those workloads.