Amazon Researchers Propose ‘MiCS,’ An Artificial Intelligence (AI) System That Attains High Training Throughput and Near-Linear Scalability on The Cloud by Only Using Data Parallelism

Gigantic models are those models that have to be trained using billions or trillions of parameters. Due to significant communication overheads, current general purpose frameworks for such enormous models cannot scale effectively in public cloud platforms. In their most recent study, researchers from John Hopkins University, Peking University, and Amazon Web Services suggested MiCS. The primary goal of this study is to reduce communication overhead by minimizing the communication scale. This study also offers experimental proof that when developing architectures, especially for sizable, deep neural networks trained on the public cloud, it is essential to consider model training infrastructure. The article on AWS V100 and A100 GPU instances demonstrates how unequally distributing the model weights reduces inter-node communication overhead. Because most gradient exchange occurs within a node, training can proceed more quickly depending on the model’s size. The project is a component of current efforts to boost the effectiveness of heavy training sessions.

For deep neural networks, test loss scales logarithmically with the quantity of input data and network parameters. Because of this, research and business efforts have focused in the last several years on creating high-capacity neural networks that can be used for various downstream tasks, such as supervised tuning. The scaling of training computes also increased, almost doubling every six months, to meet the demands of training such massive networks. Different parameter sharding algorithms, such as ZeRO, and GShard, have been suggested for training these models as large-scale deep network usage has become more widespread. On-premise GPU stations with large-bandwidth communication primitives are typically favored when creating proof-of-concept frameworks. Industrial applications, however, typically reside on the public cloud in reality. Due to the restrictions and accessibility of architectural components on the cloud, this poses extra technological hurdles.

The public cloud uses software-defined reusable components that make managing compute instances simple. Unlike intra-node bandwidth between GPUs, such as NVIDIA NVLink and NVSwitch, cloud virtual machine clusters usually have an inter-node bandwidth that is 12 to 24 times slower. As a result, distributed gradient synchronization becomes a significant training bottleneck for extensive deep networks. Model parameters should be as close as possible to GPUs to reduce inter-node communication, according to MiCS. This can be achieved by reducing the size of the model partition and by giving intra-node GPUs preference. The least number of nodes is preferred to divide the weights when several nodes are needed to cover the entire parameter range. The researchers also modify the gradient accumulation approach to incorporate an unequal weight distribution. As a result, discrepancies in real communication are mirrored at the algorithmic level.

The report presents the findings of many experiments conducted in 100Gbps and 400Gbps network environments. Different deep networks of different sizes and GPU counts are used to compare performance. MiCS consistently improves throughput up to 2.82 times for 100Gbps network configurations and up to 2.21 times for 400Gbps cases. Researchers from Google Cloud also advocated a similar strategy previously in a GCP blog post.

This Article is written as a research summary article by Marktechpost Staff based on the research paper 'MiCS: Near-linear Scaling for Training Gigantic Model on Public Cloud'. All Credit For This Research Goes To Researchers on This Project. Check out the paper, AWS article and reference article.

Please Don't Forget To Join Our ML Subreddit
🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...