Microsoft And The University Of California, Merced Introduces ZeRO-Offload, A Novel Heterogeneous DeepLearning Training Technology To Train Multi-Billion Parameter Models On A Single GPU


We are progressing towards an era of technology that is becoming heavily dependent on Deep Learning (DL) models. As these models’ size increases exponentially, it becomes quite expensive to train these models. Accessibility to large-scale model training is limited due to the nature of the state-of-art system technologies. A limited number of AI researchers and institutions possess the resources to train these sizeable Deep Learning models consisting of more than a billion parameters. For example, a DGX-2 equivalent node with 19 NVIDIA V100 cards costing over 100K Dollars is required to train a 10 Billion parameter model. This is beyond the scope of many data scientists and even many academic institutions.

To increase the accessibility of this process, a team of researchers at the University of California, Merced, and Microsoft have developed ZeRO-Offload. This new heterogeneous Deep Learning technology facilitates the data scientists to train multi-billion parameter models on a single GPU without model refactoring. It is a powerful GPU-CPU hybrid Deep Learning training technology with high compute efficiency and near-linear throughput scalability.

Challenges faced during large-scale model training include the model states, i.e., the parameter, gradients, optimizer states, and lack of research on exploiting the CPU compute. Many researchers have tried to address these issues using heterogeneous Deep Learning training to reduce GPU memory requirements, but these target activation memory on small-sized CNN based models. 

Traditional data parallelism is usually the community standard for scaling Deep learning training to Multiple GPUs. Still, it requires data and computation reproduction, which renders it inappropriate for heterogeneous training of Deep Learning models. The ZeRO-Offload, on the other hand, exploits both CPU memory and compute for offloading, thus providing a precise solution to scale multiple GPUs by working with ZeRO-powered data parallelism efficiently. ZeRO-Offload also maintains a single copy of the optimizer states on CPU memory irrespective of the data-parallel degree, which results in optimistic scalability of up to 128 GPUs.

The ZeRO-offload design is based on three principles: Efficiency, Scalability, and Usability. The researchers have identified a unique data partitioning and optimal computation strategy between CPU and GPU devices. The approach involves offloading gradients, optimizer states, and optimizer computation to CPU, keeping parameters, and keeping forward and backward computation on GPU. The researchers observed a ten times increase in model size with minimum communication and limited CPU computation that enabled the training of 13 Billion parameters on a single NVIDIA V100 GPU at 40 TFLOPS.

The ZeRO-Offload is available as a part of the Open Source PyTorch library, DeepSpeed. It can be easily added to the existing training pipelines by changing just a few lines of code. ZeRO-Offload has increased computational and memory efficiency, and it is easy to use. These features will make large-scale model training accessible even to the researchers and data scientists working with a single GPU. The paper ‘ZeRO-Offload: Democratizing Billion Scale Model Training’ is available on arXiv. 




Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.