In a recent announcement by Microsoft, it has been unveiled that an addition is being made to their open-source DeepSpeed, an AI training library that optimizes memory usage and trains very large deep learning models. With the help of ZeRO-Infinity, Microsoft achieved the benchmark of training a model with 32 trillion parameters on a cluster of GPUs. Moreover, it also showed fine-tuning of a 1 trillion parameter model on a single GPU.
This new edition has come with a plethora of features, as claimed by Microsoft:
- The latest iteration of the Zero Redundancy Optimizer is a bundle of memory optimization techniques.
- Several new strategies have been introduced by this model that would look into the problems of both memory and bandwidth simultaneously when a deep learning model is being trained.
- It would also come with a new offload engine for exploiting the benefits of CPU, and Non-Volatile Memory (NVM) express memory.
- The memory-centric tiling would handle the large operations without model parallelism.
- To reduce the bandwidth costs, bandwidth-centric portioning would come into play.
- Scheduling data communication would be easier than before with an overlap-centric design.
All these features would better equip the system’s capabilities to go beyond the GPU memory wall. This would further facilitate the training of models with tens of trillions of parameters with more minor clusters of GPUs. Not only this, an order of enormous magnitude as compared to the state-of-the-art systems would be made available.
Deep learning research is an ever-evolving arena. Recent trends have shown that to reach superhuman performance, the larger models need to be trained on a large amount of data with the largest models available. But these models are not easy to train and require an extensive cluster of GPUs. Mostly, model developers prefer to make use of transfer learning that can fine-tune a large pre-trained model. This saves tons of computing resources. However, some models are simply too large to fine-tune on a single machine. For both of these examples, code refactoring is required to utilize the distributed training frameworks fully.
Looking at these, Microsoft, as a part of their AI at Scale program, launched DeepSpeed as well as the Zero Redundancy Optimizer. The latter has been improved significantly by the company by introducing additional partitioning of the model state, offloading the data, and computing from the GPU to the CPU of any training machine.
This new iteration ZeRO-Infinity brings in a new scheme of things and looks into both the problems of training large deep learning models: memory size and memory bandwidth. The new offload infinity engine uses CPU and NVMe to increase the amount of memory available to store model parameters and activations. Prior to this, full offload for the entire model could not be done to these locations. Memory-centric tiling, which is also another significant feature, would reduce the memory footprint in the large model layers by consolidating them into smaller tiles. These would then be used sequentially. As claimed by Microsoft, the large deep learning models would then be trained without using model parallelism.
The team at Microsoft put the model to test several times before it was released; GPT-like Transformer models of different sizes were experimented with to validate the model’s ability to scale. The ZeRO-Infinity handled models 40 times larger than the traditional ones by making use of the same hardware. The model could also achieve a speedup two times greater than the previous versions that worked on the 64 GPU cluster. With these experiments, Microsoft has also potentially opened up the road to training 100 trillion parameter models.