DeepSpeed, Microsoft’s deep learning optimization library, makes distributed training easy, effective, and efficient. It’s an essential part of Microsoft’s initiative, AI at Scale, to enable the next-generation AI capabilities at scale.
DeepSpeed trains deep learning (DL) models with over a hundred billion parameters on GPU clusters’ current generation and offers ten times better performance than state-of-the-art. Early adopters of DeepSpeed have already produced Turing-NLG, a language model (LM) with over 17B parameters.
Microsoft Corp. released a new version of the DeepSpeed that enables the creation of DL models. This version has gone through multiple iterations that increased the maximum size of the models it can train from over a hundred billion to more than a trillion, which are about five times what the world’s current largest model has to offer. It also boosts working on smaller projects.
The parameters are like the insights at a high level that an AI learns from processing data. Hence, enable AI models to improve their speed and accuracy with time. The proficiency of a neural network is directly proportional to the number of parameters it has. Thus DeepSpeed can process the data it ingests and thereby produce higher-quality results.
DeepSpeed built to address the challenge that the developers can equip their neural networks with as many parameters as their AI training infrastructure can handle, i.e., hardware limitations are an obstacle to building better and bigger models. DeepSpeed, making the AI training process more hardware-efficient, increases the AI software developers’ sophistication without buying more infrastructure.
According to Microsoft, the tool can train a trillion-parameter LM using 100 of Nvidia Corp.’s previous-generation V100 graphics cards. Otherwise, that task would take four thousand Nvidia’s current-generation A100 graphics cards 100 days to complete. The above is because the A100 is 20 times faster than the V100. The company claims that even if the available hardware reduces to a single V100 chip, their product could still train a language model up to 13 billion parameters.
The improvements in the older version are made possible by several new technologies in the latest version of DeepSpeed Such as:
- ZeRO-Offload: By making the memory’s creative use in the servers’ central processing units, the number of parameters AI training servers can handle increases.
- Dubbed 3D parallelism: To increase the hardware efficiency, it distributes work among the training servers. We get from a blog post by Microsoft executives Rangan Majumder and Junhua Wang that 3D parallelism adapts to the varying needs of workload and power the huge models, thus achieving throughput-scaling efficiency near-perfect memory-scaling.