Transformer-based language models have made rapid progress in many natural language processing (NLP) applications, thanks to the availability of large datasets, large computation at scale, and advanced algorithms and software to train these models.
The high-performing language models need many parameters, a lot of data, and a lot of training time to develop a richer, more sophisticated understanding of language. As a result, they generalize well as effective zero– or few–shot learners on various NLP tasks and datasets with high accuracy.
However, training such models is problematic for two reasons:
- The parameters of these models can no longer be fit into the memory of even the most powerful GPU.
- Special attention is required for optimizing the algorithms, software, and hardware stack as a whole. If proper attention is not provided, the large number of computing operations required can result in unrealistically long training times.
Microsoft and NVIDIA present the Megatron-Turing Natural Language Generation model (MT-NLG), powered by DeepSpeed and Megatron, the largest and robust monolithic transformer language model trained with 530 billion parameters. MT-NLG is the successor to Turing NLG 17B and Megatron-LM. The scale of this model is three times that of the largest of its kind. It can do natural language tasks with high accuracy, including prediction, reading comprehension, common sense reasoning, natural language reasoning, and word meaning disambiguation.
To achieve ultra-high efficiency model training, the team uses GPU and decentralized learning software stacks. Furthermore, they use hundreds of billions of tokens to build high-quality natural language training corpora and cooperatively define training settings to optimize efficiency and stability.
The model is trained on the Selene supercomputer, built on NvidiaDGX SuperPOD, and includes mixed-precision training. There are 560 DGX A100 servers on the supercomputer. HDR InfiniBand with full-fat tree extension is used to connect these servers. Each DGX A100 includes eight A100s, each with an 80GB Tensor Core GPU connected via NVLink and NVSwitch.
According to the researchers, only this architecture has the potential of achieving parallelism between thousands of GPUs. In addition, it can train a model with hundreds of billions of parameters in an acceptable amount of time.
They explain that existing parallel solutions, such as data, work pipelines, and tensor slicing are insufficient to train this model. To overcome this challenge, they used Megatron-LM and PyTorch deep learning optimization tool DeepSpeed to develop an efficient and scalable 3D parallel system, incorporating data, task pipelines, and tensor-slicing-based parallelism.
Megatron-tensor LM’s slicing can grow the model within the node, while the DeepSpeed work pipeline’s parallelism can expand the model across nodes. Each model copy for the 530 billion parameters MT-NLG must span 280 A100 GPUs, have 8-way tensor slices and 35-way work pipeline parallelism across nodes, and scale to thousands of GPUs using DeepSpeed Models’ data parallelism.
MT-NLG gives the best outcomes for numerous natural language problems. For example, determining the relationship between two sentences is usually a more complex problem for language models in small sample prediction. However, MT-NLG can be trained with fewer tokens. In other words, it can train a large model quickly.
The team hopes that their innovations will aid in the development of existing and future AI models, making them cheaper and faster to train.
Other Source: https://www.ithome.com.tw/news/147225