Researchers From NVIDIA, Stanford University and Microsoft Research Propose Efficient Trillion-Parameter Language Model Training on GPU Clusters

Large-scale transformer-based language models have produced significant gains in natural language processing (NLP) in recent years. However, training such models is difficult because no single GPU has enough memory to accommodate parameter totals that have grown exponentially in recent years. Even if these parameters could be trained on a single GPU, limited computing power would result in longer training times without model parallelism. 

In a paper by NVIDIA, Stanford University, and Microsoft Research, a research team has proposed a new parallelization schedule that improves throughput by more than 10 percent with a comparable memory footprint. The paper demonstrated that such strategies could be composed to achieve high aggregate throughput when training large models with nearly a trillion parameters. 

The team first introduces methods combining data parallelism with tensor model parallelism and pipeline model parallelism to ensure that large models are trained efficiently. Through data parallelism, every worker has a copy of the entire model, and the input dataset is sharded. The workers periodically aggregate the gradients to maintain a compatible version of the weights. 

In pipeline parallelism, the model layers are shared across multiple devices. Since pipelining schemes must ensure that inputs see compatible weight versions across forward and backward passes, the researchers propose two scheduling approaches: default scheduling and scheduling with interleaved stages. 

The default schedule approach has a huge memory footprint since it requires stashed intermediate activations to be kept in memory. Therefore the team opts for a modified PipeDream-Flush schedule, which is much more memory-efficient. The scheduling with interleaved stages approach can reduce pipeline bubble size; it has drawbacks because it needs extra communication.

In the tensor model parallelism, individual model layers are partitioned over multiple devices. The method proposed by researchers uses a Megatron project-inspired partitioning strategy for transformer layers, the bedrock of language models.

The team tested the combined pipeline, tensor model, and data parallelism approach to decide if it could enhance communication and computation performance when training GPT model sizes ranging from billion to trillion parameters.

The results demonstrate the proposed composition of the tensor pipeline and data parallelism allows training iterations on a model with approximately 1 trillion parameters at 502 petaFLOP/s on nearly 3072 GPUs, achieving per-GPU throughput of 52 percent of the peak, much better compared to the 36 percent obtained by previous approaches on similar-sized models. Thus the proposed method can scale to thousands of GPUs and achieves a two-order-of-magnitude increase over existing systems.



🚀 [FREE AI WEBINAR] 'Optimise Your Custom Embedding Space: How to find the right embedding model for YOUR data.' (July 18, 2024) [Promoted]