The mixture of experts (MoE) is a promising deep learning model architecture that can minimize training cost complexity to several sublinear parameters. An ensemble learning technique is used in MoE architectures to break down modeling jobs into sub-tasks and train an expert model for each. A gating model learns which expert to trust and then combines the predictions based on the input to be forecasted.
The availability and capability of hardware resources are pushed to the limit while training huge dense models. As a result, a good model architecture that could considerably reduce training costs compared to a quality-equivalent dense model is needed.
Microsoft’s DeepSpeed-MoE precisely meets this requirement, allowing Massive MoE Model Inference to be performed up to 4.5 times faster and nine times cheaper. This facilitates model scaling and paves the path for models that can learn more data. It also powers various disciplines, including computer vision, speech recognition, and natural language processing.
Bottlenecks faced by MoE based models in a real-world scenario:
- Limited Scope: In the field of natural language processing, the scope of MoE-based models is generally limited to encoder-decoder models and sequence-to-sequence tasks. The application of MoE to auto-regressive natural language generation (NLG) is an untapped area. There has been no considerable research into its applicability in other disciplines.
- Massive Memory Requirements: While MoE models use less computing to reach the same model quality as dense models, they require a far more significant number of parameters. In other words, compared to quality-equivalent thick models, MoE-based models have a substantially poorer “parameter efficiency.” Training and inference are hampered by larger model sizes and reduced parameter efficiency.
- Limited Inference Performance: Fast inference of MoE-based models is considerably more difficult due to the enormous model size and low parameter efficiency. On the one hand, the more significant parameter size necessitates more GPUs, and multi-GPU inference technology is not intended for use with MoE-based models. However, because the inference is frequently memory capacity constrained, MoE-based models may require a 10x increase in possible memory bandwidth to achieve the same inference latency as dense models.
Researchers from Microsoft propose solutions to tackle the shortcomings:
- Auto-regressive NLG jobs are now included in the scope of MoE-based models. Models like GPT-3 and MTNLG show a 5x reduction in training costs while maintaining the same model quality. Simultaneously, it brings up the possibility of significantly achieving more excellent next-generation model quality while working within the constraints of current-generation hardware resources.
- By inventing a new MoE architecture called Pyramid-Residual MoE, we may improve the parameter efficiency of MoE-based models (PR-MoE). PR-MoE is a residual connections-based hybrid dense and MoE model. PR-MoE can reduce the size of MoE model parameters by up to 3x while maintaining model quality and reducing compute requirements. The researchers also created a condensed version of PR-MoE. Mixture-of-Students is the name of this variant (MoS). MoS decreases the size of the MoE model by up to 3.7 times while maintaining the same model quality.
- The researchers present their DeepSpeed-MoE inference system, a highly optimized MoE inference system that allows inference workloads to be scaled efficiently across hundreds of GPUs. The trillion-parameter MoE models offer ultra-fast inference latencies (sub 25 ms). In addition, as compared to conventional MoE inference systems, it provides a 7.3x reduction in inference latency and cost.
The researchers hope that their inventions and infrastructures in their recent paper prove to be a viable route for addressing the present large-scale deep learning models’ training cost difficulties. It also serves as a starting point for the next generation of AI scale models in training and inference.