Alibaba Releases Qwen1.5-MoE-A2.7B: A Small MoE Model with only 2.7B Activated Parameters yet Matching the Performance of State-of-the-Art 7B models like Mistral 7B

In recent times, the Mixture of Experts (MoE) architecture has become significantly popular with the release of the Mixtral model. Diving deeper into the study of MoE models, a team of researchers from the Qwen team, Alibaba Cloud, has introduced Qwen1.5, which is the improved version of Qwen, the Large Language Model (LLM) series developed by them. 

Qwen1.5-MoE-A2.7B has represented a notable advancement and performs on par with heavyweight 7B models like Mistral 7B and Qwen1.5-7B, even with its small 2.7 billion activated parameters. It is a successor to Qwen1.5-7B, with a reduced activation parameter count of about one-third, which means a 75% reduction in training costs. It exhibits a 1.74-fold increase in inference speed, demonstrating notable gains in resource efficiency without sacrificing performance.

The Qwen1.5-MoE-A2.7B architecture is an example of creative thinking and good optimization. A significant improvement is the use of fine-grained experts, which permits a higher number of experts without increasing the number of parameters. This method, which uses 64 experts instead of the traditional 8, greatly increases model capacity. 

The model’s performance has been greatly influenced by the initialization stage. Throughout the training, Qwen1.5-MoE-A2.7B improves performance and faster convergence by repurposing existing models and adding randomness during initialization. It uses a generalized MoE routing paradigm that incorporates both shared and route-specific experts. This arrangement contributes to the overall effectiveness of the model by providing increased flexibility and efficiency in the routing mechanism’s construction.

Comprehensive analyses of many benchmark datasets have highlighted the model’s competitive performance. Its superiority has been shown in a variety of domains, such as multilingualism, coding, language comprehension, and mathematics, when compared to other MoE models with similar parameter counts and top-performing 7B base models.

This model is particularly attractive because of its exceptional inference speed and training cost-effectiveness. Compared to conventional 7B models, this model achieves a 75% decrease in training costs by considerably lowering the count of non-embedding parameters. Furthermore, because of the integrated shared experts and optimized MoE architecture, its inference speed is increased by 1.74 times.

In conclusion, Qwen1.5-MoE-A2.7B signifies a paradigm change in the efficiency of the approach. It demonstrates the potential of MoE architectures by matching the performance of 7B models with a fraction of the parameters. This model signals the beginning of a new phase in data science optimization, with notable savings in training expenses and inference time.

Check out the Models on HF and GithubAll credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 39k+ ML SubReddit

Tanya Malhotra is a final year undergrad from the University of Petroleum & Energy Studies, Dehradun, pursuing BTech in Computer Science Engineering with a specialization in Artificial Intelligence and Machine Learning.
She is a Data Science enthusiast with good analytical and critical thinking, along with an ardent interest in acquiring new skills, leading groups, and managing work in an organized manner.

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...