Microsoft Enhances its Translator, a Microsoft Azure Cognitive Service, with Z-code Mixture of Experts (MoE) models, to Boost Efficiency and Quality

This article is based on the research paper 'Scalable and Efficient MoE Training for Multitask Multilingual Models' and Microsoft article 'Microsoft Translator enhanced with Z-code Mixture of Experts models'

Recent advancements in machine translation are breaking language barriers and bringing people worldwide together. However, human language is so flexible that MT tasks are considered to be one of the most difficult artificial intelligence projects.

This article is based on the recent research work done by the Microsoft team. 

Microsoft researchers have been working on enhancing existing AI methods to build multilingual, large-scale language models that can be used in multiple settings. Their current work significantly improves the quality of production translation models by adopting a more holistic, human-centric approach to learning and understanding. 

They have recently released a new service named Translator, which employs Z-code Mixture of Experts models supporting the creation of AI systems that can speak, see, hear, and understand.

The Z-code models are based on a Mixture of Experts (MoE) that allows different portions of the models to learn distinct jobs. This enables models to translate from one language to another simultaneously. The Z-code MoE model uses additional parameters while dynamically choosing which ones to use for each input. During training, the model might specialize in a subset of the parameters (experts). The model uses the appropriate expertise for the task at runtime, which is more computationally efficient than using all of the model’s parameters.

Transfer learning is a technique that permits effective knowledge exchange across related languages. The team uses transfer learning in new Z-code MoE models. During the training process, the models use both parallel and monolingual data. This expands the possibilities for high-quality machine translation beyond high-resource languages. In addition, it also improves the quality of low-resource languages with limited training data. Because both high-resource and low-resource languages show advances, this strategy can positively impact AI fairness.

The team trained translation systems with 200 billion parameters to handle 100 language pairs for research purposes. While such huge systems enhanced translation quality, they also made it more difficult to implement them cost-effectively in a production context.

The team chose to train a set of 5 billion parameter models for their production model deployment, which is 80 times larger than present models. They trained a multilingual model for each set of languages, with each model capable of serving up to 20 language pairs. In other words, each model can replace up to 20 existing systems. As a result, the model maximizes language transfer learning while still being deployable at a low runtime cost.


The researchers opted for human evaluation to evaluate the new MoE’s quality improvements to the present manufacturing system. Their findings show that the Z-code-MoE systems perform better than individual bilingual systems.

Training huge models with billions of parameters is a cumbersome task. To make it comparatively easy, the researchers worked with Microsoft DeepSpeed to create a high-performance system to train enormous scale Z-code MoE models. This enables them to scale and deploy Z-code models for translation more efficiently.

The researchers also collaborated with NVIDIA to develop quicker engines that can be used in the field to install the new Z-code/MoE models on GPUs. To efficiently implement MoE layers on a single V100 GPU, NVIDIA created special CUDA kernels and used the CUTLASS and FasterTransformer packages. In comparison to normal GPU (PyTorch) runtimes, this approach delivered up to 27x throughput improvements. They further leveraged Triton’s dynamic batching capability to combine multiple requests into a single huge batch for increased throughput. This allowed them to ship massive models with reduced runtime costs.

Document Translation is a function that converts complete papers or volumes of documents into various file formats while retaining their original formatting. Users using document translation can now request Z-code models. The team states that Z-code models will be made available to all consumers and other Translator products soon in stages. 




Tanushree Shenwai is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Bhubaneswar. She is a Data Science enthusiast and has a keen interest in the scope of application of artificial intelligence in various fields. She is passionate about exploring the new advancements in technologies and their real-life application.