Machine translation (MT) is the process of employing artificial intelligence to automatically translate text from one language (the source) to another (the destination) (AI). The ultimate goal is to create a universal translation system that will allow everyone to access information and communicate more effectively. It is a long road ahead for this vision to turn into reality.
Most currently used MT systems are bilingual models, which require labeled examples for each language pair and job. Such models are, however, unsuitable for languages with insufficient training data. Its enormous complexity makes it impossible to scale to practical applications such as Facebook, where billions of users post in hundreds of languages every day.
To address this problem and develop a universal translator, the MT field must witness a transition from bilingual to multilingual models. A single translation model is used to process numerous languages in multilingual machine translation. The research would attain its peak if it were possible to build a single model for translation across as many languages as possible by effectively using the available linguistic resources.
Facebook AI has brought about a quantum leap in MT by revealing their multilingual model that has surpassed the best specially trained bilingual models and set new standards by winning WMT, a prestigious MT competition. The model has proven efficient in translation for both low-resource and high-resource languages.
The effectiveness of this model has demonstrated that advancement in three areas is required to achieve good performance on both high- and low-resource languages:
- Large-Scale data mining.
- Scaling Model Capacity
Large-Scale Data Mining:
To train the WMT 2021 model, the researchers devised two multilingual methods: any-to-English and English-to-any. They applied parallel data mining techniques to find translations in large web crawl data sets to circumvent the limitations of hand translations of typical training articles.
It is a well-known fact that monolingual data for any language vastly exceeds the amount of parallel data. Hence, it becomes crucial to leverage ready-to-use monolingual data to improve the performance of MT systems. One of the most successful techniques for using monolingual data is Back Translation. Back translation, also referred to as reverse translation, is the process of re-translating content from the target language back to its source language in literal terms.
Scaling Model Capacity:
To boost the capacity of multilingual model designs, the model size was raised from 15 billion to 52 billion parameters. The scaling efforts would have been in vain if Facebook’s latest GPU memory-saving solution, Fully Sharded Data-Parallel, had not been implemented. This feature allows for 5x faster large-scale training than conventional techniques.
Multilingual models must balance sharing parameters and specialization for multiple languages because they are naturally competitive for capacity. The computational expense of scaling the model size proportionally is unsustainable.
A new approach was adopted for using conditional computational methods, which activated only a part of the model for each of the training cases. Sparsely Gated Mixture-of-Expert (MoE) models, in which each token is routed to the top-k expert FeedForward blocks based on a learned gating function, are specially trained.
The researchers hope that their success at WMT paved a way towards building a single universal translational system that provides high-quality translations for all. The next step would be to adapt these techniques to languages beyond those featured in the WMT competition.
Low-resource translation remains a “last mile” problem for MT and the most prominent open challenge for the subfield today.