Typical English-centric multilingual models previously used for translations rely on two-step translation. This lacks in preserving the sentences’ actual meaning because of the English data bridge between both the languages. For instance prior models would train on Chinese to English and English to French, because English training data is the most widely available. This new model introduced by Facebook AI directly trains Chinese to French data to better preserve meaning.
Facebook AI introduces the first single massive multilingual machine translation (MMT) model that can translate 100×100 languages in any direction without the English data dependence. M2M-100 is trained on about 2,200 language directions, which is 10x more than the previous model.
One of the toughest challenges in building a many-to-many MMT model is curating large volumes of quality sentence pairs (also called parallel sentences) for arbitrary translation directions not involving English. And for this, the volume of data required for training grows quadratically with the number of languages.
Mining Large Data: Bridge-Strategy And Back-Translation
Building a more diverse many-to-many MMT data set was possible by combining complementary data mining resources that have been years in the making, including ccAligned, ccMatrix, and LASER. It further created a new LASER 2.0 and improved fast text language identification, with improved mining quality, including open sourced training and evaluation scripts.
But this couldn’t help much in training data for an arbitrary pair of 100 different languages. So, certain strategies were developed as listed below:
- The model prioritizes languages with the most translation requests and mining directions with the highest quality and largest quantity data, avoiding statistically rare translations like Icelandic-Nepali or Sinhala-Javanese.
- Then 14 language groups are divided based on geography and cultural and linguistic similarities. For example, one group includes Bengali, Hindi, Marathi, Nepali, Tamil, and Urdu, as these are spoken in India and have more probability of getting translated.
- Further, to connect the languages of different groups, a small number of bridge languages are identified, like a group of Hindi, Bengali, and Tamil for Indo-Aryan languages. And then, parallel training data for all possible combinations of these bridge languages are mined.
- In addition to this, back-translation is used to supplement the training of directions that have already been mined.
But to improve the model, it needs to incorporate the latest research and more specialized computation architectures necessary to bring this to production.