The Baidu team is excited to present ERNIE-M as a new multilingual model capable of understanding 96 languages. ERNIE-M is also a new training method capable of improving the model’s cross-lingual transferability on data-sparse languages. ERNIE-M delivers recent state-of-the-art results in five cross-lingual downstream tasks and tops XTREME.
Most natural language processing innovations are spawned in languages like English and Chinese (high resource language). More than 6,500 languages worldwide have scarce data resources, presenting a massive challenge for machines to understand those languages and limit AI’s democratization. The above leaves behind thousands of low-resource languages.
Although training a model on each language might be possible, but multilingual model research has seen significant advancements over the past few years. Cross-lingual models learn a shared language-agnostic representation across multiple languages and enable transfer learning from a high-resource language to a low-resource language.
The existing method, which has been proven to be effective, trains a model on different monolingual datasets to learn semantic representation and capture semantic alignment across other languages on parallel corpora. However, the sizes of parallel corpora are somewhat limited, which restricts the model’s performance.
The team proposed a novel cross-lingual pre-training method in a paper that can learn semantic alignment across multiple languages on monolingual corpora.
The Key of ERNIE-M: Cross-lingual Alignment and Back-translation
The training of ERNIE-M consists of two stages:
- The first stage includes aligning the cross-lingual semantic representation by CAMLM on a small parallel corpus. In CAMLM, the model learns the multilingual semantic word using restoration of the MASK tokens in the input sentences.
- The second stage is related to Back-translation Masked Language Modeling (BTMLM) to align cross-lingual semantics with the monolingual corpus. The team uses BTMLM to train the model to generate pseudo-parallel sentences from the monolingual sentences. The generated pairs are then used as the model’s input to further align the cross-lingual semantics, thus enhancing multilingual representation.
The team utilized five cross-lingual evaluation benchmarks to test the efficacy of ERNIE-M:
- lXNLI for cross-lingual natural language inference,
- lMLQA for cross-lingual question answering,
- lCoNLL for cross-lingual named entity recognition,
- lPAWS-X for cross-lingual paraphrase identification,
- lTatoeba for cross-lingual retrieval.
The team evaluated ERNIE-M in two formats:
- a cross-lingual setting to fine-tune the model with an English training set and consider it on a foreign language test.
- a Multilingual fine-tuning setting to refine the model on all other languages’ concatenation and evaluate it on each language test set.
Cross-lingual Sentence Retrieval: The goal is to extract parallel sentences from bilingual corpora. ERNIE-M allows retrieving results in multiple languages, such as French, English, and German, using only Chinese. This technology can bridge the gap between information expressed in various languages, thus helping people search for more valuable information. An accuracy rate of 87.9% was achieved on a subset of the Tatoeba dataset that contained 36 languages.
Cross-lingual Natural Language Inference: It’s a task to determine whether the relationship between two input sentences is a contradiction, entailment, or neural. ERNIE-M achieved 82.0% specifically on the English training set and 84.2% on all training sets.
Cross-lingual Question Answering: Question answering is a classic NLP task used to test a machine’s ability to provide an automated answer in a natural language. ERNIE-M was fine-tunned by training with English data. The model achieved an accuracy of 55.3%.
Named entity recognition task: NER seeks to locate and classify the named entities in text. ERNIE-M is evaluated on CoNLL-2002 and CoNLL-2003 datasets involving Dutch, English, Spanish, and German. ERNIE-M was tunned on English data and considered it in Spanish, Dutch, and German. The average F1 score achieved was 81.6%.
Cross-lingual paraphrase identification: The team evaluated ERNIE-M on PAWS-X. It is a paraphrase identification dataset with seven languages. ERNIE-M has achieved an accuracy of 89.5%, specifically on the English training set and 91.8% on all the training sets.
ERNIE-M has wide-ranging applications and implications. The codes and pre-trained models will be made publicly available soon.