Multilingual Neural Machine Translation (MNMT) reduces deployment costs by allowing a single system to translate sentences between several source and target languages.
To gauge the efficacy of models developed for massive MNMT, access to vast data is needed. Because of the high cost of producing such materials, there is a scarcity of test data. This is especially true when taking into account test sets for 100+ languages. This is a roadblock to the development of such models.
While certain multilingual benchmark test sets already exist, additional information is needed to advance the field.
A new Microsoft research introduces NTREX-128, a data set containing “News Text References of English into X Languages.” This work has significantly boosted the multilingual testing of English in 128 target languages. The 123 documents (1,997 phrases, 42k words) that make up the NTREX-128 benchmark were translated from English into 128 languages. The presented data is a replica of the WMT19 test data and is fully compatible with SacreBLEU.
The team has open-sourced their work to serve as a new standard against which massively multilingual machine translation models can be judged.
To generate this dataset, the team distributed the original English WMT19 test set to expert human translators. They believed that the test data quality must be sufficient for it to be of any use. Therefore, they mostly focused on two criteria:
- Reference translations should not be crafted from post-edited MT output
- Translations made by native speakers are required of the corresponding target language who are also fluent in English.
Before delivering the test set files, the translation provider ran quality assurance as part of their translation process. They used the Appraise framework’s implementation of source-based direct assessment (src-DA) to distribute the files for human review after receiving them. They hired a third-party company to handle the annotation so that we could be sure there was no prejudice involved.
Ultimately, they gain quality scores at the segment level from the judgments of bilingual annotators fluent in both the source and target languages. The ‘quality of the semantic transfer’ from the source to the destination language is expressed as a score from 0 to 100. Although this compromises fluency for a greater emphasis on sufficiency, this is fine in light of recent research.
The recent success of embedding-based, automatic assessment metrics like COMET motivated the researchers to experiment with the NTREX-128 data set, comparing COMET-src scores for the authentic translation direction with scores produced in the reverse direction. They also considered COMET-performance src’s on untrained languages as a supplementary concern.
Their results suggest that even though COMET-src can be used for quality estimation of test data, its applicability is constrained by the following issues:
- For a sizable minority of language pairs, COMET-src scores on translationese input are higher than the corresponding authentic source data.
- While relative comparisons of COMET-src scores work for all language pairs, there exists a minority of languages for which the scores appear broken. The fact that COMET has never encountered samples of training data for these languages is one possible explanation for this.
Check out the Paper and Github. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our Reddit Page, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.
Tanushree Shenwai is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Bhubaneswar. She is a Data Science enthusiast and has a keen interest in the scope of application of artificial intelligence in various fields. She is passionate about exploring the new advancements in technologies and their real-life application.