Amazon Researchers Release CoCoA-MT: A Dataset and Benchmark for Controlling formality in Machine Translation

Neural machine translation (NMT) models have steadily improved over the years, and their quality is now quite close to that of human translators. Commonly, the goal of an MT assignment is to provide a single translation for an input segment. However, there are numerous situations where more than one translation is correct. 

The correct translation may rely on factors such as the relationship between the speakers, the intended audience, or the qualities of the speaker(s). Honorifics present unique challenges, especially in English to languages with formality markers. For instance, a translator working with English inputs may need to decide between different registers (degrees of formality) in the final product, such as the tu and vous of French or the and usted of Spanish.

Large labeled datasets have traditionally been used for training NMT models with formality control. Previous efforts were limited to a few languages because of the time and resources required to produce high-quality labeled translations for various languages.

To aid in the development of more accurate NMT systems capable of inferring formality, a new Amazon’s AWS AI Lab provides a multidomain dataset, CoCoA-MT, including phrase-level annotations of formality and grammatical gender in six different language pairings. This includes English (EN), French (FR), German (DE), Hindi (HI), Italian (IT), Japanese (JA), and Spanish (ES). Using a general NMT system and a small amount of manually labeled data, they were able to produce MT systems that can be manipulated with regard to formality in this work. 

For this work, expert translators were asked to create both formal and casual renditions of content written in English. The translators were directed to make only the minimum of alterations from the formal to the informal versions (e.g., changing verb inflections, swapping pronouns). The team created a segment-level metric for gauging formality accuracy by using translators’ additional comments on sentences to reflect the formality level.

They also introduced a very accurate reference-based automatic metric for differentiating between formal and informal system assumptions to use with the CoCoA-MT dataset. Finally, they suggest using transfer learning on contrastive labeled data to train models with formality control. 

Their findings show that the proposed strategy can benefit six language pairs and holds up well across multiple datasets. The researchers conducted experiments to demonstrate that CoCoA-MT transfer learning is economical relative to non-contrastive curated data while complementing autonomously labeled data, yielding high targeted accuracy while maintaining generic translation quality. 

The team has open-sourced the CoCoAMT dataset together with the Sockeye 3 baseline models and evaluation scripts to support further work on simultaneously managing various features (formality and grammatical gender).

Check out the Paper, Github, and Reference Article. All Credit For This Research Goes To Researchers on This Project. Also, don’t forget to join our Reddit page and discord channel, where we share the latest AI research news, cool AI projects, and more.

Tanushree Shenwai is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Bhubaneswar. She is a Data Science enthusiast and has a keen interest in the scope of application of artificial intelligence in various fields. She is passionate about exploring the new advancements in technologies and their real-life application.

✅ [Featured AI Model] Check out LLMWare and It's RAG- specialized 7B Parameter LLMs