Amazon Research Introduces MTGenEval: A New Benchmark For Evaluating Gender Bias In Machine Translation

It has been a long-held goal of the field of computer science to develop software capable of translating written text between languages. The last decade has seen machine translation emerge as a practical and widely used productivity tool. As their popularity grows, it’s becoming more crucial to check that they’re objective, fair, and truthful.

It is challenging to evaluate the effectiveness of systems in terms of gender and quality because existing benchmarks lack variation in gender phenomena (e.g., focusing on professions), sentence structure (e.g., utilizing templates to generate sentences), or language coverage.

To this purpose, a new work by Amazon presents MTGenEval, a new benchmark for evaluating gender bias in machine translation. The MT-GenEval evaluation set is comprehensive and realistic, and it supports translation from English into eight widely spoken (but sometimes understudied) languages: Arabic, French, German, Hindi, Italian, Portuguese, Russian, and Spanish. The benchmark provides 2,400 parallel phrases for training and development and 1,150 evaluation data segments per language pair. 

MTGenEval is well-balanced thanks to the inclusion of human-created gender counterfactuals, which give it realism and diversity in addition to a large range of settings for disambiguation. 

Generally, the test sets are generated artificially, which includes heavy biases. In contrast, MT-GenEval data is based on real-world data collected from Wikipedia and contains professionally made reference translations in each language. 

Learning how gender is expressed in several languages can help spot common areas where translations fail. It’s true that some English terms, like “she” (female) or “brother,” has no room for ambiguity when it comes to describing their gender (male gender). Nouns, adjectives, verbs, and other parts of speech can be marked for gender in many languages, including those included in MT-GenEval.

A machine translation model must not only translate but also accurately express the genders of words that lack gender in the input when translating from a language with no or restricted gender (like English) into a language with extensive grammatical gender (like Spanish).

However, in practice, input texts are rarely so straightforward, and the term that disambiguates a person’s gender may be quite distant, perhaps even in a different phrase, from the words that represent gender in the translation. We have found that machine translation models are prone to rely on gender preconceptions (such as translating “beautiful” as female and “handsome” as male regardless of context) when faced with ambiguity in these situations.

Even while there have been isolated incidents when translations have failed to accurately reflect the intended gender, there has been no means to statistically assess these occurrences in actual, complicated input text until now. 

The researchers searched English Wikipedia articles for candidate text segments that included at least one gendered word within a three-sentence range. To guarantee that the segments were useful for gauging gender accuracy, human annotators removed any sentences that did not specifically refer to people.

The annotators then produced counterfactuals for the segments in which the participants’ gender was switched from female to male or male to female to ensure gender parity in the test set.

Every segment in the test set has both a correct translation with the right genders and a contrastive translation, which differs from the correct translation solely in terms that are gender specific, allowing evaluation accuracy of the gender translation. This study introduces a simple metric of accuracy, which involves considering all the gendered words in the contrasting reference for a given translation with the desired gender. The translation is indicated as inaccurate if it includes any of the gendered words from the contrastive reference and as correct otherwise. Their finding shows that their automatic metric reasonably matched those of human annotators with F scores above 80% in each of the eight target languages.

In addition to this linguistic evaluation, the team also develop a metric to compare machine translation quality between male and female outputs. This gender disparity in quality is measured by comparing the BLEU scores of male and female samples from the same balanced dataset.

MT-GenEval is a significant improvement over previous methods for assessing machine translation’s fidelity for gender thanks to its substantial curation and annotation. The team hopes their work will encourage other academics to focus on increasing gender translation accuracy for complicated, real-world inputs in various languages.

Check out the Paper and Amazon Blog. All Credit For This Research Goes To Researchers on This Project. Also, don’t forget to join our Reddit page and discord channel, where we share the latest AI research news, cool AI projects, and more.

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...