This AI Paper from Cohere AI Reveals Aya: Bridging Language Gaps in NLP with the World’s Largest Multilingual Dataset

Datasets are an integral part of the field of Artificial Intelligence (AI), especially when it comes to language modeling. The ability of Large Language Models (LLMs) to respond to instructions efficiently is attributed to the fine-tuning of pre-trained models, which has led to recent advances in Natural Language Processing (NLP). This process of Instruction Fine-Tuning (IFT) requires annotated and well-constructed datasets.

However, most of the datasets now in existence are in the English language. A team of researchers from Cohere AI in recent research have aimed to close the language gap by creating a human-curated dataset of instruction-following that is available in 65 languages. In order to achieve this, the team has worked with native speakers of numerous languages throughout the world, gathering real examples of instructions and completions in diverse linguistic contexts.

The team has shared that it hopes to add to the largest multilingual collection to date in addition to this language-specific dataset. This includes translating current datasets into 114 languages and producing 513 million instances through the use of templating techniques. The goal of this strategy is to improve the diversity and inclusivity of the data that is accessible for training language models.

Naming it as the Aya initiative, the team has shared the development and public release of four essential materials as a component of the project. The components are the Aya Annotation Platform, which makes annotation easier; Aya Dataset, which is the human-curated dataset for instruction-following; Aya Collection, which is the large multilingual dataset covering 114 languages; and Aya Evaluation Suite, which is a tool or framework for evaluating the effectiveness of language models trained on the Aya datasets.

The team has summarized their primary contributions as follows.

  1. Aya UI, or the Aya Annotation Platform: A powerful annotation tool has been developed that supports 182 languages, including dialects, and makes it easier to gather high-quality multilingual data in an instruction-style manner. It has been operating for eight months, registering 2,997 users from 119 countries speaking 134 different languages, indicating a broad and international user base. 
  1. The Aya Dataset – The world’s largest dataset of over 204K examples in 65 languages has been compiled for human-annotated multilingual instruction fine-tuning.
  1. Aya Collection – Instruction-style templates have been gathered from proficient speakers and have been used on 44 carefully selected datasets that addressed tasks such as open-domain question answering, machine translation, text classification, text generation, and paraphrasing. 513 million released examples have covered 114 languages, making it the largest open-source collection of multilingual instruction-finetuning (IFT) data. 
  1. Aya Evaluation – A varied test suite for multilingual open-ended generation quality has been curated and made available. It includes the English original prompts as well as 250 human-written prompts for each of the seven languages, 200 automatically translated yet human-selected prompts for 101 languages (114 dialects), and human-edited prompts for six languages.
  1. Open source – The annotation platform’s code, as well as the Aya Dataset, Aya Collection, and Aya Evaluation Suite, have been made all fully open-sourced under a permissive Apache 2.0 license.

In conclusion, the Aya initiative has been positioned as a useful case study in participatory research as well as dataset creation.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and Google News. Join our 37k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our Telegram Channel

Tanya Malhotra is a final year undergrad from the University of Petroleum & Energy Studies, Dehradun, pursuing BTech in Computer Science Engineering with a specialization in Artificial Intelligence and Machine Learning.
She is a Data Science enthusiast with good analytical and critical thinking, along with an ardent interest in acquiring new skills, leading groups, and managing work in an organized manner.

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...