Amazon Kickstarts Natural Language Understanding By Open-Sourcing ‘MASSIVE’ Speech Dataset

This Article Is Based On The Amazon article 'Amazon releases 51-language dataset for language understanding'. All Credit For This Research Goes To The Amazon Researchers. 👏👏👏

Please Don't Forget To Join Our ML Subreddit

To scale natural language understanding to every spoken language on Earth, Amazon.Inc has announced the release of its open-source ‘MASSIVE’ speech dataset. The main goal of curating such a dataset was to assist researchers in developing virtual assistants that could easily be generalized to some of the world’s most hidden languages. In addition to the database, Amazon has also published open-source modeling code to help developers create more capable virtual assistants.

Several new technological breakthroughs in speech recognition and natural language understanding (NLU) have opened the way for voice-activated digital assistants such as Siri, Bixby, and Google Assistant. The primary shortcoming of these voice-controlled personal assistants is that they are only available in a few familiar languages. The MASSIVE dataset is one step forward in the creation of a dataset that spans several obscure languages to build multilingual natural-language-understanding models that can smoothly adapt to those languages whose training data is scarce, intending to allow people all over the world to enjoy the availability of conversational AI systems like Alexa in their native languages.

The Multilingual Amazon SLURP for Slot Filling, Intent Classification, and Virtual-assistant Evaluation, or MASSIVE for short, is a ‘parallel dataset’ that includes one million labeled utterances in 51 languages, including those that lack properly labeled data, as well as open-source code that demonstrates how to execute massively multilingual NLU modeling. With Alexa currently being available in 7 languages, the company aims to expand it to over 7000 languages spoken in the masked corners of the world.

Professional translators meticulously curated the dataset by translating the available English-only SLURP dataset into 50 varied languages that lacked labeled data. The MASSIVE database, according to Amazon, will be especially effective in improving spoken-language understanding, in which audio is transformed into text before NLU is done. Natural language understanding (NLU) is a branch of natural language processing (NLP) that deals with converting human language into a machine-readable format. 

Amazon is also establishing a new competition called Massively Multilingual NLU 2022 (MMNLU-22) that will use the MASSIVE dataset to encourage academics to design models that can readily adapt to new languages and create more third-party apps for Alexa. The competition will be hosted on a platform called and will include two tasks. During December, the competition’s outcomes will be presented at an EMNLP 2022 workshop in Abu Dhabi and an online session called Massively Multilingual NLU 2022. It will also feature presentations by guest speakers and oral and poster sessions with papers on multilingual natural-language processing that have been submitted.

Amazon has a vision for its products like Alexa and Echo to reach and be available to all customers and devices. With these three significant announcements, it aspires to become a key player in the global NLU translation system community.




Khushboo Gupta is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Goa. She is passionate about the fields of Machine Learning, Natural Language Processing and Web Development. She enjoys learning more about the technical field by participating in several challenges.