MLCommons Releases Both A Multilingual Speech Dataset And A Large 30,000 Hour Diverse English Dataset To Drive Democratization of Machine Learning

The MLCommons Association, an open engineering community, dedicated to making machine learning more accessible to everyone, has released free datasets and technologies to help democratize machine learning. The People’s Speech Dataset and the Multilingual Spoken Words Corpus are the two significant new datasets (MSWC). Organizations can use these ground-breaking and openly licensed datasets to construct improved artificial intelligence models.

About the MLCommons Association:

The goal of MLCommons is to level the AI development playing field. Smaller businesses are clearly at a disadvantage when developing voice recognition models because the most comprehensive datasets have always come with hefty license fees. Furthermore, tech behemoths like Google LLC and Apple Inc. may amass vast amounts of free training data via devices like cell phones. 

The MLCommons Association is centered on collaborative engineering work that builds tools for the entire machine learning sector. It is executed through thresholds and performance measures, public datasets, and business processes. MLCommons collaborates with its 50+ founding member partners — global technology providers, academics, and researchers.

The People’s Speech Dataset:

The People’s Speech Dataset is a supervised conversational dataset with 30,000 hours of data. It is one of the world’s most comprehensive English language speeches, and it is free to use for academic and commercial purposes. This dataset aims to make speech technology, such as voice assistants and transcription, more accessible to everyone while also allowing the machine learning community to innovate. Researchers from Baidu, Factored, Harvard University, Intel, Landing AI, and NVIDIA have contributed to the dataset.

Multilingual Spoken Words Corpus (MSWC):

It is a sizable audio speech database with over 340,000 keywords in 50 languages and 23.4 million instances. Since they relied on manual efforts to collect and evaluate thousands of utterances for each keyword, previous datasets were frequently limited to a single language. It can be used to train machine learning models for applications such as call centers and smart devices, according to MLCommons.

The MLCommons Association invites people to take part in the new DataPerf benchmark suite, which measures and fosters data-centric AI research.

What is this DataPerf?

It promotes data-centric AI innovation by assessing dataset quality for popular machine learning tasks and the impact of improving datasets. Understanding and enhancing datasets receives less attention than mastering and developing models. DataPerf encourages and tracks development in such critical areas. Traditionally, AI research has focused on enhancing model structures and making them publicly available. On the other hand, engineering and maintaining datasets have lagged and are frequently laborious and ad-hoc.

The MLCommons Association is a strong supporter of Data-Centric AI (DCAI), a discipline that focuses on methodically engineering data for AI systems by creating efficient software tools and engineering techniques to make dataset development and curation more efficient. Open datasets and technologies, such as DataPerf, help foster machine learning innovation and support the DCAI movement. 

People’s Speech Dataset Research:


Multilingual Spoken Words Corpus Research: