Open data sets and benchmarks are imminent drivers of recent advances in the field of Artificial Intelligence. We are steadily progressing towards the scenario where AI-powered systems communicate with us in a genuine human-like manner. Facebook AI is all set to release a new large-scale, open-source dataset called Multilingual LibriSpeech(MLS), designed to help advanced researchers working on Automatic Speech Recognition (ASR). ASR is a technology that converts spoken words into text. It allows humans to speak with a computer interface in a way such that it resembles normal human conversation.
MLS aims to help the speech research community work in languages beyond English, facilitating people worldwide to benefit from improvements in AI-powered services. Currently, MLS has more than 50,000 hours of audio from eight different languages, including English, French, German, Spanish, Dutch, Portuguese, and Polish. It is also capable of providing language-model training data and pre-trained models along with baselines. These features assist researchers in comparing different ASR systems. MLS leverages public domain audiobooks from the LibriVox project. Thus, it can offer an extensive data set with a non-restrictive license.
The significant steps involved in preparing the MLS data set are: Downloading audiobooks, audio segmentation, and pseudo label generation, downloading text sources for audiobook data, transcript retrieval, and creating validation and test splits. MLS is a read-speech data set that builds on the widely used LibriSpeech ASR benchmark. The researchers first separated the audio and aligned it with audiobooks’ text, facilitating retrieving best-matching transcripts for audio segments.
The researchers used Facebook AI’s open-source [email protected] framework to carry out streaming inference and alignment. The MLS also provides subsets with limited labeled data for all the included languages. The team prepared language modeling data using public domain books from Project Gutenberg digital Library. In the next step, to create the language model corpus, the researchers filtered books that overlapped with the development and test sets and normalized language-specific text. The baseline models were trained and decoded using a 5-gram language model for each language. All the models were trained using 32GB Nvidia V100 GPUs. A total of 64 GPUs are utilized for training English, German, Dutch, Spanish, French models, and 16 GPUs for training models on Italian, Portuguese and Polish.
When the team compared the evaluation results of models trained on MLS’s English language subset and the LibriSpeech test set, a 20% improvement in word error rate was observed. MLS will prove to be a valuable tool for researchers in training ASR systems. MLS’s English Dataset alone is almost 47 times larger than the data present in LibriSpeech. Most importantly, MLS provides an extensive multilingual data set with a non-restrictive license. This will surely boost open and collaborative research, improving Speech Recognition techniques in multiple languages worldwide. The researchers believe that such an extensive transcribed dataset will open new avenues in ASR and Text-To-Speech (TTS) research. Detailed information on pre-trained monolingual models and step by step instructions to reproduce results can be found on Github. The data set will be freely available at OpenSLR.