Facebook AI Releases ‘VoxPopuli’, A Large-Scale Open Multilingual Speech Corpus For AI Translations in NLP Systems

With the wide-scale use of speech recognition and translation technologies, these AI systems can be implemented in many different languages. But at this point, they are only available for a handful of widely spoken languages like English or Mandarin – there’s still plenty to do before it will work with all 6,500+ other human tongues.

Facebook AI is releasing, VoxPopuli, a collection of audio recordings in 23 languages with 400,000 hours to help accelerate the development of new NLP systems. The VoxPopuli data set also includes transcribed speeches from 15 different languages and oral interpretation into 17 target language written translations for over 1,800 total hours.

VoxPopuli is a new data set, providing 9,000 to 18,000 hours of unlabeled speech per language. VoxPopuli will be an important supplement for the existing corpora, and it’s just in time as technology has progressed enough that translation from other languages into English can happen faster than ever before.

Facebook’s research team collected data in 23 languages from publicly available European Parliament event recordings and built pipelines to segment them by speaker or silence. They aligned these with transcripts, filtered out inaccurate ones, and provided speech recognition baselines for semisupervised ASR under challenging environments.

The research has shown that the increased amounts of unlabeled data and language coverage in VoxPopuli are very helpful to improving self-supervised models. This is because they can help improve quality and robustness with their dataset, making it more reliable for training neural networks on translating speech. The automatic alignment process, evaluating them as high-quality from a translation benchmark test score, also shows improvements over time by handling English and many other languages.

Github: https://github.com/facebookresearch/voxpopuli?

Paper: https://arxiv.org/abs/2101.00390

Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.

πŸš€ LLMWare Launches SLIMs: Small Specialized Function-Calling Models for Multi-Step Automation [Check out all the models]