With the wide-scale use of speech recognition and translation technologies, these AI systems can be implemented in many different languages. But at this point, they are only available for a handful of widely spoken languages like English or Mandarin – there’s still plenty to do before it will work with all 6,500+ other human tongues.
Facebook AI is releasing, VoxPopuli, a collection of audio recordings in 23 languages with 400,000 hours to help accelerate the development of new NLP systems. The VoxPopuli data set also includes transcribed speeches from 15 different languages and oral interpretation into 17 target language written translations for over 1,800 total hours.
VoxPopuli is a new data set, providing 9,000 to 18,000 hours of unlabeled speech per language. VoxPopuli will be an important supplement for the existing corpora, and it’s just in time as technology has progressed enough that translation from other languages into English can happen faster than ever before.
Facebook’s research team collected data in 23 languages from publicly available European Parliament event recordings and built pipelines to segment them by speaker or silence. They aligned these with transcripts, filtered out inaccurate ones, and provided speech recognition baselines for semisupervised ASR under challenging environments.
The research has shown that the increased amounts of unlabeled data and language coverage in VoxPopuli are very helpful to improving self-supervised models. This is because they can help improve quality and robustness with their dataset, making it more reliable for training neural networks on translating speech. The automatic alignment process, evaluating them as high-quality from a translation benchmark test score, also shows improvements over time by handling English and many other languages.