Voice-to-text processing has advanced significantly in recent years, making the occasional failures in AI-powered speech recognition systems little more than curious outliers. However, most modern speech recognition models depend on sizable supervised training data. Obtaining such statistics is simple for popular languages like English, Chinese, etc. However, it is difficult for the bulk of the 8,000 languages spoken worldwide—low-resource tongues. A research team from Carnegie Mellon University created a voice recognition pipeline that does not need audio for the target language to address this problem. Using 10,000 raw text utterances from the CMU Wilderness dataset, this ASR2K algorithm identifies 1909 languages without audio for the target language and produces remarkable 45 percent CER and 69 percent WER results. The team’s research was also published in the paper, ‘ASR2K: Speech Recognition for Around 2000 Languages Without Audio.’
The model only assumes that it has access to unprocessed text datasets or a set of n-gram statistics. Three elements make up their speech pipeline: acoustic, pronunciation, and language models. The target languages’ phonemes, including those of unseen languages, are recognized using the acoustic model. In a grapheme-to-phoneme (G2P) model, the pronunciation model forecasts the phoneme pronunciation given a grapheme sequence. Both the acoustic and pronunciation models use multilingual models without supervision, in contrast to the conventional pipeline. In order to apply their newly acquired linguistic skills to low-resource languages without supervision, they can first be trained using supervised datasets from high-resource languages.
The raw text dataset or n-gram statistics are used to construct the language model. A lexical graph is created by encoding the approximate pronunciation of each word using the pronunciation model. By counting the n-gram statistics, the model can also estimate a traditional n-gram language model thanks to the text dataset. A weighted finite-state transducer (WFST) decoder is subsequently created using this language model in conjunction with the pronunciation model. The team proposed method was then applied to 1909 languages on the Crúbadán: a sizable collection of n-grams for endangered languages.
The method was evaluated on 129 languages using two separate datasets, Common Voice (34 languages) and CMU Wilderness (95 languages). With Crbadán statistics, it achieved 50% CER and 74% WER on the Wilderness dataset, which were subsequently increased to 45% CER and 69% WER when using 10,000 raw text utterances. The team’s discovery represents a turning point because it represents the first attempt to create an audio-free speech recognition pipeline for tens of thousands of languages. The team’s paper and related code will also be published at the 23rd INTERSPEECH Conference in South Korea.
This Article is written as a research summary article by Marktechpost Staff based on the research paper 'ASR2K: Speech Recognition for Around 2000 Languages without Audio'. All Credit For This Research Goes To Researchers on This Project. Check out the paper and github link. Please Don't Forget To Join Our ML Subreddit
Khushboo Gupta is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Goa. She is passionate about the fields of Machine Learning, Natural Language Processing and Web Development. She enjoys learning more about the technical field by participating in several challenges.