Meta AI Researchers Built The First Artificial Intelligence AI-Powered Translation System Under Universal Speech Translator (UST) For A Primarily Oral Language ‘Hokkien’

Although over half of the world’s 7,000+ live languages are predominantly oral and lack a standardized writing system, recent technological advancements in AI translation have primarily concentrated on written languages. It is also mainly because of this reason that machine translation systems cannot be created using conventional methods because they require appreciable written content to train an AI model. In order to address this issue, Meta has created the first AI-powered Hokkien translation system. The Chinese diaspora speaks Hokkien widely, but no official written version of the language exists. As a part of Meta’s Universal Speech Translator (UST) project, this open-source translation tool enables Hokkien and English speakers to communicate among themselves.

With regard to standard machine translation systems, developing this novel speech-only translation system came with its own set of difficulties, including data collection, model creation, and evaluation. One of the main challenges Meta researchers experienced when developing a Hokkien translation system was gathering enough data. Hokkien is a low-resource language, meaning there are fewer training materials readily available for it than for high-resource languages like French or English. Furthermore, gathering data and annotating it for training is laborious due to the need for more human English to Hokkien translators. 

Instead, the researchers used Mandarin as a middle language to create pseudolabel and human translations. English (or Hokkien) speech was first translated into Mandarin text and then into Hokkien (or English). By using information from a comparable high-resource language, this technique significantly increased the model’s performance. Using speech mining, the researchers could embed Hokkien voice embeddings into the same semantic space as other languages without the need for a written representation of the language. A speech encoder that has been pretrained allowed for this. Hokkien and English share semantic embeddings, making it simpler to synthesize English speech from texts and produce parallel Hokkien and English speech.

Many speech-to-text or transcription-based methods are used in speech translation. However, providing transcribed text as the translation output is not optimal because most such languages are primarily spoken and need standardized written forms. Instead, Meta concentrated on translating speech to speech. Coming to the architectural details, directly following the trail previously blazed by Meta, a speech-to-unit translation (S2UT) was employed to convert input speech to a series of acoustic units. The units created the output waveforms. UnitY was used in a two-pass decoding method, where the first pass generates text in a related language (Mandarin), and the second pass generates units.

For evaluation purposes, automatic speech recognition (ASR) was first used to convert the translated speech into text. Then BLEU scores were calculated to compare the transcribed text with a human-translated text to evaluate speech translation systems. Once more, a barrier was the need for a standard writing system for languages like Hokkien. The researchers created a method that converts Hokkien speech into a standard phonetic notation known as Tâi-lô to facilitate automatic evaluation. The translation quality of various approaches could be compared using this method to obtain a BLEU score.

Meta visions to bring people together despite their geographical and linguistic barriers, even in the metaverse, through the means of spoken communication. The method enables communication between Hokkien and English speakers in its latest state. Currently, the model can only translate one complete sentence at a time and is still a work in progress. The researchers think there is still a tonne of work to be done to expand UST to new languages. However, speaking fluently to people in any language has long been a goal, and Meta’s work is a step in that direction. They have open-sourced not just the Hokkien translation models but also the evaluation datasets and research articles to encourage further research.

Hokkien direct speech-to-speech translation | SpeechMatrix | Reference Article

Please Don't Forget To Join Our ML Subreddit
🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...