AI is used for various speech recognition and understanding activities, ranging from enabling smart speakers to designing aids for persons who are deaf or have speech impairments. However, these speech comprehension algorithms frequently fail to perform well in everyday scenarios where we need them the most: when numerous people speak simultaneously or when there is a lot of background noise. Even advanced noise-canceling techniques are frequently ineffective against, for instance, the sound of the ocean on a beach trip or the background chatter of a noisy street market.
Humans interpret speech better than AI in these situations because we use both our ears and our sight. For example, we could watch someone’s mouth move and intuitively know the voice we’re hearing is coming from her. That is why Meta AI is developing new conversational AI systems that, like us, can discern intricate relationships between what they see and what they hear in conversation.
Audio-Visual Hidden Unit BERT (AV-Hubert) is a cutting-edge self-supervised framework for comprehending speech that learns by seeing and hearing people talk to develop more varied and robust speech recognition technologies. It is the first system to simultaneously predict speech and lip movements from unlabeled data — untranscribed video. AV-Hubert is 75% more accurate than the top audio-visual speech recognition systems when using the same number of transcriptions (which use both sound and images of the speaker to understand what the person is saying).
Notably, this technique solves a significant barrier to teaching AI to do valuable tasks: AV-Hubert beats the previous best audio-visual speech recognition system when just one-tenth of the data is labeled. Because significant volumes of labeled data for most languages are difficult to gather, AV-Hubert’s self-supervised technique will create noise-resistant automated speech recognition (ASR) systems in more languages and applications.
AV-Hubert will bring voice assistants closer to human-level speech comprehension by combining data on both visible lip movement and spoken words. This technique might one day allow smartphone and augmented reality (AR) glasses assistants to comprehend what we’re saying regardless of a loud manufacturing floor, a concert, or just conversing as a jet flies by.
The multimodal approach to developing voice recognition
Because today’s speech recognition models only receive audio as input, they must estimate whether one or more persons are speaking or if a sound is simply background noise. On the other hand, AV-Hubert learns in the same way that people do — multimodally — by receiving and acquiring language through a mix of auditory and lip-movement signals. The model was trained using video recordings from the publicly available LRS3 and VoxCeleb data sets.
AV-Hubert may capture subtle correlations between the two input streams efficiently even with significantly lower quantities of untranscribed video data for pretraining by mixing visual signals, such as the movement of the lips and teeth when speaking, with auditory information representation learning. Once the pre-trained model has mastered the structure and correlation, only a minimal quantity of labeled data is required to train with all new features.
The AV-Hubert method is demonstrated in the animation below. To forecast the planned sequence of discrete cluster assignments, it transforms masked audio and picture sequences into audio-visual characteristics using a hybrid ResNet-Transformer architecture. AV-Hubert’s target cluster assignments are generated initially from signal processing-based acoustic features (e.g., Mel-frequency cepstral coefficients) MFCCs) and then iteratively refined using features learned by the audio-visual encoder via k-means clustering.
When trained on 430+ hours of labeled data, the prior state-of-the-art AV-ASR achieves a 25.5 percent error rate when voice and background noise are both equally loud. AV-Hubert produces a 3.2 percent error rate with the same amount of labeled data, which means it takes just one mistake out of every 30 words it hears. An audio-only speech recognition model cannot determine which speaker to transcribe when the interference is as loud as the target speech. On the other hand, this audio-visual model learns to transcribe the speech of just the person it observes speaking. In this instance, AV-Hubert produces a 2.9 percent WER, but an audio-only model without pretraining gets only a 37.3 percent WER.
After being trained on 31,000 hours of transcribed video data, the previous state-of-the-art model can achieve a 33.6 percent WER on the standard LRS3 benchmark data set when the system can see but not hear the speaker. This method outperforms the supervised state-of-the-art, achieving a WER of 28.6 percent with just 30 hours of labeled data. Furthermore, utilizing 433 hours of labeled data, a new state-of-the-art WER of 26.9 percent was reached.
What comes next?
AV-Hubert will do more than only create conversational AI systems that can be deployed in difficult situations. Because it requires considerably less supervised data for training, it will also allow the development of conversational AI models for people worldwide who do not speak languages like English, Mandarin, or Spanish.
Because AV-Hubert learns from both voice and lip movements, it might help researchers develop more inclusive speech recognition models for persons with speech difficulties. Self-supervised audio-visual representations might also be used to detect deepfakes and other information that has been modified to deceive users by capturing the good correlations between sounds and lip movements. It might also aid in generating realistic lip movements in virtual reality avatars, allowing for a true sense of presence – the sensation of being there with someone even if they are on the other side of the planet.
Paper 1: https://arxiv.org/abs/2201.01763?
Paper 2: https://arxiv.org/abs/2201.02184?