Despite being present in surroundings with contaminated and overlapping sounds, the human perceptual system moves massively on visual information to lessen the audio’s ambiguities and modulate concentration on an active speaker in a dynamic environment.
Researchers at Facebook AI and the University of Texas at Austin have proposed a new audio-visual speech separation approach. VisualVoice is a new multi-task learning framework that jointly learns audio-visual speech separation and cross-modal speaker embeddings. It efficiently uses a person’s facial appearance to predict their vocal sounds.
Automating this speech separation process has many practical applications, such as assistive technology for the hearing impaired or more solid transcription of spoken content in noisy internet videos.
The previous works in automatic speech separation relied entirely on the audio stream. However, the current work also investigates methods to leverage its close connections to the visual stream. These methods drive the audio separation module towards the appropriate sound parts that should be separated from the complete audio by analyzing the facial motion in concert with the emitted speech. However, it is not fit to solely rely on lip movements as they can fail when lip motion becomes unreliable, e.g., the microphone occludes the mouth region, or the speaker turns their head away.
The attributes such as gender, age, nationality, and body weight can provide a prior for sound qualities such as tone, pitch, timbre, and basis of articulation. A framework can use this information to learn what to listen for to more accurately identify and separate an individual’s speech from a noisy environment.
The model takes its input video of a target speaker in an environment with overlapping sounds and generates an isolated soundtrack. The network uses facial appearance, lip motion, and vocal audio to perform this separation task, augmenting the conventional “mix-and-separate” model for audio-visual separation to account for a cross-modal contrastive loss requiring the separated voice to agree with the face. The proposed method requires no identity labels and no speakers’ enrollment; thus, they can be trained and tested using unlabelled video, reducing the cost.
The proposed approach was evaluated on five benchmark datasets for audio-visual speech separation, speech enhancement, and cross-modal speaker verification. VisualVoice surpassed the SOTA methods (on all metrics across all datasets) in audio-visual speech separation and speech enhancement in challenging real-world videos of diverse scenarios. The team states that the model’s embedding also improved the SOTA for unsupervised cross-modal speaker verification.
The researchers aim to explicitly model the fine-grained cross-modal attributes of faces and voices and leverage them to enhance speech separation.