Facebook AI Presents Contrastive Semi-Supervised Learning (CSL): An AI Approach For Automatic Speech Recognition (ASR) Models

Voice assistant concept. Vector sound wave. Voice and sound recognition equalizer wave flow background. Personal assistant and voice recognition concept gradient vector illustration

Researchers from Facebook have recently introduced a Contrastive Semi-supervised Learning (CSL) approach that synthesizes pseudo-labeling and contrastive losses to improve learned speech representations’ stability.

In recent years, semi-supervised learning has achieved remarkable results on automatic speech recognition (ASR) tasks. Pseudo-Labelling is one of the most widely utilized models for training ASR models, where a teacher network trained on labeled audio data generates a large amount of pseudo-labeled data, later used to train a student network.

However, one of the significant drawbacks of pseudo-label methods is that the student models’ performance suffers if the initial labeled data is not extensive or accurate enough to train a reliable teacher model. 

Lately, the contrastive loss approach has achieved remarkable results in computer vision (CV) and speech applications. The method is a self-supervised representation learning that uses two groups of samples (positive and negative) selected for specific anchor data within a pretext task. While the negative samples are selected randomly from a mini-batch or a memory bank, the positive samples are augmented versions of the anchor, nearby frames, or samples from the same speaker. Both positive and negative samples determine the learned representation.

Facebook’s CSL

The CSL approach by Facebook AI researchers resolves the weakness of the above two approaches. It utilizes supervised teachers to bypasses the selection of positive and negative samples. Besides, CSL takes the relative distance between label classes as a learning signal. Therefore, it tends to be more robust in teacher-generated targets compared to standard pseudo-labeling methods. The CSL pre-training comprises two functions:

  1.  An encoder – that encodes input audio into latent representations
  2. A projection network – that maps encoder representations into a new space proper to apply the contrastive loss. 

They selected a hybrid-NN supervised teacher to generate pseudo-labels for guiding the selection of positive and negative samples for contrastive loss.

The team employed functions such as connectionist temporal classification (CTC) loss for applying frame-level cross-entropy fine-tuning. The work builds upon earlier studies in ASR pre-training that uses a contrastive loss in a supervised setup.

Research Contributions

The vital contribution of this study relies on utilizing teacher pseudo-labels for selecting positive and negative samples. The self-supervised pre-training methods are sensitive to the diversity and the criterion for choosing positive and negative samples. The proposed CSL approach is more stable compared to the self-supervised pre-training methods. Additionally, CSL facilitates reliable sampling of positive examples within and across queries in the mini-batch.

CSL practices a softer constraint on learned representations by contrastive loss, which improves robustness to noisy teacher pseudo-labels. The pre-trained models can generalize better under out-of-domain conditions by applying contrastive loss over normalized representations, emphasizing hard positives and negative examples.

The researchers demonstrate the model’s resilience by conducting various experiments involving the transcription of social media videos. For this, they employed two data sources: 

  • De-identified public videos in British English and Italian from Facebook
  • Crowd-sourced workers recordings responding to artificial prompts on mobile devices. 

Then they used a hybrid-NN ASR system as the pseudo-labeling baseline and a supervised frame-level cross-entropy (CE) fine-tuning for all pre-trained models.

Source: https://arxiv.org/pdf/2103.05149.pdf
Source: https://arxiv.org/pdf/2103.05149.pdf

The best pseudo-labeling model (CE-PL) decreased the word error rate (WER) by about 36 percent for British English and 46 percent for Italian. They apply CSL pre-training in a low-resource setup with solely 10hr of labeled data, reducing word error rate (WER) by 8% (British English) and 7% (Italian) compared to the standard cross-entropy pseudo-labeling (CE-PL). This WER reduction increase to 19%, with a teacher trained only on 1hr of labels, and it achieved a 17% WRE reduction for out-of-domain conditions. 

Paper: https://arxiv.org/pdf/2103.05149.pdf


Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.