Transfer learning from high-resource languages is a useful method to improve end-to-end automatic speech recognition (ASR) for low-resource languages. However, pre-trained encoder-decoder models do not share language modeling for the same language, making it unsuitable for foreign target languages. To incorporate further knowledge of the target language and enable transfer from that target language, speech-to-text translation (ST) has been introduced as an auxiliary task.
In this method, the cross-lingual (high-resource to lower-resource) transfer learning for end-to-end ASR is improved by adding ST as an intermediate step. It makes the transfer of learning a two-step process and improves model performance. At present, the approach is based on attention-based encoder-decoder architecture. However, the team intends to extend this transfer learning approach to other end-to-end architectures, such as CTC and RNN Transducer.
Unlike earlier previous leveraging translation data methods, this approach does not require any modification to the ASR model architecture. Both ST and target ASR have the same attention-based encoder-decoder architecture and vocabulary. The high-resource ASR transcripts are translated into a target low-resource language to train ST models. Instead of using text-to-text translation data for ST training, this approach leverages ST data, which avoids speech-to-text modality adaption in the encoder. It leverages only MT pseudo-labels to train ST and does not require high-resource MT training data. It suggests that training ST with human translations is not essential as ST trained with machine translation (MT) pseudo-labels brings compatible yield. This overcomes the shortage of real ST data and consistently brings gains to the transfer learning.
Training Speech Translation with Pseudo-Labels
Word-level or sequence-level knowledge distillation (KD) reduces noise and simplifies data distribution in the training set, which helps to train MT and ST model. Training end-to-end ST models are challenging, as they need to learn acoustic modeling, language modeling, and alignment simultaneously. Also, ST labels are more expensive to obtain. The limitation of size and language collection of the existing ST text corpora makes training ST models more difficult. Therefore, pseudo-label ASR corpora with MT has been proposed, providing a more diverse and large dataset to train ST on at little cost. ST models trained with MT pseudo-labels can be recognized as a sequence-level KD process. Although pseudo-labels may reduce the model’s efficiency, real labels are difficult to learn, and pseudo-labels are a comfortable fit. Experiments also show ST models trained with pseudo-labels performing better than those using actual labels. MT pseudo-labeling also simplifies ST model training and allows beam-searching various labels to reduce overfitting.
Pre-training ASR on Speech Translation
Target ASR are pre-trained on source-to-target ST, instead of pretraining target (low-resource) ASR directly on the (multilingual) source (high-resource) ASR. The latter is pre-trained on source ASR and leverages MT pseudo-labels on source ASR data for training. This two-step method helps decouple the transfer of language modeling (decoder) and acoustic modeling (encoder) to make transfer learning trouble-free and more effective. Pretraining ST with ASR warm-starts acoustic modeling so that ST training can focus on learning language modeling and alignment. The ST model leverages additional data (MT pseudo labels) for the target language, which better model the target language. ASR and ST models use the same model architecture for easy transfer: ASRSource → STSource-Target 1 and STSource-Target → ASRTarget.