This Paper from NYU and Google Explains How Joint Speech-Text Encoders Overcome Sequence-Length Mismatch in Cross-Modal Representations

It is becoming increasingly apparent that very big models trained on massive unsupervised corpora in a single modality can achieve remarkable results. This has been proved both in the audio domain, where a single model has been shown to adapt to a surprise wide array of acoustic tasks and in the text domain, where language models have attained exceptional zero-shot capabilities. Similar achievements have prompted the inquiry into how to employ similar techniques for situations combining two modalities, which have traditionally relied on manually paired data.

One interesting approach is to train a big encoder on both modalities so that either one can be presented as an unpaired example and the encoder will learn to map the two to similar places in representation space. Achievable and capable of state-of-the-art performance on numerous picture and text comprehension tasks using a single model, such a representation has been demonstrated to be feasible in the image/text-domain.

New research by the New York University and Google investigates whether the performance gains found with the explicit alignments may be achieved by applying consistency regularization to the implicit alignments learned that in upsampling systems. They achieve this by developing a method, motivated by dynamic time warping, that optimally aligns the encoder’s representation of a speech and text example. In the absence of an explicit alignment model, the team demonstrate that the optimum alignment is not just acquired during training but also improves as one progresses through the network’s layers. 

To facilitate pretraining on unpaired voice and text data, there has been a recent trend toward models with a joint speech and text encoder in the field of speech recognition. The lengthier sequence used to represent speech offers a unique difficulty for speech recognition since it involves two sequence modalities. Because of this, comparing an encoder’s speech representation to its text representation frame-by-frame becomes a more difficult process, even though both modalities are represented in the same embedding space.

Finally, the work demonstrates that, in a monolingual and multilingual setting, significant WER improvements can be achieved against strong, semi-supervised baselines without any learned alignment model by modifying the criteria of the consistency regularization to encourage consistency under some alignment rather than a direct frame-wise comparison. Based on their findings, it appears that tolerating misalignment is all that’s needed to enforce consistency in cross-modal representations.


Check out the Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 28k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

Dhanshree Shenwai is a Computer Science Engineer and has a good experience in FinTech companies covering Financial, Cards & Payments and Banking domain with keen interest in applications of AI. She is enthusiastic about exploring new technologies and advancements in today’s evolving world making everyone's life easy.

ūüźĚ Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...