Many AI research projects have been striving to improve their ability to detect and interpret speech merely by listening and engaging with others, much like babies learn their first language. This needs not just assessing what someone says but also a variety of other clues from how those words are delivered, such as speaker identification, emotion, hesitation, and interruptions. Furthermore, the AI system must recognize and interpret noises that overlap with the speech signal, such as laughter, coughing, background vehicles, or bird tweeting, to fully comprehend a situation as a person would do.
Facebook AI is thus releasing HuBERT, a new approach for learning self-supervised speech representations, to help to model these types of rich lexical and non-lexical information in audio. HuBERT for speech representation learning matches or outperforms SOTA techniques for speech recognition, generation, and compression.
This model learns the structure of spoken input by predicting the proper cluster for masked audio segments using an offline k-means clustering step. By alternating between clustering and prediction processes, HuBERT improves its learned discrete representations over time. It’s simple and stable. HuBERT’s learned presentations are also of good quality, making them easy to integrate into various downstream speech applications.
Working of HuBERT
HuBERT is inspired by Facebook AI’s DeepCluster method for self-supervised visual learning. To capture the sequential structure of speech, it uses masked prediction loss over sequences, such as Google’s BERT (Bidirectional Encoder Representations from Transformers). It uses an offline clustering step to produce noisy labels for Masked Language Model pretraining. HuBERT, in particular, uses masked continuous speech characteristics to forecast specified cluster allocations. Only the masked regions are subjected to the predictive loss, forcing the model to learn adequate high-level representations of unmasked inputs in order to infer the targets of masked ones.
HuBERT uses continuous inputs to train both acoustic and linguistic models. The model must first encode unmasked audio inputs into meaningful continuous latent representations. These representations map to the traditional acoustic modeling problem. Second, the model requires the long-term temporal relationships between learned representations to reduce prediction error.
One important motivation behind this work is the importance of consistency of the k-means mapping from audio inputs into discrete targets, not just their correctness, which helps the model focus on modeling the sequential structure of input data.
HuBERT is pretrained on standard datasets like the LibriSpeech 960 hours and the Libri-Light 60,000 hours. It either meets or improves upon the state-of-the-art wav2vec 2.0 performance on all fine-tuning subsets of 10mins, 1h, 10h, 100h, and 960h.
Speech representation learning’s significant accomplishment allows for direct language modeling of speech signals without the use of lexical resources (i.e., no supervised labels, text corpus, or lexicons). This allows non-lexical information, such as a dramatic pause, an urgent interruption, or background noises to be modeled.
Facebook AI has taken the first steps toward synthesizing speech using learned speech representations from CPC, Wav2Vec2.0, and HuBERT in the Generative Spoken Language Modeling (GSLM).
HuBERT can assist AI researchers in developing NLP systems that are exclusively trained on audio rather than text samples. This will allow us to add the full expressiveness of spontaneous oral language to existing NLP systems, allowing an AI voice assistant to speak with the nuances and affect of a real person. Thus, learning speech representation efficiently without depending upon large labeled data is essential and helpful for the AI community to build more inclusive applications that span dialects and languages that are only spoken.