In A Latest Speech Processing Research, Meta AI Researchers Explain Their Study On Similarities Between Deep Learning Models And The Human Brain

This Article is written as a summay by Marktechpost Staff based on the research paper 'Toward a realistic model of speech processing in the
brain with self-supervised learning'. All Credit For This Research Goes To Researchers on This Project. Checkout the paper .

Please Don't Forget To Join Our ML Subreddit

Over the last decade, deep neural networks’ performance has skyrocketed. Object categorization, text translation, and speech recognition algorithms are approaching human-level performance. These algorithms’ representations have been proven to coincide with those of the brain on multiple occasions, implying that they converge to brain-like computations.

However, this convergence should not hide the significant distinctions that still exist between these deep learning models and the brain. The following comparisons are based on models that have been trained using massive amounts of data, supervised labels that are uncommon in human experience, data that is textual rather than raw sensory, and/or a lot of memory. These disparities underscore the urgent need to create architectures and learning objectives that would be sufficient to account for both behavior and brain responses under these four restrictions.

Researchers at Meta AI suggested in a recent study that the most current self-supervised architectures trained on raw sensory input are interesting candidates. The team concentrated on Wav2Vec 2.0, an architecture that stacks convolutional and transformer layers to forecast the quantization of speech waveform latent representations. Wav2Vec 2.0 was trained on 600 hours of speech, which is roughly equivalent to what neonates are exposed to during the early stages of language acquisition.

The researchers compared this model to the brains of 412 healthy individuals (351 English speakers, 28 French speakers, and 33 Mandarin speakers) who had their brains recorded with functional magnetic resonance imaging (fMRI) while listening to audio novels in their native language for around an hour.


Researchers compared brain activity to each layer of the Wav2Vec 2.0 model, as well as several variants, including a random (untrained) Wav2Vec 2.0 model, a model trained on 600 h of non-speech sounds, a model trained on 600 h of non-native speech, a model trained on 600 h of native speech, a model trained on 600 h of native speech, and a model trained directly on speech-to-text on the native language of the participants.

The experiments yielded four significant contributions. First, Wav2Vec 2.0 uses self-supervised learning to acquire latent representations of the speech waveform that are similar to those seen in the human brain. Second, the transformer layers’ functional hierarchy coincides with the cortical hierarchy of speech in the brain, revealing the whole-brain arrangement of speech processing in unprecedented detail. Third, the model’s representations of hearing, speech, and language converge with those of the human brain. Fourth, behavioral comparisons to the findings of a speech sound discrimination exercise performed by 386 more participants indicate this shared language specialization.


Human infants learn to communicate with little to no assistance. Young brains only need a few hundred hours of speech to learn to put words together in their social group’s language(s). Researchers from Meta AI recently examined whether self-supervised learning on a small bit of speech is enough to produce a model that is functionally equal to speech perception in the human brain. Researchers used three curated datasets of French, English, and Mandarin to train multiple variants of Wav2Vec 2.0, then compared their activations to those of a large population of French, English, and Mandarin speakers recorded using fMRI while passively listening to audio stories. The findings revealed that this self-supervised model learns representations that linearly map onto a remarkably distributed set of brain regions, a hierarchy that corresponds to the cortical hierarchy, and language-specific properties.

Nitish is a computer science undergraduate with keen interest in the field of deep learning. He has done various projects related to deep learning and closely follows the new advancements taking place in the field.