Meta AI Introduces ‘Data2vec’ : A Self-Supervised Algorithm That Works For Speech, Computer Vision, and NLP

Many critical recent developments in AI have been enabled by self-supervised learning. Machines learn by directly observing the world rather than being explicitly instructed through labeled images, text, audio, and other data sources.

However, whereas people appear to learn in similar ways regardless of how they obtain information — whether, through sight or hearing, self-supervised learning algorithms currently learn from images, speech, text, and other modalities in quite different ways. 

This disparity has been a critical impediment to deploying breakthroughs in self-supervised learning more generally.

It’s challenging to push multiple modalities forward simultaneously since a great algorithm built for, say, comprehending photos can’t be easily applied to another modality, such as writing. 

This is why Meta AI created data2vec, the first high-performance self-supervised algorithm that works across various modalities. Data2vec exceeded the previous best single-purpose algorithms for computer vision and voice, and it is competitive on NLP tasks when applied independently to audio, pictures, and text.


It also symbolizes a new paradigm of holistic self-supervised learning, in which further research enhances several rather than just one modality.

It also doesn’t use contrastive learning or reconstructed input examples.

In addition to accelerating AI advancement, data2vec moves us closer to creating machines that can learn about different parts of the environment around them in real-time. 

It will create more adaptive AI, which will perform jobs beyond today’s systems’ capabilities.

What is data2vec, and how does it work?

Much of AI is still focused on supervised learning, which only works with data that has been tagged.

While academics have put in a lot of effort to create large-scale labeled data sets for English speech and text, this is not possible for the dozens of languages spoken on the planet. 

Self-supervision allows computers to learn about the environment by looking at it and determining the structure of images, voice, or text.

It’s simply more scalable to have machines that don’t need to be explicitly taught to identify photos or understand spoken language. 

Today’s self-supervised learning research almost typically focuses on a single modality. As a result, researchers specializing in one modality often adopt a totally different strategy than those specializing in another.

Researchers train algorithms to fill in blanks in sentences in the case of the text. On the other hand, speech models must learn an inventory of essential speech sounds, like forecasting missing sounds. In computer vision, models are frequently taught to assign comparable representations to a color image of a cow, and the same image flipped upside down, allowing them to correlate the two far more closely than they would with an unrelated image like a duck. 

For each modality, algorithms anticipate distinct units: pixels or visual tokens for images, words for the text, and learned sound inventories for voice. Because a collection of pixels differs significantly from an audio waveform or a passage of text, algorithm creation has been related to a particular modality. This means that algorithms in each modality continue to work differently. 

Data2vec makes this easier by teaching models to anticipate their own representations of the incoming data, regardless of mode. Instead of predicting visual tokens, phrases, or sounds, a single algorithm may work with completely different sorts of input by focusing on these representations — the layers of a neural network.

This eliminates the learning task’s reliance on modality-specific targets.

It was necessary to define a robust normalization of the features for the job that would be trustworthy in different modalities to directly predict representations. 

The method starts by computing target representations from an image, a piece of text, or a voice utterance using a teacher network.

After that, a portion of the input was masked and repeated with a student network, which predicts the teacher’s latent representations. Even though it only has a partial view of the data, the student model must predict accurate input data.

The instructor network is identical to the student network, except with somewhat out-of-date weights. 

The method was tested on the primary ImageNet computer vision benchmark, and it outperformed existing processes for a variety of model sizes. It surpassed wav2vec 2.0 and HuBERT, two previous Meta AI self-supervised voice algorithms. It was put through its paces on the popular GLUE benchmark suite for text, and it came out on par with RoBERTa, a reimplementation of BERT. 

In the direction of machines that learn by seeing the world around them

While self-supervised learning has made significant progress in computer vision, movies, and other individual modalities, the primary idea behind this technique is to learn more broadly:

AI should learn to perform various tasks, including entirely new ones.

A machine that can recognize animals from its training data and adapt to recognize new creatures if we describe their appearance.

Data2vec shows that the same self-supervised algorithm may perform well in various modalities, often outperforming the best-known algorithms. 

This paves the path for more widespread self-supervised learning, bringing us closer to a day where AI can learn about complex subjects like soccer or multiple ways to bake bread using movies, articles, and audio recordings.

Data2vec will help us get closer to a world where computers just need a small amount of labeled data to complete jobs. Data2vec is a crucial step toward more generic AI because it is difficult, if not impossible, to collect annotated samples — for example, to train speech recognition models for thousands of languages.