Audio-visual (AV) learning is defined by delivering and applying instructional content that includes both sound and visual information. The natural relationship between visual observations and their accompanying sounds has shown strong self-supervision signals for learning video representations. That is why the massive amount of online videos has become a valuable source for self-supervised learning among research communities.
However, due to overdubbed audio, online videos frequently provide imperfectly aligned audio-visual signals. Therefore, the models trained on uncurated films have been shown to develop poorer representations as a result of the misalignment difficulties. The existing techniques typically rely on manually curated datasets with a predetermined taxonomy of semantic ideas, where the audio-visual connection is highly likely.
To overcome this gap, researchers from Seoul National University, NVIDIA and Microsoft have released an automatic dataset curation pipeline and a large video dataset for self-supervised audio-visual learning, termed ACAV100M (automatically curated audio-visual dataset). The dataset is made up of a massive number of uncurated web videos. The researchers took 140 million full-length videos and reduced them to 100 million segments with the best audio-visual correspondence.
The researchers designed the data collection procedure as a constrained optimization problem. They aimed at finding a subset that maximizes the total mutual information between the audio and visual channels in the videos. Mutual information (MI) measures how much knowing one variable reduces uncertainty about the other. As a result, a subset with the highest MI is likely to contain a large number of videos with audio-visual connections. Therefore, this method offers an excellent dataset for self-supervised learning.
It is possible to assess audio-visual MI for each video separately and generate a selection that delivers the maximum MI. However, due to mismatched audio-visual inputs, models trained on uncurated videos have suboptimal representations. Rather than measuring MI at the instance level, the team uses a set-based MI estimate. This method quantifies the information shared by two dataset clustering assignments.
The researchers cluster the videos according to audio and visual signals and then calculate the MI using a contingency table that encodes the overlap between the audio and visual clusters. The findings show that this set-based approach is more resistant to real-world noise, allowing it to produce datasets with better audio-visual correspondence compared to instance-based MI estimates.
The researchers compare their dataset to video datasets extensively utilized in self-supervised learning to assess its efficiency in self-supervised audio-visual learning. For this, they use contrastive learning to pre-train identical models on various datasets and the conventional benchmarks to complete linear evaluation. According to the findings, the models pre-trained on the datasets outperforms models pre-trained on existing datasets that require human annotation or manual verification. This demonstrates that the datasets generated by the proposed method provide audio-visual correspondence required for self-supervised methods.