This research summary is based on the paper 'MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound' Please don't forget to join our ML Subreddit
We humans navigate the environment using all of our senses. Allen Institute researchers propose MERLOT Reserve, a model that learns to represent videos over time and across several modalities, including audio, subtitles, and video frames. It was trained using a new learning objective and more than 20 million YouTube videos.
MERLOT Reserve is a unique, cutting-edge methodology for solving video-related inquiries. MERLOT Reserve can dependably choose the correct answer from a selection of multiple-choice answers when given a video and a question. This forecast is made by MERLOT Reserve jointly reasoning over the visual frames of the video, the video subtitles, and the audio in the movie.
MERLOT Reserve raises the bar for the popular visual question answering tasks VCR and TVQA. This new model can successfully handle a wide range of simple video comprehension tasks without the need for any human-labeled material. The article on this subject has been accepted for presentation at CVPR ’22.
Empirical findings reveal that the model learns solid representations of videos using all component modalities. When finetuned, it achieves a new state-of-the-art on both VCR and TVQA, exceeding previous efforts by 5% and 7%, respectively. Audio pretraining benefits both tasks, including VCR (centered around images without provided sound).
The learning objective also allows for the transition to zero-shot activities. It achieves competitive results on four video comprehension tasks, outperforming supervised algorithms on the recently proposed Situated Reasoning (STAR) benchmark. More analysis and discussion of these findings and their ramifications may be found in the publication.
MERLOT learns Multimodal Neural Script Knowledge from video frames and subtitles. This study learned how to combine audio with video and subtitles. MERLOT Reserve decodes a video by encoding each modality separately first, then jointly. Each video frame is encoded by a Vision Transformer.
Language is processed by encoding audio with an Audio Spectrogram transformer or subtitles with a word embedding table. A Joint Transformer Encoder was employed to integrate all modalities together and across time.
The design enables to handle video jobs (with or without subtitles) and image-based operations like VCR at the same time. A new contrastive objective is used to train the model. If we MASK out a section from a video with its frames, text, and audio all aligned in time. The model must optimize the MASKed region’s resemblance to an independent encoding of that region’s text and audio. As a result, the new goal allows learning to fuse audio and learning from audio.
When finetuned, MERLOT Reserve achieves cutting-edge VCR and TVQA performance, given its size. The TVQA results demonstrate an extra performance boost when audio is employed, which no previous work has been able to capture.
The model may be utilized for several zero-shot video comprehension applications in addition to SOTA finetuning. A query such as “What is the person doing?” might be rewritten as “The individual is MASK.” The model will then predict multiple-choice outcomes from a collection of specified alternatives (e.g., “cooking popcorn,” “eating popcorn”). The rewriting might be done manually or automatically through a language model.
MERLOT Reserve was trained on YT-Temporal-1B. Success with MERLOT was achieved in training on videos at scale, covering a variety of themes such as documentaries, how-to videos, and vlogs. A dataset of 20 million movies was generated for this study, covering over a billion frames, to make the dataset size equivalent to similar efforts in the image-only area (like JFT-3B).
Several utilities for preparing data are present in the Github repository (which helps avoid spurious correlations during model pretraining). Experiment with some examples to learn more about MERLOT Reserve’s question-answering capabilities.