Google AI Introduces ‘MV-GPT,’ A New Generative Pre-Training Framework For Multimodal Video Captioning

Multimodal video captioning systems use video frames and speech to generate natural language descriptions of videos. Such systems are stepping stones toward the long-term objective of developing multimodal conversational systems that effortlessly communicate with users while perceiving their environments via multimodal input streams.

In contrast to video understanding tasks, where the primary challenge lies in processing and understanding multimodal input videos, the study of multimodal video captioning also includes the challenge of producing grounded captions. The most prevalent method for this task is to train an encoder-decoder network using manually annotated data jointly. Because of a lack of large-scale, manually annotated data, the study of annotating grounded captions for videos is labor-intensive and often impractical. Previous research such as VideoBERT and CoMVT utilize automatic speech recognition to pre-train their models on unlabelled videos (ASR). As a result, only the video encoder is transferred to the following tasks.

✅ [Featured Article] Selected for 2024 GitHub Accelerator: Enabling the Next Wave of Innovation in Enterprise RAG with Small Specialized Language Models

Scientists introduce a novel pre-training framework for multimodal video captioning in “End-to-End Generative Pre-training for Multimodal Video Captioning,” which will be presented at CVPR 2022. This framework, which refers to multimodal video generative pre-training or MV-GPT, trains a multimodal video encoder and a sentence decoder from unlabelled videos using a future utterance as the target text and a novel bidirectional generation task.

Experiment shows that MV-GPT effectively transfers to multimodal video captioning, achieving state-of-the-art performance on various benchmarks. In addition, the multimodal video encoder is competitive in multiple video comprehension tasks, including VideoQA, text-video retrieval, and action recognition.

Future Utterance as a Text Signal Addendum

Each training video clip for multimodal video captioning is typically associated with two texts: (1) a speech transcript aligned with the clip as part of the multimodal input stream, and (2) a target caption, which is frequently manually annotated. The encoder is taught to combine information from the transcript with visual content, and the desired captions are used to train the decoder for a generation. In the case of unlabelled videos, however, each video clip contains only an ASR transcript and no manually annotated target caption. In addition, the exact text cannot be used as both the encoder input and decoder target, as this would make the target generation trivial.

MV-GPT circumvents this difficulty by utilizing a future utterance as an additional text signal and jointly enabling encoder and decoder pre-training. However, it is not ideal for training a model to generate future statements frequently not grounded in the input content.

Bi-directional Population Decline

The problem of non-grounded text generation is mitigated through the formulation of a bi-directional generation loss that includes both forward and backward generation. Bold generation generates future utterances given optical frames and their corresponding transcripts, allowing the model to learn how to fuse the visual content with the corresponding transcript. Backward generation utilizes the video’s visual frames and future utterances to train a model to generate a transcript with more grounded text. Bidirectional generation loss in MV-GPT permits both the encoder and the decoder to be introduced to handle texts with a vital visual component.


Results of Multimodal Video Captioning

Using the same model architecture and YouCook2 with standard evaluation metrics, compare MV-GPT to existing pre-training. Although all pre-training techniques improve captioning performance, it is essential to pre-train the decoder in conjunction with the captioner to improve model performance.

After applying a model pre-trained by MV-GPT to four benchmarks for captioning: YouCook2, MSR-VTT, Vitt, and Activity Net-Captions, the model achieves state-of-the-art performance by significant margins on all four criteria. MV-GPT demonstrates relative improvements of over 12 percent on the Meteor metric and all four benchmarks.

Present a new MV-GPT framework for generative pre-training for multimodal video captioning. The bidirectional generative objective pre-trains a multimodal encoder and a caption decoder using utterances sampled from unlabelled videos at different times. The pre-trained model achieves state-of-the-art performance on multiple benchmarks for video captioning and other tasks for video comprehension, including VideoQA, video retrieval, and action classification.

This Article is written as a summary article by Marktechpost Staff based on the paper 'End-to-end Generative Pretraining for Multimodal Video Captioning'. All Credit For This Research Goes To Researchers on This Project. Checkout the paper and post.

Please Don't Forget To Join Our ML Subreddit
🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...