Google AI Introduces a New Approach to Video-Text Learning Called Iterative Co-Tokenization, Which can Efficiently Fuse Spatial, Temporal, and Language Information for VideoQA

Given that video touches on a wide range of topics related to people’s daily lives, it has emerged as a dominant form of the media material. Real-world video applications, including video captioning, video content analysis, and video question-answering, have gained popularity due to the prevalence of video content today. They rely on models that can relate text or spoken words to video footage. VideoQA is particularly difficult because it calls for understanding both temporal information, which refers to how things move and interact, and semantic information, such as objects in a scene. Additionally, processing all of the frames in a video to learn spatio-temporal information might be computationally expensive due to the presence of frames in videos.

In their recent work titled, “Video Question Answering with Iterative Video-Text Co-Tokenization,” research experts at Google AI have developed a novel method for learning from videos and text. This iterative co-tokenization method can effectively combine spatial, temporal, and linguistic data for VideoQA. It uses multiple streams and separate backbone models for each scale of video to analyze and create video representations that capture various properties, such as high spatial resolution or lengthy temporal durations. The model then uses the co-tokenization module to build effective representations by combining the text and video streams. Compared to earlier methods, this model is highly efficient and consumes at least 50% fewer resources while providing more remarkable performance.

The model’s primary objective is to generate features from both text and video that allow their related inputs to interact. A second objective is to do it efficiently, which is crucial for videos because they have tens to hundreds of input frames. The model tokenizes the joint video-language inputs into a smaller set of tokens that jointly and effectively represent both modalities. Both modalities were employed to create a combined compact representation during tokenization, which was then supplied to a transformer layer to create the next-level representation. The main issue here is that the video frame frequently does not directly match the corresponding text. The researchers take care of this by adding two teachable linear layers that, before tokenization, combine the dimensions of the visual and textual feature sets. As a  single tokenization step prevents future interaction between the two modalities, a new feature representation was added to interact with the video input features and produce another set of tokenized features. The following transformer layer receives these attributes after that. New features or tokens can be created through this iterative process, representing a continuous improvement of the combined representation from both modalities. The final stage involves feeding the features into a decoder, which outputs text. 

👉 Read our latest Newsletter: Microsoft’s FLAME for spreadsheets; Dreamix creates and edit video from image and text prompts......

The model is pre-trained on several datasets before fine-tuning. The HowTo100M dataset, which consists of videos that can be automatically annotated with text based on speech recognition, was used to train the model. The model outperforms the most recent state-of-the-art techniques and yields highly accurate findings. In tests using the MSRVTT-QA, MSVD-QA, and IVQA benchmarks, the video-language iterative co-tokenization algorithm produced better results than other state-of-the-art models while being relatively small. Additionally, iterative co-tokenization learning results in significant computation savings for video-text learning tasks. This innovative method of learning through video-text modalities emphasizes cooperative learning and tackles the crucial and challenging problem of video question-answering. Additionally, it produces modest model sizes and can be improved even more with bigger models and more data. To enable more fluid interaction with vision-based media, Google Research hopes that this effort will spur additional research in vision-language learning.

This Article is written as a research summary article by Marktechpost Staff based on the research paper 'Video Question Answering with Iterative Video-Text Co-Tokenization'. All Credit For This Research Goes To Researchers on This Project. Check out the paper and reference article.

Please Don't Forget To Join Our ML Subreddit

Khushboo Gupta is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Goa. She is passionate about the fields of Machine Learning, Natural Language Processing and Web Development. She enjoys learning more about the technical field by participating in several challenges.