Google AI Introduces VideoPrism: A General-Purpose Video Encoder that Tackles Diverse Video Understanding Tasks with a Single Frozen Model

Google researchers address the challenges of achieving a comprehensive understanding of diverse video content by introducing a novel encoder model, VideoPrism. Existing models in video understanding have struggled with various tasks with complex systems and motion-centric reasoning and demonstrated poor performance across different benchmarks. The researchers aimed to develop a general-purpose video encoder that can effectively tackle a wide range of video understanding tasks with minimal adaptation.

Existing video understanding models have made significant progress but still fall short of. Some models leverage text associated with videos for learning, and others focus solely on video signals, which limits the effective capture of both appearance and motion cues. VideoPrism proposes an approach that integrates both video and text modalities during pretraining. It introduces a two-stage pretraining framework that combines contrastive learning with masked video modeling. This method enables the model to learn semantic representations from both video-text pairs and video-only data.

VideoPrism’s architecture is based on the Vision Transformer (ViT) with modifications for space-time factorization. During pretraining, the model first aligns video and text embeddings through contrastive learning and then continues training on video-only data using masked video modeling. This two-stage approach is augmented with global-local distillation and token shuffling techniques to improve model performance. Extensive evaluations across various video understanding tasks demonstrate that VideoPrism achieves state-of-the-art performance on 30 out of 33 benchmarks, showcasing its robust generalizability and effectiveness in capturing both appearance and motion cues.

Google researchers address the challenge of building a foundational video model with their state-of-the-art model VideoPrism for comprehensive video understanding. The proposed method combines contrastive learning with masked video modeling in a two-stage pretraining framework, resulting in a model that excels across a wide range of video understanding tasks.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and Google News. Join our 38k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our Telegram Channel

You may also like our FREE AI Courses….

Pragati Jhunjhunwala is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Kharagpur. She is a tech enthusiast and has a keen interest in the scope of software and data science applications. She is always reading about the developments in different field of AI and ML.

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...