Tencent AI Lab Introduces GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation

The problem of video understanding and generation scenarios has been addressed by researchers of Tencent AI Lab and The University of Sydney by presenting GPT4Video. This unified multi-model framework supports LLMs with the capability of both video understanding and generation. GPT4Video developed an instruction-following-based approach integrated with the stable diffusion generative model, which effectively and securely handles video generation scenarios.

Previous researchers have developed multimodal language models that handle visual inputs and text outputs. For example, some researchers have focused on learning a joint embedding space for multiple modalities. A growing interest has been in enabling multimodal language models to follow instructions, and MultiInstruct, the first multimodal instruction tuning benchmark dataset, was introduced.LLMs have revolutionized natural language processing. Text-to-image/video generation has been explored using various techniques. Safety concerns of LLMs also have been addressed in recent works.

In enhancing LLMs with robust multimodal capabilities, the GPT4Video framework is a universal, versatile system designed to endow LLMs with advanced video understanding and generation proficiencies. GPT4Video has emerged as a response to the limitations of current MLLMs, which exhibit deficiencies in generating multimodal outputs despite their adeptness at processing multimodal inputs.GPT4Video addresses this gap by enabling LLMs not only to interpret but also to generate rich multimodal content.

GPT4Video’s architecture is composed of three integral components:

  • A video understanding module that employs a video feature extractor and a video abstractor to encode and align video information with the LLM’s word embedding space.
  • The LLM body utilizes the structure of LLaMA and employs Parameter-Efficient Fine Tuning(PEFT) methods, specifically LoRA while keeping the original pre-trained parameters intact.
  • A video generation part that conditions the LLM to generate prompts for a model from Text to Video Model Gallery through meticulously constructed instructions following the dataset.

GPT4Video has shown remarkable abilities in understanding and generating videos, surpassing Valley by 11.8% in the Video Question Answering task and outperforming NExt-GPT by 2.3% in the text-to-video generation task. This model equips LLMs with video generation capabilities without additional training parameters and can work with various models for video generation.

In conclusion, GPT4Video is a powerful framework that enhances Language and Vision Models with advanced video understanding and generative functions. The release of a specialized multimodal instruction dataset promises to catalyze future research in the field. While specializing in the video modality, there are plans to expand to other modalities like image and audio in future updates.


Check out the Paper and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 33k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

Sana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...