Baidu AI Researchers Introduce VideoGen: A New Text-to-Video Generation Approach That Can Generate High-Definition Video With High Frame Fidelity

Text-to-image (T2I) generation systems like DALL-E2, Imagen, Cogview, Latent Diffusion, and others have come a long way in recent years. On the other hand, text-to-video (T2V) generation remains a difficult issue due to the need for high-quality visual content and temporally smooth, realistic motion corresponding to the text. In addition, large-scale databases of text-video combinations are very hard to come across. 

A recent research by Baidu Inc. introduces VideoGen, a method for creating a high-quality, seamless movie from textual descriptions. To help direct the creation of T2V, the researchers first built a high-quality image using a T2I model. Then, they use a cascaded latent video diffusion module that generates a series of high-resolution smooth latent representations based on the reference image and the text description. When necessary, they also employ a flow-based approach to upsample the latent representation sequence in time. In the end, the team trained a video decoder to convert the sequence of latent representations into an actual video.

Creating a reference image with the help of a T2I model has two distinct advantages. 

  1. The resulting video’s visual quality has improved. The proposed method takes advantage of the T2I model to draw from the much larger dataset of image-text pairs, which is more diverse and information-rich than the dataset of video-text pairs. Compared to Imagen Video, which uses image-text pairings for joint training, this method is more efficient during the training phase. 
  2. A cascaded latent video diffusion model can be guided by a reference image, allowing it to learn video dynamics rather than visual content. The team believes this is an added benefit above methods that only use the T2I model parameters.

The team also mentions that textual description is unnecessary for their video decoder to produce a movie from the latent representation sequence. By doing so, they train the video decoder on a bigger data pool, including video-text pairs and unlabeled (unpaired) films. As a result, this method improves the smoothness and realism of the created video’s motion thanks to the high-quality video data we use.

As findings suggest, VideoGen represents a significant improvement over previous methods of text-to-video generation in terms of both qualitative and quantitative evaluation.


Check out the Paper and ProjectAll Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 30k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...