A New NVIDIA Research Turns LDM Stable Diffusion into an Efficient and Expressive Text-to-Video Model with Resolution up to 1280 x 2048

Due to recent advancements in the underlying modeling methods, generative models of pictures have attracted interest like never before. The most effective models of today are based on diffusion models, autoregressive transformers, and generative adversarial networks. Particularly desired features of diffusion models (DMs) include their resilient and scalable training aim and tendency to need fewer parameters than their transformer-based equivalents. The paucity of large-scale, generic, and publicly accessible video datasets and the high computational cost involved with training on video data are the key reasons why video modeling has lagged. At the same time, the picture domain has made tremendous strides. 

Although there is a wealth of research on video synthesis, most efforts, including earlier video DMs, only produce low-resolution, frequently short films. They create extended, high-resolution films by applying video models to actual issues. They concentrate on two pertinent real-world video generation issues: (i) text-guided video synthesis for producing creative content and (ii) video synthesis of high-resolution real-world driving data, which has great potential as a simulation engine in autonomous driving. To do this, they rely on latent diffusion models (LDMs), which can lessen the significant computational load when learning from high-resolution pictures. 

Figure 1: Temporal video fine-tuning

They generate temporally coherent videos using pre-trained image diffusion models. The model first generates a batch of samples that are independent of one another. The samples are temporally aligned and create coherent films after temporal video fine-tuning.

Researchers from LMU Munich, NVIDIA, Vector Institute, the University of Toronto, and the University of Waterloo recommend Video LDMs and expand LDMs to high-resolution video creation, a process requiring much computing power. In contrast to earlier research on DMs for video creation, their Video LDMs are initially pre-trained on pictures exclusively (or use existing pre-trained image LDMs), allowing us to take advantage of huge image datasets. After adding a time dimension to the latent space DM, they convert the LDM image generator into a video generator by fixing the pre-trained spatial layers and training just the temporal layers on encoded picture sequences or films (Fig. 1). To establish temporal consistency in pixel space. They adjust LDM’s decoder in a similar way (Fig. 2). 

Figure 2: Top: They analyze video sequences using a frozen encoder throughout the temporal decoder fine-tuning process, which processes frames independently and enforces temporally coherent reconstructions across frames. They also use a discriminator with video awareness. Bottom: In latent domain models (LDMs), a diffusion model is trained. It creates latent characteristics, which are subsequently converted into pictures by the decoder. 

They also temporally align pixel space and latent DM upsamplers, frequently used for image super-resolution, making them into time-consistent video super-resolution models to further improve the spatial resolution. Their approach, which builds on LDMs, may produce globally coherent and lengthy films using little memory and processing power. The video upsampler only has to function locally for synthesis at extremely high resolutions, resulting in little training and computing demands. To achieve cutting-edge video quality, they test their technology using 5121024 actual driving scenario films and synthesize videos that are several minutes long. 

Additionally, they enhance a potent text-to-image LDM known as Stable Diffusion such that it may be used to create text-to-video with a resolution of up to 1280 x 2048. They can utilize a reasonably small training set of captioned films since they need to train the temporal alignment layers in such a scenario. They present the first instance of personalized text-to-video creation by transferring the learned temporal layers to variously configured text-to-image LDMs. They anticipate that their work will pave the way for more effective digital content generation and simulation of autonomous driving. 

The following are their contributions: 

(i) They provide a practical method for developing LDM-based video production models with high resolution and long-term consistency. Their significant discovery is to use pre-trained image DMs to generate videos by adding temporal layers that can train pictures to align consistently throughout time (Figs. 1 and 2). 

(ii) They further fine-tune super-resolution DMs, which are widely used in the literature regarding timing.

(iii) They can produce several minute-long films and achieve state-of-the-art high-resolution video synthesis performance on real driving scenario recordings. 

They (i) upgrade the publicly accessible Stable Diffusion text-to-image LDM into a robust and expressive text-to-video LDM (ii), (iii) show that the learned temporal layers may be integrated with other image model checkpoints (such as DreamBooth), and (iv) do the same for the learned temporal layers.

Check out the Paper and Project. Don’t forget to join our 19k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at Asif@marktechpost.com

🚀 Check Out 100’s AI Tools in AI Tools Club

Aneesh Tickoo is a consulting intern at MarktechPost. He is currently pursuing his undergraduate degree in Data Science and Artificial Intelligence from the Indian Institute of Technology(IIT), Bhilai. He spends most of his time working on projects aimed at harnessing the power of machine learning. His research interest is image processing and is passionate about building solutions around it. He loves to connect with people and collaborate on interesting projects.

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...