Google AI Extends Imagen To Imagen-Video, A Text-To-Video Cascaded Diffusion Model To Generate High-Quality Video With Strong Temporal Consistency And Deep Language Understanding

In recent years (or even months), we have seen tremendous growth in generative model research. In particular, text-to-image models such as DALL-E2, Imagen, or Parti have reached performances unimaginable a short time ago. 

On the other hand, text-to-video is still in its early days. In fact, generating videos from text is way more difficult compared to images: firstly, videos contain also temporal information and thus need coherency; secondly, the computational complexity needed is higher. Therefore, most existing works miss reaching high resolutions or the ability to generalize in many domains.

These two issues were solved by researchers from Google, who proposed Imagen Video, a text-to-video model based on cascaded diffusion models capable of generating high-definition videos with high frame fidelity, strong temporal consistency, and deep language understanding.

👉 Read our latest Newsletter: Microsoft’s FLAME for spreadsheets; Dreamix creates and edit video from image and text prompts......

Imagen Video comprises 7 sub-models that perform text-conditional video generation, spatial super-resolution, and temporal super-resolution. With the entire cascade, Imagen Video generates high-definition 1280×768 videos at 24 frames per second. 


Very briefly, as probably all of you already know, diffusion models are based on the concept of adding noise to an image in the forward process and trying to reconstruct it by a de-noising backward path. 

Cascaded diffusion models are one method of extending diffusion models to higher definitions. The model first creates a low-resolution image or video, after which they incrementally boost its resolution using a series of super-resolution diffusion models. When modeling extremely high dimensional problems, cascaded diffusion models keep each sub-model somewhat simple.

The following figure illustrates the components of Imagen Video: 1 frozen text encoder (T5-XXL), 1 base video diffusion model, 3 SSR (spatial super-resolution), and 3 TSR (temporal super-resolution) models. The TSR models increase temporal resolution by filling in the gaps between input frames, whereas the SSR models increase the spatial resolution for all input frames at generation time.

One advantage of cascaded models is the ability to train each diffusion model individually, enabling the training of all 7 models in parallel.


To capture dependencies between video frames, diffusion models for video creation commonly use a 3D U-Net architecture with temporal attention and convolution layers mixed in with spatial attention and convolution layers. As opposed to frame-autoregressive methods, each denoising model in this study operates on several video frames concurrently and generates full blocks of video frames at once, which is crucial to capturing the resultant video’s temporal coherence.


The base video model mixes information over time via temporal attention. On the other hand, SSR and TSR models substitute temporal convolutions for temporal attention. While the SSR and TSR models’ temporal convolutions enable Imagen Video to retain local temporal consistency while upsampling, the base model’s temporal attention enables Imagen Video to simulate long-term temporal relationships. 

Additionally, spatial attention and convolutions are used in the models. The base model and the first two spatial super-resolution models also feature spatial attention, however, to reduce memory and compute costs for higher resolutions, the authors move to completely convolutional architectures in the last models.


Some astounding frames extracted from videos generated by Imagen Video are shown below.


Another cool characteristic of Imagen Video is that it was jointly trained on images and videos. Images were packed into sequences with the same length as videos, and the temporal branches in all the models were bypassed. This strategy brought knowledge transfer from images to videos. For instance, training on photos allows the model to learn about many image types (such as drawing, painting, etc.), whereas training on natural video data only allows the model to understand dynamics in natural settings. Below is an illustration of how this strategy has produced results.

Check out the Paper and Project. All Credit For This Research Goes To Researchers on This Project. Also, don’t forget to join our Reddit page and discord channel, where we share the latest AI research news, cool AI projects, and more.

Leonardo Tanzi is currently a Ph.D. Student at the Polytechnic University of Turin, Italy. His current research focuses on human-machine methodologies for smart support during complex interventions in the medical domain, using Deep Learning and Augmented Reality for 3D assistance.