This Article is written as a summay by Marktechpost Research Staff based on the paper 'CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers'. All Credit For This Research Goes To The Researchers of This Project. Check out the paper, and github. Please Don't Forget To Join Our ML Subreddit
Thanks to DALL-E, we have recently seen how pretrained transformers have revolutionized the text-to-image generation field. So, why not try to use them for text-to-video generation? Nowadays, this idea is facing different challenges. The main reasons are the availability of proper datasets and how they have been used. First, collecting huge amounts of high-quality text-image pairs from the Internet is relatively easy, while this is not currently true for text-video data. Second, the not-fixed video lengths lead to a challenging problem: to create the data samples required to train the text-to-video deep learning model, researchers typically split the video into several clips with a fixed number of frames, disrupting the alignment between the text and the corresponding part of the video. Let’s make this clearer through an example. Consider a video in which a woman is drinking a glass of water. We want the deep learning model to learn how to create a video from the text “drinking”. However, if the considered video is split into four clips in which the woman respectively (1) holds the glass, (2) lifts the glass, (3) drinks from the glass, and (4) puts the glass down, the model will be confused while learning the meaning of drinking. Indeed the above-mentioned four clips will be all associated with the same original text: “drinking”.
Through this paper, a group of researchers from the University of Tsinghua in Beijing proposes CogVideo, a large-scale pretrained text-to-video generative model. They build CogVideo by using a pretrained text-to-image model (i.e., CogView2) to exploit the knowledge it learned from text-to-image pretraining. At the same time, their idea is to ensure text-video alignment through a multi-frame-rate hierarchical training approach.
Multi-frame-rate Hierarchical Training
First, by following the VQVAE framework, the researchers tokenize each video frame into image tokens (i.e., parts of an image). Each training sample consists of 5 frames of 400 tokens. During the training process, the transformer receives the frame tokens, the text, and a frame rate token as input. In the figure, B stands for “Begin-of-image”, and it is just a separator token inherited from CogView2. The frame rate token is used to condition the generation of the frames so that each training sample includes the complete action described in the text. This mitigates the text-video alignment problem we previously described. Specifically, for each text-video pair, the lowest frame rate among a predefined set is selected as long as it is possible to sample at least 5 frames from the original video. After creating the key frames according to the text, the researchers trained a frame interpolation model to recursively insert transition frames to make the resulting video more coherent. During this process, it is also possible to vary the frame rate if needed. In the end, CogVideo generates 480×480 videos.
The model used to perform text-to-video-generation must be able to infer both spatial and temporal correlations between text and video. As we have briefly discussed in the introduction, collecting high-quality text-video pairs is complex, expensive, and time-consuming. Fortunately, learning spatial semantics can be eased by exploiting image data. For this reason, the researchers of this work rely on the text-to-image model CogView2. Moreover, they propose a technique called dual-channel attention. The pretrained CogView2 model includes different transformer layers in which a spatial-attention mechanism is implemented. Its purpose is to analyze the spatial features of every single frame. CogVideo adds a new temporal attention channel at each transformer layer. During the training process, all the parameters inherited from CogView2 are frozen, while only the parameters of the temporal channels are trainable. The purpose of these channels is to explore and analyze temporal relationships among different frames.
In particular, in this paper, the authors implemented a Swin Attention mechanism extended to work in temporal scenarios. An essential finding is that Swin Attention allows the parallel generation of distant regions of different frames. The figure shows that the image token in the red box can be generated by using the yellow and the green image tokens. This means that while generating the gray image tokens of the i-th frame, it is possible to simultaneously create the image token in the red box. This further accelerates the generation process proposed in this paper.