Meet Phenaki: A Machine Learning-Based Model For Generating Videos From Text Prompts And Uses C-ViViT As Video Encoder

Text-to-image generation is a hot topic in the AI domain, mainly thanks to the open-source release of stable-diffusion. Do you want to see an image of “a teddy bear sleeping in a medieval bed drawn in Van Gogh style”? No problem! You can pass a prompt with details, and the stable-diffusion AI will generate a realistic image for you.

The X-to-Y generation madness using diffusion models is not just limited to images. You can go from text-to-image, text-to-speech, image-to-image, and the list goes on. Diffusion models are the dark horse models in the field of generative models. 

Let’s go back to visual applications. We saw image generation from a description works fine nowadays. But how about generating videos? Is it possible to watch “a teddy bear swimming under the water with colorful fishes”? Phenaki is here to answer that question. 

Essentially, a video is a set of images displayed consecutively to stimulate movement. So, does that mean we can just use the deep learning image generation methods to come up with a generated video? Unfortunately, no, this is a far more complicated problem. 

First of all, the computational requirement is much higher. State-of-the-art text-to-image models are already pushing the limits, so doing this for way more complicated tasks would not be possible using the same approach. More importantly, there are not enough high-quality “text-to-video” datasets available, and we know having an adequately sized dataset is a crucial requirement to train a deep neural network. 

Moreover, one may argue that a produced video must be based on a series of prompts or a plot that recounts what happens over time because a single brief text prompt is insufficient to describe a video thoroughly. ”A teddy bear swimming under the water with colorful fishes” can generate a nice image, but to make it work for the video generation, we would need something way longer and more detailed. 

Given all these problems, the authors of Phenaki had a challenging task ahead of them. Story-based conditional video generation. Phenaki is the first paper to discover this promising application. 

Since there is no story-based dataset to draw from, a standard deep learning strategy of merely learning this task from data is not feasible. Instead, Phenaki uses a model that was designed particularly to generate a video from a given story.

Structure of Phenaki. Source:

Relying on existing video encoders to achieve these features was not an option because they could either only decode fixed-size videos or encode frames separately. To tackle this issue, they propose and use C-ViViT.

C-ViViT is an encoder-decoder structure with unique capabilities. It can exploit temporal information in videos by compressing them in temporal and spatial dimensions while staying auto-regressive in time. This structure allows C-ViViT to encode and decode variable-length videos.

Furthermore, a bi-directional transformer is used after the C-ViViT to generate video from text inputs. The text-to-video problem is modeled as a sequence-to-sequence problem of predicting video tokens for text embeddings. 

Example outputs of Phenaki. Source:

This was a brief summary of Phenaki, the first story-based conditional video generation model. The acceleration in deep learning-based generation models has entered another level in recent months, and Phenaki was one of the latest studies in this domain. You can find links below if you want to learn further about Phenaki.  

This Article is written as a research summary article by Marktechpost Staff based on the research paper 'PHENAKI: VARIABLE LENGTH VIDEO GENERATION FROM OPEN DOMAIN TEXTUAL DESCRIPTIONS'. All Credit For This Research Goes To Researchers on This Project. Check out the paper and code.

Please Don't Forget To Join Our ML Subreddit

Ekrem Çetinkaya received his B.Sc. in 2018 and M.Sc. in 2019 from Ozyegin University, Istanbul, Türkiye. He wrote his M.Sc. thesis about image denoising using deep convolutional networks. He is currently pursuing a Ph.D. degree at the University of Klagenfurt, Austria, and working as a researcher on the ATHENA project. His research interests include deep learning, computer vision, and multimedia networking.