Artificial Intelligence (AI) field has been busy with handling the burst in generative models for the last couple of months. The open-source release of stable diffusion was the spark that lighted the never-ending flame of generative text-to-image models.
Despite the huge success of image generation models, generating temporally consistent long videos still remains a challenge. Phenaki was the most successful example of text-to-video, and even it can fail to maintain consistency in certain scenarios.
Video prediction models have come a long way in recent years thanks to the advancement in neural networks and GPUs. They can produce diverse samples of complex frames close to their original concept/video. We now have models that can generate short videos based on the previous frames.
The same cannot be said for long videos, unfortunately. Sliding short-context windows can indeed be used to predict frames using existing prediction models, and you will probably get an impressive result at first sight. However, these videos will lack temporal consistency as the window size is not long enough to consider long-term dependencies among frames.
Having a temporally consistent video is important for a pleasing visual experience. Imagine we are watching a predicted scene, then we zoom into a certain part of it, and when we zoom out, the scene is changed entirely because we don’t have a temporally consistent model. This would be annoying to look at.
The other important aspect would be a strong imagination from the prediction model. We would like to see a different setup when we change the scene, not having the same objects everywhere. So, the ideal video prediction we want should be consistent across time and have strong imagination for new scenes. But how close can we get to this ideal scenario? Time to meet TECO.
TECO is a vector-quantized latent dynamics model that can effectively model long-term dependencies using efficient transformers in a compact representation space. It shows strong performance on a variety of difficult video prediction tasks. This is made possible thanks to its capability of understanding long-term temporal dependencies in the video.
TECO uses an efficient representation of frames and the proper use of transformers to enable temporal consistency among frames. The efficient representation vectors ensure TECO can significantly reduce computational and memory requirements.
It starts with a generative adversarial network (GAN) model trained to spatially compress the video data. This has already been done in the literature, and it was shown to boost the efficiency of video prediction models. However, even after moving the video into latent space, previous methods were still limited to modeling short sequences due to the extremely high costs of transformer layers. TECO finds a smart solution around this problem to enable the use of transformers for longer video sequences, therefore maintaining temporal consistency. Moreover, to train the model in an efficient way, a custom loss function named DropLoss is used.
To demonstrate the performance of TECO, the authors introduced three challenging video datasets to better measure the long-term consistency. They were built on top of existing benchmarks. TECO showed strong temporal consistency in the experiments and achieved high-quality frame generation.
This was a brief summary of TECO. You can find more information in the links below if you want to learn more about it.
Check out the paper, project, and code. All Credit For This Research Goes To Researchers on This Project. Also, don’t forget to join our Reddit page and discord channel, where we share the latest AI research news, cool AI projects, and more.
Ekrem Çetinkaya received his B.Sc. in 2018, and M.Sc. in 2019 from Ozyegin University, Istanbul, Türkiye. He wrote his M.Sc. thesis about image denoising using deep convolutional networks. He received his Ph.D. degree in 2023 from the University of Klagenfurt, Austria, with his dissertation titled "Video Coding Enhancements for HTTP Adaptive Streaming Using Machine Learning." His research interests include deep learning, computer vision, video encoding, and multimedia networking.