The current media environment is filled with visual effects and video editing. As a result, as video-centric platforms have gained popularity, demand for more user-friendly and effective video editing tools has skyrocketed. However, because video data is temporal, editing in the format is still difficult and time-consuming. Modern machine learning models have shown considerable promise in enhancing editing, although techniques frequently compromise spatial detail and temporal consistency. The emergence of potent diffusion models trained on huge datasets recently caused a sharp increase in the quality and popularity of generative techniques for picture synthesis. Simple users may produce detailed pictures using text-conditioned models like DALL-E 2 and Stable Diffusion with only a text prompt as input. Latent diffusion models effectively synthesize pictures in a perceptually constrained environment. They research generative models suitable for interactive applications in video editing due to the development of diffusion models in picture synthesis. Current techniques either propagate adjustments using methodologies that calculate direct correspondences or, by finetuning on each unique video, re-pose existing picture models.
They try to avoid costly per-movie training and correspondence calculations for quick inference for every video. They suggest a content-aware video diffusion model with a configurable structure trained on a sizable dataset of paired text-image data and uncaptioned movies. They use monocular depth estimations to represent structure and pre-trained neural networks to anticipate embeddings to represent content. Their method gives several potent controls on the creative process. They first train their model, much like image synthesis models, so the inferred films’ content, such as their look or style, correspond to user-provided pictures or text cues (Fig. 1).
Figure 1: Video Synthesis With Guidance We introduce a method based on latent video diffusion models that synthesises videos (top and bottom) directed by text- or image-described content while preserving the original video’s structure (middle).
To choose how closely the model resembles the supplied structure, they apply an information-obscuring technique to the structure representation inspired by the diffusion process. To regulate the temporal consistency in created clips, they modify the inference process using a unique guiding technique influenced by classifier-free guidance.
In summary, they provide the following contributions:
• By adding temporal layers to an image model that has already been trained and by training on pictures and videos, they extend latent diffusion models to video production.
• They provide a model that adjusts films based on sample texts or pictures that are structure and content-aware. Without further per-video training or pre-processing, the complete editing procedure is done at the inference time.
• They exhibit complete mastery of consistency in terms of time, substance, and structure. They demonstrate for the first time how inference-time control over temporal consistency is made possible by concurrently training on image and video data. Training on several degrees of detail in the representation enables picking the preferred configuration during inference, ensuring structural consistency.
• They demonstrate in user research that their technique is preferable over several alternative approaches.
• By focusing on a small group of photos, they show how the trained model may be further modified to produce more accurate movies of a particular subject.
More details can be found on their project website along with interactive demos.
Check out the Paper and Project Page. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 14k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.
Aneesh Tickoo is a consulting intern at MarktechPost. He is currently pursuing his undergraduate degree in Data Science and Artificial Intelligence from the Indian Institute of Technology(IIT), Bhilai. He spends most of his time working on projects aimed at harnessing the power of machine learning. His research interest is image processing and is passionate about building solutions around it. He loves to connect with people and collaborate on interesting projects.