Alibaba Group and Ant Group Researchers Introduce VideoComposer: An AI Model That Enables To Combine Multiple Modalities Like Text, Sketch, Style, And Even Motion To Drive Video Generation

Current visual generative models, particularly diffusion-based models, have made tremendous leaps in automating content generation. Thanks to computation, data scalability, and architectural design advancements, designers can generate realistic visuals or videos using a textual prompt as input. To achieve unparalleled fidelity and diversity, these methods often train a robust diffusion model conditioned by text on massive video-text and image-text datasets. Despite these remarkable advancements, a major obstacle still exists in the synthesis system’s poor degree of control, which severely limits its usefulness.

Most current approaches enable tunable creation by introducing new conditions beyond texts, such as segmentation maps, inpainting masks, or sketches. The Composer expands on this idea by proposing a new generative paradigm based on compositionality that can compose a picture under a wide range of input conditions and achieve extraordinary flexibility. While Composer excels at considering multi-level conditions in the spatial dimension, it may need help with video production due to the unique characteristics of video data. This difficulty results from the multilayered temporal structure of movies, which must accommodate a wide range of temporal dynamics while preserving coherence between individual frames. Therefore, combining appropriate temporal conditions with spatial cues becomes critical to permit programmable video synthesis. 

The preceding considerations inspired Alibaba Group and Ant Group researchers to develop VideoComposer, which provides enhanced spatial and temporal controllability for video synthesis. This is accomplished by first dissecting a video into its constituent parts—textual condition, spatial condition, and critical temporal condition—and then using a latent diffusion model to reconstruct the input video under the influence of these elements. In particular, to explicitly record the inter-frame dynamics and provide direct control over the internal motions, the team also offers the video-specific motion vector as a type of temporal guidance during video synthesis. 

In addition, they introduce a unified spatiotemporal coder (STC-encoder) that employs cross-frame attention mechanisms to capture spatiotemporal relations within sequential input, resulting in improved cross-frame consistency of the output movies. The STC-encoder also acts as an interface, allowing for the unified and effective use of control signals from a wide range of condition sequences. Thus, VideoComposer is adaptable enough to compose a video under various settings while keeping the synthesis quality consistent. 

Importantly, unlike conventional approaches, the team was able to manipulate the movement patterns with relatively straightforward hand motions, such as an arrow showing the moon’s trajectory. The researchers carry out several qualitative and quantitative evidence demonstrating VideoComposer’s effectiveness. The findings show that the method attains remarkable levels of creativity across a range of downstream generative activities. 

 techniques.


Check Out The Paper, Github, and Project. Don’t forget to join our 23k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at Asif@marktechpost.com

🚀 Check Out 100’s AI Tools in AI Tools Club

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...