Mora: A New Multi-Agent Framework that Incorporates Several Advanced Visual AI Agents to Replicate Generalist Video Generation Demonstrated by Sora

Researchers from Lehigh University and Microsoft introduced a new multi-agent framework, Mora, to address the challenge of advancing video generation technology. While in recent years, there has been significant progress in image and text synthesis, video generation remains relatively unexplored. Existing models have struggled to produce long-duration videos exceeding 10 seconds, limiting their practical utility. Closed-source models like Sora by OpenAI present a barrier to innovation and replication within the academic community. The paper aims to replicate and extend the capabilities of Sora for various video generation tasks.

Models like Pika and Gen-2 demonstrated notable performance, but they have limitations when it comes to producing longer videos and lack the abilities shown by Sora in the current landscape of video generation. Unlike these models, Mora leverages collaboration among advanced visual AI agents to achieve generalist video generation. Mora decomposes video generation into several subtasks, each assigned to a specialized agent, such as prompt selection, text-to-image generation, image-to-video generation, and video-to-video editing. By designing the collaboration of these agents, Mora aims to replicate and extend the video generation capabilities demonstrated by Sora.

Mora’s multi-agent framework enables a structured yet flexible approach to video generation. By employing advanced AI agents specialized in different aspects of the generation process, Mora can tackle diverse video generation tasks, including text-to-video generation, text-conditional image-to-video generation, extending generated videos, video-to-video editing, connecting videos, and simulating digital worlds. Each agent is responsible for a specific input-output transformation, ensuring coherent and high-quality video outputs. Experimental results demonstrate Mora’s competitive performance, with metrics indicating its proficiency in generating videos closely resembling those produced by Sora. While there exists a performance gap between Mora and Sora, particularly in holistic assessments, Mora’s open-source nature and multi-agent architecture offer significant advantages in terms of accessibility, extensibility, and innovation potential.

In conclusion, the paper presents the Mora framework, a solution to the challenge of advancing video generation technology. By replicating and extending the capabilities of leading video generation models like Sora, Mora improves the performance of video generation and related tasks. Mora’s multi-agent approach illustrates the potential for collaborative AI systems to extend the limits of visual synthesis, opening up possibilities for innovation and application in various fields. While Mora shows competitive performance, particularly in specific tasks, further refinement and optimization may be needed to bridge the performance gap with Sora comprehensively.

Check out the Paper and GithubAll credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 39k+ ML SubReddit

Pragati Jhunjhunwala is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Kharagpur. She is a tech enthusiast and has a keen interest in the scope of software and data science applications. She is always reading about the developments in different field of AI and ML.