Meet LEO: A Groundbreaking Embodied Multi-Modal Agent for Advanced 3D World Interaction and Task Solving

AI systems capable of handling multiple tasks or domains without significant reprogramming or retraining are generalist agents. These agents aim to generalize knowledge and skills across various domains, exhibiting flexibility and adaptability in solving different problems. Simulations for training or research purposes often involve 3D environments. Generalist agents in these simulations can adapt to different scenarios, learn from experiences, and perform tasks within the virtual space. For instance, in training simulations for pilots or surgeons, these agents can replicate various scenarios and respond accordingly.

The challenges for generalist agents in 3D worlds lie in handling the complexity of three-dimensional spaces, learning robust representations that generalize across diverse environments, and making decisions considering the multi-dimensional nature of the surroundings. These agents often use techniques from reinforcement learning, computer vision, and spatial reasoning to navigate and interact effectively within these environments.

Researchers at the Beijing Institute for General Artificial Intelligence, CMU, Peking University, and Tsinghua University propose a generalized agent called LEO, trained in LLM-based architecture. LEO is a generically embodied, multi-modal, and multitasking agent. LEO can perceive, ground, reason, plan, and act with shared model architectures and weights. LEO perceives through an egocentric 2D image encoder for the embodied view and an object-centric 3D point cloud encoder for the third-person global perspective.

Using autoregressive training objectives, LEO can also be trained with task-agnostic inputs and outputs. The 3D encoder generates an object-centric token for each observed entity. This encoder design can be flexibly adapted to tasks with various embodiments. LEO is based on the basic principle of 3D vision-language alignment and 3D vision-language-action. To obtain the training data, the team curated and generated an extensive dataset comprising object-level and scene-level multi-modal tasks with exceeding scale and complexity, necessitating a deep understanding of and interaction with the 3D world.

The team also proposed scene-graph-based prompting and refinement methods, along with Object-centric Chain-of-Thought (O-CoT), to improve the quality of generated data, largely enrich the data scale and diversity, and further eliminate the hallucination of LLMs. The team extensively evaluated LEO and demonstrated its proficiency in diverse tasks, including embodied navigation and robotic manipulation. They also observed consistent performance gains while simply scaling up the training data.

The results show that the responses of LEO incorporate rich, informative spatial relations and are precisely grounded in the 3D scenes. They find LEO contains concrete objects that are present in the scenes, as well as concrete actions regarding these objects. LEO can bridge the gap between 3D vision language and embodied movement as the team’s results reveal the feasibility of their joint learning.


Check out the Paper and ProjectAll credit for this research goes to the researchers of this project. Also, don’t forget to join our 33k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

Arshad is an intern at MarktechPost. He is currently pursuing his Int. MSc Physics from the Indian Institute of Technology Kharagpur. Understanding things to the fundamental level leads to new discoveries which lead to advancement in technology. He is passionate about understanding the nature fundamentally with the help of tools like mathematical models, ML models and AI.

🚀 LLMWare Launches SLIMs: Small Specialized Function-Calling Models for Multi-Step Automation [Check out all the models]