Researchers from Stanford University and FAIR Meta Unveil CHOIS: A Groundbreaking AI Method for Synthesizing Realistic 3D Human-Object Interactions Guided by Language

The problem of generating synchronized motions of objects and humans within a 3D scene has been addressed by researchers from Stanford University and FAIR Meta by introducing CHOIS. The system operates based on sparse object waypoints, an initial state of things and humans, and a textual description. It controls interactions between humans and objects by producing realistic and controllable motions for both entities in the specified 3D environment.

Leveraging large-scale, high-quality motion capture datasets like AMASS, interest in generative human motion modeling has risen, including action-conditioned and text-conditioned synthesis. While prior works used VAE formulations for diverse human motion generation from text, CHOIS focuses on human-object interactions. Unlike existing approaches that often center on hand motion synthesis, CHOIS considers full-body motions preceding object grasping and predicts object motion based on human movements, offering a comprehensive solution for interactive 3D scene simulations.

CHOIS addresses a critical need for synthesizing realistic human behaviors in 3D environments, crucial for computer graphics, embodied AI, and robotics. CHOIS advances the field by generating synchronized human and object motion based on language descriptions, initial states, and sparse object waypoints. It tackles challenges like realistic motion generation, accommodating environment clutter, and synthesizing interactions from language descriptions, presenting a comprehensive system for controllable human-object interactions in diverse 3D scenes.

The model uses a conditional diffusion approach to generate synchronized object and human motion based on language descriptions, object geometry, and initial states. Constraints are incorporated during the sampling process to ensure realistic human-object contact. The training phase uses a loss function to guide the model in predicting object transformations without explicitly enforcing contact constraints.

The CHOIS system is rigorously evaluated against baselines and ablations, showcasing superior performance on metrics like condition matching, contact accuracy, reduced hand-object penetration, and foot floating. On the FullBodyManipulation dataset, object geometry loss enhances the model’s capabilities. CHOIS outperforms baselines and ablations on the 3D-FUTURE dataset, demonstrating its generalization to new objects. Human perceptual studies highlight CHOIS’s better alignment with text input and superior interaction quality compared to the baseline. Quantitative metrics, including position and orientation errors, measure the deviation of generated results from ground truth motion.

In conclusion, CHOIS is a system that generates realistic human-object interactions based on language descriptions and sparse object waypoints. The procedure considers object geometry loss during training and employs effective guidance terms during sampling to enhance the realism of the results. The interaction module learned by CHOIS can be integrated into a pipeline for synthesizing long-term interactions given language and 3D scenes. CHOIS has significantly improved in generating realistic human-object interactions aligned with provided language descriptions.

Future research could explore enhancing CHOIS by integrating additional supervision, like object geometry loss, to improve the matching of generated object motion with input waypoints. Investigating advanced guidance terms for enforcing contact constraints may lead to more realistic results. Extending evaluations to diverse datasets and scenarios will test CHOIS’s generalization capabilities. Further human perceptual studies can provide deeper insights into generated interactions. Applying the learned interaction module to generate long-term interactions based on object waypoints from 3D scenes would also expand CHOIS’s applicability.


Check out the Paper and ProjectAll credit for this research goes to the researchers of this project. Also, don’t forget to join our 33k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

Hello, My name is Adnan Hassan. I am a consulting intern at Marktechpost and soon to be a management trainee at American Express. I am currently pursuing a dual degree at the Indian Institute of Technology, Kharagpur. I am passionate about technology and want to create new products that make a difference.

🚀 LLMWare Launches SLIMs: Small Specialized Function-Calling Models for Multi-Step Automation [Check out all the models]