Meta AI Introduces GenAug: A New System That Uses Text2Image Models To Enable Robots To Transfer Behaviors Zero-Shot From A Simple Demonstrated Scene To Unseen Scenes of Varying Complexity

Robot learning techniques have the ability to generalize over a wide range of tasks, settings, and objects. Unfortunately, these strategies call for extensive, diverse datasets, which are difficult and costly to obtain in practical robotics contexts. Generalizability in robot learning requires access to priors or data outside the robot’s immediate environment.

Data augmentation is a useful tool for enhancing model generalization. But most methods operate in low-level visual space, altering the data in ways like color jitter, Gaussian blurring, and cropping. However, they are still incapable of dealing with significant semantic distinctions in the picture, such as distracting elements, different backgrounds, or the appearance of different objects.

GenAug is a semantic data augmentation framework developed by the University of Washington and Meta AI that uses pre-trained text-to-image generative models to facilitate imitation-based learning in practical robots. Pre-trained generative models have access to a far larger and more varied dataset than on-robot data. This research uses these generative models to supplement data in training actual robots in the real world. This study is based on the intuitive belief that, despite differences in the scene, backdrop, and item appearances, methods for accomplishing a task in one environment should be generally transferable to the same task in different situations.

A generative model can generate vastly different visual situations, with various backdrops and item appearances under which the same behavior will still be valid. At the same time, a limited amount of on-robot experience offers demonstrations of the required behavior. Also, these generative models are trained on realistic data, so the generated sceneries look realistic and vary. By doing so, a huge amount of semantically may be generated easily and affordably from a limited number of demos, giving a learning agent access to vastly more diverse settings than the merely on-robot demonstration data. 

GenAug can generate “augmented” RGBD images for completely new and realistic surroundings, demonstrating the visual realism and complexity of scenarios that a robot may experience in the real world, given a dataset of image-action examples provided on a genuine robot system. Specifically, for robots performing manipulation tasks on a tabletop, GenAug uses linguistic prompts in conjunction with a generative model to alter item textures and shapes and add new distracting elements and backdrop scenes that are physically coherent with the original scene. 

The researchers demonstrate that the generalization capabilities of imitation learning methods are greatly improved by training on this semantically augmented dataset, even though it only contains 10 real-world demos collected in a single, simple location. According to the findings, GenAug can increase robot training by 40% compared to traditional methods, allowing the robot to be trained in places and with items it has never seen before. 

The team plans to apply GenAug to other areas of robot learning, such as Behavior Cloning and Reinforcement learning, and to move beyond more difficult manipulation problems. The researchers believe it would be a fascinating future approach to investigate whether or if a mix of language and vision-language models might provide outstanding scene generators.


Check out theย Paper and Project. All Credit For This Research Goes To the Researchers on This Project. Also,ย donโ€™t forget to joinย our 14k+ ML SubReddit,ย Discord Channel,ย andย Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

Do You Know Marktechpost has 1.5 Million+ Pageviews per month and 500,000 AI Community members?
Want to support us? Become Our Sponsor
๐Ÿ Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...