OpenVLA: A 7B-Parameter Open-Source VLA Setting New State-of-the-Art for Robot Manipulation Policies

A major weakness of current robotic manipulation policies is their inability to generalize beyond their training data. While these policies, trained for specific skills or language instructions, can adapt to new conditions like different object positions or lighting, they often fail when faced with scene distractors or new objects, and need help to follow unseen task instructions. On the other hand, existing foundation models for vision and language, such as CLIP, SigLIP, and Llama 2, can generalize much better. This ability comes by training them on large-scale datasets from the internet. However, the largest robotic manipulation datasets contain only 100K to 1M examples, making it challenging to match this level of pretraining in robotics.

The paper discusses three existing methods in the field. First is Visually-Conditioned Language Models (VLMs) that are trained on huge datasets from the internet to generate natural language from images and prompts, and are used in tasks like visual question answering and object localization. The second approach, Generalist Robot Policies includes training multi-task “generalist” robot policies on large and diverse datasets that work across different robots. For instance, Octo can control multiple robots and easily adapt to new setups. The last one is Vision-Language-Action Models (VLMs) which are used in robotics for tasks like visual state representations, object detection, and high-level planning.

Researchers from Stanford University, UC Berkeley, Toyota Research Institute, Google Deepmind, and MIT have proposed OpenVLA, a 7B-parameter open-source VLA that sets a new state-of-the-art robot for manipulation policies. OpenVLA consists of a pre-trained visually-conditioned language model backbone, capturing visual details at various levels. It is fine-tuned on a huge, diverse dataset of 970k robot manipulation trajectories from the Open-X Embodiment dataset. OpenVLA outperforms the previous leading model, the 55B-parameter RT-2-X, by 16.5% in absolute success rate across 29 tasks on the WidowX and Google Robot platforms.

VLAs are effectively fine-tuned across 7 different manipulation tasks, and OpenVLA policies perform better than fine-tuned pretrained policies like Octo. To train OpenVLA, the Prismatic-7B VLM backbone is pre-trained to predict robot actions. This prediction task is set up as a “vision-language” task, where an input observation image and a natural language task instruction are mapped to a sequence of predicted robot actions. Moreover, each dimension of the robot’s actions is divided into one of 256 bins, and the width of each bin is chosen to uniformly divide the interval between the 1st and 99th percentile of the actions in the training data.

Researchers found that both versions of the Diffusion Policy are as good as or outperform the generalist policies Octo and OpenVLA for simpler single-instruction tasks such as “Put Carrot in Bowl” and “Pour Corn into Pot”. However, for more complex fine-tuning tasks that involve multiple objects and need language instructions, the pre-trained generalist policies perform better. OpenX pretraining for Octo and OpenVLA helps the models to better adapt to these diverse tasks where understanding language is important. OpenVLA is the only approach that achieves at least a 50% success rate across all tested tasks, making it a strong default choice for imitation learning tasks, especially those involving a variety of language instructions.

In conclusion, researchers have introduced OpenVLA a state-of-the-art, open-source model for vision-language-action tasks that shows strong performance for for controlling different types of robots right from the start. The introduced method can be easily adapted to new robot setups via parameter-efficient fine-tuning techniques and is the only approach that achieves at least a 50% success rate across all tested tasks. However, it has several limitations. As of now, OpenVLA only supports single-image observations. So, future work includes exploring OpenVLA to support multiple image and proprioceptive inputs as well as observation history.


Check out the Paper and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter

Join our Telegram Channel and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 44k+ ML SubReddit

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...