DeepMind Researchers Develop ‘Gato’: A Multi-Modal, Multi-Task, Multi-Embodiment AI Generalist Policy Tool That Can Perform Over 600 Tasks

This Article Is Based On The Research Paper 'A Generalist Agent'. All Credit For This Research Goes To The Researchers šŸ‘šŸ‘šŸ‘

Please Don't Forget To Join Our ML Subreddit

Using a single neural sequence model for all tasks has numerous advantages. It eliminates the need to hand-craft policy models for each area with proper inductive biases. Because the sequence model can consume any data that can be serialized into a flat sequence, it enhances the amount and diversity of training data.

Furthermore, even at the cutting edge of data, computation, and model scale, its performance continues to increase. Generic models that are better at exploiting computation have a history of eventually displacing more specialized domain-specific methods.

DeepMind researchers described the current iteration of Gato, a general-purpose agent that is instantiated as a single, massive transformer sequence model, in a recent study. Gato can converse, caption photographs, stack blocks with a real robot arm, surpass humans at Atari games, navigate in simulated 3D landscapes, obey directions, and more with just a single set of weights.

While no agent can be expected to excel at all imaginable control tasks, especially those far outsides of its training distribution, the researchers tested the hypothesis that training an agent capable of performing a wide range of tasks is possible and that this general agent can be adapted to perform even more tasks with little additional data.

The researchers theorized that such an agent could be created by scaling data and model parameters, continuously broadening the training distribution, and enacting the desired behavior. Natural language can operate as a common grounding across otherwise incompatible manifestations in this situation, allowing for combinatorial generalization to new behaviors.


The training was focused on the operating point of the model scale that allows real-time control of real-world robots, which in the case of Gato is now approximately 1.2B parameters. This operational point will organically increase the viable model size as hardware and model designs improve, pushing generalist models farther up the scaling law curve.

Gato’s design principle is to train on as much relevant data as possible, which includes modalities including photos, text, button clicks, and other discrete and continuous observations and activities. We serialize all data into a flat series of tokens to facilitate the analysis of this multimodal input. Gato can be trained and sampled from this representation in the same way that a normal large-scale language model can.


For real-world text, vision, and robotics tasks, transformer sequence models work well as multi-task multi-embodiment policies. They also show promise in learning a few-shot out-of-distribution assignment. Instead of starting from scratch, such models could be utilized as a default starting point for learning new behaviors by prompting or fine-tuning in the future. Scale in parameters, data, and compute will improve performance across all jobs, including dialogue. Better hardware and network architectures will enable larger models to be trained while preserving real-time robot control.