Meet PaLM-E: A New 562-Billion Parameter Embodied Multimodal Language Model That Performs Tasks Such As Robotic Manipulation Planning, Visual QA

Strong reasoning abilities are displayed by large language models (LLMs) in a variety of fields, including conversation, step-by-step reasoning, math problem-solving, and code authoring. Although training LLMs on vast amounts of textual data can produce representations related to their physical environment, connecting those representations to real-world visual and physical sensor modalities is crucial to solving a wider range of grounded real-world problems in computer vision and robotics.

Previous work interfaces the output of LLMs with learned robotic policies and affordance functions to make decisions, but it is constrained in that way. The limitation of prior work is that the LLM only receives textual input, which is insufficient for many tasks where the geometric configuration of the scene is crucial. Moreover, their research demonstrates that cutting-edge visual language models trained on common vision-language tasks like visual question answering (VQA) cannot directly resolve robotic reasoning problems. In this study researchers from Google and TU Berlin suggest embodied language models, which directly include continuous inputs from an embodied agent’s sensor modalities and allow the language model to draw more accurate conclusions for sequential decision-making in the actual world. They develop PaLM-E which is a single big embodied multimodal model that displays positive transfer and can solve a range of embodied reasoning problems from different observation modalities on numerous embodiments. 

PaLM-E LLM exhibhits positive transfer where knowledge or skills from a learner’s first language (L1) can be applied to their second language (L2) learning, resulting in faster and more effective acquisition of the L2. For example, if a learner’s L1 has a similar grammar structure to the L2 they are learning, they may be able to use their knowledge of L1 grammar to understand and apply the rules of L2 grammar more quickly. Similarly, if a learner’s L1 and L2 share cognates (words that have a similar spelling and meaning in both languages), they may be able to quickly expand their L2 vocabulary by recognizing and remembering these cognates. Positive transfer can be contrasted with negative transfer, which occurs when knowledge or skills from a learner’s L1 interfere with their ability to acquire their L2. For example, if the grammar structure of a learner’s L1 is vastly different from that of their L2, they may struggle to apply L2 grammar rules correctly, even if they understand them intellectually.

Similar to how language tokens are processed by the self-attention layers of a Transformer-based LLM, inputs like pictures and state estimations are also incorporated into the same latent embedding as language tokens. They begin by injecting the continuous inputs through an encoder into a pre-trained LLM. These encoders have received end-to-end training to produce sequential judgments in natural language, which the embodied agent may understand by configuring low-level rules or responding to an embodied query. By contrasting various input representations (such as standard vs. object-centric ViT encodings for visual input), freezing vs. finetuning the language model while training the encoders, and examining whether co-training on multiple tasks enables to transfer, they assess the approach in a range of contexts.

They test the technique on three robotic manipulation domains (two of which are closed-loop in the real world), common visual-language tasks like VQA and picture captioning, and language tasks, to determine the breadth of the approach. According to their findings, multi-task training enhances performance compared to training models for single tasks. They demonstrate how this transfer between tasks may result in great data efficiency for robotics tasks, including exhibiting one-shot or zero-shot generalization to novel item combinations or unknown objects and considerably enhancing learning performance from small numbers of training samples. To their knowledge, the 540B PaLM LLM and the 22B Vision Transformer (ViT) are combined to create the biggest vision-language model that has ever been published, scaling PaLM-E up to 562B parameters.

Without using task-specific finetuning, PaLM-E-562B achieves state-of-the-art performance on the OK-VQA benchmark. They also discover that PaLM-E-562B displays a wide range of skills despite having been trained on only single-image examples, including zero-shot multimodal chain-of-thought (CoT) few-shot prompting, OCR-free arithmetic reasoning, and multiimage reasoning. Zero-shot CoT, initially a language-only notion, has, to their knowledge, yet to be shown using an end-to-end model on multimodal data with task-specific programs.

To summarize their primary contributions, they (1) suggest and show how embodied data may be included in training a multimodal big language model to create a generalist, transfer-learned, multi-embodiment decision-making agent. They demonstrate that, even though state-of-the-art general-purpose visual-language models do not effectively address embodied reasoning issues out of the box (zero-shot), it is possible to train a general-purpose visual-language model that is both an effective embodied reasoner and competent. In researching the optimal training of such models,

They (3) provide fresh architectural concepts, including entity-labeling multimodal tokens and neural scene representations. Last but not least, they (4) demonstrate that PaLM-E is also a quantitatively skilled vision and language generalist, in addition to their concentration on PaLM-E as an embodied reasoner, and (5) show that expanding the language model size enables multimodal finetuning with less catastrophic forgetting. Various demos can be found on their project website.

Check out the Paper and Github. All Credit For This Research Goes To the Researchers on This Project. Also, donÔÇÖt forget to join our 15k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

Aneesh Tickoo is a consulting intern at MarktechPost. He is currently pursuing his undergraduate degree in Data Science and Artificial Intelligence from the Indian Institute of Technology(IIT), Bhilai. He spends most of his time working on projects aimed at harnessing the power of machine learning. His research interest is image processing and is passionate about building solutions around it. He loves to connect with people and collaborate on interesting projects.

­čÉŁ Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...