Meet Cheetor: A Transformer-based Multimodal Large Language Models (MLLMs) that can Effectively Handle a Wide Variety of Interleaved Vision-Language Instructions and Achieves State-of-the-Art Zero-Shot Performance

Through instruction tuning on groups of language tasks with an instructive style, large language models (LLMs) have lately demonstrated exceptional skills in acting as a general-purpose model for diverse activities. Instruction tuning unlocks a large amount of zero-shot generalizability of LLMs on novel task instructions by fine-tuning a variety of tasks in a single instruction-response format. With a long-standing goal in numerous real-world applications, this result has spurred a fresh wave of research on expanding text-only instruction-following models to multimodal ones. To accomplish this purpose, Flamingo and BLIP-2 equip LLMs with a frozen visual encoder to comprehend visual inputs. The instruction-following capability of models is further enhanced by LLaVA, MiniGPT-4, and InstructBLIP follow-up efforts by fine-tuning multimodal instruction-following datasets. 

The availability of such instruction-following assistants is constrained by these Multimodal Large Language Models (MLLMs), which primarily concentrate on vision-language instructions that only include a single picture as the visual context and have limited instruction variety. In contrast, people often express their needs in real life through a series of pertinent messages and visuals. For instance, people may need models to refer to several sources of multimodal knowledge (such as visually appealing websites, textbooks, and class slides) to respond to an open-domain inquiry. Interleaved vision-language instructions, where various pictures and texts are semantically related, are what these several references and the query represent. 

Researchers from Zhejiang University, National University of Singapore and Nanyang Technological University developed I4 (semantically Interconnected, Interleaved Image-Text Instruction-Following), a comprehensive large-scale benchmark of 31 tasks with varied instructions in a unified instruction-response format, covering 20 different scenarios, to aid research in interleaved vision-language instruction following. I4 contains three crucial traits, (1) Instructions all comprise sequences of interrelated pictures and words, such as storyboards with scripts and textbooks with diagrams. This is known as an interleaved vision language context. (2) There are many sophisticated instructions; the tasks range from conversational embodied activities to identifying discrepancies in surveillance photos to predicting speech for comics. (3) The benchmark covers various instruction-following scenarios, including cartoons, commercial imagery, driving footage, recipe instructions, etc. they systematically assess contemporary MLLMs using the suggested benchmark and discover they need help to carry out such sophisticated multimodal instructions. they contend that the Visual Prompt Generator (VPG) is crucial in MLLMs for understanding complicated instructions, even though present MLLMs mostly concentrate on building sophisticated ways to create more varied and high-quality instruction tuning data. Existing approaches suggest several VPGs (such as linear projection, Resampler, and Q-former) to extract pertinent visual cues from the rich picture information contained by the vision backbones (such as ViT) to modify LLMs to grasp visual inputs. 

By challenging the frozen LLM to provide captions conditioned on the visual cues, they train the VPG on millions of image-caption pairings. Although efficient, web-crawled captions typically only describe a small portion of the image’s foreground. As a result, the VPG may not extract precise information needed for some activities because it is only taught to extract apparent information for typical captions. Additionally, this problem worsens in I4, as the tasks call for the VPG to pay attention to certain visual details concerning other images in context (convey the fine differences between two photos, for example). 

They propose a lightweight Controllable Knowledge Re-Injection (CLORI) module that uses the sophisticated reasoning capabilities of LLMs to control the VPG (i.e., Q-former) to re-extract the missing visual information conditioned on instruction-specific semantics to address the critical issue of the VPG in existing MLLMs. To be more precise, they use the Q-former to provide task-independent visual cues that give the LLM essential information about the pictures. they first construct instruction-specific conditions from the language model to control the Q-former and conditionally extract certain information from the pictures. These conditions are then taken and reinjected into the LLM. 

Using internal cross-attention maps, they first determine the regions of a picture that the Q-former has largely disregarded. After that, they use ChatGPT and SAM to identify the editing targets and produce the right editing description. Next, using local adjustments to the original image according to the editing instructions, they use Blended Diffusion to create a counterfactual image. An inter-image discriminative pre-training task is then developed to describe the minute differences between the created counterfactual picture and the original image. The CLORI module must extract the missing visual information based on the counterfactual image and the task instruction since the modified bits are selected from the most neglected places. 

They suggest Cheetor, a Transformer-based MLLM that can successfully create holistic semantics from various complex vision-language instructions thanks to adjustable knowledge re-injection. The lightweight CLORI module can be efficiently tuned using the CAGIT technique with fewer than 1 million image-text pairings. It can be finished in several hours with a single A100 GPU without the need for enormous multimodal instruction tuning data. Their model performs notably better on the challenging I4 benchmark than previous MLLMs while being computation- and data-efficient. Additionally, they assess Cheetor using the MME benchmark, where their model performs admirably. 

The following summary of their contributions: (1) they construct I4, a thorough benchmark for interleaved vision-language instruction consisting of 31 challenges that cover a wide range of real-world settings. (2) they provide a minimally controlled knowledge re-injection (CLORI) module that, in response to LLM-generated circumstances, complementally reinjects instruction-specific visual information into the LLM. (3) Utilizing just 30k pictures, they successfully teach the CLORI module utilizing a cross-attention-guided counterfactual image training technique. (4) Their Cheetor achieves state-of-the-art performance on the challenging I4 test at the expense of 7 A100 GPU hours, even without high-quality multimodal instruction tuning data.

Check out the Paper and GitHub. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 28k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

Aneesh Tickoo is a consulting intern at MarktechPost. He is currently pursuing his undergraduate degree in Data Science and Artificial Intelligence from the Indian Institute of Technology(IIT), Bhilai. He spends most of his time working on projects aimed at harnessing the power of machine learning. His research interest is image processing and is passionate about building solutions around it. He loves to connect with people and collaborate on interesting projects.