Salesforce AI Research Introduces BLIP-2: A Generic And Efficient Vision-Language Pre-Training Strategy That Bootstraps From Frozen Image Encoders And Frozen Large Language Models (LLMs)

Research on vision-language pretraining (VLP) has advanced quickly in the past few years. Pre-trained models of progressively bigger scale have been created to advance the state-of-the-art on numerous downstream tasks continually. However, due to end-to-end training with large-scale models and datasets, most cutting-edge vision-language models suffer a substantial computation cost during pretraining.

Since vision and language are intertwined in vision and language research, vision and language models are expected to draw on the widely accessible unimodal models from the vision and natural language communities.

A recent work by Salesforce researchers introduces BLIP-2: Bootstrapping Language-Image Prediction, a general and compute-efficient VLP technique using frozen unimodal models for pretraining. This technique was created by bootstrapping off commercially available, pre-trained vision and language models. Large language models (LLMs), in particular, provide excellent language production and zero-shot transfer capabilities. On various vision-language tasks, such as visual question answering, image captioning, and image-text retrieval, BLIP-2 performs at the cutting edge.

To use pre-trained unimodal models for VLP, cross-modal alignment must be made possible. The unimodal pre-trained models don’t move during pre-training to save on computing costs and prevent catastrophic forgetting. However, freezing them makes vision-language alignment particularly difficult because LLMs haven’t seen any visuals during their unimodal pretraining. This study demonstrates that the image-to-text generation loss used by previous approaches in this context is insufficient to close the modality gap.

🔥 Recommended Read: Leveraging TensorLeap for Effective Transfer Learning: Overcoming Domain Gaps

Flamingo is one of the earlier systems that used an image-to-text generative loss. A generative loss, however, is not enough to close the modality difference. With frozen unimodal models, the researchers suggest a Querying Transformer (QFormer) pre-trained with a novel two-stage pretraining technique to achieve effective vision-language alignment. Q-Former is a simple transformer that pulls visual information from a frozen image encoder using a collection of trainable query vectors. Between the frozen image encoder and the frozen LLM, it functions as a bottleneck for information, feeding the most helpful visual feature to the LLM so that it can generate the necessary text. 

They execute vision-language representation learning in the initial pretraining stage, enforcing the Q-Former to learn the visual representations most pertinent to the text. By linking the Q-output Former to a frozen LLM during the second pretraining step, the team performs vision-to-language generative learning and trains the Q-Former so that the LLM can understand its visual representation.

The lightweight Q-Former and usage of frozen unimodal models make BLIP-2 more compute-efficient than the current state-of-the-art. On zero-shot VQAv2, BLIP-2 performs 8.7% better than Flamingo while using 54 less trainable parameters.

The findings demonstrate that BLIP-2 is a general approach that can harvest more sophisticated unimodal models for improved VLP performance. FlanT5, BLIP-2, and LLMs offer new capabilities like visual knowledge reasoning, visual communication, etc., by enabling zero-shot image-to-text generation that adheres to natural language instructions. The development of LLMs and pretrained vision models can be simply harvested by BLIP-2. The researchers believe this is crucial in creating an intelligent multimodal conversational AI.

Check out the Paper and GitHub. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 13k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

Tanushree Shenwai is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Bhubaneswar. She is a Data Science enthusiast and has a keen interest in the scope of application of artificial intelligence in various fields. She is passionate about exploring the new advancements in technologies and their real-life application.