Salesforce AI Open-Sources ‘LAVIS,’ A Deep Learning Library For Language-Vision Research/Applications

Recent years have seen remarkable development in the creation of sophisticated language-vision models. Real-world applications rely heavily on multimodal material, particularly language-vision data, which includes texts, photos, and videos. 

However, domain knowledge is required for training and evaluating these models across tasks and datasets, and they are not necessarily open to new researchers and practitioners. This is primarily because preparing the necessary experiment setup is a lot of work and is time-consuming regardless of the model, dataset, or task evaluation being used.

Salesforce researchers have developed LAVIS (short for LAnguage-VISion), an open-source library for training and evaluating state-of-the-art language-vision models on a rich family of common tasks and datasets and for off-the-shelf inference on customized language-vision data. This will make the emerging language-vision intelligence and capabilities available to a wider audience, encourage practical adoption, and reduce repetitive efforts in future development.

LAVIS is an all-inclusive, modular, and future-proof language-vision library that works with standard tasks, data sets, and cutting-edge models. LAVIS’s overarching goal is to offer data scientists, machine learning engineers, and academics a streamlined means to examine, troubleshoot, and clarify their multimodal data.

The LAVIS features that are most notable are:

  1. Interface standardization and componentization. An integrated and modular structure houses the library’s most essential parts. This facilitates rapid development, integration of new or external components, and off-the-shelf access to individual components. Model inferences, including multimodal feature extraction, are simplified thanks to the modular structure.
  2. Full compatibility with image-text and video-text tasks and datasets. More than ten popular language vision tasks, spanning more than 20 public datasets, are supported by LAVIS. The team has created a thorough and consistent evaluation standard to judge language-vision models by standardizing these tasks and datasets.
  3. Superior linguistic and visual models that can be replicated. Over 30 fine-tuned model checkpoints from the ALBEF, BLIP, CLIP, and ALPRO base models are available through the library. These models perform competitively across various activities when measured with industry-standard parameters. They also offer training and assessment scripts and settings to help make language-vision studies and implementations more repeatable.
  4. Abundant and practical resource manual. Their services include supplementary materials that might help students and researchers in the field of language-vision communication overcome common challenges they may encounter. This features a GUI-based dataset viewer for previewed downloaded datasets, automatic dataset downloading tools for preparing the supported datasets, and dataset cards detailing sources, supported tasks, standard metrics, and leaderboards.

According to the team, extending the library’s current selection of language-vision models, jobs, and datasets is a top priority for future releases. They also intend to provide greater parallelism support for scalable training and inference.

This Article is written as a research summary article by Marktechpost Staff based on the research paper 'LAVIS: A Library for Language-Vision Intelligence'. All Credit For This Research Goes To Researchers on This Project. Check out the paper, github link and reference article.

Please Don't Forget To Join Our ML Subreddit

Tanushree Shenwai is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Bhubaneswar. She is a Data Science enthusiast and has a keen interest in the scope of application of artificial intelligence in various fields. She is passionate about exploring the new advancements in technologies and their real-life application.

↗ Step by Step Tutorial on 'How to Build LLM Apps that can See Hear Speak'