Meet ViperGPT: A Python Framework that Combines Vision and Language Models Using Code Generation to Achieve State-of-the-Art Results

The groundbreaking work of Neural Module Networks in prior years aimed to break down jobs into simpler modules. Through training from beginning to finish using modules that were reconfigured for various issues, each module would learn its true purpose and become reusable. Nevertheless, it took a lot of work to use this strategy in the actual world due to several problems. Program development, in particular, needed reinforcement learning from scratch or relied on hand-tuned natural language parsers, making them challenging to optimize. Program creation was severely domain-restricted in each scenario. Training became much more difficult due to learning the perceptual models alongside the program generator, frequently failing to provide the desired modular structure.

As an example let us take some prompts, How many muffins can each child eat for it to be fair? (see Figure 1 (top)) Find the children and the muffins in the image, count how many of each there are, and then decide to divide using the logic that “fair” implies an equitable split. To comprehend the visual environment, it is common for people to compose a mix of many phases. Yet, end-to-end models, which do not naturally use this compositional reasoning, continue to be the dominating strategy in computer vision. Although the discipline has made significant progress on specific tasks like object identification and depth estimation, end-to-end methods to complicated tasks still need to learn to implicitly complete every job during a neural network’s forward run.

Figure 1: ViperGPT creates a programme from a visual input and a query, then runs it through the Python interpreter to produce the result. This diagram displays the created code as well as the outcomes of intermediate variables used during execution. ViperGPT generates replies for open-world inquiries that are both accurate and understandable by assembling pretrained modules.

This fails to take advantage of the advancements in fundamental vision tasks at many levels. Still, it ignores that computers can readily do mathematical operations (such as division) without machine learning. They can’t rely on neuronal models to systematically generalize to varying muffin or child counts. End-to-end models also result in fundamentally opaque judgments since it is impossible to verify the outcome of each phase to identify a failure. This method becomes progressively unworkable as models get more data- and are computationally hungry. They would aim to recombine their current models in novel ways to accomplish new tasks without extra training. Why can’t they design similar modular solutions for more difficult tasks?

In this study, researchers from Columbia University introduce ViperGPT1, a framework that circumvents these constraints by utilizing big language models that generate code (like the GPT-3 Codex) to nimbly build vision models on any textual query that specifies the job. For each question, it makes specialized programs that accept photos or videos as arguments and deliver the answer to that image or video query. They demonstrate that creating these applications only requires giving Codex an API exposing different visual features (such as locate and compute depth), just as one could provide for an engineer. The model can reason about using these functions and constructing the necessary logic thanks to its earlier training in code.

Their findings show that this straightforward strategy offers exceptional zero-shot performance (i.e., without training on task-specific images). Their specific method has many advantages:

  1. It can be interpreted since all the stages are clearly defined as function calls in the code with visible intermediate values.
  2. It is logical because it explicitly employs the logical and mathematical operations built into Python.
  3. It is adaptable since it can easily include any vision or language module by adding only the corresponding module’s definition to the API.
  4. Compositional, breaking down activities into smaller subtasks that are completed step-by-step.
  5. Adaptable to advancements in the area since enhancements to any of the employed modules will directly increase the performance of their technique.
  6. It does not need retraining (or finetuning) a new model for each recent activity.
  7. It is generic since it combines all tasks into a single system. 

So, their contributions are as follows: 

  • Using the advantages listed above, they provide a straightforward framework for handling sophisticated visual inquiries by incorporating code-generation models into vision with an API and the Python interpreter. 
  • They get cutting-edge zero-shot scores on tasks involving visual grounding, image question answering, and video question responding, demonstrating that this interpretability enhances rather than detracts from performance. 
  • To encourage study in this area, they provide a Python library that enables the quick creation of programs for visual tasks and will be open-sourced after publication.

Check out the Paper, Code and Project. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 16k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

Aneesh Tickoo is a consulting intern at MarktechPost. He is currently pursuing his undergraduate degree in Data Science and Artificial Intelligence from the Indian Institute of Technology(IIT), Bhilai. He spends most of his time working on projects aimed at harnessing the power of machine learning. His research interest is image processing and is passionate about building solutions around it. He loves to connect with people and collaborate on interesting projects.

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...