SalesForce AI Researchers Introduce Mask-free OVIS: An Open-Vocabulary Instance Segmentation Mask Generator

Instance segmentation refers to the computer vision task of identifying and differentiating multiple objects that belong to the same class within an image by treating them as distinct entities. Over the past few years, there has been a significant upturn in the number of instances of segmentation techniques because of the rapid advancements in deep learning techniques. For instance, convolutional neural networks (CNNs) and other progressive architectures such as Mask R-CNN are used for instance segmentation. The dominant characteristic of such techniques is that they combine object detection capabilities with pixel-wise segmentation to identify objects and generate accurate masks for each instance within an image, leading to a better understanding of the overall picture. 

However, there is a certain downside to existing detection models regarding the number of base categories they can identify. Previous trials have indicated that if a detection model is trained on the COCO dataset, its capability to detect approximately 80 categories can be attained. However, any additional categories would necessitate human involvement, which is laborious and time-consuming. To counter this, Open Vocabulary (OV) methods exist that leverage image-caption pairs and vision language models to learn new categories. However, there are vast differences in supervision when it comes to learning from base and novel categories. This often leads to overfitting on base categories and poor generalization to novel ones. As a result, there is a strong requirement for a methodology that can enhance these detection methods to detect new categories without much human intervention. This would make the models more practical and scalable for real-world applications. 

✅ [Featured Article] Selected for 2024 GitHub Accelerator: Enabling the Next Wave of Innovation in Enterprise RAG with Small Specialized Language Models

To address this issue, researchers at Salesforce AI have devised a method where bounding box and instance-mask annotations are generated from an image-caption pair. Their proposed method, The Mask-free OVIS pipeline, takes advantage of weak supervision by utilizing pseudomask annotations derived from a vision-language model to learn base and novel categories. This approach eliminates the need for laborious human annotation and addresses the issue of overfitting. Experimental evaluations have demonstrated that their methodology surpasses existing state-of-the-art open vocabulary instance segmentation models. Moreover, their research has been acknowledged and accepted at the prestigious Computer Vision and Pattern Recognition Conference in 2023.

Salesforce researchers have devised a pipeline that consists of two main stages: pseudo-mask generation and open-vocabulary instance segmentation. In the first stage, a pseudo-mask annotation is created for the object of interest from the image-caption pair. By utilizing a pre-trained vision-language model, the object’s name serves as a text prompt to localize the object. Additionally, an iterative masking process is performed with GradCAM to refine the pseudo-mask and ensure it covers the entire object accurately. In the second stage, a weakly-supervised segmentation (WSS) network is trained to select the proposal with the highest overlap with the GradCAM activation map using previously generated bounding boxes. Finally, a Mask-RCNN model is trained using the generated pseudo annotations, completing the pipeline.

The pipeline, thus, eliminates the need for any human involvement by harnessing the power of pre-trained vision-language models and weakly supervised models to automatically generate pseudo-mask annotations, which can be employed as additional training data. To evaluate their pipeline, the researchers conducted several experiments on sought-after datasets like the MS-COCO and OpenImages datasets. The findings demonstrated that employing pseudo-annotations in their approach leads to exceptional performance in detection and instance segmentation tasks, surpassing other methods that depend on human annotations. The one-of-a-kind vision-language guided approach to pseudo annotation generation, devised by the researchers at Salesforce, paves the way for originating more advanced and precise instance segmentation models that eliminate the need for human annotators.

Check Out The PaperProject, and Reference Article. Don’t forget to join our 24k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at

🚀 Check Out 100’s AI Tools in AI Tools Club

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...