Playing Where’s Waldo? in 3D: OpenMask3D is an AI Model That Can Segment Instances in 3D with Open-Vocabulary Queries

Image segmentation has come a long way in the last decade, thanks to the advancement in neural networks. It is now possible to segment multiple objects in complex scenes in just a manner of milliseconds, and the results are pretty accurate. On the other hand, we have another task in our hands for the 3D, the instance segmentation, and we have a way to go until we catch up with the 2D image segmentation performance.

3D instance segmentation has emerged as a critical task with significant applications in fields such as robotics and augmented reality. The objective of 3D instance segmentation is to predict object instance masks and their corresponding categories in a 3D scene. While notable progress has been made in this field, existing methods predominantly operate under a closed-set paradigm, where the set of object categories is limited and closely tied to the datasets used for training. 

This limitation poses two fundamental problems. First, closed-vocabulary approaches struggle to understand scenes beyond the object categories encountered during training, leading to potential difficulties in recognizing novel objects or misclassifying them. Second, these methods are inherently limited in their capacity to handle free-form queries, impeding their effectiveness in scenarios that require understanding and acting upon specific object properties or descriptions.

Open-vocabulary approaches are proposed to tackle these challenges. These approaches can handle free-form queries and enable zero-shot learning of object categories not present in the training data. By adopting a more flexible and expansive approach, open-vocabulary methods offer several advantages in tasks such as scene understanding, robotics, augmented reality, and 3D visual search. 

Enabling open-vocabulary 3D instance segmentation can significantly enhance the flexibility and practicality of applications that rely on understanding and manipulating complex 3D scenes. Let’s meet OpenMask3D, the promising 3D instance segmentation model.

OpenMask3D can segment instances of objects. Source:

OpenMask3D aims to overcome the limitations of closed-vocabulary approaches. It tackles the task of predicting 3D object instance masks and computing mask-feature representations while reasoning beyond a predefined set of concepts. OpenMask3D operates on RGB-D sequences and leverages the corresponding 3D reconstructed geometry to achieve its objectives. 

It uses a two-stage pipeline consisting of a class-agnostic mask proposal head and a mask-feature aggregation module. OpenMask3D identifies frames where instances are obvious and extracts CLIP features from the best images of each mask. The resulting feature representation is aggregated across multiple views and associated with each 3D instance mask. This instance-based feature computation approach equips OpenMask3D with the capability to retrieve object instance masks based on their similarity to any given text query, enabling open-vocabulary 3D instance segmentation and surpassing the limitations of closed-vocabulary paradigms.

Overview of OpenMask3D. Source:

By computing a mask feature per object instance, OpenMask3D can retrieve object instance masks based on similarity to any given query, making it capable of performing open-vocabulary 3D instance segmentation. Moreover, OpenMask3D preserves information about the novel and long-tail objects better than trained or fine-tuned counterparts. It also surpasses the limitations of a closed-vocabulary paradigm, enabling the segmentation of object instances based on free-form queries related to object properties such as semantics, geometry, affordances, and material properties.

Check out the Paper and Project. Don’t forget to join our 25k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at

🚀 Check Out 100’s AI Tools in AI Tools Club

Ekrem Çetinkaya received his B.Sc. in 2018, and M.Sc. in 2019 from Ozyegin University, Istanbul, Türkiye. He wrote his M.Sc. thesis about image denoising using deep convolutional networks. He received his Ph.D. degree in 2023 from the University of Klagenfurt, Austria, with his dissertation titled "Video Coding Enhancements for HTTP Adaptive Streaming Using Machine Learning." His research interests include deep learning, computer vision, video encoding, and multimedia networking.

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...