Meet SegGPT: A Generalist Model that Performs Arbitrary Segmentation Tasks in Images or Videos Via in-Context Inference

In computer vision, which seeks to locate and reorganize significant notions at the pixel level, such as foreground, category, object instance, etc., segmentation is one of the most fundamental challenges. For a variety of segmentation tasks, including foreground segmentation, interactive segmentation, semantic segmentation, instance segmentation, and panoptic segmentation, they have made considerable strides in recent years. These expert segmentation models, however, are restricted to particular tasks, classifications, granularities, data formats, etc. A new model must be trained when adjusting to a new environment, such as segmenting a novel notion or objects in videos rather than pictures.

In this study, their goal is to train a single model that can handle an infinite variety of segmentation tasks. This calls for time-consuming annotation work and needs to be more sustainable for many segmentation jobs. The main difficulties lie in two areas: (1) incorporating the vastly different data types into training, such as part, semantic, instance, panoptic, person, medical image, aerial image, etc.; and (2) creating a generalizable training scheme that differs from traditional multi-task learning, which is flexible in task definition and can handle tasks that are outside of its purview. To overcome these issues, researchers from Beijing Academy, Zhejiang University and Peking University introduce SegGPT, a generalist paradigm for segmenting anything in context.

They integrate many segmentation tasks into a generalist in-context learning framework and see segmentation as a generic format for visual perception. This framework can handle various segmentation data types by converting them to the same picture format. Using random colour mapping for each data sample, the SegGPT training issue is phrased as an in-context colouring problem. The goal is to only colour the associated areas such as classes, object instances, components, etc by the context. By employing a random colouring scheme, the model is compelled to consult contextual data to execute the given job instead of depending on certain hues. This makes it possible to approach training in a way that is more adaptable and generic.

The remaining training components stay the same when employing a standard ViT and a straightforward smooth-l1 loss. Following training, SegGPT may use in-context inference to execute various segmentation tasks in pictures or videos given a few instances, such as object instance, stuff, portion, contour, text, etc. They suggest a straightforward but powerful context ensemble technique, the featured ensemble, which can assist the model in taking advantage of the multi-example prompting scenario. By tailoring a customized prompt for a specialized use case, such as in-domain ADE20K semantic segmentation, SegGPT may also easily function as a specialist model without modifying the model parameters.

These are their primary contributions. 

(1) For the first time, they show a single generalist model that can automatically complete a wide range of segmentation tasks. 

(2) For various tasks, such as few-shot semantic segmentation, video object segmentation, semantic segmentation, and panoptic segmentation, they assess the pre-trained SegGPT directly, i.e., without fine-tuning. 

(3) Both subjectively and statistically, their results demonstrate great skills in segmenting in- and out-of-domain targets. Nevertheless, their study does not promise to achieve new state-of-the-art results or outperform existing specialized approaches across all benchmarks since they think a general-purpose model may not be able to handle certain tasks.

Check out the Paper, Project, and Github. Don’t forget to join our 19k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at

πŸš€ Check Out 100’s AI Tools in AI Tools Club

Aneesh Tickoo is a consulting intern at MarktechPost. He is currently pursuing his undergraduate degree in Data Science and Artificial Intelligence from the Indian Institute of Technology(IIT), Bhilai. He spends most of his time working on projects aimed at harnessing the power of machine learning. His research interest is image processing and is passionate about building solutions around it. He loves to connect with people and collaborate on interesting projects.

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...