Researchers From MIT and Cornell Develop STEGO (Self-Supervised Transformer With Energy-Based Graph Optimization): A Novel AI Framework That Distills Unsupervised Features Into High-Quality Discrete Semantic Labels

This Article Is Based On The Research Paper 'UNSUPERVISED SEMANTIC SEGMENTATION BY DISTILLING FEATURE CORRESPONDENCES'. All Credit For This Research Goes To The Researchers Of This Paper 👏👏👏

✍ Submit AI Related News/Story/PR Here

Please Don't Forget To Join Our ML Subreddit

Unsupervised semantic segmentation seeks to uncover and localize semantically significant categories within image corpora without any annotation. However, there are several challenges in creating annotated training data. These challenges frequently often outweigh semantic segmentation methods’ superior accuracy. Algorithms must develop features for every pixel that are both semantically relevant and compact enough to form discrete clusters to extract meaningful categories with any annotation from the training data. A team of researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), Google, and Cornell University has achieved this by creating a machine learning model named STEGO (Self-supervised Transformer with Energy-based Graph Optimization) that surpasses previous methods by decoupling feature learning from cluster compactification.

A frozen backbone makes up STEGO, and it serves as a source of learning feedback and input to the segmentation head for predicting distilled characteristics. This segmentation head is a direct feed-forward network with a ReLU activation function. Unlike earlier studies, the algorithm’s efficiency was increased without retraining or fine-tuning the backbone. The STEGO neural network retrieves global image information by pooling spatial variables in a global average. Then, based on the cosine similarity in the backbone’s feature space, a lookup table is computed for each image’s K-Nearest Neighbours.

The CocoStuff dataset served as the primary training dataset, which had many images filled with small entities that were difficult to resolve at a feature resolution of (40, 40). Before learning KNNs, five training images were trimmed to handle small objects better and preserve rapid training periods. This allows the network to examine the images in greater depth, but it also increases the KNNs’ quality. For each crop, global picture embeddings were calculated in more detail. As a result, the network resolved finer details and produced five times as many images from which to locate close matching KNNs. On both the Cityscapes and CocoStuff datasets, five-cropping boosted performance. The clustering and CRF refining steps made the final components of the design.


STEGO can be used as a stepping stone into the world of modern self-supervised visual backbones, which can then be developed to produce cutting-edge unsupervised semantic segmentation algorithms. According to the team’s studies, STEGO outperforms previous state-of-the-art models on the PiCIE, CocoStuff, and Cityscapes datasets on the linear probe and unsupervised clustering metrics. DINO’s self-supervised weights on ImageNet are enough to simultaneously solve both settings, even if the backbone for these datasets has not been fine-tuned. By examining how STEGO outperformed just clustering the features from unmodified DINO, MoCoV2, and ImageNet supervised ResNet50 backbones, the team could draw a clear conclusion about the benefits of training a segmentation head to extract feature correspondences. However, there is still much additional work to be done when it comes to labeling.