Researchers From Facebook AI Research and UIUC Propose ‘MaskFormer’, A Mask Classification Model That Simplifies The Landscape Of Effective Approaches To Semantic And Panoptic Segmentation Tasks

'Per-Pixel Classification is Not All You Need for Semantic Segmentation'

2084
Source: https://bowenc0221.github.io/maskformer/

In recent years, semantic segmentation has become an important tool for computer vision. One type of the technique is called per-pixel classification and the goal is to partition images into regions with different categories using deep learning techniques such as Fully Convolutional Networks (FCNs). Mask classification is another alternative way that separates the image partitioning and classifying aspects of segmentation. Instead a single pixel, mask-based methods predict binary masks with each associated to those assigned to one specific class.

Source: https://arxiv.org/pdf/2107.06278.pdf

The general concept of mask classification can be applied at both the semantic- and instance levels, which is an important observation. In fact before FCN, some of the most effective methods for segmentation were masked methodologies like O2P and SDS that had this same perspective in mind. Given this information we ask a natural question: Can one single approach to solving these segments simultaneously prove more beneficial? And will these approaches outperform current per-pixel classification techniques used in semantic segmentations?

To address these questions, researchers from Facebook AI Research (FAIR) and University of Illinois at Urbana-Champaign (UIUC) propose a simple MaskFormer approach/model that seamlessly converts any existing per-pixel classification model into a mask classification. The set prediction mechanism proposed in DETR employs Transformer decoder to compute sets consisting pairs each containing class predictions and masks embedding vectors. This process is done via dot product with the per-pixel embeddings obtained from an underlying fully convolutional network. This new model that solves semantic and instance level segmentation tasks in an unified manner by using the loss function of one per-pixel binary mask loss. The single classification loss for each mask makes it easier to create outputs with task dependent predictions based on blending MaskFormer’s output into other models’ prediction format.

Key features of MaskFormer:

  1. Better results while being more efficient.
  2. Unified view of semantic- and instance-level segmentation tasks.
  3. Support major semantic segmentation datasets: ADE20K, Cityscapes, COCO-Stuff, Mapillary Vistas.
  4. Support ALL Detectron2 models.
  5. Same exact model, loss, and training procedure

Researchers evaluated MaskFormer on five semantic segmentation datasets with various numbers of categories: Cityscapes (19 classes), Mapillary Vistas (65 classes), ADE20K (150 classes), COCOStuff-10K (171 classes), ADE20K-Full (847 classes). MaskFormer outperformed per-pixel classification models for Cityscapes, which has a few diverse classes. The new model demonstrates superior performance when comparing datasets with larger vocabulary.

MaskFormer simplifies the landscape of approaches to semantic and panoptic segmentation tasks, showing excellent results. It outperforms per-pixel classification baselines when the number of classes is large. Mask classification outperforms both current SOTA semantic (ADE20K) and panoptic segmentation (COCO) models.

Paper: https://arxiv.org/pdf/2107.06278.pdf

Project: https://bowenc0221.github.io/maskformer/

Github: https://github.com/facebookresearch/MaskFormer