Sigma: Changing AI Perception with Multi-Modal Semantic Segmentation through a Siamese Mamba Network for Enhanced Environmental Understanding

In AI, searching for machines capable of comprehending their environment with near-human accuracy has led to significant advancements in semantic segmentation. This field, integral to AI’s perception capabilities, includes allocating a semantic label to each pixel in an image, facilitating a detailed understanding of the scene. However, conventional segmentation techniques often falter under less-than-ideal conditions, such as poor lighting or obstructions, making pursuing more robust methods a high priority.

One emerging solution to this challenge is multi-modal semantic segmentation, which combines traditional visual data with additional information sources, such as thermal imaging and depth sensing. This approach offers a more nuanced view of the environment, allowing for improved performance where singular data modalities may fail. For instance, while RGB data provides detailed colour information, thermal imaging can detect entities based on heat signatures, and depth sensing offers a 3D scene perspective.

Despite the promise of multi-modal segmentation, existing methodologies, primarily CNNs and ViTs, have notable limitations. CNNs, for example, are restricted by their local field of view, limiting their ability to grasp the broader context of an image. ViTs can capture global context at a prohibitive computational cost, making them less viable for real-time applications. These challenges highlight the need for an innovative approach to harness multi-modal data’s power efficiently.

Researchers from the Robotics Institute at Carnegie Mellon University and the School of Future Technology at the Dalian University of Technology introduced Sigma to solve the above problems. Sigma leverages a Siamese Mamba network architecture, incorporating the Selective Structured State Space Model, Mamba, to balance global contextual understanding and computational efficiency. This model departs from traditional methods by offering global receptive field coverage with linear complexity, enabling faster and more accurate segmentation across diverse conditions.

On the challenging RGB-Thermal and RGB-Depth segmentation tasks, Sigma consistently outperformed existing state-of-the-art models. For instance, in experiments conducted on the MFNet and PST900 datasets for RGB-T segmentation, Sigma demonstrated superior accuracy, with mean Intersection over Union (mIoU) scores exceeding those of comparable methods. Sigma’s innovative design allowed it to achieve these results with significantly fewer parameters and lower computational demands, highlighting its potential for real-time applications and devices with limited processing power.

The Siamese encoder extracts features from different data modalities, which are then intelligently fused using a novel Mamba fusion mechanism. This process ensures that essential information from each modality is retained and effectively integrated. The subsequent decoding phase employs a channel-aware Mamba decoder, further refining the segmentation output by focusing on the most relevant features across the fused data. This layered approach enables Sigma to produce remarkably accurate segmentations, even when traditional methods struggle.

In conclusion, Sigma advances semantic segmentation, introducing a powerful multi-modal approach that leverages the strengths of different data types to enhance AI’s environmental perception. By combining the depth and thermal modalities with RGB data, Sigma achieves unparalleled accuracy and efficiency, setting a new standard for semantic segmentation technologies. Its success underscores the potential of multi-modal data fusion and paves the way for future innovations.

Check out the Paper and GithubAll credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 40k+ ML SubReddit

[Announcing Gretel Navigator] Create, edit, and augment tabular data with the first compound AI system trusted by EY, Databricks, Google, and Microsoft