Comprehensive Analysis of The Performance of Vision State Space Models (VSSMs), Vision Transformers, and Convolutional Neural Networks (CNNs)

Deep learning models like Convolutional Neural Networks (CNNs) and Vision Transformers achieved great success in many visual tasks, such as image classification, object detection, and semantic segmentation. However, their ability to handle different changes in data is still a big concern, especially for use in security-critical applications. Many works evaluated the robustness of CNNs and Transformers against common corruptions, domain shifts, information drops, and adversarial attacks. It shows that a model’s design affects its ability to manage these issues, and robustness varies across different architectures. A major drawback of transformers is their quadratic computational scaling with input size, making them costly for complex tasks.

This paper discussed two related topics: the Robustness of Deep Learning Models (RDLM) and State Space Models (SSMs). RDLM focuses on how well a traditionally trained model can maintain good performance if faced with natural and adversarial changes in data distribution. Deep learning models often face data corruption, like noise, blur, compression artifacts, and intentional disruptions designed to trick the model in real-world situations. These issues can significantly harm their performance, so, to ensure these models are reliable and robust, it is important to evaluate their performance under these tough conditions. On the other hand, SSMs are a promising approach for modeling sequential data in deep learning. These models transform a one-dimensional sequence using an implicit latent state.

Researchers from MBZUAI UAE, Linkoping University, and ANU Australia have introduced a comprehensive analysis of the performance of VSSMs, Vision Transformers, and CNNs. This analysis can manage various challenges for classification, detection, and segmentation tasks, and provides valuable insights into their robustness and suitability for real-world applications. The evaluations performed by researchers are divided into three parts, each focusing on an important area of model robustness. The first part is Occlusions and Information Loss, where the robustness of VSSMs is evaluated against information loss along scanning directions and occlusions. The other two parts are Common Corruptions and Adversarial Attacks.

The robustness of classification models based on VSSM is tested against Common Corruptions that reflect real-world situations. These include global corruptions like noise, blur, weather, and digital distortions at different intensity levels, and detailed corruptions such as object attribute editing and background changes. The evaluation is then extended to VSSM-based detection and segmentation models to show their strength in dense prediction tasks. Moreover, the robustness of VSSMs is analyzed against the third and last section, Adversarial Attacks in both white-box and black-box settings. This analysis gives insights into the ability of VSSMs to resist adversarial changes at various frequency levels.

Based on the evaluation of all the three sections, here are the key findings:

  • In the first part, it is found that ConvNext and VSSM models handle sequential information loss along the scanning direction, better than ViT and Swin models. In situations that involve patch drops, VSSMs show the highest robustness, although Swin models perform better under extreme information loss. 
  • VSSM models experience the smallest average performance drop compared to Swin and ConvNext models in global corruption. For fine-grained corruptions, VSSM models outperform all transformer-based variants and either match.
  • For adversarial attacks, smaller VSSM models show great robustness against white-box attacks compared to their Swin Transformer counterparts. VSSM models keep above 90% robustness for strong low-frequency perturbations, but their performance drops quickly with high-frequency attacks.

In conclusion, researchers thoroughly evaluated the robustness of Vision State-Space Models (VSSMs) under various natural and adversarial disturbances, showing their strengths and weaknesses compared to transformers and CNNs. The experiments revealed the capabilities and limitations of VSSMs in handling occlusions, common corruptions, and adversarial attacks, as well as their ability to adapt to changes in object-background composition in complex visual scenes. This study will guide future research to enhance the reliability and effectiveness of visual perception systems in real-world situations.

Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter

Join our Telegram Channel and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 45k+ ML SubReddit

🚀 [FREE AI WEBINAR] 'Optimise Your Custom Embedding Space: How to find the right embedding model for YOUR data.' (July 18, 2024) [Promoted]