Researchers from China Propose Vision Mamba (Vim): A New Generic Vision Backbone With Bidirectional Mamba Blocks

Many people are now interested in the state space model (SSM) because of how recent research has advanced. Modern SSMs, which derive from the classic state space model, benefit from concurrent training and excel at capturing long-range dependencies. Process sequence data across many activities and modalities using SSM-based methods like linear state-space layers (LSSL), structured state-space sequence model (S4), diagonal state space (DSS), and S4D. These methods excel at modeling long-range dependencies. Their use of convolutional and nearlinear computing makes them efficient on lengthy sequences.

Inspired by Mamba’s achievements in language modeling, it’s intriguing to think that one can apply the same level of accomplishment in vision to language, specifically in designing a general and efficient visual backbone using the advanced SSM method. Mamba has two obstacles, though: one is its lack of location awareness, and the other is its unidirectional modeling. 

A recent Huazhong University of Science and Technology, Horizon Robotics, and  Beijing Academy of Artificial Intelligence study provides the Vision Mamba (Vim) block to overcome these obstacles. It combines position embeddings for location-aware visual identification with bidirectional SSMs for data-dependent global visual context modeling. Before using Vim, users need to linearly project the input image’s patches into vectors. Vim blocks store image patches as sequence data, allowing for effective visual representation compression using the suggested bidirectional selective state space. Vim is even more reliable for dense prediction tasks thanks to the position embedding in its blocks, which gives it awareness of spatial information.

Researchers currently use the ImageNet dataset to train the Vim model for supervised image classification. With this pretrained Vim as a foundation, they execute sequential visual representation learning for dense prediction tasks that are further down the line, such as semantic segmentation, object detection, and instance segmentation. Pretraining Vim on massive amounts of unsupervised visual input improves its visual representation, just like Transformers. The large-scale pretraining of Vim can be accomplished with lower computing costs, all because of Mamba’s improved efficiency. 

Since Vim is a pure-SSM-based approach that models images sequentially, it shows more promise as a general and efficient backbone than previous SSM-based models for vision applications. The first pure-SSM-based model to tackle dense prediction jobs is Vim, which is made possible by bidirectional compression modeling with positional awareness. 

With just subquadratic-time computation and linear memory complexity, the suggested Vim achieves the same modeling power as ViT without requiring attention. Vim saves 86.8% GPU RAM and is 2.8 times faster than DeiT when using batch inference to extract features from images with a resolution of 1248×1248. Extensive tests are carried out on downstream problems involving dense prediction and ImageNet classification. The results show that Vim outperforms the well-known and optimized plain vision Transformer, DeiT, in terms of performance. Thanks to Mamba’s fast hardware-aware design, Vim outperforms the self-attention-based DeiT on high-resolution computer vision applications like video segmentation, computational pathology, medical picture segmentation, and aerial image analysis.

The team believes that by using Vim’s bidirectional SSM modeling with position embeddings, future efforts can tackle unsupervised tasks like mask image modeling pretraining, and by combining Vim with Mamba’s comparable architecture, multimodal tasks like CLIP-style pre-training can be accomplished. Analyzing long films, high-resolution medical images, and remote sensing photos—all of which can be considered downstream tasks—using the pretrained Vim weights is a breeze.


Check out the Paper and GithubAll credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our 36k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our Telegram Channel

Dhanshree Shenwai is a Computer Science Engineer and has a good experience in FinTech companies covering Financial, Cards & Payments and Banking domain with keen interest in applications of AI. She is enthusiastic about exploring new technologies and advancements in today’s evolving world making everyone's life easy.

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...