Microsoft AI Proposes ‘FocalNets’ Where Self-Attention is Completely Replaced by a Focal Modulation Module, Enabling To Build New Computer Vision Systems For high-Resolution Visual Inputs More Efficiently

Human eyes allow us to see finely and coarsely objects by quickly adjusting their focal points to allow us to observe our surroundings from all angles. In the area of computer vision, simulating this behavior using a neural network is still a work in progress because it is challenging to create a model that can effectively concentrate on varied granularities of visual inputs for various tasks.

Transformers and Vision Transformers have allowed for recent NLP and vision technology advancements. Transformers’ self-attention (SA) mechanism, which enables each inquiry token to gather information from others nimbly, makes them stand out in terms of vision. Their capacity for generalization also aids in analyzing incoming signals from the visual environment, which are frequently continuous and have arbitrary granularity and scope. However, SA is often employed to model over a defined set of prepared tokens with a particular scope and granularity.

Microsoft Research made a stride in this direction by constructing neural networks with focal modulation and created the FocalNets architecture family. With a 3x smaller model size and training data size, FocalNet achieves new state-of-the-art (SoTA) on one of the most challenging vision tasks: COCO object identification. It surpassed all previous Transformer models for the first time in the past two years, which is a significant accomplishment. In contrast to the Transformer, FocalNet can find and identify things in images and videos, displaying an exciting interpretable learning behavior.

The core of FocalNets is the focal modulation mechanism. It is a simple element-wise multiplication (like the focussing operator) that enables the modulator-based interaction of the model with the input. The computation of this modulator involves a two-step focal aggregation process. The first process, known as focal contextualization, extracts contexts from local to global ranges at various levels of granularity. The second process includes gated aggregation to condense all context features at various granularity levels into the modulator.

The distinctions between self-attention and focused modulation can be seen as two independent processes that allow AI models to focus on particular portions of their information selectively. While the focused modulation begins with aggregation followed by interaction, self-attention begins with interaction, substantially speeding up the process with much lighter activities. 

The team discovered that focal modulation automatically develops an interpretable representation and distinguishes the primary object from the background clutter using the usual supervised training of FocalNet and Vision Transformers (ViT) on ImageNet. Without specialized dense pixel-level supervision, it learns to separate objects, and the focus areas it chooses are consistent with the human-generated annotation in the picture classification challenge. Both the modulation map and the attention map learn to focus. However, the emphasis areas that are chosen are very different. In contrast, ViT’s attention maps’ chosen focus areas are less significant and could draw attention to spuriously connected regions.

On tasks like ImageNet classification, zero-shot classification on 20 datasets on ICinW, etc., FocalNet was evaluated against well-known vision backbone networks like Vision Transformers (ViT), Swin Transformers, and ConvNeXt for evaluation reasons. According to the study’s findings, FocalNet routinely surpasses competitors. The dense visual prediction tasks with high-resolution picture input can benefit significantly from its attention-free design of focused modulation. This design enables the model to observe a larger scope at different granularities and avoids the taxing cost of token-to-token contact.

With FocalNets, Microsoft hopes to make it easier for the AI research community to create new computer vision systems for high-resolution visual inputs. The team hopes that by sharing the results of their experiments, the scientific community will realize the full potential of FocalNets and promote the spread of focused modulation. The team has also made their research paper, a GitHub repository with the PyTorch codebase, and a HuggingFace demo available to encourage additional study.

This Article is written as a research summary article by Marktechpost Staff based on the research paper 'Focal Modulation Networks'. All Credit For This Research Goes To Researchers on This Project. Check out the paper, code and reference article.
Please Don't Forget To Join Our ML Subreddit

Khushboo Gupta is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Goa. She is passionate about the fields of Machine Learning, Natural Language Processing and Web Development. She enjoys learning more about the technical field by participating in several challenges.

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...