Vision Transformer has shown great promise at various computer vision tasks. The ability to capture short and long-range visual dependencies through the self-attention model is exciting, but it brings challenges due to quadratic computational overhead. Some recent work has attempted to improve performance by applying either coarse-grained global attention or fine-grained local attentions; however, this may cripple the modeling power of the original self-attention mechanism when applied in that way.
MIT researchers have developed a new self-attention mechanism called focal self-attention for vision transformers, using focal transformer. It allows each token to attend to the closest surrounding tokens at fine granularity and allows them to focus on faraway objects. This enables capture of both short and long range visual dependencies efficiently and effectively.
The Focal Transformers, a new variant of Vision Transformer models, proposed in this paper are a more effective multi-scale transformer model for image classification, object detection and segmentation than the SoTA methods. With extensive experimental results, it is shown that these focal attention circuits can be generalized to other vision tasks as well–such as modeling local-global interactions within transformations for various types of visuals.
Focal Transformer achieved superior performance over the state-of-the-art vision transformer on a range of public benchmarks. Using Focal Transformers as backbones, researchers obtain consistent and substantial improvements over the current state of art for 6 different object detection methods trained with standard 1x and 3x schedules.
- Focal Transformer (FT) introduced a new self-attention mechanism for ViTs
- Each token attends the closest surrounding tokens
- It captures both short and long-range visual dependencies
Microsoft AI finally opens the source code of its Focal Transformer. Below are the links.