Researchers at UC Berkeley and Google Research have proposed a conceptually simple yet powerful backbone architecture that incorporates self-attention for various computer vision tasks, including image classification, object detection, and instance segmentation.
In recent years, Deep convolutional backbone architectures have enabled significant progress in image classification, object detection, instance segmentation. Convolution operations effectively capture local information. But vision tasks such as object detection, instance segmentation, keypoint detection demand long-range modeling dependencies.
Convolution-based architectures require stacking multiple layers to aggregate the locally captured filter responses globally. Although stacking more layers improves these backbones’ performance, an explicit mechanism to model global dependencies could be a more robust and scalable solution without needing so many layers.
Therefore, the researchers have proposed a simpler solution that leverages self-attention. The BoTNet is the proposed deep learning architecture that enables hybrid models to use both convolutions and self-attention.
Designing the novel architecture
In designing the proposed architecture, the team used Multi-Head Self-Attention (MHSA) to replace the spatial 3 × 3 convolution layer in the ResNet’s (residual neural network) final three bottleneck blocks. They state that the simple design change enables the ResNet bottleneck blocks to be viewed as Transformer blocks.
This computation primitive also benefits computer vision tasks as the self-attention mechanism can learn a rich hierarchy of associative features across long sequences. For instance, in segmentation tasks, long-range modeling dependencies such as collecting and associating scene information from a large neighborhood can help to learn relationships across objects.
Previous deep learning architectures employed self-attention outside the backbone architecture. However, BoTNet applies convolutions and self-attention both within its backbone architecture to apply global self-attention over a 2D feature map.
Typically, CV landmark backbone architectures use multiple layers of 3×3 convolutions. ResNet’s leaner bottleneck architectures are widely used to reduce computational cost. After conducting several experiments, the team deduced that BoTNet could be used as a drop-in replacement for any ResNet backbone with a more efficient compute step-time. Thus, replacing convolutions is more efficient than stacking convolutions.
The researchers state that the memory and computation required for self-attention scale quadratically with spatial dimensions, leading to overheads for training and inference. Therefore, they employed a hybrid design in which convolutions are responsible for efficiently learning abstract and low-resolution feature maps from large images. Later, the global self-attention mechanism processes and aggregates the information provided by the feature maps captured by the convolutions.
On evaluating BoTNet’s performance on the COCO Instance Segmentation benchmark validation set, the team noted that BoTNet’s 49.7 percent Box AP and 44.4 percent Mask AP outperforms the existing best single model and single scale results of ResNet. Additionally, the BoTNet design enables models to achieve a robust outcome on the ImageNet benchmark for image classification, attaining 84.7 percent top-1 accuracy. This approach is 2.33x faster than traditional EfficientNet models on TPU-v3 hardware in terms of computing time.
The team hopes that their work improves the understanding of architecture design in the field, and in the future, it will set out a strong baseline for studies on leveraging self-attention models for CV tasks.