Google AI Introduces A Multi-Axis Approach for Vision Transformer and MLP Models

Following the release of AlexNet in 2012, convolutional neural networks established themselves as the primary machine learning architecture for computer vision. Recently, attention mechanisms have been widely incorporated into vision models as they focus on aspects of the input data while downplaying others, allowing the network to zero in on key details. A new, convolution-free landscape of model architectures for computer vision has emerged because of the Vision Transformer (ViT). ViT applies a Transformer encoder on image patches, treating them like a string of words. ViT’s picture recognition performance is impressive when trained on suitably large datasets.

According to studies, both convolutions and focus are unnecessary for adequate performance.

A new Google study introduces a new multi-axis technique that is both easy to use and successful. It can naturally adjust to varying input sizes while maintaining a high degree of flexibility and low complexity. The team have developed two mainframe models, one for high-level and one for low-level vision tasks:

  1. MaxViT: Multi-Axis Vision Transformer: It outperforms previous work in challenging areas like image classification, object identification, segmentation, quality evaluation, and generation. 
  2. MAXIM: Multi-Axis MLP for Image Processing: It uses a UNet-like architecture to achieve competitive performance on low-level imaging tasks like denoising, deblurring, dehazing, deraining, and low-light enhancement. 

In contrast to ViT’s full-size attention (each pixel attends to all the pixels), this novel method uses multi-axis attention to break down the attention into two sparse forms: local and (sparse) global. Compared to the original attention utilized in ViT, this simplified version demonstrates improved generalizability by performing better on various vision tests, particularly high-resolution visual predictions. The researchers have constructed the following two foundational instantiations of this multi-axis attention strategy: MaxViT and MAXIM, which are tailored to high-level and low-level tasks, respectively.

MaxViT begins with constructing a single MaxViT block (shown below), created by joining the multi-axis attention and the MBConv (introduced by EfficientNet, V2). Regardless of the resolution at which it is being fed, this single block can encode local and global visual information. The team obtained uniform MaxViT architecture by simply stacking attention and convolutional blocks in a hierarchical architecture (simulating ResNet and CoAtNet). Compared to previous hierarchical techniques, MaxViT stands out for its superior model capacity across various tasks and its ability to “see” globally across the whole network, especially in earlier, high-resolution stages.

For low-level image-to-image prediction tasks, the researchers provide MAXIM, a generic UNet-like architecture that serves as the second backbone. By utilizing a gated multi-layer perceptron (gMLP) network, MAXIM investigates the possibility of parallel designs for both local and global approaches. MAXIM’s cross-gating block allows for the application of interactions between several input signals. Since it uses only the low-cost gated MLP operators to interact with the multiple inputs, this block can be a useful option. 

In addition, MAXIM’s gated MLP and cross-gating blocks have linear complexity to image size, making it suitable for processing high-resolution images with ease.

The findings show that MaxViT performs well across a variety of vision tasks. With only ImageNet-1K training, MaxViT achieves 86.5% top-1 accuracy; with ImageNet-21K (14M images, 21k classes) pre-training, MaxViT achieves 88.7% top-1 accuracy.

MaxViT provides good performance across a wide range of downstream tasks. The MaxViT backbone achieves 53.4 AP for object detection and segmentation on the COCO dataset, outperforming previous foundational models while using only roughly 60% of the computational cost. In addition to its superior performance in picture classification, the MaxViT building block also shows promise in image production, outperforming the state-of-the-art HiT model with fewer parameters while attaining higher FID and IS scores on the ImageNet-1K unconditional generation assignment.

The method demonstrates state-of-the-art results on 15 of 20 tested datasets for image processing tasks like denoising, deblurring, deraining, dehazing, and low-light enhancement using fewer or comparable parameters and FLOPs than competitive models. MAXIM’s image restorations are cleaner and more detailed than competing methods.

Paper 1: MaxViT: Multi-Axis Vision Transformer

Paper 2: MAXIM: Multi-Axis MLP for Image Processing

Reference Article:

Please Don't Forget To Join Our ML Subreddit
🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...