Training efficiency has become a significant factor for deep learning as the neural network models, and training data size grows. GPT-3 is an excellent example to show how critical training efficiency factor could be as it takes weeks of training with thousands of GPUs to demonstrate remarkable capabilities in few-shot learning.
To address this problem, the Google AI team introduce two families of neural networks for image recognition. First is EfficientNetV2, consisting of CNN (Convolutional neural networks) with a small-scale dataset for faster training efficiency such as ImageNet1k (with 1.28 million images). Second is a hybrid model called CoAtNet, which combines convolution and self-attention to achieve higher accuracy on large-scale datasets such as ImageNet21 (with 13 million images) and JFT (with billions of images). As per the research report by Google, EfficientNetV2 and CoAtNet both are 4 to 10 times faster while achieving state-of-the-art and 90.88% top-1 accuracy on the well-established ImageNet dataset.
EfficientNetV2: Models for faster training and smaller in size
EfficientNetV2 builds upon the previous EfficientNet architecture. The Google AI team studied training speed bottlenecks on modern TPUs/GPUs to enhance and improve the original model. They found the following:
- Training with huge image sizes results in higher memory usage, leading to slow speed on TPUs/GPUs.
- Depthwise convolutions are inefficient on TPUs/GPUs since they have low hardware utilization.
- The uniform compound scaling approach that is commonly used and scales up every stage of convolutional networks equally is sub-optimal.
Google’s research team proposes both a training-aware neural architecture search (NAS) where the training speed is included in the optimization goal and a scaling method that scales different stages in a non-uniform manner to address these issues as described above.
Google’s research team evaluates the EfficientNetV2 models on ImageNet and few other transfer learning datasets. On ImageNet, EfficientNetV2 models outperform previous models with about 5–11x faster training speed and up to 6.8x smaller model size.
CoAtNet: Faster Speed and Higher Accuracy Models for Large-Scale Image Recognition
In CoAtNet (CoAtNet: Marrying Convolution and Attention for All Data Sizes), the research team studied ways to combine convolution and self-attention to develop fast and accurate neural networks for large-scale image recognition. By combining convolution and self-attention, the proposed hybrid models can achieve both greater capacity and better generalization.
The research group found two critical insights related to the finding of COAtNet:
- The depthwise convolution and self-attention can be naturally unified via simple relative attention.
- Vertically stacking convolution layers and attention layers in a way that considers their capacity and computation required in each stage (resolution) is surprisingly effective in improving generalization, capacity, and efficiency.
Based on the above insights, the Google research team created a family of hybrid models consisting of both convolution and attention, called CoAtNets.
As per Google research article data, CoAtNet model outperforms ViT models and its variants across a number of datasets, such as ImageNet1K, ImageNet21K, and JFT. While compared to convolutional networks, CoAtNet shows similar performance behavior on a small-scale dataset like ImageNet1K.
Paper (CoAtNet): https://arxiv.org/abs/2106.04803
Paper (EfficientNetV2): https://arxiv.org/abs/2104.00298