Google Brain researchers announced a two-billion-parameter deep-learning computer vision (CV) model. The model was trained on three billion pictures and obtained a new state-of-the-art record of 90.45 percent top-1 accuracy on ImageNet.
The ViT-G/14 model is based on Google’s latest Vision Transformers development (ViT). On numerous benchmarks, including ImageNet, ImageNet-v2, and VTAB-1k, ViT-G/14 beat prior state-of-the-art systems. For example, the accuracy gain on the few-shot picture identification challenge was more than five percentage points. The researchers then trained multiple more miniature versions of the model to look for a scaling law for the architecture, observing that performance follows a power-law function, similar to Transformer models used for Neuro-linguistic programming (NLP) applications.
The Transformer architecture, first introduced by Google researchers in 2017, has quickly become the most popular design for NLP deep-learning models, with OpenAI’s GPT-3 being one of the most well-known. Scaling rules for these models were described in a study released by OpenAI last year. OpenAI developed a power-law function for assessing a model’s accuracy by training several comparable models of various sizes and changing the quantity of training data and processing power. Furthermore, OpenAI discovered that bigger models not only perform better but are also more compute-efficient.
Most state-of-the-art CV deep-learning models, unlike NLP models, employ a convolutional neural network (CNN) architecture. The architecture rose to prominence after a CNN model won the ImageNet competition in 2012. With Transformers’ recent success in the NLP field, researchers have begun to look at how well it performs on vision problems; for example, OpenAI has constructed an image-generation system based on GPT-3. Google has been very active in this field, training a 600M-parameter ViT model in late 2020 using their proprietary JFT-300M dataset.
The new ViT-G/14 model was pre-trained using JFT-3B, an upgraded version of the dataset with approximately three billion pictures. The research team improved the ViT architecture, which increased memory usage to allow the model to fit into a single TPUv3 core. The researchers used few-shot and fine-tuning transfer learning on the pre-trained models to evaluate the performance of ViT-G/14 and the other smaller models. The findings were used to create scaling rules, which are similar to NLP laws:
- According to a power-law function, scaling more computation, model, and data enhances accuracy.
- In smaller models, accuracy might be a barrier.
- Larger datasets help big models.
The score of ViT-G/14 is presently ranked #1 on the ImageNet leaderboard. The following eight models with the most outstanding scores were likewise created by Google researchers, while Facebook created the tenth model. In addition, last year’s 600M-parameter ViT model code and weights were released on GitHub by Google.
ViT model code and weights Github: https://github.com/google-research/vision_transformer