Facebook AI has developed a new technique called Data-efficient image Transformers (DeiT) to train computer vision models that leverage Transformers to unlock dramatic advances across many areas of Artificial Intelligence.
DeiT requires far fewer data and far fewer computing resources to produce a high-performance image classification model. In training a DeiT model with just a single 8-GPU server over three days, FB AI achieved 84.2 top-1 accuracy on the ImageNet benchmark without any external training data. The result is competitive with cutting-edge CNNs, which have been the principal approach for image-classification till now.
By visualizing that Transformers can be trained efficiently for image classification, using only regular academic data sets, it’s expected to extend Transformers to use new cases and make this work more accessible to researchers and engineers who have a shortage of large-scale systems to train massive AI models.
Image classification — understanding the main content of an image — is easy for humans but hard for machines. It is difficult for convolution-free Transformers like DeiT as these systems don’t have many statistical priors about the images. Thus, they typically have to “see” a lot of example images to learn to classify different objects. DeiT, however, can be trained easily with approximately 1.2 million images, rather than hundreds of millions of images.
- The first key ingredient of DeiT is its training strategy. Initially, researchers used data augmentation, optimization, and regularization to simulate training on a much larger data set, as done in CNN.
- Further, they modified the Transformer architecture to allow native distillation. (Distillation is a process by which one neural network (the student NN) learns from the output of another network (the teacher NN)). They used a CNN as a teacher model for the Transformer.
- Using distillation may hamper the performance of neural networks. So, the student model learns from two different sources that may be diverging: from a labeled data set (strong supervision) and the teacher. To alleviate this, a distillation token is introduced: a learned vector that flows through the network along with the transformed image data and cues the model for its distillation output, which can differ from its (distillation token’s) class output. This improved distillation method is specific to Transformers.
DeiT is an important step towards using Transformers to advance computer vision. It will also help democratize AI research and to show that it is possible for developers with limited access to data and computing resources to train or use these new models.