The Facebook AI Research team has developed a new computer vision model called ConViT. The ConViT system combines two widely used architectures to overcome some important limitations of each approach on its own, namely convolutional neural networks (CNNs) and Transformer-based models. By leveraging both techniques, this Vision System can outperform existing architectures, especially at low data levels, while achieving similar performance with respect to the large dataset regime – all without sacrificing accuracy or speed!
AI Researchers have been utilizing certain assumptions (inductive biases) that help to train machine learning models. These inductive biases are commonly used because they can allow for more generalizable solutions, deriving from less data. CNNs rely on two of these inductive biases as a means to success. On the other side self-attention-based vision models (like Data-efficient image Transformers and Detection Transformers) do not have any inductive bias. When trained on large data sets, they can match the performance of convolutional neural networks without having to create explicit layers like CNNs. However, these self-attention models often struggle with small datasets because their network cannot detect meaningful representations since there are no inductive biases that represent what the dataset contains or where in space this information exists.
AI researchers are faced with a trade-off. On the one hand, the CNNs can achieve high performance even when given minimal data (high floor), but their strong inductive biases may limit them if large quantities of data is present (low ceiling). In contrast, Transformers feature minimal inductive biases that might not be as successful in small datasets – yet this same flexibility enables these models to outperform other types of AI across larger sets of information by considering more possibilities than before.
In order to solve this issue, Facebook AI researchers are planning on presenting their solution at ICML 2021. First, they ask a very simple question: is it possible to design models that benefit from inductive biases when helpful but aren’t limited by them when better solutions can be learned? In other words, can we get the best of both worlds? To solve this problem, the Facebook research team has initialized their latest ConViT model with a ‘soft’ convolutional inductive bias, which the model can learn to ignore if necessary.
The goal of ConViT was to modify vision Transformers to encourage their networks act convolutionally. They introduced a soft inductive bias that allows the network model itself whether it wants to remain convolutional or not. They did so by introducing gated positional self-attention (GPSA), where the model learns parameters that control how much standard content-based attention is used compared with an initialized position-based one.
The ConViT algorithm outperforms the recent Data-efficient image Transformers (DeiT) model of equivalent size and flops. The researchers hope that their ConViT approach will encourage the community to explore other ways of moving from hard inductive biases to soft inductive ones.