Facebook AI has developed a powerful new Transformer architecture for visual representation learning. This family of architectures incorporates the seminal concept of hierarchical representations called Multiscale Vision Transformers (MViT). It’s also the first such system to train entirely from scratch on video recognition data sets, like Kinetics 400, while achieving state-of-the-art performance across various transfer learning tasks in video classification and human action localization.
MViT models are a new way to identify objects in images and videos quickly. MViT performs competitively on data sets such as Kinetics and ImageNet, transferring well onto downstream tasks like identifying actions for datasets including Charades or AVA (Atomic Visual Actions). In the future, machines may be better at analyzing uncurated sights of the real world by applying MViT to videos and images in it.
The MViT is a new advancement that will help improve the Transformer backbone. Typical Vision Transformers use attention mechanisms to determine which previous tokens it should focus on, however, in the MViT they replace these with pooling attentions that provide for the reduction of visual resolution by Pooling query and key vectors as well as value vector projections.
MViT is a tool that drastically improves the performance of video understanding, requiring no specialized training and instead trains from scratch in one single step. It also greatly surpasses state-of-the art benchmark performances across recognition tests such as ImageNet, Kinetics-400, Kinetics 600 and AVA.
MViT model also provides a way to understand temporal cues without being influenced by spurious spatial biases. This is a significant breakthrough that could be useful in many AI applications, such as robotics and autonomous vehicles.