Facebook AI has built a new architecture for video understanding called TimeSformer. The video architecture is purely based on Transformers. Transformers have become the dominant approach for many natural language processing (NLP) applications such as Machine Translation and General language understanding.
TimeSformer was proven to achieve the best-reported numbers on multiple challenging action recognition benchmarks, including the Kinetics-400 action recognition data set. Compared with modern 3D convolutional neural networks, it is nearly three times faster to train requires less than one-tenth of computing inference.
Moreover, the scalability of TimeSformer facilitates the training of much larger models on much longer video clips. It paves the way for AI systems to understand more complex human actions in videos, like activities involving multiple atomic steps. It will prove to be beneficial for many AI applications that require an understanding of complex human behaviors.
Traditional video classification models use 3D convolutional filters. Whereas, TimeSformer is built on the self-attention mechanism used in Transformer models, making it possible to capture space-time dependencies over the entire video. The model interprets the input video as a time-space sequence of image patches extracted from the individual frames to apply transformers to video. This format is quite similar to that used in NLP. NLP Transformers deduce each word’s meaning by comparing it with all the other words in the sentence. This method is known as self-attention. The model captures each patch’s semantics by explicitly comparing it with the other patches in the video, making it possible to capture short-term dependencies between neighboring patches and long-range correlations between distant patches.
TimeSformer maintains a low computational cost by two methods: decomposing the video into a small set of non-overlapping patches. The other is by applying a form of self-attention that avoids exhaustive comparison between all pairs of patches. This technique is called divided space-time attention. In temporal attention, each patch is compared only with patches at the other frames’ exact spatial location. In spatial attention, the patch is compared only with patches within the same frame. Further, it was found that divided space-time attention is more efficient and more accurate than joint space-time attention.
The scalability of TimeSformer allows it to operate on extremely long clips to perform super-long-range temporal modeling. This differs significantly from the current 3D CNNs, which are limited to processing clips of at most only a handful of seconds. The best 3D CNNs models available today can only use video segments that are a few seconds long. With TimeSformer, the researchers could train on far longer video clips — up to several minutes long. This development may accelerate advanced research to teach machines to understand complex long-form actions in videos, which is crucial for many AI applications geared toward human behavior understanding, such as an AI assistant.
The low inference cost of TimeSformer is a significant step toward supporting future real-time video processing applications, such as AR/VR or intelligent assistants that provide services based on video taken from wearable cameras. This approach’s reduced cost will allow more researchers to tackle video analysis problems, expediting progress in this area. TimeSformer can also prove to be an essential step toward supporting applications requiring real-time or on-demand processing of the video.