Recently, Computer vision (CV) research-driven by Deep Learning has achieved significant progress in classifying video clips taken from the Internet and analyzing human actions in them. These video-based tasks can be pretty challenging, as they require an understanding of the interactions between humans, objects, the context within a given scene. It also requires an understanding of reasoning over long temporal intervals.
In the paper Unified Grap Structured Models for Video Understanding, a team of researchers from Google Research has proposed a message-passing graph neural network (MPNN) that can explicitly model these spatio-temporal relations, use either implicitly or explicitly captured representations of objects, and generalize previous structured models for video understanding.
The paper focuses on Spatio-temporal action recognition and video scene graph parsing that require reasoning about interactions between actors, objects, and their environment in both space and time.
The MPNN method proposed by them aims to build structured representations of videos by representing them as a graph of actors, objects, and contextual elements in a scene. MPNN performs coherent modeling of both spatial and temporal interactions. It then uses action recognition and scene graph prediction to understand the interactions between elements in the graph.
MPNN is a relatively flexible model that can operate on a directed or undirected graph. Its inference consists of two components- a message-passing phase where messages are first computed by applying spatial and temporal message-passing functions and a final readout phase where a readout function uses the updated node features to classify tasks.
After the first phase, an update function aggregates the received messages to update the latent state.
MPNN models scene context by including the features from each spatial position in the feature map. The researchers have also added an implicit object model to enable the network to encode information about the scene and relevant objects without any extra supervision.
MPNN models temporal interactions by connecting foreground nodes in a keyframe with all other foreground nodes in neighboring keyframes. It sets the sampling more than or equal to one. Thus it is possible to consider a wider temporal interval in a more computationally efficient manner.
The researchers evaluated MPNN on scene graph classification (SGCls), predicate classification (PredCls), and Spatio-temporal action detection tasks. They used the Action Genome dataset for video scene graph classification and prediction and the AVA and UCF101- 24 datasets for Spatio-temporal action recognition.
In video scene graph classification, the proposed Spatio-temporal graph-structured model improved substantially. The model also showed substantial improvements in spatio-temporal action detection on AVA datasets.