Researchers from InnoPeak Technology Propose GCF-Net: Gated Clip Fusion Network for Video Action Recognition

2332
Source: https://arxiv.org/pdf/2102.01285.pdf

Researchers from InnoPeak Technology at Palo Alto, California, introduce Gated Clip Fusion Network (GCF-Net) to boost the existing video action classifiers with a tiny computation overhead cost.

Most of the accuracy gains for video action recognition have recently come from the newly designed CNN architectures like 3D-CNNs. Deep CNN is employed to train these models by applying them on a single fixed temporal length clip. Each video segment is processed by the 3D-CNN module separately. Therefore, the corresponding clip descriptor is local, and the inter-clip relationships are naturally implicit. However, the 3D CNN has a minimal receptive field (e.g., merely 16 frames). 

The standard method that directly averages the clip-level outputs as a video-level prediction is likely to fail because it lacks the mechanism to extract and integrate relevant information to represent the video. 

Aiming to fill this research gap, the researchers proposed a lightweight network that can boost the accuracy of existing clip-based action classifiers.

The GCF-Net leverages two strategies: 

  1. It explicitly models the interdependencies between video clips to strengthen the receptive field of local clip descriptors
  2. Furthermore, each clip’s importance to an action event is calculated, and a relevant subset of clips is selected accordingly for a video-level analysis. 

The research presents a novel Bi-directional Inter-Clip Fusion method that uses short- and long-range video segments to model the inter-clip relationships and generate better clip representations. The conventional methods relied on 3D-CNN to separately generate a local feature with minimal local temporal knowledge. The team states that on comparing with earlier methods, it is found that the proposed method provides a better clip representation with a broader receptive field. 

A Gated Clip-Wise Attention is offered to suppress irrelevant clips further to improve the video-level prediction accuracy. As a byproduct, this module’s attention weights can be used to locate the time interval of an action event in a video (for instance, based on relevant clips). 

https://arxiv.org/pdf/2102.01285.pdf

The proposed GCF-Net (which models the inter-clip relationships and clip-wise importance in a much finer granularity) yields a significant gain in video action recognition accuracy than traditional methods that analyze all clips or randomly/centrally selected clips.

Experiments show that GCF-Net yields large accuracy gains on two action datasets with a small increment cost in MFLOPs. The new method with the same amount of training data set and backbone network gains the state-of-the-art action classifier’s accuracy by 11.49% and 3.67% on a large benchmark video dataset.

Source Paper: https://arxiv.org/pdf/2102.01285.pdf