Researchers from InnoPeak Technology at Palo Alto, California, introduce Gated Clip Fusion Network (GCF-Net) to boost the existing video action classifiers with a tiny computation overhead cost.
Most of the accuracy gains for video action recognition have recently come from the newly designed CNN architectures like 3D-CNNs. Deep CNN is employed to train these models by applying them on a single fixed temporal length clip. Each video segment is processed by the 3D-CNN module separately. Therefore, the corresponding clip descriptor is local, and the inter-clip relationships are naturally implicit. However, the 3D CNN has a minimal receptive field (e.g., merely 16 frames).
The standard method that directly averages the clip-level outputs as a video-level prediction is likely to fail because it lacks the mechanism to extract and integrate relevant information to represent the video.
Aiming to fill this research gap, the researchers proposed a lightweight network that can boost the accuracy of existing clip-based action classifiers.
The GCF-Net leverages two strategies:
- It explicitly models the interdependencies between video clips to strengthen the receptive field of local clip descriptors
- Furthermore, each clip’s importance to an action event is calculated, and a relevant subset of clips is selected accordingly for a video-level analysis.
The research presents a novel Bi-directional Inter-Clip Fusion method that uses short- and long-range video segments to model the inter-clip relationships and generate better clip representations. The conventional methods relied on 3D-CNN to separately generate a local feature with minimal local temporal knowledge. The team states that on comparing with earlier methods, it is found that the proposed method provides a better clip representation with a broader receptive field.
A Gated Clip-Wise Attention is offered to suppress irrelevant clips further to improve the video-level prediction accuracy. As a byproduct, this module’s attention weights can be used to locate the time interval of an action event in a video (for instance, based on relevant clips).

The proposed GCF-Net (which models the inter-clip relationships and clip-wise importance in a much finer granularity) yields a significant gain in video action recognition accuracy than traditional methods that analyze all clips or randomly/centrally selected clips.
Experiments show that GCF-Net yields large accuracy gains on two action datasets with a small increment cost in MFLOPs. The new method with the same amount of training data set and backbone network gains the state-of-the-art action classifier’s accuracy by 11.49% and 3.67% on a large benchmark video dataset.
Source Paper: https://arxiv.org/pdf/2102.01285.pdf