Video content continues to dominate Internet traffic by taking more than 80% of the traffic share in 2022. Although performance improvement is significant in each generation of video codecs, the fundamentals behind video compression algorithms have not changed considerably in the last decades.
Video codecs are built with deep thought and well-engineered components, but they are hard-coded. As a result, it is difficult to adapt to the expanding demand and diverse video use cases, including social media sharing, object identification, VR streaming, etc.
Deep learning-based applications have become the de-facto solutions for many industries and research disciplines. Moreover, the increasing hardware capabilities of end devices and the advancement in the GPUs have significantly accelerated the deep learning-based solutions and made it possible to deploy them in various use cases.
This is where the learned video compression comes into play. It is an end-to-end learning-based video compression approach that can outperform existing video codecs. They present several contributions to video compression with machine learning.
The motion compensation step has undergone the first significant alteration. Traditionally, video codes can only predict temporal patterns in the form of motion, but this brings limitations that prevent capturing certain types of temporal redundancy. Consider a scenario where someone is gazing at the camera before turning their face to the right. Traditional video codecs are unable to predict the profile face in this situation. However, this prediction will be doable in the proposed approach as it learns arbitrary Spatio-temporal patterns rather than limiting itself to merely motion data.
This change in motion prediction also brings another approach to learned state propagation. Traditionally, prior knowledge in the video is propagated using a reference frame and optical flow maps represented in raw pixel space. The biggest downside of this approach is the inability to capture long-term dependency. On the contrary, the proposed method uses arbitrary states that the model learns on its own, leading to far stronger knowledge retention.
Traditional video codecs must choose how to divide available bandwidth between motion and residue, but the ideal trade-off differs among frames. There is no simple way to trade off the motion and residue as they are compressed independently. Instead, the proposed method uses the same bottleneck to compress the compensation and residual information jointly. Doing so further removes overlaps between them, enabling the network to figure out how to allocate bitrate based on the complexity of the frames.
Another change is made in motion representation. Traditionally, a hierarchical block structure has been used to describe optical flow, with each block’s pixels having the same motion. The motion vectors are also quantized to a certain sub-pixel level. This representation can be compressed efficiently but fails to capture complex and fine motion. On the other hand, the proposed method has the flexibility in distributing the bandwidth so that less important parts are represented exceptionally well. In contrast, more important areas have motion boundaries that are arbitrarily complex with arbitrary flow precision. The example can be seen below:
Moreover, the proposed method uses multi-flow representation instead of the single flow map used in traditional codecs. This way, the scene can be composed of a mixture of multiple simple flows, which helps preserve occluded content. Finally, an ML-driven spatial rate control algorithm is implemented to assign different bitrates at different spatial locations for each frame.
To conclude, the authors propose an end-to-end learned machine learning-based video compression algorithm. Several well-designed learning-based methods are used to replace traditional approaches used in video compression. The resulting video codec can compete with traditional video codecs in encoding performance.
This Article is written as a research summary article by Marktechpost Staff based on the research paper 'Learned Video Compression'. All Credit For This Research Goes To Researchers on This Project. Check out the paper. Please Don't Forget To Join Our ML Subreddit
Ekrem Çetinkaya received his B.Sc. in 2018 and M.Sc. in 2019 from Ozyegin University, Istanbul, Türkiye. He wrote his M.Sc. thesis about image denoising using deep convolutional networks. He is currently pursuing a Ph.D. degree at the University of Klagenfurt, Austria, and working as a researcher on the ATHENA project. His research interests include deep learning, computer vision, and multimedia networking.