The development in display technologies and the never stopping increase in video content popularity have resulted in a significant demand for video compression to save on storage and bandwidth costs.
The compression is done by exploiting the similarity among the video frames. This is possible because most of the content is almost identical between video frames, as a typical video contains 30 frames per second. The compression algorithm tries to find the residual information between the video frames.
Video compression is about finding the sweet spot for the trade-off between the visual quality and the video size. Videos need to be served to millions of clients, and not all of them can have the network capacity to get the highest visual quality from the video.
Although existing video compression methods can save significant bandwidth, their advancement is still based on traditional heuristics. The latest state-of-the-art video codec, Versatile Video Coding (VVC), still shares components with video codecs from two decades ago.
Video compression problem also belongs to the group of problems that are being tackled with neural networks. The advancement in neural video compression gained momentum recently, and they managed to get on-par performance with traditional video codecs.
Despite achieving impressive compression performance, neural video compression methods suffer from producing “realistic” outputs. They can generate the output video close to the input, but they miss the realism. For example, if you check the hair of people compressed by a neural video compression model, you can see they look a bit off.
The goal of adding the realism constraint for neural networks is to ensure the output is indistinguishable from real images while staying close to the input video. The main challenge is ensuring the network can generalize well to unseen content.
This is the problem this paper tries to solve. They carefully construct a generative neural video compression technique that excels in synthesis and detail preservation. This is achieved by using a generative adversarial network (GAN) and giving utmost importance to the GAN loss function.
In video compression, specific frames are selected as key-frames (I-Frames) that are used as a base for reconstructing upcoming frames. These frames are allocated higher bitrates; therefore, they have better details. This is also valid for the proposed method The proposed method synthesizes the dependent frames (P-Frames) based on the available I-frame. It uses a three-step strategy.
First, it synthesizes essential details within the I-frame, which will be used as a base for upcoming frames. This is done by using a combination of convolutional neural network (CNN) and GAN components. The discriminator in the GAN component is responsible for ensuring I-frame level details.
Second, the synthesized details are propagated where it is needed. A powerful optical flow method (UFlow) is used to predict movement between frames. The P-frame component has two auto-encoder parts, one for predicting the optical flow and one for the residual information. These two parts work together to propagate details from the previous step as sharply as possible.
Finally, another auto-encoder is used to determine when to synthesize new details from the I-frame. Since new content can appear in P-frames, the existing details can become irrelevant, and propagating them would distort the visual quality in that case. So, whenever it happens, the network should synthesize new details. The residual auto-encoder component achieves this.
The authors state the two components are crucial in this method. The first component is conditioning the residual generator on a latent obtained from the warped previous reconstruction. The second component is leveraging accurate flow from an optical flow network. The proposed method is evaluated objectively and subjectively, and in both cases, it outperformed existing neural video compression methods.
This was a summary of the paper “Neural Video Compression using GANs for Detail Synthesis and Propagation” from the Google Research group. You can check the links below if you are interested in learning more details.
This Article is written as a research summary article by Marktechpost Staff based on the research paper 'Neural Video Compression using GANs for Detail Synthesis and Propagation'. All Credit For This Research Goes To Researchers on This Project. Check out the paper.
Please Don't Forget To Join Our ML Subreddit
Ekrem Çetinkaya received his B.Sc. in 2018, and M.Sc. in 2019 from Ozyegin University, Istanbul, Türkiye. He wrote his M.Sc. thesis about image denoising using deep convolutional networks. He received his Ph.D. degree in 2023 from the University of Klagenfurt, Austria, with his dissertation titled "Video Coding Enhancements for HTTP Adaptive Streaming Using Machine Learning." His research interests include deep learning, computer vision, video encoding, and multimedia networking.