HKUST AI Researchers Propose A Novel Neural Framework For Video Inpainting By Overfitting A CNN (Convolutional Neural Network) Without Explicit Guidance


Video inpainting is the process of filling missing regions within a video sequence to create spatial and temporal continuity. This can be helpful for removing unwanted objects, such as watermarks or logos from videos when editing them. With multimedia becoming more prevalent in today’s society, there has been an upsurge in demand for tools that allow people to fill these holes so that they have access to content regardless of where it was recorded. Video InPainting also alleviates laborious human work by not needing manual mask labeling on images with semi-automatic object removal processes.

For the most part, there are no good solutions to inpainting video because none of them can consistently produce visually pleasing videos with long-range consistency. Some methods have been tried using patch-based optimization strategies, but these tend not to work well for complex motions and cannot synthesize new content. Recently flow guided methods were developed which propagate context information through optical flows, achieving temporally consistent results where others fail. However, Optical flow in a missing region is difficult to obtain because of the constantly shifting blocked regions and complicated motion. Recent deep models trained on large video datasets have achieved more promising performance. However, the collecting process is time-consuming and laborious.

Researchers from The Hong Kong University of Science and Technology (HKUST) propose a new internal learning method for video inpainting that can overcome the aforementioned issues with implicit long-range propagation. This study shows that you don’t need to rely on explicit correspondences between frames, like optical flow. Instead, the intrinsic properties of natural videos and convolutional neural networks can implicitly address the information propagation process. In this paper, the researchers take a deep dive into several different video properties and focus on two special hard cases by imposing regularization. In the end, they managed to restore missing regions with cross-frame correlation and ensure temporal consistency through enforcing gradient constraints. To test all these theories out for themselves, they trained a convolutional neural network (CNN) which could propagate information across pixels that are not already masked out as known from training data.

The research team first evaluated the proposed method on the DAVIS dataset. Their approach achieved state-of-the art performance quantitatively and qualitatively and received positive feedback from users of their technology. They then applied this new technique to different video domains, such as autonomous driving scenes. They obtained promising results among other projects like old films or animations where it was also found that people preferred it.

Therefore this paper proposes a new way to use artificial intelligence for video inpainting. They successfully fixed long-term occlusion and complex motion that other methods have struggled with due to anti-ambiguity regularization, a method of propagating information without guidance like optical flow. The researchers also extend their proposed work by using only one mask or working on high-resolution videos at 4K resolution.