In image and video editing, accurate mattes are important to separate foreground from background. However, real-world scenes often have shadows or smoke that may affect the processing of such images, but computer vision techniques generally ignore these scene effects.
Google researchers presented their new approach to matte generation at CVPR 2021. Their method separates video into layers called omnimattes and extracts details such as shadows associated with the subjects in a scene. A state-of-the-art segmentation model can extract masks for people, but not all of the effects related to them, like shadows on the ground. The proposed method can isolate and extract additional details associated with the subjects, such as shadows cast on the ground.
Omnimattes are a novel image mask that can capture partially transparent effects such as reflections, splashes or tire smoke. These masks differ from traditional mattes in the sense they allow for soft-edge transitions rather than sharp ones. Since it is an RGBA image, you can use them with widely available tools to manipulate images, like inserting text into the video underneath a smoke trail.
The omnimatte system is shown above. In a preprocessing step, the user chooses which subjects and specifies an output layer for each one. A segmentation mask is then generated an off-the-shelf segmentation network like MaskRCNN on top of off-the-shelf tools that find camera transformations—relative to the background frame—and noise images are created in this same reference frame by sampling from random noise vectors defined in it too. The noise images provide image features that are random but consistently track the background over time, making it a natural input for CNNs (Convolutional neural networks) to learn how to reconstruct the colors of backgrounds.
In the example below, there is one layer for the person, one for dog and another separate element that represents a stationary background. When merged together using conventional alpha blending, they reproduce input video from original source material.
The video below is decomposed into three layers. The children’s initial, unsynchronized jumps are aligned by simply adjusting the playback rate of their layers, which produces realistic retiming for splashes and reflections in the water.
The proposed model can automatically generate visual effects for videos in a self-supervised manner, without any manual labels. It works on real-world footage with interactions between different types of subjects (cars, animals and people) as well as complex effects ranging from semitransparent elements like smoke and reflections to fully opaque objects such as those attached to the subject.