Meta AI Researchers Upgrade Their Machine Learning-Based Image Segmentation Models For Better Virtual Backgrounds in Video Calls And Metaverse

When video chatting with colleagues, coworkers, or family, many of us have grown accustomed to using virtual backgrounds and background filters. It has been shown to offer more control over the surroundings, allowing fewer distractions, preserving the privacy of those around us, and even liven up our virtual presentations and get-togethers. However, Background filters don’t always work as expected or perform well for everyone.

Image segmentation is a computer vision process of separating the different components of a photo or video. It has been widely used to improve backdrop blurring, virtual backgrounds, and other augmented reality (AR) effects. Despite advanced algorithms, achieving highly accurate person segmentation seems challenging.

The model used for image segmentation tasks must be incredibly consistent and lag-free. Inefficient algorithms may result in bad experiences for the users. For instance, during a video conference, artifacts generated by erroneous segmentation output might easily confuse persons utilizing virtual background programs. More importantly, segmentation problems may result in unwanted exposure to people’s physical environments when applying backdrop effects.

Recent research by Meta AI and Reality Labs presents an upgraded AI model for image segmentation. The new segmentation models are in production for real-time video calling in Spark AR on different surfaces across Portal, Messenger, and Instagram.

These algorithms are more efficient, robust, and versatile, improving the quality and consistency of our background filter effects. The upgraded segmentation models can now handle several persons and their whole bodies, as well as people obstructed by an object like a sofa, desk, or table. Beyond video calling, better segmentation can also provide new dimensions to augmented and virtual reality (AR/VR) by fusing virtual settings with real-world people and items.

The researchers use FBNet V3 as the backbone of their model, an encoder-decoder architecture based on a fusion of layers having the same spatial resolution. The resulting design is backed by Neural Architecture Search and is highly tuned for on-device performance. This heavy-weight encoder with a light decoder outperforms symmetric architecture in terms of quality.

The team used Meta AI’s ClusterFit model to extract a wide range of samples from the data set. This includes gender, skin tone, age, body pose, motion, background complexity, people count, to name a few. To enhance the volume of training data, they employed an offline high-capacity PointRend model to construct fake ground truth labels for the unannotated data.

Because real-time models normally contain a tracking mode that relies on temporal information, leading to frame-to-frame prediction inconsistencies. Therefore the metrics on static images don’t correctly reflect a model’s quality in real-time. They created a quantitative video evaluation framework that computes metrics at each frame that the model inference reaches to measure our models’ quality in real-time. This approach considerably increases temporal consistency.

The researchers evaluated their model’s performance among different groups of people. For this, they used metadata from more than 100 classes (more than 30 categories) to identify evaluation films, including three skin tones and two apparent gender categories. The results show that the model accurately predicts both perceived skin tone and apparent binary gender categories. Despite some slight discrepancies between categories, the model consistently performs well across all subcategories.

A standard deep learning model resamples an image to a small squared image as an input to the network. There are distortions resulting from this resampling, which are also variable because images have varying aspect ratios. The network learns low-level properties that are not resilient to varied aspect ratios due to the existence of distortions and changes in distortions. It has been observed that these limitations are amplified in segmentation applications. The team uses Detectron 2’s aspect ratio-dependent resampling algorithm to solve this problem. The method collects photos with similar aspect ratios and resizes them to the same size.

Further, padding group photos with comparable aspect ratios are required for aspect ratio-dependent resampling. However, the zero-padding method generates problems, and as the network grows larger, the artifacts spread to other locations. To get rid of these artifacts, the team utilizes replicate padding.

The researchers state that for AR segmentation applications, creating smooth and unambiguous boundaries is crucial. According to them, it is also important to consider the boundary-weighted loss in addition to the normal cross-entropy loss for segmentation. To that end, they use Boundary IoU’s method of extracting boundary areas for both ground truth and prediction and develop a cross-entropy loss in these areas. The model trained on border cross-entropy greatly outperforms the baseline. Their findings suggest that in addition to improving the clarity of the boundary area in the final mask output, the new models produce fewer false positives.

These models are trained offline with PyTorch before being deployed to production with the Spark AR platform. The team has used PyTorch Lite to optimize on-device deep learning model inference.

Related Paper: https://arxiv.org/pdf/2103.16562.pdf

Reference: https://ai.facebook.com/blog/creating-better-virtual-backdrops-for-video-calling-remote-presence-and-ar/