AI Researchers at Huawei Propose a Novel Decoupled Multi-task Learning with Cyclical Self-Regulation (DML-CSR) for Face Parsing

This research summary is based on the paper 'Decoupled Multi-task Learning with Cyclical Self-Regulation for Face Parsing'

Please don't forget to join our ML Subreddit

Face parsing aims to assign a pixel-wise name for each facial component, such as the eyes, nose, and mouth, as a fine-grained semantic segmentation problem. Many high-level applications, such as face swapping, face editing, and facial makeup require extensive analysis of semantic facial elements. Methods based on Fully Convolutional Networks (FCNs) have shown promising results on fully supervised face parsing, taking advantage of the learning capability of deep Convolutional Neural Networks (CNNs) and the labor effort put into pixel-level annotations.

Nonetheless, FCNs are unable to capture global contextual information, which is necessary for semantically analyzing face components in an image, due to the convolutional kernel’s local nature. To solve this problem, most region-based face parsing algorithms learn global information by incorporating CNN features into variant CRFs. These methods, on the other hand, do not take into account the relationship between multiple things.

Previously, the EAGRNet approach was introduced for modeling a region-level graph representation over a face image by propagating information across all vertices on the graph. Even while EAGRNet achieves state-of-the-art performance by reasoning across non-local regions to obtain global dependencies between distant facial components, it nevertheless suffers from spatial inconsistency and boundary confusion. In EAGRNet, the PSP module uses an average pooling layer to record the global context prior, resulting in a spatial topology that is inconsistent.

EAGRNet also incorporates additional binary edge cues into context embedding to boost parsing outcomes. In crowded circumstances, however, EAGRNet struggles to manage borders between highly irregular facial elements (such as hair and cloth) and detect distinct boundaries between multiple face instances. Furthermore, learning a good face parsing model necessitates precise pixel-level annotations. On the training dataset, however, sloppy manual labeling errors are unavoidable.

Because all pixels in the ground truth are treated equally, the researchers utilize the typical fully supervised learning approach to train EAGRNet, which fails to find label noise. Specifically, failing to notice such inadequate annotations limits model generalization and hinders performance from improving.

In a recent paper, Huawei researchers developed an end-to-end face parsing system based on Decoupled Multi-task Learning with Cyclical Self-Regulation (DML-CSR). Given a facial image as input, the ResNet-101 pre-trained on ImageNet is used as the backbone to extract features from various layers. After then, there are three tasks in the multi-task model: face parsing, binary edge detection, and category edge detection.

Source: https://arxiv.org/pdf/2203.14448v1.pdf

The backbone shares low-level weights with these activities, but there are no high-level interactions. As a result, at the inference step, the multi-task learning approach can divorce extra edge detection tasks from face parsing. To deal with the spatial inconsistencies caused by the pooling process, the team creates a Dynamic Dual Graph Convolutional Network (DDGCN) in the face parsing branch to gather long-range contextual data.

There is no extra pooling operation in the proposed DDGCN, and it can dynamically fuse the global context retrieved from GCNs in both spatial and feature spaces. The proposed category-aware edge detection module uses more semantic information than the binary edge detection module utilized in EARGNet to solve boundary confusion in both single-face and multifaced scenarios.

The team introduces a cyclically learning scheduler inspired by self-training to accomplish advanced cyclical self-regulation to address the problem caused by noisy labels in training datasets. A self-ensemble strategy is included in the proposed CSR, which can aggregate a series of historical models to produce a new reliable model and a self-distillation method that uses the soft labels provided by the aggregated model to drive subsequent model learning.

Finally, the suggested CSR iteration alternates between these two techniques, improving model generalization by correcting noisy labels during training. The proposed CSR can improve the model and label dependability in a cyclical training scheduler without adding additional computing expenses.

On the Helen (93.8 percent overall F1 score), LaPa (92.4 percent mean F1), and CelebAMask-HQ (86.1 percent mean F1) datasets, the approach achieves new state-of-the-art performance. The method uses fewer computation resources than EARGNet since the edge prediction modules may be separated from the entire network, reducing inference time from 89ms to 31ms while obtaining significantly improved performance.

Conclusion

Huawei researchers have published DML-CSR, a decoupled multi-task learning technique for face parsing with cyclical self-regulation. Extensive tests on Helen, CelebAMask-HQ and LaPa show that the proposed strategy is effective. DML-CSR outperforms other approaches on all datasets, according to the results. According to the researchers, DML-CSR is a valuable strategy for training a trustworthy face parsing model on a large-scale dataset.

Paper: https://arxiv.org/pdf/2203.14448v1.pdf

Github: https://github.com/deepinsight/insightface/tree/master/parsing/dml_csr