Google Researchers Propose a Novel Method, Called Constrained Instance reWeighting (CIW), To Reduce Noisy Labels in Deep Neural Networks

Deep neural networks (DNNs) have been highly successful in delivering substantial performance advances in various real-world applications, ranging from image recognition to genomics. Modern DNNs, on the other hand, frequently have significantly more trainable model parameters than training instances, resulting in overparameterized networks that readily overfit to noisy or corrupted labels. As a result, training using noisy labels frequently decreases the trained model’s performance on clean test data. Unfortunately, noisy labels can emerge in various real-world contexts due to multiple variables, including manual annotation errors and inconsistencies and the utilization of intrinsically noisy label sources (e.g., the internet or automated labels from an existing system).

Google researchers propose a new method, called Constrained Instance reWeighting (CIW), that dynamically assigns importance weights to individual instances and class labels in a mini-batch, to reduce the effect of potentially noisy examples. The reseachers define a family of restricted optimization problems that offer easy solutions for these crucial weights. These optimization issues are tackled in mini-batch increments, eliminating the need to keep and update significant weights throughout the dataset.

Training machine learning models aim to minimize a loss function that indicates how well the present parameters fit the training data. This loss is roughly estimated in each training step as the (weighted) sum of the losses of each occurrence in the mini-batch of data on which it is operating. For updating the model parameters, each instance is treated equally in normal training, equating to assigning uniform weights across the mini-batch.

The research team presents a family of constrained optimization problems that overcome this problem by giving relevance weights to specific instances in the dataset to limit the effect of those likely to be noisy due to these observations. It turns out that simple formulae for the instance weights may be found for various divergence metrics. The weighted total of individual instance losses is utilized to update the model parameters, resulting in the ultimate loss. This approach is known as Constrained Instance reWeighting (CIW). The smoothness or peakiness of the weights can be controlled using this method by selecting a divergence and a corresponding hyperparameter.

Taking a noisy version of the Two Moons dataset, which consists of randomly chosen points from two classes in the shape of two half-moons, as an example of how this method works. The research group cut 30% of the labels and use them for training a multilayer perceptron network for binary classification. They utilize a typical binary cross-entropy loss and an SGD with momentum optimizer to train the model. In the image below (left panel), you can see the data points and visualize an acceptable decision boundary with a dotted line separating the two classes. The red spots in the upper half-moon and the green points in the lower half-moon represent noisy data points.

The baseline model, trained using the binary cross-entropy loss, gives uniform weights to each mini-instances, batch’s finally overfitting to the noisy instances and resulting in a bad decision boundary (middle panel in the figure below).

The CIW approach reweights each mini-instances batch according to their loss values (right panel in the figure below). It gives higher weights to clean instances on the right side of the decision boundary, reducing the impact of noisy examples with a higher loss value. Smaller weights for noisy samples prevent the model from overfitting to them, allowing the CIW-trained model to successfully converge to a suitable decision boundary while avoiding the effects of noisy instances.

Figure1 –  an illustration of the decision boundary as the training progresses for the baseline and the suggested CIW methods on the Two Moons dataset

Class reWeighting with Constraints

Instance re-weighting gives instances with higher losses and lower weights. The researchers take this idea further by assigning relevance weights to all conceivable class labels. Standard training assigns a weight of 1 to the labeled class and 0 to all other classes using a one-hot label vector as the class weights. However, it is reasonable to apply non-zero weights to classes that could represent the real label in the case of potentially mislabeled occurrences. These class weights are obtained as solutions to constrained optimization problems. A hyperparameter controls the deviation of the class weights from the label one-hot distribution, as assessed by a divergence of choice. They may also construct simple equations for the class weights for numerous divergence metrics, described as Constrained Instance and Class reWeighting (CICW).

Using Mixup with Instance Weights

The researchers suggest combining the derived instance weights with the mixup, a popular strategy for regularizing models and increasing prediction accuracy. It works by taking a pair of examples from the original dataset and utilizing a random convex combination to create a new artificial example. The model is trained by reducing the loss on these jumbled data points. Individual instance losses are ignored by vanilla mixup, which might be troublesome for noisy data because mixup treats clean and noisy cases identically. 

The researchers utilize the instance weights to undertake biased sampling for the mixup and employ the weights in convex combinations because a high instance weight acquired with our CIW approach is more likely to represent a clean example (instead of random convex combinations in vanilla mixup). As a result, the mixed-up samples are skewed toward clean data points, which they call CICW-Mixup.

They find that the suggested CICW outperforms numerous approaches and matches the results of dynamic mixup, which keeps the importance weights consistent over the whole training set while using mixup. When compared to these approaches, using our importance weights with a mixup in CICW-M resulted in much better performance, especially for higher noise rates (as shown by lines above and to the right in the graphs below).

100. Standard Cross-Entropy Loss (CE),  Active-Passive Normalized Loss, Bi-tempered Loss, suggested CICW, Mixup, Dynamic Mixup, and proposed CICW-Mixup are among the methods compared.