Researchers at MIT and Amazon Study Pervasive Label Errors in Test Sets that Destabilize Machine Learning Benchmarks

Large labeled data sets are crucial for successful supervised machine learning (ML) across several domains such as image classification, sentiment analysis, and audio classification. However, machine learning (ML) datasets are not perfectly labeled. The processes used to develop datasets often involve automatic labeling or crowdsourcing, inherently error-prone techniques.

Prior work has majorly focused on noises in train sets of ML datasets. Not many studies concentrate on label errors in test sets, Yet they have diverse potential consequences. No study has looked at systematic error across the most-cited ML test sets. 

Benchmark test datasets are used to evaluate the ML models and validate the theoretical findings. If label errors occurred extensively, they could potentially undermine the framework by which we measure machine learning progress. Label errors in the test sets could mislead practitioners to incorrect conclusions about the model’s performance.  

Researchers at MIT and Amazon introduce a novel study that identifies and systematically analyzes label errors across 10 commonly-used datasets across computer vision (CV), natural language processing (NLP), and audio processing. The team explores a novel methodology for dealing with label errors, which aims to characterize the predominance of label errors in the test data of benchmarks that are opted to measure ML progress. They subsequently analyze the practical consequences of these errors, especially their effects on model selection.

The team used confident learning to identify putative label errors in test sets at scale algorithmically. They validate these label errors through human evaluation, estimating an average of 3.4% errors. For example, they identify 6% errors in the ImageNet validation set and over 10% errors in QuickDraw. 

Test set errors prominent across common benchmark datasets. Errors are estimated using confident learning (CL) and validated by human workers
Mechanical Turk validation confirming the existence of pervasive label errors and categorizing the types of label issues.

Using a simple algorithmic and crowdsourcing pipeline, they discover that the label errors are pervasive in test sets of popular benchmarks, widely used in all ML research. 

They provide a clean and corrected version of each test set 3, in which humans have corrected a significant fraction of the label errors. The team hopes that future research on these benchmarks will use this improved test data alternatively of the original erroneous labels. 

The implications of the pervasive test set label errors

The researchers record that higher capacity models perform better (in terms of accuracy) on the subset of incorrectly-labeled (original) test data. But these models perform poorly on this subset than their simpler counterparts, that is, corrected labels.

ImageNet top-1 original accuracy (top panel) and corrected accuracy (bottom panel) vs Noise Prevalence.

An intuitive hypothesis is that a high-capacity model more closely fits all statistical patterns present in the data. This includes patterns related to systematic label errors that models with more limited capacity are less competent of closely approximating.

They identified the prevalence of originally mislabeled test data needed to destabilize ML benchmarks in commonly-used benchmark datasets. They note that a slight increase in the test label error prevalence can cause the model selection to choose the wrong model based on standard test accuracy.

Evaluation of how benchmarks of popular pre-trained models change

The team randomly and incrementally removed correctly-labeled examples, one at a time. This is done until only the original set of mislabeled test data (with corrected labels) is left. Then, they operationalized averaging over all orderings of removal by linearly interpolating from benchmark accuracy on the corrected test set to accuracy on the erroneously labeled subset. This allowed them to directly estimate the noise prevalence of test set errors where benchmark rankings change. For instance, ResNet-18 outperformed ResNet-50 when they randomly remove 6% of the test examples that were initially correctly labeled. 

Benchmark ranking comparison of 34 models pre-trained on ImageNet and 13 pre-trained on CIFAR-10.

The team states that the future work would involve a rigorous analysis to disambiguate and understand the contribution of each of these causes and their effects on benchmarking stability. 


🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...