The rise of Machine Learning (ML) has brought about new challenges related to the availability and effectiveness of datasets for training and testing ML models. This is commonly referred to as the “data bottleneck,” and it is hindering the progress and implementation of ML models in various fields. In response, a platform and community called DataPerf have been developed to create competitions and leaderboards for data and data-centric AI algorithms.
One of the major issues with datasets is their quality. Public training and testing datasets are typically created from readily available sources such as web scrapes, forums, and Wikipedia or through crowdsourcing. However, these sources often suffer from issues such as bias, poor distribution, and low quality. For example, visual data is often biased towards wealthier regions, leading to skewed results. These quality problems then lead to quantity issues, where a large portion of the data is low-quality, driving up the size and computational cost of models. As public data sources become exhausted, ML models may even stall in terms of accuracy, slowing progress. Therefore, improving the quality of training and testing data is crucial for the AI community to advance.
DataPerf seeks to address these challenges by providing a platform for the development of leaderboards for data and data-centric AI algorithms. The platform is inspired by ML Leaderboards, and it aims to have a similar impact on data-centric AI research as ML leaderboards had on ML model research. The platform uses Dynabench, a benchmarking tool for data, data-centric algorithms, and models.
DataPerf version 0.5 currently offers five challenges that focus on five common data-centric tasks across four different application domains. These challenges aim to benchmark and enhance the performance of data-centric algorithms and models. Each challenge comes with design documents that outline the problem, model, quality target, rules, and submission guidelines. The Dynabench platform includes a live leaderboard, an online evaluation framework, and the tracking of submissions over time.
The first two challenges focus on training data selection, where participants design a strategy for selecting the best training set from a large candidate pool of weakly labeled training images or automatically extracted clips of spoken words. The third challenge focuses on training data cleaning, where participants design a strategy for choosing samples to relabel from a noisy training set, with the current version targeting image classification. The fourth challenge focuses on training dataset valuation, where participants design a strategy for selecting the best training set from multiple data sellers based on limited information exchanged between buyers and sellers. Lastly, the fifth challenge, called Adversarial Nibbler, focuses on designing safe-looking prompts that lead to unsafe image generations in the multimodal text-to-image domain.
DataPerf provides a platform and community for developing competitions and leaderboards for data and data-centric AI algorithms. By addressing the data bottleneck through the benchmarking and enhancement of the quality of training and test data, DataPerf aims to improve machine learning in the future. The challenges offered by DataPerf also aim to foster innovation and encourage new approaches to address the data bottleneck challenge in machine learning. Ultimately, DataPerf’s efforts could help overcome the limitations of existing datasets and enable the development of more accurate and reliable machine-learning models in various domains.
Check out the Project and Reference Article. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 17k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.
Niharika is a Technical consulting intern at Marktechpost. She is a third year undergraduate, currently pursuing her B.Tech from Indian Institute of Technology(IIT), Kharagpur. She is a highly enthusiastic individual with a keen interest in Machine learning, Data science and AI and an avid reader of the latest developments in these fields.