Data-centric AI is a new topic focusing on engineering data to create AI applications using off-the-shelf machine learning (ML) models. Previous efforts have primarily focused on model-centric AI in a static context. In this scenario, 1) data collection and engineering are completed, 2) and the primary goal is continuously improving ML models to attain excellent performance on test sets. However, real-world AI applications face increasingly complex circumstances that model-centric AI cannot fully solve. For example, researchers must devote a significant amount of effort to data preparation, which includes data labeling, error detection, and so on.
Meanwhile, they must monitor data to detect distribution drift and update models in real-time. Treating these difficulties from a model standpoint will result in a suboptimal solution. As a result, many initiatives are currently focusing on data-centric approaches or merging model-centric and data-centric practices to develop and democratize AI systems. Though the concept of data-centric AI is new, several pioneering studies with essential contributions to data engineering have already been offered. One important direction is active learning (AL). The goal of AL is to eliminate manual labeling labor while preserving and even improving the performance of ML models.
It is commonly known that ML models are extremely data-hungry. As a result, to achieve high performance (for example, accuracy) that meets application criteria, individuals must always label a large amount of data during data collecting. This method is highly time-consuming and labor-intensive, frequently becoming the bottleneck in developing ML applications. To address the issue, AL employs AL techniques to pick the most representative yet diverse training samples from an extensive training data pool. The selected pieces are then sent to an oracle (e.g., human annotators) for labeling. Following that, only these sub-datasets will be used to train ML models.
They can still obtain a competitively performing ML model while significantly reducing labeling and training expenditures. However, employing AL techniques is a complex undertaking. Applying AL to AI application development involves looking for, selecting, and implementing AL algorithms. Instead, users must create a customized backend to execute the AL pipeline in their environment (e.g., a private cluster and AWS). In other words, they must perform repetitious engineering work with boilerplate code. Furthermore, users must consider efficiency and cost considerations, as AL frequently operates on large datasets, and some AL techniques (e.g., committee-based) involve running many ML models for data selection.
Although some open-source AL tools lessen the barrier to deploying AL, they are inefficient. Inadequate planning will result in a lengthy process and additional costs. They propose creating an efficient backend for AL to address these difficulties. Their Active-Learning-as-a-Service (ALaaS) system (see Figure below) can efficiently conduct AL methods on massive datasets by employing multiple or distributed devices. To accomplish AL tasks, it uses a server-client architecture. As a result, the system is simple to install on both laptops and public clouds.
The architecture of ALaaS. The system has a server-client architecture that is simple to deploy. It also supports a variety of AL methods, model zoos, and serving engines.
Following installation, users can launch the system with a simple configuration file created using their templates. The system will then run AL tasks in an efficient pipeline fashion. Meanwhile, different acceleration techniques, such as data caching and batching, will be used to accelerate the AL process. Furthermore, their method takes accessibility and modularity into account so that non-experts may easily apply AL strategies saved in their AL zoo, and experts can offer more advanced AL strategies for more scenarios. Experiments reveal that in terms of latency and throughput, their ALaaS exceeds all other baselines. Additional ablation investigations demonstrate the efficacy of their design and reveal more interesting insights. Code, along with documentation, is available on GitHub.
This Article is written as a research summary article by Marktechpost Staff based on the research paper 'Active-Learning-as-a-Service: An Efficient MLOps System for Data-Centric AI'. All Credit For This Research Goes To Researchers on This Project. Check out the paper and github link.
Please Don't Forget To Join Our ML Subreddit
Aneesh Tickoo is a consulting intern at MarktechPost. He is currently pursuing his undergraduate degree in Data Science and Artificial Intelligence from the Indian Institute of Technology(IIT), Bhilai. He spends most of his time working on projects aimed at harnessing the power of machine learning. His research interest is image processing and is passionate about building solutions around it. He loves to connect with people and collaborate on interesting projects.