Training datasets are very important for experimenting with varied data to train new AI models. However, many commonly used public data sets contain labeling errors. This makes it challenging to train robust models, particularly for novel tasks. Many researchers use techniques such as employing a variety of data quality control procedures to overcome these shortcomings. However, there is no centralized repository consisting of examples of using these strategies.
Meta AI researchers have recently released Mephisto. It is a new platform to collect, share, and iterate on the most promising approaches to collecting training datasets for AI models. Researchers can exchange unique collecting strategies with Mephisto in a reusable and iterable format. It also allows them to change out components and quickly locate the exact annotations required, minimizing the barrier to custom task creation.
The team uncovers many common pathways for driving a complex annotation activity from concept to data collection in Mephisto. In addition to improving the quality of datasets, Mephisto also enhances the experience of the researchers and annotators who created the data set.
Researchers and technologists can use the same code to collect data across different domains such as research topics, crowdsourcing, and server configurations. This is achieved using a set of plug-and-play abstractions that take care of the heavy lifting when it comes to getting a data-gathering operation started. It also includes workflow guides for the entire process, from concept to completion.
For example, researchers can start by looking for an existing job that appears to be related to the data they wish to collect. The researcher can then modify the code to change the initial design. This includes modifying the sorts of annotations returned, data to be displayed, etc. Researchers might be utilizing anything at this point, from simple HTML forms to complex model-invoking tasks.
The task of testing and iterating locally before piloting becomes very easy with Mephisto. It provides a simple method for launching small pilot batches and viewing the results over a large number of employees once the task is complete. This makes it simple to spot any flaws with the task or workers that are knowingly submitting invalid data. There are many quality control approaches that can be opted by Researchers can then utilize a variety of existing to improve task quality or create their own heuristics according to the data they’re collecting.
The entire assignment is launched when pilots show good quality results. Additionally, their progress is tracked while it is in flight. It is now possible to bundle the data set and distribute the whole code, allowing others to acquire similar data.
At present, Mephisto contains key privacy protection methods such as masking worker identity. The researchers plan to add more capabilities in the future. This will include reporting worker statistics on contributions to a data set, including cautions about fair pay and protections that encourage responsible worker treatment and highlighting projects that expressly try to debias data sets.
The process of collecting training data is an important part of AI research. The team hopes that open sourcing their work will allow other researchers to develop comparable data sets or extend current ones.