Researchers at MIT Startup ‘DataCebo,’ Introduce Synthetic Data Metrics: An Open-Source Python Library That Evaluates Synthetic Data By Comparing It To The Real Data That You’re Trying To Mimic

Synthetic Data (SD) Metrics is a new tool developed by DataCebo, a startup born out of MIT’s Computer Science & Artificial Intelligence Laboratory (CSAIL) in 2020. This open-source Python module was created with the goal of assisting businesses in assessing model-neutral tabular data by comparing artificially generated data sets to actual data sets. The application includes a wide range of indicators for efficiency, statistics, and data privacy. Additionally, it has reports that one may use to compile data and communicate with their team. As the SDMetrics library is model-agnostic, it may be used with any synthetic data, regardless of how it was produced.

It becomes vital to develop metrics that measure how the synthetic data compares to the actual data when dealing with tabular synthetic data. Each metric assesses a distinct aspect of the data, such as coverage or correlation, and enables the user to determine which individual components have been preserved or overlooked throughout the synthetic data process. If an enterprise’s synthetic data covers the same potential values as actual data, it may be measured using CategoryCoverage and RangeCoverage. The researchers described the use cases and stated that one might use the CorrelationSimilarity metric in the SDMetrics tool to compare correlations. Over 30 metrics are now available, and more are being developed.

The SDMetrics library is a part of the Synthetic Data Vault (SDV) Project, launched at MIT’s Data to AI Lab in 2016. DataCebo was created after four years of extensive research in 2020 with the primary goal of developing the project. The Vault, an ecosystem of libraries for synthetic data generation, was founded to assist businesses in creating data models for developing new software and applications within the business. Although much work is being done in synthetic data, particularly in autonomous driving cars or photos, nothing is being done to assist businesses to benefit from it, according to DataCebo. The purpose of establishing SDV is to ensure that businesses can obtain the packages for creating synthetic data when data is not easily accessible, or there is a threat of endangering data privacy.

The tool was implemented using several graphical modeling and deep learning approaches, including Copulas, CTGAN, and DeepEcho. Large banks, insurance organizations, and businesses focusing on clinical trials are employing models created with Copulas, which has had over a million downloads. Over 500,000 downloads of the CTGAN, a neural network-based model, have been made. The creators of DataCebo claimed that other data sets with numerous tables or time-series data are also supported.

Reference Article | Github | Tool

Please Don't Forget To Join Our ML Subreddit
🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...