Meet FedTabDiff: An Innovative Federated Diffusion-based Generative AI Model Tailored for the High-Quality Synthesis of Mixed-Type Tabular Data

While generating realistic tabular data, one of the difficulties faced by the researchers is maintaining privacy, especially in sensitive domains like finance and healthcare.  As the amount of data and the importance of data analysis is increasing in all fields and privacy concerns are leading to hesitancy in deploying AI models, the importance of maintaining privacy is also increasing. Some of the challenges in preserving the privacy in financial field are mixed attribute types, implicit relationships, and distribution imbalances in real-world datasets. 

Researchers from the University of St.Gallen (Switzerland), Deutsche Bundesbank (Germany), and  International Computer Science Institute (USA) have introduced a method to generate high-fidelity mixed-type tabular data without centralized access to the original datasets, FedTabDiff, ensuring privacy and compliance with regulations (Example: EU’s General Data Protection Regulation and the California Privacy Rights Act).

✅ [Featured Article] LLMWare.ai Selected for 2024 GitHub Accelerator: Enabling the Next Wave of Innovation in Enterprise RAG with Small Specialized Language Models

Traditional methods like anonymization and elimination of sensitive attributes in high-stake domains do not provide any privacy. FedTabDiff introduces the concept of synthetic data, which involves generating data through a generative process based on the inherent properties of real data. The researchers leverage Denoising Diffusion Probabilistic Models (DDPMs), which have successfully generated synthetic images, and used the concept in a federated setting for tabular data generation.

FedTabDiff incorporates DDPMs into a federated learning framework, allowing multiple entities to collaboratively train a generative model while respecting data privacy and locality. The DDPMs use a Gaussian Diffusion Model, employing a forward process to perturb data incrementally with Gaussian noise and then restoring it through a reverse process. The federated learning aspect involves a synchronous update scheme and weighted averaging for effective model aggregation. The architecture of FedTabDiff includes a central FinDiff model maintained by a trusted entity and decentralized FinDiff models contributed by individual clients. The federated optimization provides a weighted average over decentralized model updates which helps in the collaborative learning process. For the evaluation of the model, the researchers used standard metrics of fidelity, utility, privacy, and coverage.

FedTabDiff shows exceptional performance with both financial and medical datasets, proving its effectiveness in diverse scenarios. The comparison of the model to the non-federated FinDiff models showcased better performance in all four metrics. The approach manages to balance maintaining privacy as well as keeping deviations from the original data in control and preventing the data from being too unrealistic. FedTabDiff’s effectiveness is demonstrated through empirical evaluations on real-world datasets, showcasing its potential for responsible and privacy-preserving AI applications in domains like finance and healthcare.


Check out the PaperAll credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our 36k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our Telegram Channel

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...