In recent years, machine learning (ML) techniques in the medical field have grown significantly. Thanks to the high performance of deep learning models, machines can detect and classify diseases precisely, sometimes even exceeding specialists. To do this, a model must use data like medical images by accessing patients’ personal information. The use of this personal data causes a privacy problem. One of the most significant barriers to Learning Health Systems (LHS) research and development is the lack of access to EHR patient data.
Thanks to synthetic patient data technology advancements, synthetic patient data have recently been accepted as alternative data to test new processes involving EHR data. It is now possible to generate synthetic patient data that can be used to develop an ML-enabled LHS and be shared between research communities without restriction. Using this technique, synthetic datasets on cardiovascular disease, even cancer, have already appeared. In this context, a research team from California proposed a new reproducible process using synthetic patients to build an LHS risk prediction based on ML data.
Concretely, the authors of the article carried out an experimental study by simulation. In this study, a risk prediction LHS is performed by building an XGBoost base model for different target diseases, such as lung cancer or stroke, from existing electronic health records (EHR) data. This simulation study follows two steps: In the first step, a new ML-enabled LHS process was proposed to build a risk prediction LHS for lung cancer in synthetic patients. In step two, a different target disease—stroke was utilized to check the effectiveness of the novel LHS process for building risk prediction LHS with accurate risk prediction for various target diseases. The authors proposed a high-level data-centric and ML-enabled LHS design for risk prediction. Initially, the ML model is built from the initial EHR patient data. Next, LHS learning cycles continually use up-to-date patient data to enhance the ML model and quickly release a new model that doctors can use to make risk predictions.
The ML-enabled LHS was initialized by using a dataset of 30,000 synthetic Synthea patients, and the XGBoost model was used for the risk prediction of lung cancer. Then, four other datasets of 30,000 patients were generated. These four new datasets were added successively to the first updated dataset to simulate the addition of new patients, resulting in datasets of 60,000, 90,000, 120,000, and 150,000 patients. In each instance, new XGBoost models were built. The results demonstrate that performance improves when the data size increases, reaching 0.936 recall and 0.962 AUC in the 150,000 patients dataset. The effectiveness of the new ML-enabled LHS process was verified by the implementation of XGBoost models for predicting stroke risk in the same Synthea patient populations.
This paper introduced a process of an ML model based on synthetic medical data for the first time. This study proved the effectiveness of this new LHS approach which can treat different types of disease from other EHR data. The proposed model continues to learn from new patient-generated to improve its performance until reaching risk predictions greater than 95% for the metrics recall and precision. Finally, the authors state that since real data differs from synthetic data, real data ML models can be further optimized by hyperparameter tuning.
This Article is written as a research summary article by Marktechpost Staff based on the research paper 'Simulation of a machine learning enabled learning health system for risk prediction using synthetic patient data'. All Credit For This Research Goes To Researchers on This Project. Check out the paper.
Please Don't Forget To Join Our ML Subreddit
Mahmoud is a PhD researcher in machine learning. He also holds a
bachelor's degree in physical science and a master's degree in
telecommunications and networking systems. His current areas of
research concern computer vision, stock market prediction and deep
learning. He produced several scientific articles about person re-
identification and the study of the robustness and stability of deep