Stanford Researchers Developed a Machine Learning Model Called POPDx to Predict Rare Diseases, Including Diseases That Aren’t Present in The Training Data

A rare disease affects a small proportion of the population. Most rare diseases are genetic and thus last throughout a human’s life, even if symptoms do not appear immediately. Many rare disorders manifest themselves early in life; approximately 30% of children with rare diseases die before age five.

In recent years, life sciences companies have made commendable advances in rare diseases, but the associated challenges continue to dominate. With the emergence of artificial intelligence/machine learning (AI/ML) and its related capabilities, several opportunities for intelligent intervention have emerged, which, if correctly leveraged, can significantly improve the rare disease treatment journey. AI/ML can help speed up accurately identifying and diagnosing patients.

A large amount of dataset is usually required to train machine learning models. Biobanks are large databases that contain genetic and health information from many patients. Their usefulness determines the quantity and quality of data in biobanks. Incomplete data is frequently a problem in patient datasets. To overcome this issue, Stanford researchers developed a model capable of predicting a comprehensive set of diagnosis codes (also known as phenotype codes) for all patients in the UK Biobank. UK Biobank is an extensive biomedical collection of data and research resources in the United Kingdom that includes detailed genetic and health data from half a million UK participants. It has significantly contributed to modern medicine and treatment advancement and enabled several scientific discoveries that have improved human health.

The research team developed POPDx, a machine learning framework for disease recognition, to create a model that produces probabilities that a person might have certain diseases or phenotype codes. POPDx (Population-based Objective Phenotyping by Deep Extrapolation) is a bilinear machine learning framework that estimates the probabilities of 1538 phenotype codes at the same time. For POPDx development and evaluation, the team extracted phenotypic and health-related data from 392,246 individuals in the UK Biobank. The POPDx methodology was assessed and compared to other automated multi-phenotype recognition methods. It is observed that the POPDx model outperforms the existing models in predicting rare diseases. The model is an excellent achievement since it does not require much training data, unlike other models. It uses the prior knowledge and then predicts the diseases which are not present even in the training data. Such a model is quite helpful since, unlike other fields, the abundance of data for rare diseases is scarce. 

The POPDx model searches for relationships between the patient’s data and disease information, making probabilistic decisions using natural language processing and Human Disease Ontology. Since most ML models rely on large datasets, POPDx is a significant achievement that will be beneficial for studying rare diseases. The team used multi-label classification in this model since a patient can have one or more diseases. POPDx’s solid performance with little or no info is compelling, eliminating the need for large datasets. Its ability to recognize rare diseases gives clinicians and researchers a better starting point for studying them. One of the problems faced by the team was the unavailability of data on a patient. To solve this problem, the team used background information of the patient and their records to predict diseases they might have.

POPDx will enhance the future of disease prediction even with the unavailability of datasets, proving to be a significant achievement in this field.


Check out the Paper and Reference Article. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 13k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

Niharika is a Technical consulting intern at Marktechpost. She is a third year undergraduate, currently pursuing her B.Tech from Indian Institute of Technology(IIT), Kharagpur. She is a highly enthusiastic individual with a keen interest in Machine learning, Data science and AI and an avid reader of the latest developments in these fields.

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...