Researchers From UF Health and NVIDIA Build World’s Largest Clinical Language Generator, ‘SynGatorTron’, To Develop Better AI For Rare Disease Research and Clinical Trials

A neural network that generates synthetic clinical data is in high demand, as it is a valuable resource that researchers may use to train other AI models in healthcare. Synthetic data allows developers to create simulated patient data free of sensitive protected health information and manufactured with statistical realism. Team NVIDIA and UF Health, The University of Florida’s academic health institution, have teamed up to do precisely that.

SynGatorTron is a language model developed by the team that can construct synthetic patient profiles based on the health records it has learned from. This model is ranked #1 among the language generators available in the healthcare field, with the ability to handle 5 billion parameters.

People often believe that synthetic data is directly linked to human beings, a frequent fallacy. That is not the case; it just develops characteristics comparable to those of genuine patients. Without danger or privacy issues, researchers can construct tools, models, and tasks using this synthetic data. These can then be applied to real-world data to answer clinical queries, find correlations, and investigate patient outcomes. SynGatorTron-generated data can also supplement small datasets of uncommon disease patients or minorities to reduce model bias.

Since it does not reflect an actual patient, an AI-generated doctor’s letter may appear unrealistic at first. As a result, it may seem that accurate clinical analysis is impossible. However, natural and synthetic data have high value to an untrained AI. By addressing data sparsity and privacy, combining diverse forms of clinical records will democratize the ability to construct all types of applications that rely on such data. The above explanation is the driving force behind the creation of synthetic data. When it’s ready, researchers outside UF Health can fine-tune the SynGatorTron model with localized data and use it in their AI projects.

One of the primary benefits is that the synthetic training datasets closely resemble genuine medical notes but are not linked to specific patients, making them more easily shared among research institutes without raising privacy issues. Consider a possibility where the healthcare community can simulate demographic features without relying on actual patients. This allows the imagination to run wild to investigate if realistic datasets can be generated to answer queries that would otherwise go unanswered due to data access limitations or a lack of information on patients of interest.

Researchers working on a deep learning model to examine a rare disease or the effects of a medication on a specific population could use SynGatorTron to supplement the restricted number of accurate medical information accessible with more training data.