A New Google Study Presents Personal Health Large Language Model (Ph-Llm): A Version Of Gemini Fine-Tuned For Text Understanding Numerical Time-Series Personal Health Data

A wide variety of areas have demonstrated excellent performance for large language models (LLMs), which are flexible tools for language generation. The potential of these models in medical education, research, and clinical practice is not just immense, but transformative, offering a promising future where natural language serves as an interface. Enhanced with healthcare-specific data, LLMs excel in medical question-answering, detailed EHR analysis, medical image differential diagnosis, standardized assessment of mental functioning, and psychological intervention delivery. Their success in these tests is a testament to their ability to extract valuable signals from ‘clinical data’ gathered at a medical facility, instilling hope for their widespread use in healthcare. 

Wearable technologies can monitor important aspects of human health and well-being that traditional clinical visits miss, such as sleep, physical activity, stress, and cardiometabolic health, as evaluated by physiological reactions and behavior. The passive and continuous acquisition of these constant, longitudinal data, which offer direct signals of physiology and behavior, is a major benefit for health monitoring. Despite statistics on adverse health outcomes, morbidity, and Disability Life Years providing evidence of these factors’ significant influence on overall health, they have not been thoroughly integrated into clinical practice or included in standard datasets used for medical question-answering. Reasons for the low uptake include that such data is often collected in a vacuum, is computationally expensive to retain and analyze, and is only sometimes easy to understand. So, it’s possible that even medically-tuned LLMs or general foundation LLMs won’t be able to use this data when reasoning about and suggesting therapies based on individualized health behaviors.

A new Google study presents Gemini-tuned LLM (PH-LLM) to carry out a number of activities that are pertinent to the establishment and attainment of specific individual health objectives. The researchers found that PH-LLM can take passively acquired objective data from wearables and turn it into specific insights, possible reasons for observed behaviors, and suggestions to enhance exercise and sleep hygiene. Following refinement from the exceptional Gemini Ultra 1.0, which already exhibits aggregate performance comparable to that of fitness specialists, PH-LLM showcased a marked improvement in the utilization of domain knowledge and the customization of pertinent user data for sleep insights. 

The study demonstrates that PH-LLM can correctly answer technical multiple-choice questions in the sleep and fitness domains, which aligns with its strong performance in those long-form case studies. 

PH-LLM can employ a multimodal encoder to forecast subjective sleep outcomes, and specialist models can use high-resolution time-series health behavior data as input tokens. Key use cases for applications of LLMs to personal health features on wearable devices include open-ended long-form case studies, which are tough to evaluate in an automated method. Here, the team used 857 case studies collected from a group of willing participants for assessing fitness preparedness for a workout and sleep quality and paired the case studies with strict evaluation criteria. All human experts, Gemini Ultra 1.0, and PH-LLM achieved very high average performance across all case study responses, demonstrating the Gemini model family’s strong reasoning and knowledge skills. As a result of better contextualization of key sleep aspects for these tasks, PH-LLM can draw on relevant user and domain knowledge and improve its prediction of sleep insight and etiology parts of case studies.

To optimize models, they also created tools for automated case study review and showed that they can stand in as scalable proxy measures for human experts evaluating LLM performance. The top AutoEval models achieved agreement measures with expert raters that were comparable to inter-rater concordance metrics, and these models prioritized study response sources in a way that was consistent with human experts. They found a substantial improvement in rating speed relative to humans by parallelizing automatic evaluation across model replicas.

To decipher the subjective experience of a user, the researchers effectively incorporate longitudinal time-series sensor features. The results demonstrate that appropriate model performance necessitates native multimodal data integration by assessing PH-LLM’s capacity to forecast sleep disturbance and impairment PROs (obtained from validated survey instruments) from passive sensor readouts.

Several restrictions apply to this work. To begin with, there was a significant bias in the case study rubric evaluations, which made it hard to distinguish between different models and diverse opinions from experts. Additional training of expert raters to improve inter-rater reliability or judging current replies could enhance the signal strength of the model’s performance, even though certain parts of the case studies and assessment rubric principles did demonstrate substantial divergence. Third, there were still instances of confusion or inaccurate referencing of user data, even if there were advances in referring and integrating user data into insights. For these technologies to be safely and effectively integrated into aspects that users interact with, addressing and preventing these issues is essential.

Although there are certain limits, the study shows that the Gemini models have much health knowledge and that Gemini Ultra 1.0’s performance can improve many personal health outcomes by tuning it. The study’s findings pave the way for LLMs to help people reach their health goals by providing tailored information and suggestions. To enhance predictive power, the researchers hope future studies will have big datasets containing paired outcome data to make it possible to learn non-linear interactions among characteristics.  


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter

Join our Telegram Channel and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 44k+ ML SubReddit

Dhanshree Shenwai is a Computer Science Engineer and has a good experience in FinTech companies covering Financial, Cards & Payments and Banking domain with keen interest in applications of AI. She is enthusiastic about exploring new technologies and advancements in today’s evolving world making everyone's life easy.

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...