Google DeepMind Research Introduces AMIE (Articulate Medical Intelligence Explorer): A Large Language Model (LLM) Based Research AI System for Diagnostic Medical Reasoning and Conversations

The communication between the doctor and the patient is critical to providing effective and compassionate care. A medical interview is “the most powerful, sensitive, and versatile instrument available to the physician,” according to studies. It is thought that clinical history-taking accounts for 60-80% of diagnoses in certain contexts. 

Advancements in general-purpose large language models (LLMs) have demonstrated that AI systems can reason, plan, and include pertinent context to carry on genuine conversations. The development of completely interactive conversational AI is within reach, thanks to this breakthrough, which opens up new potential for AI in healthcare. Conversations between patients and their caretakers may be natural and diagnostically helpful, and the AI systems involved in medical care would comprehend clinical language and intelligently gather information even when faced with uncertainty. 

Even though LLMs can encode clinical knowledge and answer accurate single-turn medical questions, their conversational abilities have been honed for industries other than healthcare. Previous research in health-related LLMs has not yet compared AI systems’ abilities to those of experienced doctors or conducted a thorough analysis of their capacity to take a patient’s medical history and engage in diagnostic discussion.

Researchers at Google Research and DeepMind have developed an artificial intelligence system called AMIE (Articulate Medical Intelligence Explorer), designed to take a patient’s medical history and talk with a doctor about possible diagnoses. Several real-world datasets were used to build AMIE. These datasets include medical question-answering with multiple-choice questions, medical reasoning with long-form questions vetted by experts, summaries of notes from electronic health records (EHRs), and interactions from large-scale recorded medical conversations. AMIE’s training task mixture included medical question-answering, reasoning, summarization activities, and conversation production tasks. 

However, two major obstacles make passively collecting and transcribing real-world dialogues from in-person clinical visits impractical for training LLMs for medical conversations: (1) actual data from real-life conversations isn’t always complete or scalable because it doesn’t cover all possible medical conditions and scenarios; (2) data from real-life conversations is often noisy because it contains slang, jargon, sarcasm, interruptions, grammatical errors, and implicit references. As a result, AMIE’s expertise, capacity, and relevance may be constrained. 

The team devised a self-play-based simulated learning environment for diagnostic medical dialogues in a virtual care setting to overcome these restrictions. This allowed them to expand AMIE’s knowledge and capabilities to various medical conditions and settings. Aside from the static corpus of medical QA, reasoning, summarization, and real-world dialogue data, the researchers utilized this environment to incrementally refine AMIE with a dynamic set of simulated dialogues.

To evaluate diagnostic conversational medical AI, they created a pilot evaluation rubric that includes both clinician- and patient-centered criteria for taking a patient’s history and their diagnostic reasoning, communication abilities, and empathy. 

The team created and operated a blinded remote OSCE trial with 149 case scenarios from clinical practitioners in India, the UK, and Canada. This allowed them to compare AMIE to PCPs in a balanced and randomized way during consultations with verified patient actors. Compared to PCPs, AMIE demonstrated higher diagnostic accuracy across various metrics, including differential diagnosis list top-1 and top-3 accuracy. Compared to PCPs, AMIE was deemed better on 28 out of 32 assessment axes from the specialist physician perspective and non-inferior on the remaining 26 evaluation axes from the patient actor perspective. 

In their paper, the team highlights critical limitations and offers key next steps for the clinical translation of AMIE in the real world. An important limitation of this research is the fact that they’ve used a text-chat platform, which PCPs for remote consultation were not accustomed to, but which allowed for potentially large-scale interaction between patients and LLMs specialized for diagnostic conversation.


Check out the Paper and BlogAll credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our 36k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our Telegram Channel

Dhanshree Shenwai is a Computer Science Engineer and has a good experience in FinTech companies covering Financial, Cards & Payments and Banking domain with keen interest in applications of AI. She is enthusiastic about exploring new technologies and advancements in today’s evolving world making everyone's life easy.

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...