Exploring the Ability of Large Language Models (LLMs) to Reason About Medical Questions: Insights from an Artificial Intelligence (AI) Study in Denmark

The field of natural language processing has transformed exceedingly in the past few years. This change is apparent even in how textual data is represented; for example, since a few years ago, deep contextualized representations have replaced simple word vectors. The transformer architecture and its great interoperability with parallel computing technology is the fundamental driving force behind this significant change. Large language models (LLMs), which are essentially pre-trained Transformer language models, significantly increase the capabilities of what systems can accomplish with text. Many resources have been set aside to scale these LLMs and train them on gigabytes of text using hundreds of billions of parameters. Thanks to this advancement in artificial intelligence, researchers can now create more intelligent systems with a deeper understanding of language than ever before.

Although LLMs have achieved remarkable success in the past, their performance in real-world situations that call for sharp reasoning abilities and subject-matter expertise is still uncharted territory. To find out more about this, a team of researchers from the Technical University of Denmark and the University of Copenhagen collaborated with the Copenhagen University Hospital to look into the possibility of using GPT-3.5 (Codex and InstructGPT) to respond to and reason about challenging real-world questions. The researchers opted for two sought-after multiple-choice medical exam questions, USMLE and MedMCQA, and a medical abstract-based dataset named PubMedQA. The team looked into different prompting situations, including zero-shot and few-shot prompting (prepending the question with question-answer examples), direct or Chainof-Thought (CoT) prompting, and retrieval augmentation, which involves inserting excerpts from Wikipedia into the prompt.

While investigating the zero-shot variation, the researchers examined direct prompts and zero-shot CoT. In contrast to the direct prompt, which only requires one completion step to obtain the answer, the zero-shot CoT framework employs a two-step prompting technique. An initial reasoning prompt with a CoT cue is used in the first stage, and an extractive prompt that contains the complete response is used in the second. Few-shot learning was the second prompt engineering variation that the researchers looked into. The team tried inserting triplets of questions, explanations, and responses, as well as pairs of questions and sample answers. The previous zero-prompt shot’s template was reused for each shot, but the generated explanation was swapped out for the given ones.

LLMs have the capacity to memorize specific knowledge bits tucked away in training data. However, models often fail to utilize this information successfully while making predictions. To tackle this issue, researchers usually base their forecasts on existing knowledge. The team incorporated this strategy by investigating if the language model’s accuracy is enhanced when it is provided with more context. Wikipedia excerpts served as the knowledge base for this experiment. 

After multiple experimental evaluations, the researchers concluded that zero-shot InstructGPT greatly outperformed the refined BERT baselines. CoT prompting proved to be an effective strategy as it produced better results and more understandable predictions. On the three datasets, Codex 5-shot CoT performs at a level comparable to human performance with 100 samples. Although InstructGPT and Codex are still prone to errors (mainly due to ignorance and logical errors), these can be avoided by sampling and merging many completions.

In a nutshell, LLMs can comprehend difficult medical topics well while frequently recalling expert-domain information and engaging in nontrivial reasoning processes. Despite this being an important first step, there is still a long way to go. Utilizing LLMs in clinical settings will call for more reliable methods and even higher performance. The researchers have identified only one type of bias so far, namely that the sequence of the answer choices influences the predictions. However, there may be many more such biases, including those concealed in the training data, that could impact the test results. The team’s current work focuses on this area.

Check out the Paper and Github. All Credit For This Research Goes To Researchers on This Project. Also, don’t forget to join our Reddit page and discord channel, where we share the latest AI research news, cool AI projects, and more.

[Announcing Gretel Navigator] Create, edit, and augment tabular data with the first compound AI system trusted by EY, Databricks, Google, and Microsoft