Finetuning LLaMA on Medical Papers: Meet PMC-LLaMA-A Model that Achieves High Performance on Biomedical QA Benchmarks

The development of large language models (LLMs), such as OpenAI’s ChatGPT and GPT-4, has reshaped artificial intelligence in many fields, including natural language processing, computer vision, and the biomedical field. Unfortunately, the specifics of ChatGPT’s training and the model architectures for its variants are still unknown. While LLaMA is an open-source foundational language model, it is hypothesized that its poor performance on applications requiring extensive domain knowledge is caused by a lack of domain-specific data during the model pre-training stage. 

Many studies have been discussing modifying and using open-source LLMs for specialized purposes. For instance, Alpaca and Vicuna have focused on expanding the model’s capacity for interaction by training it with examples of obeying instructions created automatically. 

A recent work by Shanghai Jiao Tong University and Shanghai AI Laboratory takes a different tack by infusing domain knowledge into a single, pre-trained LLaMA to steer the foundational language model toward a medical-specific corpus. They introduce PMC-LLaMA, a publicly available language model developed by refining LLaMA-7B on 4.8 million medical academic papers. The team believes that medical discussion and consulting would benefit more from a foundational language model with a medical focus. 

The team began with the S2ORC Datasets, which contain 81.1M academic papers in English, and sorted them according to their PubMed Central (PMC)-id. Therefore, approximately 4.9M papers, totaling over 75B tokens, are highly related to medical knowledge. By optimizing an autoregressive generation objective, first presented in GPT2, they fine-tune the LLaMA-7B model on these freely available PMC papers. They employ the bf16 (Brain Floating Point) data format and the Fully Sharded Data Parallel (FSDP) acceleration approach to speed up the learning process.

The team tests PMC-LLaMA by doing three different types of fine-tuning on the aforementioned associated medical QA datasets: full fine-tuning, parameter-efficient fine-tuning, and data-efficient fine-tuning. The results of the experiments show that PMC-LLaMA outperforms LLaMA and other models trained with LLaMA-tuned instructions in the medical domain when the instructions are tweaked. 

A shortcoming of PMC-LLaMA is that every token cannot be found in the 4.8 million papers because they have only trained five epochs so far. In the future, they plan to gradually train PMC-LLaMA models with more parameters, continuously train PMC-LLaMA, and update the base model on the hugging face page. 

Check out the Research Paper and Code. Don’t forget to join our 20k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at

πŸš€ Check Out 100’s AI Tools in AI Tools Club

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...