Are You Doing Retrieval-Augmented Generation (RAG) for Biomedicine? Meet MedCPT: A Contrastive Pre-trained Transformer Model for Zero-Shot Biomedical Information Retrieval

MedCPT is contrastively pre-trained by 255 Million query-article pairs from real and anonymized PubMed click logs. It achieved SOTA on various biomedical IR tasks, outperforming baselines such as OpenAI's embedding models.

Information Retrieval (IR) models have the ability to sort and rank documents on the basis of user queries, facilitating efficient and effective information access. One of the most exciting applications of IR is in the field of biomedicine, where it can be used to search relevant scientific literature and help medical professionals make evidence-based decisions.

However, as most existing IR systems in this field are keyword-based, they may miss relevant articles that do not share the exact same keywords. Moreover, dense retriever-based models are trained on a general dataset that cannot perform well on domain-specific tasks. Additionally, there is also a scarcity of such domain-specific datasets, which restricts the development of generalizable models.

To address these issues, the authors of this paper have introduced MedCPT, an IR model that has been trained on 255M query-article pairs from anonymized PubMed search logs. Traditional IR models have a discrepancy between retriever and re-ranker modules, which affects their performance. MedCPT, on the other hand, is the first IR model that integrates these two components using contrastive learning. This ensures that the re-ranking process aligns more closely with the characteristics of the retrieved articles, making the entire system more effective.

As mentioned above, MedCPT consists of a first-stage retriever and a second-stage re-ranker. This bi-encoder architecture is scalable as the documents can be encoded offline, and only the user query needs to be encoded at the time of inference. The retriever model then uses a nearest neighbor search to identify the parts of the documents that are most similar to the encoded query. The re-ranker, which is a cross-encoder, further refines the ranking of the top articles returned by the retriever and generates the final article ranking.

Although the re-ranker is computationally expensive, the entire architecture of MedCPT is an efficient one since only one encoding and a nearest neighbor search are required prior to the re-ranking process. MedCPT was evaluated on a wide range of zero-shot biomedical IR tasks. The following are the results:

  • MedCPT achieved state-of-the-art document retrieval performance on three out of five biomedical tasks in the BEIR benchmark. It outperformed the much larger models like Google’s GTR-XXL (4.8B) and OpenAI’s cpt-text-XL (175B).
  • MedCPT article encoder outperforms the other models like SPECTER and SciNCL when evaluated on the RELISH article similarity task. Additionally, it also achieves SOTA performance on the MeSH prediction task in SciDocs.
  • The MedCPT query encoder was able to encode biomedical and clinical sentences effectively.

In conclusion, MedCPT is the first information retrieval model that integrates a pair of retriever and re-ranker modules. This architecture provides a balance between efficiency and performance, and MedCPT is able to achieve SOTA performance in numerous biomedical tasks and outperform many larger models. The model has the potential to be applied to various biomedical applications like recommending related articles, retrieving similar sentences, searching relevant documents, etc., making it an indispensable asset for both biomedical knowledge discovery and clinical decision support.


Check out the Paper and GithubAll credit for this research goes to the researchers of this project. Also, don’t forget to join our 32k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

We are also on Telegram and WhatsApp.

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...