McGill University, Facebook & Mila Introduces MeDAL: A NLP Pre-training Dataset for Medical Abbreviation Disambiguation With 14M Articles

At EMNLP 2020 (Empirical Methods in Natural Language Processing) Conference, a Montreal-based research team introduced a large medical text dataset designed to improve medical abbreviation disambiguation. 

Correct terminology and related deep learning models for various tasks have a significant role in medicine and healthcare. However, there has been a lack of publicly available pre-training data in this field due to privacy restrictions and an overabundance of non-standard abbreviations. The patient-safety organization, Institute for Safe Medical Practices (ISMP), has listed more than 55,000 medical abbreviations that may not be interpreted correctly. 

The researchers from McGill UniversityFacebook CIFAR AI Chair, and Mila – Quebec Artificial Intelligence Institute recently introduced MeDAL. MeDAL: Medical Dataset for Abbreviation Disambiguation for Natural Language Understanding helps resolve all the contradictory, ambiguous, and potentially dangerous abbreviations in the medical and healthcare field. An example of what it does is shown below.

Image Source:

As shown in the figure, MeDAL takes input, searches for the abbreviation, find its possible meanings, and then outputs the most suitable one. 

The MeDAL is created from PubMed abstracts released in the 2019 annual baseline. It is a large dataset of medical texts aimed at improving medical abbreviation disambiguation tasks potentially useful in pretraining natural language understanding models. The dataset comprises approximately 14M articles and, on average, three abbreviations per article. It is observed that pretraining on MeDAL improves model performance and convergence speed while fine-tuning downstream medical tasks.

The existing ‘medical abbreviation disambiguation’ methods focus only on improving performance on abbreviation disambiguation. However, the proposed approach uses abbreviation disambiguation as a “pre-training task” for transfer learning on other clinical tasks. Thus, it serves a better purpose by behaving like a medium. The team has built a dataset large enough for effective pretraining compared to the existing medical abbreviation disambiguation datasets.

Evaluation tasks

The model’s performance was evaluated on mortality prediction and diagnosis prediction using LSTM, LSTM + Self Attention, and transformer models. 

  • In the mortality prediction task, all three pre-trained models performed better than their previous performances. 
  • In the diagnosis prediction task, both LSTM and LSTM + self-attention’s performance increased by more than 70 percent.

The detailed results are available in the research paper. They suggest that pre-training on the MeDAL dataset can improve models’ language understanding capabilities in the medical and healthcare domain.



🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...