Microsoft Researchers Introduce ‘FS-MOL’, A Few-Shot Learning Molecular Dataset, To Bring Deep Learning in Early-Stage Drug Discovery

The discovery, design, and testing phases of the drug development process are iterative. Drugs were previously sourced from plants and found through trial-and-error methods. While much safer and more effective, this method takes a long time and costs a lot of money. Thankfully, drug research today takes place in a lab, with each iteration of custom-designed chemicals yielding a more promising candidate.

It can take over ten years to bring a single medicine from concept to market, and it might cost anywhere between $1 and $2 billion. A lot of effort is invested during the repeated cycles of developing and synthesizing new candidate molecules, testing them, and determining which molecular features to improve before starting the process again. In a laboratory, the steps of synthesis and in vitro testing of molecular behavior are inherently slow.

Computational modeling is one technique to speed up the drug-development process. Most compounds can be prioritized in silico even if they aren’t physically available. Only the most likely to succeed are synthesized and measured. A machine learning model must be able to predict chemical attributes correctly, mainly whether a suggested medicinal molecule will be active — that is, able to alter the protein target associated with the disease — to enable such a speedup through computational modeling.

When millions of lines of data are available, ML is known to be particularly good at spotting patterns in images and text. However, just a few dozen molecules are likely to have been measured in a laboratory during the early phases of the drug-discovery process. Since data generation is expensive and can be restricted for ethical concerns, small datasets are standard in drug research.

While there isn’t much data for a drug research project from which an ML model may extract patterns, tens of thousands of previous projects’ data are available in public and proprietary databases. Fortunately, by having the ML model learn from the combination of these many linked datasets, this data may be utilized for molecular property prediction.

FS-Mol: A Few-Shot Learning Dataset of Molecules was developed by the Machine Intelligence team at Microsoft Research Cambridge in partnership with Novartis to address the problem of molecule-protein interaction prediction given a small amount of data. The goal is to help the ML and computational chemistry communities work together to solve this complex problem.

The researchers created a tiny dataset for protein-ligand binding prediction as well as a principled strategy for exploiting these datasets in few-shot learning. Due to the lack of such a dataset, an open-source evaluation framework was created to allow ML researchers to evaluate their work and assist drug development professionals in determining which computer modeling approaches are most promising for their specific goals.

In computer vision and reinforcement learning communities, few-shot learning is prevalent. It comprises preparing an ML model using training data from a set of related tasks before adapting it to a new task of interest with only a few relevant data points. The structure of the model is ready to pick up new information, similar to how a human brain learns to recognize an object it has only seen once. Thus access to millions of data points for each recent activity we may encounter isn’t required.

An assortment of available datasets is used to pretrain a few-shot learner. The hope is that by including a diverse set of training activities, at least some of them will be similar to the eventual testing task of interest. Prediction of molecule binding to a specific protein is one example of the drug discovery process. The few-shot learner is fine-tuned utilizing a tiny amount of labeled training data, which consists of a small number of measurements done on manufactured molecules against the protein target after pre-training has happened. The ability of the generated model to make predictions on held-out test data points is next assessed.

There are several ways for pre-training a few-shot learning model. While the best approach to predicting molecule-protein interaction given a small quantity of data is unknown, it’s vital to weigh the options. The Microsoft Research and Novartis teams compared several methodologies such as Meta-learning, pre-training approaches, and multitask training to determine which way is most useful.

Meta-learning techniques are honed with the goal of producing the fastest few-shot learner possible. Model-agnostic meta-learning, for example, optimizes an objective that assesses how well a model adapts when specialized to a new task. Another meta-learning method is prototypical networks, which predict the label of a new example by assessing which examples in the support set are the most similar.

By learning to identify the most significant properties, pre-training techniques try to prepare an ML model for specialization. Multitask training is one such strategy that seeks to train a model to predict labels for molecules drawn from numerous tasks simultaneously. Models are trained to recover removed or altered information in the input in self-supervised pre-training.

Only if all few-shot learners are given the same testing problem and have access to the same information during the pre-training phase can such approaches be compared fairly. However, there was no well-defined set of activities or a clear testing strategy previous to this effort. The researchers created a dataset and testing technique that mirror the real-world obstacles of early-stage drug development.

The data was collected from ChEMBL, a publically accessible database, after which it was thoroughly cleaned and filtered, and activity labels were carefully assigned based on measured values. A good pre-training program must be accompanied by thorough testing, and great care was taken to ensure that the pre-training aims were not repeated in our testing activities. The team concentrated on assignments that depicted drug compounds interacting with certain classes of enzymes so that overall findings could be split by the class performance.

The researchers took a number of pharmaceutical industry-standard models. They fed them the testing task’s support set data, treating them the same as the few-shot learning models during testing. They tested both the pretrained few-shot approaches and the untrained ways across a variety of tasks while providing them with varied quantities of support set data.

Models can do well even without pre-training if they have access to enough data during the test, but only models that have been pretrained can make effective predictions when they are given only a few data points. The results demonstrate an improvement over a completely uninformed classifier that assigns a label to each new query molecule randomly. While self-supervised pre-training and multitask techniques did not outperform untrained models, meta-learning approaches did.

Researchers revealed that prototype networks are particularly helpful in the early stages of drug development when there is a limited quantity of data available. This method had never been employed before, and it offers various interesting advances that are more specific to molecular property prediction.


The research shows that not only is early-stage drug development well-posed as a few-shot learning issue, but also pre-training and, in particular, meta-learning approaches can increase the quality of molecular property predictions significantly. They have given the drug-discovery community access to the most up-to-date state-of-the-art ML research on a truly realistic topic by sharing the dataset and evaluation framework with these baseline results. This kind of approach can assist in shortening the time it takes for a medicine to go from concept to market by minimizing the need for synthesis and, as a result, in vitro testing of vast numbers of molecules.



🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...