Microsoft Researchers Introduce Syntheseus: A Machine Learning Benchmarking Python Library for End-to-End Retrosynthetic Planning

A resurgence of interest in the computer automation of molecular design has occurred throughout the last five years, thanks to advancements in machine learning, especially generative models. While these methods assist in finding compounds with the right properties more quickly, they often produce molecules that are difficult to synthesize in a wet lab since they don’t consider synthesizability. This is the driving force behind efficient CASP algorithms, verifying an input molecule’s synthesizability by retrosynthesis—specifically creating synthesis paths.

In recent years, the intersection of chemistry and machine learning has been a focal point of attention. However, the practical implementation of state-of-the-art reaction models poses significant challenges. These models are notoriously difficult to run due to their diverse assumptions and dependencies on inputs and outputs. Moreover, the lack of readily callable entry points in the codebases, which are primarily designed to replicate benchmark results, further complicates the process.

✅ [Featured Article] LLMWare.ai Selected for 2024 GitHub Accelerator: Enabling the Next Wave of Innovation in Enterprise RAG with Small Specialized Language Models

In more detail, researchers from Microsoft, the University of Cambridge, Jagiellonian University, and Johannes Kepler University examine the widely used metrics for both one-step and multi-step retrosynthesis. It is unclear how end-to-end retrosynthesis pipeline measurements relate to those used for single-step and multi-step benchmarking in isolation. Previous research has shown uneven model comparison and metric use. By thoroughly re-evaluating and analyzing previous work, this research aims to define best practices for evaluating retrosynthesis algorithms. The team introduces a Python library, SYNTHESEUS, making it easy for researchers to consistently assess their methods in this regard.

There are two main constraints on evaluation in retrosynthesis. First, although experimental validation is vital, it should not be required that academics working on algorithm development undertake synthesis in the lab because it is costly, time-consuming, and needs significant expertise. The second issue is that most studies only look at one step of the retrosynthesis pipeline rather than the whole thing because of the split between single-step and multi-step. However, the real-world adoption hinges on how well it works from beginning to end.

The team integrated eight free and open-source reaction models into one consistent interface, seven sharing the same conda environment. Now that the intricacies of these codebases are neatly tucked away, comparing different sorts of models is as simple as a for a loop.

To compare the published figures with those generated from this evaluation, the team used the USPTO-50K dataset. This is because all the models they investigate provide results on this dataset. Due to its modest size, USPTO-50K may not provide a true picture of the distribution of all data. Consequently, the team assessed the out-of-distribution generalization of the model checkpoints trained on USPTO-50K using the proprietary Pistachio dataset, which contains over 15.6 million raw reactions and 3.4 million samples after preprocessing. Individuals new to SYNTHESEUS Default weights trained on USPTO-50K are immediately downloaded and cached by Syntheseus, so there’s no need to search for model weights when you start. You can return to a previous time to retrain using a bigger and/or internal dataset.

Chemformer, GLN, Graph2Edits, LocalRetro, MEGAN, MHNreact, and RootAligned are some of the well-established single-step models that are re-evaluated in this work. In the case of RetroKNN, the researchers were able to receive the code directly from the developers. They trained a new model using the original training code if no available checkpoint with the proper data split was found and used the specified checkpoint for all models otherwise.

They calculated the Average Reciprocal Rank (MRR) and top-k accuracy (k ≥ 50) while evaluating every model with an output of n = 100. All of the models were run with a consistent batch size of 1. Although any model could easily manage bigger batches, the batch size used for the search is normally fixed at one since the search is not usually parallelized and cannot be freely set. Consequently, the maximum number of model calls executed during a search with a particular time budget is directly related to speed under a batch size of 1.

It should be noted that while two models (RootAligned and Chemformer) use a Transformer decoder to predict the reactants’ SMILES from the beginning, the other models forecast the graph rewrite that will be applied to the result. While the former type of models performs well for top-1 accuracy across datasets and metrics, they are outperformed for greater k by graph-transformation-based models. Findings suggest that transformation-based models offer more comprehensive coverage of the data distribution because they are more explicitly rooted in the set of changes happening in the training data. Furthermore, when considering top-k accuracy for k > 1, which is impacted by deduplication, many of the USPTO-50K values that are presented outperform the figures seen in the literature. This also affects some of the model rankings; for instance, GLN has worse top-1 accuracy than LocalRetro, which was previously claimed. Pistachio retains a surprising level of model ranking compared to USPTO-50K, even if all results are significantly worse. For example, when it comes to top-50 accuracy, none of the models improve above 55%, whereas USPTO achieves nearly 100%. This is due to inadequate coverage for template-based models, but it was observed that some of the models without templates that were evaluated here also do not generalize better than their template-based equivalents. In conclusion, RetroKNN ranks first or near-first on all metrics across both datasets and is among the fastest models in re-evaluation. Current single-step metrics give a helpful but insufficient picture of how well single-step models perform. Therefore, the researchers warn the reader not to take this as a definitive suggestion.

The researchers also conducted search experiments combining several single-step models and search algorithms. Their main focus is correcting existing data, outlining best practices, and showcasing SYNTHESEUS. Therefore, they only present preliminary multi-step results. However, the future holds great promise as the framework developed in this research will pave the way for determining the optimum end-to-end pipeline, a prospect that is sure to spark anticipation and hope.

Results regarding tracking the first solution’s discovery and the maximum number of non-overlapping routes recovered from the search graph are presented. With the exception of Chemformer, GLN, and MHNreact, any search technique may serve the vast majority of models by discovering multiple independent paths to the bulk of targets. RootAligned achieves encouraging outcomes with an average of less than 30 calls (because of its high processing cost). 


Check out the Paper and GitHubAll credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 42k+ ML SubReddit

Dhanshree Shenwai is a Computer Science Engineer and has a good experience in FinTech companies covering Financial, Cards & Payments and Banking domain with keen interest in applications of AI. She is enthusiastic about exploring new technologies and advancements in today’s evolving world making everyone's life easy.

[Free AI Webinar] 'How to Build Personalized Marketing Chatbots (Gemini vs LoRA)'.