Researchers at IBM Propose MoLFormer-XL: A Pretrained Artificial Intelligence AI Model That Infers The Structure of Molecules From Simple Representations

Recent technological developments have led to the widespread adoption of large pretrained models for performing several tasks. These models, which could previously summarise texts and translate between languages, can now be used for more complex tasks like answering questions, writing code, and even composing music. Another domain where large pretrained models have demonstrated remarkable performance is research in the molecular biology domain. Research in molecular biology has also shown that huge pretrained models function remarkably well. To provide precise and quick predictions of molecular attributes, machine learning algorithms can now be taught to infer the shapes and specific characteristics of molecules. This is particularly helpful in the development of new drugs and new materials.

Although some supervised machine learning algorithms have shown promising results, the enormous chemical space and the scarcity of labels make supervised learning difficult. Chemists can obtain this knowledge through simulations or laboratory tests, but it’s a labor-intensive and expensive procedure that can take even years. Recently, researchers have attempted to use unsupervised transformer-based language models that are pretrained on a large unannotated corpus to address this problem. These models have achieved state-of-the-art performance in many subsequent natural language processing tasks.

MoLFormer-XL, a pretrained AI model that infers the structure of molecules from simple representations, was recently introduced by IBM researchers to address this bottleneck issue of limited annotated data about molecular shapes. This pretrained model makes it considerably simpler and faster to screen molecules for new applications or create them from scratch. MoLFormer-XL has been introduced as a part of the MoLFormer family of foundation models for molecular discovery. The PubChem and ZINC datasets containing 1.1 billion unlabelled molecules were used to pretrain MoLFormer-XL. The benefit of utilizing these simple chemical representations is that it allows a transformer to extract enough details to deduce a molecule’s structure and function.

For forecasting molecular behavior given a molecule’s structure, existing molecular models heavily rely on Graph Neural Networks. The main disadvantage of graph models is that they frequently need sophisticated mechanisms and extensive simulations to represent atomic interactions within molecules accurately. This restricts molecular datasets’ size, curtailing the model’s ability to generate broader predictions. MoLFormer-XL, in contrast, is pretrained on a dataset of 1.1 billion molecules, where each molecule is represented as a string using the SMILES (Simplified Molecular Input Line Entry System) notation. Each SMILES string gives a plethora of information about the underlying chemical structure by describing how the atoms in molecules targeted for drug and material development are organized.

🔥 Recommended Read: Leveraging TensorLeap for Effective Transfer Learning: Overcoming Domain Gaps

MoLFormer-XL was trained to focus on the interactions between the atoms depicted in each SMILES string using a novel rotational embedding that records a character’s relative position. According to the researchers, the model could learn structural characteristics that greatly simplified the learning of downstream tasks because of this additional molecular context. Moreover, MoLFormer-XL can also forecast a molecule’s solubility, antiviral activity, and other biophysical and physiological characteristics, such as its capacity to pass the blood-brain barrier.

Researchers at IBM are hopeful that MoLFormer-XL will soon be a useful tool for discovering novel molecules by their desired features due to its capacity to efficiently learn the structures of such a wide range of molecules. After several experimental evaluations, the researchers concluded that MoLFormer-XL outperformed other supervised and self-supervised graph neural networks and language models at ten molecular property benchmarks and achieved noticeable results on the other two. However, the primary reason behind the remarkable performance achieved by MoLFormer-XL lies in its size, which comes at the cost of computational efficiency. The model requires significant computational resources and training time, which the researchers tried optimizing wherever possible. MoLFormer-XL’s exceptional performance offers hopeful proof that large-scale molecular language models can gather enough chemical and structural data to predict various unique molecular features.

Check out the Paper and IBM Blog. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 13k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

Khushboo Gupta is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Goa. She is passionate about the fields of Machine Learning, Natural Language Processing and Web Development. She enjoys learning more about the technical field by participating in several challenges.