NVIDIA Introduces BioNeMo Framework For Training And Deploying Large Biomolecular Language Modes At SuperComputing Scale

Motifs in nucleotide and protein sequences have been maintained through evolution because of their significance to the structure or function of the molecule. It has been believed that function and evolutionary links can be deduced from the discovery of motifs in protein and nucleotide sequences.

During the GTC 2022, NVIDIA released the NVIDIA BioNeMo framework to train and deploy huge biomolecular language models at a supercomputing scale. NVIDIA BioNeMo facilitates scientists to unearth previously unknown patterns and insights in biological sequences, which they may then relate to biological features, functions, and even human health issues. 

This tool is included in the NVIDIA Clara Discovery suite of AI frameworks, apps, and models used in the pharmaceutical industry. NVIDIA BioNeMo isn’t just a framework for building language models; it also offers a cloud API service that can be used with a growing library of pre-trained AI models.

Researchers today generally train tiny neural networks that necessitate custom preprocessing to use NLP models on biological data. If they use BioNeMo, they can increase the size of their LLMs to accommodate billions of factors that describe the molecular structure, protein solubility, and other phenomena.

BioNeMo builds on the NVIDIA NeMo Megatron framework to train large-scale, self-supervised language models. It is purpose-built for the molecular sciences and supports data in the SMILES notation for chemical structures and FASTA sequence strings for amino acids and nucleic acid sequences. Data formats for chemistry, proteins, DNA, and RNA will all be supported within the LLM framework.

NVIDIA BioNeMo LLM service provides four pretrained language models for developers to create digital biology and chemistry applications. These are inference-optimized, and early access will be provided via a cloud API hosted on NVIDIA DGX Foundry.

The ESM-1 protein LLM is responsible for transforming amino acid sequences into representations that can be utilized to predict a wide range of protein characteristics and functions.

Through the BioNeMo service, users can access OpenFold, an open-source AI pipeline for developing cutting-edge protein modeling tools.

A generative chemistry model called MegaMolBART has been trained on 1.4 billion molecules and is capable of performing tasks such as reaction prediction and molecular optimization.

The ProtT5 model broadens the scope of protein LLMs like ESM-1b by allowing for the synthesis of protein sequences.

According to the team, researchers utilizing the BioNeMo LLM service will soon be able to fine-tune and employ new techniques like p-tuning to customize the LLM models for higher accuracy on their applications in a matter of hours.

To effectively include data-driven protein design into our design-build-test cycle, the BioNeMo framework is a necessary piece of infrastructure. Increasing numbers of researchers in the pharmaceutical and biotechnology industries are relying on NVIDIA BioNeMo to facilitate their work. NVIDIA is collaborating with the Broad Institute of MIT and Harvard to create models for the next generation of DNA languages; these models will use the BioNeMo framework. Further, Peptone will be modeling intrinsically disordered proteins with the help of the new framework.

Tool: https://www.nvidia.com/en-us/gpu-cloud/bionemo/

Reference: https://blogs.nvidia.com/blog/2022/09/20/bionemo-large-language-models-drug-discovery/

Note: Thanks to Protopia AI for the thought leadership/ Educational article above.

Tanushree Shenwai is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Bhubaneswar. She is a Data Science enthusiast and has a keen interest in the scope of application of artificial intelligence in various fields. She is passionate about exploring the new advancements in technologies and their real-life application.