Baidu and BioMap AI Research Open-Sources HelixFold-Single: An End-To-End MSA-Free Protein Structure Prediction Pipeline

It can predict the protein structures within seconds

Proteins play crucial roles for an organism and are involved in nearly all biological processes. Studying protein structures and functions can significantly advance the field of life science because their biological functions and architectures are closely related.

AI-based protein structure prediction technologies have significantly improved prediction accuracy in recent years. AlphaFold2 is one example of an AI-based protein structure prediction pipeline that has achieved close to experimental precision. Multiple Sequence Alignments (MSAs) and templates are the major inputs used by these sophisticated algorithms to extract co-evolutionary information from homologous sequences. However, scanning MSAs and templates from protein databases takes a lot of work and typically takes several hours.

By utilizing primary protein sequences solely, researchers from Baidu Inc. and BioMap try to test the boundaries of quick protein structure prediction. They propose HelixFold-Single, an end-to-end MSA-free protein structure prediction pipeline. A large-scale PLM serves as the model’s foundation, and the second crucial element consists of the folding-related fundamentals from AlphaFold2.

The researchers claimed that learning the co-evolution knowledge for MSA-free prediction can be accomplished via a large-scale protein language model (PLM) instead of MSAs and templates. Large-scale language models have achieved great success in natural language processing over the past few years, which is similar to studying proteins. The capacity for learning new languages significantly increases as the model parameters are increased. 

PLMs have been adopted in advanced works to improve the performance of many downstream activities, including estimating the secondary structures and functions. PLMs can highlight the long-range relationship along protein sequences and enhance downstream protein-related activities by using self-supervised learning on large-scale unlabeled proteins. 

To learn the domain knowledge, the PLM can encode a primary sequence into a single sequence representation and a pairwise residual-residual representation. The representation is subsequently processed, geometric information is learned, and atom coordinate predictions are made using the EvoFormer and Structure Module from AlphaFold2. The wiring of the two components produces an end-to-end differentiable model. 

There are two stages of training in HelixFold-Single. 

  1. The first stage involves training the large-scale PLM using the task of masked language prediction employing billions of unlabeled primary sequences.
  2. In the second stage, the entire model is trained using AlphaFold2-generated augmentation and experimental ground truth protein structures.

The researchers tested their method on the CASP14 and CAMEO datasets, comparing them against AlphaFold2 and RoseTTAFold. The results show that on proteins with enough homologous sequences, HelixFoldSingle achieves accuracy comparable to those approaches. 

The team states that HelixFold-Single outperforms MSA-based techniques in prediction efficiency and could be used for protein-related tasks requiring a large number of predictions. They also examine HelixFold-Single’s performance on targets with different homologous sequences. The outcomes suggest that HelixFold-Single can make precise structure predictions for the most widely researched proteins.

This Article is written as a research summary article by Marktechpost Staff based on the research paper 'HELIXFOLD-SINGLE:
MSA-FREE PROTEIN STRUCTURE PREDICTION BY USING PROTEIN LANGUAGE MODEL AS AN ALTERNATIVE'. All Credit For This Research Goes To Researchers on This Project. Checkout the paper and code.

Please Don't Forget To Join Our ML Subreddit
🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...