Researchers Introduce High-Performance Deep Learning Toolbox for Genome-Scale Prediction of Protein Structure and Function

Gene functional annotation refers to correctly inferring a gene’s function from its sequence, a critical role in the biological sciences. The exponential expansion in the number of sequenced genomes has been fueled by dramatic breakthroughs in next-generation sequencing technologies, resulting in an increasing bottleneck of incorrect gene annotation. Computational techniques can help eliminate the gene-annotation bottleneck by taking advantage of this vast amount of data.

Researchers from the Georgia Institute of Technology and the Department of Energy’s Oak Ridge National Laboratory are using supercomputing and new deep learning technologies to anticipate the structures and functions of thousands of proteins with unknown activities. The team used the summit supercomputer to accurately identify protein structures and activities across entire genomes of species. Their deep learning-based algorithms predict protein structure and function from DNA sequences, speeding up discoveries that could help enhance biotechnology, biosecurity, bioenergy, and pollution and climate change solutions.

Proteins are crucial to answering many scientific concerns concerning human, environmental, and global health. A protein’s function is based on its three-dimensional structure, and different protein amino acid sequences can produce the same protein structure due to redundancy in the physical properties of various amino acids. Therefore, incorporating structure into the functional analysis can provide crucial information about protein function that purely sequence-based methods may miss.

In theory, the utilization of high-performance computing (HPC) capabilities greatly aid genome annotation on a massive scale. However, unlike other computational fields with substantial computing needs, such as astrophysics, bioinformatics, and computational biology, applications have rarely used HPC or even code acceleration with general-purpose graphics processing units (GPUs).

The research group is concentrating on species that are crucial to DOE missions. They modeled the whole proteomes (proteins coded in an organism’s genome) for four bacteria, each with about 5,000 proteins. Two of these bacteria have been discovered to produce key compounds for the plastics industry. Metals are known to be broken down and transformed by the other two. The structural information can help scientists develop novel synthetic biology techniques and strategies for reducing the spread of pollutants like mercury in the environment.

The team also created models of the 24,000 proteins at work in sphagnum moss, which is essential for storing massive amounts of carbon in peat bogs. This stores more carbon than all of the world’s forests combined. These findings could aid scientists in determining which genes are most crucial for sphagnum’s ability to absorb carbon and endure climate change. The researchers started by investigating genes that allow sphagnum moss to endure rising temperatures by comparing its DNA sequences to the model organism Arabidopsis, a well-studied mustard plant species. The ability to see the structures of proteins adds another layer to the process, allowing them to zero down on the most promising gene candidates for testing.

Only part of the challenge is determining function by looking for commonalities in sequences. Proteins are made up of amino acids that are translated from DNA sequences. Some of the sequences can mutate over time as a result of evolution, replacing one amino acid with another with similar capabilities. Modifications in function are not always the result of these changes.

Although physical investigations and methods such as X-ray crystallography will still be required to establish protein structure and function, deep learning is changing the paradigm by rapidly reducing the large field of candidate genes to the most interesting few for further investigation.

Sequence Alignments from Deep-Learning of Structural Alignments, or SAdLSA, is one of the technologies in the deep learning pipeline. The computational tool is trained in the same manner as previous deep learning models for protein structure prediction are. Even if the sequences share just 10% similarity, SAdLSA can compare them by implicitly comprehending the protein structure.

When combined with AlphaFold, which provides a 3D structural model of the protein, SAdLSA allows the study of the active site to discover which amino acids are doing the chemistry and how they contribute to the function. In discovering the structures of unknown proteins, DeepMind’s tool, AlphaFold 2, achieved accuracy comparable to X-ray crystallography.

The team believes that these tools, which use both structure-based and deep learning-based technologies, can help us learn more about these proteins with unknown functions—sequences that don’t match any other sequences in the whole database of known proteins. This unleashes a wealth of fresh information and the possibility for bioengineering to address national priorities.

Research Paper:


[Announcing Gretel Navigator] Create, edit, and augment tabular data with the first compound AI system trusted by EY, Databricks, Google, and Microsoft