A New Diffusion-based Generative Model that Designs Protein Backbone Structures via a Procedure that Mirrors the Native Folding Process

Proteins have been intensively explored as a therapeutic medium due to their relevance, and they comprise a fast-rising proportion of authorized medicines. Proteins are essential for life, as they play a part in every biological activity, from transmitting information across neurons to identifying tiny intruders and activating the immune response, from creating energy for cells to moving molecules along cellular highways. Misbehaving proteins, on the other hand, are responsible for some of the most challenging diseases in human medicine, including Alzheimer’s disease, Parkinson’s disease, Huntington’s disease, and cystic fibrosis.

Deep generative models have recently been proposed. However, due to the highly complicated structure of proteins, they are frequently used to establish constraints (such as pairwise distance between residues) that are substantially post-processed to yield structures. This complicates the design pipeline, and noise in these projected constraints can be amplified during post-processing, resulting in unrealistic forms—that is, assuming the conditions are satisfiable. Other generative algorithms learn to build a 3D point cloud that depicts a protein structure using complicated equivariant network designs or loss functions.

Such equivariant designs ensure that the probability density from which the protein structures are sampled remains constant during translation and rotation. However, translation- and rotation-equivariant designs are frequently symmetric under reflection, resulting in violations of essential structural features of proteins such as chirality. Intuitively, this point cloud formulation is also somewhat unlike how proteins fold biologically—by twisting to adopt energetically advantageous configuration. They provide a generative model inspired by the in vivo protein folding process that operates on interresidue angles in protein backbones rather than Cartesian atom coordinates (see Figure below).

👉 Read our latest Newsletter: Microsoft’s FLAME for spreadsheets; Dreamix creates and edit video from image and text prompts......

This considers each residue as a different reference frame, transferring the equivariance constraints away from the neural network and onto the coordinate system. They utilize a denoising diffusion probabilistic model (diffusion model, for convenience) with a vanilla transformer parameterization and no equivariance restrictions for a generation. Diffusion models teach a neural network to start with noise and “denoise” it to create data samples repeatedly. Such models have proven highly successful across a wide range of input modalities, from photos to audio, and are easier to train with higher modal coverage than approaches such as generative adversarial networks (GANs)

They do diffusion on six angles, as shown in the bottom center diagram. There are three dihedral torsion angles (orange) and three bond angles (green). They begin with an empirically observed backbone defined by angles x0 and add Gaussian noise repeatedly using the forward noising process q until the angles are indistinguishable from a wrapped Gaussian at xT. These examples are used to study the “reverse” denoising method p.| Paper: https://arxiv.org/pdf/2209.15611v1.pdf

They present a set of validations that quantitatively show that unconditional sampling from their model directly generates realistic protein backbones, ranging from reproducing the natural distribution of protein inter-residue angles to producing overall structures with appropriate arrangements of multiple structural building block motifs. They show that the backbones they build are varied and designable, making them physiologically realistic protein structures. Their findings highlight the potential of biologically inspired problem formulations and mark a crucial step forward in creating novel proteins and protein-based therapeutics.

This Article is written as a research summary article by Marktechpost Staff based on the research pre-print paper 'Protein structure generation via folding diffusion'. All Credit For This Research Goes To Researchers on This Project. Check out the paper and github link.

Please Don't Forget To Join Our ML Subreddit

Aneesh Tickoo is a consulting intern at MarktechPost. He is currently pursuing his undergraduate degree in Data Science and Artificial Intelligence from the Indian Institute of Technology(IIT), Bhilai. He spends most of his time working on projects aimed at harnessing the power of machine learning. His research interest is image processing and is passionate about building solutions around it. He loves to connect with people and collaborate on interesting projects.