Meet Graphein: a Python Library for Geometric Deep Learning and Network Analysis on Protein Structures and Interaction Networks

Deep learning techniques are used for data with an underlying non-Euclidean structure, such as graphs or manifolds, and are known as deep geometric learning. These techniques have previously been used to solve various issues in computational biology and structural biology, and they have shown a lot of promise when it comes to the creation and identification of new drugs. With a focus on tiny molecules generally, geometric deep learning frameworks that include graph representation functionality and built-in datasets have been created. A well-developed field of study focuses on minimization strategies and computational analysis of tiny molecule graphs. The same emphasis has yet to be paid to data preparation for deep geometric learning in structural biology and interactomics.

The underlying molecular structure of proteins, which is substantially more complicated than tiny molecules, is inextricably linked to their function. Different granularity levels, ranging from atomic-scale graphs resembling small molecules to charts at the level of individual residues, can be used to populate protein graphs. The relational structure of the data can be recorded through spatial linkages or higher-order intramolecular interactions, which are not visible in small molecule graphs. Furthermore, interactions between biomolecular entities, frequently through direct physical contact controlled by their 3D structure, facilitate various biological processes. Therefore, it is necessary to have more control over the data engineering process and structural data’s featurization.

In the machine learning framework, more needs to be done to investigate the impact of graph representations of biological structures and to combine structural and interaction data. By giving researchers flexibility, reducing the time needed for data preparation, and facilitating repeatable study, graphein is a tool to address these problems. To perform biological tasks, proteins assemble into intricate three-dimensional structures. The body of experimentally established and modeled protein structures has grown due to decades of structural biology study and recent advances in protein folding. This body of data has enormous potential to guide future studies. The ideal way to describe this data in machine learning studies is still being determined. Grid-structured representations of protein structures are frequently treated with 3D Convolutional Neural Networks (3DCNNs), and sequence-based approaches have proven to be widely used.

In the context of intramolecular interactions and the internal chemistry of the biomolecular structures, however, these representations need to capture relational information. Additionally, because these approaches convolve across huge areas of space and because of computational restrictions, which frequently limit the volume of the protein to regions of interest, they are computationally costly and lose access to global structural information. For instance, this often limits the volume to be centered on a binding pocket, thereby yielding information about allosteric sites on the protein and potential conformational rearrangements that contribute to molecular recognition. These are key tasks in data-driven drug discovery.

Additionally, 3D volumetric representations need translational and rotational invariance, frequently fixed by spending a lot of money on data augmentation approaches. Because they are translationally and rotationally invariant, graphs are substantially less susceptible to these issues. Using designs like Equivariant Neural Networks (ENNs), which guarantee that geometric changes applied to their inputs correspond to specified transformations of the outputs, structural descriptors of the position may be used and usefully utilized. At various degrees of granularity, proteins and biological interaction networks may naturally be depicted as graphs. Protein structures are represented by residue-level graphs, with amino acid residues as the nodes and relationships between them as the edges—often based on intramolecular interactions or euclidean distance-based cutoffs.

Atom-level graphs depict the protein structure similarly to how small-molecule graph representations express tiny molecules, with nodes denoting individual atoms and edges meaning the relationships between them, which are frequently chemical bonds or, once more, distance-based cutoffs. The graph structure may be better clarified by giving related nodes, edges, and the entire graph numerical characteristics. These characteristics could indicate, for example, the residue’s chemical characteristics or atom type, secondary structure designations, or solvent accessibility metrics. Bond or interaction kinds, as well as distances, are examples of edge characteristics. Functional annotations and sequence-based descriptors are examples of graph features. Structural information may be superimposed on protein nodes in interaction networks to provide a multi-scale perspective of biological systems and function.

Graphein serves as a link between structural interactomics and deep geometric learning. Research on structural biology and machine learning has successfully used graph representations of proteins in the past. The creation of Graphein was motivated by the lack of fine-grained control over the construction and feature set, public APIs for high-throughput programmatic access, the ease of integrating data modalities, and incompatibility with deep learning libraries, even though there are web servers for computing protein structure graphs. The package is open source and the code can be found at GitHub.

Check out the Paper and Github. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our Reddit PageDiscord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

Aneesh Tickoo is a consulting intern at MarktechPost. He is currently pursuing his undergraduate degree in Data Science and Artificial Intelligence from the Indian Institute of Technology(IIT), Bhilai. He spends most of his time working on projects aimed at harnessing the power of machine learning. His research interest is image processing and is passionate about building solutions around it. He loves to connect with people and collaborate on interesting projects.

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...