This AI Method from MIT and IBM Research Improves the Training and Inference Performance of Deep Learning Models on Large Graphs

Graphs, an abstract, non-linear data type, are frequently employed to illustrate links between different types of data, such as social media connections, hierarchies, financial transactions, etc. To conduct deep learning on these graphical images, researchers require faster and more computationally efficient algorithms as the amount of data represented as graphs is quite large. A neural network that directly manipulates graph structures is known as a Graph Neural Network (GNN). GNNs have become increasingly common in recent years, particularly in domains that involve social networks, recommender systems, etc.

In contrast to ordinary neural networks, the exponential expansion of multi-hop graph neighborhoods along network layers makes constructing mini-batches in GNNs highly computationally expensive. This makes improving GNNs’ training and inference performance quite challenging. To address such issues,  MIT researchers collaborated with IBM Research to develop a new technique called SALIENT (SAmpling, sLIcing, and data movemeNT). By tackling three main bottlenecks, their method significantly shortens the runtime of GNNs on huge datasets, even at the scale of billions. Additionally, the newly created approach scales well when the computational capability is increased by one to sixteen graphical processing units (GPUs). 

When researchers began to examine the difficulties that current systems encountered when scaling cutting-edge machine learning techniques for graphs to large datasets, practically at the scale of billions, the necessity for SALIENT became even more apparent. Most of the current research achieves satisfactory performance on smaller datasets that can readily fit into GPU memory. The team’s goal is to build a system that can handle graphs that might be used to represent the entire Bitcoin network. However, they also want the system to be as effective and slick as possible to keep up with the rate at which new data is generated virtually every day.

In order to construct SALIENT, the team initially included fundamental optimization techniques for elements that fit into already-existing machine-learning frameworks, like PyTorch Geometric and the deep graph library (DGL). To accelerate model training and obtain inference findings more quickly, the main goal of inventing a method that could readily integrate into current GNN architectures was to make it simple for domain experts to apply this work to their specialized fields. One modification the team made to their design was continuously utilizing all hardware technology, including CPUs, data lines, and GPUs. For instance, the GPU may be utilized to train the machine-learning model or carry out inference while the CPU samples the graph and creates mini-batches of data.

These straightforward adjustments allowed the researchers to increase their GPU utilization from 10 to 30 percent, which led to a 1.4 to 2x performance increase compared to open-source benchmark routines. However, the study team believed they could achieve even better results, so they set out to examine the bottlenecks that arise at the start of the data pipeline and the algorithms for graph sampling and mini-batch preparation.

GNNs differ significantly from other neural networks. They carry out a neighborhood aggregation process involving computing details about a certain node using its neighboring nodes. However, as the number of layers in a GNN rises, so does the number of nodes the network must connect to, which can occasionally push the limitations of a computer. While some neighborhood sampling techniques make use of randomization to increase efficiency slightly, this is insufficient. To address this, the team improved the sampling procedure roughly three times by using a combination of data structures and algorithmic enhancements.

The team’s third and final bottleneck was to incorporate a prefetching step to pipeline the transfer of mini-batch data between the CPU and GPU. The team also found and fixed a performance issue in a well-known PyTorch module, resulting in a runtime of 16.5 seconds per epoch for SALIENT. The team thinks their meticulous attention to detail is why they were able to produce such impressive results. By simply carefully examining the variables that impact performance while a GNN is being trained, they solved a significant number of performance problems. Their approach currently only has one bottleneck connected to the GPU computation limit, which should be the case for an ideal system.

Researchers will now be able to handle graphs at a size that has never been seen before, thanks to MIT and IBM’s SALIENT. Regarding future prospects, the team wishes to use the graph neural network training system on the current algorithms in place for forecasting each node’s properties and on more tough jobs like recognizing deeper subgraph patterns. Indicating financial crimes would be one of its practical applications. The U.S. Air Force Research Laboratory, the U.S. Air Force Artificial Intelligence Accelerator, and the MIT-IBM Watson AI Lab provided funding for the team’s research. Their work was also presented at the MLSys 2022 conference.


Check out the Paper and MIT Article. All Credit For This Research Goes To Researchers on This Project. Also, don’t forget to join our Reddit page and discord channel, where we share the latest AI research news, cool AI projects, and more.

Khushboo Gupta is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Goa. She is passionate about the fields of Machine Learning, Natural Language Processing and Web Development. She enjoys learning more about the technical field by participating in several challenges.