The de facto models for representation learning on graph-structured data are Graph Neural Networks (GNN). As a result, they have begun to be implemented in production systems. These models pass messages along the direction of the edges in the given graph with nonlinearities between different layers, updating the node embeddings iteratively. The computed node embeddings for l layers include details from the seed vertex’s l-hop neighborhood. The GNN models must be trained on billion-scale graphs to be used in production. Even on distributed systems, training these models can take days or even weeks. Although it is more difficult in this situation, minibatch training on GNNs is more effective than using Deep Neural Networks (DNN) in general.
Real-world graphs typically have a minimum diameter. The Neighborhood Explosion Phenomenon (NEP), also known as the l-hop neighborhood, may very well span the entire graph if l is large. When there are l layers, this dependency spans the node’s l-hop neighborhood because the node embeddings in GNNs depend recursively on their set of neighbors’ embeddings. Researchers suggested sampling a subgraph of the batch’s nodes’ l-hop neighborhood to address these problems. Node-based, Layer-based, and Subgraph-based methods are the three main types of strategies. Node-based sampling techniques take separate, iterative samples of each node.
It was discovered that node-based methods sample subgraphs that are too shallow or with a low number of edges to nodes. So, layer-based sampling techniques were suggested, in which sampling is done collectively for each layer. Contrarily, subgraph sampling methods typically use the same subgraph for all layers rather than the recursive layer-by-layer sampling scheme used in node- and layer-based sampling methods. While some of these sampling techniques cache historical embeddings to lower the variance of the estimated approximate embeddings, others consider embedding magnitudes.
There are techniques for picking popular vertices from a vertex cache. Most of these methods can be combined with other sampling algorithms and are orthogonal. The NEP affects node-based sampling methods the most. Still, they guarantee a good approximation for each embedding by ensuring each vertex has k neighbors, the only hyperparameter of the sampling algorithm, which minimizes the NEP. Because the number of vertices sampled is a hyperparameter, layer-based sampling methods do not suffer as much from the NEP. However, they cannot guarantee that each vertex approximation is sufficient, and their hyperparameters are difficult to reason with. The number of nodes to sample at each layer dramatically depends on the graph structure. Methods that use subgraph sampling typically have higher levels of bias than their node- and layer-based equivalents. As a result, they concentrate on the node- and layer-based sampling techniques in this paper and combine their benefits.
The following is a list of this work’s main contributions:
- Using Poisson Sampling, they propose a new algorithm called LABOR that combines the benefits of neighbor and layer sampling approaches. As a result of correlating the sampling techniques of the given seed nodes, a significant reduction in computation, memory, and communication is achieved for the sampled vertices from various seeds. Additionally, LABOR can be used as a drop-in replacement because it shares the same hyperparameters as neighbor sampling.
- They demonstrate through experimental verification of their results that their suggested sampling algorithm, LABOR, performs better than neighbor sampling and layer sampling approaches.
The code implementation of this concept can be found as a pending pull request to the famous Deep Graph Library on GitHub.
This Article is written as a research summary article by Marktechpost Staff based on the research paper '(LA)YER-NEIGH(BOR) SAMPLING: DEFUSING NEIGHBORHOOD EXPLOSION IN GNNS'. All Credit For This Research Goes To Researchers on This Project. Check out the paper and github link.
Please Don't Forget To Join Our ML Subreddit
Aneesh Tickoo is a consulting intern at MarktechPost. He is currently pursuing his undergraduate degree in Data Science and Artificial Intelligence from the Indian Institute of Technology(IIT), Bhilai. He spends most of his time working on projects aimed at harnessing the power of machine learning. His research interest is image processing and is passionate about building solutions around it. He loves to connect with people and collaborate on interesting projects.