Amazon Researchers Propose ‘ALLIE’: A Novel Framework to Address the Challenges of Active Learning on Large-Scale Imbalanced Graphs

This research summary article is based on the paper 'ALLIE: Active learning on large-scale imbalanced graphs'

Please don't forget to join our ML Subreddit

Social network analysis, financial fraud detection, molecular design, search engines, and recommender systems are all examples of graph-structured data. Graph Neural Networks (GNNs), as opposed to classic pointwise or pairwise models, have recently emerged as state-of-the-art models on these types of datasets due to their capacity to learn and aggregate complicated relationships between (K-hop) neighborhoods.

GNNs, like other deep learning models, require a considerable quantity of labeled data for training in supervised environments, despite their enticing advantages. In many fields, obtaining adequate labeled data for training is time-consuming, labor-intensive, and expensive, which limits the use of GNNs.

Active Learning (AL) is a promising technique for obtaining labels more quickly, at a lower cost, and efficiently training models. AL dynamically queries candidate samples for labeling to maximize the machine-learned model’s performance on a restricted budget. On various benchmark datasets, such as citation graphs and gene networks, current improvements in AL on graphs have proven to be beneficial. 

However, there has been little research into AL approaches for large-scale imbalanced circumstances (for example, discovering a small fraction of false reviews on an e-commerce website). This encourages academics to look into how to query the most “informative” data in order to lower the training cost of GNNs and mitigate the effect of imbalance.

It’s not easy to train GNNs with the AL technique on unbalanced graphs. Because under-represented positive samples are less likely to be selected by standard AL methods, the low prevalence rate of positive samples precludes traditional AL methods from learning the entire data distribution. Finding abusive reviews on a shopping website, for example, can be modeled as a binary classification problem, with positive samples (i.e., abusive reviews) accounting for a very small proportion of the labeled data.

When an AL model is trained to sample reviews for labeling, it will largely provide non-abusive reviews, resulting in a modest model performance increase. To balance class distribution, most AL sampling strategies described in natural language processing and computer vision assume independent and identically distributed data. Due to the varied relational structure and extensive linkages, these methodologies are not immediately applicable to graph-organized data. 

Building an AL method for large-scale graph data is difficult. Popular social media platforms (such as Facebook and Snapchat) have hundreds of millions of monthly active users, while online e-commerce sites (such as Amazon and Walmart) contain millions of products and process billions of transactions. At this scale, searching through all of the unlabeled samples in the graph is impracticable, as AL techniques’ computational complexity grows exponentially with the size of the unlabeled set. As a result, reducing the search space for AL algorithms on large-scale graphs is crucial.

To address these two issues, Amazon researchers offer an Active Learning-based technique for Large-scale ImbalancEd graphs (ALLIE), which combines the principle of AL on graphs with reinforcement learning for accurate and efficient node categorization. Using several uncertainty measures as criteria, ALLIE may successfully pick informative unlabeled samples for labeling. Furthermore, the method prioritizes the categorization of less confident and “under-represented” samples.


Researchers offer a graph coarsening mechanism for ALLIE that categorizes related nodes into clusters in order to scale the approach to huge graphs. The search space for the AL algorithm is reduced with a better representation of nodes in each cluster. This is the first study to use large-scale graphs and active learning to model the imbalance problem.

The contributions of the team are as follows: 

Imbalance-aware reinforcement learning-based graph policy network: The team uses a reinforcement learning technique to discover a representative subset of the unlabeled dataset by optimizing the classifier’s performance. The nodes that are being queried will be more representative of the minority class.

Graph coarsening strategy to handle large-scale graph data: Existing approaches rarely consider scalability, making them inefficient when used in real-world scenarios. Researchers use a graph coarsening approach to decrease the action space in the policy network to reduce running time.

Robust learning for more accurate node classification: Researchers build a node classifier with focused loss that down-weights the well-classified samples, unlike traditional approaches that do not distinguish between majority and minority classes when maximizing the objective function. 

ALLIE was tested on both balanced and unbalanced datasets. The balanced datasets are based on publicly available citation graphs, while the imbalanced dataset comes from a private e-commerce site. On both datasets, the researchers report on node classification performance.

According to the results, ALLIE improved an average of 2.39 percent in Macro F1 and 2.71 percent in Micro F1 over the best baseline on balanced graph datasets. On the e-commerce website dataset, ALLIE improved the positive classes (abusive users and reviews) by an average of 4.75 percent in Precision, 1.96 percent in Recall, and 3.45 percent in F1 (with 10.54 percent, 3.7 percent, and 7.71 percent relative improvement, respectively) over the best baseline. A detailed ablation study was also carried out by the team to highlight the importance of each component of ALLIE. According to additional tests, ALLIE outperforms baselines with a variety of initial training set sizes and query budgets.


In a recent study, Amazon researchers present ALLIE, a unique active learning framework for large-scale unbalanced graphs. ALLIE uses a graph policy network to query potential nodes for labeling by maximizing the GNN classifier’s long-term performance. ALLIE, in comparison to numerous state-of-the-art approaches, can better deal with an uneven data distribution thanks to two balancing mechanisms. ALLIE also has a graph coarsening module, making it scalable for large-scale applications. ALLIE’s high performance is demonstrated by experiments on three benchmark datasets and a real-world retail website dataset.