In a paper published at NeurIPS 2020, Google AI researchers have proposed a simple, scalable approach to estimate training data influence- TracIn. The quality of a Machine Learning model’s training data can significantly influence its performance. Influence is the degree to which a given training example affects the model and its predictive performance, is a useful measure of data quality. Although a few methods have been proposed recently to quantify influence, their use in products has been limited due to the resources needed to run them at scale or the additional burdens placed on training. Tracln, on the other hand, traces the training process to capture changes in prediction as it visits individual training examples.
Deep Learning models are usually trained using the SGD algorithm. The algorithm operates by making multiple passes over data and making modifications to the model parameters that locally reduces loss with each pass. Tracln effectively finds mislabeled examples and outliers from various datasets and helps explain predictions in terms of training examples by assigning an influence score to each training example.
The researchers describe two types of relevant training examples: One that educes loss called proponents, and the other that increases loss called opponents. The test samples at the time of training are unknown, and the learning algorithm visits several points at once. The Tracln method overcomes these limitations using the learning algorithms’ checkpoints output as a sketch of the training process and applying pointwise loss gradients. The TracIn method can be reduced simply to the dot product of loss gradients of the test and training examples, weighted by the learning rate, and summed across checkpoints. Alternatively, if the test example has no label, the influence on the prediction score can be examined.
The researchers demonstrate the utility of Tracln by calculating the loss gradient vector for training data and a test sample for a specific classification and then leveraging a standard k-nearest neighbor library to retrieve to proponents and opponents. The breakdown of the loss of the test examples into training example’s influences provided by Tracln suggests that the loss from any gradient descent-based neural model can be viewed as a sum of similarities in algorithms’ space. Tracln thus can be used as a similarity function within a clustering algorithm.
Tracln can also be utilized for identifying outliners that exhibit a high self-influence.
Tracln is task-independent and can be applied to various models. It does not have any requirements other than being trained using SGD. Tracln, in fact, is a relatively easy to implement, scalable method to compute the influence of training data examples on individual predictions or to find rare and mislabeled training examples. We can find the line to code examples from the GitHub link in the paper.