Researchers From Stanford Introduce Disruptive Attention Consistency Method to Catapult Computer Vision Performance with Limited Datasets

In many real-world computer vision problems, such as healthcare, labeled training data can be scarce, leading to the development of machine learning models that learn partially incorrect representations and overfit to their training set. This can create a challenge for researchers working with small datasets, who need to ensure that the learned representations match human understanding, are generalizable to unseen data, and are not biased. In healthcare, interpretability is especially important to justify predictions and decisions.

To evaluate the representations learned by a model, attention or attribution map methods can be used to highlight regions in the input signal that are discriminative for the model’s predictions. These methods have become one of the main methods for analyzing the interpretability of neural networks and verifying that the model did not leverage bias present in the data. However, for a poorly optimized machine learning model, attention map results from different attention computation methods can vary greatly. More generally, attention maps become more dissimilar when the machine learning model has been poorly optimized or the task becomes more challenging, which increases the chance of overfitting.

To address this challenge, Stanford researchers proposed a method that enforces consistency between attention maps computed using different methods, resulting in improved representations learned by the model and increased classification performance on unseen data. Specifically, an attention consistency loss function is designed for two state-of-the-art attention map methods: Grad-CAM and Guided Backpropagation. The loss function is defined as the negative sum of the correlation between the attention maps, with a low loss indicating that the attention maps highlight similar regions of the input. Unsupervised training is possible, as the loss function does not require any training labels.

The proposed method, called ATCON, is shown to improve not only the quality of the attention maps but also the classification performance. Results are demonstrated in video clip event classification with a dataset curated for this project, consisting of clips extracted from continuous video recordings of hospital patients in their rooms. Improvement is also shown in image classification with PASCAL VOC and SVHN when the size of the training set is reduced. Attention consistency improves the quality of attention maps, as shown through qualitative analysis on the video dataset and quantitative analysis on PASCAL by computing the overlap between thresholded attention maps and ground truth bounding boxes. The benefits of the method are demonstrated for multiple network architectures: ResNet 50, Inception-v3, and a 3D 18 layers ResNet.

The method is compared with baselines, including layer attention consistency and few-shot learning multi-label classification. For the video dataset, the proposed method is shown to be able to leverage the state-of-the-art self-supervised method SimCLR to further boost performance. 

Below, a Figure from the authors shows the qualitative results of the proposed method compared to previous state-of-the-art.

Figure 1: Demonstration of the attention consistency method. Comparison between a baseline machine learning model and the same model plus the proposed unsupervised attention consistency fine tuning (ATCON). The first column indicates the ground truth (GT) label, the label predicted by the baseline, and by the proposed method. The second column shows the input frames. The two middle columns show Grad-CAM and Guided Backpropagation (GB) for the baseline, and the last two columns show the same attention maps for the proposed method. 

The improved attention maps may help end users better understand model predictions, and ease the deployment of machine learning systems into the real world. Overall, the proposed method addresses the challenge of learning correct representations with limited labeled data and ensures that attention maps are consistent, making them a valuable tool for evaluating the representations of machine learning models.

Check out the Paper, Github and video. All Credit For This Research Goes To Florian Dubost and Ali Mirzazadeh, Stanford researchers, and their collaborators Maxwell Pike, Krish Maniar, Max Zuo, Christopher Lee-Messer and Daniel Rubin at Stanford and Georgia Tech.

Jean-marc is a successful AI business executive .He leads and accelerates growth for AI powered solutions and started a computer vision company in 2006. He is a recognized speaker at AI conferences and has an MBA from Stanford.

↗ Step by Step Tutorial on 'How to Build LLM Apps that can See Hear Speak'