This Artificial Intelligence Paper Proposes ‘SuperGlue,’ A Graph Neural Network That Simultaneously Performs Context Aggregation, Matching, And Filtering of Local Features for Wide-Baseline Pose Estimation

Imagine you have two pictures of the same scene taken from different angles. Most of the objects in both pictures are the same, just you look at them from different angles. In computer vision, objects are assumed to have certain features like edges, corners, etc. Matching these features is critical for some applications. But what would it take to match features between two pictures?

Finding correspondence between images is the prerequisite for estimating 3D structure and camera poses in computer vision tasks such as simultaneous localization and mapping (SLAM) and structure-from-motion (SfM). This is done by matching local features, and it’s tricky to achieve due to the changes in lighting conditions, occlusion, blur, etc. 

Traditionally, feature matching is done via a two-step approach. First, the front-end step extracts visual features from the images. Second, the back-end step applies bundle adjustment and pose estimation to help match extracted visual features. Once these are done, the features are ready, and the feature matching is modeled as a linear assignment problem.

As in all other domains, deep neural networks have played a crucial role in recent years in feature matching problems. They have been used to learn better sparse detectors and local descriptors from data using convolutional neural networks (CNNs).  

However, they were usually a component in the feature matching problem, not an end-to-end solution. What if a single neural network could perform context aggregation, matching, and filtering in a single architecture? Time to introduce the SuperGlue.

SuperGlue is a middle-end matcher. Source:

SuperGlue approaches feature matching problems in a different way. It learns the matching process from pre-existing local features using a graph neural network structure. This replaces the existing approaches where first, the task-agnostic features are learned, and they are matched using heuristics and simple methods. Being an end-to-end approach gives SuperGlue a strong advantage over existing methods. SuperGlue is a learnable middle-end that could be used to improve existing approaches. 

So how does SuperGlue achieve this? It peaks through a new window and views the feature-matching problem as a partial assignment between two sets of local features. Instead of solving a linear assignment problem to match features, it treats it as an optimal transport problem. SuperGlue uses a graph neural network (GNN) that predicts the cost function of this transport optimization. 

We all know how transformers achieved massive success in natural language processing and, recently, computer vision tasks. SuperGlue utilizes a transformer to leverage both spatial relationships of key points and their visual appearances. 

SuperGlue is trained in an end-to-end manner. Image pairs are used as training data. Priors for pose estimation are learned from a large labeled dataset; therefore, SuperGlue can have an understanding of the 3D scene.

SuperGlue can be applied to multiple problems where a high-quality feature correspondence is required for a multiple-view geometry. It runs in real-time on commodity hardware and can be applied for both classical and learned features. You can find more information about SuperGlue at the links below. 

Check out the paper, projectand code. All Credit For This Research Goes To Researchers on This Project. Also, don’t forget to join our Reddit page and discord channel, where we share the latest AI research news, cool AI projects, and more.

Ekrem Çetinkaya received his B.Sc. in 2018, and M.Sc. in 2019 from Ozyegin University, Istanbul, Türkiye. He wrote his M.Sc. thesis about image denoising using deep convolutional networks. He received his Ph.D. degree in 2023 from the University of Klagenfurt, Austria, with his dissertation titled "Video Coding Enhancements for HTTP Adaptive Streaming Using Machine Learning." His research interests include deep learning, computer vision, video encoding, and multimedia networking.