Researchers From China Introduce DeepMatcher: A Deep Transformer-based Network for Robust and Accurate Local Feature Matching

Several popular geometric computer vision systems rely on local feature matching to function, such as Simultaneous Localization and Mapping (SLAM) and Structure-from-Motion (SFM). Detector-based matching is widely acknowledged to be achieved by the following:

  1. Detecting and describing a set of sparse key points using a technique such as SIFT, ORB, or a learning-based equivalent
  2. Establishing point-to-point correspondences using the nearest neighbor search or more advanced matching algorithms.

The matching search space is reduced when a feature detector is used, demonstrating the detector-based matching process’s general effectiveness. However, such a pipeline has difficulty constructing trustworthy correspondences when working with image pairs that exhibit significant viewpoint fluctuations. The main reason is that the detectors cannot extract repeating key points in such a scenario.

Many studies have tried to create correspondences directly from original images by extracting visual descriptors on dense grids throughout an image. While researchers want to create a deep local feature matcher for detector-free approaches, studies highlight the below-mentioned issues preventing this from happening: 

  1. A convolution neural network (CNN) is often used as the foundational feature extractor in detector-free approaches, followed by Transformer layers to capture long-range relevance for creating trustworthy correspondences. It appears that deep feature interaction in later phases suffers from a gap between the global receptive field of the Transformer and the local neighborhood of CNN. 
  2. Conflicts arise in scenes with recurrent geometry patterns or symmetrical structures due to CNN’s translation invariance. To address this problem, conventional detector-free techniques employ absolute position encodings before Transformer layers. Nevertheless, this position information would be lost as the depth of the Transformer layers increased. 
  3. Researchers mention that network depth is more important than network width. 

A new study by the National Natural Science Foundation of China introduces DeepMatcher. This deep local feature-matching network generates features that are more human-intuitive and easier to match for accurate correspondence with reduced computational complexity. 

Initially, the researchers used a convolutional neural network (CNN) to produce pixel tokens with enhanced properties. Then they applied a Feature Transition Module (FTM) to help bridge the gap between CNN’s locally aggregated feature extraction and Transformer’s global receptive field feature extraction. They constructed a deep network using a Slimming Transformer (SlimFormer) that improves long-range global context modeling within and across images. 

For robust long-range global context aggregation, SlimFormer uses vector-based attention to efficiently handle pixel tokens with linear complexity. Additionally, each SlimFormer is encoded with a relative position to represent relative distance information, which increases the network’s communicative prowess, especially at higher layer depths. To further mimic human behavior, SlimFormer employs a layer-scale method that allows the network to adaptively integrate message exchange from the residual block. This allows the network to obtain new matching information each time an image pair is scanned. 

DeepMatcher learns the discriminative characteristics to build dense matches at the coarse level using the Coarse Matches Module by repeatedly interleaving the self and cross-SlimFormer (CMM). Finally, they see match improvement as a hybrid classification/regression problem. Therefore, they develop Fine Matches Module (FMM) to predict confidence and offset simultaneously, leading to reliable and precise matches.

Check out the Paper and Github. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our Reddit PageDiscord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

Tanushree Shenwai is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Bhubaneswar. She is a Data Science enthusiast and has a keen interest in the scope of application of artificial intelligence in various fields. She is passionate about exploring the new advancements in technologies and their real-life application.

🐝 [FREE AI WEBINAR] 'Beginners Guide to LangChain: Chat with Your Multi-Model Data' Dec 11, 2023 10 am PST