Look at the images above. Can you tell the difference? It is as if trying to differentiate between twins. Maybe one has very slightly shorter hair? Or does he? In the realm of computer vision systems, a similar issue occurs. This research focuses on geometric vision tasks, such as 3D reconstruction, wherein these methods frequently encounter the challenge of discerning whether two images portray identical 3D surfaces in the real world or two distinct 3D surfaces that bear a striking resemblance. Incorrect determinations in this regard can result in erroneous 3D models. This task is called “visual disambiguation”.
The proposed solution by researchers at Cornell entails the creation of a novel dataset called “Doppelgangers,” which comprises pairs of images that either represent the same surface (positives) or two distinct yet visually similar surfaces (negatives). Constructing the Doppelgangers dataset was a challenging task, as even humans can struggle to differentiate between identical and similar images. The approach leverages existing image annotations from the Wikimedia Commons image database to automatically generate a substantial set of labelled image pairs.
We can summarise the contributions in the above image as follows:
(a) When presented with a pair of images, key points, and matches are extracted through the application of feature-matching methods. It’s important to highlight that in this specific scenario, the images represent a negative pair (doppelganger) showcasing opposing sides of the Arc de Triomphe. Notably, the feature matches are primarily concentrated in the upper segment of the structure, characterized by repetitive elements, in contrast to the lower section featuring sculptures.
(b) Binary masks for key points and matches are subsequently created. Following this, both the image pair and the masks undergo alignment using an affine transformation, which is determined based on the identified matches.
(c) The classifier utilized in this context takes the concatenation of the images and binary masks as input and produces an output probability. This probability serves as an indication of the likelihood that the given pair constitutes a positive match.
However, it was observed that training a deep network model directly on these raw image pairs yielded unsatisfactory results. To address this issue, a specialized network architecture was designed. This network incorporates valuable information in the form of local features and 2D correspondence to enhance the performance of the visual disambiguation task.
In the evaluation using the Doppelgangers test set, this proposed method demonstrates impressive performance in tackling intricate disambiguation tasks. It outperforms both baseline approaches and alternative network designs by a significant margin. Additionally, the study investigates the utility of the learned classifier as a straightforward pre-processing filter in scene graph computations within structure-from-motion pipelines, such as COLMAP.
Overall, these findings highlight the potential of this approach to improve the reliability and precision of computer vision systems in tasks related to 3D reconstruction and visual disambiguation. This research contributes valuable insights and tools to the field of computer vision, with promising applications in real-world scenarios requiring accurate surface recognition and reconstruction.
Check out the Paper and Project. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 30k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.
Janhavi Lande, is an Engineering Physics graduate from IIT Guwahati, class of 2023. She is an upcoming data scientist and has been working in the world of ml/ai research for the past two years. She is most fascinated by this ever changing world and its constant demand of humans to keep up with it. In her pastime she enjoys traveling, reading and writing poems.