Researchers at ETH Zurich & Microsoft Introduce ‘PixLoc’: A Neural Network For Feature Alignment With A 3D Model Of The Environment


In known scenes, camera pose estimation is an intriguing task of 3D geometry recently tackled by many learning algorithms. Many of these techniques try to find geometric quantities like poses or 3d points, which are precise enough for this kind of work – but it can be difficult when there’s no guarantee that they’ll generalize well beyond what was seen before.

Researchers at ETH Zurich & Microsoft have created a solution to the end-to-end solution for camera pose estimation. To do so, they did not regress geometric quantities by teaching deep networks basic principles or 3D map encoding like previous approaches that used this approach would do. Instead, the research team goes Back to the Feature. They show that learning robust and generic features is sufficient for accurate localization by leveraging classical image alignment with existing 3D maps. Researchers have developed this new trainable algorithm, ‘PixLoc,’ to localize images in 3D with the help of CNN (convolutional neural network). In other words, PixLoc is a scene agnostic neural network that estimates an accurate 6-DoF pose from an image and a 3D model.


With the help of classical geometric optimization, the network does not need to be trained on pose regression. Instead, it extracts suitable features, and it becomes accurate in any scene. PixLoc was trained end-to-end, from pixels to pose. The team unrolled the direct alignment and supervised only its poses for training purposes.

The proposed formulation is capable of generating simple yet accurate localization models. It can even compete with more complex state-of-the-art approaches when they’re trained per scene. PixLoc is the lightweight post-processing step that helps refinements to poses estimated by any existing system. 

According to the researchers, ‘PixLoc, is the first end-to-end trainable approach capable of being deployed into new scenes widely differing from its training data without retraining or finetuning.’