3D object reconstruction is a significant computer vision problem with AR/VR technology applications, such as telepresence and the generation of 3D models for gaming. New emerging technology for photorealistic 3D reconstruction can seamlessly mix real objects with virtual ones on traditional smartphones, laptops, and even augmented reality glasses of the future. To summarize, the current 3D reconstruction methods rely on learning models for various object categories, which are limited since there is a lack of data sets containing videos of real-world objects and accurate 3D re-creations. Since models use these examples to create adequate reconstructions, researchers typically just used synthetic objects with approximate matches in nature.
Facebook AI releases a large-scale data set containing real videos of common object categories with 3D (CO3D) annotations. The new CO3D has 1.5 million frames from nearly 19,000 videos capturing objects from 50 different types in the widely used MS-COCO dataset for increased accuracy and coverage over previous alternatives to improve research efforts around this field.
Facebook’s AI is also releasing its work on a novel method they call NeRFormer. It can learn to synthesize images of an object from different viewpoints by observing videos in the CO3D data set (rather than just stills). This enables efficient synthesis that marries two recent machine learning contributions: Transformers and Neural Radiance Fields, which boosts accuracy up to 17% compared with nearest competitors’ methods when generating new views for objects.
To gather a large-scale real-life data set of common objects in the wild annotated with 3D shapes, Facebook AI researchers devised a photogrammetric approach requiring only object-centric multiview images. Such data can be effectively gathered by crowdsourcing “turntable” videos captured with consumer smartphones.
To achieve this, they crowdsourced object-centric videos on Amazon Mechanical Turk (AMT). Each AMT task asked a worker to select an object in a given category, place it on a solid surface, and record video while moving around the whole thing. They selected 50 MS-COCO categories comprising stationary objects with well-defined shapes, which are good candidates for successful 3D reconstruction.
The mature photogrammetry framework, COLMAP, uses 3D annotations and tracks the camera to create a dense point cloud of objects. To ensure high-quality 3D annotations, we use an active learning algorithm for videos with low accuracy.
Apart from the release of the CO3D data set, Facebook AI also proposes NeRFormer. This is a novel deep architecture that learns by differentially rendering its neural radiance field (NeRF). The properties are predicted based on analyzing video content and marching along the rays to render it. Thus, once Neural Formers learns the common structure of a category, it can synthesize new views of an unseen object given only its known view.
The CO3D dataset will be the first of its kind, and it’s already making a big impact in 3D real-life object reconstruction. It provides training data for their NeRFormer to tackle new-view synthesis (NVS) tasks. With photorealistic NVS, they are one step closer to fully immersive AR/VR effects. This will allow objects and people to connect across environments by sharing or recollecting experiences virtually.