A common challenge in 3D asset creation for Augmented Reality (AR), Virtual Reality (VR), robotics, and gaming has emerged. The surge in the popularity of 3D diffusion models, which simplify the complex 3D asset creation process, comes with a hitch. These models require access to ground-truth 3D models or point clouds for training, which can be challenging for real images. Moreover, the latent 3D diffusion approach often results in a complex and challenging-to-denoise latent space on diverse 3D datasets, making high-quality rendering a hurdle.
Some existing solutions tackle this challenge but often demand a lot of manual work and optimization processes. A team of researchers from Adobe Research and Stanford have been working to make the 3D generation process faster, more realistic, and more generic. A recent paper introduces a new approach called DMV3D, a single-stage category-agnostic diffusion model. This model can generate 3D Neural Radiance Fields (NeRFs) from either text or a single-image input condition through direct model inference, significantly cutting down the time needed to create 3D objects.
The critical contributions of DMV3D include a pioneering single-stage diffusion framework using a multi-view 2D image diffusion model for 3D generation. They also introduced a Large Reconstruction Model (LRM), a multi-view denoiser that reconstructs noise-free triplane NeRFs from noisy multi-view images. The model provides a general probabilistic approach for high-quality text-to-3D generation and single-image reconstruction, achieving fast direct model inference, taking only about 30 seconds on a single A100 GPU.
DMV3D integrates 3D NeRF reconstruction and rendering into its denoiser, creating a 2D multi-view image diffusion model trained without direct 3D supervision. This eliminates the need for separately training 3D NeRF encoders for latent-space diffusion and streamlines the per-asset optimization process. The researchers strategically use a sparse set of four multi-view images surrounding an object, effectively describing a 3D object without significant self-occlusions.
Leveraging large transformer models, the researchers address the challenging task of sparse-view 3D reconstruction. Built upon the recent 3D Large Reconstruction Model (LRM), they introduce a novel joint reconstruction and denoising model capable of handling various noise levels in the diffusion process. This model integrates as the multi-view image denoiser in a multi-view image diffusion framework.
Trained on large-scale datasets comprising synthetic renderings and real captures, DMV3D demonstrates the ability to generate single-stage 3D in approximately 30 seconds on a single A100 GPU. It achieves state-of-the-art results in single-image 3D reconstruction. This work provides a fresh perspective on addressing 3D generation tasks by bridging the realms of 2D and 3D generative models, unifying 3D reconstruction and generation. The implications extend beyond immediate applications, opening doors for developing foundational models to tackle various challenges in 3D vision and graphics.
Check out the Paper and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 33k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.
Niharika is a Technical consulting intern at Marktechpost. She is a third year undergraduate, currently pursuing her B.Tech from Indian Institute of Technology(IIT), Kharagpur. She is a highly enthusiastic individual with a keen interest in Machine learning, Data science and AI and an avid reader of the latest developments in these fields.