Researchers from ByteDance and UCSD Propose a Multi-View Diffusion Model that is Able to Generate a Set of Multi-View Images of an Object/Scene from Any Given Text

Despite being a crucial stage in the contemporary gaming and media industry’s pipeline, creating 3D content is time-consuming, requiring skilled designers to put in hours or even days of effort to produce a single 3D item. Thus, a system that allows non-professional users to create 3D material easily is quite valuable. Three categories of existing 3D object creation techniques exist template-based generation pipelines, 3D generative models, and 2D-lifting techniques. Both template-based generators and 3D generative models may now rarely generalize to arbitrary object production due to the restricted number of accessible 3D models and the significant data complexity. Their created material is often limited to a few categories, most of which are common objects with straightforward topologies and textures from the outside world. 

However, in business, popular 3D assets frequently combine intricate, creative, and perhaps unrealistic structures and styles (Ske). Pre-trained 2D generation models may be used for 3D generation, according to recent research on 2D-lifting techniques. The common representations include Dreamfusion and Magic3D systems, which use 2D diffusion models as supervision for the improvement of a 3D representation, such as NeRF, using score distillation sampling (SDS). These 2D models, developed using large-scale 2D picture datasets, have outstanding generalizability and can produce hypothetical and unseen situations, the specifics of which may be defined by text input, making them effective tools for producing aesthetic 3D assets. 

However, these models can only offer single-view supervision, and the produced assets are readily affected by the multi-view consistency problem since they only have 2D knowledge. Because of this, the generation is highly unsteady, and the products frequently have serious artifacts. There are issues with 2D-lifting methods since score distillation is difficult without thorough multi-view knowledge or 3D awareness. These consist of (1) The Janus problem with many faces. The system regularly recreates content represented by the text prompt. (2) Content bleeds across distinct viewpoints. Examples are shown in Figure 1. There are several potential causes for the multifaceted problem. For instance, some items, like blades, may be almost undetectable at particular angles. 

Figure 1 illustrates typical 2D-lifting approaches for 3D-generating multiview consistency issues. On the left, you can see “A bald eagle carved out of wood,” which has two faces. Right: “a DSLR image of a plate of fried chicken and waffles with maple syrup on them,” where the chicken slowly transforms into a waffle.

However, from other perspectives, important aspects of a character or animal may be obscured or self-occluded. A 2D diffusion model can only evaluate these things from some possible perspectives the way humans can, which causes it to provide redundant and inconsistent material. Researchers from ByteDance and UCSD suggest multi-view diffusion models as a solution to these issues, which concurrently produce a collection of multi-view pictures that are coherent with one another. They mostly maintain the architecture design of 2D image diffusion for multi-image generation. This enables us to inherit the generalizability of previously learned 2D diffusion models for transfer learning. They produce a collection of multi-view pictures from an actual 3D dataset, called obverse, to guarantee the multi-view consistency of their model. 

They discover that the model can attain high consistency and generalizability by concurrently training it on real photos and multi-view images. They also use multi-view score distillation to apply these models to 3D creation. In contrast to single-view 2D diffusion models, their model’s multi-view supervision turns out to be far more stable. They can also still produce hypothetical, hidden 3D contents using pure 2D diffusion models. They use their multi-view diffusion model, which they adapted from DreamBooth and DreamBooth3D, to extract identification data from a set of supplied photographs, and it exhibits strong multi-view consistency following such a few-show fine-tuning. Their model, MVDream, effectively builds 3D Nerf models without the Janus problem when included in the 3D creation process. It either outperforms or equals the diversity found in other cutting-edge techniques.


Check out the Paper and Project Page. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 29k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

Aneesh Tickoo is a consulting intern at MarktechPost. He is currently pursuing his undergraduate degree in Data Science and Artificial Intelligence from the Indian Institute of Technology(IIT), Bhilai. He spends most of his time working on projects aimed at harnessing the power of machine learning. His research interest is image processing and is passionate about building solutions around it. He loves to connect with people and collaborate on interesting projects.

[Announcing Gretel Navigator] Create, edit, and augment tabular data with the first compound AI system trusted by EY, Databricks, Google, and Microsoft