Three-dimensional (3D) modeling has become critical in various fields, such as architecture and engineering. 3D models are computer-generated objects or environments that can be manipulated, animated, and rendered from different perspectives to provide a realistic visual representation of the physical world. Creating 3D models can be time-consuming and costly, especially for complex objects. However, recent advancements in computer vision and machine learning have made it possible to generate 3D models or scenes from a single input image.
3D scene generation involves using artificial intelligence algorithms to learn the underlying structure and geometrical properties of an object or environment from a single image. The process typically comprises two stages: the first involves extracting the object’s shape and structure, and the second consists in generating the object’s texture and appearance.
In recent years, this technology has become a hot topic in the research community. The classic approach for 3D scene generation involves learning the features or characteristics of a scene presented in two dimensions. In contrast, novel approaches exploit differentiable rendering, which allows the computation of gradients or derivatives of rendered images with respect to the input geometry parameters.
However, all these techniques, often developed to address this task for specific categories of objects, provide 3D scenes with limited variances, such as terrain representations with minor changes.
A novel approach for 3D scene generation has been proposed to address this limitation.
Its goal is to create natural scenes that possess unique features resulting from the interdependence between their constituent geometry and appearance. The distinctive nature of these features makes it challenging for the model to learn common figures’ characteristics.
In similar cases, the exemplar-based paradigm is employed, which involves the manipulation of a suitable exemplar model to construct a richer target model. Therefore the exemplar model should have similar characteristics to the target model for this technique to be effective.
However, having different exemplar scenes with specific characteristics makes it difficult to have ad hoc designs for every scene type.
Therefore, the proposed approach utilizes a patch-based algorithm, which was used long before deep learning. The pipeline is presented in the figure below.
Specifically, a multi-scale generative patch-based framework is adopted, which employs a Generative Patch Nearest-Neighbor (GPNN) module to maximize the bidirectional visual summary between the input and output.
This approach utilizes Plenoxels, a grid-based radiance field known for its impressive visual effects, to represent the input scene. While its regular structure and simplicity benefit patch-based algorithms, certain essential designs must be implemented. Specifically, the exemplar pyramid is constructed through a coarse-to-fine training process of Plenoxels on images of the input scene rather than simply downsampling a high-resolution pre-trained model. Additionally, the high-dimensional, unbounded, and noisy features of the Plenoxels-based exemplar at each level are transformed into well-defined and compact geometric and appearance features to enhance robustness and efficiency in subsequent patch matching.
Furthermore, this study employs diverse representations for the synthesis process within the generative nearest neighbor module. The patch matching and blending operate simultaneously at each level to progressively synthesize an intermediate value-based scene, which will ultimately be transformed into a coordinate-based equivalent.
Finally, using patch-based algorithms with voxels can lead to high computational demands. Therefore, an exact-to-approximate patch nearest-neighbor field (NNF) module is utilized in the pyramid, which maintains the search space within a manageable range while making minimal compromises on visual summary optimality.
The results obtained by this model are reported below for a few random images.
This was the summary of a novel AI framework to enable high-variance image-to-3D scene generation. If you are interested, you can learn more about this technique in the links below.
Check out the Paper and Project. Don’t forget to join our 21k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at Asif@marktechpost.com
Daniele Lorenzi received his M.Sc. in ICT for Internet and Multimedia Engineering in 2021 from the University of Padua, Italy. He is a Ph.D. candidate at the Institute of Information Technology (ITEC) at the Alpen-Adria-Universität (AAU) Klagenfurt. He is currently working in the Christian Doppler Laboratory ATHENA and his research interests include adaptive video streaming, immersive media, machine learning, and QoS/QoE evaluation.