When it comes to the spatial understanding of a scene when observed from any viewpoint or orientation, most geometry and learning-based approaches for 3D view synthesis fail to extrapolate to infer unobserved parts of the scene. The inability of these models to learn a prior over scenes is their fundamental limitation. Popular models like NeRF do not learn a scene prior, and therefore it cannot extrapolate views. Even conditional auto-encoder models can extrapolate views of simple objects, but they overfit to viewpoints and produce blurry renderings.
A prior learned model for spatial understanding of a scene may be used for unconditional or conditional inference. A good use case of unconditional inference is to generate realistic scenes and pass through them without any input observations, relying on the prior distribution over scenes. Similarly, with conditional inference there too are different types of problems. Example, plausible scene completions may be sampled by inverting scene observations back to the learned scene prior. Therefore, a generative model for scenes would be a practical solution for tackling a wide range of machine learning and computer vision problems, including model-based reinforcement learning.
Through this research, Apple researchers have introduced Generative Scene Networks (GSN), a generative model of scenes that allows view synthesis of a freely moving camera in an open environment. Their contributions in this model include:
- The research group introduced ‘the first generative model for unconstrained scene-level radiance fields’.
- Their research showed ‘that decomposing a latent code into a grid of locally conditioned radiance fields results in an expressive and robust scene representation.’ This outperforms strong baselines such as pure coarse encoding or global illumination methods.
- They infer ‘observations from arbitrary cameras given a sparse set of observations by inverting GSN’
- In their research, they also showed that ‘GSN can be trained on multiple scenes to grasp a rich scene fast, while rendered trajectories are smooth and consistent, maintaining scene coherence.’
Generative Scene Networks (GSN) can learn to decompose scenes into a set of multiple local radiance fields that can be rendered from a free-moving camera. The model may then be used as either prior or given sparse 2D observations complete the scene. Hence, this research, by Apple researchers has made great strides in generative modeling for complex, realistic indoor scenes.