Researchers at Google AI Introduce an Efficient Neural Volumetric Representation that Enables Real-Time View Synthesis

0
4971
Source: https://arxiv.org/pdf/2103.14645.pdf

View synthesis is a computer vision (CV) technique that uses observed images to recover a 3D scene representation that can render the scene from novel unobserved viewpoints. Recently, it has seen significant progress resulting from using neural volumetric representations.

Neural Radiance Fields (NeRF) can render photorealistic novel views using fine geometric details and realistic view-dependent appearance. It represents a scene as a continuous volumetric function, parameterized by a multilayer perceptron (MLP) that maps from a steady 3D position to the volume density and view-dependent emitted radiance at that location.

However, rendering NeRF is slow and computationally heavy, limiting its use for interactive view synthesis. It also makes it impossible to display a recovered 3D model in a standard web browser. 

Google researchers address this problem of rendering a trained NeRF in real-time while preserving its capability to represent minute geometric details and convincing view-dependent effects.

Their approach accelerates NeRF’s rendering process by three orders of magnitude, resulting in a rendering time of 12 milliseconds per frame on a single GPU. They precompute and store a trained NeRF into a sparse 3D voxel grid data structure called a Sparse Neural Radiance Grid (SNeRG). Each active voxel in a SNeRG contains opacity, diffuse color, and a learned feature vector that encodes view-dependent effects. 

To render this representation, first, they accumulate the diffuse colors and features vectors along each ray. Then they pass these accumulated feature vectors through a lightweight MLP to produce a view-dependent residual, added to the collected diffuse color.

Key Modifications to NeRF

Recent studies recommend discretized volumetric representations as one of the most efficient approaches to improve the efficiency of NeRF. The researchers extend this approach with a deferred neural rendering technique for modeling view-dependent effects that enable the visualization of trained NeRF models in real-time commodity hardware with minimal quality degradation.

The team introduces two necessary modifications to NeRF that allow it to be effectively baked into this sparse voxel representation:

  1. They designed a “deferred” NeRF architecture. The original NeRF architecture represents view-dependent effects with an MLP that runs once per 3D sample. However, the modified architecture instead represents them with an MLP that only runs once per pixel.
  2. They regularize NeRF’s predicted opacity field during training to encourage sparsity. Rendering time and storage required for a volumetric representation heavily depend on opacity’s sparsity within a provided scene. Therefore, the regularizer penalizes predicted density to make NeRF’s opacity field more sparse, thus improving both the storage cost and rendering time for the resulting SNeRG. 

The team demonstrates the proposed method’s ability to increase the rendering speed of NeRF so that frames can be rendered in real-time while retaining NeRF’s ability to represent fine geometric details and convincing view-dependent effects. Furthermore, this representation is compact and requires less than 90 MB on average to represent a scene.

Figure 1: Comparison of ray-marching procedures for
NeRF and SNeRG.
Source: https://arxiv.org/pdf/2103.14645.pdf

The researchers compared the proposed approach to contemporary techniques for accelerating NeRF considering three criteria: render-time performance, storage cost, and rendering quality. On evaluating, they found that:

  • MLP had a small impact on runtime performance on removing the view-dependence. 
  • Extracting the sparsity loss resulted in increased memory usage.
  • Changing the proposed “deferred” rendering to NeRF in prohibitively significant render times.

The team states that the proposed SNeRG model’s rendering quality is found to be competitive with the neural model after fine-tuning. The storage ablation study validates that the compressed SNeRG representations are small enough to be quickly loaded on a web page or display at over 30 frames per second on a laptop GPU.

Figure 2: Visualization of sparsity loss and visibility
culling.
Source: https://arxiv.org/pdf/2103.14645.pdf
Figure 3: Impact of fine-tuning (FT) the view-dependent
appearance network
Source: https://arxiv.org/pdf/2103.14645.pdf

The team hopes that their approach will help in adopting such neural scene representations across various vision and graphics applications.

Paper: https://arxiv.org/pdf/2103.14645.pdf

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.