Meet GPS-Gaussian: A New Artificial Intelligence Approach for Synthesizing Novel Views of a Character in a Real-Time Manner

An essential function of multi-view camera systems is novel view synthesis (NVS), which attempts to generate photorealistic images from new perspectives using source photos. The subfields of human NVS have the potential to significantly contribute to real-time efficiency and consistent 3D appearances in areas such as holographic communication, stage performances, and 3D/4D immersive scene capture for sports broadcasting. Prior efforts have used a weighted blending process to create new views, but these have usually relied on input views that are either very dense or have very accurate proxy geometry. Rendering high-fidelity images for NVS under sparse-view camera settings is still a huge issue.

In several NVS tasks, implicit representations, notably Neural Radiance Fields (NeRF), have recently shown outstanding performance. Although there have been advancements in strategies to speed up the process, NVS methods that use implicit representations still take a long time to query dense spots in scene space. Conversely, explicit representations’ real-time and high-speed rendering capabilities, especially point clouds, have attracted sustained attention. When combined with neural networks, point-based graphics provide an impressive explicit representation that is both realistic and more efficient than NeRF in the human NVS test.

New research by the Harbin Institute of Technology and Tsinghua University aims for a generalizable 3D Gaussian Splatting approach to feed-forwardly regress Gaussian parameters instead of using per-subject optimization in this paper. Their goal is to learn how to use large 3D human scan models with various human topologies, clothing styles, and pose-dependent deformations to create Gaussian representations, drawing inspiration from successful learning-based human reconstruction approaches like PIFu. The proposed approach permits the rapid depiction of human appearances through a generalizable Gaussian model by utilizing these acquired human priors. 

The researchers present 2D Gaussian parameter maps defined on source view picture planes (position, color, scaling, rotation, opacity) as an alternative to unstructured point clouds. Thanks to these Gaussian parameter maps, it can depict a character using pixel-wise parameters, where each foreground pixel corresponds to a specific Gaussian point. On top of that, it makes it possible to use cost-effective 2D convolution networks instead of 3D operators. Estimating depth maps for both source views using two-view stereo as a learnable un-projection technique raises 2D parameter maps to 3D Gaussian points. Characters are represented by these unprojected Gaussian points from both source views, and the novel view image can be generated using the splatting approach. The significant self-occlusions in human characters make the depth above estimation a challenging problem with existing cascaded cost volume approaches. Hence, the team suggests simultaneously training their Gaussian parameter regression and an iterative stereo matching-based depth estimation module on big data. Minimizing rendering loss of the Gaussian module fixes any artifacts that may be caused by the depth estimation, which improves the accuracy of 3D Gaussian position determination. Training becomes more stable with the help of such a collaborative approach, which is good for all parties. 

In reality, the team could achieve 2K novel views with frame rates above 25 FPS using only one state-of-the-art graphics card. An unseen character can be rendered instantaneously without optimization or fine-tuning using the proposed method’s broad generalizability and fast rendering capabilities.

As highlighted in their paper, some factors can still affect the method’s efficacy, even though the suggested GPS-Gaussian synthesizes high-quality images. As an example, one essential preprocessing step is precise foreground matting. In addition, when a target area is completely invisible in one view but visible in another, as in a 6-camera setup, the method cannot adequately handle a big difference. The researchers believe that this difficulty can be solved by using time-related data.

Check out the Paper and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 33k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

Dhanshree Shenwai is a Computer Science Engineer and has a good experience in FinTech companies covering Financial, Cards & Payments and Banking domain with keen interest in applications of AI. She is enthusiastic about exploring new technologies and advancements in today’s evolving world making everyone's life easy.

🚀 LLMWare Launches SLIMs: Small Specialized Function-Calling Models for Multi-Step Automation [Check out all the models]