This AI Research Presents Drivable 3D Gaussian Avatars (D3GA): The First 3D Controllable Model for Human Bodies Rendered with Gaussian Splats

The Impressionist art movement was founded in the nineteenth century by the Anonymous Society of Painters, Sculptors, Printmakers, etc., and is characterized by “short, broken brushstrokes that barely convey forms.” Recent studies now render human subjects as realistically as possible in photographs, a challenge that the impressionists avoided. 

Since monocular techniques lack accuracy, creating drivable (i.e., can be animated to generate new content) photorealistic humans currently requires extensive multi-view data. Furthermore, current methods necessitate intricate pre-processing, such as accurate 3D registrations. But to get those registrations, you must use iterative processes that aren’t easy to incorporate into end-to-end workflows. Other approaches that do not require accurate registrations are based on neuronal radiance fields (NeRFs). They either have trouble rendering clothing animations (with certain exceptions) or are too slow for real-time rendering.

Researchers from Meta Reality Labs Research, Technical University of Darmstadt, and Max Planck Institute for Intelligent Systems represent the 3D human appearance and deformations in a canonical space using 3D Gaussians rather than radiance fields. Gaussian splats are used as a modern replacement for those quick brushstrokes so that the avatars’ anatomy and aesthetics match those of live, repositionable characters. Gaussian splats don’t call for any hacks involving the sampling of camera rays. Points in a drivable NeRF are often transformed from the canonical space to the observation space using linear blend skinning (LBS). D3GA, on the other hand, models humans using 3D Gaussian volumes as volumetric primitives and hence requires a mapping from volumes to canonical space.

In place of LBS, the researchers employ cages, another well-established deformation model well-suited to volume transformations. The deformation gradient generated by deforming cages in canonical space directly applies to the 3D Gaussian representation. This approach is built on a compositional structure that allows us to represent the torso, face, and clothing separately using cages. The lingering mystery concerns pinpointing the cue that causes those cage distortions to occur. The current state-of-the-art in drivable avatars demands dense input signals like RGB-D pictures or even multi-camera setups, which might not be acceptable for low-bandwidth connections in telepresence applications. The team used a more condensed input based on the human stance, which includes quaternion representations of skeletal joint angles and 3D facial key points. They use nine high-quality multi-view sequences to train person-specific models that can be driven with fresh poses from any subject. They cover many body forms, motions, and clothes (not restricted to tight-fitting).

The method produces high-quality output, outperforming the state-of-the-art with equivalent input and competing favorably with methods employing more information, such as FFD meshes or pictures, during testing. As a bonus, the proposed technique does not necessitate ground truth geometry to achieve promising results in geometry and appearance modeling for dynamic sequences, which reduces the processing time needed for the data. 


Check out the Paper and Project Page. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 33k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..