Researchers at Google Develop BlazePose GHUM Holistic: A Lightweight Neural Network Pipeline that Predicts 3D Landmarks and Pose of the Human Body on-Device

Motion capture and interactive video games are only two examples of the many applications made possible by accurate real-time inference of the human body skeleton. Running without the need for specialized sensors on a consumer’s phone or laptop would greatly democratize the technology for non-professional users.

Numerous improvements in volumetric representations or the estimation of 3D landmarks on the human body have been made over the last ten years. The majority of these either need specialized lab equipment, are too computationally expensive to perform on mobile devices, or don’t go into enough body topology detail.

BlazePose GHUM Holistic, a compact neural network pipeline developed by Google researchers to address these issues, predicts the 3D landmarks and pose of the human body on a device, including the hands, from a single monocular image. It is accessible to developers and producers via MediaPipe and operates in real-time at 15 FPS on the majority of contemporary mobile phones and browsers.

The method addresses problems the researchers found with the available interactive motion capture systems. Researchers developed a unique method that is based on fitting a statistical 3D human model (GHUM) to a variety of 2D annotations to solve the difficulty of gathering various 3D ground truths of the human body. The team suggested using depth ordering comments as fitting supervision to increase accuracy even more.

Hands and fingers are not included in the usual topology for on-device body landmarks prediction. The development of a comprehensive motion capture system for the body is hampered as a result. In order to get over this constraint, researchers cropped high-resolution hand regions from the original image using BlazePose’s palm prediction as a prior. Then, in a single feed-forward step, they run a retrained hand-tracking model to forecast 3D hand landmarks for each hand.

In order to enable expressive use cases like 3D avatars, the team addressed the issue of moving beyond precise 3D representations of the human body to high-level semantic comprehension and mapping.

Source: https://arxiv.org/pdf/2206.11678.pdf

The team conducted trials on a held-out test set of 10,000 in-the-wild photos with extremely difficult poses that had GHUM fits that were curated for any mistakes in order to validate the GHUM lifter architecture. The researchers concluded that BlazePose GHUM Holistic presented a straightforward and meaningful representation of the human body that can be used for a variety of applications right out of the box after comparing results with various SOTA approaches. They were able to transition from 2D space to a real-world coordinate system thanks to 3D landmarks. Pose estimate provided the team with additional hand points to provide finer details as needed and a high-level interpretation of 3D landmarks.

Conclusion

BlazePose GHUM Holistic, a lightweight neural network pipeline for 3D human body landmarks and pose estimation, was unveiled by Google researchers. It is designed with real-time on-device inference in mind. Motion capture from a single RGB image is possible with BlazePose GHUM Holistic, which also supports fitness monitoring, avatar control, and AR/VR effects. A new technique for gathering 3D ground truth data, updated 3D body tracking with extra hand landmarks, and whole-body posture estimate from a monocular image are the primary contributions. The team is eager to optimize the model in the next releases.

This Article is written as a summary article by Marktechpost Staff based on the research paper 'BlazePose GHUM Holistic: Real-time 3D Human Landmarks and Pose Estimation'. All Credit For This Research Goes To Researchers on This Project. Checkout the paper and github.

Please Don't Forget To Join Our ML Subreddit

Nitish is a computer science undergraduate with keen interest in the field of deep learning. He has done various projects related to deep learning and closely follows the new advancements taking place in the field.