In order to better understand the human body for videos and images, pose detection is a critical step. Currently, many of us have tried 2D pose estimation with support from existing models.
Tensorflow just launched their first 3D model in TF.js pose-detection API. An increasing interest from the TensorFlow.js community in 3D pose estimation has been seen, which opens up new design opportunities for applications such as fitness, medical and motion capture among many others. A great example of this is using 3D motion to drive a character animation on your browser.
The community demo uses multiple models powered by MediaPipe and TensorFlow.js (namely FaceMesh, BlazePose, Hand Pose). It even works without app installation since a webpage is all you need to enjoy the experience.
In contrast to 2D, which can be obtained via human annotation and preserves a good level of human diversity in the dataset. This becomes challenging for 3D data because either lab setup or specialized hardware is required for manual scans – introducing additional challenges such as maintaining environmental variety. Some researchers opt to build a completely synthetic dataset, which also involves the challenge of domain adaptation.
The proposed approach uses a 3D statistical human body model called GHUM to obtain pose ground truth. Researchers fit the GHUM model and extended it with real-world keypoint coordinates in metric space during this process. The goal of the fitting is to align the 2D image evidence, which includes semantic segmentation alignment and shape and pose regularization terms.
To make the annotation process more efficient, researchers asked annotators to provide depth order between pose skeleton edges where they are certain. Due to the nature of 3D-2D projection, multiple points in 3D can project onto the same 2d point (i.e., with X and Y but different Z). So fitting results could be ambiguous resulting into several realistic body poses for a given input image or video frame. This task was easier than a real depth annotation, showing high consistency between annotators (98% on cross-validation) and reducing the errors in our GHUM reconstructions from 25% to 3%.
BlazePose GHUM takes a two-step approach to human body pose prediction. The model is trained on cropped images, predicting the 3D position in relative coordinates of an origin at the subject’s hip center.
MediaPipe vs. TF.js runtime
Live Demo: https://storage.googleapis.com/tfjs-models/demos/pose-detection/index.html?model=blazepose