TensorFlow recently launched its latest pose detection model, MoveNet, with a new pose-detection API in TensorFlow.js.
- Lightning is mainly made for latency-critical applications.
- Thunder focuses on the applications requiring high accuracy.
Both the models run faster than real time (i.e.,30+ FramesPerSecond) on most modern desktops, laptops, and phones. The models run completely on client-side, in the browser using TensorFlow.js without any server call except for the initial page.
Human pose estimation has developed a lot; however, it hasn’t surfaced in many applications, mainly because more focus has been placed on making pose models larger and more accurate than making them faster and easily deployable everywhere. With MoveNet, tensor flow aims to design and optimize a model that utilizes the best aspects of state-of-the-art architectures while keeping inference times as low as possible. As a result, their model can deliver accurate key points across a wide variety of poses, environments, and hardware setups.
The researchers collaborated with IncludeHealth, a digital health and performance company. They aim to understand whether MoveNet can help unlock remote care for patients. IncludeHealth has an interactive web application that guides a patient through various routines from the comfort of their own home. Their routines are digitally built and prescribed by physical therapists to test balance, strength, and range of motion.
TensorFlow has released MoveNet early to IncludeHealth, accessible through the new pose-detection API. The model is trained in fitness, dance, and yoga poses. IncludeHealth has integrated the model into their application and benchmarked MoveNet relative to other available pose detectors.
MoveNet is based on a bottom-up estimation model that uses heatmaps to localize human key points. Its architecture consists of two key components:
- Feature extractor: It is a MobileNetV2 with an attached feature pyramid network (FPN), which allows for a high resolution, semantically rich feature map output.
- Set of prediction heads: Four prediction heads attached to the feature extractor predict the Person center heat map, Keypoint regression field, Person keypoint heatmap, and 2D per-keypoint offset field.
The model’s operation follows a sequence of operations as mentioned below:
Step 1: Person center heatmap: Prediction of the geometric center of a person. The location that has the most significant score is chosen for further steps.
Step 2: Keypoint regression field: Prediction of a complete set of key points for a person to group critical key points into instances. The output from keypoint regression (from the pixel corresponding to the object center) is sliced to generate an initial set of key points for the person.
Step 3: Person keypoint heatmap: Prediction of the location of all key points, independent of person instances. Each pixel is multiplied by a weight (inversely proportional to the distance from the corresponding regressed keypoint) to avoid considering key points from background people.
Step 4: 2D per-keypoint offset field: Prediction of local offsets from each output feature map pixel to the precise sub-pixel location of each key point. The coordinates of the highest heatmap values (in each keypoint channel) are selected to predict the final set of keypoint predictions. Later, the local 2D offset predictions are then added to the same, providing refined estimates.
The team used COCO and Active (an internal Google dataset) datasets for training the model. While COCO is the standard benchmark dataset used for detection, it is not suitable for detecting challenging poses and significant motion blur. Adopting COCO’s standard 17 body key points, Active was trained by labeling keypoints on yoga, fitness, and dance videos from YouTube.
The results reveal that the model trained on the Active dataset shows a significant performance boost compared to similar architectures trained on only COCO.
Initially, the bottleneck MobileNetV2 layers were picked up for FPN’s lateral connections. The team reduced the number of convolution filters in each prediction head to advance the execution on the output feature maps. MoveNet was continuously profiled, uncovering and removing particularly slow ops. Moreover, outputs from all models were packed into a single output tensor, ensuring fast execution with TensorFlow.js, as there will be only one download from GPU to CPU.
Using 192×192 inputs to the model (256×256 for Thunder) speeds up the model significantly. They applied intelligent cropping based on detections from the previous frame to counteract the lower resolution. It ensures that the model devotes its attention and resources to the main subject and not the background.
While operating on a high FPS camera stream, both Lightning and Thunder implement a non-linear filter to the incoming keypoint predictions. Simultaneously, the tuned filter suppresses high-frequency noise and outliers from the model, smoothing keypoint visualizations with the least lag in all conditions.