A new study by Facebook AI and the University of Notre Dame research team has proposed a novel real-time six DoF (Degrees of Freedom) 3D face pose estimation technique, named Img2pose, that works without face detection or landmark localization.
6 DoF means the freedom of movement of a body in 3D space in six different ways. Other than yaw, pitch and roll rotational motion that is already there in 3 DoF, 6 DoF face pose estimation adds front/back, up/down, and left/right variables. The proposed technique can directly estimate the 6DoF 3D face pose for all faces, even in very crowded images, without the face detection step.
The current face detection techniques consist of two steps. The first step is to position a bounding box around each face in the photo. The next one is typically facial landmark detection, including localizing specific facial features like eye centers and the nose’s tip. This process works well for many face-based reasoning tasks but suffers from high compute costs, especially in SOTA models. Also, landmark detectors tend to be re-optimized for specific face detector when the face detector is updated.
The Notre Dame and Facebook researchers say that estimating the 6 DoF rigid transformation of a face is simpler than facial landmark detection. Also, 6 DoF offers more information than face bounding box labels.
The proposed method estimates the 6DoF pose for each face in the given image, denoting the rotation and the 3D face translation. Since it’s possible to convert the 6DoF face pose to an extrinsic camera matrix to project a 3-D face to the 2-D image plane, the predicted 3D face poses can be further used to obtain accurate 2D face bounding boxes. Face detection thus becomes a by-product of the process, with reduced computational overhead.
All 3-D face shapes in an input image can be aligned by replacing ‘training for face bounding box detection’ with ‘6DoF pose estimation’. It is possible to adjust the generated face-bounding boxes in terms of size and shape to match specific research needs as the pose aligns a 3D shape with known geometry to a face region in the picture.
The img2pose model is built using a small and fast ResNet-18 backbone and is trained on the WIDER FACE training set with weakly supervised labels and human-annotated ground-truth pose labels. Two datasets, the AFLW2000-3D and BIWI datasets, were used for the testing of img2pose. It performed better than SOTA face pose estimators while running in real-time and surpassed comparable complexity models on landmark detection despite not being optimized on bounding box labels.
The team believes that the proposed direct multi-face approach is the first to estimate the 6DoF rigid transformation of 3D faces without face detection or facial landmark localization. The method is expected to improve accuracy in tasks such as object and key-point detection in the future.