Computer vision tasks have reached exceptional accuracy with new advancements in machine learning models trained with photos. Adding to these advancements, 3D object understanding boasts the great potential to power a more comprehensive range of applications, such as robotics, augmented reality, autonomy, and image retrieval.
In early 2020, Google released MediaPipe Objectron. The model was designed for real-time 3D object detection for mobile devices. This model was trained on a fully annotated, real-world 3D dataset and could predict objects’ 3D bounding boxes.
Still, it was a big challenge to understand the objects in 3D due to the lack of large real-world datasets compared to 2D tasks. There is a strong need for object-centric video datasets to empower the research community to advance 3D object understanding. This must capture more of an object’s 3D structure and match the data format used for vision tasks, like video or camera streams.
Keeping the above in mind, Google released the Objectron dataset, collecting short and object-centric video clips capturing a broader set of familiar objects from multiple angles. Each video clip is accompanied by augmented reality session metadata that includes sparse point-clouds and camera poses. The data contains manually annotated 3-Dimensional bounding boxes for each object to describe the object’s position, orientation, and dimensions. The dataset consists of about 15,000 annotated video clips with over 4 million annotated images collected from a geo-diverse sample.
A 3D Object Detection Solution
Along with the dataset, Google also shared a 3D object detection solution for the following categories of objects — shoes, chairs, mugs, and cameras. The models are released in Google’s open-source framework for customizable Machine Learning solutions for live and streaming media, i.e. MediaPipe. MediaPipe also powers Machine Learning solutions such as on-device real-time hand, iris, and body pose tracking.
In these new versions, a two-stage architecture is used.
- The first stage employs the TensorFlow Object Detection model and finds the 2-dimensional crop of the object.
- The second stage uses the cropped image from stage 1 to estimate the 3-Dimensional bounding box while simultaneously computing the object’s 2-Dimensional crop.