Google AI Releases ‘Objectron Dataset’ Consisting Of 15,000 Annotated Videos And 4M Annotated Images

Computer vision tasks have reached exceptional accuracy with new advancements in machine learning models trained with photos. Adding to these advancements, 3D object understanding boasts the great potential to power a more comprehensive range of applications, such as robotics, augmented reality, autonomy, and image retrieval.

In early 2020, Google released MediaPipe Objectron. The model was designed for real-time 3D object detection for mobile devices. This model was trained on a fully annotated, real-world 3D dataset and could predict objects’ 3D bounding boxes.

Still, it was a big challenge to understand the objects in 3D due to the lack of large real-world datasets compared to 2D tasks. There is a strong need for object-centric video datasets to empower the research community to advance 3D object understanding. This must capture more of an object’s 3D structure and match the data format used for vision tasks, like video or camera streams. 

Keeping the above in mind, Google released the Objectron dataset, collecting short and object-centric video clips capturing a broader set of familiar objects from multiple angles. Each video clip is accompanied by augmented reality session metadata that includes sparse point-clouds and camera poses. The data contains manually annotated 3-Dimensional bounding boxes for each object to describe the object’s position, orientation, and dimensions. The dataset consists of about 15,000 annotated video clips with over 4 million annotated images collected from a geo-diverse sample.

A 3D Object Detection Solution

Along with the dataset, Google also shared a 3D object detection solution for the following categories of objects — shoes, chairs, mugs, and cameras. The models are released in Google’s open-source framework for customizable Machine Learning solutions for live and streaming media, i.e. MediaPipe. MediaPipe also powers Machine Learning solutions such as on-device real-time hand, iris, and body pose tracking.

In these new versions, a two-stage architecture is used.

  •  The first stage employs the TensorFlow Object Detection model and finds the 2-dimensional crop of the object. 
  • The second stage uses the cropped image from stage 1 to estimate the 3-Dimensional bounding box while simultaneously computing the object’s 2-Dimensional crop.




Consultant Intern: He is Currently pursuing his Third year of B.Tech in Mechanical field from Indian Institute of Technology(IIT), Goa. He is motivated by his vision to bring remarkable changes in the society by his knowledge and experience. Being a ML enthusiast with keen interest in Robotics, he tries to be up to date with the latest advancements in Artificial Intelligence and deep learning.

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...