Understanding indoor 3D scenes are becoming increasingly important in augmented reality, robotics, photography, games, and real estate. Many state-of-the-art scene interpretation algorithms have lately been driven by modern machine learning approaches. Depth estimation, 3D reconstruction, instance segmentation, object detection, and other methods are used to address distinct aspects of the problem.
The majority of these studies are made possible by a range of real and synthetic RGB-D datasets that have been made available in recent years. Even though commercially accessible RGB-D sensors, such as Microsoft Kinect, have made the collection of such datasets possible, capturing data at a significant scale with ground truth is still a problematic issue.
Furthermore, practically all earlier data gathering devices were datasets such as SunRGBD or ScanNet, which are not compatible with today’s technology. Due to a lack of diversity in data and a gap in depth-sensing technology, making the groundbreaking research of the last decade feasible for day-to-day application poses a difficulty.
Apple has announced iPads and iPhones that include a LiDAR scanner. It ushered in a new era of depth sensor availability and accessibility. This is the first time a large-scale dataset has been acquired utilizing Apple’s LiDAR scanner and mobile devices. It is the largest RGB-D dataset regarding the number of sequences and scene diversity gathered in people’s homes. It helps bridge the domain gap between existing datasets and widely available mobile depth sensors.
The collection, dubbed ARKitScenes, contains 5,048 RGB-D sequences, more than three times the size of the largest indoor dataset currently accessible. There are 1,661 different scenes in these sequences. For all of the sequences, it also gives approximated ARKit camera postures as well as LiDAR scanner-based ARKit scene reconstruction. The dataset provides high-quality ground truth and illustrates its utility in two downstream supervised learning tasks: 3D object detection and color-guided depth upsampling, in addition to the raw and processed data mentioned above. ARKitScenes delivers the largest RGB-D dataset annotated with oriented 3D bounding boxes for 17 room-defining furniture types for the 3D object detection problem.
In addition, ARKitScenes takes advantage of high-resolution ground truth scene geometry collected by a professional stationary laser scanner (Faro Focus S70). The high-quality laser scans are registered with mobile RGB-D frames shot with an iPad Pro using a unique technique. This is the first dataset to provide high-quality ground truth depth data that has been registered to frames from a commonly used depth sensor.
The researchers collected data using two essential devices: the 2020 iPad Pro and the Faro Focus S70. ARKit is used to collect several sensor outputs from the 2020 iPad Pro, including IMU, RGB (for both Wide and Ultra Wide cameras), and the detailed depth map from the LiDAR scanner. Such data was gathered using the official ARKit SDK3. ARKit world tracking and scene reconstruction are used by the data collection app during the capture.
This is to provide direct input on tracking robustness and reconstruction quality to the operators, who are not computer vision experts. The team used a Faro Focus S70 stationary laser scanner on a tripod to acquire high-resolution XYZRGB point clouds of the area in addition to the handheld iPad Pro.
Real-world residences were employed as data gathering venues, which were hired for a whole day. The homeowners gave their permission for this information to be made public in order to aid the study and development of indoor 3D scene understanding. Before beginning the captures, the operator was directed to delete any personally identifiable information. London, Newcastle, and Warsaw are the three major European cities where data is collected.
When choosing residences for data collection, the team considered two factors: the household’s socioeconomic class (SES) as well as the location of the property in the city. The houses in the dataset come from rural, suburban, and urban areas in each of the cities named. Additionally, homes from all three SES groups were included: low, medium, and high.
Following the selection of a house for data collection, each home is separated into various scenes, and the actions below are carried out. The first step is to capture exact XYZRGB point clouds of the area using a Faro Focus S70 stationary laser scanner mounted on a tripod. To achieve adequate coverage, tripod positions are chosen to optimize surface coverage. On average, four laser scans are recorded per room. Second, using the iPad Pro, up to three video sequences are shot in an attempt to capture all surfaces in each room.
The team tries to keep the environment entirely static during the data collection process, ensuring that no items move or change their appearance. However, because data gathering for a venue takes an average of six hours and many venues are lit by sunlight, the lighting situation can change during that period, potentially resulting in inconsistencies in illumination between sequences and scans.
All XYZRGB point clouds are spatially registered from the stationary laser scanner into a standard coordinate system in a one-time offline step using the proprietary software Faro Scene, which for most scenes fully automatically estimates a 6DoF rigid body transformation for each scan, transforming it into a typical venue coordinate system. Multiple unique scenes can be created in a single location (typically a house or apartment).
The method for calculating the ground truth 6DoF posture of the iPad Pro’s RGB cameras concerning the venue coordinate system necessitates the creation of synthetic views from our laser scan of the venue. The rendering of these XYZRGB point clouds from unusual perspectives presents a distinct set of challenges. Far geometry must be correctly occluded by close geometry, and a geometry that cannot be assured to have a direct line-of-sight from the novel viewpoint must be rejected.
The team manually annotates 3D-oriented bounding boundaries for 17 categories of room-defining furniture using a unique tool. The annotation occurs during the ARKit scene reconstruction, which results in a colored scene mesh. Our labeling technology also allows annotators to witness real-time projections of 3D bounding boxes onto video frames, allowing for more accurate annotation.
The ARKitScenes venues are separated into three groups: 80 percent for training, 10% for validation, and 10% for a held-out test set that will not be disclosed. The 5,048 sequences that are released are part of the training and validation set. Because the split is decided per venue, all-laser scans and iPad sequences from that venue are grouped together.
The researchers subsampled the dataset by collecting a single frame every two seconds with the goal of improving run-time while preserving significant variation between frames. As a result, they used 39k frames from the train split to train the models, and 5.6k frames from the validation split to evaluate them. The validation split was further filtered manually to include only frames free of depth aggressors that are difficult to identify automatically, such as specular or translucent objects. The train and the validation split were obtained from separate residences.
ARKitScenes is the largest indoor RGB-D dataset ever collected with a mobile device, as well as the first collection captured with Apple’s LiDAR scanner. The researchers demonstrated how the dataset might be utilized for 3D object detection and color-guided depth upsampling two downstream computer vision applications. The research community will be able to push the boundaries of the current state-of-the-art and build solutions that are more generalizable to real-world circumstances thanks to ARKitScenes.