Stony Brook University Researchers Introduce ‘StandardSim’: A Large-Scale Photorealistic Synthetic Dataset Featuring Annotations For Semantic Segmentation, Instance Segmentation, Depth Estimation, And Object Detection

Autonomous checkout is a rapidly advancing technology that has the potential to transform the way people shop in physical establishments. It frequently uses cameras and other sensors to gain a sense of a shopping environment and make a final conclusion about what a buyer buys. In systems where cameras are the only sensors available, computer vision is critical for comprehending this data. While vision-only autonomous checkout is still a relatively new concept, no benchmarks or new tasks have emerged.

In a recent study, researchers from Stony Brook University and Standard Cognition hypothesized that understanding retail settings necessitates not only domain-specific data but also a new computer vision task that detects changes in retail sceneries over time. Thus, they provide a new dataset, StandardSim, as well as a novel goal for detecting changes in retail scenarios over time in this study.

While it has addressed item detection in retail environments by introducing the dense object detection task and dataset SKU-110K, they do not give semantic descriptions of things beyond bounding boxes. Other large-scale datasets have been built synthetically for interior spaces, but they don’t work well in retail since the objects aren’t as varied or tightly packed.

Furthermore, many datasets do not include a variety of viewpoints, with scenes being depicted from the perspective of a human navigating the scene rather than from the ceiling or a corner. As a result, when models trained on these datasets are applied to real-world retail situations, they lack the knowledge required to recognize small objects from the perspective of ceiling cameras.

The change detection job is designed to imitate a shopper’s behaviors in a retail environment by giving a model with two photographs of a scene, one before and one after a sequence of interactions with the scene’s components. Objects may be taken, added, or shifted about the area after a consumer interacts with them in a retail environment. Each image pair in the collection depicts a random interaction and is annotated similarly to segmentation, with each pixel indicating whether the interaction is a take, put, shift, or no change.

Objects are often small, and changes are few. Because the task is comparable to segmentation, the researchers modify a popular state-of-the-art segmentation model, Deeplabv3, to set a benchmark for this work, and show that StandardSim is a highly tough benchmark due to the sparse changes and tiny size of objects.

With these issues in mind, the team presents StandardSim, a large-scale synthetic dataset built from extremely accurate store models that includes annotations for depth estimation, object detection, instance segmentation, and a novel job called change detection. Over 25,000 photos from 2,134 different scenes make up this dataset. Each scene comprises various camera views, allowing for multi-view reconstruction and shape estimation annotations. StandardSim provides more annotations for more jobs than earlier datasets, filling a need in the retail environment domain.

The team focuses on monocular depth estimates, in addition to change detection since depth gives key indications regarding object movement across time. They use StandardSim to test the performance of the state-of-the-art monocular depth estimation model Dense Prediction Transformer, which is based on MiDaS. The research discovers that StandardSim has a significantly larger error on the dataset, implying that it is a demanding new benchmark for monocular depth estimation.


The dataset has more photos and annotations for more tasks than existing datasets for retail and change detection. Other large-scale synthetic datasets may have more photos, but they lack change detection annotations and samples that are structured for change detection. StandardSim offers sceneries that are better suited for self-checkout.

Blender is at the heart of the data generation pipeline for the team. To dynamically adjust the store, the products, and the camera positions, the data creation pipeline uses Blender’s python interface. Photo-realistic RGB, RGB with randomized textures, RGB with blank textures, z-depth, segmentation masks, and surface normals are all produced using Blender’s cycles.

Each store model is an exact reproduction of a real retail establishment. To build these replica assets, the team uses a lidar-based Matterport device to take a 3d scan of the store, then uses an in-house asset creator to model the store in Blender using the 3d scan as a reference. This assures the 3d model’s size and structure are extremely exact, as well as the high-quality textures and meshes required for photo-realistic rendering from multiple perspectives. They create a model of the store and label which meshes are shelves, but they leave all of the shelves empty.

The team chose a popular semantic segmentation model, Deeplabv3, to benchmark on the dataset since change detection necessitates a pixel-level grasp of where changes occur in a scene. The Resnet50 backbone was chosen as the design because it offers a fair balance of accuracy and processing efficiency. They program the encoder with COCO pre-trained weights and fine-tune the entire network on the training set until it converges.

Several data augmentations tailored to the change detection task are included in the team. They begin by randomly altering change detection pairs by flipping the change order, that is, switching the before and after photos while maintaining the same label. Second, they flip images and labels to the left and right at random. To replicate real camera images, noises are added at random to pixel values. In addition, the graphics depict lighting variations that could occur in real-time camera feeds.

The final model obtains 36.15 percent IOU on the validation set and 36.04 percent IOU on the test set, demonstrating that change detection, like fine parts segmentation, is a challenging task. The model performs best on the put class and struggles the most on the shift class, according to a breakdown of IOU by class. The consistency in results between the test and validation sets also suggests that the test set is unbiased in comparison to the validation set. 


The authors of this paper examine why, despite significant breakthroughs in computer vision, progress in autonomous checkout systems has been modest. They believe that the absence of datasets for retail is causing the delay, and they propose StandardSim, a large-scale synthetic open dataset with annotations for a range of computer vision tasks. They use the dataset to evaluate the performance of state-of-the-art models for change detection and monocular depth estimation to that of other datasets. They find that StandardSim’s retail environment domain is distinct and difficult in comparison to other datasets, and they identify other applications for it.