Microsoft AI Research Releases ‘ORBIT’ Dataset: A Real-World Few-Shot Dataset for Teachable Object Recognition


Object recognition algorithms have come a long way in recent years, but they still require training datasets containing thousands of high-quality, annotated examples for every object category.

Few-shot learning addresses this huge demand for datasets by training models to recognize entirely new things from only a few examples. Meta-learning algorithms, in particular, which ‘learn to learn’ utilizing episodic training, can potentially reduce the number of training examples required to train a model. The majority of few-shot learning research, on the other hand, has been driven by benchmark datasets that lack the substantial variability that applications confront when deployed in the real world.

To bridge this gap, Microsoft researchers, in collaboration with the City, University of London, present the ORBIT dataset and few-shot benchmark that helps learn new objects from a small number of high-variation samples. This new benchmark dataset includes a total of 2,687,934 frames, containing 3,822 videos of 486 objects captured on mobile phones by 77 people who are blind or have low vision.

The dataset and benchmark establish a new baseline for testing ML models in few-shot, high-variation learning scenarios, which will aid in the development of models that perform better in real-world scenarios.

The ORBIT dataset and benchmark are based on teachable object recognizers, which are real-world applications for blind and low-vision people. By taking just a few short films of objects meaningful to them, a person may teach a system to recognize them. Later, these videos are used to train a personalized object recognizer. This would allow a blind person to teach an object recognizer their house keys or a favorite outfit and then have the phone recognize them. Typical object recognizers are unable to identify such items since they are not included in popular object recognition training datasets.


Teachable object identification is an excellent example of a scenario with few shots and a lot of variety. People can only film a few short videos to “teach” a new object; therefore, it’s few-shot. Because each person only has a few things, the videos they shoot of these objects will differ in quality, blur, object centrality, and other aspects.

Unlike traditional computer vision benchmarks, the teachable object recognition benchmark measures performance depending on human input. In simple words, trained ML models are given only those items and videos which are associated with a single user.

Its performance is measured by how effectively it recognizes those objects. This procedure is carried out for each test user in a group. The outcome includes a set of measurements that more accurately reflect how well a teachable object recognizer would perform in the actual world for a single user.

The researchers used widely recognized few-shot learning algorithms to examine their datasets. They explain that even while few-shot models have reached saturation on existing few-shot benchmarks, they only achieve 50-55 percent accuracy on the teachable object recognition benchmark. Furthermore, there is a high variation across users. These findings highlight the need for algorithms to be more resistant to high-variation (or “noisy”) data.

Beyond object identification, creating teachable object recognizers poses issues for machine learning. Uncertainty quantification is a machine learning technique that can help with this problem. Furthermore, the difficulties in developing teachable object recognition systems extend beyond ML algorithmic advancements, making this a topic fit for multi-disciplinary collaboration.

The researchers hope that their contributions will help shape the next generation of recognition tools for the blind and low-vision communities and increase the resilience of computer vision systems in a variety of other applications.