Top Open-Source Computer Vision Datasets in 2022

Computer vision is a branch of artificial intelligence (AI), computers, and systems that can gather usable data from digital images, videos, and other visual inputs and take appropriate actions or make suggestions in response to that data. Much like artificial intelligence offers robots the ability to think, computer vision gives them the ability to perceive, observe, and understand.

Because it has been around longer, human eyesight has an advantage over machine vision. Human sight has the advantage of learning how to differentiate between items, gauge their distance from the viewer, assess whether they are moving, and assess whether an image is accurate throughout a lifetime.

Instead of using retinas, optic nerves, and the visual brain, computers are taught to perform equivalent tasks considerably faster using cameras, data, and algorithms. Since a system trained to inspect items or monitor a production asset may evaluate thousands of products or processes per minute while detecting undetectable faults or errors, it can quickly outperform people.

Computer vision is used in the energy, utility, manufacturing, and automobile industries, and the market is continually growing.

The top open-source datasets for computer vision projects are listed below:

It is an image dataset made up using the WordNet hierarchical structure. WordNet has over 100,000 synsets, and over 80,000 (or 80%) of them are nouns. For each synset, ImageNet attempts to give an average of 1000 photos. Computer vision research, it was motivated by two crucial demands. These include the vital need for more data to enable generalized machine learning techniques and the requirement to define a distinct North Star problem in computer vision.


One of the most enormous open-source datasets for face image training, including annotations for gender and age. This dataset contains 523,051 face photos, of which 460,723 were sourced from Wikipedia and 62,328 from IMDB, and 20,284 from celebs.

MS Coco

It is an extensive dataset for object detection, segmentation, and captioning. It has 250,000 persons with critical points, 330K photographs (more than 200K named), 1.5 million instances of objects, 80 object categories, 91 stuff categories, 5 captions per image, and 91 thing categories.


It is a collection of sentence-based image descriptions and searches. It comprises 30,000 photographs and five captions clearly describing the principal elements and events. The pictures were manually chosen from six different Flickr groups and often didn’t feature famous persons or places.

Berkeley DeepDrive

Berkeley DeepDrive serves as a foundational dataset for learning across different tasks. It features more than 50k rides and 100k driving videos. Each 40-second video has a frame rate of 30 fps. It includes a variety of scene types, such as city streets, homes, and highways, in various weather conditions and throughout the day. It might be useful for lane detection, object detection, semantic segmentation, instance segmentation, multi-object tracking, etc.


There are ten scene types in the Large-scale Scene Understanding (LSUN) categorization dataset, including bedrooms, kitchens, outdoor churches, dining rooms, etc. There are many photographs in each category, ranging from about 120,000 to 3,000,000.

For each category, there are 1000 photographs in the test data and 300 in the validation data.

MPII Human Pose

The dataset consists of over 25K pictures with about 40K annotated body joints of individuals. These photos were all taken from YouTube videos and included the before and succeeding unannotated frames. They are gathered using a taxonomy of regular human activities that have been established. The dataset consists of 410 human activities, and each image has an activity label.

CIFAR-10 & CIFAR-100

The CIFAR-10 collection has 60,000 3232 color images divided into 10 classes, each with 6,000 images. Similar to the CIFAR-10, the CIFAR-100 offers 100 courses with a total of 600 photos. 10,000 test photos and 50,000 training images are available.

Five training batches and one test batch, each with 10,000 pictures, make up the CIFAR-10 dataset. The training batches consist of exactly 5,000 photos from each class combined. An exact 1,000 randomly chosen photographs from each class make up the test batch. The remaining images are distributed across the training batches in random order; however, specific training batches can have a disproportionate number of pictures from a particular class.

The 100 classes in the CIFAR-100 are divided into 20 superclasses. Each image has a “fine” and a “coarse” designation, indicating the class to which it belongs (the superclass to which it belongs).


It is a collection of large-scale, high-quality datasets that contain URL links to 650,000 video clips and cover 400/600/700 action classes depending on the dataset version. Each clip is roughly ten seconds long and manually annotated with a single action class. Both human-human and human-object interactions, such as those involving musical instruments, are seen in the film.


It’s a database with varied stereo and video sequences captured in city streets in 50 locations. For 30 divisions arranged in 8 categories, it offers semantic, instance-wise, and dense pixel annotations. CityScapes offers 20,000 coarsely annotated frames and 5000 frames with pixel-level annotations.

Labeled Faces

Face verification, commonly referred to as pair matching, has a public benchmark called Labeled Faces in the Wild. No of how well an algorithm performs on LFW, this data should not be interpreted as indicating that it is acceptable for use in any commercial application. It is a collection of face images created to research the issue of public face recognition. More than 13,000 facial photos gathered from the internet are included in the data collection. Names of the people depicted have been written on each face. 1680 of the individuals depicted had two or more images in the data set. The fact that these faces were picked up by the Viola-Jones face detector is the only restriction on them.


The LabelMe-12-50k dataset was taken from LabelMe and consists of 50,000 JPEG images (40,000 for training and 10,000 for testing). Each image has a dimension of 256×256 pixels. In the training and testing set, 50% of the photos contain a centered object from one of the 12 object classes. The remaining 50% display a portion of a randomly chosen image (“clutter”).

The dataset is a challenging task for object recognition systems since the examples of each object class exhibit wide variations in appearance, lighting, and viewing angles. Additionally, centered objects may be partially obscured, or the image may contain additional objects (or fragments).


One of the critical functions of computer vision is scene recognition, which provides a context for object recognition. This new scene-centric database called Places is available with 205 scene categories and 2.5 million photos with a category description. It offers recent state-of-the-art performances on scene-centric benchmarks using convolutional neural networks (CNN) to learn in-depth scene features for scene recognition tasks. The Places Database and the trained CNNs are provided here for use in academic research and teaching.

Stanford Cars dataset

There are 16,185 photos of 196 different kinds of cars in the Cars collection. Usually, categories are organized by Make, Model, and Year, for example, the 2012 Tesla Model S or 2012 BMW M3 coupe. The data has been divided into 8,144 training photos and 8,041 testing images, roughly splitting each class 50-50.

A developing area in computer vision, fine-grained recognition helps people identify minute variations in appearance in the real world. For scene comprehension and multi-view object class detection, 3D object representations are helpful tools. This automobile dataset’s training and testing sets are excellent for building models that distinguish between cars. Data came from the AI Lab at Stanford University.

Face Mask Detection

Masks are one of the only preventative measures for COVID-19 in the absence of immunization. They are vital for defending people’s health against respiratory infections. With the help of this information, a model may be developed to identify people who are wearing masks, are not wearing them, or are donning them incorrectly.

This collection includes bounding boxes for the 853 photos from the three classes in the PASCAL VOC format. The classes include with a mask, without a mask, and a poorly worn mask.

Fire and Smoke Dataset

India’s DataCluster Labs gathered this data set. This dataset, which includes more than 7000+ genuine Fire and Smoke photos from more than 400+ urban and rural regions, is challenging to analyze. Datacluster’s computer vision experts personally verify and validate each image.

Dataset Specifics

  • Most extensive data set: 7000+
  • Captured by: Over a thousand crowdsourcing participants
  • Resolution: 98% HD and higher photos (1920×1080 and above)
  • Location: Captured in more than 400 Indian cities
  • Diversity: Various lighting situations, including day and night, as well as different distances and vantage points.
  • Used device: captured in 2020–2021 while using a mobile device
  • Uses include detecting fire and smoke, intelligent cameras, fire and smoke alarm systems, etc.


Prathamesh Ingle is a Mechanical Engineer and works as a Data Analyst. He is also an AI practitioner and certified Data Scientist with an interest in applications of AI. He is enthusiastic about exploring new technologies and advancements with their real-life applications

✅ [Featured Tool] Check out Taipy Enterprise Edition