University of Oxford Researchers Release ‘PASS’ Dataset With 1.4M+ Images (Free From Humans) For Self-Supervised Machine Learning

The development of modern machine learning could not have happened without an extensive research dataset. For quite some time, computer vision has relied on large-scale datasets of images like ImageNet and others sampled from the Internet for pretraining models. The use of datasets is not always ethically and technically sound, as they can contain personal information taken without consent. They also have unclear license usage that biases their results to be inaccurate or misleading in many cases. A more recent development is to use unsupervised methods for model pretraining. This means that instead of using labelled datasets such as ImageNet, we can train our models without any specific input images and labels provided at all.

Researchers from the university of oxford have proposed ‘PASS’ dataset, which contains an extensive collection of images (1.28M) excluding humans and other identifying information such as license plates, signatures, or handwriting and NSFW images. The research group started with a large-scale (100 million random Flickr images) dataset—YFCC100M. They also preferred data collected under the most permissive Creative Common license (CC-BY) to address copyright concerns.

In extensive evaluation of SSL methods when performance differences were analyzed between PASS and ImageNet, it was found by researchers that PASS has three significant differences with ImageNet:

  • Lack of human
  • Lack of class-level curation and search
  • Lack of ‘community optimization.

While doing further research, the research group also found that self-supervised approaches such as MoCo, SwAV and DINO train very well on the PASS dataset. They also found that excluding images with humans during pretraining does not cause an effect on downstream task performances. The performance of models trained on the PASS dataset has better results than ImageNet in 8/13 frozen encoder evaluation benchmarks.




Asif Razzaq is an AI Journalist and Cofounder of Marktechpost, LLC. He is a visionary, entrepreneur and engineer who aspires to use the power of Artificial Intelligence for good.

Asif's latest venture is the development of an Artificial Intelligence Media Platform (Marktechpost) that will revolutionize how people can find relevant news related to Artificial Intelligence, Data Science and Machine Learning.

Asif was featured by Onalytica in it’s ‘Who’s Who in AI? (Influential Voices & Brands)’ as one of the 'Influential Journalists in AI' ( His interview was also featured by Onalytica (