The development of modern machine learning could not have happened without an extensive research dataset. For quite some time, computer vision has relied on large-scale datasets of images like ImageNet and others sampled from the Internet for pretraining models. The use of datasets is not always ethically and technically sound, as they can contain personal information taken without consent. They also have unclear license usage that biases their results to be inaccurate or misleading in many cases. A more recent development is to use unsupervised methods for model pretraining. This means that instead of using labelled datasets such as ImageNet, we can train our models without any specific input images and labels provided at all.
Researchers from the university of oxford have proposed ‘PASS’ dataset, which contains an extensive collection of images (1.28M) excluding humans and other identifying information such as license plates, signatures, or handwriting and NSFW images. The research group started with a large-scale (100 million random Flickr images) dataset—YFCC100M. They also preferred data collected under the most permissive Creative Common license (CC-BY) to address copyright concerns.
In extensive evaluation of SSL methods when performance differences were analyzed between PASS and ImageNet, it was found by researchers that PASS has three significant differences with ImageNet:
- Lack of human
- Lack of class-level curation and search
- Lack of ‘community optimization.
While doing further research, the research group also found that self-supervised approaches such as MoCo, SwAV and DINO train very well on the PASS dataset. They also found that excluding images with humans during pretraining does not cause an effect on downstream task performances. The performance of models trained on the PASS dataset has better results than ImageNet in 8/13 frozen encoder evaluation benchmarks.