OpenAI introduced a neural network, CLIP, which efficiently learns visual concepts from natural language supervision. CLIP, also called Contrastive Language–Image Pre-training, is available to be applied to any visual classification benchmark by merely providing the visual categories’ names to be recognized. Users find the above similar to the “zero-shot” capabilities of GPT-2 and 3.
The current deep-learning approach to computer vision has several significant problems such as:
- Typical vision datasets require a lot of labor.
- It is expensive to create while teaching only a narrow set of visual concepts;
- The Standard vision models are good at one task only and require significant effort to adapt to a new task.
- Models that perform well on benchmarks have a deficient performance on stress tests.
The team presents their neural network to address these problems. The CLIP is trained on a wide variety of images with a wide variety of natural language supervision abundantly available on the internet.
Costly datasets: CLIP learns from the text–image pairs that are publicly available on the internet. The above reduces the need for the expensive large labeled datasets that have been extensively studied by prior work.
Narrow: CLIP can be adapted to perform a wide variety of visual classification tasks without additional training examples. CLIP’s text-encoder will output a linear classifier of CLIP’s visual representations, and the accuracy of this classifier is often comparable with fully supervised models.
Poor real-world performance: The Deep learning models are often said to achieve human performance. CLIP models can be evaluated on the benchmarks without a need to train on their data. Thus the benchmark performance of CLIP is much more representative of its performance in the wild.
CLIP builds on a vast body of work on zero-shot transfer, multimodal learning, and natural language supervision. The zero-data learning dates back to more than ten years. But it was mostly studied in computer vision as a way of generalizing to unseen object categories.
Ang Li and his co-authors’ work at FAIR 2016 is most encouraging for CLIP. Ang Li and his co-authors used natural language supervision to enable zero-shot transfer to several existing computer vision classification datasets. The above was achieved by fine-tuning an ImageNet CNN. This was done to predict a much broader set of visual concepts (visual n-grams) from the text of descriptions, titles, and tags of 30 million Flickr photos. Ang Li and his co-authors successfully achieved 11.5 % accuracy on ImageNet zero-shot.
CLIP is part of a group of papers revisiting visual learning representations from natural language supervision in the past year. This work uses more modern architectures like the Transformer and includes:
- VirTex: It explored autoregressive language modeling.
- ICMLM: It investigated masked language modeling.
- ConVIRT: It studied the same contrastive objective that the team used for CLIP but in medical imaging.
The GPT-2 and 3 models trained on such data can achieve compelling zero-shot performance, but they require significant training computation. To reduce the needed computing, the team focused on algorithmic ways to improve our approach’s training efficiency.CLIP is flexible as it learns a wide range of visual concepts directly from natural language. CLIP can perform many different tasks zero-shot. To validate this, the team has measured CLIP’s zero-shot performance on over 30 other datasets.
The team also reported a few limitations of CLIP. It struggles with more complex tasks such as predicting how close the nearest car is in a photo and more abstract or systematic tasks such as counting the number of objects in an image. CLIP has a poor generalization to images not covered in its pre-training dataset. Also, CLIP’s zero-shot classifiers can be sensitive to wording or phrasing and sometimes require trial and error “prompt engineering” to perform well.