Excellent visual and vision-language representations are crucial in solving computer vision problems such as image retrieval, image classification, video understanding. That is why visual and vision-language models rely on curated training datasets such as ImageNet, OpenImages, Conceptual Captions, which require expert knowledge and extensive labels. All these datasets need non-trivial data collection and cleaning steps, limiting the size of datasets and hindering the trained models’ scale. In comparison, NLP models use large-scale pre-training on raw text without human labels and have achieved SotA performance on GLUE and SuperGLUE benchmarks.
Google researchers propose a technique to bridge this gap by using publicly available image alt-text data (text appearing in place of an image on a webpage when the image fails to load). The team employs these image alt-text data to train larger, state-of-the-art vision and vision-language models.
Alt-texts describe the image that fails to load on a webpage. Sometimes the text is unrelated to its paired image, making the dataset “noisy.” The researchers constructed a Conceptual Captions dataset to obtain a version of raw English alt-text data (image and alt-text pairs). They applied minimal frequency-based filtering for cleaning instead of heavy filtering and post-processing. The resulting dataset is much larger but at the same time noisier of 1.8B image-text pairs.
ALIGN: A Large-scale Image and Noisy-Text Embedding
The team used a simple dual-encoder architecture to build larger and more powerful models quickly. The architecture learns to align visual and language representations of the image and text pairs. Image and text encoders are trained based on contrastive loss that pushes the matched image-text pairs embeddings together while pushing those of non-matched image-text pairs apart.
Scaling up the model size is made easy with the large-scale dataset. The resulting representations are then used for downstream visual and vision-language tasks. Without requiring fine-tuning, ALIGN powers cross-modal search – image-to-text search, text-to-image search, and search with joint image+text queries.
Evaluation and Future Work
The learned ALIGN model with EfficientNet-L2 (image encoder) and BERT-Large (text encoder) achieves SOTA performance on several image-text retrieval tasks in zero-shot as well as fine-tuned settings. Furthermore, ALIGN is a robust image representation model that outperforms CLIP and achieves a SOTA result of 85.5% top-1 accuracy on ImageNet.
In image classification problems, each class is treated as independent IDs, and the classification layers need to be trained with at least a few shots of labeled data per class. Since class names are also natural language phrases, ALIGN can be easily used for image classification without training data.
Although this methodology offers promising results, it is crucial to analyze the data and the resulting model before using the model in practice. Considerations should be made towards the potential for harmful text data in alt-texts to reinforce such harms. Efforts should be made to balance the data to prevent any reinforcing stereotypes from the web data. Additional analysis is needed to ensure that the demographic distribution of humans and related cultural items does not cause skewed model performance.