Learn to associate images with their text labels through contrastive loss. Classification of images is done over prompted natural language.
Problems of current deep learning for computer vision:
2013: Richard Socher, Stanford trained model on CIFAR-10 to make predictions in word-embedding space. This model could predict 2 unseen classes.
2013: DeVISE
2016: Ang Li, FAIR, fine-tuned ImageNet CNN to predict wider set of visual n-grams.
ImageNet dataset required >25k workers to annotate 14 million of images for 22k object categories. CLIP learns from publicly available text-image pairs found on the internet.
To perform transfer learning with an ImageNet model, you need to build a new dataset, add an output head and fine-tune the model.
Even if they achieve (super-)human performance on benchmarks, they have poor real-zorld performance (they're overfitting the benchmark).
CLIP can be evaluated on benchmarks without having to train on their data (it matches the performance of the original ResNet-50 on ImageNet).
Evaluated on 27 tasks, to show its robustness. CLIP performance on Optical Character Recognition is mixed but the semantic representations are useful. When evaluated on SST-2 dataset (Stanford Sentiment Treebank: dependency trees of movie reviews), rendered as images, a linear classifier on CLIP's representation matches a Continuous Bag of Words model with direct access to the text.