CLIP (OpenAI)

Blogpost

Learn to associate images with their text labels through contrastive loss. Classification of images is done over prompted natural language.

Context

Problems of current deep learning for computer vision:

  • datasets are labor intensive
  • task transfer is poor

2013: Richard Socher, Stanford trained model on CIFAR-10 to make predictions in word-embedding space. This model could predict 2 unseen classes.

2013: DeVISE

2016: Ang Li, FAIR, fine-tuned ImageNet CNN to predict wider set of visual n-grams.

Approach

  • Scaling pre-training task is sufficient for zero-shot.
  • data = image, text pairs found on the internet
  • objective: given an image, predict which out of 32,768 text snippets was the one actually paired with it.
  • Task transfer: let's say the task is to classify photos of dog vs cat. Given an image, CLIP assigns a probability score to the prompts "photo of a dog" and "photo of a cat" (see figure)

clip.svg

Costly dataset

ImageNet dataset required >25k workers to annotate 14 million of images for 22k object categories. CLIP learns from publicly available text-image pairs found on the internet.

Narrow

To perform transfer learning with an ImageNet model, you need to build a new dataset, add an output head and fine-tune the model.

Poor real-world performance

Even if they achieve (super-)human performance on benchmarks, they have poor real-zorld performance (they're overfitting the benchmark).

CLIP can be evaluated on benchmarks without having to train on their data (it matches the performance of the original ResNet-50 on ImageNet).

Efficiency

  • contrastive objective is 4x to 10x more compute efficient than the image-to-text approach taken by VirTex
  • VisionTransformer is 3x more compute efficient

Performance

Evaluated on 27 tasks, to show its robustness. CLIP performance on Optical Character Recognition is mixed but the semantic representations are useful. When evaluated on SST-2 dataset (Stanford Sentiment Treebank: dependency trees of movie reviews), rendered as images, a linear classifier on CLIP's representation matches a Continuous Bag of Words model with direct access to the text.

Limitations

  • counting numbers of objects, evaluating relative distances between objects (i.e. operating on discrete symbols).
  • fine-grained classification (differences between car models)
  • needs prompt-engineering