See blog post Self-supervised learning: the dark matter of intelligence
Yann LeCun – 03/04/2021
We believe that self-supervised learning (SSL) is one of the most promising ways to build such background knowledge and approximate a form of common sense in AI systems.
The term “self-supervised learning” is more accepted than the previously used term “unsupervised learning.” Unsupervised learning is an ill-defined and misleading term that suggests that the learning uses no supervision at all. In fact, self-supervised learning is not unsupervised, as it uses far more feedback signals than standard supervised and reinforcement learning methods do.
Self-supervised learning has long had great success in advancing the field of natural language processing (NLP):
NLP systems use contrastive methods by masking or substituting input words. The goal is to reconstruct the original version of a corrupted text. The general technique of training a model to restore a corrupted version of an input is called a denoising auto-encoder.
These techniques cannot be easily extended to CV: it is considerably more difficult to represent uncertainty in the prediction for images than it is for words. We cannot list all possible video frames and compute a prediction score (after softmax layer) to each of them, because there is an infinite number of them (high dimensional continuous object vs discrete outcome).
New techniques such as SwAV are starting to beat accuracy records in vision tasks: latest research project SEER leverages SwAV and other methods to pretrain a large network on a billion random unlabeled images, showing that self-supervised learning can excel at computer vision tasks as well.
FAIR released a model family called https://arxiv.org/abs/2003.13678 RegNets that are ConvNets capable of scaling to billions (potentially even trillions) of parameters and can be optimized to fit different runtime and memory limitations.
Training an EBM consists of two parts:
Training it to produce low energy for compatible and
Finding a way to ensure that for a particular , the values that are incompatible with produce a higher energy than those that are compatible with
The second point is where the difficulty lies
A well-suited architecture is Siamese networks or joint embedding architecture:
Without a specific way to ensure that the networks produce high energy when and differ, the two networks could collapse to always produce identical output embeddings.
Two categories of techniques to avoid collapse: contrastive methods (method used in NLP) and regularization methods.
Another interesting avenue is latent-variable predictive models:
Latent-variable models can be trained with contrastive methods. A good example is a generative adversarial network:
Contrastive methods have a major issue: they are very inefficient to train. In high-dimensional spaces such as images, there are many ways one image can be different from another.
“Happy families are all alike; every unhappy family is unhappy in its own way”
Non-contrastive methods applied to joint embedding architectures is the hottest topic in SSL for vision:
They use various tricks, such as:
Perhaps a better alternative in the long run will be to devise non-contrastive methods with latent-variable predictive models.
The main obstacle is that they require a way to minimize the capacity of the latent variable:
A successful example of such a method is the Variational Auto-Encoder (VAE), in which the latent variable is made “fuzzy” (because it follows a multivariate normal distribution?), which limits its capacity. But VAE have not yet been shown to produce good representations for downstream visual tasks.
Another successful example is sparse modeling but its use has been limited to simple architectures. No perfect recipe seems to exist to limit the capacity of latent variables.