Google Research, Brain Team
[class] token, prepend a learnable embedding whose state at the output of the Transformer encoder serves as the image representation .Hybrid Architecture: Input sequence is formed from feature maps of a CNN with patch size of 1 "pixel".
Transformers lack some of the inductive biases inherent to CNN (translation equivariance and locality) and do not generalize well when trained on insufficient amounts of data (i.e. ImageNet only).
CNNs have translation equivariance because they use the same filter (kernel) across the entire input (same weights are applied regardless of where the kernel is in the input image). The convolution operation is a sliding window, the network responds to features (edges, textures) in the same way, regardless where they appear. In transformers, the self-attention mechanism makes every position attend to the others, capturing global dependencies but losing locality and requiring positional encodings to reintroduce spatial information.
Locality means that CNNs process information within a small, localized region of the input at a time (receptive field), due to small kernel sizes such as . Important features such as edges, corners or textures are often local and do not require global context initially.
Current state-of-the-art is held by Noisy Student (Xie et al., 2020) for ImageNet and Big Transfer (BiT) (Kolesnikov et al.) for other reported datasets (VITAB, CIFAR, ...)
ViT model pre-trained on larger datasets (14M-300M images, e.g. JFT-300M) outperforms ResNet (BiT) (Kolesnikov et al.) (by a percentage point or less).
When pre-trained on ImageNet, larger architectures underperform smaller architectures despite heavy regularization. Only with JFT-300M do we see the full benefit of larger models.
Benefits from the efficient implementations on hardware accelerators (self-attention can be parallelized).
ViT uses 2-4x less compute to attain same performance as ResNets.
Used for fast on-the-fly evaluation where fine-tuning would be too costly.
learned positional embeddings (as introduced in BERT) vs positional encodings?