Transformer Architectures in Vision
[2018 ICML] Image Transformer
[2019 CVPR] Video Action Transformer Network
[2020 ECCV] End-to-End Object Detection with Transformers
[2021 ICLR] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
1. Transformer in Vision
Sangmin Woo
2020.10.29
[2018 ICML] Image Transformer
[2019 CVPR] Video Action Transformer Network
[2020 ECCV] End-to-End Object Detection with Transformers
[2021 ICLR] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
2. 2 / 36
Contents
[2018 ICML] Image Transformer
Niki Parmar1 Ashish Vaswani1 Jakob Uszkoreit1 Łukasz Kaiser1 Noam Shazeer1 Alexander Ku2,3 Dustin Tran4
1Google Brain, Mountain View, USA
2Department of Electrical Engineering and Computer Sciences, University of California, Berkeley
3Work done during an internship at Google Brain
4Google AI, Mountain View, USA.
[2019 CVPR] Video Action Transformer Network
Rohit Girdhar1 Jo˜ao Carreira2 Carl Doersch2 Andrew Zisserman2,3
1Carnegie Mellon University 2DeepMind 3University of Oxford
[2020 ECCV] End-to-End Object Detection with Transformers
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko
Facebook AI
[2021 ICLR under review] An Image is Worth 16x16 Words: Transformers for Image
Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer , Alexander Kolesnikov , Dirk Weissenborn , Xiaohua Zhai , Thomas Unterthiner, Mostafa
Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby,
Google Research, Brain Team
3. 3 / 36
Background
Attention is all you need [2017 NIPS]
• The main idea of the original architecture is to compute self-attention by
comparing a feature to all other features in the sequence.
• Features are first mapped to a query (Q) and memory (key and value, K &
V ) embedding using linear projections.
• The output for the query is computed as an attention weighted sum of
values (V), with the attention weights obtained from the product of the
query (Q) with keys (K).
• In practice, query (Q) is the word being translated, and keys (K) and
values (V) are linear projections of the input sequence and the output
sequence generated so far.
• A positional encoding is also added to these representations in order to
incorporate positional information which is lost in this setup.
[1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In NIPS, 2017.
5. 5 / 36
Image Transformer [2018 ICML]
Pixel-RNN / Pixel-CNN (van den Oord et al., 2016)
• Straightforward
• Tractable likelihood
• Simple and stable
[2] van den Oord, A¨aron, Kalchbrenner, Nal, and Kavukcuoglu, Koray. Pixel recurrent neural networks. ICML, 2016.
[3] van den Oord, A¨aron, Kalchbrenner, Nal, Vinyals, Oriol, Espeholt, Lasse, Graves, Alex, and Kavukcuoglu, Koray. Conditional image
generation with pixelcnn decoders. NIPS, 2016.
Pixel-RNN Pixel-CNN
6. 6 / 36
Image Transformer [2018 ICML]
Motivation
• Pixel-RNN and Pixel-CNN turned the problem into sequence modeling
problem by applying RNN or CNN to predict each next pixel given all
previously generated pixels.
• RNN is computationally heavy
• CNN is parallelizable
• CNN has limited receptive field → long range dependency problem → if
stack more layers? → expensive
• RNN has virtually unlimited receptive field
• Self-attention can achieve a better balance in the trade-off between the
virtually unlimited receptive field of the necessarily sequential PixelRNN
and the limited receptive field of the much more parallelizable PixelCNN
and its various extensions
8. 8 / 36
Image Transformer [2018 ICML]
Image Transformer
• 𝑞: single channel of one pixel
(query)
• 𝑚1, 𝑚2, 𝑚3: memory of
previously generated pixels
(key)
• 𝑝 𝑞, 𝑝1, 𝑝2, 𝑝3: position encodings
• 𝑐𝑚𝑝: first embed query and key
then apply dot product.
9. 9 / 36
Image Transformer [2018 ICML]
Local Self-Attention
• Scalability issue is in the self-attention mechanism
• Restrict the positions in the memory matrix M to a local neighborhood
around the query position
• 1d vs. 2d Attention
12. 12/ 36
Video Action Transformer Network
[2018 CVPR]
Action Recognition
[4] Sigurdsson, Gunnar A., et al. Hollywood in homes: Crowdsourcing data collection for activity understanding. ECCV, 2016.
14. 14/ 36
Video Action Transformer Network
[2018 CVPR]
Action Transformer
• Action Transformer unit takes as input the video feature representation
and the box proposal from RPN and maps it into query (𝑄) and memory
(𝐾&𝑉) features.
• Query (𝑄): The person being classified.
• Memory (𝐾&𝑉): Clip around the person.
• The unit processes the query (𝑄) and memory (𝐾&𝑉) to output an updated
query vector (𝑄∗
).
• The intuition is that the self-attention will add context from other people
and objects in the clip to the query (𝑄) vector, to aid with the subsequent
classification.
• This unit can be stacked in multiple heads and layers, by concatenating
the output from the multiple heads at a given layer, and using the
concatenated feature as the next query.
• This updated query (𝑄∗
) is then used to again attend to context features in
the following layer.
18. 18/ 36
End-to-End Object Detection with
Transformer [2020 ECCV]
Object Detection
[5] Ren, Shaoqing, et al. Faster r-cnn: Towards real-time object detection with region proposal networks. NIPS. 2015.
19. 19/ 36
End-to-End Object Detection with
Transformer [2020 ECCV]
Faster R-CNN
[5] Ren, Shaoqing, et al. Faster r-cnn: Towards real-time object detection with region proposal networks. NIPS. 2015.
20. 20/ 36
End-to-End Object Detection with
Transformer [2020 ECCV]
Non Maximum Suppression (NMS)
NMS
21. 21/ 36
End-to-End Object Detection with
Transformer [2020 ECCV]
Motivation
• Multi-staged pipeline
• Too many hand-crafted components or heuristics (e.g., non-maximum
suppression, anchor box) that explicitly encode out prior knowledge about
the task
• Let’s simplify these pipelines with end-to-end philosophy!
• Let’s remove the need of heuristics with direct set prediction!
• Forces unique predictions via bi-partite matching loss between
predicted and ground-truth objects.
• Encoder-decoder architecture based on Transformer
• Transformer explicitly model all pairwise interactions between
elements in a sequence, which is particularly suitable for specific
constraints of set prediction such as removing duplicate
predictions.
22. 22/ 36
End-to-End Object Detection with
Transformer [2020 ECCV]
DETR (DEtection TRansformer) in high-level
28. 28/ 36
End-to-End Object Detection with
Transformer [2020 ECCV]
DETR PyTorch inference code: Very simple
29. 29/ 36
An Image is Worth 16x16 words
[2021 ICLR under review]
Motivation
• Large Transformer-based models are often pre-trained on large corpora
and then fine-tuned for the task at hand: BERT uses a denoising self-
supervised pre-training task, while the GPT line of work uses language
modeling as its pre-training task
• Vision Transformer (ViT) yield modest results when trained on mid-sized
datasets such as ImageNet, achieving accuracies of a few percentage
points below ResNets of comparable size. This seemingly discouraging
outcome may be expected: Transformers lack some inductive biases
inherent to CNNs, such as translation equivariance and locality, and
therefore do not generalize well when trained on insufficient amounts of
data.
• However, the picture changes if ViT is trained on large datasets (14M-
300M images). i.e., large scale training trumps inductive bias.
Transformers attain excellent results when pre-trained at sufficient
scale and transferred to tasks with fewer datapoints.
30. 30/ 36
An Image is Worth 16x16 words
[2021 ICLR under review]
Vision Transformer (ViT)
31. 31/ 36
An Image is Worth 16x16 words
[2021 ICLR under review]
ViT vs. BiT (Alexander Kolesnikov et al. Big transfer (BiT): General visual representation learning. In ECCV, 2020.)
32. 32/ 36
An Image is Worth 16x16 words
[2021 ICLR under review]
Performance vs. pre-training samples
33. 33/ 36
An Image is Worth 16x16 words
[2021 ICLR under review]
Performance vs. cost
34. 34/ 36
An Image is Worth 16x16 words
[2021 ICLR under review]
Image Classification
35. 35/ 36
Concluding Remarks
Transformer are competent with modeling inter-relationship
between pixels (video clips, image patches, …)
If Transformer is pre-trained sufficient number of data, it can
replace the CNN and it also performs well
Transformer is a generic architecture even more than MLP (I
think…)
Not only in NLP, Transformer also shows astonishing results in
Vision!
But, Transformer is known to have quadratic complexity
Here’s further reading which reduces the quadratic complexity into the
linear complexity.
“Rethinking Attention with Performers” (ICLR 2021 under review)
We adjust the concentration of the distribution we sample from with a temperature tau > 0 by which we divide the logits for the channel intensities.
Remove duplicates
Due to the scalability of Attention mechanism,
Naive application of self-attention to images would require that each pixel attends to every other pixel. With quadratic cost in the number of pixels, this does not scale to realistic input sizes.