Attention Is All You Need.
With these simple words, the Deep Learning industry was forever changed. Transformers were initially introduced in the field of Natural Language Processing to enhance language translation, but they demonstrated astonishing results even outside language processing. In particular, they recently spread in the Computer Vision community, advancing the state-of-the-art on many vision tasks. But what are Transformers? What is the mechanism of self-attention, and do we really need it? How did they revolutionize Computer Vision? Will they ever replace convolutional neural networks?
These and many other questions will be answered during the talk.
In this tech talk, we will discuss:
- A piece of history: Why did we need a new architecture?
- What is self-attention, and where does this concept come from?
- The Transformer architecture and its mechanisms
- Vision Transformers: An Image is worth 16x16 words
- Video Understanding using Transformers: the space + time approach
- The scale and data problem: Is Attention what we really need?
- The future of Computer Vision through Transformers
Speaker: Davide Coccomini, Nicola Messina
Website: https://www.aicamp.ai/event/eventdetails/W2021101110
Exploring the Future Potential of AI-Enabled Smartphone Processors
Transformers in Vision: From Zero to Hero
1. Transformers In Vision
From Zero to Hero!
Davide Coccomini & Nicola Messina
Davide Coccomini Nicola Messina
PhD Candidate
Italian National
Research Council
PhD Student
Italian National
Research Council
Reach me on …
Reach me on …
2. What do you think when you hear the
word «Transformer»?
Davide Coccomini & Nicola Messina | AICamp 2021
3.
4. The Transformer «today»
Input Output
Multi-Head
Attention
Feed Forward
Feed Forward
Add & Norm
Add & Norm
Add & Norm
Add & Norm
Multi-Head
Attention
Multi-Head
Attention
+ +
Salta (90%) | Odia (9%) | Perchè (1%)
The cat jumps the wall Il gatto
Positional
Encoding
Positional
Encoding
V K Q
Nx
Mx
Davide Coccomini & Nicola Messina | AICamp 2021
V K Q V K Q
Encoder
Decoder
Add & Norm
Linear + Softmax
Memory
5. Outline
Some history: from RNNs to Transformers
Transformers’ attention and self-attention mechanisms
The power of the Transformer Encoder
From text to images: Vision Transformers
Transformers: The beginnings
From images to videos
The scale and data problem
Convolutional Neural Networks and Vision Transformers
Some interesting real-world applications
Transformers in Vision
6. Videos
Images
Text History
Introduced transformers in NLP
2017
Vision Transformers
2020
2021
Transformers for video
understanding
Now Computer Vision Revolution!
Transformers
Davide Coccomini & Nicola Messina | AICamp 2021
7. A step back: Recurrent Networks (RNNs)
E
The cat jumps the wall
h0
hstart
E
h1
E
h2
E
h3
E
h4
D h5
D
h6
D
h7
D
h8
D
Il gatto salta il
h9
D
<end>
Davide Coccomini & Nicola Messina | AICamp 2021
muro
Encoder
Final sentence
embedding
Decoder
<s>
8. Problems
1. We forget tokens too far in the past
2. We need to wait the previous token to compute the next hidden-state
E
The cat jumps the wall
h0
hstart
E
h1
E
h2
E
h3
E
h4
D h5
D h6
D h7
D h8
D
Il gatto salta il muro
h9
D
<end>
<s>
Davide Coccomini & Nicola Messina | AICamp 2021
9. Solving problem 1
"We forget tokens too far in the past"
E
The cat jumps the wall
h0
hstar
t
E
h1
E
h2
E
h3
E
h4
D D D
Il gatto salta
+ + + context
=
Attention
Solution
Add an attention mechanism
+
h5 h6
<s>
10. Solving problem 2
"We need to wait the previous token to compute the next hidden-state"
2017 paper
"Attention Is All You Need"
Solution
Throw away recurrent
connections
E
The cat jumps the wall
h0
hstar
t
E
h1
E
h2
E
h3
E
h4
D D D
Il gatto salta
+ + + context
=
Attention
+
h5 h6
<s>
Davide Coccomini & Nicola Messina | AICamp 2021
11. Full Transformer Architecture
Input Output
Multi-Head
Attention
Feed Forward
Feed Forward
Add & Norm
Add & Norm
Add & Norm
Add & Norm
Multi-Head
Attention
Multi-Head
Attention
+ +
“Salta” (90%) | “Odia” (9%) | “Perchè” (1%)
“The cat jumps the wall” “Il gatto”
Positional
Encoding
Positional
Encoding
V K Q
Nx
Mx
Davide Coccomini & Nicola Messina | AICamp 2021
V K Q V K Q
Encoder
Decoder
Add & Norm
Linear + Softmax
Memory
12. Transformer's Attention Mechanism
Target tokens “from the
point of view” of the
source sequence
Queries
Target
Sequence
Source
Sequence
FFN
FFN
FFN
FFN
Keys & Values
FFN
FFN
∙
∙
∙
∙
Norm
&
Softmax
Dot
product
Il
gatto
salta
The
cat
jumps
the
…
…
FFN
13. Key Value
“A4”
“N9”
“O7”
“A4”
“N2”
Transformer's Attention Mechanism
From a different perspective
Query
“N5”
Weighted
average
“gatto” token, built by
aggregating value vectors in
the source dictionary
Lookup Table
Target
Sequence
Source
Sequence
Il
gatto
salta
The
cat
jumps
the
…
wall
Soft-matching
Davide Coccomini & Nicola Messina | AICamp 2021
14. Attention and Self-Attention
Self-Attention
• Source = Target
• Key, Queries, Values
obtained from the
same sentence
• Captures intra-sequence dependencies
Attention
• Source ≠ Target
• Queries from Source
• Key, Values from Target
• Captures inter-sequence dependencies
I gave my dog Charlie some food
Ho dato da mangiare al mio cane Charlie
I gave my dog Charlie some food
To whom? What?
Who?
Multi-Head
Attention
V K Q
Multi-Head
Attention
V K Q
Target Source
Davide Coccomini & Nicola Messina | AICamp 2021
16. Full Transformer Architecture
Input Output
Feed Forward
Feed Forward
Add & Norm
Add & Norm
Add & Norm
Add & Norm
Multi-Head
Attention
Multi-Head
Attention
+ +
Salta (90%) | Odia (9%) | Perchè (1%)
The cat jumps the wall Il gatto
Positional
Encoding
Positional
Encoding
V K Q
Nx
Mx
Davide Coccomini & Nicola Messina | AICamp 2021
V K Q V K Q
Encoder
Decoder
Add & Norm
Linear + Softmax
Memory
Key Value
“A4”
“N9”
“O7”
“A4”
“N2”
Lookup Table (source sequence)
Multi-Head
Self Attention
Multi-Head
Attention
17. Is there any problem
with Transformers?
Attention
calculation is O(n2)
Self-Attention
I gave my dog Charlie some food
I gave my dog Charlie some food
I gave my dog Charlie some food
. . . . . . . . . .
Davide Coccomini & Nicola Messina | AICamp 2021
18. I gave my dog Charlie some food
I
gave
my
dog
Charlie
some
food
Attention
calculation is O(n2)
Self-Attention
Davide Coccomini & Nicola Messina | AICamp 2021
19. The Power of the Transformer Encoder
• Many achievements using only the Encoder
• BERT (Devlin et al., 2018)
Next Sentence
Prediction {0, 1}
Transformer Encoder (N layers)
I gave my dog Charlie some food
<CLS> <SEP> He ate it
Positional Encoding
Embedding Layer
Masked Language
Modelling «ate»
Memory
20. Transformers in Computer Vision
Can we use the self-attention mechanism in images?
Davide Coccomini & Nicola Messina | AICamp 2021
21. Transformers in Computer Vision
256px
256px
3906250000
calculations
Impossible!
62500
pixels
• The transformer works with a set of tokens
• What are tokens in images?
Davide Coccomini & Nicola Messina | AICamp 2021
22. • Tokens as the features from an object detector
Transformers in Computer Vision
Tokens!
ROI Pooling
ROI Pooling
ROI Pooling
Davide Coccomini & Nicola Messina | AICamp 2021
23. “An image is worth 16x16 words”
Image
to
Patches
Tokens!
Linear
Projection
256px
256px
16px
16px
Vision Transformers (ViTs)
Davide Coccomini & Nicola Messina | AICamp 2021
“An image is worth 16x16 words” | Dosovitskiy et al., 2020
27. TimeSformers
Combine space and time attention with Divided Space-Time Attention!
frame
t
-
δ
frame
t
frame
t
+
δ
Space
Time
Time
Davide Coccomini & Nicola Messina | AICamp 2021
Is Space-Time Attention All You Need for Video Understanding? | Gedas Bertasius et al.
40. Why Fourier?
It’s just a transformation!
Image from mriquestion.com
Fourier Transform
Davide Coccomini & Nicola Messina | AICamp 2021
41. Fourier
Transform
Fourier
Transform
Transforming over
the hidden domain
Transforming over
the sequence
domain
Fourier Network
What does it transform?
Input Vectors
Davide Coccomini & Nicola Messina | AICamp 2021
FNet: Mixing Tokens with Fourier Transforms | James Lee-Thorp et al.
42. MLP Mixer
1 2 3
4 5 6
7 8 9
Per-patch Fully Connected
N x (Mixer Layer)
Davide Coccomini & Nicola Messina | AICamp 2021
Global Average Pooling
Fully-Connected
CLASS
MLP-Mixer: An all-MLP Architecture for Vision | Ilya Tolstikhin et al.
46. Why are they different?
Able to find
long-term dependencies
Learns
inductive biases
Need very large
dataset for training
Lack of global
understanding
Locality
Sensitive
Translation
Invariant
Davide Coccomini & Nicola Messina | AICamp 2021
Convolutional
Neural
Network
Vision
Transformers
47. A different point of view
ViTs are both local and global!
The ViT learns only global information
with low amount of data
0.7
0.01
Davide Coccomini & Nicola Messina | AICamp 2021
Heads focus on farther patches
48. A different point of view
ViTs are both local and global!
The ViT learns also local information
with more data
Davide Coccomini & Nicola Messina | AICamp 2021
0.7
0.01
Higher layers heads still focus on farther
patches
0.01
0.4
0.3
Lower layers heads focus on both farther
and closer patches
0.6
0.06
49. A different point of view
They learn different representations!
Similar representations
through the layers
Different representations
through the layers
Davide Coccomini & Nicola Messina | AICamp 2021
Do Vision Transformers See Like Convolutional Neural Networks? | Maithra Raghu et al.
50. A different point of view
Vision Transformers are very robust!
Davide Coccomini & Nicola Messina | AICamp 2021
Occlusion Distribution Shift Adversarial Perturbation Permutation
Intriguing Properties of Vision Transformers | Muzammal Naseer et al.
56. Recap
Use a pure Vision Transformer
Improve the internal attention mechanism
Use an alternative transformation
Combine CNNs with Vision Transformers
How can we use Transformers in Vision?
59. Video Supervised Learning DINO
Source: Advancing the state of the art in computer vision with self-supervised Transformers and
10x more efficient training – Facebook AI
60. Source: Advancing the state of the art in computer vision with self-supervised Transformers and
10x more efficient training – Facebook AI
- They plot CKA similarities between all pairs of layers across different model architectures.
We observe that ViTs have relatively uniform layer similarity structure, with a clear grid-like pattern and large similarity between lower and higher layers.
By contrast, the ResNet models show clear stages in similarity structure, with smaller similarity scores between lower and higher layers.