https://telecombcn-dl.github.io/dlmm-2017-dcu/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of big annotated data and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which had been addressed until now with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
3. Attention Models: Motivation
Image:
H x W x 3
bird
The whole input volume is used to predict the output...
...despite the fact that not all pixels are equally important
3
6. Encoder & Decoder for Machine Translation
6
Limtation 1: The whole information is
encoded in a fixed-size vector, no
matter the length of the input
sentence.
7. Encoder & Decoder
7
Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)
Limitation 2: All output predictions are
based on the final and static
recurrent state of the encoder (hT
)
No matter the output word being
predicted at each time step, all input
words are considered in an equal
way.
8. Attention Mechanism
8
Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)
The vector to be fed to the RNN at each timestep is a
weighted sum of all the annotation vectors.
9. Attention Mechanism
9
Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)
An attention weight (scalar) is predicted at each time-step for each annotation vector
hj
with a simple fully connected neural network.
h1
zi
Annotation
vector
Recurrent
state
Attention
weight
(a1
)
10. Attention Mechanism
10
Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)
An attention weight (scalar) is predicted at each time-step for each annotation vector
hj
with a simple fully connected neural network.
h2
zi
Annotation
vector
Recurrent
state
Attention
weight
(a2
)
Shared for all j
11. Attention Mechanism
11
Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)
Once a relevance score (weight) is estimated for each word, they are normalized
with a softmax function so they sum up to 1.
12. Attention Mechanism
12
Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)
Finally, a context-aware representation ci
for the output word at timestep i can be
defined as:
13. Attention Mechanism
13
Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)
The model automatically finds the correspondence structure between two languages
(alignment).
(Edge thicknesses represent the attention weights found by the attention model)
14. Attention Mechanism
14
Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)
The model automatically found the correspondence structure between two
languages (alignment).
(Edge thicknesses represent the attention weights found by the attention model)
16. Attention Models
16
Chan et al. Listen, Attend and Spell. ICASSP 2016
Source: distill.pub
Input: Audio features; Output: Text
Attend to different parts of the input to optimize a certain output
17. Attention Models
17
Input: Image; Output: Text
A bird flying over a body of water
Attend to different parts of the input to optimize a certain output
18. LSTM Decoder for Image Captioning
LSTMLSTM LSTM
CNN LSTM
A bird flying
...
<EOS>
The LSTM decoder “sees” the input only at the beginning !
Features:
D
18
...
Vinyals et al. Show and tell: A neural image caption generator. CVPR 2015
19. Attention for Image Captioning
CNN
Image:
H x W x 3
Features:
L x D
Slide Credit: CS231n 19
20. Attention for Image Captioning
CNN
Image:
H x W x 3
Features:
L x D
h0
a1
Slide Credit: CS231n
Attention
weights (LxD)
20
21. Attention for Image Captioning
Slide Credit: CS231n
CNN
Image:
H x W x 3
Features:
L x D
h0
a1
z1
Weighted
combination
of features
y1
h1
First word
Attention
weights (LxD)
a2 y2
Weighted
features: D
predicted
word
21
22. Attention for Image Captioning
CNN
Image:
H x W x 3
Features:
L x D
h0
a1
z1
Weighted
combination
of features
y1
h1
First word
a2 y2
h2
a3 y3
z2 y2
Weighted
features: D
predicted
word
Attention
weights (LxD)
Slide Credit: CS231n 22
23. Attention for Image Captioning
Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015
23
24. Attention for Image Captioning
Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015
24
25. Attention for Image Captioning
Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015
25
26. Attention for Image Captioning
Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015
26
Some outputs can probably be predicted without looking at the image...
27. Attention for Image Captioning
Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015
27
Some outputs can probably be predicted without looking at the image...
28. Attention for Image Captioning
Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015
28
Can we focus on the image only when necessary?
29. Attention for Image Captioning
Lu et al. Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning. CVPR
2017 29
“Regular” Attention Attention with Sentinel
30. Attention for Image Captioning
Lu et al. Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning. CVPR
2017 30
31. Soft Attention
Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015
CNN
Image:
H x W x 3
Grid of features
(Each
D-dimensional)
a b
c d
pa
pb
pc
pd
Distribution over
grid locations
pa
+ pb
+ pc
+ pc
= 1
Soft attention:
Summarize ALL locations
z = pa
a+ pb
b + pc
c + pd
d
Derivative dz/dp is nice!
Train with gradient descent
Context vector z
(D-dimensional)
From
RNN:
Slide Credit: CS231n 31
32. Soft Attention
Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015
CNN
Image:
H x W x 3
Grid of features
(Each
D-dimensional)
a b
c d
pa
pb
pc
pd
Distribution over
grid locations
pa
+ pb
+ pc
+ pc
= 1
Soft attention:
Summarize ALL locations
z = pa
a+ pb
b + pc
c + pd
d
Differentiable function
Train with gradient descent
Context vector z
(D-dimensional)
From
RNN:
Slide Credit: CS231n
● Still uses the whole input !
● Constrained to fix grid
32
33. Hard Attention
Input image:
H x W x 3
Box Coordinates:
(xc, yc, w, h)
Cropped and
rescaled image:
X x Y x 3
Not a differentiable function !
Can’t train with backprop :(
33
Hard attention:
Sample a subset
of the input
Need other optimization strategies
e.g.: reinforcement learning
34. Attention for Image Captioning
Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015
34
35. Hard Attention
Input image:
H x W x 3
Box Coordinates:
(xc, yc, w, h)
Cropped and
rescaled image:
X x Y x 3
CNN
bird
Not a differentiable function !
Can’t train with backprop :( 35
36. Spatial Transformer Networks
Input image:
H x W x 3
Box Coordinates:
(xc, yc, w, h)
Cropped and
rescaled image:
X x Y x 3
CNN
bird
Jaderberg et al. Spatial Transformer Networks. NIPS 2015
Not a differentiable function !
Can’t train with backprop :(
Make it differentiable
Train with backprop :) 36
37. Spatial Transformer Networks
Jaderberg et al. Spatial Transformer Networks. NIPS 2015
Input image:
H x W x 3
Box Coordinates:
(xc, yc, w, h)
Cropped and
rescaled image:
X x Y x 3
Can we make this
function differentiable?
Idea: Function mapping
pixel coordinates (xt, yt) of
output to pixel coordinates
(xs, ys) of input
Slide Credit: CS231n
Repeat for all pixels
in output
Network
attends to
input by
predicting
37
Mapping given by box coordinates
(translation + scale)
38. Spatial Transformer Networks
Jaderberg et al. Spatial Transformer Networks. NIPS 2015
Easy to incorporate in any network, anywhere !
Differentiable module
Insert spatial transformers into a
classification network and it learns
to attend and transform the input
38
39. Spatial Transformer Networks
Jaderberg et al. Spatial Transformer Networks. NIPS 2015
39
Fine-grained classification
Also used as an alternative to RoI pooling in proposal-based detection & segmentation pipelines
40. Semantic Attention: Image Captioning
40You et al. Image Captioning with Semantic Attention. CVPR 2016
41. Visual Attention: Saliency Detection
Kuen et al. Recurrent Attentional Networks for Saliency Detection. CVPR 2016
41
42. Visual Attention: Fixation Prediction
Cornia et al. Predicting Human Eye Fixations via an LSTM-based Saliency Attentive Model.
42
43. Visual Attention: Action Recognition
Sharma et al. Action Recognition Using Visual Attention. arXiv 2016 43
44. Deformable Convolutions
Dai, Qi, Xiong, Li, Zhang et al. Deformable Convolutional Networks. arXiv Mar 2017
44
Dynamic & learnable receptive field