SlideShare une entreprise Scribd logo
1  sur  45
Télécharger pour lire hors ligne
Amaia Salvador
amaia.salvador@upc.edu
PhD Candidate
Universitat Politècnica de Catalunya
DEEP
LEARNING
WORKSHOP
Dublin City University
27-28 April 2017
Attention Models
Day 2 Lecture 12
Attention Models: Motivation
Image:
H x W x 3
bird
The whole input volume is used to predict the output...
2
Attention Models: Motivation
Image:
H x W x 3
bird
The whole input volume is used to predict the output...
...despite the fact that not all pixels are equally important
3
Attention Models: Motivation
Attention models can
relieve computational burden
Helpful when processing big
images !
4
Attention Models: Motivation
Attention models can
relieve computational burden
Helpful when processing big
images !
5
bird
Encoder & Decoder for Machine Translation
6
Limtation 1: The whole information is
encoded in a fixed-size vector, no
matter the length of the input
sentence.
Encoder & Decoder
7
Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)
Limitation 2: All output predictions are
based on the final and static
recurrent state of the encoder (hT
)
No matter the output word being
predicted at each time step, all input
words are considered in an equal
way.
Attention Mechanism
8
Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)
The vector to be fed to the RNN at each timestep is a
weighted sum of all the annotation vectors.
Attention Mechanism
9
Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)
An attention weight (scalar) is predicted at each time-step for each annotation vector
hj
with a simple fully connected neural network.
h1
zi
Annotation
vector
Recurrent
state
Attention
weight
(a1
)
Attention Mechanism
10
Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)
An attention weight (scalar) is predicted at each time-step for each annotation vector
hj
with a simple fully connected neural network.
h2
zi
Annotation
vector
Recurrent
state
Attention
weight
(a2
)
Shared for all j
Attention Mechanism
11
Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)
Once a relevance score (weight) is estimated for each word, they are normalized
with a softmax function so they sum up to 1.
Attention Mechanism
12
Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)
Finally, a context-aware representation ci
for the output word at timestep i can be
defined as:
Attention Mechanism
13
Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)
The model automatically finds the correspondence structure between two languages
(alignment).
(Edge thicknesses represent the attention weights found by the attention model)
Attention Mechanism
14
Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)
The model automatically found the correspondence structure between two
languages (alignment).
(Edge thicknesses represent the attention weights found by the attention model)
Attention Models
Attend to different parts of the input to optimize a certain output
15
Attention Models
16
Chan et al. Listen, Attend and Spell. ICASSP 2016
Source: distill.pub
Input: Audio features; Output: Text
Attend to different parts of the input to optimize a certain output
Attention Models
17
Input: Image; Output: Text
A bird flying over a body of water
Attend to different parts of the input to optimize a certain output
LSTM Decoder for Image Captioning
LSTMLSTM LSTM
CNN LSTM
A bird flying
...
<EOS>
The LSTM decoder “sees” the input only at the beginning !
Features:
D
18
...
Vinyals et al. Show and tell: A neural image caption generator. CVPR 2015
Attention for Image Captioning
CNN
Image:
H x W x 3
Features:
L x D
Slide Credit: CS231n 19
Attention for Image Captioning
CNN
Image:
H x W x 3
Features:
L x D
h0
a1
Slide Credit: CS231n
Attention
weights (LxD)
20
Attention for Image Captioning
Slide Credit: CS231n
CNN
Image:
H x W x 3
Features:
L x D
h0
a1
z1
Weighted
combination
of features
y1
h1
First word
Attention
weights (LxD)
a2 y2
Weighted
features: D
predicted
word
21
Attention for Image Captioning
CNN
Image:
H x W x 3
Features:
L x D
h0
a1
z1
Weighted
combination
of features
y1
h1
First word
a2 y2
h2
a3 y3
z2 y2
Weighted
features: D
predicted
word
Attention
weights (LxD)
Slide Credit: CS231n 22
Attention for Image Captioning
Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015
23
Attention for Image Captioning
Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015
24
Attention for Image Captioning
Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015
25
Attention for Image Captioning
Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015
26
Some outputs can probably be predicted without looking at the image...
Attention for Image Captioning
Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015
27
Some outputs can probably be predicted without looking at the image...
Attention for Image Captioning
Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015
28
Can we focus on the image only when necessary?
Attention for Image Captioning
Lu et al. Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning. CVPR
2017 29
“Regular” Attention Attention with Sentinel
Attention for Image Captioning
Lu et al. Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning. CVPR
2017 30
Soft Attention
Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015
CNN
Image:
H x W x 3
Grid of features
(Each
D-dimensional)
a b
c d
pa
pb
pc
pd
Distribution over
grid locations
pa
+ pb
+ pc
+ pc
= 1
Soft attention:
Summarize ALL locations
z = pa
a+ pb
b + pc
c + pd
d
Derivative dz/dp is nice!
Train with gradient descent
Context vector z
(D-dimensional)
From
RNN:
Slide Credit: CS231n 31
Soft Attention
Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015
CNN
Image:
H x W x 3
Grid of features
(Each
D-dimensional)
a b
c d
pa
pb
pc
pd
Distribution over
grid locations
pa
+ pb
+ pc
+ pc
= 1
Soft attention:
Summarize ALL locations
z = pa
a+ pb
b + pc
c + pd
d
Differentiable function
Train with gradient descent
Context vector z
(D-dimensional)
From
RNN:
Slide Credit: CS231n
● Still uses the whole input !
● Constrained to fix grid
32
Hard Attention
Input image:
H x W x 3
Box Coordinates:
(xc, yc, w, h)
Cropped and
rescaled image:
X x Y x 3
Not a differentiable function !
Can’t train with backprop :(
33
Hard attention:
Sample a subset
of the input
Need other optimization strategies
e.g.: reinforcement learning
Attention for Image Captioning
Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015
34
Hard Attention
Input image:
H x W x 3
Box Coordinates:
(xc, yc, w, h)
Cropped and
rescaled image:
X x Y x 3
CNN
bird
Not a differentiable function !
Can’t train with backprop :( 35
Spatial Transformer Networks
Input image:
H x W x 3
Box Coordinates:
(xc, yc, w, h)
Cropped and
rescaled image:
X x Y x 3
CNN
bird
Jaderberg et al. Spatial Transformer Networks. NIPS 2015
Not a differentiable function !
Can’t train with backprop :(
Make it differentiable
Train with backprop :) 36
Spatial Transformer Networks
Jaderberg et al. Spatial Transformer Networks. NIPS 2015
Input image:
H x W x 3
Box Coordinates:
(xc, yc, w, h)
Cropped and
rescaled image:
X x Y x 3
Can we make this
function differentiable?
Idea: Function mapping
pixel coordinates (xt, yt) of
output to pixel coordinates
(xs, ys) of input
Slide Credit: CS231n
Repeat for all pixels
in output
Network
attends to
input by
predicting
37
Mapping given by box coordinates
(translation + scale)
Spatial Transformer Networks
Jaderberg et al. Spatial Transformer Networks. NIPS 2015
Easy to incorporate in any network, anywhere !
Differentiable module
Insert spatial transformers into a
classification network and it learns
to attend and transform the input
38
Spatial Transformer Networks
Jaderberg et al. Spatial Transformer Networks. NIPS 2015
39
Fine-grained classification
Also used as an alternative to RoI pooling in proposal-based detection & segmentation pipelines
Semantic Attention: Image Captioning
40You et al. Image Captioning with Semantic Attention. CVPR 2016
Visual Attention: Saliency Detection
Kuen et al. Recurrent Attentional Networks for Saliency Detection. CVPR 2016
41
Visual Attention: Fixation Prediction
Cornia et al. Predicting Human Eye Fixations via an LSTM-based Saliency Attentive Model.
42
Visual Attention: Action Recognition
Sharma et al. Action Recognition Using Visual Attention. arXiv 2016 43
Deformable Convolutions
Dai, Qi, Xiong, Li, Zhang et al. Deformable Convolutional Networks. arXiv Mar 2017
44
Dynamic & learnable receptive field
45
Questions?

Contenu connexe

Plus de Universitat Politècnica de Catalunya

Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Universitat Politècnica de Catalunya
 
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Universitat Politècnica de Catalunya
 
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...Universitat Politècnica de Catalunya
 
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Universitat Politècnica de Catalunya
 
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Universitat Politècnica de Catalunya
 
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Universitat Politècnica de Catalunya
 
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)Universitat Politècnica de Catalunya
 
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Universitat Politècnica de Catalunya
 
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020Universitat Politècnica de Catalunya
 
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...Universitat Politècnica de Catalunya
 
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...Universitat Politècnica de Catalunya
 
Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC Telecom...
Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC Telecom...Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC Telecom...
Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC Telecom...Universitat Politècnica de Catalunya
 
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC BarcelonaSelf-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC BarcelonaUniversitat Politècnica de Catalunya
 
Deep Video Object Tracking 2020 - Xavier Giro - UPC TelecomBCN Barcelona
Deep Video Object Tracking 2020 - Xavier Giro - UPC TelecomBCN BarcelonaDeep Video Object Tracking 2020 - Xavier Giro - UPC TelecomBCN Barcelona
Deep Video Object Tracking 2020 - Xavier Giro - UPC TelecomBCN BarcelonaUniversitat Politècnica de Catalunya
 
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020Universitat Politècnica de Catalunya
 

Plus de Universitat Politècnica de Catalunya (20)

Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...
 
Intepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural NetworksIntepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural Networks
 
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
 
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
 
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
 
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
 
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
 
Curriculum Learning for Recurrent Video Object Segmentation
Curriculum Learning for Recurrent Video Object SegmentationCurriculum Learning for Recurrent Video Object Segmentation
Curriculum Learning for Recurrent Video Object Segmentation
 
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
 
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
 
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
 
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
 
Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC Telecom...
Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC Telecom...Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC Telecom...
Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC Telecom...
 
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC BarcelonaSelf-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona
 
Deep Video Object Tracking 2020 - Xavier Giro - UPC TelecomBCN Barcelona
Deep Video Object Tracking 2020 - Xavier Giro - UPC TelecomBCN BarcelonaDeep Video Object Tracking 2020 - Xavier Giro - UPC TelecomBCN Barcelona
Deep Video Object Tracking 2020 - Xavier Giro - UPC TelecomBCN Barcelona
 
Neural Architectures for Video Encoding
Neural Architectures for Video EncodingNeural Architectures for Video Encoding
Neural Architectures for Video Encoding
 
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
Backpropagation for Deep Learning
Backpropagation for Deep LearningBackpropagation for Deep Learning
Backpropagation for Deep Learning
 

Dernier

The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerPavel Šabatka
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...PrithaVashisht1
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introductionsanjaymuralee1
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?sonikadigital1
 
AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)Data & Analytics Magazin
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best PracticesDataArchiva
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionajayrajaganeshkayala
 
MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptaigil2
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityAggregage
 
YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.JasonViviers2
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationGiorgio Carbone
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxVenkatasubramani13
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxDwiAyuSitiHartinah
 
SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024Becky Burwell
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructuresonikadigital1
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Guido X Jansen
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Vladislav Solodkiy
 

Dernier (17)

The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayer
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introduction
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?
 
AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual intervention
 
MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .ppt
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
 
YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - Presentation
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptx
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
 
SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructure
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023
 

Attention Models (D2L12 Insight@DCU Machine Learning Workshop 2017)

  • 1. Amaia Salvador amaia.salvador@upc.edu PhD Candidate Universitat Politècnica de Catalunya DEEP LEARNING WORKSHOP Dublin City University 27-28 April 2017 Attention Models Day 2 Lecture 12
  • 2. Attention Models: Motivation Image: H x W x 3 bird The whole input volume is used to predict the output... 2
  • 3. Attention Models: Motivation Image: H x W x 3 bird The whole input volume is used to predict the output... ...despite the fact that not all pixels are equally important 3
  • 4. Attention Models: Motivation Attention models can relieve computational burden Helpful when processing big images ! 4
  • 5. Attention Models: Motivation Attention models can relieve computational burden Helpful when processing big images ! 5 bird
  • 6. Encoder & Decoder for Machine Translation 6 Limtation 1: The whole information is encoded in a fixed-size vector, no matter the length of the input sentence.
  • 7. Encoder & Decoder 7 Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) Limitation 2: All output predictions are based on the final and static recurrent state of the encoder (hT ) No matter the output word being predicted at each time step, all input words are considered in an equal way.
  • 8. Attention Mechanism 8 Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) The vector to be fed to the RNN at each timestep is a weighted sum of all the annotation vectors.
  • 9. Attention Mechanism 9 Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) An attention weight (scalar) is predicted at each time-step for each annotation vector hj with a simple fully connected neural network. h1 zi Annotation vector Recurrent state Attention weight (a1 )
  • 10. Attention Mechanism 10 Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) An attention weight (scalar) is predicted at each time-step for each annotation vector hj with a simple fully connected neural network. h2 zi Annotation vector Recurrent state Attention weight (a2 ) Shared for all j
  • 11. Attention Mechanism 11 Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) Once a relevance score (weight) is estimated for each word, they are normalized with a softmax function so they sum up to 1.
  • 12. Attention Mechanism 12 Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) Finally, a context-aware representation ci for the output word at timestep i can be defined as:
  • 13. Attention Mechanism 13 Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) The model automatically finds the correspondence structure between two languages (alignment). (Edge thicknesses represent the attention weights found by the attention model)
  • 14. Attention Mechanism 14 Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) The model automatically found the correspondence structure between two languages (alignment). (Edge thicknesses represent the attention weights found by the attention model)
  • 15. Attention Models Attend to different parts of the input to optimize a certain output 15
  • 16. Attention Models 16 Chan et al. Listen, Attend and Spell. ICASSP 2016 Source: distill.pub Input: Audio features; Output: Text Attend to different parts of the input to optimize a certain output
  • 17. Attention Models 17 Input: Image; Output: Text A bird flying over a body of water Attend to different parts of the input to optimize a certain output
  • 18. LSTM Decoder for Image Captioning LSTMLSTM LSTM CNN LSTM A bird flying ... <EOS> The LSTM decoder “sees” the input only at the beginning ! Features: D 18 ... Vinyals et al. Show and tell: A neural image caption generator. CVPR 2015
  • 19. Attention for Image Captioning CNN Image: H x W x 3 Features: L x D Slide Credit: CS231n 19
  • 20. Attention for Image Captioning CNN Image: H x W x 3 Features: L x D h0 a1 Slide Credit: CS231n Attention weights (LxD) 20
  • 21. Attention for Image Captioning Slide Credit: CS231n CNN Image: H x W x 3 Features: L x D h0 a1 z1 Weighted combination of features y1 h1 First word Attention weights (LxD) a2 y2 Weighted features: D predicted word 21
  • 22. Attention for Image Captioning CNN Image: H x W x 3 Features: L x D h0 a1 z1 Weighted combination of features y1 h1 First word a2 y2 h2 a3 y3 z2 y2 Weighted features: D predicted word Attention weights (LxD) Slide Credit: CS231n 22
  • 23. Attention for Image Captioning Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015 23
  • 24. Attention for Image Captioning Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015 24
  • 25. Attention for Image Captioning Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015 25
  • 26. Attention for Image Captioning Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015 26 Some outputs can probably be predicted without looking at the image...
  • 27. Attention for Image Captioning Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015 27 Some outputs can probably be predicted without looking at the image...
  • 28. Attention for Image Captioning Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015 28 Can we focus on the image only when necessary?
  • 29. Attention for Image Captioning Lu et al. Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning. CVPR 2017 29 “Regular” Attention Attention with Sentinel
  • 30. Attention for Image Captioning Lu et al. Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning. CVPR 2017 30
  • 31. Soft Attention Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015 CNN Image: H x W x 3 Grid of features (Each D-dimensional) a b c d pa pb pc pd Distribution over grid locations pa + pb + pc + pc = 1 Soft attention: Summarize ALL locations z = pa a+ pb b + pc c + pd d Derivative dz/dp is nice! Train with gradient descent Context vector z (D-dimensional) From RNN: Slide Credit: CS231n 31
  • 32. Soft Attention Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015 CNN Image: H x W x 3 Grid of features (Each D-dimensional) a b c d pa pb pc pd Distribution over grid locations pa + pb + pc + pc = 1 Soft attention: Summarize ALL locations z = pa a+ pb b + pc c + pd d Differentiable function Train with gradient descent Context vector z (D-dimensional) From RNN: Slide Credit: CS231n ● Still uses the whole input ! ● Constrained to fix grid 32
  • 33. Hard Attention Input image: H x W x 3 Box Coordinates: (xc, yc, w, h) Cropped and rescaled image: X x Y x 3 Not a differentiable function ! Can’t train with backprop :( 33 Hard attention: Sample a subset of the input Need other optimization strategies e.g.: reinforcement learning
  • 34. Attention for Image Captioning Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015 34
  • 35. Hard Attention Input image: H x W x 3 Box Coordinates: (xc, yc, w, h) Cropped and rescaled image: X x Y x 3 CNN bird Not a differentiable function ! Can’t train with backprop :( 35
  • 36. Spatial Transformer Networks Input image: H x W x 3 Box Coordinates: (xc, yc, w, h) Cropped and rescaled image: X x Y x 3 CNN bird Jaderberg et al. Spatial Transformer Networks. NIPS 2015 Not a differentiable function ! Can’t train with backprop :( Make it differentiable Train with backprop :) 36
  • 37. Spatial Transformer Networks Jaderberg et al. Spatial Transformer Networks. NIPS 2015 Input image: H x W x 3 Box Coordinates: (xc, yc, w, h) Cropped and rescaled image: X x Y x 3 Can we make this function differentiable? Idea: Function mapping pixel coordinates (xt, yt) of output to pixel coordinates (xs, ys) of input Slide Credit: CS231n Repeat for all pixels in output Network attends to input by predicting 37 Mapping given by box coordinates (translation + scale)
  • 38. Spatial Transformer Networks Jaderberg et al. Spatial Transformer Networks. NIPS 2015 Easy to incorporate in any network, anywhere ! Differentiable module Insert spatial transformers into a classification network and it learns to attend and transform the input 38
  • 39. Spatial Transformer Networks Jaderberg et al. Spatial Transformer Networks. NIPS 2015 39 Fine-grained classification Also used as an alternative to RoI pooling in proposal-based detection & segmentation pipelines
  • 40. Semantic Attention: Image Captioning 40You et al. Image Captioning with Semantic Attention. CVPR 2016
  • 41. Visual Attention: Saliency Detection Kuen et al. Recurrent Attentional Networks for Saliency Detection. CVPR 2016 41
  • 42. Visual Attention: Fixation Prediction Cornia et al. Predicting Human Eye Fixations via an LSTM-based Saliency Attentive Model. 42
  • 43. Visual Attention: Action Recognition Sharma et al. Action Recognition Using Visual Attention. arXiv 2016 43
  • 44. Deformable Convolutions Dai, Qi, Xiong, Li, Zhang et al. Deformable Convolutional Networks. arXiv Mar 2017 44 Dynamic & learnable receptive field