Language and Vision (D3L5 2017 UPC Deep Learning for Computer Vision)

[course site]
Xavier Giro-i-Nieto
xavier.giro@upc.edu
Associate Professor
Universitat Politecnica de Catalunya
Technical University of Catalonia
Language and Vision
Day 3 Lecture 5
#DLUPC

2
Acknowledgments
Antonio
Bonafonte
Santiago
Pascual

3
Acknowledgments
Marta R. Costa-jussà

4
Outline
1. Neural Machine Transaltion (no vision here !)
2. Image Captioning
3. Visual Question Answering / Reasoning
4. Joint Embeddings

5
Outline
2. Image and Video Captioning
4. Joint Embeddings

8
Neural Machine Translation (NMT)
Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)
Representation or
Embedding

9
Encoder-decoder
Cho, Kyunghyun, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger
Schwenk, and Yoshua Bengio. "Learning phrase representations using RNN encoder-decoder for statistical
machine translation." arXiv preprint arXiv:1406.1078 (2014).
Language IN
Language OUT
RNNs

10
Encoder-Decoder
Front View Side View
Representation of the sentence
Cho, Kyunghyun, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua
Bengio. "Learning phrase representations using RNN encoder-decoder for statistical machine translation." arXiv preprint
arXiv:1406.1078 (2014).

12
Encoder in three steps
(2)
(3)
(3) Sequence
summarization
(2) Continuous-space Word
Representation (word
embedding)
(1) One hot encoding
Representation of
the sentence

13
Encoder: (1) One hot encoding
(2)
(3)
(3) Sequence
summarization
embedding)
Representation of
the sentence

14Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)
Example: words.
cat: xT
= [1,0,0, ..., 0]
dog: xT
= [0,1,0, ..., 0]
.
.
house: xT
= [0,0,0, …,0,1,0,...,0]
.
.
.
Number of words, |V| ?
B2: 5K
C2: 18K
LVSR: 50-100K
Wikipedia (1.6B): 400K
Crawl data (42B): 2M
Encoder: (1) One hot encoding

15
Encoder: (2) Word embedding
(2)
(3)
(3) Sequence
summarization
embedding)

16
Figure: Christopher Olah, Visualizing Representations
Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. "Distributed representations of
words and phrases and their compositionality." NIPS 2013
Video: Antonio Bonafonte @ DLSL 2017
Encoder: (2) Word embedding

hT
Encoder: (3) Recurrence

18
Sequence
Figure: Cristopher Olah, “Understanding LSTM Networks” (2015)
Activation function could be
LSTM, GRU, QRNN, pLSTM...
Encoder: (3) Recurrence

20
Decoder: (1) Recurrence
(1) The Recurrent State (zi
) of the decoder is determined by:
● summary vector hT
● previous output word ui-1
● previous state zi-1
hT

21
Decoder: (2) Word Probabilities
(2) With zi
ready, we the output of the RNN will estimate a
probability pi
for each word in the vocabulary:
hT

(3) The word with the highest probability will be predicted as
word sample ui
Decoder: (3) Word Sample

23
(3) An output sequence of words can be generated until an
<EOS> (End Of Sentence) “word” is predicted.
EOS
Decoder: (3) Word Sample

24
Encoder-Decoder
Representation or
Embedding

25
Representation or
Embedding
Encoder Decoder

26
Outline
4. Joint Embeddings

27
(Slides by Marc Bolaños): Karpathy, Andrej, and Li Fei-Fei. "Deep visual-semantic alignments for generating
image descriptions." CVPR 2015
Image Captioning

28
Representation or
Embedding
Encoder Decoder

29
Captioning: DeepImageSent
(Slides by Marc Bolaños): Karpathy, Andrej, and Li Fei-Fei. "Deep visual-semantic alignments for
generating image descriptions." CVPR 2015
only takes into account
image features in the first
hidden state
Multimodal Recurrent
Neural Network

30
Captioning: Show & Tell
Vinyals, Oriol, Alexander Toshev, Samy Bengio, and Dumitru Erhan. "Show and tell: A neural image caption
generator." CVPR 2015.

31
Captioning: Show & Tell
Vinyals, Oriol, Alexander Toshev, Samy Bengio, and Dumitru Erhan. "Show and tell: A neural image caption
generator." CVPR 2015.

32
Johnson, Justin, Andrej Karpathy, and Li Fei-Fei. "Densecap: Fully convolutional localization networks for
dense captioning." CVPR 2016
Captioning (+ Detection): DenseCap

33

34
XAVI: “man has
short hair”, “man
with short hair”
AMAIA:”a woman
wearing a black
shirt”, “
BOTH: “two men
wearing black
glasses”

35
Captioning: Video
Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, Trevor
Darrel. Long-term Recurrent Convolutional Networks for Visual Recognition and Description, CVPR 2015. code

36
( Slides by Marc Bolaños) Pingbo Pan, Zhongwen Xu, Yi Yang,Fei Wu,Yueting Zhuang Hierarchical
Recurrent Neural Encoder for Video Representation with Application to Captioning, CVPR 2016.
LSTM unit
(2nd layer)
Time
Image
t = 1 t = T
hidden state
at t = T
first chunk
of data
Captioning: Video

37
Outline
4. Joint Embeddings

38
Visual Question Answering (VQA)
[z1
, z2
, … zN
] [y1
, y2
, … yM
]
“Is economic growth decreasing ?”
“Yes”
Encode
Encode
Decode

39
Antol, Stanislaw, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and
Devi Parikh. "VQA: Visual question answering." CVPR 2015.

40
Extract visual
features
Embedding
Predict answerMerge
Question
What object is flying?
Answer
Kite
Slide credit: Issey Masuda

41
Masuda, Issey, Santiago Pascual de la Puente, and Xavier Giro-i-Nieto. "Open-Ended Visual
Question-Answering." ETSETB UPC TelecomBCN (2016).
Image
Question
Answer

42
Noh, H., Seo, P. H., & Han, B. Image question answering using convolutional neural network with
dynamic parameter prediction. CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)

43
Visual Question Answering: Dynamic
(Slides and Slidecast by Santi Pascual): Xiong, Caiming, Stephen Merity, and Richard Socher. "Dynamic
Memory Networks for Visual and Textual Question Answering." ICML 2016

44
Visual Question Answering: Grounded
(Slides and Screencast by Issey Masuda): Zhu, Yuke, Oliver Groth, Michael Bernstein, and Li Fei-Fei."Visual7W: Grounded
Question Answering in Images." CVPR 2016.

45
Visual Dialog (Image Guessing Game)
Das, Abhishek, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José MF Moura, Devi Parikh, and Dhruv Batra.
"Visual Dialog." CVPR 2017

46
Visual Dialog (Image Guessing Game)
Das, Abhishek, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José MF Moura, Devi Parikh, and Dhruv Batra.
"Visual Dialog." CVPR 2017

47
Visual Reasoning
Johnson, Justin, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross
Girshick. "CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning."
CVPR 2017

48
Visual Reasoning
(Slides by Fran Roldan) Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Judy Hoffman, Fei-Fei Li, Larry
Zitnick, Ross Girshick , “Inferring and Executing Programs for Visual Reasoning”. arXiv 2017.
Program Generator Execution Engine

49
Outline
4. Joint Embeddings

50
Representation or
EmbeddingEncoder
Joint Neural Embeddings
Encoder

51
Frome, Andrea, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeff Dean, and Tomas Mikolov. "Devise: A
deep visual-semantic embedding model." NIPS 2013

52
Socher, R., Ganjoo, M., Manning, C. D., & Ng, A., Zero-shot learning through cross-modal transfer.
NIPS 2013 [slides] [code]
Zero-shot learning:
a class not present in the
training set of images
can be predicted
(eg. no images from
“cat” in the training set)

53
Alejandro Woodward, Víctor Campos, Dèlia Fernàndez, Jordi Torres, Xavier Giró-i-Nieto,
Brendan Jou and Shih-Fu Chang (work under progress)
Foggy Day

54
Amaia Salvador, Nicholas Haynes, Yusuf Aytar, Javier Marín, Ferda Ofli, Ingmar Weber,
Antonio Torralba, “Learning Cross-modal Embeddings for Cooking Recipes and Food
Images”. CVPR 2017
Image and text retrieval with joint embeddings.

55

56
joint
embedding
LSTM Bidirectional LSTM

57
Outline
4. Joint Embeddings

58
Thanks ! Q&A ?
Follow me at
https://imatge.upc.edu/web/people/xavier-giro
@DocXavi
/ProfessorXavi

Language and Vision (D3L5 2017 UPC Deep Learning for Computer Vision)

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Language and Vision (D3L5 2017 UPC Deep Learning for Computer Vision)

Similaire à Language and Vision (D3L5 2017 UPC Deep Learning for Computer Vision) (20)

Plus de Universitat Politècnica de Catalunya

Plus de Universitat Politècnica de Catalunya (20)

Dernier

Dernier (20)

Language and Vision (D3L5 2017 UPC Deep Learning for Computer Vision)