SlideShare une entreprise Scribd logo
1  sur  71
Télécharger pour lire hors ligne
@DocXavi
Xavier Giró-i-Nieto
[http://pagines.uab.cat/mcv/]
Module 6
Deep Learning for Video:
Action Recognition
22nd March 2018
Acknowledgements
2
Víctor Campos Alberto MontesAmaia Salvador Santiago Pascual
3
Densely linked slides
Outline
4
5
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. . Large-scale video
classification with convolutional neural networks. CVPR 2014
Motivation
6Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. Large-scale video classification with convolutional neural networks.
CVPR 2014.
What is a video?
7
●
○
○
●
●
How do we work with images?
8
●
How do we work with videos ?
9
CNNs for sequences of images
10
CNN Input RGB Optical Flow Fusion
Single frame 2D CNN - Pooling
Single frame models
11
CNN CNN CNN...
Combination method
Yue-Hei Ng, Joe, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and
George Toderici. "Beyond short snippets: Deep networks for video classification." CVPR 2015
CNNs for sequences of images
12
CNN Input RGB Optical Flow Fusion
Single frame 2D CNN - Pooling + NN
Multiple frames 2D CNN - Pooling + NN
13
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. . Large-scale video
classification with convolutional neural networks. CVPR 2014
Multiple Frames
14
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. . Large-scale video
classification with convolutional neural networks. CVPR 2014
Multiple Frames
15
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. . Large-scale video
classification with convolutional neural networks. CVPR 2014
Multiple Frames
16
CNN Input RGB Optical Flow Fusion
Single frame 2D CNN - Pooling + NN
Multiple frames 2D CNN - Pooling + NN
Limitation of Feed Forward NN (as CNNs)
17Slide credit: Santi Pascual
If we have a sequence of samples...
predict sample x[t+1] knowing previous values {x[t], x[t-1], x[t-2], …, x[t-τ]}
Limitation of Feed Forward NN (as CNNs)
18Slide credit: Santi Pascual
Feed Forward approach:
● static window of size L
● slide the window time-step wise
...
...
...
x[t+1]
x[t-L], …, x[t-1], x[t]
x[t+1]
L
Limitation of Feed Forward NN (as CNNs)
19Slide credit: Santi Pascual
Feed Forward approach:
● static window of size L
● slide the window time-step wise
...
...
...
x[t+2]
x[t-L+1], …, x[t], x[t+1]
...
...
...
x[t+1]
x[t-L], …, x[t-1], x[t]
x[t+2]
L
Limitation of Feed Forward NN (as CNNs)
20Slide credit: Santi Pascual 20
Feed Forward approach:
● static window of size L
● slide the window time-step wise
x[t+3]
L
...
...
...
x[t+3]
x[t-L+2], …, x[t+1], x[t+2]
...
...
...
x[t+2]
x[t-L+1], …, x[t], x[t+1]
...
...
...
x[t+1]
x[t-L], …, x[t-1], x[t]
Limitation of Feed Forward NN (as CNNs)
...
...
...
x1, x2, …, xL
Problems for the feed forward + static window approach:
● What’s the matter increasing L? → Fast growth of num of parameters!
● Decisions are independent between time-steps!
○ The network doesn’t care about what happened at previous time-step, only present window
matters → doesn’t look good
● Cumbersome padding when there are not enough samples to fill L size
○ Can’t work with variable sequence lengths
x1, x2, …, xL, …, x2L
...
...
x1, x2, …, xL, …, x2L, …, x3L
...
...
... ...
Slide credit: Santi Pascual
Limitation of Feed Forward NN (as CNNs)
22
The hidden layers and the
output depend from previous
states of the hidden layers
Recurrent Neural Network (RNN)
CNNs for sequences of images
23
CNN Input RGB Optical Flow Fusion
Single frame 2D CNN - Pooling + NN
Multiple frames 2D CNN - Pooling + NN
Sequence of images 2D CNN - RNN
2D CNN + RNN
24
CNN CNN CNN...
RNN RNN RNN...
Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko,
Trevor Darrel. Long-term Recurrent Convolutional Networks for Visual Recognition and Description, CVPR 2015. code
Videolectures on RNNs:
DLSL 2017, “RNN (I)”
“RNN (II)”
DLAI 2018, “RNN”
25
Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko,
Trevor Darrel. Long-term Recurrent Convolutional Networks for Visual Recognition and Description, CVPR 2015. code
2D CNN + RNN
26
Victor Campos, Brendan Jou, Xavier Giro-i-Nieto, Jordi Torres, and Shih-Fu Chang. “Skip RNN: Learning to
Skip State Updates in Recurrent Neural Networks”, ICLR 2018.
Used Unused
2D CNN + RNN
CNNs for sequences of images
27
CNN Input RGB Optical Flow Fusion
Single frame 2D CNN - Pooling + NN
Multiple frames 2D CNN - Pooling + NN
Sequence of images 2D CNN - RNN
Sequence of clips 3D CNN - Pooling
3D CNN (C3D)
28
●
●
Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning
spatiotemporal features with 3D convolutional networks" ICCV 2015
29
Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning
spatiotemporal features with 3D convolutional networks" ICCV 2015
16-frame clip
16-frame clip
16-frame clip
...
Average
4096-dimvideodescriptor
4096-dimvideodescriptor
L2 norm
3D CNN (C3D)
30
●
●
3D CNN
CNNs for sequences of images
31
CNN Input RGB Optical Flow Fusion
Single frame 2D CNN - Pooling + NN
Multiple frames 2D CNN - Pooling + NN
Sequence of images 2D CNN - RNN
Sequence of clips 3D CNN - Pooling
Sequence of clips 3D CNN - RNN
32
3D CNN + RNN
A. Montes, Salvador, A., Pascual-deLaPuente, S., and Giró-i-Nieto, X., “Temporal Activity Detection in
Untrimmed Videos with Recurrent Neural Networks”, NIPS Workshop 2016 (best poster award)
CNNs for sequences of images
33
CNN Input RGB Optical Flow Fusion
Single frame 2D CNN - Pooling + NN
Multiple frames 2D CNN - Pooling + NN
Sequence of images 2D CNN - RNN
Sequence of clips 3D CNN - Pooling
Sequence of clips 3D CNN - RNN
Two-stream 2D CNN 2D CNN Pooling
Two-streams 2D CNNs
34
Simonyan, Karen, and Andrew Zisserman. "Two-stream convolutional networks for action recognition in
videos." NIPS 2014.
Fusion
Two-streams 2D CNNs
35
Feichtenhofer, Christoph, Axel Pinz, and Andrew Zisserman. "Convolutional two-stream network fusion for video action recognition." CVPR 2016. [code]
Fusion
Two-streams 2D CNNs
36
Feichtenhofer, Christoph, Axel Pinz, and Richard Wildes. "Spatiotemporal residual networks for video action recognition." NIPS 2016. [code]
37Wang, Limin, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. "Temporal segment networks:
Towards good practices for deep action recognition." ECCV 2016.
Two-streams 2D CNNs
Two-streams 2D CNNs
38Girdhar, Rohit, Deva Ramanan, Abhinav Gupta, Josef Sivic, and Bryan Russell. "ActionVLAD: Learning spatio-temporal aggregation for action
classification." CVPR 2017.
CNNs for sequences of images
39
CNN Input RGB Optical Flow Fusion
Single frame 2D CNN - Pooling + NN
Multiple frames 2D CNN - Pooling + NN
Sequence of images 2D CNN - RNN
Sequence of clips 3D CNN - Pooling
Sequence of clips 3D CNN - RNN
Two-stream 2D CNN 2D CNN Pooling
Two-stream 2D CNN 2D CNN RNN
Two-streams 2D CNNs + RNN
40
Yue-Hei Ng, Joe, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and
George Toderici. "Beyond short snippets: Deep networks for video classification." CVPR 2015
CNNs for sequences of images
41
CNN Input RGB Optical Flow Fusion
Single frame 2D CNN - Pooling + NN
Multiple frames 2D CNN - Pooling + NN
Sequence of images 2D CNN - RNN
Sequence of clips 3D CNN - Pooling
Sequence of clips 3D CNN - RNN
Two-stream 2D CNN 2D CNN Pooling
Two-stream 2D CNN 2D CNN RNN
Two-stream Inflated 3D CNN Inflated 3D CNN Pooling
Two-streams 3D CNNs
42
Carreira, J., & Zisserman, A. . Quo vadis, action recognition? A new model and the kinetics dataset. CVPR
2017. [code]
Two-streams Inflated 3D CNNs (I3D)
43
Carreira, J., & Zisserman, A. . Quo vadis, action recognition? A new model and the kinetics dataset. CVPR
2017. [code]
NxN NxNxN
Two-streams 3D CNNs
44
Carreira, J., & Zisserman, A. . Quo vadis, action recognition? A new model and the kinetics dataset. CVPR
2017. [code]
CNNs for sequences of images
45
CNN Input RGB Optical Flow Fusion
Single frame 2D CNN - Pooling + NN
Multiple frames 2D CNN - Pooling + NN
Sequence of images 2D CNN - RNN
Sequence of clips 3D CNN - Pooling
Sequence of clips 3D CNN - RNN
Two-stream 2D CNN 2D CNN Pooling
Two-stream 2D CNN 2D CNN RNN
Two-stream Inflated 3D CNN Inflated 3D CNN Pooling
46
Action recognition
BSc
thesis
47
Action Recognition with object detection
Gkioxari, Georgia, Ross Girshick, and Jitendra Malik. "Contextual action recognition with r* cnn." In ICCV
2015. [code]
48
Sharma, Shikhar, Ryan Kiros, and Ruslan Salakhutdinov. "Action recognition using visual attention."
ICLRW 2016.
Action Recognition with attention
49
Sharma, Shikhar, Ryan Kiros, and Ruslan Salakhutdinov. "Action recognition using visual attention."
ICLRW 2016.
Action Recognition with soft attention
50Girdhar, Rohit, and Deva Ramanan. "Attentional pooling for action recognition." NIPS 2017
Action recognition with soft attention
51
Zhu, Wangjiang, Jie Hu, Gang Sun, Xudong Cao, and Yu Qiao. "A key volume mining deep framework for action recognition."
CVPR 2016.
Action Recognition with hard attention
Outline
52
53
Soomro, K., Zamir, A. R., & Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint
arXiv:1212.0402.
Datasets: UCF-101
54
Datasets: HDMB51 (Brown University)
Kuehne, Hildegard, Hueihan Jhuang, Estíbaliz Garrote, Tomaso Poggio, and Thomas Serre. "HMDB: a large video database for human motion
recognition." ICCV 2011.
55
Datasets: Sports 1M (Stanford)
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. Large-scale video classification with convolutional neural networks.
CVPR 2014.
56
Schuldt, Christian, Ivan Laptev, and Barbara Caputo. "Recognizing human actions: a local SVM approach." In Pattern Recognition, 2004. ICPR
2004.
Datasets: KTH
57
Heilbron, F.C., Escorcia, V., Ghanem, B. and Niebles, J.C.,. “Activitynet: A large-scale video benchmark for human activity understanding”.
CVPR 2015.
Datasets: ActivityNet
58
Sigurdsson, G. A., Varol, G., Wang, X., Farhadi, A., Laptev, I., & Gupta, A. (2016, October). Hollywood in homes: Crowdsourcing data collection
for activity understanding. ECCV 2016. [Dataset] [Code]
Datasets: Charades (Allen AI)
59
Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., ... & Suleyman, M. (2017). The kinetics human action video
dataset. arXiv preprint arXiv:1705.06950.
Datasets: Kinectics (DeepMind)
60
(Slides by Dídac Surís) Abu-El-Haija, Sami, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra
Vijayanarasimhan. "Youtube-8m: A large-scale video classification benchmark." arXiv preprint arXiv:1609.08675 (2016). [project]
Datasets: YouTube-8M (Google)
61
(Slides by Dídac Surís) Abu-El-Haija, Sami, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra
Vijayanarasimhan. "Youtube-8m: A large-scale video classification benchmark." arXiv preprint arXiv:1609.08675 (2016). [project]
Activity Recognition: Datasets
62
Hang Zhao, Zhicheng Yan, Heng Wang, Lorenzo Torresani, Antonio Torralba, “SLAC: A Sparsely Labeled Dataset for Action Classification and
Localization” arXiv 2017 [project page]
Datasets: SLAC (MIT & Facebook)
63
Monfort, Mathew, Bolei Zhou, Sarah Adel Bargal, Alex Andonian, Tom Yan, Kandan Ramakrishnan, Lisa Brown et al. "Moments in Time
Dataset: one million videos for event understanding." arXiv preprint arXiv:1801.03150 (2018).
Datasets: Moments in Time (MIT & IBM)
64Weinzaepfel, Philippe, Xavier Martin, and Cordelia Schmid. "Human Action Localization with Sparse Spatial Supervision." (2017).
Datasets: DALY (INRIA)
DALY contains the following spatial annotations:
● bounding box around the action
● upper body pose annotation, including a
bounding box around the head
● bounding box around object(s) involved in the
action
65
Gu, C., Sun, C., Vijayanarasimhan, S., Pantofaru, C., Ross, D. A., Toderici, G., ... & Malik, J. (2017). AVA: A video dataset of spatio-temporally
localized atomic visual actions. arXiv preprint arXiv:1705.08421.
Datasets: AVA (Berkeley & Google)
Outline
66
Large-scale datasets
67
●
○
●
○
○
●
○
Tips & Tricks by Víctor Campos (2017)
Memory issues
68
●
○
●
○
○
Tips & Tricks by Víctor Campos (2017)
I/O bottleneck
69
●
●
○
○
○
Tips & Tricks by Víctor Campos (2017)
70
Questions ?
● MSc course (2017)
● BSc course (2018)
71
Deep Learning online courses by UPC:
● 1st edition (2016)
● 2nd edition (2017)
● 3rd edition (2018)
● 1st edition (2017)
● 2nd edition (2018)
Next edition Autumn 2018 Next edition Winter/Spring 2019Summer School (late June 2018)

Contenu connexe

Tendances

Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...
Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...
Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...
Simplilearn
 
CVML2011: human action recognition (Ivan Laptev)
CVML2011: human action recognition (Ivan Laptev)CVML2011: human action recognition (Ivan Laptev)
CVML2011: human action recognition (Ivan Laptev)
zukun
 

Tendances (20)

Faster R-CNN: Towards real-time object detection with region proposal network...
Faster R-CNN: Towards real-time object detection with region proposal network...Faster R-CNN: Towards real-time object detection with region proposal network...
Faster R-CNN: Towards real-time object detection with region proposal network...
 
Convolutional Neural Network (CNN) - image recognition
Convolutional Neural Network (CNN)  - image recognitionConvolutional Neural Network (CNN)  - image recognition
Convolutional Neural Network (CNN) - image recognition
 
Introduction to deep learning
Introduction to deep learningIntroduction to deep learning
Introduction to deep learning
 
Generative adversarial networks
Generative adversarial networksGenerative adversarial networks
Generative adversarial networks
 
Deep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural NetworksDeep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural Networks
 
Convolution Neural Network (CNN)
Convolution Neural Network (CNN)Convolution Neural Network (CNN)
Convolution Neural Network (CNN)
 
Mask-RCNN for Instance Segmentation
Mask-RCNN for Instance SegmentationMask-RCNN for Instance Segmentation
Mask-RCNN for Instance Segmentation
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)
 
Lec14 multiview stereo
Lec14 multiview stereoLec14 multiview stereo
Lec14 multiview stereo
 
ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]
 
Image classification using CNN
Image classification using CNNImage classification using CNN
Image classification using CNN
 
Multiple object detection
Multiple object detectionMultiple object detection
Multiple object detection
 
Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...
Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...
Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...
 
Image Segmentation
 Image Segmentation Image Segmentation
Image Segmentation
 
Understanding Convolutional Neural Networks
Understanding Convolutional Neural NetworksUnderstanding Convolutional Neural Networks
Understanding Convolutional Neural Networks
 
Video Classification: Human Action Recognition on HMDB-51 dataset
Video Classification: Human Action Recognition on HMDB-51 datasetVideo Classification: Human Action Recognition on HMDB-51 dataset
Video Classification: Human Action Recognition on HMDB-51 dataset
 
CVML2011: human action recognition (Ivan Laptev)
CVML2011: human action recognition (Ivan Laptev)CVML2011: human action recognition (Ivan Laptev)
CVML2011: human action recognition (Ivan Laptev)
 
Adversarial Attacks and Defense
Adversarial Attacks and DefenseAdversarial Attacks and Defense
Adversarial Attacks and Defense
 
Instance Segmentation - Míriam Bellver - UPC Barcelona 2018
Instance Segmentation - Míriam Bellver - UPC Barcelona 2018Instance Segmentation - Míriam Bellver - UPC Barcelona 2018
Instance Segmentation - Míriam Bellver - UPC Barcelona 2018
 
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
 

Similaire à Deep Learning for Video: Action Recognition (UPC 2018)

The impact of visual saliency prediction in image classification
The impact of visual saliency prediction in image classificationThe impact of visual saliency prediction in image classification
The impact of visual saliency prediction in image classification
Universitat Politècnica de Catalunya
 

Similaire à Deep Learning for Video: Action Recognition (UPC 2018) (20)

Deep Learning Architectures for Video - Xavier Giro - UPC Barcelona 2019
Deep Learning Architectures for Video - Xavier Giro - UPC Barcelona 2019Deep Learning Architectures for Video - Xavier Giro - UPC Barcelona 2019
Deep Learning Architectures for Video - Xavier Giro - UPC Barcelona 2019
 
Neural Architectures for Video Encoding
Neural Architectures for Video EncodingNeural Architectures for Video Encoding
Neural Architectures for Video Encoding
 
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
 
Deep Video Object Segmentation - Xavier Giro - UPC Barcelona 2019
Deep Video Object Segmentation - Xavier Giro - UPC Barcelona 2019Deep Video Object Segmentation - Xavier Giro - UPC Barcelona 2019
Deep Video Object Segmentation - Xavier Giro - UPC Barcelona 2019
 
Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...
Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...
Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...
 
Deep Learning for Computer Vision: Video Analytics (UPC 2016)
Deep Learning for Computer Vision: Video Analytics (UPC 2016)Deep Learning for Computer Vision: Video Analytics (UPC 2016)
Deep Learning for Computer Vision: Video Analytics (UPC 2016)
 
Deep Learning from Videos (UPC 2018)
Deep Learning from Videos (UPC 2018)Deep Learning from Videos (UPC 2018)
Deep Learning from Videos (UPC 2018)
 
Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)
Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)
Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)
 
The impact of visual saliency prediction in image classification
The impact of visual saliency prediction in image classificationThe impact of visual saliency prediction in image classification
The impact of visual saliency prediction in image classification
 
[OSGeo-KR Tech Workshop] Deep Learning for Single Image Super-Resolution
[OSGeo-KR Tech Workshop] Deep Learning for Single Image Super-Resolution[OSGeo-KR Tech Workshop] Deep Learning for Single Image Super-Resolution
[OSGeo-KR Tech Workshop] Deep Learning for Single Image Super-Resolution
 
Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019
Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019
Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019
 
One Perceptron to Rule Them All: Language and Vision
One Perceptron to Rule Them All: Language and VisionOne Perceptron to Rule Them All: Language and Vision
One Perceptron to Rule Them All: Language and Vision
 
Action_recognition-topic.pptx
Action_recognition-topic.pptxAction_recognition-topic.pptx
Action_recognition-topic.pptx
 
Neural Architectures for Still Images - Xavier Giro- UPC Barcelona 2019
Neural Architectures for Still Images - Xavier Giro- UPC Barcelona 2019Neural Architectures for Still Images - Xavier Giro- UPC Barcelona 2019
Neural Architectures for Still Images - Xavier Giro- UPC Barcelona 2019
 
240315_Thanh_LabSeminar[G-TAD: Sub-Graph Localization for Temporal Action Det...
240315_Thanh_LabSeminar[G-TAD: Sub-Graph Localization for Temporal Action Det...240315_Thanh_LabSeminar[G-TAD: Sub-Graph Localization for Temporal Action Det...
240315_Thanh_LabSeminar[G-TAD: Sub-Graph Localization for Temporal Action Det...
 
V2 v posenet
V2 v posenetV2 v posenet
V2 v posenet
 
On Air
On AirOn Air
On Air
 
Deep Video Object Tracking 2020 - Xavier Giro - UPC TelecomBCN Barcelona
Deep Video Object Tracking 2020 - Xavier Giro - UPC TelecomBCN BarcelonaDeep Video Object Tracking 2020 - Xavier Giro - UPC TelecomBCN Barcelona
Deep Video Object Tracking 2020 - Xavier Giro - UPC TelecomBCN Barcelona
 
Interpretability of Convolutional Neural Networks - Eva Mohedano - UPC Barcel...
Interpretability of Convolutional Neural Networks - Eva Mohedano - UPC Barcel...Interpretability of Convolutional Neural Networks - Eva Mohedano - UPC Barcel...
Interpretability of Convolutional Neural Networks - Eva Mohedano - UPC Barcel...
 
Convolutional neural network
Convolutional neural network Convolutional neural network
Convolutional neural network
 

Plus de Universitat Politècnica de Catalunya

Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosGeneration of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Universitat Politècnica de Catalunya
 

Plus de Universitat Politècnica de Catalunya (20)

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Deep Generative Learning for All
Deep Generative Learning for AllDeep Generative Learning for All
Deep Generative Learning for All
 
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
 
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-NietoTowards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
 
The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021
 
Open challenges in sign language translation and production
Open challenges in sign language translation and productionOpen challenges in sign language translation and production
Open challenges in sign language translation and production
 
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosGeneration of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
 
Discovery and Learning of Navigation Goals from Pixels in Minecraft
Discovery and Learning of Navigation Goals from Pixels in MinecraftDiscovery and Learning of Navigation Goals from Pixels in Minecraft
Discovery and Learning of Navigation Goals from Pixels in Minecraft
 
Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...
 
Intepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural NetworksIntepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural Networks
 
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
 
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
 
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
 
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
 
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
 
Curriculum Learning for Recurrent Video Object Segmentation
Curriculum Learning for Recurrent Video Object SegmentationCurriculum Learning for Recurrent Video Object Segmentation
Curriculum Learning for Recurrent Video Object Segmentation
 
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
 
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
 

Dernier

Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
vexqp
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
vexqp
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 

Dernier (20)

Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
 
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
SR-101-01012024-EN.docx  Federal Constitution  of the Swiss ConfederationSR-101-01012024-EN.docx  Federal Constitution  of the Swiss Confederation
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 

Deep Learning for Video: Action Recognition (UPC 2018)

  • 1. @DocXavi Xavier Giró-i-Nieto [http://pagines.uab.cat/mcv/] Module 6 Deep Learning for Video: Action Recognition 22nd March 2018
  • 2. Acknowledgements 2 Víctor Campos Alberto MontesAmaia Salvador Santiago Pascual
  • 5. 5 Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. . Large-scale video classification with convolutional neural networks. CVPR 2014
  • 6. Motivation 6Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. Large-scale video classification with convolutional neural networks. CVPR 2014.
  • 7. What is a video? 7 ● ○ ○ ●
  • 8. ● How do we work with images? 8
  • 9. ● How do we work with videos ? 9
  • 10. CNNs for sequences of images 10 CNN Input RGB Optical Flow Fusion Single frame 2D CNN - Pooling
  • 11. Single frame models 11 CNN CNN CNN... Combination method Yue-Hei Ng, Joe, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. "Beyond short snippets: Deep networks for video classification." CVPR 2015
  • 12. CNNs for sequences of images 12 CNN Input RGB Optical Flow Fusion Single frame 2D CNN - Pooling + NN Multiple frames 2D CNN - Pooling + NN
  • 13. 13 Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. . Large-scale video classification with convolutional neural networks. CVPR 2014 Multiple Frames
  • 14. 14 Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. . Large-scale video classification with convolutional neural networks. CVPR 2014 Multiple Frames
  • 15. 15 Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. . Large-scale video classification with convolutional neural networks. CVPR 2014 Multiple Frames
  • 16. 16 CNN Input RGB Optical Flow Fusion Single frame 2D CNN - Pooling + NN Multiple frames 2D CNN - Pooling + NN Limitation of Feed Forward NN (as CNNs)
  • 17. 17Slide credit: Santi Pascual If we have a sequence of samples... predict sample x[t+1] knowing previous values {x[t], x[t-1], x[t-2], …, x[t-τ]} Limitation of Feed Forward NN (as CNNs)
  • 18. 18Slide credit: Santi Pascual Feed Forward approach: ● static window of size L ● slide the window time-step wise ... ... ... x[t+1] x[t-L], …, x[t-1], x[t] x[t+1] L Limitation of Feed Forward NN (as CNNs)
  • 19. 19Slide credit: Santi Pascual Feed Forward approach: ● static window of size L ● slide the window time-step wise ... ... ... x[t+2] x[t-L+1], …, x[t], x[t+1] ... ... ... x[t+1] x[t-L], …, x[t-1], x[t] x[t+2] L Limitation of Feed Forward NN (as CNNs)
  • 20. 20Slide credit: Santi Pascual 20 Feed Forward approach: ● static window of size L ● slide the window time-step wise x[t+3] L ... ... ... x[t+3] x[t-L+2], …, x[t+1], x[t+2] ... ... ... x[t+2] x[t-L+1], …, x[t], x[t+1] ... ... ... x[t+1] x[t-L], …, x[t-1], x[t] Limitation of Feed Forward NN (as CNNs)
  • 21. ... ... ... x1, x2, …, xL Problems for the feed forward + static window approach: ● What’s the matter increasing L? → Fast growth of num of parameters! ● Decisions are independent between time-steps! ○ The network doesn’t care about what happened at previous time-step, only present window matters → doesn’t look good ● Cumbersome padding when there are not enough samples to fill L size ○ Can’t work with variable sequence lengths x1, x2, …, xL, …, x2L ... ... x1, x2, …, xL, …, x2L, …, x3L ... ... ... ... Slide credit: Santi Pascual Limitation of Feed Forward NN (as CNNs)
  • 22. 22 The hidden layers and the output depend from previous states of the hidden layers Recurrent Neural Network (RNN)
  • 23. CNNs for sequences of images 23 CNN Input RGB Optical Flow Fusion Single frame 2D CNN - Pooling + NN Multiple frames 2D CNN - Pooling + NN Sequence of images 2D CNN - RNN
  • 24. 2D CNN + RNN 24 CNN CNN CNN... RNN RNN RNN... Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, Trevor Darrel. Long-term Recurrent Convolutional Networks for Visual Recognition and Description, CVPR 2015. code Videolectures on RNNs: DLSL 2017, “RNN (I)” “RNN (II)” DLAI 2018, “RNN”
  • 25. 25 Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, Trevor Darrel. Long-term Recurrent Convolutional Networks for Visual Recognition and Description, CVPR 2015. code 2D CNN + RNN
  • 26. 26 Victor Campos, Brendan Jou, Xavier Giro-i-Nieto, Jordi Torres, and Shih-Fu Chang. “Skip RNN: Learning to Skip State Updates in Recurrent Neural Networks”, ICLR 2018. Used Unused 2D CNN + RNN
  • 27. CNNs for sequences of images 27 CNN Input RGB Optical Flow Fusion Single frame 2D CNN - Pooling + NN Multiple frames 2D CNN - Pooling + NN Sequence of images 2D CNN - RNN Sequence of clips 3D CNN - Pooling
  • 28. 3D CNN (C3D) 28 ● ● Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks" ICCV 2015
  • 29. 29 Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks" ICCV 2015 16-frame clip 16-frame clip 16-frame clip ... Average 4096-dimvideodescriptor 4096-dimvideodescriptor L2 norm 3D CNN (C3D)
  • 31. CNNs for sequences of images 31 CNN Input RGB Optical Flow Fusion Single frame 2D CNN - Pooling + NN Multiple frames 2D CNN - Pooling + NN Sequence of images 2D CNN - RNN Sequence of clips 3D CNN - Pooling Sequence of clips 3D CNN - RNN
  • 32. 32 3D CNN + RNN A. Montes, Salvador, A., Pascual-deLaPuente, S., and Giró-i-Nieto, X., “Temporal Activity Detection in Untrimmed Videos with Recurrent Neural Networks”, NIPS Workshop 2016 (best poster award)
  • 33. CNNs for sequences of images 33 CNN Input RGB Optical Flow Fusion Single frame 2D CNN - Pooling + NN Multiple frames 2D CNN - Pooling + NN Sequence of images 2D CNN - RNN Sequence of clips 3D CNN - Pooling Sequence of clips 3D CNN - RNN Two-stream 2D CNN 2D CNN Pooling
  • 34. Two-streams 2D CNNs 34 Simonyan, Karen, and Andrew Zisserman. "Two-stream convolutional networks for action recognition in videos." NIPS 2014. Fusion
  • 35. Two-streams 2D CNNs 35 Feichtenhofer, Christoph, Axel Pinz, and Andrew Zisserman. "Convolutional two-stream network fusion for video action recognition." CVPR 2016. [code] Fusion
  • 36. Two-streams 2D CNNs 36 Feichtenhofer, Christoph, Axel Pinz, and Richard Wildes. "Spatiotemporal residual networks for video action recognition." NIPS 2016. [code]
  • 37. 37Wang, Limin, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. "Temporal segment networks: Towards good practices for deep action recognition." ECCV 2016. Two-streams 2D CNNs
  • 38. Two-streams 2D CNNs 38Girdhar, Rohit, Deva Ramanan, Abhinav Gupta, Josef Sivic, and Bryan Russell. "ActionVLAD: Learning spatio-temporal aggregation for action classification." CVPR 2017.
  • 39. CNNs for sequences of images 39 CNN Input RGB Optical Flow Fusion Single frame 2D CNN - Pooling + NN Multiple frames 2D CNN - Pooling + NN Sequence of images 2D CNN - RNN Sequence of clips 3D CNN - Pooling Sequence of clips 3D CNN - RNN Two-stream 2D CNN 2D CNN Pooling Two-stream 2D CNN 2D CNN RNN
  • 40. Two-streams 2D CNNs + RNN 40 Yue-Hei Ng, Joe, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. "Beyond short snippets: Deep networks for video classification." CVPR 2015
  • 41. CNNs for sequences of images 41 CNN Input RGB Optical Flow Fusion Single frame 2D CNN - Pooling + NN Multiple frames 2D CNN - Pooling + NN Sequence of images 2D CNN - RNN Sequence of clips 3D CNN - Pooling Sequence of clips 3D CNN - RNN Two-stream 2D CNN 2D CNN Pooling Two-stream 2D CNN 2D CNN RNN Two-stream Inflated 3D CNN Inflated 3D CNN Pooling
  • 42. Two-streams 3D CNNs 42 Carreira, J., & Zisserman, A. . Quo vadis, action recognition? A new model and the kinetics dataset. CVPR 2017. [code]
  • 43. Two-streams Inflated 3D CNNs (I3D) 43 Carreira, J., & Zisserman, A. . Quo vadis, action recognition? A new model and the kinetics dataset. CVPR 2017. [code] NxN NxNxN
  • 44. Two-streams 3D CNNs 44 Carreira, J., & Zisserman, A. . Quo vadis, action recognition? A new model and the kinetics dataset. CVPR 2017. [code]
  • 45. CNNs for sequences of images 45 CNN Input RGB Optical Flow Fusion Single frame 2D CNN - Pooling + NN Multiple frames 2D CNN - Pooling + NN Sequence of images 2D CNN - RNN Sequence of clips 3D CNN - Pooling Sequence of clips 3D CNN - RNN Two-stream 2D CNN 2D CNN Pooling Two-stream 2D CNN 2D CNN RNN Two-stream Inflated 3D CNN Inflated 3D CNN Pooling
  • 47. BSc thesis 47 Action Recognition with object detection Gkioxari, Georgia, Ross Girshick, and Jitendra Malik. "Contextual action recognition with r* cnn." In ICCV 2015. [code]
  • 48. 48 Sharma, Shikhar, Ryan Kiros, and Ruslan Salakhutdinov. "Action recognition using visual attention." ICLRW 2016. Action Recognition with attention
  • 49. 49 Sharma, Shikhar, Ryan Kiros, and Ruslan Salakhutdinov. "Action recognition using visual attention." ICLRW 2016. Action Recognition with soft attention
  • 50. 50Girdhar, Rohit, and Deva Ramanan. "Attentional pooling for action recognition." NIPS 2017 Action recognition with soft attention
  • 51. 51 Zhu, Wangjiang, Jie Hu, Gang Sun, Xudong Cao, and Yu Qiao. "A key volume mining deep framework for action recognition." CVPR 2016. Action Recognition with hard attention
  • 53. 53 Soomro, K., Zamir, A. R., & Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402. Datasets: UCF-101
  • 54. 54 Datasets: HDMB51 (Brown University) Kuehne, Hildegard, Hueihan Jhuang, Estíbaliz Garrote, Tomaso Poggio, and Thomas Serre. "HMDB: a large video database for human motion recognition." ICCV 2011.
  • 55. 55 Datasets: Sports 1M (Stanford) Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. Large-scale video classification with convolutional neural networks. CVPR 2014.
  • 56. 56 Schuldt, Christian, Ivan Laptev, and Barbara Caputo. "Recognizing human actions: a local SVM approach." In Pattern Recognition, 2004. ICPR 2004. Datasets: KTH
  • 57. 57 Heilbron, F.C., Escorcia, V., Ghanem, B. and Niebles, J.C.,. “Activitynet: A large-scale video benchmark for human activity understanding”. CVPR 2015. Datasets: ActivityNet
  • 58. 58 Sigurdsson, G. A., Varol, G., Wang, X., Farhadi, A., Laptev, I., & Gupta, A. (2016, October). Hollywood in homes: Crowdsourcing data collection for activity understanding. ECCV 2016. [Dataset] [Code] Datasets: Charades (Allen AI)
  • 59. 59 Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., ... & Suleyman, M. (2017). The kinetics human action video dataset. arXiv preprint arXiv:1705.06950. Datasets: Kinectics (DeepMind)
  • 60. 60 (Slides by Dídac Surís) Abu-El-Haija, Sami, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. "Youtube-8m: A large-scale video classification benchmark." arXiv preprint arXiv:1609.08675 (2016). [project] Datasets: YouTube-8M (Google)
  • 61. 61 (Slides by Dídac Surís) Abu-El-Haija, Sami, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. "Youtube-8m: A large-scale video classification benchmark." arXiv preprint arXiv:1609.08675 (2016). [project] Activity Recognition: Datasets
  • 62. 62 Hang Zhao, Zhicheng Yan, Heng Wang, Lorenzo Torresani, Antonio Torralba, “SLAC: A Sparsely Labeled Dataset for Action Classification and Localization” arXiv 2017 [project page] Datasets: SLAC (MIT & Facebook)
  • 63. 63 Monfort, Mathew, Bolei Zhou, Sarah Adel Bargal, Alex Andonian, Tom Yan, Kandan Ramakrishnan, Lisa Brown et al. "Moments in Time Dataset: one million videos for event understanding." arXiv preprint arXiv:1801.03150 (2018). Datasets: Moments in Time (MIT & IBM)
  • 64. 64Weinzaepfel, Philippe, Xavier Martin, and Cordelia Schmid. "Human Action Localization with Sparse Spatial Supervision." (2017). Datasets: DALY (INRIA) DALY contains the following spatial annotations: ● bounding box around the action ● upper body pose annotation, including a bounding box around the head ● bounding box around object(s) involved in the action
  • 65. 65 Gu, C., Sun, C., Vijayanarasimhan, S., Pantofaru, C., Ross, D. A., Toderici, G., ... & Malik, J. (2017). AVA: A video dataset of spatio-temporally localized atomic visual actions. arXiv preprint arXiv:1705.08421. Datasets: AVA (Berkeley & Google)
  • 68. Memory issues 68 ● ○ ● ○ ○ Tips & Tricks by Víctor Campos (2017)
  • 69. I/O bottleneck 69 ● ● ○ ○ ○ Tips & Tricks by Víctor Campos (2017)
  • 71. ● MSc course (2017) ● BSc course (2018) 71 Deep Learning online courses by UPC: ● 1st edition (2016) ● 2nd edition (2017) ● 3rd edition (2018) ● 1st edition (2017) ● 2nd edition (2018) Next edition Autumn 2018 Next edition Winter/Spring 2019Summer School (late June 2018)