https://telecombcn-dl.github.io/2018-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
2. Object Detection
CAT, DOG, DUCK
The task of assigning a
label and a bounding box
to all objects in the image
2
3. Object Detection as Classification
Classes = [cat, dog, duck]
Cat ? NO
Dog ? NO
Duck? NO
3
4. Object Detection as Classification
Classes = [cat, dog, duck]
Cat ? NO
Dog ? NO
Duck? NO
4
5. Object Detection as Classification
Classes = [cat, dog, duck]
Cat ? YES
Dog ? NO
Duck? NO
5
6. Classes = [cat, dog, duck]
Cat ? NO
Dog ? NO
Duck? NO
6
Object Detection as Classification
7. Problem:
Too many positions & scales to test
Solution: If your classifier is fast enough, go for it
7
Object Detection as Classification
8. Object Detection with ConvNets?
Convnets are computationally demanding. We can’t test all positions & scales !
Solution: Look at a tiny subset of positions. Choose them wisely :)
8
9. Object Detection: Datasets
9
20 categories
6k training images
6k validation images
10k test images
200 categories
456k training images
60k validation + test images
80 categories
200k training images
60k val + test images
11. Region Proposals
● Find “blobby” image regions that are likely to contain objects
● “Class-agnostic” object detector
Slide Credit: CS231n 11
12. Region Proposals
Selective Search (SS) Multiscale Combinatorial Grouping (MCG)
[SS] Uijlings et al. Selective search for object recognition. IJCV 2013
[MCG] Arbeláez, Pont-Tuset et al. Multiscale combinatorial grouping. CVPR 2014 12
13. Object Detection with Convnets: R-CNN
Girshick et al. Rich feature hierarchies for accurate object detection and semantic segmentation. CVPR 2014
13
15. R-CNN
Girshick et al. Rich feature hierarchies for accurate object detection and semantic segmentation. CVPR 2014
15
16. R-CNN: Problems
1. Slow at test-time: need to run full forward pass of
CNN for each region proposal
2. SVMs and regressors are post-hoc: CNN features not
updated in response to SVMs and regressors
Slide Credit: CS231n 16
17. Fast R-CNN
Girshick Fast R-CNN. ICCV 2015
Solution: Share computation of convolutional layers between region proposals for an image
R-CNN Problem #1: Slow at test-time: need to run full forward pass of CNN for each region proposal
17
18. Fast R-CNN: Sharing features
Hi-res input image:
3 x 800 x 600
with region
proposal
Convolution
and Pooling
Hi-res conv features:
C x H x W
with region proposal
Fully-connected
layers
Max-pool within
each grid cell
RoI conv features:
C x h x w
for region proposal
Fully-connected layers expect
low-res conv features:
C x h x w
Slide Credit: CS231n 18Girshick Fast R-CNN. ICCV 2015
19. Fast R-CNN
Solution: Train it all together end to end
R-CNN Problem #2&3: SVMs and regressors are post-hoc. Complex training.
19Girshick Fast R-CNN. ICCV 2015
20. Fast R-CNN
Slide Credit: CS231n
R-CNN Fast R-CNN
Training Time: 84 hours 9.5 hours
(Speedup) 1x 8.8x
Test time per image 47 seconds 0.32 seconds
(Speedup) 1x 146x
mAP (VOC 2007) 66.0 66.9
Using VGG-16 CNN on Pascal VOC 2007 dataset
Faster!
FASTER!
Better!
20
21. Fast R-CNN: Problem
Slide Credit: CS231n
R-CNN Fast R-CNN
Test time per image 47 seconds 0.32 seconds
(Speedup) 1x 146x
Test time per image
with Selective Search
50 seconds 2 seconds
(Speedup) 1x 25x
Test-time speeds don’t include region proposals
21
22. Faster R-CNN
Conv
layers
Region Proposal Network
FC6
Class probabilities
FC7
FC8
RPN Proposals
RoI
Pooling
Conv5_3
RPN Proposals
22
Ren et al. Faster R-CNN: Towards real-time object detection with region proposal networks. NIPS 2015
Learn proposals end-to-end sharing parameters with the classification network
23. Faster R-CNN
Conv
layers
Region Proposal Network
FC6
Class probabilities
FC7
FC8
RPN Proposals
RoI
Pooling
Conv5_3
RPN Proposals
Fast R-CNN
23
Ren et al. Faster R-CNN: Towards real-time object detection with region proposal networks. NIPS 2015
Learn proposals end-to-end sharing parameters with the classification network
24. Region Proposal Network
Objectness scores
(object/no object)
Bounding Box Regression
In practice, k = 9 (3 different scales and 3 aspect ratios)
24
Ren et al. Faster R-CNN: Towards real-time object detection with region proposal networks. NIPS 2015
25. Faster R-CNN
Ren et al. Faster R-CNN: Towards real-time object detection with region proposal networks. NIPS 2015
R-CNN Fast R-CNN Faster R-CNN
Test time per
image
(with proposals)
50 seconds 2 seconds 0.2 seconds
(Speedup) 1x 25x 250x
mAP (VOC 2007) 66.0 66.9 66.9
Slide Credit: CS231n 25
29. One-stage methods
29
Problem:
Too many positions & scales to test
Modern detectors parallelize feature extraction across all locations.
Region classification is not slow anymore!
Previously… :
30. YOLO: You Only Look Once
30Redmon et al. You Only Look Once: Unified, Real-Time Object Detection, CVPR 2016
Proposal-free object detection pipeline
31. YOLO: You Only Look Once
Redmon et al. You Only Look Once: Unified, Real-Time Object Detection, CVPR 2016 31
32. YOLO: You Only Look Once
32
Each cell predicts:
- For each bounding box:
- 4 coordinates (x, y, w, h)
- 1 confidence value
- Some number of class
probabilities
For Pascal VOC:
- 7x7 grid
- 2 bounding boxes / cell
- 20 classes
7 x 7 x (2 x 5 + 20) = 7 x 7 x 30 tensor = 1470 outputs
33. SSD: Single Shot MultiBox Detector
Liu et al. SSD: Single Shot MultiBox Detector, ECCV 2016 33
Same idea as YOLO, + several predictors at different stages in the network
36. RetinaNet
36
Matching proposal-based performance with a one-stage approach
Problem of one-stage detectors? They evaluate many candidate locations but only a few have objects
---> IMBALANCE, making learning inefficient
Key idea is to lower loss weight for well classified samples, increase it for difficult ones.
Lin et al. Focal Loss for Dense Object Detection. ICCV 2017