https://telecombcn-dl.github.io/2017-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
2. Outline
● What is object segmentation?
○ Applications
● Semantic segmentation
○ From image classification to semantic segmentation
■ Fully convolutional networks
■ Learnable Upsampling
■ Skip connections
○ FCN8s, Dilated Convolutions, U-Net, ...
● Instance segmentation
○ Simultaneous Detection and Segmentation
○ Mask R-CNN
2
3. Image and object segmentation
● Image Segmentation
○ Group pixels into regions that share some similar properties
● Segmenting images into meaningful objects
○ Object-level segmentation: accurate localization and recognition
Superpixels
(Ren ICCV 2003
3
5. Semantic segmentation
● Label every pixel: recognize the class of every pixel
● Do not differentiate instances
5Mottaghi et al, “The role of context for object detection and semantic segmentation in the wild”, CVPR 2014
6. Instance segmentation
● Detect instances, categorize and label every pixel
● Labels are class-aware and instance-aware
6
Arnab,Torr “Pixelwise instance segmentation with a dynamically instantiated network”, CVPR 2017
Object detection Semantic Segm. Instance segm. Ground truth
8. Datasets for semantic/instance segmentation
8
● Real indoor scenes
● 10,000 images
● 58,658 3D bounding boxes
● Dense annotations
● Instances GT
● Semantic segmentation GT
● Objects + stuff
● Real indoor & outdoor scenes
● 80 categories
● +300,000 images
● 2M instances
● Partial annotations
● Semantic segmentation GT
● Instance segmentation GT
● Objects, but no stuff
SUN RGB-D
COCO Common Objects in Context
9. Datasets for semantic/instance segmentation
9
● Real driving scenes
● 30 categories
● +25,000 images
● 20,000 partial annotations
● 5,000 dense annotations
● Semantic segmentation GT
● Instance segmentation GT
● Depth, GPS and other metadata
● Objects and stuff
● Real general scenes
● +150 categories
● +22,000 images
● Semantic segmentation GT
● Instance + parts segmentation GT
● Objects and stuff
CityScapes ADE20K
10. From classification to semantic segmentation
10
CNN
DOG
CAT
extract a
patch
run through a CNN
trained for image
classification
classify the
center pixel
CAT
repeat for every pixel
11. From classification to semantic segmentation
● A classification network becoming fully convolutional
○ Fully connected layers can also be viewed as convolutions with kernels that cover the
entire input region
11
Shelhamer, Long, Darrell, Fully Convolutional Networks for Semantic Segmentation, 2014-2016
12. From classification to semantic segmentation
● Dense prediction: fully convolutional, end to end, pixel-to-pixel network
● Problems
○ Output is smaller than input → Add upsampling layers
○ Output is very coarse → Add fine details from previous layers
12
Final layer is 1X1 conv with #channels = #classes
Pixelwise loss function:
Credit: Shelhamer, Long
13. ● Dense prediction: fully convolutional, end to end, pixel-to-pixel network
From classification to semantic segmentation
13
Final layer is 1X1 conv with #channels = #classes
Pixelwise loss function:
Conv, pool, non linearity
Learnable Upsampling Pixelwise Output + loss
Credit: Shelhamer, Long
14. Learnable upsampling: recovering spatial shape
Upsampling: Transposed convolution also called fractionally strided convolution or
‘deconvolution’
14
Convolution
More info:
Dumoulin et al, A guide to convolution arithmetic for deep learning, 2016
https://github.com/vdumoulin/conv_arithmetic
I input image 4x4 vectorized to 16x1
O output image 4x1 (later reshaped 2x2)
h 3x3 kernel;
C 16x4 (weights)
The backward pass is obtained by transposing C: CT
Transposed convolution (also called fractionally strided convolution or ‘deconvolution’): swaps
forward and backward passes of a convolution
Convolution as a matrix operation
C=
15. Learnable upsampling: recovering spatial shape
It is always possible to emulate a transposed convolution with a direct convolution (fractional
stride)
15
More info:
Dumoulin et al, A guide to convolution arithmetic for deep learning, 2016
https://github.com/vdumoulin/conv_arithmetic
16. Learnable upsampling: recovering spatial shape
16
1D Convolution with stride 2
1D Transposed
Convolution with stride 2
1D Subpixel convolution
with stride 1/2
The two operators can achieve the
same result if the filters are learned.
Shi, Is the deconvolution layer the same as a convolutional layer?, 2016
1D example:
17. DeconvNet:
VGG-16 (conv+Relu+MaxPool) + mirrored VGG (Unpooling+’deconv’+Relu)
More than one upsampling layer
17
Normal VGG “Upside down” VGG
Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015
18. Resolution: spectrum of deep features
Problem: coarse output
● Combine where (local, shallow) with what (global, deep)
18
fuse features into deep jet
(cf. Hariharan et al. CVPR15 “hypercolumn”)
Credit: Shelhamer, Long
19. Fine details: skip connections
19
Adding 1x1 conv classifying layer on top of pool4,
Then upsample x2 (init to bilinear and then learned)
conv7 prediction, sum both, and upsample x16 for output
end-to-end, joint learning
of semantics and location
skip tu fuse
layers
Credit: Shelhamer, Long
20. Skip connections
● A multi-stream network that fuses features/predictions across layers
20
Input image stride 32 stride 16 stride 8 ground truth
no skps 1 skip 2 skipsCredit: Shelhamer, Long
21. Transfer learning
● Cast ILSVRC (AlexNet, VGG, GoogLeNet) classifiers into FCNs and augment them for
dense prediction: discard classifier layer, transform FC to CONV, add 1X1 CONV with 21
filters for scoring at each output location, upsampling
● Add skip connections
● Train for segmentation by fine-tuning all layers with PASCAL VOC 2011 with a pixelwise
loss.
● Metrics: pixel accuracy, mean accuracy, mean pixel intersection over union
21
Mean IU: Per-class evaluation: an intersection
of the predicted and true sets of pixels for a
given class, divided by their union
Pascal Test Set
Pascal Validation Set
Based on VGG
FCN-32s = FCN VGG
22. FCN-8s results on Pascal
22
SDS: Simultaneous Detection and
Segmentation Hariharan et al. ECCV14
24. U-Net
● A contracting path and an expansive path
● Adds convolutions in the upsampling path (“symmetric” net)
● Skip connections: concatenation of feature maps
24
Ronneberger et al, “U-Net: Convolutional Networks for Biomedical Image Segmentation”, arXiv 2015
Winner of
CAD Caries challenge ISBI 2015
Cell tracking challenge ISBI 2015
25. Fully convolutional DenseNets
● Adds feed-forward connections between layers
● Based on U-Nets:
○ connections between downsampling – upsampling paths
● Based on DenseNets* (for image classification):
○ each layer directly connected to every other layer
○ alleviate the vanishing-gradient problem
○ strengthen feature propagation
○ encourage feature reuse
○ substantially reduce the number of parameters.
25
Jégou et al, “The One Hundred Layers Tiramisu: Fully Convolutional DenseNets for Semantic Segmentation”, Dec. 2016
* Huang et al, “Densely connected convolutional networks”, arxiv Aug 2016
Dense block: Complete architecture:
26. Dilated convolutions
● Systematically aggregate multiscale contextual information without losing resolution
○ Usual convolution
○ Dilated convolution
26
Yu, Koltun, Multi-scale context aggregation by dilated convolutions, 2016
27. Instance segmentation
● Detect instances, categorize and label every pixel
● Labels are class-aware and instance-aware
27
Arnab,Torr “Pixelwise instance segmentation with a dynamically instantiated network”, CVPR 2017
Object detection Semantic Segm. Instance segm. Ground truth
28. Instance segmentation: Multi-task cascades
28
Dai et al, “Instance-aware Semantic Segmentation via Multi-task Network Cascades”, arXiv 2015
Won COCO 2015 challenge (with ResNet)
Learn entire model end to end
30. Instance segmentation: Mask R-CNN
● Extension of Faster R-CNN to instance segmentation
● A Fully Convolutional Network (FCN) is added on top of the CNN features of Faster R-CNN to
generate a mask (segmentation output).
● This is in parallel to the classification and bounding box regression network of Faster R-CNN
● RoIAlign instead of RoIPool to properly aligning extracted features with input
30He et al, Mask R-CNN, 2017
31. Instance segmentation: Mask R-CNN
● Classification and bounding box detection losses like Faster R-CNN
● A new loss term for mask prediction
● Output: C x m x m volume for mask prediction (C classes, m size of square mask)
31
He et al, Mask R-CNN, 2017
32. Instance segmentation: Mask R-CNN
● Masks are combined with classifications and bounding boxes from Faster R-CNN
32
He et al, Mask R-CNN, 2017
33. Instance segmentation: Mask R-CNN
● Results on COCO dataset
● MNC and FCIS were winners of COCO 2015 and 2016
33
MNC: Dai et al, “Instance-aware Semantic Segmentation via Multi-task Network Cascades”, arXiv 2015
FCIS: Li et al., “Fully convolutional instance-aware semantic segmentation”, CVPR 2017