Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)

[course site]
Verónica Vilaplana
veronica.vilaplana@upc.edu
Associate Professor
Universitat Politecnica de Catalunya
Technical University of Catalonia
Segmentation
Day 3 Lecture 1
#DLUPC

Outline
● What is object segmentation?
○ Applications
● Semantic segmentation
○ From image classification to semantic segmentation
■ Fully convolutional networks
■ Learnable Upsampling
■ Skip connections
○ FCN8s, Dilated Convolutions, U-Net, ...
● Instance segmentation
○ Simultaneous Detection and Segmentation
○ Mask R-CNN
2

Image and object segmentation
● Image Segmentation
○ Group pixels into regions that share some similar properties
● Segmenting images into meaningful objects
○ Object-level segmentation: accurate localization and recognition
Superpixels
(Ren ICCV 2003
3

Object segmentation: applications
Image editing and composition (Xu, 2016) Robotics
Autonomous driving
(cordts, 2016)
Medical image analysis
(Casamitjana, 2017)
4

Semantic segmentation
● Label every pixel: recognize the class of every pixel
● Do not differentiate instances
5Mottaghi et al, “The role of context for object detection and semantic segmentation in the wild”, CVPR 2014

Instance segmentation
● Detect instances, categorize and label every pixel
● Labels are class-aware and instance-aware
6
Arnab,Torr “Pixelwise instance segmentation with a dynamically instantiated network”, CVPR 2017
Object detection Semantic Segm. Instance segm. Ground truth

Datasets for semantic/instance segmentation
7
● 20 categories
● +10,000 images
● Semantic segmentation GT
● Instance segmentation GT
● Real indoor & outdoor scenes
● 540 categories
● +10,000 images
● Dense annotations
● Objects + stuff
Pascal Visual Object Classes Pascal Context

8
● Real indoor scenes
● 10,000 images
● 58,658 3D bounding boxes
● Dense annotations
● Instances GT
● Objects + stuff
● Real indoor & outdoor scenes
● 80 categories
● +300,000 images
● 2M instances
● Partial annotations
● Objects, but no stuff
SUN RGB-D
COCO Common Objects in Context

9
● Real driving scenes
● 30 categories
● +25,000 images
● 20,000 partial annotations
● 5,000 dense annotations
● Depth, GPS and other metadata
● Objects and stuff
● Real general scenes
● +150 categories
● +22,000 images
● Instance + parts segmentation GT
● Objects and stuff
CityScapes ADE20K

From classification to semantic segmentation
10
CNN
DOG
CAT
extract a
patch
run through a CNN
trained for image
classification
classify the
center pixel
CAT
repeat for every pixel

● A classification network becoming fully convolutional
○ Fully connected layers can also be viewed as convolutions with kernels that cover the
entire input region
11
Shelhamer, Long, Darrell, Fully Convolutional Networks for Semantic Segmentation, 2014-2016

● Dense prediction: fully convolutional, end to end, pixel-to-pixel network
● Problems
○ Output is smaller than input → Add upsampling layers
○ Output is very coarse → Add fine details from previous layers
12
Final layer is 1X1 conv with #channels = #classes
Pixelwise loss function:
Credit: Shelhamer, Long

● Dense prediction: fully convolutional, end to end, pixel-to-pixel network
13
Final layer is 1X1 conv with #channels = #classes
Pixelwise loss function:
Conv, pool, non linearity
Learnable Upsampling Pixelwise Output + loss

Learnable upsampling: recovering spatial shape
Upsampling: Transposed convolution also called fractionally strided convolution or
‘deconvolution’
14
Convolution
More info:
Dumoulin et al, A guide to convolution arithmetic for deep learning, 2016
https://github.com/vdumoulin/conv_arithmetic
I input image 4x4 vectorized to 16x1
O output image 4x1 (later reshaped 2x2)
h 3x3 kernel;
C 16x4 (weights)
The backward pass is obtained by transposing C: CT
Transposed convolution (also called fractionally strided convolution or ‘deconvolution’): swaps
forward and backward passes of a convolution
Convolution as a matrix operation
C=

It is always possible to emulate a transposed convolution with a direct convolution (fractional
stride)
15
More info:
Dumoulin et al, A guide to convolution arithmetic for deep learning, 2016
https://github.com/vdumoulin/conv_arithmetic

16
1D Convolution with stride 2
1D Transposed
Convolution with stride 2
1D Subpixel convolution
with stride 1/2
The two operators can achieve the
same result if the filters are learned.
Shi, Is the deconvolution layer the same as a convolutional layer?, 2016
1D example:

DeconvNet:
VGG-16 (conv+Relu+MaxPool) + mirrored VGG (Unpooling+’deconv’+Relu)
More than one upsampling layer
17
Normal VGG “Upside down” VGG
Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015

Resolution: spectrum of deep features
Problem: coarse output
● Combine where (local, shallow) with what (global, deep)
18
fuse features into deep jet
(cf. Hariharan et al. CVPR15 “hypercolumn”)

Fine details: skip connections
19
Adding 1x1 conv classifying layer on top of pool4,
Then upsample x2 (init to bilinear and then learned)
conv7 prediction, sum both, and upsample x16 for output
end-to-end, joint learning
of semantics and location
skip tu fuse
layers

Skip connections
● A multi-stream network that fuses features/predictions across layers
20
Input image stride 32 stride 16 stride 8 ground truth
no skps 1 skip 2 skipsCredit: Shelhamer, Long

Transfer learning
● Cast ILSVRC (AlexNet, VGG, GoogLeNet) classifiers into FCNs and augment them for
dense prediction: discard classifier layer, transform FC to CONV, add 1X1 CONV with 21
filters for scoring at each output location, upsampling
● Add skip connections
● Train for segmentation by fine-tuning all layers with PASCAL VOC 2011 with a pixelwise
loss.
● Metrics: pixel accuracy, mean accuracy, mean pixel intersection over union
21
Mean IU: Per-class evaluation: an intersection
of the predicted and true sets of pixels for a
given class, divided by their union
Pascal Test Set
Pascal Validation Set
Based on VGG
FCN-32s = FCN VGG

FCN-8s results on Pascal
22
SDS: Simultaneous Detection and
Segmentation Hariharan et al. ECCV14

Semantic segmentation
Typical architecture
● Downsampling path: extracts coarse features
● Upsampling path: recovers input image resolution
● Skip connections: recovers detailed information
● Post-processing (optional): refines predictions (CRF)
Other architectures:
● DeepLab: ‘atrous’ convolutions + spatial pyramid + CRF (Chen, ICLR 2015)
● CRF-RNN: FCN + CRF as Recurrent NN (Zheng, ICCV 2015)
● U-Net (Ronnemberger, 2015)
● Fully Convolutional DenseNets (Jégou, 2016)
● Dilated convolutions (Yu, 2016) 23

U-Net
● A contracting path and an expansive path
● Adds convolutions in the upsampling path (“symmetric” net)
● Skip connections: concatenation of feature maps
24
Ronneberger et al, “U-Net: Convolutional Networks for Biomedical Image Segmentation”, arXiv 2015
Winner of
CAD Caries challenge ISBI 2015
Cell tracking challenge ISBI 2015

Fully convolutional DenseNets
● Adds feed-forward connections between layers
● Based on U-Nets:
○ connections between downsampling – upsampling paths
● Based on DenseNets* (for image classification):
○ each layer directly connected to every other layer
○ alleviate the vanishing-gradient problem
○ strengthen feature propagation
○ encourage feature reuse
○ substantially reduce the number of parameters.
25
Jégou et al, “The One Hundred Layers Tiramisu: Fully Convolutional DenseNets for Semantic Segmentation”, Dec. 2016
* Huang et al, “Densely connected convolutional networks”, arxiv Aug 2016
Dense block: Complete architecture:

Dilated convolutions
● Systematically aggregate multiscale contextual information without losing resolution
○ Usual convolution
○ Dilated convolution
26
Yu, Koltun, Multi-scale context aggregation by dilated convolutions, 2016

Instance segmentation
● Detect instances, categorize and label every pixel
● Labels are class-aware and instance-aware
27
Arnab,Torr “Pixelwise instance segmentation with a dynamically instantiated network”, CVPR 2017
Object detection Semantic Segm. Instance segm. Ground truth

Instance segmentation: Multi-task cascades
28
Dai et al, “Instance-aware Semantic Segmentation via Multi-task Network Cascades”, arXiv 2015
Won COCO 2015 challenge (with ResNet)
Learn entire model end to end

Instance segmentation: Multi-task cascades
Results on Pascal VOC 2012 and MS COCO:
29

Instance segmentation: Mask R-CNN
● Extension of Faster R-CNN to instance segmentation
● A Fully Convolutional Network (FCN) is added on top of the CNN features of Faster R-CNN to
generate a mask (segmentation output).
● This is in parallel to the classification and bounding box regression network of Faster R-CNN
● RoIAlign instead of RoIPool to properly aligning extracted features with input
30He et al, Mask R-CNN, 2017

● Classification and bounding box detection losses like Faster R-CNN
● A new loss term for mask prediction
● Output: C x m x m volume for mask prediction (C classes, m size of square mask)
31
He et al, Mask R-CNN, 2017

● Masks are combined with classifications and bounding boxes from Faster R-CNN
32
He et al, Mask R-CNN, 2017

● Results on COCO dataset
● MNC and FCIS were winners of COCO 2015 and 2016
33
MNC: Dai et al, “Instance-aware Semantic Segmentation via Multi-task Network Cascades”, arXiv 2015
FCIS: Li et al., “Fully convolutional instance-aware semantic segmentation”, CVPR 2017

Summary
● Semantic segmentation
○ Fully convolutional networks
○ Learnable Upsampling
○ Skip connections
○ Models:
■ FCN8s, Dilated Convolutions, U-Net, FC Densenets
● Instance segmentation
○ Based on object/segments proposals
■ Simultaneous Detection and Segmentation (R-CNN)
■ Multii-task cascade (Faster R-CNN)
■ Mask R-CNN (Faster R-CNN)
○ Others
■ Recurrent instance segmentation
34

FCN-8s vs DeepLab vs Dilated Convolutions
36
Input image FCN-8s DeepLab DilConv Ground truth
Pascal VOC 2012 test set
Mean IoU:
FCN-8s = 62.2
DeepLab = 62.1
DilConv = 67.6
FC8: Fully Convolutional Networks for Semantic Segmentation, Long, Darrell, Shelhamer, 2014-2016
DeepLab: Semantic Image Segm. with Deep Conv. Nets, Atrous Convolution, and Fully Connected CRFs, Chen, et al, 2015
Dilated convolutions: Multi-scale context aggregation by dilated convolutions, Yu, Koltun, 2016

Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)

Similar to Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision) (20)

More from Universitat Politècnica de Catalunya

More from Universitat Politècnica de Catalunya (20)

Recently uploaded

Recently uploaded (20)

Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)