The first part of this dissertation focuses on an analysis of the spatial context in semantic image segmentation. First, we review how spatial context has been tackled in the literature by local features and spatial aggregation techniques. From a discussion about whether the context is beneficial or not for object recognition, we extend a Figure-Border-Ground segmentation for local feature aggregation with ground truth annotations to a more realistic scenario where object proposals techniques are used instead. Whereas the Figure and Ground regions represent the object and the surround respectively, the Border is a region around the object contour, which is found to be the region with the richest contextual information for object recognition. Furthermore, we propose a new contour-based spatial aggregation technique of the local features within the object region by a division of the region into four subregions. Both contributions have been tested on a semantic segmentation benchmark with a combination of free and non-free context local features that allows the models automatically learn whether the context is beneficial or not for each semantic category.
The second part of this dissertation addresses the semantic segmentation for a set of closely-related images from an uncalibrated multiview scenario. State-of-the-art semantic segmentation algorithms fail on correctly segmenting the objects from some viewpoints when the techniques are independently applied to each viewpoint image. The lack of large annotations available for multiview segmentation do not allow to obtain a proper model that is robust to viewpoint changes. In this second part, we exploit the spatial correlation that exists between the dierent viewpoints images to obtain a more robust semantic segmentation. First, we review the state-of-the-art co-clustering, co-segmentation and video segmentation techniques that aim to segment the set of images in a generic way, i.e. without considering semantics. Then, a new architecture that considers motion information and provides a multiresolution segmentation is proposed for the co-clustering framework and outperforms state-of-the-art techniques for generic multiview segmentation. Finally, the proposed multiview segmentation is combined with the semantic segmentation results giving a method for automatic resolution selection and a coherent semantic multiview segmentation.
Visual Object Analysis using Regions and Local Features
1. Visual Object Analysis using
Regions and Local Features
Carles Ventura Royo
Co-advisors
Xavier Giró i Nieto
Verónica Vilaplana Besler
Tutor
Ferran Marqués Acosta
2. Outline
• Introduction
• Part I: Context Analysis in semantic segmentation
• Part II: Multiresolution co-clustering for uncalibrated multiview
segmentation
• Conclusions
2
3. Outline
• Introduction
• Part I: Context Analysis in semantic segmentation
• Introduction
• Related Work
• Contributions
• Experiments
• Conclusions
• Part II: Multiresolution co-clustering for uncalibrated multiview
segmentation
• Conclusions
3
4. Outline
• Introduction
• Part I: Context Analysis in semantic segmentation
• Part II: Multiresolution co-clustering for uncalibrated multiview
segmentation
• Introduction
• Related Work
• Contributions
• Experiments
• Conclusions
• Conclusions
4
12. Introduction: Local Features Aggregation
12
• Bag of Features (BoF) [1]
vector
quantization
codebook
Bag of Features
[1] G Csurka et al, Visual Categorization with Bags of Keypoints. ECCV’04
13. Introduction: Local Features Aggregation
13
• Pooling
1
𝑁
𝑖=1
𝑁
𝑥𝑖
1
𝑁
𝑖=1
𝑁
𝑥𝑖 𝑥𝑖
𝑇
First Order Average Pooling (O1P) [1]
Second Order Average Pooling (O2P) [2]
𝑥𝑖: 𝑙𝑜𝑐𝑎𝑙 𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠
No need of codebook High dimensionality
[1] Y Boureau et al, A Theoretical Analysis of Feature Pooling in Visual Recognition. ICML’10
[2] J Carreira et al, Semantic segmentation with second-order pooling. ECCV’12
15. Outline
• Introduction
• Part I: Context Analysis in semantic segmentation
• Introduction
• Related Work
• Contributions
• Experiments
• Conclusions
• Part II: Multiresolution co-clustering for uncalibrated multiview
segmentation
• Conclusions
15
16. Introduction: Context
16
[2] A Rabinovich et al, Objects in Context. ICCV’07
Semantic context [1,2] Spatial context
[1] M Bar, Visual Objects in Context. Nature Reviews Neuroscience 2004
GOAL: Analyze the influence of the
spatial context in object recognition
17. Outline
• Introduction
• Part I: Context Analysis in semantic segmentation
• Introduction
• Related Work
• Contributions
• Experiments
• Conclusions
• Part II: Multiresolution co-clustering for uncalibrated multiview
segmentation
• Conclusions
17
18. Related Work: Ideal scenario
18
Ground
truth
object
location
[1] J.R.R. Uijlings et al., The Visual Extent of an Object. IJCV’12
Conclusion: Aggregating the local features over three region pools
(interior, border and surround) increases the performance [1]
19. Related Work: Realistic scenario
• Pipeline [1]
19
Input
image
Generate
object
candidates
Rank
object
candidates
Predict
class
scores
Aggregate
high-rank
candidates
[1] J Carreira et al, Object Recognition as Ranking
Holistic Figure-Ground Hypotheses. CVPR’10
Semantic
partition
20. Related Work: Realistic scenario
• How is each class predictor trained? [1]
20
0.8179
0.6861
0.9013
0.7381
0.7105
0.6462
TRAINING
DATA
A SVR is used to learn the function that
predicts the overlap for each class
GOAL: CHANGE SPATIAL CODIFICATION
O2PF O2PG
overlap
score
os_1
os_2
os_N
SVR os = f([O2PF O2PG])
[O2PF_1 O2PG_1]
[O2PF_2 O2PG_2]
[O2PF_1 O2PG_1]
…
[1] J Carreira et al, Semantic segmentation
with second-order pooling. ECCV’12
21. Outline
• Introduction
• Part I: Context Analysis in semantic segmentation
• Introduction
• Related Work
• Contributions
• Experiments
• Conclusions
• Part II: Multiresolution co-clustering for uncalibrated multiview
segmentation
• Conclusions
21
22. Contributions
• Figure-Border-Ground spatial pooling in the realistic scenario
22
os_1
os_2
os_N
SVR os = f([O2PF O2PB O2PG])
[O2PF_1 O2PB_1 O2PG_1]
[O2PF_2 O2PB_2 O2PG_2]
[O2PF_N O2PB_N O2PG_N]
…
23. Contributions
• Contour-based spatial pyramid [1]: crown-based
23
os_1
os_2
os_N
SVR os = f([O2PF O2PSR1 O2PSR2 O2PSR3 O2PSR4])
[O2PF_1 O2PSR1_1 O2PSR2_1 O2PSR3_1 O2PSR4_1]
[O2PF_2 O2PSR1_2 O2PSR2_2 O2PSR3_2 O2PSR4_2]
[O2PF_N O2PSR1_N O2PSR2_N O2PSR3_N O2PSR4_N]
[1] S Lazebnik et al, Beyond bags of features: Spatial pyramid
matching for recognizing natural scene categories. CVPR’06
…
24. Contributions
• Contour-based spatial pyramid [1]: Cartesian-based
24
os_1
os_2
os_N
SVR os = f([O2PF O2PSR1 O2PSR2 O2PSR3 O2PSR4])
[O2PF_1 O2PSR1_1 O2PSR2_1 O2PSR3_1 O2PSR4_1]
[O2PF_2 O2PSR1_2 O2PSR2_2 O2PSR3_2 O2PSR4_2]
[O2PF_N O2PSR1_N O2PSR2_N O2PSR3_N O2PSR4_N]
[1] S Lazebnik et al, Beyond bags of features: Spatial pyramid
matching for recognizing natural scene categories. CVPR’06
…
25. Outline
• Introduction
• Part I: Context Analysis in semantic segmentation
• Introduction
• Related Work
• Contributions
• Experiments
• Conclusions
• Part II: Multiresolution co-clustering for uncalibrated multiview
segmentation
• Conclusions
25
27. Experiments: Local Features Aggregation
27
• Pooling
1
𝑁
𝑖=1
𝑁
𝑥𝑖
1
𝑁
𝑖=1
𝑁
𝑥𝑖 𝑥𝑖
𝑇
First Order Average Pooling (O1P) [1]
Second Order Average Pooling (O2P) [2]
𝑥𝑖: 𝑙𝑜𝑐𝑎𝑙 𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠
No need of codebook High dimensionality
[1] Y Boureau et al, A Theoretical Analysis of Feature Pooling in Visual Recognition. ICML’10
[2] J Carreira et al, Semantic segmentation with second-order pooling. ECCV’12
28. Experiments
• Ideal scenario
• Train set: train11
• Test set: val11
28
F [1] F-B F-G [1] F-B-G
eSIFT [1] 63.9 66.2 66.4 68.6
eMSIFT [1] 64.8 68.9 67.7 70.8
[1] J Carreira et al, Semantic segmentation with second-
order pooling. ECCV’12
29. Experiments
• Ideal scenario
• Train set: train11
• Test set: val11
29
F [1] F-B F-B-G
Non SP 64.8 68.9 70.8
Crown-based SP 68.7 71.1 71.7
Cartesian-based SP 67.7 71.6 72.7
[1] J Carreira et al, Semantic segmentation with second-order pooling. ECCV’12
31. Experiments
• Realistic scenario (CPMC [1])
• Train set: train11
• Test set: val11
31
Figure SP (Figure) Border Ground AAC
eSIFT eSIFT 28.6 [2]
eSIFT eSIFT eSIFT 34.8
eSIFT+eMSIFT+eLBP eSIFT 37.2 [2]
eSIFT eSIFT eSIFT eSIFT 37.4
eSIFT+eMSIFT+eLBP eSIFT eSIFT eSIFT 39.6
[2] J Carreira et al, Semantic segmentation with second-order pooling. ECCV’12
[1] J Carreira et al, Constrained parametric min-cuts for automatic object segmentation. CVPR’10
32. Experiments
• Realistic scenario (CPMC [1])
• Train set: trainval11/12
• Test set: test11/12
32
[2] J Carreira et al, Semantic segmentation with second-
order pooling. ECCV’12
F-G [2] F-B-G SP(F)-B-G
VOC11 38.8 43.8 40.3
VOC12 39.9 42.2 40.8
[1] J Carreira et al, Constrained parametric min-cuts for
automatic object segmentation. CVPR’10
33. Experiments
• Realistic scenario (MCG [1])
• Train set: train11
• Test set: val11
33
[2] J Carreira et al, Semantic segmentation with
second-order pooling. ECCV’12
F-G [2] F-B-G SP(F)-B-G
CPMC 37.2 38.9 39.6
MCG 30.9 34.1 36.1
[1] P Arbeláez et al, Multiscale combinatorial grouping.
CVPR’14
34. Experiments: Qualitative evaluation
34
F-G F-B-G F-G F-B-G
aeroplane
bicycle bicycle
cat bird
motorbike boat
bottle
bus
bus
motorbike car
chair
cat
chair chair
horse bird
cow
35. Experiments: Qualitative evaluation
35
F-G F-B-G F-G F-B-G
chair
diningtable
cow dog
person
horse
person motorbike
motorbike
motorbike
person
pottedplant bottle
sheep
sofa
cat
bus
train train
tvmonitor
36. Outline
• Introduction
• Part I: Context Analysis in semantic segmentation
• Introduction
• Related Work
• Contributions
• Experiments
• Conclusions
• Part II: Multiresolution co-clustering for uncalibrated multiview
segmentation
• Conclusions
36
37. Conclusions
• Figure-Border-Ground spatial pooling improves the original Figure-
Ground pooling in both ideal and realistic scenarios
• The Border region pool carries the richest contextual information
• The Cartesian-based spatial pyramid outperforms the crown-based
spatial pyramid, but both of them may result in overfitting
• Both Figure-Border-Ground pooling and Cartesian-based spatial
pyramid have been validated with MCG object candidates
• Published in ICIP’15
37
43. Outline
• Introduction
• Part I: Context Analysis in semantic segmentation
• Part II: Multiresolution co-clustering for uncalibrated multiview
segmentation
• Introduction
• Related Work
• Contributions
• Experiments
• Conclusions
• Conclusions
43
44. Related Work: Co-clustering framework [1,2]
• Objective: Find the clusters that define the coherent regions across
the different views at multiple resolutions
44
[2] D Varas et al, Multiresolution hierarchy co-clustering for semantic segmentation in sequences with small variations. ICCV’15
[1] D Glasner et al, Contour-based joint clustering of multiple segmentations. CVPR’11
LEAVES
PARTITIONS
CO-CLUSTERED PARTITIONS
INPUT
IMAGES
HIERARCHIES
45. Related Work: Co-clustering framework [1,2]
• Objective: Find the clusters that define the coherent regions across
the different views
45
view 1 view 2 view 1 view 2
LEAVES PARTITIONS CO-CLUSTERED PARTITIONS
[2] D Varas et al, Multiresolution hierarchy co-clustering for semantic segmentation in sequences with small variations. ICCV’15
[1] D Glasner et al, Contour-based joint clustering of multiple segmentations. CVPR’11
R2
47. Related Work: Co-clustering framework
• How are the values of the boundary variables chosen?
47
view 1 view 2
LEAVES PARTITIONS
INTRA INTERACTIONS INTER INTERACTIONS
Q1,2, Q1,3, Q2,3, Q4,5, Q5,6 Q1,4, Q1,5, Q2,4, Q2,5, Q3,6
R2
59. Contribution IV: Generic global co-clustering
59
• All co-clustered partitions
resulting from the iterative
architecture are fed into a
global optimization
• The reduction on the
number of regions makes
the global optimization
feasible
60. Contribution V: Semantic global co-clustering
60
• Semantic information is
introduced in the global
optimization
62. Contribution VI: Automatic resolution selection
62
view 1 view 2
LEAVES PARTITIONS …
MULTIRESOLUTION
CO-CLUSTERING
• We propose a method that
automatically selects the
resolution that best fits with
the semantic information
SEMANTIC
PARTITIONS
SINGLE RESOLUTION
CO-CLUSTERING
R2
64. Contribution VII: Coherent semantic partitions
64
STATE OF
THE ART [1]
OUR
RESULTS
[1] S Zheng et al, Conditional Random Fields as Recurrent Neural Networks. ICCV’15
65. Outline
• Introduction
• Part I: Context Analysis in semantic segmentation
• Part II: Multiresolution co-clustering for uncalibrated multiview
segmentation
• Introduction
• Related Work
• Contributions
• Experiments
• Conclusions
• Conclusions
65
66. Experiments: Dataset
• Multiview dataset [1]
66[1] A. Kowdle et at, Multiple view object cosegmentation using appearance and stereo cues (ECCV’12)
67. Experiments: Generic co-clustering
67
Co-segmentation techniques
Video segmentation techniques
Co-clustering techniques
• I-1S: Motion-compensated one-step
iterative (baseline)
• I-2S: Two-step iterative
• UCM+I-1S: First step is replaced by a cut
from a hierarchical segmentation algorithm
• I-2S+GG: Two-step iterative followed by
generic global optimization
75. Outline
• Introduction
• Part I: Context Analysis in semantic segmentation
• Part II: Multiresolution co-clustering for uncalibrated multiview
segmentation
• Introduction
• Related Work
• Contributions
• Experiments
• Conclusions
• Conclusions
75
76. Conclusions
• The use of motion cues significantly improved the performance
• The new resolution parameterization allowed us to have a more uniform
distribution of resolutions
• The two-step architecture improved the performance of the original one-
step architecture
• Although global optimization is now feasible, there is no clear gain for
generic co-clustering. However, it is useful for semantic co-clustering.
• A small decrease in performance is achieved as a result of applying the
resolution selection technique
• Submitted to ECCV’16 (waiting decision)
76
77. Future Work
• Extending experiments to video datasets
• VSB100 (Video Segmentation Benchmark) [1]
• Cityscapes [2]
• Extending experiments to calibrated scenarios
• Training end-to-end CNNs for multiview semantic segmentation
77
[1] F Galasso et al, A Unified Video Segmentation Benchmark: Annotation, Metrics and Analysis. ICCV’13
[2] M Cordts et al, The cityscapes dataset for semantic urban scene understanding. CVPR’16
78. Outline
• Introduction
• Part I: Context Analysis in semantic segmentation
• Part II: Multiresolution co-clustering for uncalibrated multiview
segmentation
• Introduction
• Related Work
• Contributions
• Experiments
• Conclusions
• Conclusions
78
79. Conclusions
• Results achieved in the first part by considering new spatial
configurations are now obsolete after the outstanding results
achieved by deep learning techniques.
• Results from deep learning techniques were used in the second part.
• The proposed multiresolution co-clustering has improved state-of-
the-art results, but we should consider an end-to-end deep learning
approach to achieve a more significant improvement.
• Semantic segmentation techniques evolve really fast, making this field
very competitive and challenging.
79
80. Publications
• Related with the Thesis
• C. Ventura, D. Varas, X. Giro-i-Nieto, V. Vilaplana, F. Marques. Semantically driven
multiresolution co-clustering for uncalibrated multiview segmentation. Submitted to
the European Conference on Computer Vision (ECCV) 2016. In process of review.
• C. Ventura, X. Giro-i-Nieto, V. Vilaplana, K. McGuinness, F. Marques, Noel E O'Connor.
Improving spatial codication in semantic segmentation. International Conference on
Image Processing (ICIP) 2015.
• C. Ventura. Visual object analysis using regions and interest points. ACM
international conference on Multimedia 2013.
80
81. Publications
• Other publications:
• K. McGuinness, E. Mohedano, Z. Zhang, F. Hu, R. Albatal, Cathal Gurrin, N.E O'Connor, A. F.
Smeaton, A. Salvador, X. Giro-i-Nieto, C. Ventura. Insight Centre for Data Analytics (DCU) at
TRECVid 2014: instance search and semantic indexing tasks. TRECVID Workshop 2014.
• C. Ventura, V. Vilaplana, X. Giro-i-Nieto, F. Marques. Improving retrieval accuracy of Hierarchical
Cellular Trees for generic metric spaces. Multimedia Tools and Applications, 2014.
• C. Ventura, X. Giro-i-Nieto, V. Vilaplana, D. Giribet, E. Carasusan. Automatic keyframe selection
based on mutual reinforcement algorithm. International Workshop on Content-Based
Multimedia Indexing (CBMI) 2013.
• C. Ventura, M. Tella-Amo, X. Giro-i-Nieto. UPC at MediaEval 2013 Hyperlinking Task. MediaEval
2013.
• C. Ventura, M. Martos, X. Giro-i-Nieto, V. Vilaplana, F. Marques. Hierarchical navigation and
visual search for video keyframe retrieval. International Conference on Multimedia Modeling
2012.
81
86. Related Work: Realistic scenario
86Source: J. Carreira et al., Semantic segmentation with second-order pooling
Input image
Object segment
hypotheses
Ranked object
segment hypotheses
(class independent)
object
plausibility
score
87. Related Work: Realistic scenario
87Source: J. Carreira et al., Semantic segmentation with second-order pooling
Predict overlap estimate of each segment to each
object class and sort segments by maximal score
Aggregate high-rank segments
88. Related Work: Realistic scenario
88
0.8179
0.6861
0.9013
0.7381
0.7105
0.6462
TRAINING
DATA
TEST
DATA
?0.4905
[1] J Carreira et al, Semantic segmentation with second-order pooling. ECCV’12
89. Related Work: Co-clustering framework
• What are the contour elements?
89
view 1 view 2
LEAVES PARTITIONS Which contour elements are considered to compute Q1,4?
• Contour elements of R1
• Contour elements of R4
96. Related Work: Co-clustering framework
• Multiresolution parameterization
96
: Number of active contours
to encode leave contours
: Maximum fraction to describe
the r-th coarse level
: Maximum difference between
consecutive levels
= 9 = 0.5 = 0.1
4.53.6
99. Contributions
• Semantic global co-clustering
99
1. Class assignment to regions 3. Optimization constraints
• Regions from same partition
with same class
• Regions from different partitions
with diferent class
2. Similarity penalizations
• Regions from same partition
with different classes
100. Contribution VI: Automatic resolution selection
• Some applications require a single resolution
100
l1
l2
C1
C2
C3
l1 C1 C2U
l2
C2
C2
l1 or l2 ? l1
102. Conclusions
• Multiresolution co-clustering framework for uncalibrated multiview
sequences
• Two-step architecture
• Global optimization
• Semantic-based co-clustering with resolution selection
• Submitted to ECCV’16 (waiting decision)
102
103. Conclusions
• Part I: Improving spatial codification in semantic segmentation
• Figure-Border-Ground in realistic scenario
• Contour-based spatial pyramid
• Part II: Multiresolution co-clustering for uncalibrated multiview
segmentation
• Results from Part I are replaced by SoA deep learning techniques
• Generic co-clustering for multiview sequences
• Semantic co-clustering for multiview sequences
103