Recurrent Instance Segmentation with Linguistic Referring Expressions

Recurrent Instance Segmentation
with Linguistic Referring Expressions
Alba María Herrera Palacio
ADVISORS:
Xavier Giró-i-Nieto
Carles Ventura
Carina Silberer
MASTER THESIS DISSERTATION, September 2019

INTRODUCTION | MOTIVATION
3
Natural Language Expressions
PREVIOUS WORK [1]
[1] A. Khoreva et al., Video Object Segmentation with Language Referring Expressions. ACCV 2018

Model
time
Model
One-shot RVOS [2]
INTRODUCTION | VIDEO OBJECT SEGMENTATION
Model
time
Model
Referring expression
“the woman”
4[2] C. Ventura et al., RVOS: End-to-End Recurrent Network for Video Object Segmentation. CVPR 2019

- Word or phrase
- Unambiguous
- Any form of linguistic description
INTRODUCTION | IMAGE SEGMENTATION WITH REFERRING EXPRESSIONS
REFERRING EXPRESSIONS
5
“male reading book with
fanny pack on waist”
“far right girl white shirt”“woman with phone”“left woman in blue”

IMAGE SEGMENTATION
WITH REFERRING
EXPRESSIONS

METHODOLOGY | GENERAL RECURRENT ARCHITECTURE
7
REFERRING
EXPRESSIONS
ENCODER
MASK
DECODER
IMAGE
ENCODER
JointRepresentation
“male reading book with
fanny pack on waist”
“far right girl white shirt”
“woman with phone”
“left woman in blue”

METHODOLOGY | PROPOSED ARCHITECTURE
8
REFERRING EXPRESSION ENCODER

METHODOLOGY | REFERRING EXPRESSION ENCODER
REFERRING EXPRESSION ENCODER
Pooled output
BERT
embedding
Encoded layers
None
Dimensionality
reduction
Linear layer
PCA
9

IMAGE ENCODER
10

IMAGE ENCODER
METHODOLOGY | IMAGE ENCODER
RVOS [2] ENCODER
MULTI-RESOLUTION
VISUAL FEATURES

MASK DECODER
12

MASK DECODER
METHODOLOGY | MASK DECODER
LANGUAGE & VISION FUSION
1
M
width
height depth
1
N
13

MASK DECODER
METHODOLOGY | MASK DECODER
RVOS [2] SPATIAL RECURRENCE
space

EXPERIMENTS | IMAGE DATASET
RefCOCO by UNC
15

EXPERIMENTS | QUANTITATIVE RESULTS
RefCOCO - Embedding conﬁgurations
16

EXPERIMENTS | QUANTITATIVE RESULTS
RefCOCO - Order of referents and batch size
17

True
positives
False
negatives
False
positives
EXPERIMENTS | QUALITATIVE RESULTS
“man on the left”
“right gal”
“left horse”
“right horse” 18
“right gal”
“man on the left”
“right horse”
“left horse”

EXPERIMENTS | FAILURE CASE
“sitting guy with cake” “woman with ponytail” “woman wearing red” “green shirt thanks for playing”
19
True
positives
False
negatives
False
positives

METHODOLOGY | VIDEO BASELINE ARCHITECTURE
MAttNet [3] + RVOS
[3] Licheng Yu et al., MAttNet: Modular Attention Network for Referring Expression Comprehension. CVPR 2018
21

EXPERIMENTS | VIDEO DATASET
DAVIS 2017
+ EXPRESSIONS BY KHOREVA [1]
22[1] A. Khoreva et al., Video Object Segmentation with Language Referring Expressions. ACCV 2018

EXPERIMENTS | VIDEO QUALITATIVE RESULTS
23
REFERRING EXPRESSIONS:
"a brown deer on the left"
"a brown deer on the right with branched horns"
FRAME 0 FRAME 2 FRAME 4
MAttNet + RVOS

EXPERIMENTS | ADDITIONAL RESULTS
QUANTITATIVE
24
FAILURE CASE (MAttNet)
FIRST FRAME GROUND TRUTH MATTNET MASK
REFERRING EXPRESSIONS:
"a white golf car"
"a man in a black tshirt"
"two golf sticks"

FUTURE WORK | FIRST STEPS
temporal
26
spatial
temporal

CONCLUSIONS | THESIS SCOPE
27
1. Follows a global trend of solving multimodal tasks with deep neural networks.
2. Interest in both computer vision and natural language ﬁelds.
3. Some promising results prove that the architecture learns to take into account
language information.
4. Experiments compare the approach of using referring expressions to single
modality.
5. The architecture used for images is designed to be extended to videos by
adding temporal recurrency and training with video.

CONCLUSIONS | VIDEO BASELINE PUBLICATION
28
Workshop on Multimodal Understanding and
Learning for Embodied Applications

“Thank you for your attention
Do you have any questions?
Feel free to ask!
Special thanks to my advisors and the UPF COLT group members
for their support.
29

APPENDIX I | COMPARISONS
RefCOCO - With & without referring expressions
31

APPENDIX II | COMPARISON STATE OF THE ART
RefCOCO - State-of-the-art
32

Recurrent Instance Segmentation with Linguistic Referring Expressions

Recommandé

Recommandé

Contenu connexe

Plus de Universitat Politècnica de Catalunya

Plus de Universitat Politècnica de Catalunya (20)

Dernier

Dernier (20)

Recurrent Instance Segmentation with Linguistic Referring Expressions