This master's thesis proposes a recurrent architecture to perform instance segmentation on images and videos using linguistic referring expressions. The model encodes referring expressions with BERT and visual features with an RVOS encoder, then decodes masks with a recurrent fusion of language and vision. Experiments on the RefCOCO dataset show the model outperforms baselines when incorporating referring expressions. The architecture is extended to video by adding temporal recurrence to an MAttNet+RVOS baseline, demonstrating promising initial results on the DAVIS dataset.
Recurrent Instance Segmentation with Linguistic Referring Expressions
1. Recurrent Instance Segmentation
with Linguistic Referring Expressions
Alba María Herrera Palacio
ADVISORS:
Xavier Giró-i-Nieto
Carles Ventura
Carina Silberer
MASTER THESIS DISSERTATION, September 2019
3. INTRODUCTION | MOTIVATION
3
Natural Language Expressions
PREVIOUS WORK [1]
[1] A. Khoreva et al., Video Object Segmentation with Language Referring Expressions. ACCV 2018
4. Model
time
Model
One-shot RVOS [2]
INTRODUCTION | VIDEO OBJECT SEGMENTATION
Model
time
Model
Referring expression
“the woman”
4[2] C. Ventura et al., RVOS: End-to-End Recurrent Network for Video Object Segmentation. CVPR 2019
5. - Word or phrase
- Unambiguous
- Any form of linguistic description
INTRODUCTION | IMAGE SEGMENTATION WITH REFERRING EXPRESSIONS
REFERRING EXPRESSIONS
5
“male reading book with
fanny pack on waist”
“far right girl white shirt”“woman with phone”“left woman in blue”
7. METHODOLOGY | GENERAL RECURRENT ARCHITECTURE
7
REFERRING
EXPRESSIONS
ENCODER
MASK
DECODER
IMAGE
ENCODER
JointRepresentation
“male reading book with
fanny pack on waist”
“far right girl white shirt”
“woman with phone”
“left woman in blue”
21. METHODOLOGY | VIDEO BASELINE ARCHITECTURE
MAttNet [3] + RVOS
[3] Licheng Yu et al., MAttNet: Modular Attention Network for Referring Expression Comprehension. CVPR 2018
21
22. EXPERIMENTS | VIDEO DATASET
DAVIS 2017
+ EXPRESSIONS BY KHOREVA [1]
22[1] A. Khoreva et al., Video Object Segmentation with Language Referring Expressions. ACCV 2018
23. EXPERIMENTS | VIDEO QUALITATIVE RESULTS
23
REFERRING EXPRESSIONS:
"a brown deer on the left"
"a brown deer on the right with branched horns"
FRAME 0 FRAME 2 FRAME 4
MAttNet + RVOS
24. EXPERIMENTS | ADDITIONAL RESULTS
QUANTITATIVE
24
FAILURE CASE (MAttNet)
FIRST FRAME GROUND TRUTH MATTNET MASK
REFERRING EXPRESSIONS:
"a white golf car"
"a man in a black tshirt"
"two golf sticks"
26. FUTURE WORK | FIRST STEPS
temporal
26
spatial
temporal
27. CONCLUSIONS | THESIS SCOPE
27
1. Follows a global trend of solving multimodal tasks with deep neural networks.
2. Interest in both computer vision and natural language fields.
3. Some promising results prove that the architecture learns to take into account
language information.
4. Experiments compare the approach of using referring expressions to single
modality.
5. The architecture used for images is designed to be extended to videos by
adding temporal recurrency and training with video.
28. CONCLUSIONS | VIDEO BASELINE PUBLICATION
28
Workshop on Multimodal Understanding and
Learning for Embodied Applications
29. “Thank you for your attention
Do you have any questions?
Feel free to ask!
Special thanks to my advisors and the UPF COLT group members
for their support.
29