TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
Dividing and Aggregating Network for Multi-view Action Recognition [Poster in ECCV 2018]
1. Dongang Wang1
, Wanli Ouyang1,2
,Wen Li3
, and Dong Xu1
Dividing and Aggregating Network for Multi-view
Action Recognition
1
School of Electrical and Information Engineering, The University of Sydney 2
SenseTime Computer Vision Research Group, The University of Sydney 3
Computer Vision Laboratory, ETH Zurich
Motivation
q It is well-known that feature variations caused by viewpoints can influence
the classification accuracy.
q We want to learn view-specific representations instead of extracting view-
invariant features using global codebooks or dictionaries.
q The view-specific features can be used to help each other because
different feature extractor may have different activation area.
q
Training details
q Backbone: temporal segment network (TSN) [1]
q Contains two stages. Stage1: train the basic modules for
feature extractors for each view, and Stage2: fine-tune the
extractors after adding view-classifier and message passing
modules.
q For cross-subject setting, the branch number equals the
total views. For cross-view setting, the branch number equals
the total view minus 1.
q Each branch duplicate parts of inception_5b.
Modules
q Basic Multi-branch Module
This part will extract the view-independent features by using the shared CNN,
and then extract the view-specific features in each CNN branch. It should be
trained in the first place to get the basic knowledge of each view.
q Message Passing Module
By treating the view-specific features as fv, and the refined feature as hv for each
view v, we can model the relationship by using conditional random field. The
solution is as follows:
The Wu,v's are the parameters in fully connected layers learned between either
two branches. We implement the message passing by using two fully connected
layers.
q View-prediction-guided Fusion Module
This module contains two stages. First we combine the scores from all v-th view
specific classifiers for the i-th video to form the view-specific scores Sv. Then we
use the view prediction scores to generate the final action classification score Ti
.
The pv's are the view classification scores for the i-th video. They are generated
from the view-independent features.
References
[1] Wang, L., Xiong, Y.,Wang, Z., Qiao, Y., Lin, D., Tang, X., Van Gool, L.: Temporal segment
networks: towards good practices for deep action recognition. In: ECCV 2016
[2] Shahroudy, A., Liu, J., Ng, T.T., Wang, G.: Ntu rgb+ d: A large scale dataset for 3d human
activity analysis. In CVPR 2016
[3] Wang, J., Nie, X., Xia, Y., Wu, Y., Zhu, S.C.: Cross-view action modeling, learning
and recognition. In: CVPR 2014
Contribution
q We design a multi-branch network for multi-view action recognition. The
network is trained using RGB videos and the extracted dense optical flow,
which follows the two-stream CNN scheme [1].
q Conditional random field (CRF) is introduced to pass message among
view-specific features from different branches.
q A view-prediction-guided fusion method for combining action
classification scores from multiple branches is proposed. The view
prediction score is used as weights for combination.
Experiment results
The accuracy for NTU-RGB+D dataset [2] is:
The accuracy for Northwestern-UCLA Multiview Action[3] is:
We also did the ablation test for each modules based on NTU-RGB+D
dataset under cross-view setting:
... ...... ...
. . . . . .
Final action
class score Y
. . .. . . . . .. . .
... ... ......
1
p
v
p
V
p
1
S v
S V
S
1 1,
1,u
,u v
,u V
,V V
1 1,
C 1,v
C 1,V
C 1,u
C ,u v
C ,u V
C 1,V
C ,V v
C ,V V
C
1
,v v v u v u
u vv
h f W h
i
, ,
i i
v u v u v
u
S C 1
V
i i i
v v
v
p S
i
v
S
i
v
p
Problem
q Training: labeled videos from multiple views:
q Test: new samples from the known views
q Test: a new sample from the unknown views
Message
from A to B
Combined features
from Branch B
Message
from C to B
Features in
Branch A
Features in
Branch B
Features in
Branch C
Input video
from View B
. . .
Multi-
branch CNN
. . .
Inception 5a
output
1x1
convolutions
1x1
convolutions
1x1
convolutions
1x1
convolutions
3x3
convolutions
3x3
convolutions
3x3
convolutions
Inception 5b
output
pooling
Shared CNN CNN Branch
... ...
... ... Deep
Model
Action
category
Deep
Model
Deep
Model
...
Final action
class score Y
View
prediction
score
Shared
CNN
CNN
branch(V)
CNN
branch(u)
CNN
branch(1)
message
passing
message
passing
View
classifier Refined
view-
specific
feature(1)
Refined
view-
specific
feature(u)
Refined
view-
specific
feature(V)
View-specific
classifier (1,1)
View-specific
classifier (1, v)
View-specific
classifier (u, 1)
View-specific
classifier (u, v)
Score
fusion
...
...
...
......
......
...
... ...
...
Input: multi-
view videos
Basic Multi-branch Module Message Passing
Module
View-prediction-
guided
Fusion Module
View-
specific
feature(1)
View-
specific
feature(u)
View-
specific
feature(V)
View-
independent
feature
1 1,
C
1,v
C
1,u
C
,u v
C
Methods Modalities Cross-Subject Cross-View
STA-Hands Pose+RGB 82.50% 88.60%
Baradel et al. Pose+RGB 84.80% 90.60%
TSN RGB 84.93% 85.36%
DA-Net (Ours) RGB 88.12% 91.96%
Methods Cross-Subject Cross-View
MST-AOG 81.6% 73.3%
Kong et al. 81.1% 77.2%
TSN 90.3% 80.6%
DA-Net (Ours) 92.1% 84.2%
Method RGB-stream Flow-stream Two-stream
Basic multi-branch 73.9% 87.7% 89.8%
DA-Net (w/o msg.) 74.1% 88.4% 90.7%
DA-Net (w/o fus.) 74.5% 88.6% 90.9%
DA-Net 75.3% 88.9% 92.0%