Step zhedong

1
STEP: Spatio-Temporal Progressive Learning
for Video Action Detection
Xitong Yang1,2 Xiaodong Yang2 Ming-Yu Liu2
Fanyi Xiao2,3 Larry Davis1 Jan Kautz2
1University of Maryland, College Park 2NVIDIA 3University of California, Davis

2
About Me (Xitong Yang, 杨希桐)
► Education
► 2016 – Present: Ph.D., University of Maryland, College Park; Prof. Larry Davis
► 2014 – 2016: M.S., University of Rochester; Prof. Jiebo Luo
► 2010 – 2014: B.E., Beijing Institute of Technology
► Internship
► 2018, 2019: NVIDIA; Xiaodong Yang, Ming-Yu Liu, Sifei Liu, Jan Kautz
► 2017: Honda Research Institute; Yi-Ting Chen, Teruhisa Misu
► 2016: PARC East; Sriganesh Madhvanath, Raja Bala
► Research Interest
► Computer vision, video understanding

3
Spatio-temporal Action Detection
Time
LongJump

4
Object Detection
► Two-stage methods
► Fast / Faster R-CNN
► One-stage methods
► SSD
Faster R-CNN
(Ren et al, NeurIPS 2015)
SSD
(Liu et al, ECCV 2016)

5
Object Detection Pipeline
source: https://www.saagie.com/fr/blog/object-detection-part1
Proposals/
Anchors
Classification:
object recognition
Regression:
bounding box refinement
Post-processing

6
From Object Detection to Action Detection
► Use optical flow as additional input
► From frame-level prediction to clip-level prediction
► Process long sequences (use 3D CNNs)
► Replicate 2D proposals over time to obtain 3D proposals
Two-stream R-CNN
(Peng et al, ECCV 2016)
Kalogeiton et al, ICCV 2017
I3D + Faster R-CNN
(Girdhar et al, 2018)

7
From Object Detection to Action Detection
► Use optical flow as additional input
► From frame-level prediction to clip-level prediction
► Process long sequences (use 3D CNNs)
► Replicate 2D proposals over time to obtain 3D proposals
Two-stream R-CNN
(Peng et al, ECCV 2016)
Kalogeiton et al, ICCV 2017
I3D + Faster R-CNN
(Girdhar et al, 2018)

8
Challenges
Time
► Extended two-stage methods
✕ Effective temporal modeling
► Spatial displacement over time

9
Challenges

10
Challenges
✕ Efficient detection
► Thousands of proposals
► Processing long sequences

11
Spatio-TEmporal Progressive Learning
(STEP)

12
► Goals of STEP
✓ Effective temporal modeling
► Adapt to spatial displacement
✓ Efficient detection
► Use a small number of proposals
What is STEP

13
What is STEP
► STEP = progressive learning + spatial refinement + temporal extension
Step
Initial Proposal
Refined Tubelet
Extended Tubelet
Time
progressive learning

14
What is STEP
Step
Initial Proposal
Refined Tubelet
Extended Tubelet
Time
spatial refinement

15
What is STEP
Step
Initial Proposal
Refined Tubelet
Extended Tubelet
Time
temporal extension

17
Time
s=1: anchors
t
Our Approach: STEP

18
Time
s=1: anchors
Our Approach: STEP

19
Time
s=1: anchors
Our Approach: STEP

20
Time
s=1: anchors
Our Approach: STEP

21
s=1: temporal extension
Time
Our Approach: STEP

22
Time
Our Approach: STEP

23
Time
s=1: spatial refinement
Our Approach: STEP

24
Time
Our Approach: STEP

25
Time
Our Approach: STEP

26
Time
Our Approach: STEP

27
Time
Our Approach: STEP

28
Time
Our Approach: STEP

29
Time
Our Approach: STEP

30
Time
Our Approach: STEP

31
Time
Our Approach: STEP

32
Our Approach: STEP
► STEP
✓ Effective temporal modeling
► Adaptive temporal extension
✓ Efficient detection
► Use only 11 (34) proposals on UCF101-24 (AVA)
► Progressively increase the sequence length
✓ Generic learning framework for video understanding
► Instantiate with different backbones / refinement schedule
Step
Initial Proposal
Refined Tubelet
Extended Tubelet
Time

33
Related Work: Iterative Methods in Vision
Iterative pose estimation
(Carreira et al, CVPR16)
Object detection
Grid-CNN (Najibi et al, CVPR16)
Recurrent image generation
DRAW (Gregor et al, ICML15)
Object detection
Cascade R-CNN (Cai et al, CVPR18)

34
Model Details
Temporal
Modeling
Global Branch
Local Branch
Classification
Regression
Convolutional
Features
Proposals
RoI Pool
► Spatial refinement
► Two branches for classification & regression
Action
detection
Classification Regression
• Temporal
information
• Context
• Interaction
• ….
• Precise
localization
• Bounding box
of the actor
• …

35
► Temporal extension
► Linear extrapolation / location anticipation
Model Details
!"#
$
!%#
$
!$
► Spatial refinement
► Two branches for classification & regression
Temporal
Modeling
Global Branch
Local Branch
Classification
Regression
Convolutional
Features
Proposals
RoI Pool

36
Model Details
► Progressive learning
► Joint training
Time
RoI Pool S1
P1
L1
L0
Backbone
Classification
Regression
Proposals

37
Model Details
Time
RoI Pool
RoI Pool
S1
S2
P1
P2
L1
L2
T1
L0
Backbone
► Joint training

38
Model Details
Time
RoI Pool
RoI Pool
RoI Pool
S1
S2
S3
P1
P2
P3
L1
L2
L3
T1
T2
L0
Backbone
► Joint training

39
Model Details
► The problem of distribution shift over different steps
► Our training strategies
► Increasing IoU thresholds for 3 steps (0.2 à 0.35 à 0.5)
► Separate header networks for different steps

41
Experiment Setup
► Dataset
► UCF101-24
► A subset of UCF-101 dataset that consists of videos from 24 action
classes and their corresponding bounding box annotations.
► AVA
► Complex actions (60 classes) and scenes sourced from movies.
Annotations are provided at 1-second intervals.
► Evaluation
► Frame-mAP at IoU=0.5

42
Qualitative Results: Progressive Learning
UCF101-24
AVA

43
Qualitative Results: Progressive Learning
Steps

44
Ablation Study
Spatial Refinement Temporal ExtensionNumber of Proposals
► Improvement obtained by more steps

45
Ablation Study
Spatial Refinement Temporal Extension
► Improvement obtained by more steps
► Performance saturates after 3 steps
Number of Proposals

46
Ablation Study
► Improvement obtained by more proposals
► More inference time
0
0.8
1.6
2.4
3.2
58
61
64
67
11 34 83 132
secondsperbatch
frame-mAP(%)
number of initial proposals
Number of Proposals

47
Ablation Study
0
0.8
1.6
2.4
3.2
58
61
64
67
11 34 83 132
secondsperbatch
frame-mAP(%)
number of initial proposals
ACT
► Improvement obtained by more proposals
► More inference time
► Achieve SOTA using only 11 proposals
Number of Proposals

48
Ablation Study
Step
Frame-mAP
51.5
60.7
62.6
49
51
53
55
57
59
61
63
65
67
1 2 3
w/o temporal extension (K = 6) w/o temporal extension (K = 30)
w/ temporal extrapolation w/ temporal anticipation
Number of Proposals

49
Ablation Study
Step
Frame-mAP
51.5
60.7
62.6
53.1
61.8
63.4
49
51
53
55
57
59
61
63
65
67
1 2 3
Number of Proposals
► Long-range temporal context benefits action classification

50
Ablation Study
Step
Frame-mAP
(K = 6 à 18 à 30)
51.5
60.7
62.6
53.1
61.8
63.4
51.5
62.8
65.5
51.5
62.5
66.7
49
51
53
55
57
59
61
63
65
67
1 2 3
► Long-range temporal context benefits action classification
► Adaptive temporal extension is more effective (and more efficient)
Number of Proposals

51
Comparison with SOTA
► UCF101-24
► VGG16 backbone
► Two-stream fusion
► K = 6 à 18 à 30
► AVA (v2.1)
► I3D backbone
► K = 12 à 12 à 36
* RGB + Flow
(Updated result on arxiv: 20.2%)

52
Qualitative Results: UCF101-24

54
Conclusion
► Spatio-TEmporal Progressive learning for action detection
► A novel framework for effective temporal modeling on long sequences
► A simply, fully end-to-end action detector (without external human detectors)
► Codes: https://github.com/NVlabs/STEP

Step zhedong

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Step zhedong

Similaire à Step zhedong (20)

Plus de 哲东郑

Plus de 哲东郑 (20)

Dernier

Dernier (20)

Step zhedong

Step zhedong

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Step zhedong

Similaire à Step zhedong (20)

Plus de 哲东 郑

Plus de 哲东 郑 (20)

Dernier

Dernier (20)

Step zhedong

Plus de 哲东郑

Plus de 哲东郑 (20)