STEP is a new framework for video action detection that uses progressive learning with spatial refinement and temporal extension. It aims to effectively model temporal information while efficiently detecting actions using a small number of proposals. The approach starts with initial proposals and refines their spatial boundaries and temporally extends the tubelets in progressive steps. Experiments on UCF101-24 and AVA datasets show it achieves state-of-the-art performance using only 11 proposals, demonstrating its efficiency. Ablation studies validate the importance of temporal modeling and adaptive temporal extension.
1. 1
STEP: Spatio-Temporal Progressive Learning
for Video Action Detection
Xitong Yang1,2 Xiaodong Yang2 Ming-Yu Liu2
Fanyi Xiao2,3 Larry Davis1 Jan Kautz2
1University of Maryland, College Park 2NVIDIA 3University of California, Davis
2. 2
About Me (Xitong Yang, 杨希桐)
► Education
► 2016 – Present: Ph.D., University of Maryland, College Park; Prof. Larry Davis
► 2014 – 2016: M.S., University of Rochester; Prof. Jiebo Luo
► 2010 – 2014: B.E., Beijing Institute of Technology
► Internship
► 2018, 2019: NVIDIA; Xiaodong Yang, Ming-Yu Liu, Sifei Liu, Jan Kautz
► 2017: Honda Research Institute; Yi-Ting Chen, Teruhisa Misu
► 2016: PARC East; Sriganesh Madhvanath, Raja Bala
► Research Interest
► Computer vision, video understanding
6. 6
From Object Detection to Action Detection
► Use optical flow as additional input
► From frame-level prediction to clip-level prediction
► Process long sequences (use 3D CNNs)
► Replicate 2D proposals over time to obtain 3D proposals
Two-stream R-CNN
(Peng et al, ECCV 2016)
Kalogeiton et al, ICCV 2017
I3D + Faster R-CNN
(Girdhar et al, 2018)
7. 7
From Object Detection to Action Detection
► Use optical flow as additional input
► From frame-level prediction to clip-level prediction
► Process long sequences (use 3D CNNs)
► Replicate 2D proposals over time to obtain 3D proposals
Two-stream R-CNN
(Peng et al, ECCV 2016)
Kalogeiton et al, ICCV 2017
I3D + Faster R-CNN
(Girdhar et al, 2018)
12. 12
► Goals of STEP
✓ Effective temporal modeling
► Adapt to spatial displacement
✓ Efficient detection
► Use a small number of proposals
What is STEP
13. 13
What is STEP
► STEP = progressive learning + spatial refinement + temporal extension
Step
Initial Proposal
Refined Tubelet
Extended Tubelet
Time
progressive learning
14. 14
What is STEP
Step
Initial Proposal
Refined Tubelet
Extended Tubelet
Time
► STEP = progressive learning + spatial refinement + temporal extension
spatial refinement
15. 15
What is STEP
Step
Initial Proposal
Refined Tubelet
Extended Tubelet
Time
► STEP = progressive learning + spatial refinement + temporal extension
temporal extension
32. 32
Our Approach: STEP
► STEP
✓ Effective temporal modeling
► Adaptive temporal extension
✓ Efficient detection
► Use only 11 (34) proposals on UCF101-24 (AVA)
► Progressively increase the sequence length
✓ Generic learning framework for video understanding
► Instantiate with different backbones / refinement schedule
Step
Initial Proposal
Refined Tubelet
Extended Tubelet
Time
33. 33
Related Work: Iterative Methods in Vision
Iterative pose estimation
(Carreira et al, CVPR16)
Object detection
Grid-CNN (Najibi et al, CVPR16)
Recurrent image generation
DRAW (Gregor et al, ICML15)
Object detection
Cascade R-CNN (Cai et al, CVPR18)
34. 34
Model Details
Temporal
Modeling
Global Branch
Local Branch
Classification
Regression
Convolutional
Features
Proposals
RoI Pool
► Spatial refinement
► Two branches for classification & regression
Action
detection
Classification Regression
• Temporal
information
• Context
• Interaction
• ….
• Precise
localization
• Bounding box
of the actor
• …
35. 35
► Temporal extension
► Linear extrapolation / location anticipation
Model Details
!"#
$
!%#
$
!$
► Spatial refinement
► Two branches for classification & regression
Temporal
Modeling
Global Branch
Local Branch
Classification
Regression
Convolutional
Features
Proposals
RoI Pool
36. 36
Model Details
► Progressive learning
► Joint training
Time
RoI Pool S1
P1
L1
L0
Backbone
Classification
Regression
Proposals
39. 39
Model Details
► The problem of distribution shift over different steps
► Our training strategies
► Increasing IoU thresholds for 3 steps (0.2 à 0.35 à 0.5)
► Separate header networks for different steps
41. 41
Experiment Setup
► Dataset
► UCF101-24
► A subset of UCF-101 dataset that consists of videos from 24 action
classes and their corresponding bounding box annotations.
► AVA
► Complex actions (60 classes) and scenes sourced from movies.
Annotations are provided at 1-second intervals.
► Evaluation
► Frame-mAP at IoU=0.5
45. 45
Ablation Study
Spatial Refinement Temporal Extension
► Improvement obtained by more steps
► Performance saturates after 3 steps
Number of Proposals
46. 46
Ablation Study
Spatial Refinement Temporal Extension
► Improvement obtained by more proposals
► More inference time
0
0.8
1.6
2.4
3.2
58
61
64
67
11 34 83 132
secondsperbatch
frame-mAP(%)
number of initial proposals
Number of Proposals
47. 47
Ablation Study
Spatial Refinement Temporal Extension
0
0.8
1.6
2.4
3.2
58
61
64
67
11 34 83 132
secondsperbatch
frame-mAP(%)
number of initial proposals
ACT
► Improvement obtained by more proposals
► More inference time
► Achieve SOTA using only 11 proposals
Number of Proposals
50. 50
Ablation Study
Spatial Refinement Temporal Extension
Step
Frame-mAP
(K = 6 à 18 à 30)
51.5
60.7
62.6
53.1
61.8
63.4
51.5
62.8
65.5
51.5
62.5
66.7
49
51
53
55
57
59
61
63
65
67
1 2 3
w/o temporal extension (K = 6) w/o temporal extension (K = 30)
w/ temporal extrapolation w/ temporal anticipation
► Long-range temporal context benefits action classification
► Adaptive temporal extension is more effective (and more efficient)
Number of Proposals
51. 51
Comparison with SOTA
► UCF101-24
► VGG16 backbone
► Two-stream fusion
► K = 6 à 18 à 30
► AVA (v2.1)
► I3D backbone
► K = 12 à 12 à 36
* RGB + Flow
(Updated result on arxiv: 20.2%)
54. 54
Conclusion
► Spatio-TEmporal Progressive learning for action detection
► A novel framework for effective temporal modeling on long sequences
► A simply, fully end-to-end action detector (without external human detectors)
► Codes: https://github.com/NVlabs/STEP