ECCV2010: Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification

Modeling Temporal Structure of
Decomposable Motion Segments for
Activity Classification

Juan Carlos Chih-Wei Li
Niebles Chen Fei-Fei
Computer Science Dept.
Stanford University
1

Recognizing Human Activities

Motion Analysis Interactions with Objects Detect unusual behavior
Temporal structure &
causality

Judge Sports Automatically Provide cooking assistance Smart surveillance
Biomechanics … Psychology studies
Video game interfaces

2

Activity landscape
Long term
Snapshot Atomic action Activities Events
event

Construction
Catch Run High Jump Football of a building

10-1 100 101 103 107-8
Temporal Scale (seconds)

3

Activity landscape
Long term
event

Construction
Catch Run High Jump Football of a building

10-1 100 101 103 107-8
• Thurau & Hlavac, 2008 • Bobick & Davis, 2001 • Ramanan & Forsyth, • Sridhar et al, 2010
• Gupta et al, 2009 • Efros et al, 2003 2003 • Kuettel, 2010
• Ikizler & Duygulu, 2009 • Schuldt et al, 2004 • Laxton et al, 2007
• Ikizler-Cinbis et al, 2009 • Alper & Shah, 2005 • Ikizler & Forsyth, 2008
• Yao & Fei-Fei 2010a,b • Dollar et al, 2005 • Gupta et al, 2009
• Yang, Wang and Mori, • Blank et al, 2005 • Choi & Savarese, 2009
2010 • Niebles et al, 2006
• Laptev et al, 2008
• Wang & Mori, 2008
• Rodriguez et al, 2008
• Wang & Mori, 2009
• Gupta et al, 2009
• Liu et al, 2009 4
• Marszalek et al, 2009

Activity landscape
Long term
event

10-1 100 101 103 107-8

• Composition of simple motions
• Non-periodic
• Longer duration than atomic actions

Activity landscape – related datasets
Long term
event

10-1 100 101 103 107-8
Actions in still images KTH New
[Ikizler 2009] [Schuldt et al 2004] Olympic Sports
PPMI Hollywood Dataset
[Yao & Fei-Fei 2010] [Laptev et al 2008]
UIUC Sports UCF Sports
[Li & Fei-Fei 2007] [Rodriguez et al 2008]
Ballet
[Yang et al 2009]

Activity landscape
Long term
event

10-1 100 101 103 107-8
Possible approaches:
Pose-based recognition HMM, CRF Bag of features

• Computationally intensive • Simple action recognition: Fails when actions
Ferrari et al 2008 are complex
Ramanan & Forsyth 2003 Laptev et al 2008 Sminchisescu 2006
Nazli & Forsyth 2008 Niebles et al 2006 Blank et al 2005
7
[…] Liu et al 2009 Efros et al 2003 […]

Our proposal – decompose activities into simpler
motion segments

1. Simple motions are easier to describe computationally
2. Can leverage temporal context
3. Human visual system seems to rely on decomposition for
understanding [Zacks et al, Nature Neuro 2001, Tversky et al, JEP, 2006]
8

Outline
• Discriminative model for activities
– Representation
– Recognition
– Learning
• Experiments
• Conclusions

9

Outline
– Representation
– Recognition
– Learning
• Experiments
• Conclusions

10

A model for activities

Activity Model

11

A model for complex activities

Activity Model
Model Properties
0 1
• Use a standard [ ]
time range: [0,1] time

12


Activity Model
Model Properties
0 1
• Model is formed
by a few simple
motions

13


Activity Model
Model Properties
0 1
• Model is formed
by a few simple
motions

14


Activity Model
Model Properties
0 1
• Model is formed
by a few simple
motions
• Local motion
appearance

: Motion Segment 1

15


Activity Model
Model Properties
0 1
time range: [0,1] : anchor location time
• Model is formed
by a few simple
motions
• Local motion
appearance
• Encode temporal
order : Motion Segment 1

16


temporal location uncertainty Activity Model
Model Properties
0 1
• Model is formed
by a few simple
motions
• Local motion
appearance
• Encode temporal
• Temporal
flexibility

17


temporal location uncertainty Activity Model
Model Properties
0 1
shorter
• Model is formed
by a few simple
motions
• Local motion
appearance
• Encode temporal
• Temporal
flexibility
• Multiple temporal
scales longer 18

Outline
– Representation
– Recognition
– Learning
• Experiments
• Conclusions

19

Query Video
Recognition

20

Query Video
Recognition

[0 1]

Activity Model

[0 1]

21

Query Video
Recognition

[0 1]

Match Motion Segment 1:

Activity Model

[0 1]

22

Query Video
Recognition

[0 1]

• Consider a candidate location

Activity Model

[0 1]

23

Query Video
Recognition

[0 1]

• Matching score for this segment:

Activity Model

[0 1]

24

Query Video
Recognition

[0 1]


Activity Model

[0 1]

25

Query Video
Recognition

[0 1]

Match Motion Segment 1: Spatio-temporal Interest points
• Consider a candidate location HOG/HOF Descriptors
• Matching score for this segment: [Laptev et al, 2005]

Activity Model

[0 1]

26

Query Video
Recognition

[0 1]

Match Motion Segment 1: Vector-quantized into a codebook
• Consider a candidate location of 1000 spatio-temporal words.

Activity Model

[0 1]

27

Query Video
Recognition

Video words

[0 1]

Match Motion Segment 1: Appearance feature:
• Consider a candidate location histogram of video words

Activity Model

[0 1]

28

Query Video
Recognition

Video words

[0 1]

Match Motion Segment 1: Appearance similarity score:
• Consider a candidate location Chi-square kernel SVM

Activity Model

[0 1]

29

Query Video
Recognition

[0 1]


Activity Model

[0 1]

30

Query Video
Recognition

[0 1]


Activity Model

[0 1]

31

Query Video
Recognition

[0 1]

Match Motion Segment 1: Temporal location feature:
• Consider a candidate location the distance btw h_1 and the
• Matching score for this segment: anchor location:

Activity Model

[0 1]

32

Query Video
Recognition

[0 1]

Match Motion Segment 1: Temporal location disagreement
• Consider a candidate location score: 2nd order polynomial

Activity Model

[0 1]

33

Query Video
Recognition

[0 1]


Activity Model

[0 1]

34

Query Video
Recognition

[0 1]


Activity Model

[0 1]

35

Query Video
Recognition

[0 1]


Activity Model

[0 1]

36

Query Video
Recognition

[0 1]

• Matching score for all segments:

Activity Model

[0 1]

37

Outline
– Representation
– Recognition
– Learning
• Experiments
• Conclusions

38

Learning from weakly labeled data
positive examples negative examples

39

• YouTube videos
• Class label per video collected on
Amazon Mechanical Turk
• No annotation of temporal
segments 39

Learning from weakly labeled data

40

Activity Model

[0 1]

40

Learning
Goal
Learn: • Motion segment appearance
• Temporal arrangement
A max-margin framwork by optimizing a discriminative loss:

 Coordinate descend [Felzenszwalb et al 2008]
Activity Model

[0 1]

41

Learning
Coordinate descend
• Initialize model parameters


Activity Model
[ ]
0 1

42

Learning
Coordinate descend
1. Find best matching locations


Activity Model
[ ]
0 1

43

Learning
Coordinate descend
2. Update


Activity Model
[ ]
0 1

44

Learning
Coordinate descend
2. Update


Activity Model
[ ]
0 1

45

Learning
Coordinate descend
Repeat till convergence (or max iter.)
2. Update


Activity Model
[ ]
0 1

46

Outline
– Representation
– Recognition
– Learning
• Experiments
• Conclusions

47

Experiment I: Simple Actions
• KTH dataset [Schuldt et al 2004]
Action Class Our Model
walking jogging running
walking 94.4%
running 79.5%
jogging 78.2%
hand-waving 99.9%
hand-clapping 96.5%
boxing 99.2%
boxing hand-waving hand-clapping
100.0% Ours
90.0% Wang et al 2009
80.0%
70.0% Laptev et al 2008
60.0%
Wong et al 2007
50.0%
Accuracy Schuldt et al
48
2004

Experiment II: Proof of concept
• Activities synthesized from • 6 classes
Weizmann [Blank 2005] •Ours 100%
Ours 100%
•Bag-of-features 17%
Bag-of-Features 17%

wave jump jumping - jacks
Activity Model

[0 1]
shorter

jumping jacks
waving

waving Transition from jump to jumping jacks
longer jumping jacks 49

Experiment III: Olympic Sports Dataset
• YouTube videos with class labels per video from AMT
• 16 classes, ~100 videos each http://vision.stanford.edu/Datasets/OlympicSports

high-jump long-jump triple-jump pole-vault discus hammer javelin shot put

50
basketball bowling tennis-serve platform springboard snatch clean-jerk vault
lay-up

Learned model: High Jump

Activity Model

[0 1]
shorter

Landing &
Start running Run Take off stand up

longer Run

51


Activity Model

[0 1]
shorter

Landing &

longer Run

Shorter
segment, larger
location 52


Activity Model

[0 1]

Landing &

Run

Long segment,
small location
uncertainty 53

Learned Model: Clean and Jerk

Activity Model

[0 1]

Hold weight while Lift Weight to Hold weight on
crouching shoulders shoulders

Hold weight while
crouching
Transition to upright position

54


Activity Model

[0 1]


Hold weight while
crouching

Short segment with low
location uncertainty, it had high
location consistency in training 55


Activity Model

[0 1]


Hold weight while
crouching

Segments encode
similar appearance,
possible locations overlap 56

Matched Sequences
Long Jump
Sequence 1
Run Take off Stand up
Long Jump
Sequence 2

Remarks:
•Matching is tolerant to variations in exact motion segment temporal location.
• Query videos can have different time length.

Long Jump Model

[0 1]

57

Matched Sequences
Vault
Sequence 1
Run Up in the air Landing
Vault
Sequence 2

Low matching score, good
temporal alignment, bad
appearance.

Vault Model

[0 1]

58

Classifying Olympic Sports
100.0%
90.0%
80.0%
70.0%
60.0%
50.0%
40.0%
30.0%

Ours Laptev et al CVPR 08

Our Method 72.1%
Laptev et al 2008 62.0% 59

Outline
– Representation
– Recognition
– Learning
• Experiments
• Conclusions

60

Conclusions

Temporal context and structures are useful Olympic Sports Dataset
for activity recognition (16 classes, ~100 video/class)

Future directions
• Explore richer temporal structures;
• Introduce semantics for more meaningful decomposition

61

Thank you!

Juan Carlos Niebles
Graduate student
Princeton/Stanford

Bangpeng Yao, Barry Chai, Jia Deng, Hao Su, Olga
Russakovsky, and all Stanford Vision Lab members.

ECCV2010: Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification

Recommandé

Recommandé

Contenu connexe

Similaire à ECCV2010: Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification

Similaire à ECCV2010: Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification (6)

Plus de zukun

Plus de zukun (20)

Dernier

Dernier (20)

ECCV2010: Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification