Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
CVPR2010: modeling mutual context of object and human pose in human-object interaction activities
1. Modeling Mutual Context of Object
and Human Pose in Human-Object
Interaction Activities
Bangpeng Yao and Li Fei-Fei
Computer Science Department, Stanford University
{bangpeng,feifeili}@cs.stanford.edu
1
3. Human-Object Interaction
Holistic image based classification (Previous talk: Grouplet)
Playing
Playing
bassoon
saxophone
Detailed understanding and reasoning
Vs.
Playing
saxophone
Grouplet is a generic feature for structured objects, or interactions
of groups of objects.
HOI activity: Tennis Forehand
Berg & Malik, 2005 Grauman & Darrell, 2005 Gehler & Nowozin, 2009 OURS
Caltech101
48% 59% 77% 62% 3
6. Human-Object Interaction
Holistic image based classification
Detailed understanding and reasoning
• Human pose estimation
• Object detection
Head
Tennis Torso
racket
HOI activity: Tennis Forehand
6
7. Outline
• Background and Intuition
• Mutual Context of Object and Human Pose
Model Representation
Model Learning
Model Inference
• Experiments
• Conclusion
7
8. Outline
• Background and Intuition
• Mutual Context of Object and Human Pose
Model Representation
Model Learning
Model Inference
• Experiments
• Conclusion
8
9. Human pose estimation & Object detection
Human pose Difficult part
estimation is appearance
challenging.
Self-occlusion
Image region looks
like a body part
• Felzenszwalb & Huttenlocher, 2005
• Ren et al, 2005
• Ramanan, 2006
• Ferrari et al, 2008
• Yang & Mori, 2008
• Andriluka et al, 2009
9
• Eichner & Ferrari, 2009
10. Human pose estimation & Object detection
Human pose
estimation is
challenging.
• Felzenszwalb & Huttenlocher, 2005
• Ren et al, 2005
• Ramanan, 2006
• Ferrari et al, 2008
• Yang & Mori, 2008
• Andriluka et al, 2009
10
• Eichner & Ferrari, 2009
11. Human pose estimation & Object detection
Facilitate
Given the
object is
detected.
11
12. Human pose estimation & Object detection
Object
detection is
Small, low- challenging
resolution, partially
occluded
Image region similar
to detection target
• Viola & Jones, 2001
• Lampert et al, 2008
• Divvala et al, 2009
• Vedaldi et al, 2009
12
13. Human pose estimation & Object detection
Object
detection is
challenging
• Viola & Jones, 2001
• Lampert et al, 2008
• Divvala et al, 2009
• Vedaldi et al, 2009
13
16. Context in Computer Vision
Previous work – Use context
cues to facilitate object detection:
Helpful, but only moderately
outperform better
~3-4%
with without
context context
• Hoiem et al, 2006 • Murphy et al, 2003 • Viola & Jones, 2001
• Rabinovich et al, 2007 • Shotton et al, 2006 • Lampert et al, 2008
• •
•
Oliva & Torralba, 2007
Heitz & Koller, 2008 •
Harzallah et al, 2009
Li, Socher & Fei-Fei, 2009
• Desai et al, 2009 • Marszalek et al, 2009
•
Divvala et al, 2009 • Bao & Savarese, 2010 16
17. Context in Computer Vision
Previous work – Use context Our approach – Two challenging
cues to facilitate object detection: tasks serve as mutual context of
each other:
With
mutual
context:
Helpful, but only moderately
outperform better
~3-4%
Without
context:
with without
context context
• Hoiem et al, 2006 • Murphy et al, 2003
• Rabinovich et al, 2007 • Shotton et al, 2006
• Oliva & Torralba, 2007 • Harzallah et al, 2009
• Heitz & Koller, 2008 • Li, Socher & Fei-Fei, 2009
• Desai et al, 2009 • Marszalek et al, 2009
• Divvala et al, 2009 • Bao & Savarese, 2010 17
18. Outline
• Background and Intuition
• Mutual Context of Object and Human Pose
Model Representation
Model Learning
Model Inference
• Experiments
• Conclusion
18
19. Mutual Context Model Representation
A:
Activity
A
Tennis Croquet Volleyball Human pose
forehand shot smash
H
O: Object
O
Tennis Croquet Volleyball Body parts
racket mallet
P1 P2 PN
H:
fO
f1 f2 fN
Intra-class variations
• More than one H for each A; Image evidence
• Unobserved during training.
P: lP: location; θP: orientation; sP: scale.
f: Shape context. [Belongie et al, 2002] 19
20. Mutual Context Model Representation
Markov Random Field
• e ( A, O ) , e ( A, H ) , e (O, H ) : Frequency A we e
of co-occurrence between A, O, and H. eE
e ( A, H )
e ( A, O ) Clique Clique
H weight potential
e (O, H )
O
P1 P2 PN
fO
f1 f2 fN
20
21. Mutual Context Model Representation
Markov Random Field
• e ( A, O ) , e ( A, H ) , e (O, H ) : Frequency A we e
of co-occurrence between A, O, and H. eE
• e (O, Pn ) , e ( H , Pn ) , e ( Pm , Pn ) : Spatial Clique Clique
H weight potential
relationship among object and body parts.
bin lO lPn bin O Pn sO sPn O
e ( H , Pn )
location orientation size e (O, Pn )
P1 P2 PN
e ( Pm , Pn )
fO
f1 f2 fN
21
22. Mutual Context Model Representation
Markov Random Field
• e ( A, O ) , e ( A, H ) , e (O, H ) : Frequency A we e
of co-occurrence between A, O, and H. eE
• e (O, Pn ) , e ( H , Pn ) , e ( Pm , Pn ) : Spatial Clique Clique
H weight potential
relationship among object and body parts.
bin lO lPn bin O Pn sO sPn O
e ( H , Pn )
location orientation size e (O, Pn ) Obtained by
structure learning
• Learn structural connectivity among P1 P2 PN
the body parts and the object. e ( Pm , Pn )
fO
f1 f2 fN
22
23. Mutual Context Model Representation
Markov Random Field
• e ( A, O ) , e ( A, H ) , e (O, H ) : Frequency A we e
of co-occurrence between A, O, and H. eE
• e (O, Pn ) , e ( H , Pn ) , e ( Pm , Pn ) : Spatial Clique Clique
H weight potential
relationship among object and body parts.
bin lO lPn bin O Pn sO sPn O
location orientation size
• Learn structural connectivity among e (O , f O ) P1 P2 PN
the body parts and the object.
e ( Pn , f P )
n
fO
• e (O, f O ) and e ( Pn , f Pn ): Discriminative
part detection scores. f1 f2 fN
Shape context + AdaBoost
[Andriluka et al, 2009]
[Belongie et al, 2002]
[Viola & Jones, 2001]
23
24. Outline
• Background and Intuition
• Mutual Context of Object and Human Pose
Model Representation
Model Learning
Model Inference
• Experiments
• Conclusion
24
25. Model Learning
Input:
we e
A
eE
H
O
cricket cricket
P1 P2 PN shot bowling
fO
f1 f2 fN
Goals:
Hidden human poses
25
26. Model Learning
Input:
we e
A
eE
H
O
cricket cricket
P1 P2 PN shot bowling
fO
f1 f2 fN
Goals:
Hidden human poses
Structural connectivity
26
27. Model Learning
Input:
we e
A
eE
H
O
cricket cricket
P1 P2 PN shot bowling
fO
f1 f2 fN
Goals:
Hidden human poses
Structural connectivity
Potential parameters
Potential weights
27
28. Model Learning
Input:
we e
A
eE
H
O
cricket cricket
P1 P2 PN shot bowling
fO
f1 f2 fN
Goals:
Hidden human poses Hidden variables
Structural connectivity Structure learning
Potential parameters
Parameter estimation
Potential weights
28
29. Model Learning
we e
A
Approach:
eE
H
croquet shot
O
P1 P2 PN
fO
f1 f2 fN
Goals:
Hidden human poses
Structural connectivity
Potential parameters
Potential weights
29
30. Model Learning
we e
A
Approach:
eE
E
2
max e we e
H
Hill-climbing
E e
2 2
O
Joint density Gaussian priori of
P1 P2 PN of the model the edge number
fO
f1 f2 fN
Goals:
Hidden human poses
Structural connectivity
Potential parameters
Potential weights
30
31. Model Learning
we e
A
Approach:
eE
H
• Maximum likelihood
O e ( A, O ) e ( A, H ) e (O, H )
P1 P2 PN
e ( H , Pn ) e (O, Pn ) e ( Pm , Pn )
fO • Standard AdaBoost
f1 f2 fN e (O, f O ) e ( Pn , f Pn )
Goals:
Hidden human poses
Structural connectivity
Potential parameters
Potential weights
31
32. Model Learning
we e
A
Approach:
eE
H
Max-margin learning
1
min w r i
O 2
w , 2 2
P1 P2 PN r i
s.t. i, r where y r y ci ,
fO
w ci xi w r xi 1 i
f1 f2 fN
i, i 0
Goals:
Hidden human poses Notations
Structural connectivity • xi: Potential values of the i-th image.
• wr: Potential weights of the r-th pose.
Potential parameters • y(r): Activity of the r-th pose.
Potential weights • ξi: A slack variable for the i-th image.
32
35. Outline
• Background and Intuition
• Mutual Context of Object and Human Pose
Model Representation
Model Learning
Model Inference
• Experiments
• Conclusion
35
37. Model Inference
I
The learned models
Head detection
Torso detection
Compositional
Inference
[Chen et al, 2007]
A1 , H1 , O1* , P*n
1, n
Tennis racket detection
Layout of the object and body parts.
37
38. Model Inference
I
The learned models
Output
A1 , H1 , O1* , P*n
1, n
AK , H K , OK ,PK ,n
* *
n
38
39. Outline
• Background and Intuition
• Mutual Context of Object and Human Pose
Model Representation
Model Learning
Model Inference
• Experiments
• Conclusion
39
40. Dataset and Experiment Setup
Sport data set: 6 classes
180 training (supervised with object and part locations) & 120 testing images
Tasks:
• Object detection;
• Pose estimation;
• Activity classification.
Cricket Cricket Croquet
defensive shot bowling shot
Tennis Tennis Volleyball
forehand serve smash
[Gupta et al, 2009]
40
41. Dataset and Experiment Setup
Sport data set: 6 classes
180 training (supervised with object and part locations) & 120 testing images
Tasks:
• Object detection;
• Pose estimation;
• Activity classification.
Cricket Cricket Croquet
defensive shot bowling shot
Tennis Tennis Volleyball
forehand serve smash
[Gupta et al, 2009]
41
44. Dataset and Experiment Setup
Sport data set: 6 classes
180 training & 120 testing images
Tasks:
• Object detection;
• Pose estimation;
• Activity classification.
Cricket Cricket Croquet
defensive shot bowling shot
Tennis Tennis Volleyball
forehand serve smash
[Gupta et al, 2009]
44
45. Human Pose Estimation Results
Method Torso Upper Leg Lower Leg Upper Arm Lower Arm Head
Ramanan,
2006
.52 .22 .22 .21 .28 .24 .28 .17 .14 .42
Andriluka et
al, 2009
.50 .31 .30 .31 .27 .18 .19 .11 .11 .45
Our full
model
.66 .43 .39 .44 .34 .44 .40 .27 .29 .58
45
46. Human Pose Estimation Results
Method Torso Upper Leg Lower Leg Upper Arm Lower Arm Head
Ramanan,
2006
.52 .22 .22 .21 .28 .24 .28 .17 .14 .42
Andriluka et
al, 2009
.50 .31 .30 .31 .27 .18 .19 .11 .11 .45
Our full
model
.66 .43 .39 .44 .34 .44 .40 .27 .29 .58
Tennis serve Our estimation Andriluka Volleyball Our estimation Andriluka
model result et al, 2009 smash model result et al, 2009
46
47. Human Pose Estimation Results
Method Torso Upper Leg Lower Leg Upper Arm Lower Arm Head
Ramanan,
2006
.52 .22 .22 .21 .28 .24 .28 .17 .14 .42
Andriluka et
al, 2009
.50 .31 .30 .31 .27 .18 .19 .11 .11 .45
Our full
model
.66 .43 .39 .44 .34 .44 .40 .27 .29 .58
One pose
per class
.63 .40 .36 .41 .31 .38 .35 .21 .23 .52
Estimation Estimation Estimation Estimation
result result result result
47
48. Dataset and Experiment Setup
Sport data set: 6 classes
180 training & 120 testing images
Tasks:
• Object detection;
• Pose estimation;
• Activity classification.
Cricket Cricket Croquet
defensive shot bowling shot
Tennis Tennis Volleyball
forehand serve smash
[Gupta et al, 2009]
48
49. Activity Classification Results
No scene
information Scene is
0.9 critical!! Cricket
83.3%
shot
Classification accuracy
0.8 78.9%
0.7
Tennis
0.6 52.5% forehand
0.5
Our Gupta et Bag-of-
Our
model Gupta et Bag-of-words
al, 2009 Words
model al, 2009 SIFT+SVM
49
50. Conclusion Grouplet representation
Human-Object Interaction
Vs.
Mutual context model
Next Steps
• Pose estimation & Object detection on PPMI images.
• Modeling multiple objects and humans.
50
51. Acknowledgment
• Stanford Vision Lab reviewers:
– Barry Chai (1985-2010)
– Juan Carlos Niebles
– Hao Su
• Silvio Savarese, U. Michigan
• Anonymous reviewers
51