Articulated human pose estimation by deep learning

Articulated Human Pose Estimation by Deep
Learning
Wei Yang
Supervisor: Xiaogang Wang, Wanli Ouyang
wyang@ee.cuhk.edu.hk

Outline
• Introduction
• Regression by Convolutional Neural Network
• Deformable Convolutional Neural Networks
• Discussion and Future work
2016/8/11 2

Introduction
Articulated body pose estimation
“recovers the pose of an articulated body, which consists of
joints and rigid parts using image-based observations.”
2016/8/11 3

Applications
Action recognition
Human tracking
Clothing Parsing
Gaming
2016/8/11 4

Classic Approaches
Fischler & Elschlager 1973
Felzenszwalb & Huttenlocher 2005
Pictorial Structure
• Unary Templates
• Pairwise Springs
Yang & Ramanan 2011
Mixtures of “mini-parts”
• Mixture of part 𝑖
• Unary template for part 𝑖 with mixture 𝑚𝑖
• Pairwise springs between part 𝑖 with
mixture 𝑚𝑖 and part 𝑗 with mixture 𝑚𝑗
2016/8/11 6

Deep Learning Methods
Multi-source Deep Learning
• Candidate estimations
• Deep model uses multi-source
including appearance score, mixture
type, and deformation.
Ouyang et al. 2014
Deeppose
• Reasoning pose in a holistic fashion
• refines the joint predictions by using
higher resolution sub-images
Toshev & Szegedy 2014
2016/8/11 7

We propose to study pose estimation in two ways
• Holistic View
–Regression of joint locations by convolutional neural
networks (CNNs)
• Local information
–Deformable Convolutional Neural Networks
2016/8/11 8

Regression by Convolutional Neural Network
2016/8/11 9

Formulation
• Image: 𝐼
• Part location: 𝐩 = 𝑝𝑖 𝑖=1
𝑃
= 𝑥𝑖, 𝑦𝑖 𝑖=1
𝑃
𝜓( 𝐼 ; 𝜃) = 𝐩
Location of part 𝑖:
𝑝𝑖 = (𝑥𝑖, 𝑦𝑖)
Learned by deep CNN
2016/8/11 10

Basic Architecture of the CNN Regressor
• AlexNet
– Krizhevsky, Sutskever, and Hinton, NIPS 2012
– The first time deep model is shown to be effective on large scale
computer vision task.
2016/8/11 11

Normalize Scale of Human Body
• Size of the CNN input is fixed
• Simple warping changes the aspect ratio of people
• People appear at different scales of an image
1. Original image 2. Human detection
[Ouyang et al. CVPR 2014]
3. Crop by bbox 4. Padding with
mean RGB value
2016/8/11 12

Architecture 1
• Loss function:
• Evaluation metric: PCP
2016/8/11 13
Method Torso Head U.Leg L.Leg U.Arm L.Arm Mean PCP
Yang&Ramanan 84.1 77.1 69.5 65.6 52.5 35.9 60.8
Conv5 58.8 24.1 49.6 36.6 25.8 2.8 31.3

Architecture 2
Yang&Ramanan 84.1 77.1 69.5 65.6 52.5 35.9 60.8
Conv5 58.8 24.1 49.6 36.6 25.8 2.8 31.3
Fc8
(AlexNet)
81.1 63.7 72.8 66.6 50.6 21.9 56.9
2016/8/11 14
• Loss function:

Architecture 3
Yang&Ramanan 84.1 77.1 69.5 65.6 52.5 35.9 60.8
Conv5 58.8 24.1 49.6 36.6 25.8 2.8 31.3
Fc8
(AlexNet)
81.1 63.7 72.8 66.6 50.6 21.9 56.9
Fc10 84.1 68.8 76.8 69.4 54.9 26.8 60.9
2016/8/11 15
• Loss function:

PCP and PDJ on LSP
# Method Torso Head U.Leg L.Leg U.Arm L.Arm Mean PCP
ours
1 Conv5 58.8 24.1 49.6 36.6 25.8 2.8 31.3
2
Fc8
(AlexNet)
81.1 63.7 72.8 66.6 50.6 21.9 56.9
3
Fc8
(LSP-extend)
83.1 67.2 75.0 68.7 53.4 25.6 59.6
4 Fc10 84.1 68.8 76.8 69.4 54.9 26.8 60.9
5 Fc10 (Fusion) 84.8 71.8 77.6 71.2 55.9 29.2 62.5
State-of-
the-art
methods
6 Yang&Ramanan 84.1 77.1 69.5 65.6 52.5 35.9 60.8
7 Ouyang et al. 85.8 83.1 76.5 72.2 63.3 46.6 68.6
2016/8/11 16

Results on LSP dataset
2016/8/11 17

Failure Cases
• articulation
• fore-shortening
• occlusions and distractions
• cluttered background or overlapping people
2016/8/11 18

Deformable Convolutional Neural Networks
2016/8/11 19

Motivation
• Local image patches are able to capture:
– Part presence
– Pairwise part spatial relationships
2016/8/11 20
Number of mixture type for each pair: 6
Neighbor: 1
# of relationships: 61 = 6
Neighbor: 2
# of relationships: 62
= 36
Lowerarm
Upper arm
[Chen & Yuille NIPS 2014]

Tree-structured Relational Graph
• 𝑇 = 𝑉, 𝐸
– 𝑉: positions of body parts
– 𝐸: pairwise relationships between parts
• 𝐩 = 𝑝𝑖 = {(𝑥𝑖, 𝑦𝑖)}
– 𝑝𝑖: Pixel location of part 𝑖
• 𝑡 = {𝑡𝑖𝑗, 𝑡𝑗𝑖| 𝑖, 𝑗 ∈ 𝐸}
– Pairwise relationship
– Defined by relative position
– 𝑡𝑖𝑗 ∈ 1, … , 𝑇𝑖𝑗
– In experiment: 13 type for each pair
𝑖, 𝑗 ∈ 𝐸
2016/8/11 21

Formulation
2016/8/11 22
𝐹 𝐩, 𝐭 𝐼; 𝝎, 𝜃 =
𝑖∈𝑉
𝐴𝑖(𝑝𝑖|𝐼; 𝜃)
Part
presence
𝜔𝑖 ⋅
Inference: 𝐩∗
, 𝐭∗
= arg max
𝐩,𝐭
𝐹 𝐩, 𝐭 𝐼; 𝝎, 𝜃
• Tree structure
• Can be solved efficiently by dynamic programming
𝜔𝑖, 𝜔𝑖𝑗, 𝝎𝑖𝑗
𝑡 𝑖𝑗
are currently learned by Latent structure SVM
+
(𝑖,𝑗)∈𝐸
𝑅(𝑝𝑖, 𝑝𝑗, 𝑡𝑖𝑗, 𝑡𝑗𝑖|𝐼; 𝜃)
Pairwise
deformation
+𝝎𝑖𝑗
𝑡 𝑖𝑗
⋅𝜔𝑖𝑗 ⋅
Pairwise
Relationship

Learning parameters 𝜃
2016/8/11 23
Derive the type label for each patch
• use relative position 𝑑𝑖𝑗 to represent
the pairwise relations
• Cluster the relative positions over the
whole training set 𝑑𝑖𝑗 𝑖=1
𝑁
• Type label 𝑡𝑖𝑗
𝑛
: cluster index
• Mean relative position 𝑟𝑖𝑗
𝑡 𝑖𝑗
: cluster
center

Casting Full Connections into Convolutions
2016/8/11 24
Elbow
Part presence map
Pairwise relationship
map

PCP and PDJ on LSP dataset and FLIC dataset
Dataset Method Torso Head U.Leg L.Leg U.Arm L.Arm Mean PCP
LSP
DCNN 92.5 85.1 82.7 76.3 70.2 55.9 74.8
Ouyang et al. 85.8 83.1 76.5 72.2 63.3 46.6 68.6
FLIC DCNN 87.0 98.8 - - 96.5 84.0 91.1
LSP FLIC
2016/8/11 25

Future Work
• Build end-to-end system to estimate human pose
• Consider combining local information and holistic view
• Beyond tree structure
2016/8/11 27

Thank you
Articulated Human Pose Estimation by Deep Learning

Appendix
Data Augmentation
Evaluation Metrics
2016/8/11 29

Data Augmentation
• The number of training data of existing datasets are
insufficient to train deep CNNs
– Statistics of existing datasets
– Number of parameters of AlexNet: 60 million
• Data augmentation is efficient to prevent overfitting
Dataset # Training
images
# Testing
images
Type
PARSE 100 205 Full body
LSP 1,000 1,000 Full body
LSP extend 10,000 - Full
FLIC 3,987 1,016 Upper body
MPII 28,821 11,701 Full body
2016/8/11 30

Data Augmentation (cont.)
• Random padding
• Rotating
– ±[2.5◦, 5◦, 7.5◦, 10◦, 15◦, 20◦]
• Flipping
2016/8/11 31

Evaluation Metrics
• Percentage of Correct Parts (PCP)
– measures the percentage of correctly localized body parts.
– A candidate body part is treated as correct if its segment endpoints lie within
50% of the length of the ground-truth annotated endpoints.
• Percentage of Detected Joints (PDJ)
– measures the performance using a curve of the percentage of correctly localized
joints by varying localization precision threshold, which is normalized by the
scale defined as distance between left shoulder and right hip
– invariant to scale
2016/8/11 32

Articulated human pose estimation by deep learning

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Articulated human pose estimation by deep learning

Similaire à Articulated human pose estimation by deep learning (20)

Dernier

Dernier (20)

Articulated human pose estimation by deep learning

Notes de l'éditeur