Artificial Intelligence In Microbiology by Dr. Prince C P
Articulated human pose estimation by deep learning
1. Articulated Human Pose Estimation by Deep
Learning
Wei Yang
Supervisor: Xiaogang Wang, Wanli Ouyang
wyang@ee.cuhk.edu.hk
2. Outline
• Introduction
• Regression by Convolutional Neural Network
• Deformable Convolutional Neural Networks
• Discussion and Future work
2016/8/11 2
3. Introduction
Articulated body pose estimation
“recovers the pose of an articulated body, which consists of
joints and rigid parts using image-based observations.”
2016/8/11 3
6. Classic Approaches
Fischler & Elschlager 1973
Felzenszwalb & Huttenlocher 2005
Pictorial Structure
• Unary Templates
• Pairwise Springs
Yang & Ramanan 2011
Mixtures of “mini-parts”
• Mixture of part 𝑖
• Unary template for part 𝑖 with mixture 𝑚𝑖
• Pairwise springs between part 𝑖 with
mixture 𝑚𝑖 and part 𝑗 with mixture 𝑚𝑗
2016/8/11 6
7. Deep Learning Methods
Multi-source Deep Learning
• Candidate estimations
• Deep model uses multi-source
including appearance score, mixture
type, and deformation.
Ouyang et al. 2014
Deeppose
• Reasoning pose in a holistic fashion
• refines the joint predictions by using
higher resolution sub-images
Toshev & Szegedy 2014
2016/8/11 7
8. We propose to study pose estimation in two ways
• Holistic View
–Regression of joint locations by convolutional neural
networks (CNNs)
• Local information
–Deformable Convolutional Neural Networks
2016/8/11 8
10. Formulation
• Image: 𝐼
• Part location: 𝐩 = 𝑝𝑖 𝑖=1
𝑃
= 𝑥𝑖, 𝑦𝑖 𝑖=1
𝑃
𝜓( 𝐼 ; 𝜃) = 𝐩
Location of part 𝑖:
𝑝𝑖 = (𝑥𝑖, 𝑦𝑖)
Learned by deep CNN
2016/8/11 10
11. Basic Architecture of the CNN Regressor
• AlexNet
– Krizhevsky, Sutskever, and Hinton, NIPS 2012
– The first time deep model is shown to be effective on large scale
computer vision task.
2016/8/11 11
12. Normalize Scale of Human Body
• Size of the CNN input is fixed
• Simple warping changes the aspect ratio of people
• People appear at different scales of an image
1. Original image 2. Human detection
[Ouyang et al. CVPR 2014]
3. Crop by bbox 4. Padding with
mean RGB value
2016/8/11 12
20. Motivation
• Local image patches are able to capture:
– Part presence
– Pairwise part spatial relationships
2016/8/11 20
Number of mixture type for each pair: 6
Neighbor: 1
# of relationships: 61 = 6
Neighbor: 2
# of relationships: 62
= 36
Lowerarm
Upper arm
[Chen & Yuille NIPS 2014]
21. Tree-structured Relational Graph
• 𝑇 = 𝑉, 𝐸
– 𝑉: positions of body parts
– 𝐸: pairwise relationships between parts
• 𝐩 = 𝑝𝑖 = {(𝑥𝑖, 𝑦𝑖)}
– 𝑝𝑖: Pixel location of part 𝑖
• 𝑡 = {𝑡𝑖𝑗, 𝑡𝑗𝑖| 𝑖, 𝑗 ∈ 𝐸}
– Pairwise relationship
– Defined by relative position
– 𝑡𝑖𝑗 ∈ 1, … , 𝑇𝑖𝑗
– In experiment: 13 type for each pair
𝑖, 𝑗 ∈ 𝐸
2016/8/11 21
22. Formulation
2016/8/11 22
𝐹 𝐩, 𝐭 𝐼; 𝝎, 𝜃 =
𝑖∈𝑉
𝐴𝑖(𝑝𝑖|𝐼; 𝜃)
Part
presence
𝜔𝑖 ⋅
Inference: 𝐩∗
, 𝐭∗
= arg max
𝐩,𝐭
𝐹 𝐩, 𝐭 𝐼; 𝝎, 𝜃
• Tree structure
• Can be solved efficiently by dynamic programming
𝜔𝑖, 𝜔𝑖𝑗, 𝝎𝑖𝑗
𝑡 𝑖𝑗
are currently learned by Latent structure SVM
+
(𝑖,𝑗)∈𝐸
𝑅(𝑝𝑖, 𝑝𝑗, 𝑡𝑖𝑗, 𝑡𝑗𝑖|𝐼; 𝜃)
Pairwise
deformation
+𝝎𝑖𝑗
𝑡 𝑖𝑗
⋅𝜔𝑖𝑗 ⋅
Pairwise
Relationship
23. Learning parameters 𝜃
2016/8/11 23
Derive the type label for each patch
• use relative position 𝑑𝑖𝑗 to represent
the pairwise relations
• Cluster the relative positions over the
whole training set 𝑑𝑖𝑗 𝑖=1
𝑁
• Type label 𝑡𝑖𝑗
𝑛
: cluster index
• Mean relative position 𝑟𝑖𝑗
𝑡 𝑖𝑗
: cluster
center
24. Casting Full Connections into Convolutions
2016/8/11 24
Elbow
Part presence map
Pairwise relationship
map
25. PCP and PDJ on LSP dataset and FLIC dataset
Dataset Method Torso Head U.Leg L.Leg U.Arm L.Arm Mean PCP
LSP
DCNN 92.5 85.1 82.7 76.3 70.2 55.9 74.8
Ouyang et al. 85.8 83.1 76.5 72.2 63.3 46.6 68.6
FLIC DCNN 87.0 98.8 - - 96.5 84.0 91.1
LSP FLIC
2016/8/11 25
27. Future Work
• Build end-to-end system to estimate human pose
• Consider combining local information and holistic view
• Beyond tree structure
2016/8/11 27
30. Data Augmentation
• The number of training data of existing datasets are
insufficient to train deep CNNs
– Statistics of existing datasets
– Number of parameters of AlexNet: 60 million
• Data augmentation is efficient to prevent overfitting
Dataset # Training
images
# Testing
images
Type
PARSE 100 205 Full body
LSP 1,000 1,000 Full body
LSP extend 10,000 - Full
FLIC 3,987 1,016 Upper body
MPII 28,821 11,701 Full body
2016/8/11 30
31. Data Augmentation (cont.)
• Random padding
• Rotating
– ±[2.5◦, 5◦, 7.5◦, 10◦, 15◦, 20◦]
• Flipping
2016/8/11 31
32. Evaluation Metrics
• Percentage of Correct Parts (PCP)
– measures the percentage of correctly localized body parts.
– A candidate body part is treated as correct if its segment endpoints lie within
50% of the length of the ground-truth annotated endpoints.
• Percentage of Detected Joints (PDJ)
– measures the performance using a curve of the percentage of correctly localized
joints by varying localization precision threshold, which is normalized by the
scale defined as distance between left shoulder and right hip
– invariant to scale
2016/8/11 32
Notes de l'éditeur
Good afternoon everyone. It’s my honor to take my screening test here.
My name is Wei Yang. I’m from the IVP lab.
My talk is about articulated pose estimation with Deep learning methods.
We will first have a brief introduction about the task we address.
Then we discuss two methods based on deep learning.
Finally we conclude this talk and discuss future work.
According to Wikipedia, the goal of articulated pose estimation is to “recovers the joint positions of articulated limbs, as we show here for a man playing baseball.
There are lots of applications where being able to estimate human pose is useful. For example, pose estimation is helpful for recognizing action. It also helps to parse clothing in fashion photographs. Recently, pose estimation has been successful applied in human tracking and gaming systems.
However, In unconstrained images, human pose estimation can be a very hard problem because people can appear with a variety of poses, clothing, and body shape. In the slides, you can see some very interesting and unusual examples that demonstrate how flexible the human pose is.
A classic approach for human pose estimation is to model the human as a set of parts, such as a head, torso, arm, and leg part. In 3D, these parts can be modeled as cylinders.
Pictorial structures use 2D part models, where geometric relations between parts are encoded by springs.
However, capturing the whole range of appearances using pictorial structures is still quite difficult. A big problem is that even projections of a simple cylinder into 2D yields many different appearances. So one usually has to explicitly evaluate many different possible in-plane orientations and foreshortenings in order to find a good match for a part template.
Yang propose mini parts to approximate these transformations. in this case the mini-parts are tuned to represent near-vertical and near horizontal limbs.
Recently, the state-of-the-art performance on pose estimation are achived by deep learningmethods.
1. Ouyang et al. [19] propose to use multi-source deep model for constructing the non-linearrepresentation from multiple information sources, i.e., mixture type, apperance score, and deformations.
2. Deeppose [26] estimates body part locations by a regressor in a holistic manner. Then they refines the joint predictions by using higher resolution sub-images with a convolutional neural network. Howerver, this method suffers from inaccuracy in the high-precision region.
We propose to study pose estimation in two ways
First, we study regression of joint locations by CNNs. We want to know how accurate this method is. And what is the limitation of only using a single CNN regressor.
We also study methods based on local image patches. And in future work we plan to incorporate the deformable graphical models into the network.
Let I denote an image, and p denote part locations in an image. We want to learn a regression model that given an image, it output all the part locations. This is a non-trivial problem. And we use a CNN as our regression model because of its strong representation ability.
We adopt AlexNet as the basic network structure. This structure was proposed in 2012. It won the imagenet competition on a large margin, and is the first time that deep model is shown to be effective on large scale computer vision task.
The original input size of AlexNet is 227*227. We first simply warp the images to this size.
However, we found that the performance is very bad because two reasons:
1. This simple warping method changes the aspect ratio of people
2. Second, people appear at different scales of an image.
To keep the aspect ratio and meanwhile to make people in different images to have the same scale. We first detect the rough location of the human, then we crop images with the detected bounding box. Finally, we do padding and warping. Note that we use mean RGB values instead zero to perform padding.
Since existing pose datasets are relatively small. We start from removing the last two fully connected layers.
The evaluation metric here is the Percentage of Correct Parts, the higher the better.
However, the performance if far from the baseline method.
This shows that the model complexity is not high enough to model this complex problem.
Then we increase the complexty by adding two fully connected layers. We achieve 56.9 mean PCP, which is still not better than the baseline method.
We observe that the location of one part may help to locate another part. For example, the location of elbows may help to locate the wrist. Hence to add another fully connected layer after the original output layer. We also add two layers as the second branch to increase the variation of the model. Finally, we sum the outputs form two branches together to get the final prediction.
This time we achieve 60.9 mean PCP, which is comparable with Yang’s method.
Here is the summary of the experiment results. By further doing data fusion on test set. We finally get 62.5 mean PCP.
This is the visualized results on LSP dataset. We can see that this method has limitations in high precision regions, such as lower arms and lower legs.
It is worth to mention that this method is very fast, since predictions can be get by batch forward propagation.
Here we provided some failure cases. The failures are mainly caused by articulation, fore-shortening, self occlusion or occlusions caused by clothing, and cluttered background or overlapping people.
As mentioned before, although this method is very fast. It still has limitations, for example, it only gives one prediction for one image. Hence we turn to another kind of method based on local image information.
We observe that local image patches are not only able to capture part presence, but also able to reason pairwise spatial relationships.
For example, consider the patch centered at wrist can predict the relative position of elbow; the patch centered at elbow can reliably predict position of shoulder and wrist.
We use mixture model to define different types of spatial relationships. The right panel shows typical spatial relationships the wrist can have with its neighbor elbow.
The left panel shows the typical spatial relationships the elbow can have with its two neighbors, say shoulder and wrist.
Based on this observation, we can define human pose as a tree structure graph, where each node denotes the position of each part, and the edges denote the pairwise spatial relationships.
We define the score function of part locations p and pairwise relation types t. It is computed by summing the Unary appearance term and the pairwise relationship term. The unary term is the part presence map indicating the probalibity that part I appears at each location of the image. Pairwise term consists of two part. The first part is the pairwise relationship map, and the second part is the deformation cost. Theta are parameters which are learned by CNN.
Inference is to find the positions and mixture types to maximize this score. As the relational graph is tree structure, it can be efficiently solved by dynamic programming.
Here we talk about how to learn theta. Given an image, we want produce a score map to indicate its probability of a specific type. This is done by learn a multi class classifier on local image patches. First we need to derive type label for each patch.
Then we use two convolutional layers with 1 by 1 kernels to replace the original fully connected layers. Then the network becomes a fully convolutional network, and can perform convolutions on input image with arbitrary size, and the output is the scoremap for each type, as we want.
Then we can easily compute the part presence map and pairwise relationship maps as this figure illustrated.
For example, to compute part presence map of elbow, we just add all the score maps associated with elbow to shoulder, and elbow to wrist together.
To compute pairwise relationship maps, we need to perform marginalization.