SlideShare une entreprise Scribd logo
1  sur  6
Télécharger pour lire hors ligne
000
001
002
003
004
005
006
007
008
009
010
011
012
013
014
015
016
017
018
019
020
021
022
023
024
025
026
027
028
029
030
031
032
033
034
035
036
037
038
039
040
041
042
043
044
045
046
047
048
049
050
051
052
053
2D Pose Estimation Using Active Shape Models and
Learned Entropy Field Approximations
Ian Dewancker
Department of Computer Science
University of British Columbia
Image Understanding II Project, April 23 2012
idewanck@cs.ubc.ca
Abstract
This report describes the implementation of a framework for exploring full-body,
generative 2D pose estimation approach in single, broadcast quality frames. A
dataset for training is manually constructed from image frames taken from NBA
footage. Poses can be generated using an active shape model consisting of a sim-
ple linear combination requiring only a small number of parameters. To guide the
search for an optimal pose, an energy function is proposed that compares image
entropy with a learned entropy field estimator, computed using the current pose
estimate. Current results are presented and future work or exploration is discussed.
1 Introduction
Pose estimation and understanding human body configurations in images is an important problem
in computer vision. A large number of applications ranging from surveillance, human-computer
iteration and motion analysis. This problem is challenging because of the high dimensional search
space of body poses, the presence of ambiguous poses, and the need to identify the human body in
each image. Determining pose in video sequences is likely a key stepping stone towards solving the
more difficult problem of general action recognition.
This report outlines a generative framework for 2D pose search in NBA video frames. By 2D pose
we mean that the joints are only estimated in image space, and not estimated in 3D. To create a
representative model of the pose space, an application was developed to annotate images quickly to
create a low resolution training set of about 150 poses. Following the core ideas behind active shape
models, principle component analysis is applied to a mean adjusted set of training poses to find the
modes of greatest variation. These modes form the basis of a simple linear model that can be used
to generate poses. A bounding box, centered on the each players chest is assumed to be given, that
is, the joint positions are represented as scaled offsets from the chest point.
Next, a method for computing the likelihood of the image data is presented. Image entropy is ex-
plored as a texture similarity measure. Two regression methods, neural networks and radial basis
functions are explored to learn a function that takes the current 2D pose and generates a 2D entropy
field (an estimate of the expected entropy signature for a given pose) that can then be directly com-
pared with the entropy of the image. Further steps are discussed to improve this entropy estimation,
speed up the computation and finally perform the search using a prior on the pose space with a
suitable stochastic method.
1
054
055
056
057
058
059
060
061
062
063
064
065
066
067
068
069
070
071
072
073
074
075
076
077
078
079
080
081
082
083
084
085
086
087
088
089
090
091
092
093
094
095
096
097
098
099
100
101
102
103
104
105
106
107
2 Related Work
There has been a fair amount of research effort spent investigating this problem and as a result, a
diverse array of approaches exist. One approach formulates the problem as a regression between 2D
silhouette data and 3D poses of joints using radial basis functions [2]. The drawback with this ap-
proach is that it requires very clean segmentation between the human (foreground) and background,
which is very challenging when dealing with real scenes. Another recent approach uses a histogram
of oriented gradients (HOG) approach to detect limbs and body parts in images and then estimate a
likely pose [5]. One drawback with this approach is that it requires fairly high resolution images of
the persons body (500X500 pixels) to successfully detect the limbs and return an estimated pose.
Other ideas have come from looking at sequences of frames and computing a motion signature from
the optical flow of the image and then matching this to a database of templates [11]. This template
database contains known poses for each motion signature and so the estimated pose can be matched
from these examples. Finally, detection and tracking sport players and pedestrians has been well
studied with fairly robust solutions in existence [6]
3 Problem Definition
At a high level, we can view the task of pose estimation as an inference problem, where we are
trying to estimate the most likely pose X (which here denotes a vector representing the 2D pose of a
player) given the image data I (which here denotes the pixel data within a provided bounding box in
a single video frame ). We can view the pose estimation problem as finding a suitable (most likely)
pose over the posterior as shown in equation 2
For this framework, we present the following definition for the pose space. Poses exist as elements
of a 32 dimensional space, where each are represented as 16 pairs of normalized 2D co-ordinates
for 16 joints as shown in the figure below. The center chest point (highlighted in red in the figure)
represents the point (0, 0) and every other point is defined as an offset from this point, for example
the head could be located at (0, −0.2). A scaling relative to the height of the provided bounding box
is performed to normalize the image co-ordinates of the joints, that is, every joint location offset in
pixel co-ordinates is divided by the height of the bounding box.
X =






headu
headv
...
rfootu
rfootv






, I = Image Data (1)
arg max
X
P(X | I ) = arg max
X
α P(I | X) · P(X) (2)
Figure 1: 2D Pose Vector and Problem Definition
We approach the problem with a generative model and look to design a solution that samples hy-
potheses from the pose space and evaluates their likelihood given the data present I. An alternative
approach in machine learning is to use discriminative models which are designed to directly map
image information and features computed from the data I into likely pose estimates [1]. Here we
pursue a generative approach and outline steps towards a full solution. There are two important steps
for a generative approach, creating a model to generate poses from the pose space and the function
method of evaluating how well that pose fits the data (likelihood).
2
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
4 Training Data Generation
To better understand and represent the pose space, an application was developed to generate example
poses from frames of data. The application was written in python using a game library and allowed
the user to quickly click and generate skeletons to be fit to the image. The user could scale skeletons
and manipulate joint positions with the mouse.
Figure 2: Screenshot of Pose Training Application
The training examples were saved in plaintext format to a file. Players were chosen that were
relatively isolated in the frame, that is, not occluded by other players, however self-occlusion was
present. Approximately 150 poses were created, but more could easily be generated.
5 Active Shape Models
A well studied approach for modeling shape variation comes from active shape models [10]. A
parameterized linear model can be estimated by using principle component analysis (PCA) to find
the main axes or principle components of the data cloud. This can be performed fairly simply, by
taking a matrix M which consists of N poses collected from training, and offsetting them by the
mean training vector as shown in equation 3 and 4
Finding the principle components can be easily computed by taking the SVD of matrix M. The
matrix U holds in its columns a suitable basis for the pose space of our training set. We can take
only the first k columns of U and store these in a new matrix called Φ. New poses can then be
generated by taking a mean pose added with a linear combination of the principle components.
¯x =
1
N
N
i=1
xi (3)
M = (x1 − ¯x) (x2 − ¯x) · · · (xN − ¯x) , Φ = ([U], SVD(M)) (4)
X = ¯x + Φb (5)
The figure below shows the mean pose shape offset by the first 3 modes. The top row of the figure
shows the mean pose modified by the 1st mode only, the middle by the second and the bottom by the
third. Intuitively one can understand that these modes adjust the limbs and posture of the skeletons
along dimensions of large variance as learned from the training set.
3
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
Figure 3: Variation along first 3 shape modes
6 Image Likelihood
With a shape model complete and capable of generating poses the next step was to investigate possi-
ble methods for determining the likelihood of an image given a pose, that is a method of computing
P(I | X). One idea that that has been explored is that of edge matching using an algorithm known
as chamfer matching, however edge data is often noisy, and this leads to poses being drawn to areas
dense with edges. Another idea is creating a colour model for players and poses from the training
data so that a pose could be matched to the image, however this is also fragile, since players have
very different colours for their skins and jerseys and colour is very sensitive to illumination changes.
What was sought was something that captured slightly more information than edges, but slightly
less sensitive to player and luminance changes than colour.
6.1 Image Entropy
Entropy is a statistical measure of randomness that can be used to characterize the texture of an im-
age. It is a measure of the variability within a set of measurements and if we take the measurements
to be grayscale intensities in a 9X9 window around each pixel, we can compute the entropy of the
gray scale histogram after binning the values into 12 different bins.
Intuitively, players should exhibit high entropy patterns around the edges of their limbs and the text
on their jerseys and lower entropy patterns on the interior ares of their skin and non-text jersey areas
as shown below in the figure.
H(Y ) = −
N
i=1
p(yi) log2 p(yi) (6)
Figure 4: Image Entropy Calculation and Example
4
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
6.2 Learning the Entropy Field Approximation
What was required now was a method to generate an entropy field, which we will define to be a
2D scalar field given a pose vector, that could be compared to against the entropy of the actual
image. In other words, what should the entropy image look like for a given skeleton pose? Two
common regression methods were experimented with having varying degrees of success to estimate
the entropy field Hu,v(X), note we have augmented the 32 dimensional vector X with 2 more
dimensions for to capture the dimensions of the 2D entropy field.
The first regression model was a single layered neural network consisting of 10000 internal nodes.
A neural network is a parametric regression approach that learns a set of weights to combine inputs
into internal nodes and internal nodes into outputs using activation functions as shown in equation
7 [9]. Several parameters were modified and experimented with including the number of internal
nodes and type of activation function. The best results for the example are shown below in the figure:
Hu,v(X) = σ


10000
j=0
w
(2)
kj h
34
i=0
w
(1)
ji xi

 (7)
Radial basis functions, or RBFs are also universal function approximators, however they use a non-
parametric approach, storing centers taken from the training data to and use them to interpolate new
values [8] [7]. We used Gaussian basis functions and took 5000 centers to produce the results shown
in the figure below. As you can see, it does a better job than the neural net, but it is still not great.
Hu,v(X) =
5000
i=1
wiφ( X − ci ) (8)
Figure 5: From left to right: Skeleton Pose, Ground Truth Entropy, ANN Entropy, RBF Entropy
While some good initial experimentation took place, the problem with the framework was that the
function was complicated and computation time to train these systems grew increasingly large. With
the RBF estimate, structure is clearly present, however for the method to work well, it should begin
to very closely estimate the ground truth so that a simple comparison method could be run on the
estimate and the actual image.
Alternatively, what could have been more computationally effective would have been to use an array
of neural nets, at representative co-ordinates in the entropy field and train them to produce expected
values and then interpolate other values in the field given these grid nodes as reference values,
instead of trying to regress for the entire field. Unfortunately time ran out to implement and test this
idea.
5
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
7 Conclusions and Future Work
While work on the image likelihood function proved promising, a prior on the poses would still be
required to help restrain the search to poses that are close to those in the training data. A typical
approach to this problem is to use a Gaussian mixture model Similar to RBFs, a Gaussian Mixture
Model (GMM) is a weighted sum of M Gaussian densities as given by the equation parameters are
estimated from training data using the iterative Expectation-Maximization (EM) algorithm [9]
P(X) =
M
i=1
wi g(X|µi , Σi) (9)
g(X|µi, Σi) =
1
(2π)D/2|Σi|1/2
exp −
1
2
(X − µi) Σ−1
i (X − µi) (10)
This would provide the full definition of the probability distribution function and from there an
optimization search could be performed. Due to the fact that the gradient would be difficult or
impossible to compute for this function, it seems likely that a stochastic approach like simulated
annealing would work well for this problem. Neighbours could be generated by concatenating
random offsets from Gaussian distributions and the temperature scheduling could be adjusted with
experimentation.
To really evaluate any approach, much more training data would have to be produced and so the
efforts of this project focused more on experimenting with potentially useful approaches. Ideally a
full evaluation of a solution could take place with reports on the accuracy of estimated poses com-
pared with the ground truth. It is unfortunate that I ran out of time to fully evaluate a method, but the
preliminary results were promising and the framework is written now for further experimentation.
A more intelligent approach could also be taken with the active shape models as well. Instead of
simply finding the mean of the entire training set, some unsupervised clustering could be performed
beforehand and then k shape models could be constructed. This could potentially retain more im-
portant information about the variance in the pose clusters associated with various actions (jumping,
running, shooting etc.). Included after the report is all the source code I wrote for the project.
References
[1] Ng, Andrew , Jordan, M. (2001) On Discriminative vs. Generative classifiers, NIPS.
[2] Moeslund, T. , Hilton A. , Kruger V. (2006) A survey of advances in vision-based human motion
capture and analysis Computer Vision and Image Understanding
[3] Lee, M. , Cohen I. (2004) Proposal Maps driven MCMC for Estimating Human Body Pose in
Static Images Computer Vision and Pattern Recognition
[4] Rogez, G. , Rihan J. (2008) Randomized Trees for Human Pose Detection Computer Vision and
Pattern Recognition
[5] Ferrari, V. and Marin-Jimenez, M. and Zisserman, A. (2008) Progressive Search Space Reduction
for Human Pose Estimatio Computer Vision and Pattern Recognition
[6] Wei-Lwun Lu, Kenji Okuma, and James J. Little (2009) Tracking and Recognizing Actions of
Multiple Hockey Players using the Boosted Particle Filter, Image and Vision Computing,
[7] Press, Teukolsky, Vetterling, Flannery (2007) Numerical Recipes, The Art of Scientific Comput-
ing Cambridge University Press
[8] Duda, Hart, Stork (2003) Pattern Classification, Second Edition , Wiley
[9] Bishop,(2006) Pattern Recognition and Machine Learning , Springer
[10] T.F. Cootes and C.J.Taylor (2004) Statistical Models of Appearance for Computer Vision
[11] Efros, Berg, Mori and Malik (2003) Recognizing Action at a Distance, International Conference
on Computer Vision
6

Contenu connexe

Tendances

International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 
Ijarcet vol-2-issue-3-1280-1284
Ijarcet vol-2-issue-3-1280-1284Ijarcet vol-2-issue-3-1280-1284
Ijarcet vol-2-issue-3-1280-1284
Editor IJARCET
 
A Quantitative Comparative Study of Analytical and Iterative Reconstruction T...
A Quantitative Comparative Study of Analytical and Iterative Reconstruction T...A Quantitative Comparative Study of Analytical and Iterative Reconstruction T...
A Quantitative Comparative Study of Analytical and Iterative Reconstruction T...
CSCJournals
 

Tendances (17)

Lec11: Active Contour and Level Set for Medical Image Segmentation
Lec11: Active Contour and Level Set for Medical Image SegmentationLec11: Active Contour and Level Set for Medical Image Segmentation
Lec11: Active Contour and Level Set for Medical Image Segmentation
 
C1802050815
C1802050815C1802050815
C1802050815
 
Multiple Ant Colony Optimizations for Stereo Matching
Multiple Ant Colony Optimizations for Stereo MatchingMultiple Ant Colony Optimizations for Stereo Matching
Multiple Ant Colony Optimizations for Stereo Matching
 
LEARNING FINGERPRINT RECONSTRUCTION: FROM MINUTIAE TO IMAGE
 LEARNING FINGERPRINT RECONSTRUCTION: FROM MINUTIAE TO IMAGE LEARNING FINGERPRINT RECONSTRUCTION: FROM MINUTIAE TO IMAGE
LEARNING FINGERPRINT RECONSTRUCTION: FROM MINUTIAE TO IMAGE
 
Medial axis transformation based skeletonzation of image patterns using image...
Medial axis transformation based skeletonzation of image patterns using image...Medial axis transformation based skeletonzation of image patterns using image...
Medial axis transformation based skeletonzation of image patterns using image...
 
EXTENDED WAVELET TRANSFORM BASED IMAGE INPAINTING ALGORITHM FOR NATURAL SCENE...
EXTENDED WAVELET TRANSFORM BASED IMAGE INPAINTING ALGORITHM FOR NATURAL SCENE...EXTENDED WAVELET TRANSFORM BASED IMAGE INPAINTING ALGORITHM FOR NATURAL SCENE...
EXTENDED WAVELET TRANSFORM BASED IMAGE INPAINTING ALGORITHM FOR NATURAL SCENE...
 
Lec10: Medical Image Segmentation as an Energy Minimization Problem
Lec10: Medical Image Segmentation as an Energy Minimization ProblemLec10: Medical Image Segmentation as an Energy Minimization Problem
Lec10: Medical Image Segmentation as an Energy Minimization Problem
 
Lec9: Medical Image Segmentation (III) (Fuzzy Connected Image Segmentation)
Lec9: Medical Image Segmentation (III) (Fuzzy Connected Image Segmentation)Lec9: Medical Image Segmentation (III) (Fuzzy Connected Image Segmentation)
Lec9: Medical Image Segmentation (III) (Fuzzy Connected Image Segmentation)
 
Medical Image Segmentation by Transferring Ground Truth Segmentation Based up...
Medical Image Segmentation by Transferring Ground Truth Segmentation Based up...Medical Image Segmentation by Transferring Ground Truth Segmentation Based up...
Medical Image Segmentation by Transferring Ground Truth Segmentation Based up...
 
44 paper
44 paper44 paper
44 paper
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
Ijarcet vol-2-issue-3-1280-1284
Ijarcet vol-2-issue-3-1280-1284Ijarcet vol-2-issue-3-1280-1284
Ijarcet vol-2-issue-3-1280-1284
 
Curvature-Based Registration for Slice Interpolation of Medical Images
Curvature-Based Registration for Slice Interpolation of Medical ImagesCurvature-Based Registration for Slice Interpolation of Medical Images
Curvature-Based Registration for Slice Interpolation of Medical Images
 
Iris Localization - a Biometric Approach Referring Daugman's Algorithm
Iris Localization - a Biometric Approach Referring Daugman's AlgorithmIris Localization - a Biometric Approach Referring Daugman's Algorithm
Iris Localization - a Biometric Approach Referring Daugman's Algorithm
 
A Quantitative Comparative Study of Analytical and Iterative Reconstruction T...
A Quantitative Comparative Study of Analytical and Iterative Reconstruction T...A Quantitative Comparative Study of Analytical and Iterative Reconstruction T...
A Quantitative Comparative Study of Analytical and Iterative Reconstruction T...
 
An approach to improving edge
An approach to improving edgeAn approach to improving edge
An approach to improving edge
 
Image Stitching Algorithm: An Optimization between Correlation-Based and Feat...
Image Stitching Algorithm: An Optimization between Correlation-Based and Feat...Image Stitching Algorithm: An Optimization between Correlation-Based and Feat...
Image Stitching Algorithm: An Optimization between Correlation-Based and Feat...
 

En vedette

impactos del cambio climatico en ecosistemas costeros
 impactos del cambio climatico en ecosistemas costeros impactos del cambio climatico en ecosistemas costeros
impactos del cambio climatico en ecosistemas costeros
Xin San
 

En vedette (10)

Puestadesol
PuestadesolPuestadesol
Puestadesol
 
GPavlioti MResume
GPavlioti MResumeGPavlioti MResume
GPavlioti MResume
 
Trabajo final111
Trabajo final111Trabajo final111
Trabajo final111
 
Resume of Noor
Resume of NoorResume of Noor
Resume of Noor
 
Narayn June 2016
Narayn June 2016Narayn June 2016
Narayn June 2016
 
#MakeItDigital Il valore del sistema di gestione (qualità) nell'era digitale
#MakeItDigital Il valore del sistema di gestione (qualità) nell'era digitale#MakeItDigital Il valore del sistema di gestione (qualità) nell'era digitale
#MakeItDigital Il valore del sistema di gestione (qualità) nell'era digitale
 
Con0210 cs
Con0210 csCon0210 cs
Con0210 cs
 
impactos del cambio climatico en ecosistemas costeros
 impactos del cambio climatico en ecosistemas costeros impactos del cambio climatico en ecosistemas costeros
impactos del cambio climatico en ecosistemas costeros
 
Arxivar per Coopfidi. Gestione dei processi approvativi tramite workflow
Arxivar per Coopfidi. Gestione dei processi approvativi tramite workflowArxivar per Coopfidi. Gestione dei processi approvativi tramite workflow
Arxivar per Coopfidi. Gestione dei processi approvativi tramite workflow
 
Actividad de aprendizaje 1.4
Actividad de aprendizaje 1.4Actividad de aprendizaje 1.4
Actividad de aprendizaje 1.4
 

Similaire à proj525

Estimating Human Pose from Occluded Images (ACCV 2009)
Estimating Human Pose from Occluded Images (ACCV 2009)Estimating Human Pose from Occluded Images (ACCV 2009)
Estimating Human Pose from Occluded Images (ACCV 2009)
Jia-Bin Huang
 

Similaire à proj525 (20)

Face Image Restoration based on sample patterns using MLP Neural Network
Face Image Restoration based on sample patterns using MLP Neural NetworkFace Image Restoration based on sample patterns using MLP Neural Network
Face Image Restoration based on sample patterns using MLP Neural Network
 
J017166469
J017166469J017166469
J017166469
 
Medical image segmentation by
Medical image segmentation byMedical image segmentation by
Medical image segmentation by
 
Recovering 3D human body configurations using shape contexts
Recovering 3D human body configurations using shape contextsRecovering 3D human body configurations using shape contexts
Recovering 3D human body configurations using shape contexts
 
Human Face Detection Based on Combination of Logistic Regression, Distance of...
Human Face Detection Based on Combination of Logistic Regression, Distance of...Human Face Detection Based on Combination of Logistic Regression, Distance of...
Human Face Detection Based on Combination of Logistic Regression, Distance of...
 
3D Reconstruction from Multiple uncalibrated 2D Images of an Object
3D Reconstruction from Multiple uncalibrated 2D Images of an Object3D Reconstruction from Multiple uncalibrated 2D Images of an Object
3D Reconstruction from Multiple uncalibrated 2D Images of an Object
 
Face Pose Classification Method using Image Structural Similarity Index
Face Pose Classification Method using Image Structural Similarity IndexFace Pose Classification Method using Image Structural Similarity Index
Face Pose Classification Method using Image Structural Similarity Index
 
hpe3d_report.pdf
hpe3d_report.pdfhpe3d_report.pdf
hpe3d_report.pdf
 
Content based image retrieval (cbir) using
Content based image retrieval (cbir) usingContent based image retrieval (cbir) using
Content based image retrieval (cbir) using
 
Gait Based Person Recognition Using Partial Least Squares Selection Scheme
Gait Based Person Recognition Using Partial Least Squares Selection Scheme Gait Based Person Recognition Using Partial Least Squares Selection Scheme
Gait Based Person Recognition Using Partial Least Squares Selection Scheme
 
557 480-486
557 480-486557 480-486
557 480-486
 
Face Detection for identification of people in Images of Internet
Face Detection for identification of people in Images of InternetFace Detection for identification of people in Images of Internet
Face Detection for identification of people in Images of Internet
 
CELL TRACKING QUALITY COMPARISON BETWEEN ACTIVE SHAPE MODEL (ASM) AND ACTIVE ...
CELL TRACKING QUALITY COMPARISON BETWEEN ACTIVE SHAPE MODEL (ASM) AND ACTIVE ...CELL TRACKING QUALITY COMPARISON BETWEEN ACTIVE SHAPE MODEL (ASM) AND ACTIVE ...
CELL TRACKING QUALITY COMPARISON BETWEEN ACTIVE SHAPE MODEL (ASM) AND ACTIVE ...
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
Face Recognition Using Neural Networks
Face Recognition Using Neural NetworksFace Recognition Using Neural Networks
Face Recognition Using Neural Networks
 
Estimating Human Pose from Occluded Images (ACCV 2009)
Estimating Human Pose from Occluded Images (ACCV 2009)Estimating Human Pose from Occluded Images (ACCV 2009)
Estimating Human Pose from Occluded Images (ACCV 2009)
 
Real-time Multi-object Face Recognition Using Content Based Image Retrieval (...
Real-time Multi-object Face Recognition Using Content Based Image Retrieval (...Real-time Multi-object Face Recognition Using Content Based Image Retrieval (...
Real-time Multi-object Face Recognition Using Content Based Image Retrieval (...
 
Blind Image Seperation Using Forward Difference Method (FDM)
Blind Image Seperation Using Forward Difference Method (FDM)Blind Image Seperation Using Forward Difference Method (FDM)
Blind Image Seperation Using Forward Difference Method (FDM)
 
Object Shape Representation by Kernel Density Feature Points Estimator
Object Shape Representation by Kernel Density Feature Points Estimator Object Shape Representation by Kernel Density Feature Points Estimator
Object Shape Representation by Kernel Density Feature Points Estimator
 
Ck36515520
Ck36515520Ck36515520
Ck36515520
 

proj525

  • 1. 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053 2D Pose Estimation Using Active Shape Models and Learned Entropy Field Approximations Ian Dewancker Department of Computer Science University of British Columbia Image Understanding II Project, April 23 2012 idewanck@cs.ubc.ca Abstract This report describes the implementation of a framework for exploring full-body, generative 2D pose estimation approach in single, broadcast quality frames. A dataset for training is manually constructed from image frames taken from NBA footage. Poses can be generated using an active shape model consisting of a sim- ple linear combination requiring only a small number of parameters. To guide the search for an optimal pose, an energy function is proposed that compares image entropy with a learned entropy field estimator, computed using the current pose estimate. Current results are presented and future work or exploration is discussed. 1 Introduction Pose estimation and understanding human body configurations in images is an important problem in computer vision. A large number of applications ranging from surveillance, human-computer iteration and motion analysis. This problem is challenging because of the high dimensional search space of body poses, the presence of ambiguous poses, and the need to identify the human body in each image. Determining pose in video sequences is likely a key stepping stone towards solving the more difficult problem of general action recognition. This report outlines a generative framework for 2D pose search in NBA video frames. By 2D pose we mean that the joints are only estimated in image space, and not estimated in 3D. To create a representative model of the pose space, an application was developed to annotate images quickly to create a low resolution training set of about 150 poses. Following the core ideas behind active shape models, principle component analysis is applied to a mean adjusted set of training poses to find the modes of greatest variation. These modes form the basis of a simple linear model that can be used to generate poses. A bounding box, centered on the each players chest is assumed to be given, that is, the joint positions are represented as scaled offsets from the chest point. Next, a method for computing the likelihood of the image data is presented. Image entropy is ex- plored as a texture similarity measure. Two regression methods, neural networks and radial basis functions are explored to learn a function that takes the current 2D pose and generates a 2D entropy field (an estimate of the expected entropy signature for a given pose) that can then be directly com- pared with the entropy of the image. Further steps are discussed to improve this entropy estimation, speed up the computation and finally perform the search using a prior on the pose space with a suitable stochastic method. 1
  • 2. 054 055 056 057 058 059 060 061 062 063 064 065 066 067 068 069 070 071 072 073 074 075 076 077 078 079 080 081 082 083 084 085 086 087 088 089 090 091 092 093 094 095 096 097 098 099 100 101 102 103 104 105 106 107 2 Related Work There has been a fair amount of research effort spent investigating this problem and as a result, a diverse array of approaches exist. One approach formulates the problem as a regression between 2D silhouette data and 3D poses of joints using radial basis functions [2]. The drawback with this ap- proach is that it requires very clean segmentation between the human (foreground) and background, which is very challenging when dealing with real scenes. Another recent approach uses a histogram of oriented gradients (HOG) approach to detect limbs and body parts in images and then estimate a likely pose [5]. One drawback with this approach is that it requires fairly high resolution images of the persons body (500X500 pixels) to successfully detect the limbs and return an estimated pose. Other ideas have come from looking at sequences of frames and computing a motion signature from the optical flow of the image and then matching this to a database of templates [11]. This template database contains known poses for each motion signature and so the estimated pose can be matched from these examples. Finally, detection and tracking sport players and pedestrians has been well studied with fairly robust solutions in existence [6] 3 Problem Definition At a high level, we can view the task of pose estimation as an inference problem, where we are trying to estimate the most likely pose X (which here denotes a vector representing the 2D pose of a player) given the image data I (which here denotes the pixel data within a provided bounding box in a single video frame ). We can view the pose estimation problem as finding a suitable (most likely) pose over the posterior as shown in equation 2 For this framework, we present the following definition for the pose space. Poses exist as elements of a 32 dimensional space, where each are represented as 16 pairs of normalized 2D co-ordinates for 16 joints as shown in the figure below. The center chest point (highlighted in red in the figure) represents the point (0, 0) and every other point is defined as an offset from this point, for example the head could be located at (0, −0.2). A scaling relative to the height of the provided bounding box is performed to normalize the image co-ordinates of the joints, that is, every joint location offset in pixel co-ordinates is divided by the height of the bounding box. X =       headu headv ... rfootu rfootv       , I = Image Data (1) arg max X P(X | I ) = arg max X α P(I | X) · P(X) (2) Figure 1: 2D Pose Vector and Problem Definition We approach the problem with a generative model and look to design a solution that samples hy- potheses from the pose space and evaluates their likelihood given the data present I. An alternative approach in machine learning is to use discriminative models which are designed to directly map image information and features computed from the data I into likely pose estimates [1]. Here we pursue a generative approach and outline steps towards a full solution. There are two important steps for a generative approach, creating a model to generate poses from the pose space and the function method of evaluating how well that pose fits the data (likelihood). 2
  • 3. 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 4 Training Data Generation To better understand and represent the pose space, an application was developed to generate example poses from frames of data. The application was written in python using a game library and allowed the user to quickly click and generate skeletons to be fit to the image. The user could scale skeletons and manipulate joint positions with the mouse. Figure 2: Screenshot of Pose Training Application The training examples were saved in plaintext format to a file. Players were chosen that were relatively isolated in the frame, that is, not occluded by other players, however self-occlusion was present. Approximately 150 poses were created, but more could easily be generated. 5 Active Shape Models A well studied approach for modeling shape variation comes from active shape models [10]. A parameterized linear model can be estimated by using principle component analysis (PCA) to find the main axes or principle components of the data cloud. This can be performed fairly simply, by taking a matrix M which consists of N poses collected from training, and offsetting them by the mean training vector as shown in equation 3 and 4 Finding the principle components can be easily computed by taking the SVD of matrix M. The matrix U holds in its columns a suitable basis for the pose space of our training set. We can take only the first k columns of U and store these in a new matrix called Φ. New poses can then be generated by taking a mean pose added with a linear combination of the principle components. ¯x = 1 N N i=1 xi (3) M = (x1 − ¯x) (x2 − ¯x) · · · (xN − ¯x) , Φ = ([U], SVD(M)) (4) X = ¯x + Φb (5) The figure below shows the mean pose shape offset by the first 3 modes. The top row of the figure shows the mean pose modified by the 1st mode only, the middle by the second and the bottom by the third. Intuitively one can understand that these modes adjust the limbs and posture of the skeletons along dimensions of large variance as learned from the training set. 3
  • 4. 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 Figure 3: Variation along first 3 shape modes 6 Image Likelihood With a shape model complete and capable of generating poses the next step was to investigate possi- ble methods for determining the likelihood of an image given a pose, that is a method of computing P(I | X). One idea that that has been explored is that of edge matching using an algorithm known as chamfer matching, however edge data is often noisy, and this leads to poses being drawn to areas dense with edges. Another idea is creating a colour model for players and poses from the training data so that a pose could be matched to the image, however this is also fragile, since players have very different colours for their skins and jerseys and colour is very sensitive to illumination changes. What was sought was something that captured slightly more information than edges, but slightly less sensitive to player and luminance changes than colour. 6.1 Image Entropy Entropy is a statistical measure of randomness that can be used to characterize the texture of an im- age. It is a measure of the variability within a set of measurements and if we take the measurements to be grayscale intensities in a 9X9 window around each pixel, we can compute the entropy of the gray scale histogram after binning the values into 12 different bins. Intuitively, players should exhibit high entropy patterns around the edges of their limbs and the text on their jerseys and lower entropy patterns on the interior ares of their skin and non-text jersey areas as shown below in the figure. H(Y ) = − N i=1 p(yi) log2 p(yi) (6) Figure 4: Image Entropy Calculation and Example 4
  • 5. 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 6.2 Learning the Entropy Field Approximation What was required now was a method to generate an entropy field, which we will define to be a 2D scalar field given a pose vector, that could be compared to against the entropy of the actual image. In other words, what should the entropy image look like for a given skeleton pose? Two common regression methods were experimented with having varying degrees of success to estimate the entropy field Hu,v(X), note we have augmented the 32 dimensional vector X with 2 more dimensions for to capture the dimensions of the 2D entropy field. The first regression model was a single layered neural network consisting of 10000 internal nodes. A neural network is a parametric regression approach that learns a set of weights to combine inputs into internal nodes and internal nodes into outputs using activation functions as shown in equation 7 [9]. Several parameters were modified and experimented with including the number of internal nodes and type of activation function. The best results for the example are shown below in the figure: Hu,v(X) = σ   10000 j=0 w (2) kj h 34 i=0 w (1) ji xi   (7) Radial basis functions, or RBFs are also universal function approximators, however they use a non- parametric approach, storing centers taken from the training data to and use them to interpolate new values [8] [7]. We used Gaussian basis functions and took 5000 centers to produce the results shown in the figure below. As you can see, it does a better job than the neural net, but it is still not great. Hu,v(X) = 5000 i=1 wiφ( X − ci ) (8) Figure 5: From left to right: Skeleton Pose, Ground Truth Entropy, ANN Entropy, RBF Entropy While some good initial experimentation took place, the problem with the framework was that the function was complicated and computation time to train these systems grew increasingly large. With the RBF estimate, structure is clearly present, however for the method to work well, it should begin to very closely estimate the ground truth so that a simple comparison method could be run on the estimate and the actual image. Alternatively, what could have been more computationally effective would have been to use an array of neural nets, at representative co-ordinates in the entropy field and train them to produce expected values and then interpolate other values in the field given these grid nodes as reference values, instead of trying to regress for the entire field. Unfortunately time ran out to implement and test this idea. 5
  • 6. 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 7 Conclusions and Future Work While work on the image likelihood function proved promising, a prior on the poses would still be required to help restrain the search to poses that are close to those in the training data. A typical approach to this problem is to use a Gaussian mixture model Similar to RBFs, a Gaussian Mixture Model (GMM) is a weighted sum of M Gaussian densities as given by the equation parameters are estimated from training data using the iterative Expectation-Maximization (EM) algorithm [9] P(X) = M i=1 wi g(X|µi , Σi) (9) g(X|µi, Σi) = 1 (2π)D/2|Σi|1/2 exp − 1 2 (X − µi) Σ−1 i (X − µi) (10) This would provide the full definition of the probability distribution function and from there an optimization search could be performed. Due to the fact that the gradient would be difficult or impossible to compute for this function, it seems likely that a stochastic approach like simulated annealing would work well for this problem. Neighbours could be generated by concatenating random offsets from Gaussian distributions and the temperature scheduling could be adjusted with experimentation. To really evaluate any approach, much more training data would have to be produced and so the efforts of this project focused more on experimenting with potentially useful approaches. Ideally a full evaluation of a solution could take place with reports on the accuracy of estimated poses com- pared with the ground truth. It is unfortunate that I ran out of time to fully evaluate a method, but the preliminary results were promising and the framework is written now for further experimentation. A more intelligent approach could also be taken with the active shape models as well. Instead of simply finding the mean of the entire training set, some unsupervised clustering could be performed beforehand and then k shape models could be constructed. This could potentially retain more im- portant information about the variance in the pose clusters associated with various actions (jumping, running, shooting etc.). Included after the report is all the source code I wrote for the project. References [1] Ng, Andrew , Jordan, M. (2001) On Discriminative vs. Generative classifiers, NIPS. [2] Moeslund, T. , Hilton A. , Kruger V. (2006) A survey of advances in vision-based human motion capture and analysis Computer Vision and Image Understanding [3] Lee, M. , Cohen I. (2004) Proposal Maps driven MCMC for Estimating Human Body Pose in Static Images Computer Vision and Pattern Recognition [4] Rogez, G. , Rihan J. (2008) Randomized Trees for Human Pose Detection Computer Vision and Pattern Recognition [5] Ferrari, V. and Marin-Jimenez, M. and Zisserman, A. (2008) Progressive Search Space Reduction for Human Pose Estimatio Computer Vision and Pattern Recognition [6] Wei-Lwun Lu, Kenji Okuma, and James J. Little (2009) Tracking and Recognizing Actions of Multiple Hockey Players using the Boosted Particle Filter, Image and Vision Computing, [7] Press, Teukolsky, Vetterling, Flannery (2007) Numerical Recipes, The Art of Scientific Comput- ing Cambridge University Press [8] Duda, Hart, Stork (2003) Pattern Classification, Second Edition , Wiley [9] Bishop,(2006) Pattern Recognition and Machine Learning , Springer [10] T.F. Cootes and C.J.Taylor (2004) Statistical Models of Appearance for Computer Vision [11] Efros, Berg, Mori and Malik (2003) Recognizing Action at a Distance, International Conference on Computer Vision 6