Probabilistic PCA, EM, and more

Principal Components Analysis, Expectation
Maximization, and more

Harsh Vardhan Sharma1,2
1 Statistical Speech Technology Group

Beckman Institute for Advanced Science and Technology
2 Dept. of Electrical & Computer Engineering

University of Illinois at Urbana-Champaign

Group Meeting: December 01, 2009

Material for this presentation derived from:

Probabilistic Principal Component Analysis.
Tipping and Bishop, Journal of the Royal Statistical Society (1999) 61:3, 611-622

Mixtures of Principal Component Analyzers.
Tipping and Bishop, Proceedings of Fifth International Conference on Artiﬁcial Neural Networks (1997), 13-18

Outline

1 Principal Components Analysis

2 Basic Model / Model Basics

3 A brief digression - Inference and Learning

4 Probabilistic PCA

5 Expectation Maximization for Probabilistic PCA

6 Mixture of Principal Component Analyzers

hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 3 / 39

PCA

Outline




4 Probabilistic PCA




PCA

standard PCA in 1 slide

A well-established technique for dimensionality reduction

Most common derivation of PCA → linear projection maximizing the
variance in the projected space:

1: Organize observed data {yi ∈ Rp }N in a p × N matrix X after subtracting mean
i=1
1 N
y =
¯ N i=1 yi .
k
2: Obtain the k principal axes wj ∈ Rp j=1
:: k eigenvectors of data-covariance matrix
1 N T
(Sy = N i=1 yi − y
¯ yi − y
¯ ) corresponding to k largest eigenvalues (k < p).

3: The k principal components of yi are xi = WT yi − y , where W = (w1 , . . . , wk ). The
¯
components of xi are then uncorrelated and the projection-covariance matrix Sx is diagonal
with the k largest eigenvalues of Sy .


PCA

Things to think about

Assumptions behind standard PCA

1 Linearity
problem is that of changing the basis — have measurements in a particular basis; want to
see data in a basis that best expresses it. We restrict ourselves to look at bases that are
linear combinations of the measurement-basis.

2 Large variances = important structure
believe that data has high SNR ⇒ dynamics of interest assumed to exist along directions
with largest variance; lower-variance directions pertain to noise.

3 Principal Components are orthogonal
decorrelation-based dimensional reduction removes redundancy in the original
data-representation.


PCA

Limitations of standard PCA

1 Decorrelation not always the best approach
Useful only when first and second order statistics are sufficient statistics for revealing all
dependencies in data (for e.g., Gaussian distributed data).

2 Linearity assumption not always justifiable
Not valid when data-structure captured by a nonlinear function of dimensions in the
measurement-basis.

3 Non-parametricity
No probabilistic model for observed data. (Advantages of probabilistic extension coming
up!)

4 Calculation of Data Covariance Matrix
When p and N are very large, difficulties arise in terms of computational complexity and
data scarcity


PCA

Handling the decorrelation and linearity caveats

Example solutions : Independent Components Analysis (imposing more general notion of
statistical dependency), kernel PCA (nonlinearly transforming data to a more appropriate
naive-basis)

PCA

Handling the non-parametricity caveat: motivation for probabilistic PCA

probabilistic perspective can provide a log-likelihood measure for
comparison with other density-estimation techniques.

Bayesian inference methods may be applied (e.g., for model
comparison).

pPCA can be utilized as a constrained Gaussian density model:
potential applications → classiﬁcation, novelty detection.

multiple pPCA models can be combined as a probabilistic mixture.

standard PCA uses a naive way to access covariance (distance2 from
observed data): pPCA deﬁnes a proper covariance structure whose
parameters can be estimated via EM.


PCA


Handling the computational caveat: motivation for EM-based PCA

computing the sample covariance itself is O Np 2 .

data scarcity: often don’t have enough data for sample covariance to
be full-rank.

computational complexity: direct diagonalization is O p 3 .

standard PCA doesn’t deal properly with missing data. EM
algorithms can estimate ML values of missing data.

EM-based PCA: doesn’t require computing sample covariance,
O (kNp) complexity.


Basic Model

Outline




4 Probabilistic PCA




Basic Model

PCA as a limiting case of linear Gaussian models
linear Gaussian model ≡ latent variable model → Factor Analysis → probabilistic PCA →
standard PCA

ym − µ = Cxm + m = Cxm +
where for m = 1, . . . , N

xm ∈ Rk ∼ N 0, Q – (hidden) state vector

y m ∈ Rp – output/observable vector

C ∈ Rp×k – observation/measurement matrix

∈ Rp ∼ N 0, R – zero-mean white Gaussian noise

So, we have ym ∼ N µ, W = CQCT + R = CCT + R .


Basic Model


standard PCA

ym − µ = Cxm +

The restriction to zero-mean noise source is not a loss of generality.
All of the structure in Q can be moved into C and can use Q = Ik×k .
R in general cannot be restricted, since ym are observed and cannot
be whitened/rescaled.
Assumed: C is of rank k; Q and R are always full rank.


Inference, Learning

Outline




4 Probabilistic PCA




Inference, Learning

Latent Variable Models and Probability Computations

Case 1 :: know what the hidden states are. just want to estimate
them. (can write down a priori C based on problem-physics)

estimating the states given observations and a model → inference

Case 2 :: have observation data. observation process mostly
unknown. no explicit model for the “causes”

learning a few parameters that model the data well (in the ML sense)
→ learning


Inference, Learning

Learning, via Expectation Maximization

Given a likelihood function L θ; Y = {ym }m , X = {xm }m , where θ is the
parameter vector, Y is the observed data and X represents the unobserved
latent variables or missing values, the maximum likelihood estimate (MLE)
of θ is obtained iteratively as follows:

Expectation step:

˜
Q θ|θ(u) = EX|Y,θ(u) log L θ; Y, X

Maximization step:

˜
θ(u+1) = arg max Q θ|θ(u)


Inference, Learning


Learning, via Expectation Maximization, for linear Gaussian models

Use the solution to the inference problem for estimating the unknown
latent variables / missing values X, given Y, θ(u) . Then use this ﬁctitious
“complete” data to solve for θ(u+1) .

Expectation step: Obtain conditional latent suﬃcient statistics
T (u) from
xm (u) , xm xm

xm |ym ∼ N β (u) (ym − µ) , I − β (u) C(u)
T −1 T T −1
where β (u) = C(u) W(u) = C(u) C(u) C(u) + R(u) .

Maximization step: Choose C, R to maximize joint likelihood of X, Y.


pPCA

Outline




4 Probabilistic PCA




pPCA


standard PCA

ym − µ = Cxm +

The restriction to zero-mean noise source is not a loss of generality.
All of the structure in Q can be moved into C and can use Q = Ik×k .
R in general cannot be restricted, since ym are observed and cannot
be whitened/rescaled.
Assumed: C is of rank k; Q and R are always full rank.

So, we have ym ∼ N µ, W = CQCT + R = CCT + R .


pPCA

standard PCA

k < p: looking for more parsimonious representation of the observed
data.

:: linear Gaussian model → Factor Analysis ::

R needs to be restricted: the learning procedure could explain all the
structure in the data as noise (i.e., obtain maximal likelihood by
choosing C = 0 and R = W = data-sample covariance).
since ym ∼ N µ, W = CCT + R we can do no better than having the
model-covariance equal the data-sample covariance.


pPCA


standard PCA

Factor Analysis ≡ restricting R to be diagonal

xm ≡ {xmi }k – factors explaining the correlations between the
i=1 p
observation variables ym ≡ ymj j=1
p
ymj j=1
conditionally independent given {xmi }k
i=1

j – variability unique to a particular ymj
R = diag(rjj ) – “uniquenesses”
diﬀerent from standard PCA, which eﬀectively treats covariance and
variance identically


pPCA

standard PCA

Probabilistic PCA ≡ constraining R to σ 2 I

Noise, ∼ N 0, σ 2 I

ym |xm ∼ N µ + Cxm , W = CCT + σ 2 I
ym ∼ N µ, W = CCT + σ 2 I
xm |ym ∼ N (β (ym − µ) , I − βC) where
−1
β = CT W−1 = CT CCT + σ 2 I
xm |ym ∼ N κ (ym − µ) , σ 2 M−1 where
−1
κ = M−1 CT = CT C + σ 2 I CT


pPCA

standard PCA

−1
β = CT W−1 = CT CCT + σ 2 Ip×p
−1
κ = M−1 CT = CT C + σ 2 Ik×k CT

It can be shown (by applying the Woodbury matrix identity to M) that
1 I − βC = σ 2 M−1
2 β=κ
Then, by letting σ 2 → 0, we obtain
−1 T
xm |ym → δ xm − CT C C (ym − µ) , which is the standard PCA.


pPCA

closed-form ML learning

log-likelihood for the pPCA model,
L θ; Y = − N p log (2π) + log |W| + trace W−1 Sy
2 where

W = CCT + σ 2 I
1 N T
Sy = N m=1 (ym − µ) (ym − µ)

ML estimates of µ, Sy are the sample mean and sample covariance
matrix respectively
ˆ 1 N
µ= N m=1 ym
T
ˆ 1 N ˆ ˆ
Sy = N m=1 ym − µ ym − µ


pPCA

closed-form ML learning

ML estimates of C and σ 2 , i.e. C and R

ˆ 1/2
C = Uk Λk − σ 2 I V
maps latent space (containing X) to the principal subspace of Y
columns of Uk ∈ Rp×k – principal eigenvectors of Sy
Λk ∈ Rk×k – diagonal matrix of corresponding eigenvalues of Sy
V ∈ Rk×k – arbitrary rotation matrix, can be set to Ik×k

ˆ
σ2 = 1 p
p−k r =k+1 λr

variance lost in the projection process, averaged over the number of
dimensions projected out/away


EM for pPCA

Outline




4 Probabilistic PCA




EM for pPCA

EM-based ML learning for linear Gaussian models

Expectation step:


T −1 T T −1
where β (u) = C(u) W(u) = C(u) C(u) C(u) + R(u) .

Maximization step: Choose C, R to maximize joint likelihood of X, Y.


EM for pPCA

EM-based ML learning for probabilistic PCA

Expectation step:


T −1 T T (u) −1
where β (u) = C(u) W(u) = C(u) C(u) C(u) + σ 2 I .

Maximization step: Choose C, σ 2 to maximize joint likelihood of X, Y.


EM for pPCA


Expectation step:
(u) −1
xm |ym ∼ N κ(u) (ym − µ) , σ 2 M(u)

−1 T T (u) −1 T
where κ(u) = M(u) C(u) = C(u) C(u) + σ 2 I C(u) .

Maximization step: Choose C, σ 2 to maximize joint likelihood of X, Y.


EM for pPCA


Expectation step: Compute for m = 1, . . . , N
−1 T
xm = M(u) C(u) ˆ
ym − µ

(u) −1 T
T
xm xm = σ2 M(u) + xm xm

Maximization step: Set
N N −1

C(u+1) = ˆ
ym − µ xm T T
xm xm
m=1 m=1

N
(u+1) 1 ˆ
2
T T
ˆ T
σ2 = ym − µ − 2 xm C(u+1) ym − µ + trace T
xm xm C(u+1) C(u+1)
Np m=1


EM for pPCA


(u) −1 T −1
ˆ ˆ
C(u+1) = Sy C(u) σ 2 I + M(u) C(u) Sy C(u)
(u+1) 1 ˆ ˆ −1 T
σ2 = trace Sy − Sy C(u) M(u) C(u+1)
p

Convergence: only stable local extremum is the global maximum at
which the true principal subspace is found. Paper doesn’t discuss any
initialization scheme(s).
Complexity : require terms of the form SC and trace (S)
T
Computing SC as m ym ym C is O (kNp) and more efficient than
T 2
m ym ym C which is equivalent to finding S explicitly (O Np ).
Very efficient for k << p.
Require trace (S), not S ⇒ computing only variance along each
coordinate sufficient.

Mixture of pPCAs

Outline




4 Probabilistic PCA




Mixture of pPCAs

Mixture of pPCAs :: the model

Likelihood of observed data

N M
2 M
L θ = {µr } , {Cr } , σr r =1
;Y = log πr · p (ym |r )
m=1 r =1

where, for m = 1, . . . , N and r = 1, . . . , M

ym |r ∼ N µr , Wr = Cr CT + σr I :: a single pPCA model
r
2

M
{πr } , πr ≥ 0, r =1 πr = 1 :: mixture weights

M independent latent variables xmr for each ym .


Mixture of pPCAs

Mixture of pPCAs :: EM-based ML learning

Stage 1: New estimates of component-speciﬁc π, µ

Expectation step: component’s responsibility for generating observation
(u)
(u+1) p (u) (ym |r ) · πr
Rmr = P (u) (r |ym ) = (u)
M (u) (y |r )
r =1 p m · πr

Maximization step:
N
(u+1) 1 (u+1)
πr = Rmr
N
m=1
N (u+1)
ˆ(u+1) = m=1 Rmr · ym
µr (u+1)
N
m=1 Rmr


Mixture of pPCAs

Stage 2: New estimates of component-speciﬁc C, σ 2
Expectation step: Compute for m = 1, . . . , N and r = 1, . . . , M

(u)−1 (u)T ˆ(u+1)
xmr = Mr Cr ym − µr
T 2 (u) (u)−1 T
xmr xmr = σr Mr + xmr xmr

Maximization step: Set for r = 1, . . . , M
N N −1
(u+1) (u+1) ˆ(u+1) T (u+1) T
Cr = Rmr ym − µr xmr Rmr xmr xmr
m=1 m=1
N
2 (u+1) 1 (u+1)
σr = (u+1)
Rmr ×
πr p m=1

2 (u+1)T (u+1)T
ˆ(u+1)
ym − µr − 2 xmr T
Cr ˆ(u+1) + trace
ym − µr T
xmr xmr Cr Cr
(u+1)


Mixture of pPCAs


−1 T −1
(u+1) (u+1) 2(u)
ˆ (u+1) C(u) πr (u) (u) ˆ (u+1) (u)
Cr = Syr r σr I + Mr Cr Syr Cr
2 (u+1) 1 ˆ (u+1) − S(u+1) C(u) M(u) C(u+1)
ˆ yr
−1 T
σr = (u+1)
trace Syr r r r
πr p
where

N
ˆ (u+1) = 1 (u+1) ˆ (u+1) ˆ (u+1) T
Syr (u+1)
Rmr ym − µr ym − µr
πr N m=1
(u) 2(u) (u)T (u)
Mr = σr I + Cr Cr


Mixture of pPCAs


−1 T −1
(u+1) (u+1) 2(u)
ˆ (u+1) C(u) πr (u) (u) ˆ (u+1) (u)
Cr = Syr r σr I + Mr Cr Syr Cr
2 (u+1) 1 −1
ˆ (u+1) − S(u+1) C(u) M(u) C(u+1)
ˆ yr
T
σr = (u+1)
trace Syr r r r
πr p

Complexity : require terms of the form SC and trace (S)
T
Computing SC as m ym ym C is O (kNp) and more efficient than
T C which is equivalent to finding S explicitly (O Np 2 ).
m y m ym
Very efficient for k << p.

Require trace (S), not S ⇒ computing only variance along each
coordinate sufficient.


Probabilistic PCA, EM, and more

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Probabilistic PCA, EM, and more

Similaire à Probabilistic PCA, EM, and more (20)

Dernier

Dernier (20)

Probabilistic PCA, EM, and more