This is a presentation that I gave to my research group. It is about probabilistic extensions to Principal Components Analysis, as proposed by Tipping and Bishop.
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
Probabilistic PCA, EM, and more
1. Principal Components Analysis, Expectation
Maximization, and more
Harsh Vardhan Sharma1,2
1 Statistical Speech Technology Group
Beckman Institute for Advanced Science and Technology
2 Dept. of Electrical & Computer Engineering
University of Illinois at Urbana-Champaign
Group Meeting: December 01, 2009
2. Material for this presentation derived from:
Probabilistic Principal Component Analysis.
Tipping and Bishop, Journal of the Royal Statistical Society (1999) 61:3, 611-622
Mixtures of Principal Component Analyzers.
Tipping and Bishop, Proceedings of Fifth International Conference on Artificial Neural Networks (1997), 13-18
3. Outline
1 Principal Components Analysis
2 Basic Model / Model Basics
3 A brief digression - Inference and Learning
4 Probabilistic PCA
5 Expectation Maximization for Probabilistic PCA
6 Mixture of Principal Component Analyzers
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 3 / 39
4. PCA
Outline
1 Principal Components Analysis
2 Basic Model / Model Basics
3 A brief digression - Inference and Learning
4 Probabilistic PCA
5 Expectation Maximization for Probabilistic PCA
6 Mixture of Principal Component Analyzers
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 4 / 39
5. PCA
standard PCA in 1 slide
A well-established technique for dimensionality reduction
Most common derivation of PCA → linear projection maximizing the
variance in the projected space:
1: Organize observed data {yi ∈ Rp }N in a p × N matrix X after subtracting mean
i=1
1 N
y =
¯ N i=1 yi .
k
2: Obtain the k principal axes wj ∈ Rp j=1
:: k eigenvectors of data-covariance matrix
1 N T
(Sy = N i=1 yi − y
¯ yi − y
¯ ) corresponding to k largest eigenvalues (k < p).
3: The k principal components of yi are xi = WT yi − y , where W = (w1 , . . . , wk ). The
¯
components of xi are then uncorrelated and the projection-covariance matrix Sx is diagonal
with the k largest eigenvalues of Sy .
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 5 / 39
6. PCA
Things to think about
Assumptions behind standard PCA
1 Linearity
problem is that of changing the basis — have measurements in a particular basis; want to
see data in a basis that best expresses it. We restrict ourselves to look at bases that are
linear combinations of the measurement-basis.
2 Large variances = important structure
believe that data has high SNR ⇒ dynamics of interest assumed to exist along directions
with largest variance; lower-variance directions pertain to noise.
3 Principal Components are orthogonal
decorrelation-based dimensional reduction removes redundancy in the original
data-representation.
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 6 / 39
7. PCA
Things to think about
Limitations of standard PCA
1 Decorrelation not always the best approach
Useful only when first and second order statistics are sufficient statistics for revealing all
dependencies in data (for e.g., Gaussian distributed data).
2 Linearity assumption not always justifiable
Not valid when data-structure captured by a nonlinear function of dimensions in the
measurement-basis.
3 Non-parametricity
No probabilistic model for observed data. (Advantages of probabilistic extension coming
up!)
4 Calculation of Data Covariance Matrix
When p and N are very large, difficulties arise in terms of computational complexity and
data scarcity
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 7 / 39
8. PCA
Things to think about
Handling the decorrelation and linearity caveats
Example solutions : Independent Components Analysis (imposing more general notion of
statistical dependency), kernel PCA (nonlinearly transforming data to a more appropriate
naive-basis)
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 8 / 39
9. PCA
Things to think about
Handling the non-parametricity caveat: motivation for probabilistic PCA
probabilistic perspective can provide a log-likelihood measure for
comparison with other density-estimation techniques.
Bayesian inference methods may be applied (e.g., for model
comparison).
pPCA can be utilized as a constrained Gaussian density model:
potential applications → classification, novelty detection.
multiple pPCA models can be combined as a probabilistic mixture.
standard PCA uses a naive way to access covariance (distance2 from
observed data): pPCA defines a proper covariance structure whose
parameters can be estimated via EM.
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 9 / 39
10. PCA
Things to think about
Handling the computational caveat: motivation for EM-based PCA
computing the sample covariance itself is O Np 2 .
data scarcity: often don’t have enough data for sample covariance to
be full-rank.
computational complexity: direct diagonalization is O p 3 .
standard PCA doesn’t deal properly with missing data. EM
algorithms can estimate ML values of missing data.
EM-based PCA: doesn’t require computing sample covariance,
O (kNp) complexity.
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 10 / 39
11. Basic Model
Outline
1 Principal Components Analysis
2 Basic Model / Model Basics
3 A brief digression - Inference and Learning
4 Probabilistic PCA
5 Expectation Maximization for Probabilistic PCA
6 Mixture of Principal Component Analyzers
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 11 / 39
12. Basic Model
PCA as a limiting case of linear Gaussian models
linear Gaussian model ≡ latent variable model → Factor Analysis → probabilistic PCA →
standard PCA
ym − µ = Cxm + m = Cxm +
where for m = 1, . . . , N
xm ∈ Rk ∼ N 0, Q – (hidden) state vector
y m ∈ Rp – output/observable vector
C ∈ Rp×k – observation/measurement matrix
∈ Rp ∼ N 0, R – zero-mean white Gaussian noise
So, we have ym ∼ N µ, W = CQCT + R = CCT + R .
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 12 / 39
13. Basic Model
PCA as a limiting case of linear Gaussian models
linear Gaussian model ≡ latent variable model → Factor Analysis → probabilistic PCA →
standard PCA
ym − µ = Cxm +
The restriction to zero-mean noise source is not a loss of generality.
All of the structure in Q can be moved into C and can use Q = Ik×k .
R in general cannot be restricted, since ym are observed and cannot
be whitened/rescaled.
Assumed: C is of rank k; Q and R are always full rank.
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 13 / 39
14. Inference, Learning
Outline
1 Principal Components Analysis
2 Basic Model / Model Basics
3 A brief digression - Inference and Learning
4 Probabilistic PCA
5 Expectation Maximization for Probabilistic PCA
6 Mixture of Principal Component Analyzers
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 14 / 39
15. Inference, Learning
Latent Variable Models and Probability Computations
Case 1 :: know what the hidden states are. just want to estimate
them. (can write down a priori C based on problem-physics)
estimating the states given observations and a model → inference
Case 2 :: have observation data. observation process mostly
unknown. no explicit model for the “causes”
learning a few parameters that model the data well (in the ML sense)
→ learning
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 15 / 39
16. Inference, Learning
Latent Variable Models and Probability Computations
Inference
ym ∼ N µ, W = CCT + R
gives us
P (ym |xm ) · P (xm ) N (µ + Cxm , R) |ym · N 0, I |xm
P (xm |ym ) = =
P (ym ) N (µ, W) |ym
Therefore,
xm |ym ∼ N (β (ym − µ) , I − βC)
−1
where β = CT W−1 = CT CCT + R .
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 16 / 39
17. Inference, Learning
Latent Variable Models and Probability Computations
Learning, via Expectation Maximization
Given a likelihood function L θ; Y = {ym }m , X = {xm }m , where θ is the
parameter vector, Y is the observed data and X represents the unobserved
latent variables or missing values, the maximum likelihood estimate (MLE)
of θ is obtained iteratively as follows:
Expectation step:
˜
Q θ|θ(u) = EX|Y,θ(u) log L θ; Y, X
Maximization step:
˜
θ(u+1) = arg max Q θ|θ(u)
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 17 / 39
18. Inference, Learning
Latent Variable Models and Probability Computations
Learning, via Expectation Maximization, for linear Gaussian models
Use the solution to the inference problem for estimating the unknown
latent variables / missing values X, given Y, θ(u) . Then use this fictitious
“complete” data to solve for θ(u+1) .
Expectation step: Obtain conditional latent sufficient statistics
T (u) from
xm (u) , xm xm
xm |ym ∼ N β (u) (ym − µ) , I − β (u) C(u)
T −1 T T −1
where β (u) = C(u) W(u) = C(u) C(u) C(u) + R(u) .
Maximization step: Choose C, R to maximize joint likelihood of X, Y.
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 18 / 39
19. pPCA
Outline
1 Principal Components Analysis
2 Basic Model / Model Basics
3 A brief digression - Inference and Learning
4 Probabilistic PCA
5 Expectation Maximization for Probabilistic PCA
6 Mixture of Principal Component Analyzers
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 19 / 39
20. pPCA
PCA as a limiting case of linear Gaussian models
linear Gaussian model ≡ latent variable model → Factor Analysis → probabilistic PCA →
standard PCA
ym − µ = Cxm +
The restriction to zero-mean noise source is not a loss of generality.
All of the structure in Q can be moved into C and can use Q = Ik×k .
R in general cannot be restricted, since ym are observed and cannot
be whitened/rescaled.
Assumed: C is of rank k; Q and R are always full rank.
So, we have ym ∼ N µ, W = CQCT + R = CCT + R .
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 20 / 39
21. pPCA
PCA as a limiting case of linear Gaussian models
linear Gaussian model ≡ latent variable model → Factor Analysis → probabilistic PCA →
standard PCA
k < p: looking for more parsimonious representation of the observed
data.
:: linear Gaussian model → Factor Analysis ::
R needs to be restricted: the learning procedure could explain all the
structure in the data as noise (i.e., obtain maximal likelihood by
choosing C = 0 and R = W = data-sample covariance).
since ym ∼ N µ, W = CCT + R we can do no better than having the
model-covariance equal the data-sample covariance.
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 21 / 39
22. pPCA
PCA as a limiting case of linear Gaussian models
linear Gaussian model ≡ latent variable model → Factor Analysis → probabilistic PCA →
standard PCA
Factor Analysis ≡ restricting R to be diagonal
xm ≡ {xmi }k – factors explaining the correlations between the
i=1 p
observation variables ym ≡ ymj j=1
p
ymj j=1
conditionally independent given {xmi }k
i=1
j – variability unique to a particular ymj
R = diag(rjj ) – “uniquenesses”
different from standard PCA, which effectively treats covariance and
variance identically
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 22 / 39
23. pPCA
PCA as a limiting case of linear Gaussian models
linear Gaussian model ≡ latent variable model → Factor Analysis → probabilistic PCA →
standard PCA
Probabilistic PCA ≡ constraining R to σ 2 I
Noise, ∼ N 0, σ 2 I
ym |xm ∼ N µ + Cxm , W = CCT + σ 2 I
ym ∼ N µ, W = CCT + σ 2 I
xm |ym ∼ N (β (ym − µ) , I − βC) where
−1
β = CT W−1 = CT CCT + σ 2 I
xm |ym ∼ N κ (ym − µ) , σ 2 M−1 where
−1
κ = M−1 CT = CT C + σ 2 I CT
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 23 / 39
24. pPCA
PCA as a limiting case of linear Gaussian models
linear Gaussian model ≡ latent variable model → Factor Analysis → probabilistic PCA →
standard PCA
−1
β = CT W−1 = CT CCT + σ 2 Ip×p
−1
κ = M−1 CT = CT C + σ 2 Ik×k CT
It can be shown (by applying the Woodbury matrix identity to M) that
1 I − βC = σ 2 M−1
2 β=κ
Then, by letting σ 2 → 0, we obtain
−1 T
xm |ym → δ xm − CT C C (ym − µ) , which is the standard PCA.
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 24 / 39
25. pPCA
closed-form ML learning
log-likelihood for the pPCA model,
L θ; Y = − N p log (2π) + log |W| + trace W−1 Sy
2 where
W = CCT + σ 2 I
1 N T
Sy = N m=1 (ym − µ) (ym − µ)
ML estimates of µ, Sy are the sample mean and sample covariance
matrix respectively
ˆ 1 N
µ= N m=1 ym
T
ˆ 1 N ˆ ˆ
Sy = N m=1 ym − µ ym − µ
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 25 / 39
26. pPCA
closed-form ML learning
ML estimates of C and σ 2 , i.e. C and R
ˆ 1/2
C = Uk Λk − σ 2 I V
maps latent space (containing X) to the principal subspace of Y
columns of Uk ∈ Rp×k – principal eigenvectors of Sy
Λk ∈ Rk×k – diagonal matrix of corresponding eigenvalues of Sy
V ∈ Rk×k – arbitrary rotation matrix, can be set to Ik×k
ˆ
σ2 = 1 p
p−k r =k+1 λr
variance lost in the projection process, averaged over the number of
dimensions projected out/away
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 26 / 39
27. EM for pPCA
Outline
1 Principal Components Analysis
2 Basic Model / Model Basics
3 A brief digression - Inference and Learning
4 Probabilistic PCA
5 Expectation Maximization for Probabilistic PCA
6 Mixture of Principal Component Analyzers
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 27 / 39
28. EM for pPCA
EM-based ML learning for linear Gaussian models
Expectation step:
xm |ym ∼ N β (u) (ym − µ) , I − β (u) C(u)
T −1 T T −1
where β (u) = C(u) W(u) = C(u) C(u) C(u) + R(u) .
Maximization step: Choose C, R to maximize joint likelihood of X, Y.
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 28 / 39
29. EM for pPCA
EM-based ML learning for probabilistic PCA
Expectation step:
xm |ym ∼ N β (u) (ym − µ) , I − β (u) C(u)
T −1 T T (u) −1
where β (u) = C(u) W(u) = C(u) C(u) C(u) + σ 2 I .
Maximization step: Choose C, σ 2 to maximize joint likelihood of X, Y.
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 29 / 39
30. EM for pPCA
EM-based ML learning for probabilistic PCA
Expectation step:
(u) −1
xm |ym ∼ N κ(u) (ym − µ) , σ 2 M(u)
−1 T T (u) −1 T
where κ(u) = M(u) C(u) = C(u) C(u) + σ 2 I C(u) .
Maximization step: Choose C, σ 2 to maximize joint likelihood of X, Y.
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 30 / 39
31. EM for pPCA
EM-based ML learning for probabilistic PCA
Expectation step: Compute for m = 1, . . . , N
−1 T
xm = M(u) C(u) ˆ
ym − µ
(u) −1 T
T
xm xm = σ2 M(u) + xm xm
Maximization step: Set
N N −1
C(u+1) = ˆ
ym − µ xm T T
xm xm
m=1 m=1
N
(u+1) 1 ˆ
2
T T
ˆ T
σ2 = ym − µ − 2 xm C(u+1) ym − µ + trace T
xm xm C(u+1) C(u+1)
Np m=1
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 31 / 39
32. EM for pPCA
EM-based ML learning for probabilistic PCA
(u) −1 T −1
ˆ ˆ
C(u+1) = Sy C(u) σ 2 I + M(u) C(u) Sy C(u)
(u+1) 1 ˆ ˆ −1 T
σ2 = trace Sy − Sy C(u) M(u) C(u+1)
p
Convergence: only stable local extremum is the global maximum at
which the true principal subspace is found. Paper doesn’t discuss any
initialization scheme(s).
Complexity : require terms of the form SC and trace (S)
T
Computing SC as m ym ym C is O (kNp) and more efficient than
T 2
m ym ym C which is equivalent to finding S explicitly (O Np ).
Very efficient for k << p.
Require trace (S), not S ⇒ computing only variance along each
coordinate sufficient.
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 32 / 39
33. Mixture of pPCAs
Outline
1 Principal Components Analysis
2 Basic Model / Model Basics
3 A brief digression - Inference and Learning
4 Probabilistic PCA
5 Expectation Maximization for Probabilistic PCA
6 Mixture of Principal Component Analyzers
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 33 / 39
34. Mixture of pPCAs
Mixture of pPCAs :: the model
Likelihood of observed data
N M
2 M
L θ = {µr } , {Cr } , σr r =1
;Y = log πr · p (ym |r )
m=1 r =1
where, for m = 1, . . . , N and r = 1, . . . , M
ym |r ∼ N µr , Wr = Cr CT + σr I :: a single pPCA model
r
2
M
{πr } , πr ≥ 0, r =1 πr = 1 :: mixture weights
M independent latent variables xmr for each ym .
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 34 / 39
35. Mixture of pPCAs
Mixture of pPCAs :: EM-based ML learning
Stage 1: New estimates of component-specific π, µ
Expectation step: component’s responsibility for generating observation
(u)
(u+1) p (u) (ym |r ) · πr
Rmr = P (u) (r |ym ) = (u)
M (u) (y |r )
r =1 p m · πr
Maximization step:
N
(u+1) 1 (u+1)
πr = Rmr
N
m=1
N (u+1)
ˆ(u+1) = m=1 Rmr · ym
µr (u+1)
N
m=1 Rmr
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 35 / 39
36. Mixture of pPCAs
Mixture of pPCAs :: EM-based ML learning
Stage 2: New estimates of component-specific C, σ 2
Expectation step: Compute for m = 1, . . . , N and r = 1, . . . , M
(u)−1 (u)T ˆ(u+1)
xmr = Mr Cr ym − µr
T 2 (u) (u)−1 T
xmr xmr = σr Mr + xmr xmr
Maximization step: Set for r = 1, . . . , M
N N −1
(u+1) (u+1) ˆ(u+1) T (u+1) T
Cr = Rmr ym − µr xmr Rmr xmr xmr
m=1 m=1
N
2 (u+1) 1 (u+1)
σr = (u+1)
Rmr ×
πr p m=1
2 (u+1)T (u+1)T
ˆ(u+1)
ym − µr − 2 xmr T
Cr ˆ(u+1) + trace
ym − µr T
xmr xmr Cr Cr
(u+1)
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 36 / 39
37. Mixture of pPCAs
Mixture of pPCAs :: EM-based ML learning
−1 T −1
(u+1) (u+1) 2(u)
ˆ (u+1) C(u) πr (u) (u) ˆ (u+1) (u)
Cr = Syr r σr I + Mr Cr Syr Cr
2 (u+1) 1 ˆ (u+1) − S(u+1) C(u) M(u) C(u+1)
ˆ yr
−1 T
σr = (u+1)
trace Syr r r r
πr p
where
N
ˆ (u+1) = 1 (u+1) ˆ (u+1) ˆ (u+1) T
Syr (u+1)
Rmr ym − µr ym − µr
πr N m=1
(u) 2(u) (u)T (u)
Mr = σr I + Cr Cr
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 37 / 39
38. Mixture of pPCAs
Mixture of pPCAs :: EM-based ML learning
−1 T −1
(u+1) (u+1) 2(u)
ˆ (u+1) C(u) πr (u) (u) ˆ (u+1) (u)
Cr = Syr r σr I + Mr Cr Syr Cr
2 (u+1) 1 −1
ˆ (u+1) − S(u+1) C(u) M(u) C(u+1)
ˆ yr
T
σr = (u+1)
trace Syr r r r
πr p
Complexity : require terms of the form SC and trace (S)
T
Computing SC as m ym ym C is O (kNp) and more efficient than
T C which is equivalent to finding S explicitly (O Np 2 ).
m y m ym
Very efficient for k << p.
Require trace (S), not S ⇒ computing only variance along each
coordinate sufficient.
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 38 / 39