The document discusses the EM algorithm and its application in Probabilistic Latent Semantic Analysis (pLSA). It begins by introducing the parameter estimation problem and comparing frequentist and Bayesian approaches. It then describes the EM algorithm, which iteratively computes lower bounds to the log-likelihood function. Finally, it applies the EM algorithm to pLSA by modeling documents and words as arising from a mixture of latent topics.
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
EM algorithm and its application in probabilistic latent semantic analysis
1. EM algorithm and its application in Probabilistic Latent
Semantic Analysis (pLSA)
Duc-Hieu Tran
tdh.net [at] gmail.com
Nanyang Technological University
July 27, 2010
Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 1 / 27
2. The parameter estimation problem
Outline
The parameter estimation problem
EM algorithm
Probabilistic Latent Sematic Analysis
Reference
Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 2 / 27
3. The parameter estimation problem
Introduction
Known the prior probabilities P(ωi ), class-conditional densities p(x|ωi )
=⇒ optimal classifier
P(ωj |x) ∝ p(x|ωj )p(ωj )
decide ωi if p(ωi |x) > P(ωj |x), ∀j = i
In practice, p(x|ωi ) is unknown – just estimated from training samples
(e.g., assume p(x|ωi ) ∼ N (µi , Σi )).
Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 3 / 27
4. The parameter estimation problem
Frequentist vs. Bayesian schools
Frequentist
parameters – quantities whose values are fixed but unknown.
the best estimate of their values – the one that maximizes the
probability of obtaining the observed samples.
Bayesian
paramters – random variables having some known prior distribution.
observation of the samples converts this to a posterior density;
revising our opinion about the true values of the parameters.
Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 4 / 27
5. The parameter estimation problem
Examples
training samples: S = {(x (1) , y (1) ), . . . (x (m) , y (m) )}
frequentist: maximum likelihood
max p(y (i) |x (i) ; θ)
θ
i
bayesian: P(θ) – prior, e.g., P(θ) ∼ N (0, I)
m
P(θ|S) ∝ P(y (i) |x (i) , θ) .P(θ)
i=1
θMAP = arg max P(θ|S)
θ
Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 5 / 27
6. EM algorithm
Outline
The parameter estimation problem
EM algorithm
Probabilistic Latent Sematic Analysis
Reference
Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 6 / 27
7. EM algorithm
An estimation problem
training set of m independent samples: {x (1) , x (2) , . . . , x (m) }
goal: fit the paramters of a model p(x, z) to the data
the likelihood:
m m
(i)
(θ) = log p(x ; θ) = log p(x (i) , z; θ)
i=1 i=1 z
explicitly maximize (θ) might be difficult.
z - laten random variable
if z (i) were observed, then maximum likelihood estimation would be
easy.
strategy: repeatedly construct a lower-bound on (E-step) and
optimize that lower-bound (M-step).
Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 7 / 27
8. EM algorithm
EM algorithm (1)
digression: Jensen’s inequality.
f – convex function; E [f (X )] ≥ f (E [X ])
for each i, Qi – distribution of z: z Qi (z) = 1, Qi (z) ≥ 0
(θ) = log p(x (i) ; θ)
i
= log p(x (i) , z (i) ; θ)
i z (i)
p(x (i) , z (i) ; θ)
= log Qi (z (i) ) (1)
i
Qi (z (i) )
z (i)
applying Jensen’s inequality, concave function log
p(x (i) , z (i) ; θ)
≥ Qi (z (i) )log (2)
i
Qi (z (i) )
z (i)
More detail . . .
Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 8 / 27
9. EM algorithm
EM algorithm (2)
for any set of distribution Qi , formula (2) gives a lower-bound on (θ)
how to choose Qi ?
strategy: make the inequality hold with equality at our particular
value of θ.
require:
p(x (i) , z (i) ; θ)
=c
Qi (z (i) )
c – constant not depend on z (i)
choose: Qi (z (i) ) ∝ p(x (i) , z (i) ; θ)
we know z Qi (z (i) ) = 1, so
p(x (i) , z (i) ; θ) p(x (i) , z (i) ; θ)
Qi (z (i) ) = = = p(z (i) |x (i) ; θ)
z p(x (i) , z; θ) p(x (i) ; θ)
Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 9 / 27
10. EM algorithm
EM algorithm (3)
Qi – posterior distribution of z (i) given x (i) and the parameter θ
EM algorithm: repeat until convergence
E-step: for each i
Qi (z (i) ) := p(z (i) |x (i) ; θ)
M-step:
p(x (i) , z (i) ; θ)
θ := arg max Qi (z (i) ) log
θ i
Qi (z (i) )
z (i)
The algorithm will converge, since (θ(t) ) ≤ (θ(t+1) )
Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 10 / 27
11. EM algorithm
EM algorithm (4)
Digression: coordinate ascent algorithm.
maxW (α1 , . . . αm )
α
loop until converge:
for i ∈ 1, . . . , m:
αi = arg max W (α1 , . . . , αi , . . . , αm )
ˆ
αi
ˆ
EM-algorithm as coordinate ascent algorithm
p(x (i) , z (i) ; θ)
J(Q, θ) = Qi (z (i) ) log
i
Qi (z (i) )
z (i)
(θ) ≥ J(Q, θ)
EM algorithm can be viewed as coordinate ascent on J
E-step: maximize w.r.t Q
M-step: maximize w.r.t θ
Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 11 / 27
12. Probabilistic Latent Sematic Analysis
Outline
The parameter estimation problem
EM algorithm
Probabilistic Latent Sematic Analysis
Reference
Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 12 / 27
13. Probabilistic Latent Sematic Analysis
Probabilistic Latent Semantic Analysis (1)
set of documents D = {d1 , . . . , dN }
set of words W = {w1 , . . . , wM }
set of unobserved classes Z = {z1 , . . . , zK }
conditional independence assumption:
P(di , wj |zk ) = P(di |zk )P(wj |zk ) (3)
so,
K
P(wj |di ) = P(zk |di )P(wj |zk ) (4)
k=1
K
P(di , wj ) = P(di ) P(wj |zk )P(zk |di )
k=1
More detail . . .
Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 13 / 27
14. Probabilistic Latent Sematic Analysis
Probabilistic Latent Semantic Analysis (2)
n(di , wj ) – # word wj in doc. di
Likelihood
N N M
L= P(di ) = [P(di , wj )]n(di ,wj )
i=1 i=1 j=1
N M K n(di ,wj )
= P(di ) P(wj |zk )P(zk |di )
i=1 j=1 k=1
log-likelihood = log(L)
N M K
= n(di , wj ) log P(di ) + n(di , wj ) log P(wj |zk )P(zk |di )
i=1 j=1 k=1
Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 14 / 27
15. Probabilistic Latent Sematic Analysis
Probabilistic Latent Semantic Analysis (3)
maximize w.r.t P(wj |zk ), P(zk |di )
≈ maximize
N M K
n(di , wj ) log P(wj |zk )P(zk |di )
i=1 j=1 k=1
N M K
P(wj |zk )P(zk |di )
= n(di , wj ) log Qk (zk )
Qk (zk )
i=1 j=1 k=1
N M K
P(wj |zk )P(zk |di )
≥ n(di , wj ) Qk (zk ) log
Qk (zk )
i=1 j=1 k=1
choose
P(wj |zk )P(zk |di )
Qk (zk ) = K
= P(zk |di , wj )
l=1 P(wj |zl )P(zl |di )
More detail . . .
Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 15 / 27
16. Probabilistic Latent Sematic Analysis
Probabilistic Latent Semantic Analysis (4)
≈ maximize (w.r.t P(wj |zk ), P(zk |di ))
N M K
P(wj |zk )P(zk |di )
n(di , wj ) P(zk |di , wj ) log
P(zk |di , wj )
i=1 j=1 k=1
≈ maximize
N M K
n(di , wj ) P(zk |di , wj ) log[P(wj |zk )P(zk |di )]
i=1 j=1 k=1
Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 16 / 27
17. Probabilistic Latent Sematic Analysis
Probabilistic Latent Semantic Analysis (5)
EM-algorithm
E-step: update
P(wj |zk )P(zk |di )
P(zk |di , wj ) = K
l=1 P(wj |zl )P(zl |di )
M-step: maximize w.r.t P(wj |zk ), P(zk |di )
N M K
n(di , wj ) P(zk |di , wj ) log[P(wj |zk )P(zk |di )]
i=1 j=1 k=1
subject to
M
P(wj |zk ) = 1, k ∈ {1 . . . K }
j=1
K
P(zk |di ) = 1, i ∈ {1 . . . N}
k=1
Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 17 / 27
18. Probabilistic Latent Sematic Analysis
Probabilistic Latent Semantic Analysis (6)
Solution of maximization problem in M-step:
N
i=1 n(di , wj )P(zk |di , wj )
P(wj |zk ) = M N
m=1 n=1 n(dn , wm )P(zk |dn , wm )
M
j=1 n(di , wj )P(zk |di , wj )
P(zk |di ) =
n(di )
M
where, n(di ) = j=1 n(di , wj )
More detail . . .
Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 18 / 27
19. Probabilistic Latent Sematic Analysis
Probabilistic Latent Semantic Analysis (7)
All together
E-step:
P(wj |zk )P(zk |di )
P(zk |di , wj ) = K
l=1 P(wj |zl )P(zl |di )
M-step:
N
i=1 n(di , wj )P(zk |di , wj )
P(wj |zk ) = M N
m=1 n=1 n(dn , wm )P(zk |dn , wm )
M
j=1 n(di , wj )P(zk |di , wj )
P(zk |di ) =
n(di )
Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 19 / 27
20. Reference
Outline
The parameter estimation problem
EM algorithm
Probabilistic Latent Sematic Analysis
Reference
Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 20 / 27
21. Reference
R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classification,
Wiley-Interscience, 2001.
T. Hofmann, ”Unsupervised learning by probabilistic latent semantic
analysis,” Machine Learning, vol. 42, 2001, p. 177–196.
Course: ”Machine Learning CS229”, Andrew Ng, Stanford University
Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 21 / 27
22. Appendix
Generative model for word/document co-occurence
select a document di with probability (w.p) P(di )
pick a latent class zk w.p P(zk |di )
generate a word wj w.p P(wj |zk )
K K
P(di , wj ) = P(di , wj |zk )P(zk ) = P(wj |zk )P(di |zk )P(zk )
k=1 k=1
K
= P(wj |zk )P(zk |di )P(di )
k=1
K
= P(di ) P(wj |zk )P(zk |di )
k=1
P(di , wj ) = P(wj |di )P(di )
K
=⇒ P(wj |di ) = P(zk |di )P(wj |zk )
k=1
Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 22 / 27
23. Appendix
K
P(wj |di ) = P(zk |di )P(wj |zk )
k=1
K
since k=1 P(zk |di ) = 1, P(wj , di ) is convex combination of P(wj |zk )
≈ each document is modelled as a mixture of topics
Return
Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 23 / 27
24. Appendix
P(di , wj |zk )P(zk )
P(zk |di , wj ) = (5)
P(di , wj )
P(wj |zk )P(di |zk )P(zk )
= (6)
P(di , wj )
P(wj |zk )P(zk |di )
= (7)
P(wj |di )
P(wj |zk )P(zk |di )
= K (8)
l=1 P(wj |zl )P(zl |di )
From (5) to (6) by conditional independence assumption (3). From (7) to
(8) by (4). Return
Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 24 / 27
25. Appendix
Lagrange multipliers τk , ρi
N M K
H= n(di , wj ) P(zk |di , wj ) log[P(wj |zk )P(zk |di )]
i=1 j=1 k=1
K M N K
+ τk 1 − P(wj |di ) + ρi 1 − P(zk |di )
k=1 j=1 i=1 k=1
N
∂H i=1 P(zk |di , wj )n(di , wj )
= − τk = 0
∂P(wj |zk ) P(wj |zk )
M
∂H j=1 n(di , wj )P(zk |di , wj )
= − ρi = 0
∂P(zk |di ) P(zk |di )
Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 25 / 27
26. Appendix
M
from j=1 P(wj |zk ) =1
M N
τk = P(zk |di , wj )n(di , wj )
j=1 i=1
K
from k=1 P(zk |di , wj ) =1
ρi = n(di )
=⇒ P(wj |zk ), P(zk |di ) Return
Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 26 / 27
27. Appendix
Applying the Jensen’s inequality
f (x) = log (x), concave function
p(x (i) , z (i) ; θ) p(x (i) , z (i) ; θ)
f Ez (i) ∼Qi ≥ Ez (i) ∼Qi f
Qi (z (i) ) Qi (z (i) )
Return
Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 27 / 27