EM algorithm and its application in probabilistic latent semantic analysis

EM algorithm and its application in Probabilistic Latent
Semantic Analysis (pLSA)

Duc-Hieu Tran
tdh.net [at] gmail.com

Nanyang Technological University

July 27, 2010

Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 1 / 27

The parameter estimation problem

Outline


EM algorithm

Probabilistic Latent Sematic Analysis

Reference



Frequentist vs. Bayesian schools

Frequentist
parameters – quantities whose values are ﬁxed but unknown.
the best estimate of their values – the one that maximizes the
probability of obtaining the observed samples.
Bayesian
paramters – random variables having some known prior distribution.
observation of the samples converts this to a posterior density;
revising our opinion about the true values of the parameters.



Examples

training samples: S = {(x (1) , y (1) ), . . . (x (m) , y (m) )}
frequentist: maximum likelihood

max p(y (i) |x (i) ; θ)
θ
i

bayesian: P(θ) – prior, e.g., P(θ) ∼ N (0, I)
m
P(θ|S) ∝ P(y (i) |x (i) , θ) .P(θ)
i=1

θMAP = arg max P(θ|S)
θ


EM algorithm

Outline


EM algorithm


Reference


EM algorithm

An estimation problem

training set of m independent samples: {x (1) , x (2) , . . . , x (m) }
goal: ﬁt the paramters of a model p(x, z) to the data
the likelihood:
m m
(i)
(θ) = log p(x ; θ) = log p(x (i) , z; θ)
i=1 i=1 z

explicitly maximize (θ) might be diﬃcult.
z - laten random variable
if z (i) were observed, then maximum likelihood estimation would be
easy.
strategy: repeatedly construct a lower-bound on (E-step) and
optimize that lower-bound (M-step).


EM algorithm

EM algorithm (1)
digression: Jensen’s inequality.
f – convex function; E [f (X )] ≥ f (E [X ])
for each i, Qi – distribution of z: z Qi (z) = 1, Qi (z) ≥ 0

(θ) = log p(x (i) ; θ)
i

= log p(x (i) , z (i) ; θ)
i z (i)
p(x (i) , z (i) ; θ)
= log Qi (z (i) ) (1)
i
Qi (z (i) )
z (i)
applying Jensen’s inequality, concave function log
p(x (i) , z (i) ; θ)
≥ Qi (z (i) )log (2)
i
Qi (z (i) )
z (i)

More detail . . .

EM algorithm

EM algorithm (2)
for any set of distribution Qi , formula (2) gives a lower-bound on (θ)
how to choose Qi ?
strategy: make the inequality hold with equality at our particular
value of θ.
require:
p(x (i) , z (i) ; θ)
=c
Qi (z (i) )
c – constant not depend on z (i)
choose: Qi (z (i) ) ∝ p(x (i) , z (i) ; θ)
we know z Qi (z (i) ) = 1, so

p(x (i) , z (i) ; θ) p(x (i) , z (i) ; θ)
Qi (z (i) ) = = = p(z (i) |x (i) ; θ)
z p(x (i) , z; θ) p(x (i) ; θ)


EM algorithm

EM algorithm (3)

Qi – posterior distribution of z (i) given x (i) and the parameter θ
EM algorithm: repeat until convergence
E-step: for each i
Qi (z (i) ) := p(z (i) |x (i) ; θ)
M-step:

p(x (i) , z (i) ; θ)
θ := arg max Qi (z (i) ) log
θ i
Qi (z (i) )
z (i)

The algorithm will converge, since (θ(t) ) ≤ (θ(t+1) )


EM algorithm

EM algorithm (4)
Digression: coordinate ascent algorithm.
maxW (α1 , . . . αm )
α
loop until converge:
for i ∈ 1, . . . , m:

αi = arg max W (α1 , . . . , αi , . . . , αm )
ˆ
αi
ˆ

EM-algorithm as coordinate ascent algorithm

p(x (i) , z (i) ; θ)
J(Q, θ) = Qi (z (i) ) log
i
Qi (z (i) )
z (i)

(θ) ≥ J(Q, θ)
EM algorithm can be viewed as coordinate ascent on J
E-step: maximize w.r.t Q
M-step: maximize w.r.t θ


Outline


EM algorithm


Reference



Probabilistic Latent Semantic Analysis (1)

set of documents D = {d1 , . . . , dN }
set of words W = {w1 , . . . , wM }
set of unobserved classes Z = {z1 , . . . , zK }
conditional independence assumption:

P(di , wj |zk ) = P(di |zk )P(wj |zk ) (3)

so,
K
P(wj |di ) = P(zk |di )P(wj |zk ) (4)
k=1
K
P(di , wj ) = P(di ) P(wj |zk )P(zk |di )
k=1
More detail . . .




n(di , wj ) – # word wj in doc. di
Likelihood
N N M
L= P(di ) = [P(di , wj )]n(di ,wj )
i=1 i=1 j=1

N M K n(di ,wj )

= P(di ) P(wj |zk )P(zk |di )
i=1 j=1 k=1

log-likelihood = log(L)
N M K
= n(di , wj ) log P(di ) + n(di , wj ) log P(wj |zk )P(zk |di )
i=1 j=1 k=1



EM-algorithm
E-step: update
P(zk |di , wj ) = K
M-step: maximize w.r.t P(wj |zk ), P(zk |di )
N M K
n(di , wj ) P(zk |di , wj ) log[P(wj |zk )P(zk |di )]
i=1 j=1 k=1

subject to
M
P(wj |zk ) = 1, k ∈ {1 . . . K }
j=1
K
P(zk |di ) = 1, i ∈ {1 . . . N}
k=1

Reference

Outline


EM algorithm


Reference


Reference

R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classiﬁcation,
Wiley-Interscience, 2001.
T. Hofmann, ”Unsupervised learning by probabilistic latent semantic
analysis,” Machine Learning, vol. 42, 2001, p. 177–196.
Course: ”Machine Learning CS229”, Andrew Ng, Stanford University


Appendix

Applying the Jensen’s inequality

f (x) = log (x), concave function

p(x (i) , z (i) ; θ) p(x (i) , z (i) ; θ)
f Ez (i) ∼Qi ≥ Ez (i) ∼Qi f
Qi (z (i) ) Qi (z (i) )

Return


EM algorithm and its application in probabilistic latent semantic analysis

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à EM algorithm and its application in probabilistic latent semantic analysis

Similaire à EM algorithm and its application in probabilistic latent semantic analysis (20)

Plus de zukun

Plus de zukun (20)

Dernier

Dernier (20)

EM algorithm and its application in probabilistic latent semantic analysis