Olivier Cappé's talk at BigMC March 2011

Online EM Algorithm and Some Extensions

Olivier Capp´
e

T´l´com ParisTech & CNRS
ee

March 2011

0. Capp´ (@ BigMC)
e Online EM Algorithm March 2011 1 / 34

Online Estimation for Missing Data Models
Based on (C & Moulines, 2009) and (C, 2010)

Goals
1 Maximum likelihood estimation, or
1’ Competitive with maximum likelihood
estimation when #obs. is large

2 Good scaling (performance vs. computational cost) as #obs.
increases

(3) Process data on-the-ﬂy (no storage)
4 Simple to implement (no line-search, projection,
preconditioning, etc.)

0. Capp´ (@ BigMC)

Outline

1 The EM Algorithm in Exponential Families

2 The Limiting EM Recursion

3 Online EM Algorithm
The Algorithm
Properties and Discussion

4 Use for Batch ML Estimation

5 Extensions

6 References

0. Capp´ (@ BigMC)

The EM Algorithm in Exponential Families

Missing Data Model

A missing data model is a statistical model {pθ (x, y)}θ∈Θ in which only Y
may be observed (the couple (X, Y ) is referred to as the complete data)

Hence, parameter estimates θn must be function of observations
Y1 , . . . , Yn only (here assumed to be independent and identically
distributed)
Of course, the statistical model could also be deﬁned as {fθ (y)}θ∈Θ ,
where fθ (y) = pθ (x, y)dx but the speciﬁc structure of fθ needs to
be exploited
To analyze the methods the data {Yt }t≥1 is assumed to be generated by
an iid. process with marginal π, not necessarily equal to fθ

0. Capp´ (@ BigMC)


Finite Mixture Model
Mixture PDF
m
f (y) = αi fi (y)
i=1

Missing Data Interpretation

P(Xt = i) = αi
Yt |Xt = i ∼ fi (y)

0. Capp´ (@ BigMC)


To determine the maximum likelihood estimate
n
θn = arg max log fθ (Yt )
θ
t=1

numerically, the standard approach is the following.
Expectation-Maximization (Dempster, Laird & Rubin, 1977)
k
Given a current parameter guess θn
E-Step Compute
n
1
qn,θn (θ) =
k Eθn [ log pθ (Xt , Yt )| Yt ]
k
n
t=1

M-Step Update the parameter estimate to
k+1
θn = arg max qn,θn (θ)
k
θ∈Θ

0. Capp´ (@ BigMC)


Rationale
1 It is an ascent algorithm (shown using Jensen inequality)

Figure: The EM intermediate quantity is a minorizing surrogate

2 Because of Fisher relation, the algorithm can only stop in a stationary
point of the log-likelihood∗
∗
See (Wu, 1983) for necessary topological
and regularity assumptions
0. Capp´ (@ BigMC)


An Example: Poisson Mixture
Likelihood
m
λj Y −λj
fθ (Y ) = αj e
Y!
j=1

“Complete-Data” Log-Likelihood

log pθ (X, Y ) = − log(Y !)
m
+ [log(αj ) − λj ] 1{X = j}
j=1
m
+ log(λj )Y 1{X = j}
j=1

0. Capp´ (@ BigMC)


EM Algorithm for the Poisson Mixture
EM E-Step
m n
1
qn,θn =
k [log(αj ) − λj ] Pθn (Xt = j|Yt )
k
n
j=1 t=1
m n
1
+ log(λj ) Yt Pθn (Xt = j|Yt )
k
n
j=1 t=1

EM M-Step
n
k+1 1
αn,j = Pθn (Xt = j|Yt )
k
n
t=1
n
t=1 Yt Pθn (Xt = j|Yt )
k
λk+1 =
n,j n
t=1 Pθn (Xt = j|Yt )
k

0. Capp´ (@ BigMC)


Exponential Family Model

In the following, we assume that the complete-data model belongs to an
exponential family

(Curved) Exponential Family Model

pθ (x, y) = exp ( s(x, y), ψ(θ) − A(θ))

where s(x, y) is the vector (complete-data) suﬃcient
statistics
Explicit Complete-Data Maximum Likelihood
¯
S → θ(S) = arg max S, ψ(θ) − A(θ)
θ

is available in closed-form

0. Capp´ (@ BigMC)


The EM Algorithm Revisited

The k-th EM Iteration (From n Observations)
E-Step
n
k+1 1
Sn = Eθn [ s(Xt , Yt )| Yt ]
k
n
t=1

M-Step
k+1 ¯ k+1
θn = θ Sn

0. Capp´ (@ BigMC)

The Limiting EM Recursion

A Key Remark

The k-th EM Iteration (From n Observations)
E-Step
n
k+1 1
Sn = Eθn [ s(Xt , Yt )| Yt ]
k
n
t=1

M-Step
k+1 ¯ k+1
θn = θ Sn

Can be fully reparameterized in the domain of suﬃcient statistics
n
k+1 1
Sn = Eθ(Sn ) [ s(Xt , Yt )| Yt ]
¯ k
n
t=1

0. Capp´ (@ BigMC)



By letting n tend to inﬁnity, one obtains two equivalent updates:
Suﬃcient Statistics Update

S k = Eπ Eθ(S k−1 ) [ s(X1 , Y1 )| Y1 ]
¯

Parameter Update
¯
θk = θ {Eπ (Eθk−1 [ s(X1 , Y1 )| Y1 ])}

Using usual EM arguments, these updates are such that
1 The Kullback-Leibler divergence D(π|fθk ) is monotonically decreasing
with k
2 Converge to {θ : θ D(π|fθ ) = 0}

0. Capp´ (@ BigMC)


Batch EM Is Not Eﬃcient for Large Data Records
see also (Neal & Hinton, 1999)

3
2 10 observations
4

3
||u||2
2

1

0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

3
20 10 observations
4

3
||u||2

2

1

0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Batch EM iterations

Figure: Convergence of batch EM estimates of u 2 as a function of the number of EM iterations for 2,000 (top) and
20,000 (bottom) observations. The box-and-whisker plots are computed from 1,000 independent replications of the simulated
data. The grey region corresponds to ±2 interquartile range (approx. 99.3% coverage) under the asymptotic Gaussian
approximation of the MLE (from [C, 2010]).

0. Capp´ (@ BigMC)

Online EM Algorithm The Algorithm

The Online EM Algorithm

The online EM algorithm outputs one updated parameter estimate θn
after processing each individual observation Yn
The parameter update is very similar to applying the EM algorithm to
the single observation Yn (with smoothing)
The memory footprint of the algorithm is constant while its
computational cost is proportional to the number of processed
observations

0. Capp´ (@ BigMC)


Online EM: Rationale

We try to locate the solutions of

Eπ Eθ(S) [ s(X1 , Y1 )| Y1 ] − S = 0
¯

Viewing Eθ(S) [ s(Xn , Yn )| Yn ] as a noisy observation of
¯

Eπ Eθ(S) [ s(X1 , Y1 )| Y1 ] , this is exactly the usual Stochastic
¯
Approximation (or Robbins-Monro) setup:

Sn = Sn−1 + γn Eθ(Sn−1 ) [ s(Xn , Yn )| Yn ] − Sn−1
¯

where (γn ) is a sequence of decreasing positive stepsizes

0. Capp´ (@ BigMC)


The Algorithm

Online EM Algorithm
Stochastic E-Step

Sn = (1 − γn )Sn−1 + γn Eθn−1 [ s(Xn , Yn )| Yn ]

M Step
¯
θn = θ(Sn )

Practical Recommendations
γn = 1/nα with α ∈ [0.6, 0.7]
Don’t do M for the ﬁrst 10–20 obs.
(optional) Use Polyak-Ruppert averaging (requires to
chose n0 )

0. Capp´ (@ BigMC)


Online EM in the Poisson Mixture Example

SA E-Step
Computing Conditional Expectations
αn−1,j λYn e−λn−1,j
n−1,j
pn,j = Pm
i=1 αn−1,i λYn e−λn−1,i
n−1,i

Statistics Update (Stochastic Approximation)
α α
Sn,j = (1 − γn )Sn−1,j + γn pn,j
λ λ
Sn,j = (1 − γn )Sn−1,j + γn pn,j Yn

M-Step: Parameter Update
α
αn,j = Sn,j ,
ˆ ˆ λ α
λn,j = Sn,j /Sn,j

0. Capp´ (@ BigMC)

Online EM Algorithm Properties and Discussion

Analysis
(C & Moulines, 2009)

Under n γn = ∞, 2 < ∞, compactness of Θ and other regularity
n γn
assumptions

1 The estimate θn converges to one of the roots of θ D(π|fθ ) =0
2 The algorithm is asymptotically equivalent to
θn = θn−1 + γn J −1 (θn−1 ) θ log fθn−1 (Yn )
where J(θ) = −Eπ Eθ 2 log pθ (X1 , Y1 ) Y1
θ
3 For a well speciﬁed model (π = fθ ) and under Polyak-Ruppert
averaging† θn is Fisher eﬃcient
√ L
n(θn − θ ) −→ N (0, If (θ ))
where If (θ ) = −Eθ [ 2 log f (Y )]
θ θ 1
†˜
= 1/(n − n0 ) n 0 +1 θn ,
P
θn t=n
−α
with γn = n and α ∈ (1/2, 1)
0. Capp´ (@ BigMC)


Some More Details
1 (Andrieu et al., 2005) but also (Delyon, 1994), (Bena¨ 1999) using
ım,
the fact that D(π|fθ(S) ) is a Lyapunov function:
¯

S D(π|fθ(S) ) ,
¯ Eπ Eθ(S) [ s(X1 , Y1 )| Y1 ] − S
¯ ≤0
mean ﬁeld

2 ¯
Taylor series expansion of θ to establish the equivalence (with
remainder a.s. o(γn ))
3 (Pelletier, 1998) to show that
−1/2 L
−1
γn (θn − θ ) −→ N (0, Ip (θ )/2)

in well-speciﬁed models (where Ip is the complete-data Fisher
information matrix)
General results of (Polyak and Judistky, 1992), (Mokkadem and
Pelletier, 2006) on averaging
0. Capp´ (@ BigMC)


Illustration of Polyak-Ruppert Averaging

α = 0.9
2

1
0
u
−2
0 200 400 600 800 1000 1200 1400 1600 1800 2000
α = 0.6
2
1

0
u

−2
0 200 400 600 800 1000 1200 1400 1600 1800 2000
α = 0.6 with halfway averaging
2
1

0
u

−2
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Number of observations

Figure: Four superimposed trajectories of the estimate of u1 (ﬁrst component of u) for various algorithm settings
(α = 0.9, α = 0.6 and α = 0.6 with Polyak-Ruppert averaging, from top to bottom). The actual value of u1 is equal to
zero.

0. Capp´ (@ BigMC)


Performance of Online EM
α = 0.9
4
3

||u||2
2
1
0
0.2 10^3 2 10^3 20 10^3

α = 0.6
4
3
||u||2

2
1
0
0.2 10^3 2 10^3 20 10^3

α = 0.6 with halfway averaging
4
3
||u||2

2
1
0
0.2 10^3 2 10^3 20 10^3
Number of observations

Figure: Online EM estimates of u 2 for various data sizes (200, 2,000 and 20,000 observations, from left to right) and
algorithm settings (α = 0.9, α = 0.6 and α = 0.6 with Polyak-Ruppert averaging, from top to bottom). The
box-and-whisker plots (outliers plotting suppressed) are computed from 1,000 independent replications of the simulated data.
The grey regions corresponds to ±2 interquartile range (approx. 99.3% coverage) under the asymptotic Gaussian
approximation of the MLE.

0. Capp´ (@ BigMC)


Related Works

(Titterington, 1984) Proposes a gradient algorithm
−1
θn = θn−1 + γn Ip (θn−1 ) θ log fθn−1 (Yn )

It is asymptotically equivalent to the algorithm (previously
described) for well-specified models (π = fθ )

(Neal & Hinton, 1999) Describe an algorithm called Incremental EM that
is equivalent (up to first batch scan only) to Online EM used
with γn = 1/n

(Sato, 2000; Sato & Ishii, 2000) Describe the algorithm and provide some
analysis in the flat model case and for mixtures of Gaussian

0. Capp´ (@ BigMC)


How Does This Work in Practice?

Fine But don’t use ‡ γn = 1/n

Simulations in (C & Moulines, 2009) on mixtures of Gaussian regressions

Large Scale Experiments on Real Data in (Liang & Klein, 2009), where
the use of mini-batch blocking was found useful:
Apply the proposed algorithm considering
Ymk+1 , Ymk+2 . . . Ym(k+1) as one observation
Mini-batch blocking is useful in dealing with mixture-like
models with infrequent components

‡
γn = γ0 /(n0 + n) can be an option
but requires carefully setting γ0 and n0
0. Capp´ (@ BigMC)


Some Intuition About the Weights

If rk = (1 − γk )rk−1 + γk Ek , for k ≥ 1 1
−4
x 10
α=1

1
n n n
1 rn = k=1 ωk Ek + ω0 r0 with 1
n n
k=0 ωk = 1 4
−4
x 10
α = 0.9

n 1
2 ωk = n+a (for k ≥ 1) when 2

γk = 1/(k + a) and is strictly 0

increasing otherwise 4
−3
x 10
α = 0.6

n
3 n 2
k=1 (ωk )≡ 2 n−α when γk = k −α ,
1
2

with 1/2 < α < 1 0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

0. Capp´ (@ BigMC)

Use for Batch ML Estimation

How to Use Online EM for Batch ML Estimation?

The most popular use of the method is to perfom batch ML estimation
from very large datasets

Because we did not assume that π = fθ , the previous analysis can be
applied to π ≡ the empirical measure associated with Y1 , . . . , Yn
Online EM can be used for batch ML estimation by (randomly)
scanning the data Y1 , . . . , Yn
Convergence “speed” (with averaging) is (nobs. × nscans )−1/2 versus
ρnscans for batch EM
Not a fair comparison in terms of computing time as the M-Step is
not free and possible parallelization is ignored

0. Capp´ (@ BigMC)


Comparison With Batch and Incremental EM

Batch EM
−1.54

−1.56

−1.58
1 2 3 4 5

Incremental EM
−1.54

−1.56

−1.58
1 2 3 4 5

Online EM
−1.54

−1.56

−1.58
1 2 3 4 5
batch tours

Figure: Normalized log-likelihood of the estimates obtained with, from top to bottom, batch EM, incremental EM and
online EM as a function of the number of batch tours (or iterations, for batch EM). The data is of length N = 100 and the box
an whiskers plots summarize the results of 500 independent runs of the algorithms started from randomized starting points θ0 .

0. Capp´ (@ BigMC)


Comparison With Batch and Incremental EM (Contd.)

Batch EM

−1.56

−1.58

−1.6
1 2 3 4 5

Incremental EM
−1.56

−1.58

−1.6
1 2 3 4 5

Online EM

−1.56

−1.58

−1.6
1 2 3 4 5
batch tours

Figure: Same display for a data record of length N = 1,000.

0. Capp´ (@ BigMC)

Extensions

Summary

The Good Easy (esp. when EM implementation is available)
Can be used for ML estimation from a batch of
observations
Robust wrt. to stepsize selection (note that scale is
ﬁxed due to the use of convex combinations)
Handles parameter constraints nicely (only requires that
S be closed under convex combinations with expected
suﬃcient statistics)

0. Capp´ (@ BigMC)

Extensions

Summary (Contd.)

The Bad Needs that the E-step be explicit
¯
Needs that θ be explicit
Not appropriate for short (say, less than 1000
observations) data records without cycling
What about non-independent observations?

0. Capp´ (@ BigMC)

Extensions

Online EM in Latent Factor Models (Ongoing Work)
Many models of the form
Cn |Hn ∼ gPK
k=1 θk Hn,k

where {gλ }λ∈Λ is an exponential family of distributions and Hn is a latent
random vector of positive weights (probabilistic matrix factorization,
discrete component analysis, partial membership models, simplicial
mixtures)

Figure: Bayesian network representations of Latent Dirichlet Allocation (LDA)

0. Capp´ (@ BigMC)

Extensions

Simulated Online EM Algorithm for LDA

For n = 1, . . .
Simulated E-step
˜
Simulate Hn given Cn and θn−1
(in practise, using a short run of
Metropolis-Hastings or collapsed
Gibbs sampling)
Use the Rao-Blackwellized update

˜
Sn = (1−γn )Sn−1 +γn Eθn−1 s(Zn , Wn )| Wn , Hn

¯
M-step θn = θ(Sn )

0. Capp´ (@ BigMC)

Extensions

Ignoring the sampling bias, this recursion can be analyzed and has the
same asymptotic properties as the online EM algorithm

In particular, for well-speciﬁed models,

−1/2 L −1
γn (θn − θ ) −→ N (0, If (θ ))

instead of
−1/2 L −1
γn (θn − θ ) −→ N (0, Ip (θ ))
for the “exact” online EM algorithm (Ip (θ ) = −Eθ [ 2 log p (X , Y )]).
θ θ 1 1

0. Capp´ (@ BigMC)

References

Capp´, O. & Moulines, E. (2009). On-line expectation-maximization algorithm for
e
latent data models. J. Roy. Statist. Soc. B, 71(3):593-613.
Capp´, O. (2011). Online Expectation-Maximisation. To appear in Mengersen, K.,
e
Titterington, M., & Robert, C. P., eds., Mixtures, Wiley.
Liang, P. & Klein, D. (2009). Online EM for Unsupervised Models. In Proc
NAACL Conference.
Neal, R. M. & Hinton, G. E. (1999). A view of the EM algorithm that justiﬁes
incremental, sparse, and other variants. In Jordan, M. I., ed., Learning in graphical
models, pages 355–368. MIT Press, Cambridge, MA, USA.
Rohde, D. & Capp´, O. (2011). Online maximum-likelihood estimation for latent
e
factor models. Submitted.
Sato, M. (2000). Convergence of on-line EM algorithm. In Proc. International
Conference on Neural Information Processing, 1:476–481.
Sato, M. & Ishii, S. (2000). On-line EM algorithm for the normalized Gaussian
network. Neural Computation, 12:407-432.
Titterington, D. M. (1984). Recursive parameter estimation using incomplete
data. J. Roy. Statist. Soc. B, 46(2):257-267.

0. Capp´ (@ BigMC)

Olivier Cappé's talk at BigMC March 2011

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (15)

Similaire à Olivier Cappé's talk at BigMC March 2011

Similaire à Olivier Cappé's talk at BigMC March 2011 (20)

Plus de BigMC

Plus de BigMC (11)

Olivier Cappé's talk at BigMC March 2011