SlideShare une entreprise Scribd logo
1  sur  34
Télécharger pour lire hors ligne
Online EM Algorithm and Some Extensions

                         Olivier Capp´

                     T´l´com ParisTech & CNRS

                          March 2011

0. Capp´ (@ BigMC)
       e                  Online EM Algorithm   March 2011   1 / 34
Online Estimation for Missing Data Models
Based on (C & Moulines, 2009) and (C, 2010)

              1 Maximum likelihood estimation, or
                        1’ Competitive with maximum likelihood
                            estimation when #obs. is large

              2 Good scaling (performance vs. computational cost) as #obs.

           (3) Process data on-the-fly (no storage)
              4 Simple to implement (no line-search, projection,
                preconditioning, etc.)

     0. Capp´ (@ BigMC)
            e                     Online EM Algorithm              March 2011   2 / 34

1   The EM Algorithm in Exponential Families

2   The Limiting EM Recursion

3   Online EM Algorithm
     The Algorithm
     Properties and Discussion

4   Use for Batch ML Estimation

5   Extensions

6   References

     0. Capp´ (@ BigMC)
            e                    Online EM Algorithm   March 2011   3 / 34
The EM Algorithm in Exponential Families

Missing Data Model

A missing data model is a statistical model {pθ (x, y)}θ∈Θ in which only Y
may be observed (the couple (X, Y ) is referred to as the complete data)

    Hence, parameter estimates θn must be function of observations
    Y1 , . . . , Yn only (here assumed to be independent and identically
    Of course, the statistical model could also be defined as {fθ (y)}θ∈Θ ,
    where fθ (y) = pθ (x, y)dx but the specific structure of fθ needs to
    be exploited
To analyze the methods the data {Yt }t≥1 is assumed to be generated by
an iid. process with marginal π, not necessarily equal to fθ

    0. Capp´ (@ BigMC)
           e                                Online EM Algorithm   March 2011   4 / 34
The EM Algorithm in Exponential Families

Finite Mixture Model
Mixture PDF
                                                  f (y) =          αi fi (y)

Missing Data Interpretation

                                                  P(Xt = i) = αi
                                                  Yt |Xt = i ∼ fi (y)

    0. Capp´ (@ BigMC)
           e                                Online EM Algorithm                March 2011   5 / 34
The EM Algorithm in Exponential Families

To determine the maximum likelihood estimate
                                θn = arg max                    log fθ (Yt )

numerically, the standard approach is the following.
Expectation-Maximization (Dempster, Laird & Rubin, 1977)
Given a current parameter guess θn
     E-Step Compute
                                 qn,θn (θ) =
                                     k                          Eθn [ log pθ (Xt , Yt )| Yt ]

     M-Step Update the parameter estimate to
                                              θn = arg max qn,θn (θ)

    0. Capp´ (@ BigMC)
           e                                Online EM Algorithm                           March 2011   6 / 34
The EM Algorithm in Exponential Families

  1    It is an ascent algorithm (shown using Jensen inequality)

            Figure: The EM intermediate quantity is a minorizing surrogate

  2    Because of Fisher relation, the algorithm can only stop in a stationary
       point of the log-likelihood∗
    See (Wu, 1983) for necessary topological
and regularity assumptions
       0. Capp´ (@ BigMC)
              e                                Online EM Algorithm   March 2011   7 / 34
The EM Algorithm in Exponential Families

An Example: Poisson Mixture
                                                                        λj Y −λj
                                              fθ (Y ) =            αj       e

“Complete-Data” Log-Likelihood

                    log pθ (X, Y ) = − log(Y !)
                                         +         [log(αj ) − λj ] 1{X = j}
                                                                    +          log(λj )Y 1{X = j}

    0. Capp´ (@ BigMC)
           e                                 Online EM Algorithm                       March 2011   8 / 34
The EM Algorithm in Exponential Families

EM Algorithm for the Poisson Mixture
 EM E-Step
                                  m                                n
                    qn,θn =
                        k              [log(αj ) − λj ]                  Pθn (Xt = j|Yt )
                                 j=1                               t=1
                                                        m                    n
                                                   +          log(λj )            Yt Pθn (Xt = j|Yt )
                                                        j=1                 t=1

EM M-Step
                                        k+1        1
                                       αn,j      =           Pθn (Xt = j|Yt )
                                                         t=1 Yt Pθn (Xt = j|Yt )
                                       λk+1 =
                                        n,j               n
                                                          t=1 Pθn (Xt = j|Yt )

    0. Capp´ (@ BigMC)
           e                                Online EM Algorithm                          March 2011   9 / 34
The EM Algorithm in Exponential Families

Exponential Family Model

In the following, we assume that the complete-data model belongs to an
exponential family

(Curved) Exponential Family Model

                                  pθ (x, y) = exp ( s(x, y), ψ(θ) − A(θ))

                where s(x, y) is the vector (complete-data) sufficient
Explicit Complete-Data Maximum Likelihood
                                  S → θ(S) = arg max S, ψ(θ) − A(θ)

                is available in closed-form

    0. Capp´ (@ BigMC)
           e                                Online EM Algorithm       March 2011   10 / 34
The EM Algorithm in Exponential Families

The EM Algorithm Revisited

The k-th EM Iteration (From n Observations)
                                        k+1             1
                                       Sn =                       Eθn [ s(Xt , Yt )| Yt ]

                                                    k+1 ¯ k+1
                                                   θn = θ Sn

    0. Capp´ (@ BigMC)
           e                                Online EM Algorithm                             March 2011   11 / 34
The Limiting EM Recursion

A Key Remark

The k-th EM Iteration (From n Observations)
                                      k+1       1
                                     Sn       =            Eθn [ s(Xt , Yt )| Yt ]

                                                 k+1 ¯ k+1
                                                θn = θ Sn

Can be fully reparameterized in the domain of sufficient statistics
                          k+1        1
                         Sn =                  Eθ(Sn ) [ s(Xt , Yt )| Yt ]
                                                ¯ k

    0. Capp´ (@ BigMC)
           e                              Online EM Algorithm                        March 2011   12 / 34
The Limiting EM Recursion

The Limiting EM Recursion

By letting n tend to infinity, one obtains two equivalent updates:
Sufficient Statistics Update

                                    S k = Eπ Eθ(S k−1 ) [ s(X1 , Y1 )| Y1 ]

Parameter Update
                                     θk = θ {Eπ (Eθk−1 [ s(X1 , Y1 )| Y1 ])}

Using usual EM arguments, these updates are such that
  1   The Kullback-Leibler divergence D(π|fθk ) is monotonically decreasing
      with k
  2   Converge to {θ :           θ D(π|fθ )     = 0}

      0. Capp´ (@ BigMC)
             e                              Online EM Algorithm               March 2011   13 / 34
The Limiting EM Recursion

Batch EM Is Not Efficient for Large Data Records
see also (Neal & Hinton, 1999)

                                                                   2 10 observations



                                       1   2   3   4   5   6   7   8   9       10   11   12   13   14   15   16   17   18   19   20

                                                                   20 10 observations




                                       1   2   3   4   5   6   7   8   9       10   11   12   13   14   15   16   17   18   19   20
                                                                   Batch EM iterations

Figure:     Convergence of batch EM estimates of u 2 as a function of the number of EM iterations for 2,000 (top) and
20,000 (bottom) observations. The box-and-whisker plots are computed from 1,000 independent replications of the simulated
data. The grey region corresponds to ±2 interquartile range (approx. 99.3% coverage) under the asymptotic Gaussian
approximation of the MLE (from [C, 2010]).

       0. Capp´ (@ BigMC)
              e                                                Online EM Algorithm                                                    March 2011   14 / 34
Online EM Algorithm   The Algorithm

The Online EM Algorithm

   The online EM algorithm outputs one updated parameter estimate θn
   after processing each individual observation Yn
   The parameter update is very similar to applying the EM algorithm to
   the single observation Yn (with smoothing)
   The memory footprint of the algorithm is constant while its
   computational cost is proportional to the number of processed

   0. Capp´ (@ BigMC)
          e                        Online EM Algorithm        March 2011   15 / 34
Online EM Algorithm   The Algorithm

Online EM: Rationale

   We try to locate the solutions of

                        Eπ Eθ(S) [ s(X1 , Y1 )| Y1 ] − S = 0

   Viewing Eθ(S) [ s(Xn , Yn )| Yn ] as a noisy observation of

   Eπ Eθ(S) [ s(X1 , Y1 )| Y1 ] , this is exactly the usual Stochastic
   Approximation (or Robbins-Monro) setup:

                Sn = Sn−1 + γn Eθ(Sn−1 ) [ s(Xn , Yn )| Yn ] − Sn−1

   where (γn ) is a sequence of decreasing positive stepsizes

   0. Capp´ (@ BigMC)
          e                         Online EM Algorithm          March 2011   16 / 34
Online EM Algorithm   The Algorithm

The Algorithm

Online EM Algorithm
Stochastic E-Step

                         Sn = (1 − γn )Sn−1 + γn Eθn−1 [ s(Xn , Yn )| Yn ]

    M Step
                                                 θn = θ(Sn )

Practical Recommendations
                γn = 1/nα with α ∈ [0.6, 0.7]
                Don’t do M for the first 10–20 obs.
                (optional) Use Polyak-Ruppert averaging (requires to
                chose n0 )

    0. Capp´ (@ BigMC)
           e                          Online EM Algorithm          March 2011   17 / 34
Online EM Algorithm     The Algorithm

Online EM in the Poisson Mixture Example

SA E-Step
Computing Conditional Expectations
                                          αn−1,j λYn e−λn−1,j
                                  pn,j = Pm
                                          i=1 αn−1,i λYn e−λn−1,i

Statistics Update (Stochastic Approximation)
                                  α               α
                                 Sn,j = (1 − γn )Sn−1,j + γn pn,j
                                  λ               λ
                                 Sn,j = (1 − γn )Sn−1,j + γn pn,j Yn

M-Step: Parameter Update
                         αn,j = Sn,j ,
                         ˆ                      ˆ       λ     α
                                                λn,j = Sn,j /Sn,j

    0. Capp´ (@ BigMC)
           e                         Online EM Algorithm               March 2011   18 / 34
Online EM Algorithm   Properties and Discussion

(C & Moulines, 2009)

Under n γn = ∞,                   2   < ∞, compactness of Θ and other regularity
                               n γn

  1   The estimate θn converges to one of the roots of                            θ D(π|fθ )    =0
  2   The algorithm is asymptotically equivalent to
                           θn = θn−1 + γn J −1 (θn−1 )             θ   log fθn−1 (Yn )
      where J(θ) = −Eπ Eθ 2 log pθ (X1 , Y1 ) Y1
  3   For a well specified model (π = fθ ) and under Polyak-Ruppert
      averaging† θn is Fisher efficient
                          √             L
                            n(θn − θ ) −→ N (0, If (θ ))
      where If (θ ) = −Eθ [            2 log f (Y )]
                                       θ      θ 1
       = 1/(n − n0 ) n 0 +1 θn ,
    θn               t=n
with γn = n   and α ∈ (1/2, 1)
      0. Capp´ (@ BigMC)
             e                             Online EM Algorithm                           March 2011   19 / 34
Online EM Algorithm   Properties and Discussion

Some More Details
 1   (Andrieu et al., 2005) but also (Delyon, 1994), (Bena¨ 1999) using
     the fact that D(π|fθ(S) ) is a Lyapunov function:

                     S D(π|fθ(S) ) ,
                            ¯          Eπ Eθ(S) [ s(X1 , Y1 )| Y1 ] − S
                                           ¯                                      ≤0
                                                        mean field

 2                              ¯
     Taylor series expansion of θ to establish the equivalence (with
     remainder a.s. o(γn ))
 3          (Pelletier, 1998) to show that
                                −1/2                  L
                               γn (θn − θ ) −→ N (0, Ip (θ )/2)

            in well-specified models (where Ip is the complete-data Fisher
            information matrix)
            General results of (Polyak and Judistky, 1992), (Mokkadem and
            Pelletier, 2006) on averaging
     0. Capp´ (@ BigMC)
            e                           Online EM Algorithm                    March 2011   20 / 34
Online EM Algorithm        Properties and Discussion

Illustration of Polyak-Ruppert Averaging

                                                               α = 0.9

                                 0   200   400   600     800    1000     1200   1400   1600   1800   2000
                                                               α = 0.6


                                 0   200   400   600     800    1000     1200   1400   1600   1800   2000
                                                 α = 0.6 with halfway averaging


                                 0   200   400   600     800    1000     1200   1400   1600   1800   2000
                                                       Number of observations

Figure:    Four superimposed trajectories of the estimate of u1 (first component of u) for various algorithm settings
(α = 0.9, α = 0.6 and α = 0.6 with Polyak-Ruppert averaging, from top to bottom). The actual value of u1 is equal to

       0. Capp´ (@ BigMC)
              e                                   Online EM Algorithm                                       March 2011   21 / 34
Online EM Algorithm      Properties and Discussion

Performance of Online EM
                                                              α = 0.9

                                        0.2 10^3               2 10^3                20 10^3

                                                              α = 0.6

                                        0.2 10^3               2 10^3                20 10^3

                                                   α = 0.6 with halfway averaging

                                        0.2 10^3               2 10^3                20 10^3
                                                      Number of observations

Figure:     Online EM estimates of u 2 for various data sizes (200, 2,000 and 20,000 observations, from left to right) and
algorithm settings (α = 0.9, α = 0.6 and α = 0.6 with Polyak-Ruppert averaging, from top to bottom). The
box-and-whisker plots (outliers plotting suppressed) are computed from 1,000 independent replications of the simulated data.
The grey regions corresponds to ±2 interquartile range (approx. 99.3% coverage) under the asymptotic Gaussian
approximation of the MLE.

       0. Capp´ (@ BigMC)
              e                                     Online EM Algorithm                                 March 2011       22 / 34
Online EM Algorithm   Properties and Discussion

Related Works

(Titterington, 1984) Proposes a gradient algorithm
                         θn = θn−1 + γn Ip (θn−1 )                  θ   log fθn−1 (Yn )

                It is asymptotically equivalent to the algorithm (previously
                described) for well-specified models (π = fθ )

(Neal & Hinton, 1999) Describe an algorithm called Incremental EM that
            is equivalent (up to first batch scan only) to Online EM used
            with γn = 1/n

(Sato, 2000; Sato & Ishii, 2000) Describe the algorithm and provide some
             analysis in the flat model case and for mixtures of Gaussian

    0. Capp´ (@ BigMC)
           e                         Online EM Algorithm                          March 2011   23 / 34
Online EM Algorithm   Properties and Discussion

How Does This Work in Practice?

          Fine But don’t use         ‡   γn = 1/n

 Simulations in (C & Moulines, 2009) on mixtures of Gaussian regressions

Large Scale Experiments on Real Data in (Liang & Klein, 2009), where
             the use of mini-batch blocking was found useful:
                          Apply the proposed algorithm considering
                          Ymk+1 , Ymk+2 . . . Ym(k+1) as one observation
                 Mini-batch blocking is useful in dealing with mixture-like
                 models with infrequent components

     γn = γ0 /(n0 + n) can be an option
but requires carefully setting γ0 and n0
     0. Capp´ (@ BigMC)
            e                             Online EM Algorithm                    March 2011   24 / 34
Online EM Algorithm   Properties and Discussion

Some Intuition About the Weights

If rk = (1 − γk )rk−1 + γk Ek , for k ≥ 1                1
                                                             x 10

             n    n           n
  1   rn =   k=1 ωk Ek     + ω0 r0 with                  1
        n    n
        k=0 ωk = 1                                       4
                                                             x 10
                                                                                                α = 0.9

        n     1
  2   ωk = n+a (for k ≥ 1) when                          2

      γk = 1/(k + a) and is strictly                     0

      increasing otherwise                               4
                                                             x 10
                                                                                                α = 0.6

  3           n 2
        k=1 (ωk )≡ 2 n−α when γk = k −α ,

      with 1/2 < α < 1                                   0
                                                          0         1000   2000   3000   4000    5000     6000   7000   8000   9000   10000

      0. Capp´ (@ BigMC)
             e                           Online EM Algorithm                                               March 2011             25 / 34
Use for Batch ML Estimation

How to Use Online EM for Batch ML Estimation?

The most popular use of the method is to perfom batch ML estimation
from very large datasets

Because we did not assume that π = fθ , the previous analysis can be
applied to π ≡ the empirical measure associated with Y1 , . . . , Yn
    Online EM can be used for batch ML estimation by (randomly)
    scanning the data Y1 , . . . , Yn
    Convergence “speed” (with averaging) is (nobs. × nscans )−1/2 versus
    ρnscans for batch EM
    Not a fair comparison in terms of computing time as the M-Step is
    not free and possible parallelization is ignored

    0. Capp´ (@ BigMC)
           e                                Online EM Algorithm   March 2011   26 / 34
Use for Batch ML Estimation

Comparison With Batch and Incremental EM

                                                              Batch EM


                                       1           2              3           4            5

                                                           Incremental EM


                                       1           2              3           4            5

                                                              Online EM


                                       1           2              3           4            5
                                                             batch tours

Figure:     Normalized log-likelihood of the estimates obtained with, from top to bottom, batch EM, incremental EM and
online EM as a function of the number of batch tours (or iterations, for batch EM). The data is of length N = 100 and the box
an whiskers plots summarize the results of 500 independent runs of the algorithms started from randomized starting points θ0 .

       0. Capp´ (@ BigMC)
              e                                        Online EM Algorithm                             March 2011       27 / 34
Use for Batch ML Estimation

Comparison With Batch and Incremental EM (Contd.)

                                                       Batch EM



                               1            2              3          4            5

                                                    Incremental EM


                               1            2              3          4            5

                                                       Online EM



                               1            2              3          4            5
                                                      batch tours

                        Figure:     Same display for a data record of length N = 1,000.

   0. Capp´ (@ BigMC)
          e                                     Online EM Algorithm                       March 2011   28 / 34


 The Good              Easy (esp. when EM implementation is available)
                       Can be used for ML estimation from a batch of
                       Robust wrt. to stepsize selection (note that scale is
                       fixed due to the use of convex combinations)
                       Handles parameter constraints nicely (only requires that
                       S be closed under convex combinations with expected
                       sufficient statistics)

  0. Capp´ (@ BigMC)
         e                         Online EM Algorithm            March 2011   29 / 34

Summary (Contd.)

  The Bad               Needs that the E-step be explicit
                        Needs that θ be explicit
                        Not appropriate for short (say, less than 1000
                        observations) data records without cycling
                        What about non-independent observations?

   0. Capp´ (@ BigMC)
          e                         Online EM Algorithm            March 2011   30 / 34

Online EM in Latent Factor Models (Ongoing Work)
Many models of the form
                                   Cn |Hn ∼ gPK
                                                         k=1 θk Hn,k

where {gλ }λ∈Λ is an exponential family of distributions and Hn is a latent
random vector of positive weights (probabilistic matrix factorization,
discrete component analysis, partial membership models, simplicial

                Figure:   Bayesian network representations of Latent Dirichlet Allocation (LDA)

    0. Capp´ (@ BigMC)
           e                               Online EM Algorithm                                March 2011   31 / 34

Simulated Online EM Algorithm for LDA

For n = 1, . . .
Simulated E-step
                           Simulate Hn given Cn and θn−1
                           (in practise, using a short run of
                           Metropolis-Hastings or collapsed
                           Gibbs sampling)
                           Use the Rao-Blackwellized update

                           Sn = (1−γn )Sn−1 +γn Eθn−1 s(Zn , Wn )| Wn , Hn

      M-step θn = θ(Sn )

      0. Capp´ (@ BigMC)
             e                          Online EM Algorithm         March 2011   32 / 34

Ignoring the sampling bias, this recursion can be analyzed and has the
same asymptotic properties as the online EM algorithm

In particular, for well-specified models,

                          −1/2                L −1
                         γn (θn − θ ) −→ N (0, If (θ ))

instead of
                          −1/2                L −1
                         γn (θn − θ ) −→ N (0, Ip (θ ))
for the “exact” online EM algorithm (Ip (θ ) = −Eθ [      2 log p (X , Y )]).
                                                          θ      θ  1 1

    0. Capp´ (@ BigMC)
           e                      Online EM Algorithm           March 2011   33 / 34

Capp´, O. & Moulines, E. (2009). On-line expectation-maximization algorithm for
latent data models. J. Roy. Statist. Soc. B, 71(3):593-613.
Capp´, O. (2011). Online Expectation-Maximisation. To appear in Mengersen, K.,
Titterington, M., & Robert, C. P., eds., Mixtures, Wiley.
Liang, P. & Klein, D. (2009). Online EM for Unsupervised Models. In Proc
NAACL Conference.
Neal, R. M. & Hinton, G. E. (1999). A view of the EM algorithm that justifies
incremental, sparse, and other variants. In Jordan, M. I., ed., Learning in graphical
models, pages 355–368. MIT Press, Cambridge, MA, USA.
Rohde, D. & Capp´, O. (2011). Online maximum-likelihood estimation for latent
factor models. Submitted.
Sato, M. (2000). Convergence of on-line EM algorithm. In Proc. International
Conference on Neural Information Processing, 1:476–481.
Sato, M. & Ishii, S. (2000). On-line EM algorithm for the normalized Gaussian
network. Neural Computation, 12:407-432.
Titterington, D. M. (1984). Recursive parameter estimation using incomplete
data. J. Roy. Statist. Soc. B, 46(2):257-267.

0. Capp´ (@ BigMC)
       e                        Online EM Algorithm                 March 2011   34 / 34

Contenu connexe


Approximate Bayesian Computation on GPUs
Approximate Bayesian Computation on GPUsApproximate Bayesian Computation on GPUs
Approximate Bayesian Computation on GPUsMichael Stumpf
New Mathematical Tools for the Financial Sector
New Mathematical Tools for the Financial SectorNew Mathematical Tools for the Financial Sector
New Mathematical Tools for the Financial SectorSSA KPI
05 history of cv a machine learning (theory) perspective on computer vision
05  history of cv a machine learning (theory) perspective on computer vision05  history of cv a machine learning (theory) perspective on computer vision
05 history of cv a machine learning (theory) perspective on computer visionzukun
A brief introduction to Hartree-Fock and TDDFT
A brief introduction to Hartree-Fock and TDDFTA brief introduction to Hartree-Fock and TDDFT
A brief introduction to Hartree-Fock and TDDFTJiahao Chen
Lecture on nk [compatibility mode]
Lecture on nk [compatibility mode]Lecture on nk [compatibility mode]
Lecture on nk [compatibility mode]NBER
Influence of the sampling on Functional Data Analysis
Influence of the sampling on Functional Data AnalysisInfluence of the sampling on Functional Data Analysis
Influence of the sampling on Functional Data Analysistuxette
Exact Computation of the Expectation Curves of the Bit-Flip Mutation using La...
Exact Computation of the Expectation Curves of the Bit-Flip Mutation using La...Exact Computation of the Expectation Curves of the Bit-Flip Mutation using La...
Exact Computation of the Expectation Curves of the Bit-Flip Mutation using La...jfrchicanog
Optimal control of coupled PDE networks with automated code generation
Optimal control of coupled PDE networks with automated code generationOptimal control of coupled PDE networks with automated code generation
Optimal control of coupled PDE networks with automated code generationDelta Pi Systems
Doering Savov
Doering SavovDoering Savov
Doering Savovgh
A copula-based Simulation Method for Clustered Multi-State Survival Data
A copula-based Simulation Method for Clustered Multi-State Survival DataA copula-based Simulation Method for Clustered Multi-State Survival Data
A copula-based Simulation Method for Clustered Multi-State Survival Datafedericorotolo
The Black-Litterman model in the light of Bayesian portfolio analysis
The Black-Litterman model in the light of Bayesian portfolio analysisThe Black-Litterman model in the light of Bayesian portfolio analysis
The Black-Litterman model in the light of Bayesian portfolio analysisDaniel Bruggisser
Lesson 14: Derivatives of Logarithmic and Exponential Functions
Lesson 14: Derivatives of Logarithmic and Exponential FunctionsLesson 14: Derivatives of Logarithmic and Exponential Functions
Lesson 14: Derivatives of Logarithmic and Exponential FunctionsMatthew Leingang
Monash University short course, part II
Monash University short course, part IIMonash University short course, part II
Monash University short course, part IIChristian Robert

Tendances (15)

Approximate Bayesian Computation on GPUs
Approximate Bayesian Computation on GPUsApproximate Bayesian Computation on GPUs
Approximate Bayesian Computation on GPUs
New Mathematical Tools for the Financial Sector
New Mathematical Tools for the Financial SectorNew Mathematical Tools for the Financial Sector
New Mathematical Tools for the Financial Sector
05 history of cv a machine learning (theory) perspective on computer vision
05  history of cv a machine learning (theory) perspective on computer vision05  history of cv a machine learning (theory) perspective on computer vision
05 history of cv a machine learning (theory) perspective on computer vision
A brief introduction to Hartree-Fock and TDDFT
A brief introduction to Hartree-Fock and TDDFTA brief introduction to Hartree-Fock and TDDFT
A brief introduction to Hartree-Fock and TDDFT
Lecture on nk [compatibility mode]
Lecture on nk [compatibility mode]Lecture on nk [compatibility mode]
Lecture on nk [compatibility mode]
Influence of the sampling on Functional Data Analysis
Influence of the sampling on Functional Data AnalysisInfluence of the sampling on Functional Data Analysis
Influence of the sampling on Functional Data Analysis
Exact Computation of the Expectation Curves of the Bit-Flip Mutation using La...
Exact Computation of the Expectation Curves of the Bit-Flip Mutation using La...Exact Computation of the Expectation Curves of the Bit-Flip Mutation using La...
Exact Computation of the Expectation Curves of the Bit-Flip Mutation using La...
Optimal control of coupled PDE networks with automated code generation
Optimal control of coupled PDE networks with automated code generationOptimal control of coupled PDE networks with automated code generation
Optimal control of coupled PDE networks with automated code generation
Doering Savov
Doering SavovDoering Savov
Doering Savov
A copula-based Simulation Method for Clustered Multi-State Survival Data
A copula-based Simulation Method for Clustered Multi-State Survival DataA copula-based Simulation Method for Clustered Multi-State Survival Data
A copula-based Simulation Method for Clustered Multi-State Survival Data
The Black-Litterman model in the light of Bayesian portfolio analysis
The Black-Litterman model in the light of Bayesian portfolio analysisThe Black-Litterman model in the light of Bayesian portfolio analysis
The Black-Litterman model in the light of Bayesian portfolio analysis
Lesson 14: Derivatives of Logarithmic and Exponential Functions
Lesson 14: Derivatives of Logarithmic and Exponential FunctionsLesson 14: Derivatives of Logarithmic and Exponential Functions
Lesson 14: Derivatives of Logarithmic and Exponential Functions
Monash University short course, part II
Monash University short course, part IIMonash University short course, part II
Monash University short course, part II

Similaire à Olivier Cappé's talk at BigMC March 2011

Ml mle_bayes
Ml  mle_bayesMl  mle_bayes
Ml mle_bayesPhong Vo
Handling missing data with expectation maximization algorithm
Handling missing data with expectation maximization algorithmHandling missing data with expectation maximization algorithm
Handling missing data with expectation maximization algorithmLoc Nguyen
5. cem granger causality ecm
5. cem granger causality  ecm 5. cem granger causality  ecm
5. cem granger causality ecm Quang Hoang
Tutorial on EM algorithm – Part 2
Tutorial on EM algorithm – Part 2Tutorial on EM algorithm – Part 2
Tutorial on EM algorithm – Part 2Loc Nguyen
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...Chiheb Ben Hammouda
Finite mixture model with EM algorithm
Finite mixture model with EM algorithmFinite mixture model with EM algorithm
Finite mixture model with EM algorithmLoc Nguyen
Recent developments on unbiased MCMC
Recent developments on unbiased MCMCRecent developments on unbiased MCMC
Recent developments on unbiased MCMCPierre Jacob
Parameter Uncertainty and Learning in Dynamic Financial Decisions
Parameter Uncertainty and Learning in Dynamic Financial DecisionsParameter Uncertainty and Learning in Dynamic Financial Decisions
Parameter Uncertainty and Learning in Dynamic Financial DecisionsDaniel Bruggisser
Final Present Pap1on relibility
Final Present Pap1on relibilityFinal Present Pap1on relibility
Final Present Pap1on relibilityketan gajjar
Case Study (All)
Case Study (All)Case Study (All)
Case Study (All)gudeyi
Runtime Analysis of Population-based Evolutionary Algorithms
Runtime Analysis of Population-based Evolutionary AlgorithmsRuntime Analysis of Population-based Evolutionary Algorithms
Runtime Analysis of Population-based Evolutionary AlgorithmsPK Lehre
Runtime Analysis of Population-based Evolutionary Algorithms
Runtime Analysis of Population-based Evolutionary AlgorithmsRuntime Analysis of Population-based Evolutionary Algorithms
Runtime Analysis of Population-based Evolutionary AlgorithmsPer Kristian Lehre
Quantum modes - Ion Cotaescu
Quantum modes - Ion CotaescuQuantum modes - Ion Cotaescu
Quantum modes - Ion CotaescuSEENET-MTP
Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Pierre Jacob
7. toda yamamoto-granger causality
7. toda yamamoto-granger causality7. toda yamamoto-granger causality
7. toda yamamoto-granger causalityQuang Hoang
Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Pierre Jacob

Similaire à Olivier Cappé's talk at BigMC March 2011 (20)

Ml mle_bayes
Ml  mle_bayesMl  mle_bayes
Ml mle_bayes
Deep Learning Opening Workshop - Statistical and Computational Guarantees of ...
Deep Learning Opening Workshop - Statistical and Computational Guarantees of ...Deep Learning Opening Workshop - Statistical and Computational Guarantees of ...
Deep Learning Opening Workshop - Statistical and Computational Guarantees of ...
Handling missing data with expectation maximization algorithm
Handling missing data with expectation maximization algorithmHandling missing data with expectation maximization algorithm
Handling missing data with expectation maximization algorithm
5. cem granger causality ecm
5. cem granger causality  ecm 5. cem granger causality  ecm
5. cem granger causality ecm
Tutorial on EM algorithm – Part 2
Tutorial on EM algorithm – Part 2Tutorial on EM algorithm – Part 2
Tutorial on EM algorithm – Part 2
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
Finite mixture model with EM algorithm
Finite mixture model with EM algorithmFinite mixture model with EM algorithm
Finite mixture model with EM algorithm
Recent developments on unbiased MCMC
Recent developments on unbiased MCMCRecent developments on unbiased MCMC
Recent developments on unbiased MCMC
Parameter Uncertainty and Learning in Dynamic Financial Decisions
Parameter Uncertainty and Learning in Dynamic Financial DecisionsParameter Uncertainty and Learning in Dynamic Financial Decisions
Parameter Uncertainty and Learning in Dynamic Financial Decisions
Final Present Pap1on relibility
Final Present Pap1on relibilityFinal Present Pap1on relibility
Final Present Pap1on relibility
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
Case Study (All)
Case Study (All)Case Study (All)
Case Study (All)
Runtime Analysis of Population-based Evolutionary Algorithms
Runtime Analysis of Population-based Evolutionary AlgorithmsRuntime Analysis of Population-based Evolutionary Algorithms
Runtime Analysis of Population-based Evolutionary Algorithms
Runtime Analysis of Population-based Evolutionary Algorithms
Runtime Analysis of Population-based Evolutionary AlgorithmsRuntime Analysis of Population-based Evolutionary Algorithms
Runtime Analysis of Population-based Evolutionary Algorithms
Quantum modes - Ion Cotaescu
Quantum modes - Ion CotaescuQuantum modes - Ion Cotaescu
Quantum modes - Ion Cotaescu
Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...
7. toda yamamoto-granger causality
7. toda yamamoto-granger causality7. toda yamamoto-granger causality
7. toda yamamoto-granger causality
Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...

Plus de BigMC

Anisotropic Metropolis Adjusted Langevin Algorithm: convergence and utility i...
Anisotropic Metropolis Adjusted Langevin Algorithm: convergence and utility i...Anisotropic Metropolis Adjusted Langevin Algorithm: convergence and utility i...
Anisotropic Metropolis Adjusted Langevin Algorithm: convergence and utility i...BigMC
Dealing with intractability: Recent Bayesian Monte Carlo methods for dealing ...
Dealing with intractability: Recent Bayesian Monte Carlo methods for dealing ...Dealing with intractability: Recent Bayesian Monte Carlo methods for dealing ...
Dealing with intractability: Recent Bayesian Monte Carlo methods for dealing ...BigMC
Stability of adaptive random-walk Metropolis algorithms
Stability of adaptive random-walk Metropolis algorithmsStability of adaptive random-walk Metropolis algorithms
Stability of adaptive random-walk Metropolis algorithmsBigMC
"Monte-Carlo Tree Search for the game of Go"
"Monte-Carlo Tree Search for the game of Go""Monte-Carlo Tree Search for the game of Go"
"Monte-Carlo Tree Search for the game of Go"BigMC
Hedibert Lopes' talk at BigMC
Hedibert Lopes' talk at  BigMCHedibert Lopes' talk at  BigMC
Hedibert Lopes' talk at BigMCBigMC
Andreas Eberle
Andreas EberleAndreas Eberle
Andreas EberleBigMC
Olivier Féron's talk at BigMC March 2011
Olivier Féron's talk at BigMC March 2011Olivier Féron's talk at BigMC March 2011
Olivier Féron's talk at BigMC March 2011BigMC
Estimation de copules, une approche bayésienne
Estimation de copules, une approche bayésienneEstimation de copules, une approche bayésienne
Estimation de copules, une approche bayésienneBigMC
Comparing estimation algorithms for block clustering models
Comparing estimation algorithms for block clustering modelsComparing estimation algorithms for block clustering models
Comparing estimation algorithms for block clustering modelsBigMC
Learning spline-based curve models (Laure Amate)
Learning spline-based curve models (Laure Amate)Learning spline-based curve models (Laure Amate)
Learning spline-based curve models (Laure Amate)BigMC
Omiros' talk on the Bernoulli factory problem
Omiros' talk on the  Bernoulli factory problemOmiros' talk on the  Bernoulli factory problem
Omiros' talk on the Bernoulli factory problemBigMC

Plus de BigMC (11)

Anisotropic Metropolis Adjusted Langevin Algorithm: convergence and utility i...
Anisotropic Metropolis Adjusted Langevin Algorithm: convergence and utility i...Anisotropic Metropolis Adjusted Langevin Algorithm: convergence and utility i...
Anisotropic Metropolis Adjusted Langevin Algorithm: convergence and utility i...
Dealing with intractability: Recent Bayesian Monte Carlo methods for dealing ...
Dealing with intractability: Recent Bayesian Monte Carlo methods for dealing ...Dealing with intractability: Recent Bayesian Monte Carlo methods for dealing ...
Dealing with intractability: Recent Bayesian Monte Carlo methods for dealing ...
Stability of adaptive random-walk Metropolis algorithms
Stability of adaptive random-walk Metropolis algorithmsStability of adaptive random-walk Metropolis algorithms
Stability of adaptive random-walk Metropolis algorithms
"Monte-Carlo Tree Search for the game of Go"
"Monte-Carlo Tree Search for the game of Go""Monte-Carlo Tree Search for the game of Go"
"Monte-Carlo Tree Search for the game of Go"
Hedibert Lopes' talk at BigMC
Hedibert Lopes' talk at  BigMCHedibert Lopes' talk at  BigMC
Hedibert Lopes' talk at BigMC
Andreas Eberle
Andreas EberleAndreas Eberle
Andreas Eberle
Olivier Féron's talk at BigMC March 2011
Olivier Féron's talk at BigMC March 2011Olivier Féron's talk at BigMC March 2011
Olivier Féron's talk at BigMC March 2011
Estimation de copules, une approche bayésienne
Estimation de copules, une approche bayésienneEstimation de copules, une approche bayésienne
Estimation de copules, une approche bayésienne
Comparing estimation algorithms for block clustering models
Comparing estimation algorithms for block clustering modelsComparing estimation algorithms for block clustering models
Comparing estimation algorithms for block clustering models
Learning spline-based curve models (Laure Amate)
Learning spline-based curve models (Laure Amate)Learning spline-based curve models (Laure Amate)
Learning spline-based curve models (Laure Amate)
Omiros' talk on the Bernoulli factory problem
Omiros' talk on the  Bernoulli factory problemOmiros' talk on the  Bernoulli factory problem
Omiros' talk on the Bernoulli factory problem

Olivier Cappé's talk at BigMC March 2011

  • 1. Online EM Algorithm and Some Extensions Olivier Capp´ e T´l´com ParisTech & CNRS ee March 2011 0. Capp´ (@ BigMC) e Online EM Algorithm March 2011 1 / 34
  • 2. Online Estimation for Missing Data Models Based on (C & Moulines, 2009) and (C, 2010) Goals 1 Maximum likelihood estimation, or 1’ Competitive with maximum likelihood estimation when #obs. is large 2 Good scaling (performance vs. computational cost) as #obs. increases (3) Process data on-the-fly (no storage) 4 Simple to implement (no line-search, projection, preconditioning, etc.) 0. Capp´ (@ BigMC) e Online EM Algorithm March 2011 2 / 34
  • 3. Outline 1 The EM Algorithm in Exponential Families 2 The Limiting EM Recursion 3 Online EM Algorithm The Algorithm Properties and Discussion 4 Use for Batch ML Estimation 5 Extensions 6 References 0. Capp´ (@ BigMC) e Online EM Algorithm March 2011 3 / 34
  • 4. The EM Algorithm in Exponential Families Missing Data Model A missing data model is a statistical model {pθ (x, y)}θ∈Θ in which only Y may be observed (the couple (X, Y ) is referred to as the complete data) Hence, parameter estimates θn must be function of observations Y1 , . . . , Yn only (here assumed to be independent and identically distributed) Of course, the statistical model could also be defined as {fθ (y)}θ∈Θ , where fθ (y) = pθ (x, y)dx but the specific structure of fθ needs to be exploited To analyze the methods the data {Yt }t≥1 is assumed to be generated by an iid. process with marginal π, not necessarily equal to fθ 0. Capp´ (@ BigMC) e Online EM Algorithm March 2011 4 / 34
  • 5. The EM Algorithm in Exponential Families Finite Mixture Model Mixture PDF m f (y) = αi fi (y) i=1 Missing Data Interpretation P(Xt = i) = αi Yt |Xt = i ∼ fi (y) 0. Capp´ (@ BigMC) e Online EM Algorithm March 2011 5 / 34
  • 6. The EM Algorithm in Exponential Families To determine the maximum likelihood estimate n θn = arg max log fθ (Yt ) θ t=1 numerically, the standard approach is the following. Expectation-Maximization (Dempster, Laird & Rubin, 1977) k Given a current parameter guess θn E-Step Compute n 1 qn,θn (θ) = k Eθn [ log pθ (Xt , Yt )| Yt ] k n t=1 M-Step Update the parameter estimate to k+1 θn = arg max qn,θn (θ) k θ∈Θ 0. Capp´ (@ BigMC) e Online EM Algorithm March 2011 6 / 34
  • 7. The EM Algorithm in Exponential Families Rationale 1 It is an ascent algorithm (shown using Jensen inequality) Figure: The EM intermediate quantity is a minorizing surrogate 2 Because of Fisher relation, the algorithm can only stop in a stationary point of the log-likelihood∗ ∗ See (Wu, 1983) for necessary topological and regularity assumptions 0. Capp´ (@ BigMC) e Online EM Algorithm March 2011 7 / 34
  • 8. The EM Algorithm in Exponential Families An Example: Poisson Mixture Likelihood m λj Y −λj fθ (Y ) = αj e Y! j=1 “Complete-Data” Log-Likelihood log pθ (X, Y ) = − log(Y !) m + [log(αj ) − λj ] 1{X = j} j=1 m + log(λj )Y 1{X = j} j=1 0. Capp´ (@ BigMC) e Online EM Algorithm March 2011 8 / 34
  • 9. The EM Algorithm in Exponential Families EM Algorithm for the Poisson Mixture EM E-Step m n 1 qn,θn = k [log(αj ) − λj ] Pθn (Xt = j|Yt ) k n j=1 t=1 m n 1 + log(λj ) Yt Pθn (Xt = j|Yt ) k n j=1 t=1 EM M-Step n k+1 1 αn,j = Pθn (Xt = j|Yt ) k n t=1 n t=1 Yt Pθn (Xt = j|Yt ) k λk+1 = n,j n t=1 Pθn (Xt = j|Yt ) k 0. Capp´ (@ BigMC) e Online EM Algorithm March 2011 9 / 34
  • 10. The EM Algorithm in Exponential Families Exponential Family Model In the following, we assume that the complete-data model belongs to an exponential family (Curved) Exponential Family Model pθ (x, y) = exp ( s(x, y), ψ(θ) − A(θ)) where s(x, y) is the vector (complete-data) sufficient statistics Explicit Complete-Data Maximum Likelihood ¯ S → θ(S) = arg max S, ψ(θ) − A(θ) θ is available in closed-form 0. Capp´ (@ BigMC) e Online EM Algorithm March 2011 10 / 34
  • 11. The EM Algorithm in Exponential Families The EM Algorithm Revisited The k-th EM Iteration (From n Observations) E-Step n k+1 1 Sn = Eθn [ s(Xt , Yt )| Yt ] k n t=1 M-Step k+1 ¯ k+1 θn = θ Sn 0. Capp´ (@ BigMC) e Online EM Algorithm March 2011 11 / 34
  • 12. The Limiting EM Recursion A Key Remark The k-th EM Iteration (From n Observations) E-Step n k+1 1 Sn = Eθn [ s(Xt , Yt )| Yt ] k n t=1 M-Step k+1 ¯ k+1 θn = θ Sn Can be fully reparameterized in the domain of sufficient statistics n k+1 1 Sn = Eθ(Sn ) [ s(Xt , Yt )| Yt ] ¯ k n t=1 0. Capp´ (@ BigMC) e Online EM Algorithm March 2011 12 / 34
  • 13. The Limiting EM Recursion The Limiting EM Recursion By letting n tend to infinity, one obtains two equivalent updates: Sufficient Statistics Update S k = Eπ Eθ(S k−1 ) [ s(X1 , Y1 )| Y1 ] ¯ Parameter Update ¯ θk = θ {Eπ (Eθk−1 [ s(X1 , Y1 )| Y1 ])} Using usual EM arguments, these updates are such that 1 The Kullback-Leibler divergence D(π|fθk ) is monotonically decreasing with k 2 Converge to {θ : θ D(π|fθ ) = 0} 0. Capp´ (@ BigMC) e Online EM Algorithm March 2011 13 / 34
  • 14. The Limiting EM Recursion Batch EM Is Not Efficient for Large Data Records see also (Neal & Hinton, 1999) 3 2 10 observations 4 3 ||u||2 2 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 3 20 10 observations 4 3 ||u||2 2 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Batch EM iterations Figure: Convergence of batch EM estimates of u 2 as a function of the number of EM iterations for 2,000 (top) and 20,000 (bottom) observations. The box-and-whisker plots are computed from 1,000 independent replications of the simulated data. The grey region corresponds to ±2 interquartile range (approx. 99.3% coverage) under the asymptotic Gaussian approximation of the MLE (from [C, 2010]). 0. Capp´ (@ BigMC) e Online EM Algorithm March 2011 14 / 34
  • 15. Online EM Algorithm The Algorithm The Online EM Algorithm The online EM algorithm outputs one updated parameter estimate θn after processing each individual observation Yn The parameter update is very similar to applying the EM algorithm to the single observation Yn (with smoothing) The memory footprint of the algorithm is constant while its computational cost is proportional to the number of processed observations 0. Capp´ (@ BigMC) e Online EM Algorithm March 2011 15 / 34
  • 16. Online EM Algorithm The Algorithm Online EM: Rationale We try to locate the solutions of Eπ Eθ(S) [ s(X1 , Y1 )| Y1 ] − S = 0 ¯ Viewing Eθ(S) [ s(Xn , Yn )| Yn ] as a noisy observation of ¯ Eπ Eθ(S) [ s(X1 , Y1 )| Y1 ] , this is exactly the usual Stochastic ¯ Approximation (or Robbins-Monro) setup: Sn = Sn−1 + γn Eθ(Sn−1 ) [ s(Xn , Yn )| Yn ] − Sn−1 ¯ where (γn ) is a sequence of decreasing positive stepsizes 0. Capp´ (@ BigMC) e Online EM Algorithm March 2011 16 / 34
  • 17. Online EM Algorithm The Algorithm The Algorithm Online EM Algorithm Stochastic E-Step Sn = (1 − γn )Sn−1 + γn Eθn−1 [ s(Xn , Yn )| Yn ] M Step ¯ θn = θ(Sn ) Practical Recommendations γn = 1/nα with α ∈ [0.6, 0.7] Don’t do M for the first 10–20 obs. (optional) Use Polyak-Ruppert averaging (requires to chose n0 ) 0. Capp´ (@ BigMC) e Online EM Algorithm March 2011 17 / 34
  • 18. Online EM Algorithm The Algorithm Online EM in the Poisson Mixture Example SA E-Step Computing Conditional Expectations αn−1,j λYn e−λn−1,j n−1,j pn,j = Pm i=1 αn−1,i λYn e−λn−1,i n−1,i Statistics Update (Stochastic Approximation) α α Sn,j = (1 − γn )Sn−1,j + γn pn,j λ λ Sn,j = (1 − γn )Sn−1,j + γn pn,j Yn M-Step: Parameter Update α αn,j = Sn,j , ˆ ˆ λ α λn,j = Sn,j /Sn,j 0. Capp´ (@ BigMC) e Online EM Algorithm March 2011 18 / 34
  • 19. Online EM Algorithm Properties and Discussion Analysis (C & Moulines, 2009) Under n γn = ∞, 2 < ∞, compactness of Θ and other regularity n γn assumptions 1 The estimate θn converges to one of the roots of θ D(π|fθ ) =0 2 The algorithm is asymptotically equivalent to θn = θn−1 + γn J −1 (θn−1 ) θ log fθn−1 (Yn ) where J(θ) = −Eπ Eθ 2 log pθ (X1 , Y1 ) Y1 θ 3 For a well specified model (π = fθ ) and under Polyak-Ruppert averaging† θn is Fisher efficient √ L n(θn − θ ) −→ N (0, If (θ )) where If (θ ) = −Eθ [ 2 log f (Y )] θ θ 1 †˜ = 1/(n − n0 ) n 0 +1 θn , P θn t=n −α with γn = n and α ∈ (1/2, 1) 0. Capp´ (@ BigMC) e Online EM Algorithm March 2011 19 / 34
  • 20. Online EM Algorithm Properties and Discussion Some More Details 1 (Andrieu et al., 2005) but also (Delyon, 1994), (Bena¨ 1999) using ım, the fact that D(π|fθ(S) ) is a Lyapunov function: ¯ S D(π|fθ(S) ) , ¯ Eπ Eθ(S) [ s(X1 , Y1 )| Y1 ] − S ¯ ≤0 mean field 2 ¯ Taylor series expansion of θ to establish the equivalence (with remainder a.s. o(γn )) 3 (Pelletier, 1998) to show that −1/2 L −1 γn (θn − θ ) −→ N (0, Ip (θ )/2) in well-specified models (where Ip is the complete-data Fisher information matrix) General results of (Polyak and Judistky, 1992), (Mokkadem and Pelletier, 2006) on averaging 0. Capp´ (@ BigMC) e Online EM Algorithm March 2011 20 / 34
  • 21. Online EM Algorithm Properties and Discussion Illustration of Polyak-Ruppert Averaging α = 0.9 2 1 0 u −2 0 200 400 600 800 1000 1200 1400 1600 1800 2000 α = 0.6 2 1 0 u −2 0 200 400 600 800 1000 1200 1400 1600 1800 2000 α = 0.6 with halfway averaging 2 1 0 u −2 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Number of observations Figure: Four superimposed trajectories of the estimate of u1 (first component of u) for various algorithm settings (α = 0.9, α = 0.6 and α = 0.6 with Polyak-Ruppert averaging, from top to bottom). The actual value of u1 is equal to zero. 0. Capp´ (@ BigMC) e Online EM Algorithm March 2011 21 / 34
  • 22. Online EM Algorithm Properties and Discussion Performance of Online EM α = 0.9 4 3 ||u||2 2 1 0 0.2 10^3 2 10^3 20 10^3 α = 0.6 4 3 ||u||2 2 1 0 0.2 10^3 2 10^3 20 10^3 α = 0.6 with halfway averaging 4 3 ||u||2 2 1 0 0.2 10^3 2 10^3 20 10^3 Number of observations Figure: Online EM estimates of u 2 for various data sizes (200, 2,000 and 20,000 observations, from left to right) and algorithm settings (α = 0.9, α = 0.6 and α = 0.6 with Polyak-Ruppert averaging, from top to bottom). The box-and-whisker plots (outliers plotting suppressed) are computed from 1,000 independent replications of the simulated data. The grey regions corresponds to ±2 interquartile range (approx. 99.3% coverage) under the asymptotic Gaussian approximation of the MLE. 0. Capp´ (@ BigMC) e Online EM Algorithm March 2011 22 / 34
  • 23. Online EM Algorithm Properties and Discussion Related Works (Titterington, 1984) Proposes a gradient algorithm −1 θn = θn−1 + γn Ip (θn−1 ) θ log fθn−1 (Yn ) It is asymptotically equivalent to the algorithm (previously described) for well-specified models (π = fθ ) (Neal & Hinton, 1999) Describe an algorithm called Incremental EM that is equivalent (up to first batch scan only) to Online EM used with γn = 1/n (Sato, 2000; Sato & Ishii, 2000) Describe the algorithm and provide some analysis in the flat model case and for mixtures of Gaussian 0. Capp´ (@ BigMC) e Online EM Algorithm March 2011 23 / 34
  • 24. Online EM Algorithm Properties and Discussion How Does This Work in Practice? Fine But don’t use ‡ γn = 1/n Simulations in (C & Moulines, 2009) on mixtures of Gaussian regressions Large Scale Experiments on Real Data in (Liang & Klein, 2009), where the use of mini-batch blocking was found useful: Apply the proposed algorithm considering Ymk+1 , Ymk+2 . . . Ym(k+1) as one observation Mini-batch blocking is useful in dealing with mixture-like models with infrequent components ‡ γn = γ0 /(n0 + n) can be an option but requires carefully setting γ0 and n0 0. Capp´ (@ BigMC) e Online EM Algorithm March 2011 24 / 34
  • 25. Online EM Algorithm Properties and Discussion Some Intuition About the Weights If rk = (1 − γk )rk−1 + γk Ek , for k ≥ 1 1 −4 x 10 α=1 1 n n n 1 rn = k=1 ωk Ek + ω0 r0 with 1 n n k=0 ωk = 1 4 −4 x 10 α = 0.9 n 1 2 ωk = n+a (for k ≥ 1) when 2 γk = 1/(k + a) and is strictly 0 increasing otherwise 4 −3 x 10 α = 0.6 n 3 n 2 k=1 (ωk )≡ 2 n−α when γk = k −α , 1 2 with 1/2 < α < 1 0 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0. Capp´ (@ BigMC) e Online EM Algorithm March 2011 25 / 34
  • 26. Use for Batch ML Estimation How to Use Online EM for Batch ML Estimation? The most popular use of the method is to perfom batch ML estimation from very large datasets Because we did not assume that π = fθ , the previous analysis can be applied to π ≡ the empirical measure associated with Y1 , . . . , Yn Online EM can be used for batch ML estimation by (randomly) scanning the data Y1 , . . . , Yn Convergence “speed” (with averaging) is (nobs. × nscans )−1/2 versus ρnscans for batch EM Not a fair comparison in terms of computing time as the M-Step is not free and possible parallelization is ignored 0. Capp´ (@ BigMC) e Online EM Algorithm March 2011 26 / 34
  • 27. Use for Batch ML Estimation Comparison With Batch and Incremental EM Batch EM −1.54 −1.56 −1.58 1 2 3 4 5 Incremental EM −1.54 −1.56 −1.58 1 2 3 4 5 Online EM −1.54 −1.56 −1.58 1 2 3 4 5 batch tours Figure: Normalized log-likelihood of the estimates obtained with, from top to bottom, batch EM, incremental EM and online EM as a function of the number of batch tours (or iterations, for batch EM). The data is of length N = 100 and the box an whiskers plots summarize the results of 500 independent runs of the algorithms started from randomized starting points θ0 . 0. Capp´ (@ BigMC) e Online EM Algorithm March 2011 27 / 34
  • 28. Use for Batch ML Estimation Comparison With Batch and Incremental EM (Contd.) Batch EM −1.56 −1.58 −1.6 1 2 3 4 5 Incremental EM −1.56 −1.58 −1.6 1 2 3 4 5 Online EM −1.56 −1.58 −1.6 1 2 3 4 5 batch tours Figure: Same display for a data record of length N = 1,000. 0. Capp´ (@ BigMC) e Online EM Algorithm March 2011 28 / 34
  • 29. Extensions Summary The Good Easy (esp. when EM implementation is available) Can be used for ML estimation from a batch of observations Robust wrt. to stepsize selection (note that scale is fixed due to the use of convex combinations) Handles parameter constraints nicely (only requires that S be closed under convex combinations with expected sufficient statistics) 0. Capp´ (@ BigMC) e Online EM Algorithm March 2011 29 / 34
  • 30. Extensions Summary (Contd.) The Bad Needs that the E-step be explicit ¯ Needs that θ be explicit Not appropriate for short (say, less than 1000 observations) data records without cycling What about non-independent observations? 0. Capp´ (@ BigMC) e Online EM Algorithm March 2011 30 / 34
  • 31. Extensions Online EM in Latent Factor Models (Ongoing Work) Many models of the form Cn |Hn ∼ gPK k=1 θk Hn,k where {gλ }λ∈Λ is an exponential family of distributions and Hn is a latent random vector of positive weights (probabilistic matrix factorization, discrete component analysis, partial membership models, simplicial mixtures) Figure: Bayesian network representations of Latent Dirichlet Allocation (LDA) 0. Capp´ (@ BigMC) e Online EM Algorithm March 2011 31 / 34
  • 32. Extensions Simulated Online EM Algorithm for LDA For n = 1, . . . Simulated E-step ˜ Simulate Hn given Cn and θn−1 (in practise, using a short run of Metropolis-Hastings or collapsed Gibbs sampling) Use the Rao-Blackwellized update ˜ Sn = (1−γn )Sn−1 +γn Eθn−1 s(Zn , Wn )| Wn , Hn ¯ M-step θn = θ(Sn ) 0. Capp´ (@ BigMC) e Online EM Algorithm March 2011 32 / 34
  • 33. Extensions Ignoring the sampling bias, this recursion can be analyzed and has the same asymptotic properties as the online EM algorithm In particular, for well-specified models, −1/2 L −1 γn (θn − θ ) −→ N (0, If (θ )) instead of −1/2 L −1 γn (θn − θ ) −→ N (0, Ip (θ )) for the “exact” online EM algorithm (Ip (θ ) = −Eθ [ 2 log p (X , Y )]). θ θ 1 1 0. Capp´ (@ BigMC) e Online EM Algorithm March 2011 33 / 34
  • 34. References Capp´, O. & Moulines, E. (2009). On-line expectation-maximization algorithm for e latent data models. J. Roy. Statist. Soc. B, 71(3):593-613. Capp´, O. (2011). Online Expectation-Maximisation. To appear in Mengersen, K., e Titterington, M., & Robert, C. P., eds., Mixtures, Wiley. Liang, P. & Klein, D. (2009). Online EM for Unsupervised Models. In Proc NAACL Conference. Neal, R. M. & Hinton, G. E. (1999). A view of the EM algorithm that justifies incremental, sparse, and other variants. In Jordan, M. I., ed., Learning in graphical models, pages 355–368. MIT Press, Cambridge, MA, USA. Rohde, D. & Capp´, O. (2011). Online maximum-likelihood estimation for latent e factor models. Submitted. Sato, M. (2000). Convergence of on-line EM algorithm. In Proc. International Conference on Neural Information Processing, 1:476–481. Sato, M. & Ishii, S. (2000). On-line EM algorithm for the normalized Gaussian network. Neural Computation, 12:407-432. Titterington, D. M. (1984). Recursive parameter estimation using incomplete data. J. Roy. Statist. Soc. B, 46(2):257-267. 0. Capp´ (@ BigMC) e Online EM Algorithm March 2011 34 / 34