SlideShare une entreprise Scribd logo
1  sur  93
Télécharger pour lire hors ligne
An introduction to advanced (?) MCMC methods




        An introduction to advanced (?) MCMC methods

                                          Christian P. Robert

                              Universit´ Paris-Dauphine and CREST-INSEE
                                       e
                              http://www.ceremade.dauphine.fr/~xian


                        Royal Statistical Society, October 13, 2010
An introduction to advanced (?) MCMC methods
  Motivating example




Motivating example




        1   Motivating example

        2   The Metropolis-Hastings Algorithm
An introduction to advanced (?) MCMC methods
  Motivating example




Latent structures make life harder!



             Even simple models may lead to computational complications,
             as in latent variable models

                                        f (x|θ) =   f ⋆ (x, x⋆ |θ) dx⋆
An introduction to advanced (?) MCMC methods
  Motivating example




Latent structures make life harder!



             Even simple models may lead to computational complications,
             as in latent variable models

                                        f (x|θ) =   f ⋆ (x, x⋆ |θ) dx⋆


             If (x, x⋆ ) observed, fine!
An introduction to advanced (?) MCMC methods
  Motivating example




Latent structures make life harder!



             Even simple models may lead to computational complications,
             as in latent variable models

                                        f (x|θ) =   f ⋆ (x, x⋆ |θ) dx⋆


             If (x, x⋆ ) observed, fine!
             If only x observed, trouble!
An introduction to advanced (?) MCMC methods
  Motivating example




      Example (Mixture models)
      Models of mixtures of distributions:

                                    X ∼ fj with probability pj ,

      for j = 1, 2, . . . , k, with overall density

                                 X ∼ p1 f1 (x) + · · · + pk fk (x) .
An introduction to advanced (?) MCMC methods
  Motivating example




      Example (Mixture models)
      Models of mixtures of distributions:

                                     X ∼ fj with probability pj ,

      for j = 1, 2, . . . , k, with overall density

                                 X ∼ p1 f1 (x) + · · · + pk fk (x) .

      For a sample of independent random variables (X1 , · · · , Xn ),
      sample density
                                 n
                                      {p1 f1 (xi ) + · · · + pk fk (xi )} .
                                i=1
An introduction to advanced (?) MCMC methods
  Motivating example




      Example (Mixture models)
      Models of mixtures of distributions:

                                     X ∼ fj with probability pj ,

      for j = 1, 2, . . . , k, with overall density

                                 X ∼ p1 f1 (x) + · · · + pk fk (x) .

      For a sample of independent random variables (X1 , · · · , Xn ),
      sample density
                                 n
                                      {p1 f1 (xi ) + · · · + pk fk (xi )} .
                                i=1

      Expanding this product involves k n elementary terms: prohibitive
      to compute in large samples.
An introduction to advanced (?) MCMC methods
  Motivating example




0.3N (µ1 , 1) + 0.7N (µ2 , 1) loglikelihood
           3
           2
      µ2

           1
           0
           −1




                       −1             0        1    2   3

                                               µ1
An introduction to advanced (?) MCMC methods
  Motivating example




A typology of Bayes computational problems
         (i)   use of a complex parameter space, as for instance in
               constrained parameter sets like those resulting from imposing
               stationarity constraints in dynamic models;
An introduction to advanced (?) MCMC methods
  Motivating example




A typology of Bayes computational problems
         (i)    use of a complex parameter space, as for instance in
                constrained parameter sets like those resulting from imposing
                stationarity constraints in dynamic models;
         (ii)   use of a complex sampling model with an intractable
                likelihood, as for instance in missing data and graphical
                models;
An introduction to advanced (?) MCMC methods
  Motivating example




A typology of Bayes computational problems
         (i)    use of a complex parameter space, as for instance in
                constrained parameter sets like those resulting from imposing
                stationarity constraints in dynamic models;
         (ii)   use of a complex sampling model with an intractable
                likelihood, as for instance in missing data and graphical
                models;
        (iii)   use of a huge dataset;
An introduction to advanced (?) MCMC methods
  Motivating example




A typology of Bayes computational problems
         (i)    use of a complex parameter space, as for instance in
                constrained parameter sets like those resulting from imposing
                stationarity constraints in dynamic models;
         (ii)   use of a complex sampling model with an intractable
                likelihood, as for instance in missing data and graphical
                models;
        (iii)   use of a huge dataset;
        (iv)    use of a complex prior distribution (which may be the
                posterior distribution associated with an earlier sample);
An introduction to advanced (?) MCMC methods
  Motivating example




A typology of Bayes computational problems
         (i)    use of a complex parameter space, as for instance in
                constrained parameter sets like those resulting from imposing
                stationarity constraints in dynamic models;
         (ii)   use of a complex sampling model with an intractable
                likelihood, as for instance in missing data and graphical
                models;
        (iii)   use of a huge dataset;
        (iv)    use of a complex prior distribution (which may be the
                posterior distribution associated with an earlier sample);
         (v)    use of a complex inferential procedure as for instance, Bayes
                factors
                        π                                        π(θ ∈ Θ0 )
                       B01 (x) = P (θ ∈ Θ0 | x)/P (θ ∈ Θ1 | x)              .
                                                                 π(θ ∈ Θ1 )
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm




The Metropolis-Hastings Algorithm



      1   Motivating example

      2   The Metropolis-Hastings Algorithm
            Monte Carlo Methods based on Markov Chains
            The Metropolis–Hastings algorithm
            A collection of Metropolis-Hastings algorithms
            Extensions
            Convergence assessment
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     Monte Carlo Methods based on Markov Chains


Running Monte Carlo via Markov Chains



      Fact: It is not necessary to use a sample from the distribution f to
      approximate the integral

                                         I=       h(x)f (x)dx ,
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     Monte Carlo Methods based on Markov Chains


Running Monte Carlo via Markov Chains



      Fact: It is not necessary to use a sample from the distribution f to
      approximate the integral

                                         I=       h(x)f (x)dx ,


      We can obtain X1 , . . . , Xn ∼ f (approx) without directly
      simulating from f , using an ergodic Markov chain with
      stationary distribution f
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     Monte Carlo Methods based on Markov Chains


Running Monte Carlo via Markov Chains (2)


      Idea
      For an arbitrary starting value x(0) , an ergodic chain (X (t) ) is
      generated using a transition kernel with stationary distribution f
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     Monte Carlo Methods based on Markov Chains


Running Monte Carlo via Markov Chains (2)


      Idea
      For an arbitrary starting value x(0) , an ergodic chain (X (t) ) is
      generated using a transition kernel with stationary distribution f


             Ensures the convergence in distribution of (X (t) ) to a random
             variable from f .
             For a “large enough” T0 , X (T0 ) can be considered as
             distributed from f
             Produces a dependent sample X (T0 ) , X (T0 +1) , . . ., which is
             generated from f , sufficient for most approximation purposes.
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     The Metropolis–Hastings algorithm


The Metropolis–Hastings algorithm

      Problem:
      How can one build a Markov chain with a given stationary
      distribution?
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     The Metropolis–Hastings algorithm


The Metropolis–Hastings algorithm

      Problem:
      How can one build a Markov chain with a given stationary
      distribution?

      MH basics
      Algorithm that converges to the objective (target) density

                                                 f

      using an arbitrary transition kernel density

                                               q(x, y)

      called instrumental (or proposal) distribution
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     The Metropolis–Hastings algorithm


The MH algorithm

      Algorithm (Metropolis–Hastings)
      Given x(t) ,
         1    Generate Yt ∼ q(x(t) , y).
         2    Take

                                               Yt     with prob. ρ(x(t) , Yt ),
                            X (t+1) =
                                               x(t)   with prob. 1 − ρ(x(t) , Yt ),

              where
                                                         f (y) q(y, x)
                                      ρ(x, y) = min                    ,1    .
                                                         f (x) q(x, y)
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     The Metropolis–Hastings algorithm


Features

              Independent of normalizing constants for both f and q(x, ·)
              (ie, those constants independent of x)
              Never move to values with f (y) = 0
              The chain (x(t) )t may take the same value several times in a
              row, even though f is a density wrt Lebesgue measure
              The sequence (yt )t is usually not a Markov chain
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     The Metropolis–Hastings algorithm


Features

              Independent of normalizing constants for both f and q(x, ·)
              (ie, those constants independent of x)
              Never move to values with f (y) = 0
              The chain (x(t) )t may take the same value several times in a
              row, even though f is a density wrt Lebesgue measure
              The sequence (yt )t is usually not a Markov chain
                                                                P( θ-> θ ’)
   Satisfies the detailed balance condition
                                                                                           θ’
                                                            θ

             f (x)K(x, y) = f (y)K(y, x)                                      P(θ’-> θ )



                                                                 [Green, 1995]
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     The Metropolis–Hastings algorithm


Convergence properties


         1    The M-H Markov chain is reversible, with invariant/stationary
              density f .
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     The Metropolis–Hastings algorithm


Convergence properties


         1    The M-H Markov chain is reversible, with invariant/stationary
              density f .
         2    As f is a probability measure, the chain is positive recurrent
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     The Metropolis–Hastings algorithm


Convergence properties


         1    The M-H Markov chain is reversible, with invariant/stationary
              density f .
         2    As f is a probability measure, the chain is positive recurrent
         3    If
                                            f (Yt ) q(Yt , X (t) )
                                      Pr                             ≥ 1 < 1.   (1)
                                           f (X (t) ) q(X (t) , Yt )

              i.e., if the event {X (t+1) = X (t) } occurs with positive
              probability, then the chain is aperiodic
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     The Metropolis–Hastings algorithm


Convergence properties (2)
         4    If
                                         q(x, y) > 0 for every (x, y),   (2)
              the chain is irreducible
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     The Metropolis–Hastings algorithm


Convergence properties (2)
         4    If
                                         q(x, y) > 0 for every (x, y),   (2)
              the chain is irreducible
         5    For M-H, f -irreducibility implies Harris recurrence
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     The Metropolis–Hastings algorithm


Convergence properties (2)
         4    If
                                         q(x, y) > 0 for every (x, y),                        (2)
              the chain is irreducible
         5    For M-H, f -irreducibility implies Harris recurrence
         6    Thus, under conditions (1) and (2)
                   (i) For h, with Ef |h(X)| < ∞,
                                               T
                                         1
                                  lim                h(X (t) ) =   h(x)df (x)       a.e. f.
                                T →∞     T     t=1

                (ii) and
                                         lim            K n (x, ·)µ(dx) − f        =0
                                         n→∞
                                                                              TV
                      for every initial distribution µ, where K n (x, ·) denotes the
                      kernel for n transitions.
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     A collection of Metropolis-Hastings algorithms


The Independent Case

      The instrumental distribution q(x, ·) is independent of x and is
      denoted g
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     A collection of Metropolis-Hastings algorithms


The Independent Case

      The instrumental distribution q(x, ·) is independent of x and is
      denoted g

      Algorithm (Independent Metropolis-Hastings)
      Given x(t) ,
          1   Generate Yt ∼ g(y)
          2   Take
                                      
                                      Y                               f (Yt ) g(x(t) )
                                                      with prob. min                    ,1 ,
                                      
                                            t
                     X (t+1) =                                         f (x(t) ) g(Yt )
                                      
                                          x(t)        otherwise.
                                      
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     A collection of Metropolis-Hastings algorithms


Properties
      The resulting sample is not iid
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     A collection of Metropolis-Hastings algorithms


Properties
      The resulting sample is not iid but there exist strong convergence
      properties:
      Theorem (Ergodicity)
      The algorithm produces a uniformly ergodic chain if there exists a
      constant M such that

                                      f (x) ≤ M g(x) ,         x ∈ supp f.

      In this case,
                                                                         n
                                                                     1
                                   K n (x, ·) − f     TV   ≤    1−           .
                                                                     M

                                                               [Mengersen & Tweedie, 1996]
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     A collection of Metropolis-Hastings algorithms




      Example (Noisy AR(1))
      Hidden Markov chain from a regular AR(1) model,

                                xt+1 = ϕxt + ǫt+1               ǫt ∼ N (0, τ 2 )

      and observables
                                                yt |xt ∼ N (x2 , σ 2 )
                                                             t
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     A collection of Metropolis-Hastings algorithms




      Example (Noisy AR(1))
      Hidden Markov chain from a regular AR(1) model,

                                xt+1 = ϕxt + ǫt+1               ǫt ∼ N (0, τ 2 )

      and observables
                                                yt |xt ∼ N (x2 , σ 2 )
                                                             t

      The distribution of xt given xt−1 , xt+1 and yt is

                   −1                                                     τ2
           exp               (xt − ϕxt−1 )2 + (xt+1 − ϕxt )2 +               (yt − x2 )2
                                                                                    t      .
                   2τ 2                                                   σ2
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     A collection of Metropolis-Hastings algorithms




      Example (Noisy AR(1) too)
                                    2
      Use for proposal the N (µt , ωt ) distribution, with

                                        xt−1 + xt+1        2       τ2
                           µt = ϕ                     and ωt =          .
                                          1 + ϕ2                 1 + ϕ2
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     A collection of Metropolis-Hastings algorithms




      Example (Noisy AR(1) too)
                                    2
      Use for proposal the N (µt , ωt ) distribution, with

                                        xt−1 + xt+1        2       τ2
                           µt = ϕ                     and ωt =          .
                                          1 + ϕ2                 1 + ϕ2
      Ratio
                                π(x)/qind (x) = exp −(yt − x2 )2 /2σ 2
                                                            t

      is bounded
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     A collection of Metropolis-Hastings algorithms




      (top) Last 500 realisations of the chain {Xk }k out of 10, 000
      iterations; (bottom) histogram of the chain, compared with
      the target distribution.
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     A collection of Metropolis-Hastings algorithms


Random walk Metropolis–Hastings


      Instead, use a local perturbation as proposal

                                                  Yt = X (t) + εt ,

      where εt ∼ g, independent of X (t) .
      The instrumental density is now of the form g(y − x) and the
      Markov chain is a random walk if g is symmetric

                                                      g(x) = g(−x)
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     A collection of Metropolis-Hastings algorithms




      Algorithm (Random walk Metropolis)
      Given x(t)
          1   Generate Yt ∼ g(y − x(t) )
          2   Take
                                                                           f (Yt )
                                         
                                         Y           with prob. min 1,             ,
                              (t+1)         t
                          X            =                                  f (x(t) )
                                          (t)
                                          x           otherwise.
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     A collection of Metropolis-Hastings algorithms


Probit illustration


      Likelihood and posterior given by
                                                      n
            π(β|y, X) ∝ ℓ(β|y, X) ∝                         Φ(xiT β)yi (1 − Φ(xiT β))ni −yi .
                                                      i=1

      under the flat prior
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     A collection of Metropolis-Hastings algorithms


Probit illustration


      Likelihood and posterior given by
                                                      n
            π(β|y, X) ∝ ℓ(β|y, X) ∝                         Φ(xiT β)yi (1 − Φ(xiT β))ni −yi .
                                                      i=1

      under the flat prior
      A random walk proposal works well for a small number of
                                                       ˆ
      predictors. Use the maximum likelihood estimate β as starting
                                                                   ˆ
      value and asymptotic (Fisher) covariance matrix of the MLE, Σ, as
      scale
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     A collection of Metropolis-Hastings algorithms


MCMC algorithm


      Probit random-walk Metropolis-Hastings
                                          ˆ             ˆ
              Initialization: Set β (0) = β and compute Σ
              Iteration t:
                  1            ˜                     ˆ
                      Generate β ∼ Nk+1 (β (t−1) , τ Σ)
                  2   Compute

                                                                          ˜
                                                                      π(β|y)
                                                      ˜
                                          ρ(β (t−1) , β) = min 1,
                                                                    π(β (t−1) |y)


                  3                                ˜              ˜
                      With probability ρ(β (t−1) , β) set β (t) = β;
                      otherwise set β (t) = β (t−1) .
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     A collection of Metropolis-Hastings algorithms


R bank benchmark
        Probit modelling with
        no intercept over the




                                                                                                                           0.8
                                                      −1.0




                                                                                   1.0




                                                                                                                           0.4
        four measurements.




                                                      −2.0




                                                                                                                           0.0
                                                                                   0.0
                                                                 0   4000   8000         −2.0 −1.5 −1.0 −0.5                             0 200   600   1000

        Three different scales




                                                      3




                                                                                                                           0.0 0.4 0.8
        τ = 1, 0.1, 10: best



                                                      2




                                                                                   0.4
                                                      1
        mixing behavior is


                                                      −1




                                                                                   0.0
                                                                 0   4000   8000         −1     0     1         2     3                  0 200   600   1000



        associated with τ = 1.

                                                      2.5




                                                                                   0.8




                                                                                                                           0.0 0.4 0.8
                                                      −0.5 1.0
        Average of the




                                                                                   0.4
                                                                                   0.0
        parameters over                               1.8
                                                                 0   4000   8000         −0.5   0.5       1.5       2.5                  0 200   600   1000




        MCMC 9, 000




                                                                                                                           0.0 0.4 0.8
                                                                                   2.0
                                                      1.2




                                                                                   1.0
        iterations gives plug-in
                                                                                   0.0
                                                      0.6




                                                                 0   4000   8000         0.6    1.0       1.4        1.8                 0 200   600   1000


        estimate

            pi = Φ (−1.2193xi1 + 0.9540xi2 + 0.9795xi3 + 1.1481xi4 ) .
            ˆ
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     A collection of Metropolis-Hastings algorithms




      Example (Mixture models)
                                                n      k
                             π(θ|x) ∝                       pℓ f (xj |µℓ , σℓ ) π(θ)
                                              j=1     ℓ=1
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     A collection of Metropolis-Hastings algorithms




      Example (Mixture models)
                                                n        k
                             π(θ|x) ∝                         pℓ f (xj |µℓ , σℓ ) π(θ)
                                              j=1       ℓ=1

      Metropolis-Hastings proposal:

                                                    θ(t) + ωε(t) if u(t) < ρ(t)
                              θ(t+1) =
                                                    θ(t)         otherwise

      where
                                                      π(θ(t) + ωε(t) |x)
                                        ρ(t) =                           ∧1
                                                         π(θ(t) |x)
      and ω scaled for good acceptance rate
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     A collection of Metropolis-Hastings algorithms


                                 Random walk MCMC output for
                                     .7N (µ1 , 1) + .3N (µ2 , 1)
                                           and scale 1
                                                          Iteration 1
           4
           3
           2
      µ2

           1
           0
           −1




                       −1                  0          1                 2   3   4

                                                              µ1
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     A collection of Metropolis-Hastings algorithms


                                 Random walk MCMC output for
                                     .7N (µ1 , 1) + .3N (µ2 , 1)
                                           and scale 1
                                                      Iteration 10
           4
           3
           2
      µ2

           1
           0
           −1




                       −1                  0          1              2   3   4

                                                           µ1
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     A collection of Metropolis-Hastings algorithms



                                 Random walk MCMC output for
                                     .7N (µ1 , 1) + .3N (µ2 , 1)
                                           and scale 1
                                                      Iteration 100
           4
           3
           2
      µ2

           1
           0
           −1




                       −1                  0          1               2   3   4

                                                           µ1
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     A collection of Metropolis-Hastings algorithms



                                 Random walk MCMC output for
                                     .7N (µ1 , 1) + .3N (µ2 , 1)
                                           and scale 1
                                                      Iteration 500
           4
           3
           2
      µ2

           1
           0
           −1




                       −1                  0          1               2   3   4

                                                           µ1
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     A collection of Metropolis-Hastings algorithms



                                 Random walk MCMC output for
                                     .7N (µ1 , 1) + .3N (µ2 , 1)
                                           and scale 1
                                                      Iteration 1000
           4
           3
           2
      µ2

           1
           0
           −1




                       −1                  0          1                2   3   4

                                                            µ1
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     A collection of Metropolis-Hastings algorithms


                                 Random walk MCMC output for
                                     .7N (µ1 , 1) + .3N (µ2 , 1)
                                                      √
                                          and scale .1
                                                      Iteration 10
           4
           3
           2
      µ2

           1
           0
           −1




                       −1                  0          1              2   3   4

                                                           µ1
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     A collection of Metropolis-Hastings algorithms


                                 Random walk MCMC output for
                                     .7N (µ1 , 1) + .3N (µ2 , 1)
                                                      √
                                          and scale .1
                                                      Iteration 100
           4
           3
           2
      µ2

           1
           0
           −1




                       −1                  0          1               2   3   4

                                                           µ1
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     A collection of Metropolis-Hastings algorithms


                                 Random walk MCMC output for
                                     .7N (µ1 , 1) + .3N (µ2 , 1)
                                                      √
                                          and scale .1
                                                      Iteration 500
           4
           3
           2
      µ2

           1
           0
           −1




                       −1                  0          1               2   3   4

                                                           µ1
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     A collection of Metropolis-Hastings algorithms



                                 Random walk MCMC output for
                                     .7N (µ1 , 1) + .3N (µ2 , 1)
                                                      √
                                          and scale .1
                                                      Iteration 1000
           4
           3
           2
      µ2

           1
           0
           −1




                       −1                  0          1                2   3   4

                                                            µ1
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     A collection of Metropolis-Hastings algorithms



                                 Random walk MCMC output for
                                     .7N (µ1 , 1) + .3N (µ2 , 1)
                                                      √
                                          and scale .1
                                                      Iteration 10,000
           4
           3
           2
      µ2

           1
           0
           −1




                       −1                  0           1                 2   3   4

                                                             µ1
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     A collection of Metropolis-Hastings algorithms



                                 Random walk MCMC output for
                                     .7N (µ1 , 1) + .3N (µ2 , 1)
                                                      √
                                          and scale .1
                                                      Iteration 5000
           4
           3
           2
      µ2

           1
           0
           −1




                       −1                  0          1                2   3   4

                                                            µ1
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     A collection of Metropolis-Hastings algorithms


Convergence properties



      Uniform ergodicity prohibited by random walk structure
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     A collection of Metropolis-Hastings algorithms


Convergence properties



      Uniform ergodicity prohibited by random walk structure
      At best, geometric ergodicity:

      Theorem (Sufficient ergodicity)
      For a symmetric density f , log-concave in the tails, and a positive
      and symmetric density g, the chain (X (t) ) is geometrically ergodic.
                                           [Mengersen & Tweedie, 1996]
                                                                  no tail effect
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     A collection of Metropolis-Hastings algorithms




                                                      1.5




                                                                                        1.5
                                                      1.0




                                                                                        1.0
   Example (Comparison of tail
   effects)




                                                      0.5




                                                                                        0.5
                                                      0.0




                                                                                        0.0
   Random-walk
   Metropolis–Hastings algorithms




                                                      -0.5




                                                                                        -0.5
   based on a N (0, 1) instrumental



                                                      -1.0




                                                                                        -1.0
   for the generation of (a) a

                                                      -1.5




                                                                                        -1.5
   N (0, 1) distribution and (b) a                           0   50   100

                                                                      (a)
                                                                            150   200          0   50   100

                                                                                                        (b)
                                                                                                              150   200




   distribution with density                          90% confidence envelopes of
   ψ(x) ∝ (1 + |x|)−3                                 the means, derived from 500
                                                      parallel independent chains
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     Extensions


Extensions




      There are many other families of HM algorithms
              Adaptive Rejection Metropolis Sampling
              Reversible Jump
              Langevin algorithms
      to name just a few...
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     Extensions


Langevin Algorithms


      Proposal based on the Langevin diffusion Lt is defined by the
      stochastic differential equation
                                                 1
                                      dLt = dBt + ∇ log f (Lt )dt,
                                                 2
       where Bt is the standard Brownian motion
      Theorem
      The Langevin diffusion is the only non-explosive diffusion which is
      reversible with respect to f .
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     Extensions


Discretization

      Because continuous time cannot be simulated, consider the
      discretised sequence

                                      σ2
              x(t+1) = x(t) +            ∇ log f (x(t) ) + σεt ,                                     εt ∼ Np (0, Ip )
                                      2
       where σ 2 corresponds to the discretisation step
                                                  0.6
                                                  0.5
                                                  0.4




   Example of
                                        Density

                                                  0.3




   f (x) = exp(−x4 )
                                                  0.2
                                                  0.1
                                                  0.0




                                                        −1.5   −1.0   −0.5   0.0   0.5   1.0   1.5




                                        σ2               = .1
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     Extensions


Discretization

      Because continuous time cannot be simulated, consider the
      discretised sequence

                                      σ2
              x(t+1) = x(t) +            ∇ log f (x(t) ) + σεt ,                             εt ∼ Np (0, Ip )
                                      2
       where σ 2 corresponds to the discretisation step
                                          0.6
                                          0.5
                                          0.4




   Example of
                                          0.3




   f (x) = exp(−x4 )
                                          0.2
                                          0.1
                                          0.0




                                                −1.5   −1.0   −0.5   0.0   0.5   1.0   1.5




                                        σ2      = .01
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     Extensions


Discretization

      Because continuous time cannot be simulated, consider the
      discretised sequence

                                      σ2
              x(t+1) = x(t) +            ∇ log f (x(t) ) + σεt ,                                        εt ∼ Np (0, Ip )
                                      2
       where σ 2 corresponds to the discretisation step
                                                    0.6
                                                    0.5
                                                    0.4




   Example of
                                          Density

                                                    0.3




   f (x) = exp(−x4 )
                                                    0.2
                                                    0.1
                                                    0.0




                                                           −1.5   −1.0   −0.5   0.0   0.5   1.0   1.5




                                        σ2                = .001
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     Extensions


Discretization

      Because continuous time cannot be simulated, consider the
      discretised sequence

                                      σ2
              x(t+1) = x(t) +            ∇ log f (x(t) ) + σεt ,                                          εt ∼ Np (0, Ip )
                                      2
       where σ 2 corresponds to the discretisation step
                                                       0.8
                                                       0.6




   Example of
                                             Density

                                                       0.4




   f (x) = exp(−x4 )
                                                       0.2
                                                       0.0




                                                             −1.5   −1.0   −0.5   0.0   0.5   1.0   1.5




                                        σ2             = .0001
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     Extensions


Discretization

      Because continuous time cannot be simulated, consider the
      discretised sequence

                                      σ2
              x(t+1) = x(t) +            ∇ log f (x(t) ) + σεt ,                                            εt ∼ Np (0, Ip )
                                      2
       where σ 2 corresponds to the discretisation step
                                                         0.6
                                                         0.5
                                                         0.4




   Example of
                                               Density

                                                         0.3




   f (x) = exp(−x4 )
                                                         0.2
                                                         0.1
                                                         0.0




                                                               −1.5   −1.0   −0.5   0.0   0.5   1.0   1.5




                                        σ2     =               .0001∗
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     Extensions


Discretization




      Unfortunately, the discretized chain may be transient, for instance
      when
                          lim σ 2 ∇ log f (x)|x|−1 > 1
                                      x→±∞

      Example of f (x) = exp(−x4 ) when σ 2 = .2
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     Extensions


MH correction

      Accept the new value Yt with probability
                                                                          2
                                                        σ2
                         exp − Yt − x(t) −                          (t)
                                                        2 ∇ log f (x )        2σ 2
            f (Yt )
                     ·                                                               ∧1.
           f (x(t) )                                    σ2
                                                                          2
                          exp −         x(t)   − Yt −   2 ∇ log f (Yt )       2σ 2


      Choice of the scaling factor σ
      Should lead to an acceptance rate of 0.574 to achieve optimal
      convergence rates (when the components of x are uncorrelated)
                                            [Roberts & Rosenthal, 1998]
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     Extensions


Optimizing the Acceptance Rate



      Problem of choice of the transition kernel from a practical point of
      view
      Most common alternatives:
          1   a fully automated algorithm like ARMS;
          2   an instrumental density g which approximates f , such that
              f /g is bounded for uniform ergodicity to apply;
          3   a random walk
      In both cases (b) and (c), the choice of g is critical,
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     Extensions


Case of the random walk


      Different approach to acceptance rates
      A high acceptance rate does not indicate that the algorithm is
      moving correctly since it indicates that the random walk is moving
      too slowly on the surface of f .
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     Extensions


Case of the random walk


      Different approach to acceptance rates
      A high acceptance rate does not indicate that the algorithm is
      moving correctly since it indicates that the random walk is moving
      too slowly on the surface of f .
      If x(t) and yt are close, i.e. f (x(t) ) ≃ f (yt ) y is accepted with
      probability
                                       f (yt )
                              min              ,1 ≃ 1 .
                                     f (x(t) )
      For multimodal densities with well separated modes, the negative
      effect of limited moves on the surface of f clearly shows.
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     Extensions


Case of the random walk (2)




      If the average acceptance rate is low, the successive values of f (yt )
      tend to be small compared with f (x(t) ), which means that the
      random walk moves quickly on the surface of f since it often
      reaches the “borders” of the support of f
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     Extensions


Rule of thumb




      In small dimensions, aim at an average acceptance rate of 50%. In
      large dimensions, at an average acceptance rate of 25%.
                                       [Gelman,Gilks and Roberts, 1995]
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     Extensions


Rule of thumb




      In small dimensions, aim at an average acceptance rate of 50%. In
      large dimensions, at an average acceptance rate of 25%.
                                       [Gelman,Gilks and Roberts, 1995]

      This rule is to be taken with a pinch of salt!
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     Extensions




      Example (Noisy AR(1) continued)
      For a Gaussian random walk with scale ω small enough, the
      random walk never jumps to the other mode. But if the scale ω is
      sufficiently large, the Markov chain explores both modes and give a
      satisfactory approximation of the target distribution.
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     Extensions




                  Markov chain based on a random walk with scale ω = .1
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     Extensions




                  Markov chain based on a random walk with scale ω = .5
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     Extensions


Where do we stand?
      MCMC in a nutshell:
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     Extensions


Where do we stand?
      MCMC in a nutshell:
              Running a sequence Xt+1 = Ψ(Xt , Yy ) provides approximation
              to target density f when detailed balance condition holds

                                       f (x)K(x, y) = f (y)K(y, x)
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     Extensions


Where do we stand?
      MCMC in a nutshell:
              Running a sequence Xt+1 = Ψ(Xt , Yy ) provides approximation
              to target density f when detailed balance condition holds

                                       f (x)K(x, y) = f (y)K(y, x)


              Easiest implementation of the principle is random walk
              Metropolis-Hastings

                                               Yt = X (t) + εt
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     Extensions


Where do we stand?
      MCMC in a nutshell:
              Running a sequence Xt+1 = Ψ(Xt , Yy ) provides approximation
              to target density f when detailed balance condition holds

                                       f (x)K(x, y) = f (y)K(y, x)


              Easiest implementation of the principle is random walk
              Metropolis-Hastings

                                               Yt = X (t) + εt


              Practical convergence requires sufficient energy from the
              proposal that is calibrated by trial and error.
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     Convergence assessment


Convergence diagnostics

      How many iterations?
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     Convergence assessment


Convergence diagnostics

      How many iterations?
             Rule # 1 There is no absolute number of simulations, i.e.
             1, 000 is neither large, nor small.
             Rule # 2 It takes [much] longer to check for convergence
             than for the chain itself to converge.
             Rule # 3 MCMC is a “what-you-get-is-what-you-see”
             algorithm: it fails to tell about unexplored parts of the space.
             Rule # 4 When in doubt, run MCMC chains in parallel and
             check for consistency.
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     Convergence assessment


Convergence diagnostics

      How many iterations?
             Rule # 1 There is no absolute number of simulations, i.e.
             1, 000 is neither large, nor small.
             Rule # 2 It takes [much] longer to check for convergence
             than for the chain itself to converge.
             Rule # 3 MCMC is a “what-you-get-is-what-you-see”
             algorithm: it fails to tell about unexplored parts of the space.
             Rule # 4 When in doubt, run MCMC chains in parallel and
             check for consistency.

      Many “quick-&-dirty” solutions in the literature, but not
      necessarily 100% trustworthy.
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     Convergence assessment




      Example (Bimodal target)




                                                                            0.4
      Density




                                                                            0.3
                    exp −x2 /2 4(x − .3)2 + .01




                                                                            0.2
       f (x) =         √                           .
                               4(1 + (.3)2 ) + .01




                                                                            0.1
                         2π




                                                                            0.0
                                                                                  −4   −2   0   2   4




      and use of random walk Metropolis–Hastings algorithm with
      variance .04
      Evaluation of the missing mass by
                                      T −1
                                               [θ(t+1) − θ(t) ] f (θ(t) )
                                      t=1
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     Convergence assessment


              1.0
              0.8
              0.6
       mass

              0.4
              0.2
              0.0




                     0                  500            1000               1500             2000

                                                       Index




                    Sequence [in blue] and mass evaluation [in brown]


                                                               [Philippe & Robert, 2001]
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     Convergence assessment


Effective sample size
      How many iid simulations from π are equivalent to N simulations
      from the MCMC algorithm?
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     Convergence assessment


Effective sample size
      How many iid simulations from π are equivalent to N simulations
      from the MCMC algorithm?

      Based on estimated k-th order auto-correlation,

                                          ρk = cov x(t) , x(t+k) ,

      effective sample size
                                                           T0         −1/2
                                          ess
                                      N         =n   1+2         ρk
                                                                 ˆ           ,
                                                           k=1



             Only partial indicator that fails to signal chains stuck in one
             mode of the target
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     Convergence assessment


Tempering

      Facilitate exploration of π by flattening the target: simulate from
      πα (x) ∝ π(x)α for α > 0 small enough
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     Convergence assessment


Tempering

      Facilitate exploration of π by flattening the target: simulate from
      πα (x) ∝ π(x)α for α > 0 small enough
             Determine where the modal regions of π are (possibly with
             parallel versions using different α’s)
             Recycle simulations from π(x)α into simulations from π by
             importance sampling
             Simple modification of the Metropolis–Hastings algorithm,
             with new acceptance
                                                          α
                                               π(θ′ |x)       q(θ|θ′ )
                                                                         ∧1
                                               π(θ|x)         q(θ′ |θ)
An introduction to advanced (?) MCMC methods
  The Metropolis-Hastings Algorithm
     Convergence assessment


Tempering with the mean mixture

                          1                             0.5                         0.2
    4




                                       4




                                                                      4
    3




                                       3




                                                                      3
    2




                                       2




                                                                      2
    1




                                       1




                                                                      1
    0




                                       0




                                                                      0
    −1




                                       −1




                                                                      −1
          −1    0     1       2   3   4        −1   0   1     2   3   4    −1   0   1     2   3   4

Contenu connexe

Tendances

Metropolis-Hastings MCMC Short Tutorial
Metropolis-Hastings MCMC Short TutorialMetropolis-Hastings MCMC Short Tutorial
Metropolis-Hastings MCMC Short TutorialRalph Schlosser
 
Markov Chain Monte Carlo Methods
Markov Chain Monte Carlo MethodsMarkov Chain Monte Carlo Methods
Markov Chain Monte Carlo MethodsFrancesco Casalegno
 
short course at CIRM, Bayesian Masterclass, October 2018
short course at CIRM, Bayesian Masterclass, October 2018short course at CIRM, Bayesian Masterclass, October 2018
short course at CIRM, Bayesian Masterclass, October 2018Christian Robert
 
Markov chain Monte Carlo methods and some attempts at parallelizing them
Markov chain Monte Carlo methods and some attempts at parallelizing themMarkov chain Monte Carlo methods and some attempts at parallelizing them
Markov chain Monte Carlo methods and some attempts at parallelizing themPierre Jacob
 
Differential analyses of structures in HiC data
Differential analyses of structures in HiC dataDifferential analyses of structures in HiC data
Differential analyses of structures in HiC datatuxette
 
Unbiased Bayes for Big Data
Unbiased Bayes for Big DataUnbiased Bayes for Big Data
Unbiased Bayes for Big DataChristian Robert
 
A short and naive introduction to using network in prediction models
A short and naive introduction to using network in prediction modelsA short and naive introduction to using network in prediction models
A short and naive introduction to using network in prediction modelstuxette
 
From RNN to neural networks for cyclic undirected graphs
From RNN to neural networks for cyclic undirected graphsFrom RNN to neural networks for cyclic undirected graphs
From RNN to neural networks for cyclic undirected graphstuxette
 
Mark Girolami's Read Paper 2010
Mark Girolami's Read Paper 2010Mark Girolami's Read Paper 2010
Mark Girolami's Read Paper 2010Christian Robert
 
On Meme Self-Adaptation in Spatially-Structured Multimemetic Algorithms
On Meme Self-Adaptation in Spatially-Structured Multimemetic AlgorithmsOn Meme Self-Adaptation in Spatially-Structured Multimemetic Algorithms
On Meme Self-Adaptation in Spatially-Structured Multimemetic AlgorithmsRafael Nogueras
 
Macrocanonical models for texture synthesis
Macrocanonical models for texture synthesisMacrocanonical models for texture synthesis
Macrocanonical models for texture synthesisValentin De Bortoli
 
Supervised Planetary Unmixing with Optimal Transport
Supervised Planetary Unmixing with Optimal TransportSupervised Planetary Unmixing with Optimal Transport
Supervised Planetary Unmixing with Optimal TransportSina Nakhostin
 

Tendances (20)

Metropolis-Hastings MCMC Short Tutorial
Metropolis-Hastings MCMC Short TutorialMetropolis-Hastings MCMC Short Tutorial
Metropolis-Hastings MCMC Short Tutorial
 
Markov Chain Monte Carlo Methods
Markov Chain Monte Carlo MethodsMarkov Chain Monte Carlo Methods
Markov Chain Monte Carlo Methods
 
short course at CIRM, Bayesian Masterclass, October 2018
short course at CIRM, Bayesian Masterclass, October 2018short course at CIRM, Bayesian Masterclass, October 2018
short course at CIRM, Bayesian Masterclass, October 2018
 
Shanghai tutorial
Shanghai tutorialShanghai tutorial
Shanghai tutorial
 
Jere Koskela slides
Jere Koskela slidesJere Koskela slides
Jere Koskela slides
 
Markov chain Monte Carlo methods and some attempts at parallelizing them
Markov chain Monte Carlo methods and some attempts at parallelizing themMarkov chain Monte Carlo methods and some attempts at parallelizing them
Markov chain Monte Carlo methods and some attempts at parallelizing them
 
Differential analyses of structures in HiC data
Differential analyses of structures in HiC dataDifferential analyses of structures in HiC data
Differential analyses of structures in HiC data
 
Richard Everitt's slides
Richard Everitt's slidesRichard Everitt's slides
Richard Everitt's slides
 
Unbiased Bayes for Big Data
Unbiased Bayes for Big DataUnbiased Bayes for Big Data
Unbiased Bayes for Big Data
 
A short and naive introduction to using network in prediction models
A short and naive introduction to using network in prediction modelsA short and naive introduction to using network in prediction models
A short and naive introduction to using network in prediction models
 
From RNN to neural networks for cyclic undirected graphs
From RNN to neural networks for cyclic undirected graphsFrom RNN to neural networks for cyclic undirected graphs
From RNN to neural networks for cyclic undirected graphs
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
Hastings 1970
Hastings 1970Hastings 1970
Hastings 1970
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
Mark Girolami's Read Paper 2010
Mark Girolami's Read Paper 2010Mark Girolami's Read Paper 2010
Mark Girolami's Read Paper 2010
 
On Meme Self-Adaptation in Spatially-Structured Multimemetic Algorithms
On Meme Self-Adaptation in Spatially-Structured Multimemetic AlgorithmsOn Meme Self-Adaptation in Spatially-Structured Multimemetic Algorithms
On Meme Self-Adaptation in Spatially-Structured Multimemetic Algorithms
 
Macrocanonical models for texture synthesis
Macrocanonical models for texture synthesisMacrocanonical models for texture synthesis
Macrocanonical models for texture synthesis
 
Supervised Planetary Unmixing with Optimal Transport
Supervised Planetary Unmixing with Optimal TransportSupervised Planetary Unmixing with Optimal Transport
Supervised Planetary Unmixing with Optimal Transport
 
Gtti 10032021
Gtti 10032021Gtti 10032021
Gtti 10032021
 
Trondheim, LGM2012
Trondheim, LGM2012Trondheim, LGM2012
Trondheim, LGM2012
 

Similaire à Introduction to advanced Monte Carlo methods

MCMC and likelihood-free methods
MCMC and likelihood-free methodsMCMC and likelihood-free methods
MCMC and likelihood-free methodsChristian Robert
 
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...Varad Meru
 
An investigation of inference of the generalized extreme value distribution b...
An investigation of inference of the generalized extreme value distribution b...An investigation of inference of the generalized extreme value distribution b...
An investigation of inference of the generalized extreme value distribution b...Alexander Decker
 
Delayed acceptance for Metropolis-Hastings algorithms
Delayed acceptance for Metropolis-Hastings algorithmsDelayed acceptance for Metropolis-Hastings algorithms
Delayed acceptance for Metropolis-Hastings algorithmsChristian Robert
 
Stability of adaptive random-walk Metropolis algorithms
Stability of adaptive random-walk Metropolis algorithmsStability of adaptive random-walk Metropolis algorithms
Stability of adaptive random-walk Metropolis algorithmsBigMC
 
03 Vibration of string
03 Vibration of string03 Vibration of string
03 Vibration of stringMaged Mostafa
 
Approximation in Stochastic Integer Programming
Approximation in Stochastic Integer ProgrammingApproximation in Stochastic Integer Programming
Approximation in Stochastic Integer ProgrammingSSA KPI
 
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...Varad Meru
 
Vanilla rao blackwellisation
Vanilla rao blackwellisationVanilla rao blackwellisation
Vanilla rao blackwellisationDeb Roy
 
Stratified sampling and resampling for approximate Bayesian computation
Stratified sampling and resampling for approximate Bayesian computationStratified sampling and resampling for approximate Bayesian computation
Stratified sampling and resampling for approximate Bayesian computationUmberto Picchini
 
Linear models for classification
Linear models for classificationLinear models for classification
Linear models for classificationSung Yub Kim
 
RSS Read Paper by Mark Girolami
RSS Read Paper by Mark GirolamiRSS Read Paper by Mark Girolami
RSS Read Paper by Mark GirolamiChristian Robert
 
Likelihood survey-nber-0713101
Likelihood survey-nber-0713101Likelihood survey-nber-0713101
Likelihood survey-nber-0713101NBER
 
ABC with data cloning for MLE in state space models
ABC with data cloning for MLE in state space modelsABC with data cloning for MLE in state space models
ABC with data cloning for MLE in state space modelsUmberto Picchini
 
Characterization of Subsurface Heterogeneity: Integration of Soft and Hard In...
Characterization of Subsurface Heterogeneity: Integration of Soft and Hard In...Characterization of Subsurface Heterogeneity: Integration of Soft and Hard In...
Characterization of Subsurface Heterogeneity: Integration of Soft and Hard In...Amro Elfeki
 
ANU ASTR 4004 / 8004 Astronomical Computing : Lecture 2
ANU ASTR 4004 / 8004 Astronomical Computing : Lecture 2ANU ASTR 4004 / 8004 Astronomical Computing : Lecture 2
ANU ASTR 4004 / 8004 Astronomical Computing : Lecture 2tingyuansenastro
 
Unbiased Markov chain Monte Carlo
Unbiased Markov chain Monte CarloUnbiased Markov chain Monte Carlo
Unbiased Markov chain Monte CarloJeremyHeng10
 

Similaire à Introduction to advanced Monte Carlo methods (20)

MCMC and likelihood-free methods
MCMC and likelihood-free methodsMCMC and likelihood-free methods
MCMC and likelihood-free methods
 
intro
introintro
intro
 
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
 
A bit about мcmc
A bit about мcmcA bit about мcmc
A bit about мcmc
 
An investigation of inference of the generalized extreme value distribution b...
An investigation of inference of the generalized extreme value distribution b...An investigation of inference of the generalized extreme value distribution b...
An investigation of inference of the generalized extreme value distribution b...
 
Delayed acceptance for Metropolis-Hastings algorithms
Delayed acceptance for Metropolis-Hastings algorithmsDelayed acceptance for Metropolis-Hastings algorithms
Delayed acceptance for Metropolis-Hastings algorithms
 
Stability of adaptive random-walk Metropolis algorithms
Stability of adaptive random-walk Metropolis algorithmsStability of adaptive random-walk Metropolis algorithms
Stability of adaptive random-walk Metropolis algorithms
 
03 Vibration of string
03 Vibration of string03 Vibration of string
03 Vibration of string
 
Approximation in Stochastic Integer Programming
Approximation in Stochastic Integer ProgrammingApproximation in Stochastic Integer Programming
Approximation in Stochastic Integer Programming
 
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
 
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
 
Vanilla rao blackwellisation
Vanilla rao blackwellisationVanilla rao blackwellisation
Vanilla rao blackwellisation
 
Stratified sampling and resampling for approximate Bayesian computation
Stratified sampling and resampling for approximate Bayesian computationStratified sampling and resampling for approximate Bayesian computation
Stratified sampling and resampling for approximate Bayesian computation
 
Linear models for classification
Linear models for classificationLinear models for classification
Linear models for classification
 
RSS Read Paper by Mark Girolami
RSS Read Paper by Mark GirolamiRSS Read Paper by Mark Girolami
RSS Read Paper by Mark Girolami
 
Likelihood survey-nber-0713101
Likelihood survey-nber-0713101Likelihood survey-nber-0713101
Likelihood survey-nber-0713101
 
ABC with data cloning for MLE in state space models
ABC with data cloning for MLE in state space modelsABC with data cloning for MLE in state space models
ABC with data cloning for MLE in state space models
 
Characterization of Subsurface Heterogeneity: Integration of Soft and Hard In...
Characterization of Subsurface Heterogeneity: Integration of Soft and Hard In...Characterization of Subsurface Heterogeneity: Integration of Soft and Hard In...
Characterization of Subsurface Heterogeneity: Integration of Soft and Hard In...
 
ANU ASTR 4004 / 8004 Astronomical Computing : Lecture 2
ANU ASTR 4004 / 8004 Astronomical Computing : Lecture 2ANU ASTR 4004 / 8004 Astronomical Computing : Lecture 2
ANU ASTR 4004 / 8004 Astronomical Computing : Lecture 2
 
Unbiased Markov chain Monte Carlo
Unbiased Markov chain Monte CarloUnbiased Markov chain Monte Carlo
Unbiased Markov chain Monte Carlo
 

Plus de Christian Robert

Asymptotics of ABC, lecture, Collège de France
Asymptotics of ABC, lecture, Collège de FranceAsymptotics of ABC, lecture, Collège de France
Asymptotics of ABC, lecture, Collège de FranceChristian Robert
 
Workshop in honour of Don Poskitt and Gael Martin
Workshop in honour of Don Poskitt and Gael MartinWorkshop in honour of Don Poskitt and Gael Martin
Workshop in honour of Don Poskitt and Gael MartinChristian Robert
 
How many components in a mixture?
How many components in a mixture?How many components in a mixture?
How many components in a mixture?Christian Robert
 
Testing for mixtures at BNP 13
Testing for mixtures at BNP 13Testing for mixtures at BNP 13
Testing for mixtures at BNP 13Christian Robert
 
Inferring the number of components: dream or reality?
Inferring the number of components: dream or reality?Inferring the number of components: dream or reality?
Inferring the number of components: dream or reality?Christian Robert
 
Testing for mixtures by seeking components
Testing for mixtures by seeking componentsTesting for mixtures by seeking components
Testing for mixtures by seeking componentsChristian Robert
 
discussion on Bayesian restricted likelihood
discussion on Bayesian restricted likelihooddiscussion on Bayesian restricted likelihood
discussion on Bayesian restricted likelihoodChristian Robert
 
NCE, GANs & VAEs (and maybe BAC)
NCE, GANs & VAEs (and maybe BAC)NCE, GANs & VAEs (and maybe BAC)
NCE, GANs & VAEs (and maybe BAC)Christian Robert
 
Coordinate sampler : A non-reversible Gibbs-like sampler
Coordinate sampler : A non-reversible Gibbs-like samplerCoordinate sampler : A non-reversible Gibbs-like sampler
Coordinate sampler : A non-reversible Gibbs-like samplerChristian Robert
 
Laplace's Demon: seminar #1
Laplace's Demon: seminar #1Laplace's Demon: seminar #1
Laplace's Demon: seminar #1Christian Robert
 
Likelihood-free Design: a discussion
Likelihood-free Design: a discussionLikelihood-free Design: a discussion
Likelihood-free Design: a discussionChristian Robert
 

Plus de Christian Robert (20)

Asymptotics of ABC, lecture, Collège de France
Asymptotics of ABC, lecture, Collège de FranceAsymptotics of ABC, lecture, Collège de France
Asymptotics of ABC, lecture, Collège de France
 
Workshop in honour of Don Poskitt and Gael Martin
Workshop in honour of Don Poskitt and Gael MartinWorkshop in honour of Don Poskitt and Gael Martin
Workshop in honour of Don Poskitt and Gael Martin
 
discussion of ICML23.pdf
discussion of ICML23.pdfdiscussion of ICML23.pdf
discussion of ICML23.pdf
 
How many components in a mixture?
How many components in a mixture?How many components in a mixture?
How many components in a mixture?
 
restore.pdf
restore.pdfrestore.pdf
restore.pdf
 
Testing for mixtures at BNP 13
Testing for mixtures at BNP 13Testing for mixtures at BNP 13
Testing for mixtures at BNP 13
 
Inferring the number of components: dream or reality?
Inferring the number of components: dream or reality?Inferring the number of components: dream or reality?
Inferring the number of components: dream or reality?
 
CDT 22 slides.pdf
CDT 22 slides.pdfCDT 22 slides.pdf
CDT 22 slides.pdf
 
Testing for mixtures by seeking components
Testing for mixtures by seeking componentsTesting for mixtures by seeking components
Testing for mixtures by seeking components
 
discussion on Bayesian restricted likelihood
discussion on Bayesian restricted likelihooddiscussion on Bayesian restricted likelihood
discussion on Bayesian restricted likelihood
 
NCE, GANs & VAEs (and maybe BAC)
NCE, GANs & VAEs (and maybe BAC)NCE, GANs & VAEs (and maybe BAC)
NCE, GANs & VAEs (and maybe BAC)
 
ABC-Gibbs
ABC-GibbsABC-Gibbs
ABC-Gibbs
 
Coordinate sampler : A non-reversible Gibbs-like sampler
Coordinate sampler : A non-reversible Gibbs-like samplerCoordinate sampler : A non-reversible Gibbs-like sampler
Coordinate sampler : A non-reversible Gibbs-like sampler
 
eugenics and statistics
eugenics and statisticseugenics and statistics
eugenics and statistics
 
Laplace's Demon: seminar #1
Laplace's Demon: seminar #1Laplace's Demon: seminar #1
Laplace's Demon: seminar #1
 
ABC-Gibbs
ABC-GibbsABC-Gibbs
ABC-Gibbs
 
asymptotics of ABC
asymptotics of ABCasymptotics of ABC
asymptotics of ABC
 
ABC-Gibbs
ABC-GibbsABC-Gibbs
ABC-Gibbs
 
Likelihood-free Design: a discussion
Likelihood-free Design: a discussionLikelihood-free Design: a discussion
Likelihood-free Design: a discussion
 
the ABC of ABC
the ABC of ABCthe ABC of ABC
the ABC of ABC
 

Introduction to advanced Monte Carlo methods

  • 1. An introduction to advanced (?) MCMC methods An introduction to advanced (?) MCMC methods Christian P. Robert Universit´ Paris-Dauphine and CREST-INSEE e http://www.ceremade.dauphine.fr/~xian Royal Statistical Society, October 13, 2010
  • 2. An introduction to advanced (?) MCMC methods Motivating example Motivating example 1 Motivating example 2 The Metropolis-Hastings Algorithm
  • 3. An introduction to advanced (?) MCMC methods Motivating example Latent structures make life harder! Even simple models may lead to computational complications, as in latent variable models f (x|θ) = f ⋆ (x, x⋆ |θ) dx⋆
  • 4. An introduction to advanced (?) MCMC methods Motivating example Latent structures make life harder! Even simple models may lead to computational complications, as in latent variable models f (x|θ) = f ⋆ (x, x⋆ |θ) dx⋆ If (x, x⋆ ) observed, fine!
  • 5. An introduction to advanced (?) MCMC methods Motivating example Latent structures make life harder! Even simple models may lead to computational complications, as in latent variable models f (x|θ) = f ⋆ (x, x⋆ |θ) dx⋆ If (x, x⋆ ) observed, fine! If only x observed, trouble!
  • 6. An introduction to advanced (?) MCMC methods Motivating example Example (Mixture models) Models of mixtures of distributions: X ∼ fj with probability pj , for j = 1, 2, . . . , k, with overall density X ∼ p1 f1 (x) + · · · + pk fk (x) .
  • 7. An introduction to advanced (?) MCMC methods Motivating example Example (Mixture models) Models of mixtures of distributions: X ∼ fj with probability pj , for j = 1, 2, . . . , k, with overall density X ∼ p1 f1 (x) + · · · + pk fk (x) . For a sample of independent random variables (X1 , · · · , Xn ), sample density n {p1 f1 (xi ) + · · · + pk fk (xi )} . i=1
  • 8. An introduction to advanced (?) MCMC methods Motivating example Example (Mixture models) Models of mixtures of distributions: X ∼ fj with probability pj , for j = 1, 2, . . . , k, with overall density X ∼ p1 f1 (x) + · · · + pk fk (x) . For a sample of independent random variables (X1 , · · · , Xn ), sample density n {p1 f1 (xi ) + · · · + pk fk (xi )} . i=1 Expanding this product involves k n elementary terms: prohibitive to compute in large samples.
  • 9. An introduction to advanced (?) MCMC methods Motivating example 0.3N (µ1 , 1) + 0.7N (µ2 , 1) loglikelihood 3 2 µ2 1 0 −1 −1 0 1 2 3 µ1
  • 10. An introduction to advanced (?) MCMC methods Motivating example A typology of Bayes computational problems (i) use of a complex parameter space, as for instance in constrained parameter sets like those resulting from imposing stationarity constraints in dynamic models;
  • 11. An introduction to advanced (?) MCMC methods Motivating example A typology of Bayes computational problems (i) use of a complex parameter space, as for instance in constrained parameter sets like those resulting from imposing stationarity constraints in dynamic models; (ii) use of a complex sampling model with an intractable likelihood, as for instance in missing data and graphical models;
  • 12. An introduction to advanced (?) MCMC methods Motivating example A typology of Bayes computational problems (i) use of a complex parameter space, as for instance in constrained parameter sets like those resulting from imposing stationarity constraints in dynamic models; (ii) use of a complex sampling model with an intractable likelihood, as for instance in missing data and graphical models; (iii) use of a huge dataset;
  • 13. An introduction to advanced (?) MCMC methods Motivating example A typology of Bayes computational problems (i) use of a complex parameter space, as for instance in constrained parameter sets like those resulting from imposing stationarity constraints in dynamic models; (ii) use of a complex sampling model with an intractable likelihood, as for instance in missing data and graphical models; (iii) use of a huge dataset; (iv) use of a complex prior distribution (which may be the posterior distribution associated with an earlier sample);
  • 14. An introduction to advanced (?) MCMC methods Motivating example A typology of Bayes computational problems (i) use of a complex parameter space, as for instance in constrained parameter sets like those resulting from imposing stationarity constraints in dynamic models; (ii) use of a complex sampling model with an intractable likelihood, as for instance in missing data and graphical models; (iii) use of a huge dataset; (iv) use of a complex prior distribution (which may be the posterior distribution associated with an earlier sample); (v) use of a complex inferential procedure as for instance, Bayes factors π π(θ ∈ Θ0 ) B01 (x) = P (θ ∈ Θ0 | x)/P (θ ∈ Θ1 | x) . π(θ ∈ Θ1 )
  • 15. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm The Metropolis-Hastings Algorithm 1 Motivating example 2 The Metropolis-Hastings Algorithm Monte Carlo Methods based on Markov Chains The Metropolis–Hastings algorithm A collection of Metropolis-Hastings algorithms Extensions Convergence assessment
  • 16. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Monte Carlo Methods based on Markov Chains Running Monte Carlo via Markov Chains Fact: It is not necessary to use a sample from the distribution f to approximate the integral I= h(x)f (x)dx ,
  • 17. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Monte Carlo Methods based on Markov Chains Running Monte Carlo via Markov Chains Fact: It is not necessary to use a sample from the distribution f to approximate the integral I= h(x)f (x)dx , We can obtain X1 , . . . , Xn ∼ f (approx) without directly simulating from f , using an ergodic Markov chain with stationary distribution f
  • 18. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Monte Carlo Methods based on Markov Chains Running Monte Carlo via Markov Chains (2) Idea For an arbitrary starting value x(0) , an ergodic chain (X (t) ) is generated using a transition kernel with stationary distribution f
  • 19. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Monte Carlo Methods based on Markov Chains Running Monte Carlo via Markov Chains (2) Idea For an arbitrary starting value x(0) , an ergodic chain (X (t) ) is generated using a transition kernel with stationary distribution f Ensures the convergence in distribution of (X (t) ) to a random variable from f . For a “large enough” T0 , X (T0 ) can be considered as distributed from f Produces a dependent sample X (T0 ) , X (T0 +1) , . . ., which is generated from f , sufficient for most approximation purposes.
  • 20. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm The Metropolis–Hastings algorithm The Metropolis–Hastings algorithm Problem: How can one build a Markov chain with a given stationary distribution?
  • 21. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm The Metropolis–Hastings algorithm The Metropolis–Hastings algorithm Problem: How can one build a Markov chain with a given stationary distribution? MH basics Algorithm that converges to the objective (target) density f using an arbitrary transition kernel density q(x, y) called instrumental (or proposal) distribution
  • 22. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm The Metropolis–Hastings algorithm The MH algorithm Algorithm (Metropolis–Hastings) Given x(t) , 1 Generate Yt ∼ q(x(t) , y). 2 Take Yt with prob. ρ(x(t) , Yt ), X (t+1) = x(t) with prob. 1 − ρ(x(t) , Yt ), where f (y) q(y, x) ρ(x, y) = min ,1 . f (x) q(x, y)
  • 23. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm The Metropolis–Hastings algorithm Features Independent of normalizing constants for both f and q(x, ·) (ie, those constants independent of x) Never move to values with f (y) = 0 The chain (x(t) )t may take the same value several times in a row, even though f is a density wrt Lebesgue measure The sequence (yt )t is usually not a Markov chain
  • 24. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm The Metropolis–Hastings algorithm Features Independent of normalizing constants for both f and q(x, ·) (ie, those constants independent of x) Never move to values with f (y) = 0 The chain (x(t) )t may take the same value several times in a row, even though f is a density wrt Lebesgue measure The sequence (yt )t is usually not a Markov chain P( θ-> θ ’) Satisfies the detailed balance condition θ’ θ f (x)K(x, y) = f (y)K(y, x) P(θ’-> θ ) [Green, 1995]
  • 25. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm The Metropolis–Hastings algorithm Convergence properties 1 The M-H Markov chain is reversible, with invariant/stationary density f .
  • 26. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm The Metropolis–Hastings algorithm Convergence properties 1 The M-H Markov chain is reversible, with invariant/stationary density f . 2 As f is a probability measure, the chain is positive recurrent
  • 27. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm The Metropolis–Hastings algorithm Convergence properties 1 The M-H Markov chain is reversible, with invariant/stationary density f . 2 As f is a probability measure, the chain is positive recurrent 3 If f (Yt ) q(Yt , X (t) ) Pr ≥ 1 < 1. (1) f (X (t) ) q(X (t) , Yt ) i.e., if the event {X (t+1) = X (t) } occurs with positive probability, then the chain is aperiodic
  • 28. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm The Metropolis–Hastings algorithm Convergence properties (2) 4 If q(x, y) > 0 for every (x, y), (2) the chain is irreducible
  • 29. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm The Metropolis–Hastings algorithm Convergence properties (2) 4 If q(x, y) > 0 for every (x, y), (2) the chain is irreducible 5 For M-H, f -irreducibility implies Harris recurrence
  • 30. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm The Metropolis–Hastings algorithm Convergence properties (2) 4 If q(x, y) > 0 for every (x, y), (2) the chain is irreducible 5 For M-H, f -irreducibility implies Harris recurrence 6 Thus, under conditions (1) and (2) (i) For h, with Ef |h(X)| < ∞, T 1 lim h(X (t) ) = h(x)df (x) a.e. f. T →∞ T t=1 (ii) and lim K n (x, ·)µ(dx) − f =0 n→∞ TV for every initial distribution µ, where K n (x, ·) denotes the kernel for n transitions.
  • 31. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms The Independent Case The instrumental distribution q(x, ·) is independent of x and is denoted g
  • 32. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms The Independent Case The instrumental distribution q(x, ·) is independent of x and is denoted g Algorithm (Independent Metropolis-Hastings) Given x(t) , 1 Generate Yt ∼ g(y) 2 Take  Y f (Yt ) g(x(t) ) with prob. min ,1 ,  t X (t+1) = f (x(t) ) g(Yt )  x(t) otherwise. 
  • 33. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Properties The resulting sample is not iid
  • 34. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Properties The resulting sample is not iid but there exist strong convergence properties: Theorem (Ergodicity) The algorithm produces a uniformly ergodic chain if there exists a constant M such that f (x) ≤ M g(x) , x ∈ supp f. In this case, n 1 K n (x, ·) − f TV ≤ 1− . M [Mengersen & Tweedie, 1996]
  • 35. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Example (Noisy AR(1)) Hidden Markov chain from a regular AR(1) model, xt+1 = ϕxt + ǫt+1 ǫt ∼ N (0, τ 2 ) and observables yt |xt ∼ N (x2 , σ 2 ) t
  • 36. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Example (Noisy AR(1)) Hidden Markov chain from a regular AR(1) model, xt+1 = ϕxt + ǫt+1 ǫt ∼ N (0, τ 2 ) and observables yt |xt ∼ N (x2 , σ 2 ) t The distribution of xt given xt−1 , xt+1 and yt is −1 τ2 exp (xt − ϕxt−1 )2 + (xt+1 − ϕxt )2 + (yt − x2 )2 t . 2τ 2 σ2
  • 37. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Example (Noisy AR(1) too) 2 Use for proposal the N (µt , ωt ) distribution, with xt−1 + xt+1 2 τ2 µt = ϕ and ωt = . 1 + ϕ2 1 + ϕ2
  • 38. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Example (Noisy AR(1) too) 2 Use for proposal the N (µt , ωt ) distribution, with xt−1 + xt+1 2 τ2 µt = ϕ and ωt = . 1 + ϕ2 1 + ϕ2 Ratio π(x)/qind (x) = exp −(yt − x2 )2 /2σ 2 t is bounded
  • 39. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms (top) Last 500 realisations of the chain {Xk }k out of 10, 000 iterations; (bottom) histogram of the chain, compared with the target distribution.
  • 40. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Random walk Metropolis–Hastings Instead, use a local perturbation as proposal Yt = X (t) + εt , where εt ∼ g, independent of X (t) . The instrumental density is now of the form g(y − x) and the Markov chain is a random walk if g is symmetric g(x) = g(−x)
  • 41. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Algorithm (Random walk Metropolis) Given x(t) 1 Generate Yt ∼ g(y − x(t) ) 2 Take f (Yt )  Y with prob. min 1, , (t+1) t X = f (x(t) )  (t) x otherwise.
  • 42. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Probit illustration Likelihood and posterior given by n π(β|y, X) ∝ ℓ(β|y, X) ∝ Φ(xiT β)yi (1 − Φ(xiT β))ni −yi . i=1 under the flat prior
  • 43. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Probit illustration Likelihood and posterior given by n π(β|y, X) ∝ ℓ(β|y, X) ∝ Φ(xiT β)yi (1 − Φ(xiT β))ni −yi . i=1 under the flat prior A random walk proposal works well for a small number of ˆ predictors. Use the maximum likelihood estimate β as starting ˆ value and asymptotic (Fisher) covariance matrix of the MLE, Σ, as scale
  • 44. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms MCMC algorithm Probit random-walk Metropolis-Hastings ˆ ˆ Initialization: Set β (0) = β and compute Σ Iteration t: 1 ˜ ˆ Generate β ∼ Nk+1 (β (t−1) , τ Σ) 2 Compute ˜ π(β|y) ˜ ρ(β (t−1) , β) = min 1, π(β (t−1) |y) 3 ˜ ˜ With probability ρ(β (t−1) , β) set β (t) = β; otherwise set β (t) = β (t−1) .
  • 45. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms R bank benchmark Probit modelling with no intercept over the 0.8 −1.0 1.0 0.4 four measurements. −2.0 0.0 0.0 0 4000 8000 −2.0 −1.5 −1.0 −0.5 0 200 600 1000 Three different scales 3 0.0 0.4 0.8 τ = 1, 0.1, 10: best 2 0.4 1 mixing behavior is −1 0.0 0 4000 8000 −1 0 1 2 3 0 200 600 1000 associated with τ = 1. 2.5 0.8 0.0 0.4 0.8 −0.5 1.0 Average of the 0.4 0.0 parameters over 1.8 0 4000 8000 −0.5 0.5 1.5 2.5 0 200 600 1000 MCMC 9, 000 0.0 0.4 0.8 2.0 1.2 1.0 iterations gives plug-in 0.0 0.6 0 4000 8000 0.6 1.0 1.4 1.8 0 200 600 1000 estimate pi = Φ (−1.2193xi1 + 0.9540xi2 + 0.9795xi3 + 1.1481xi4 ) . ˆ
  • 46. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Example (Mixture models) n k π(θ|x) ∝ pℓ f (xj |µℓ , σℓ ) π(θ) j=1 ℓ=1
  • 47. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Example (Mixture models) n k π(θ|x) ∝ pℓ f (xj |µℓ , σℓ ) π(θ) j=1 ℓ=1 Metropolis-Hastings proposal: θ(t) + ωε(t) if u(t) < ρ(t) θ(t+1) = θ(t) otherwise where π(θ(t) + ωε(t) |x) ρ(t) = ∧1 π(θ(t) |x) and ω scaled for good acceptance rate
  • 48. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Random walk MCMC output for .7N (µ1 , 1) + .3N (µ2 , 1) and scale 1 Iteration 1 4 3 2 µ2 1 0 −1 −1 0 1 2 3 4 µ1
  • 49. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Random walk MCMC output for .7N (µ1 , 1) + .3N (µ2 , 1) and scale 1 Iteration 10 4 3 2 µ2 1 0 −1 −1 0 1 2 3 4 µ1
  • 50. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Random walk MCMC output for .7N (µ1 , 1) + .3N (µ2 , 1) and scale 1 Iteration 100 4 3 2 µ2 1 0 −1 −1 0 1 2 3 4 µ1
  • 51. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Random walk MCMC output for .7N (µ1 , 1) + .3N (µ2 , 1) and scale 1 Iteration 500 4 3 2 µ2 1 0 −1 −1 0 1 2 3 4 µ1
  • 52. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Random walk MCMC output for .7N (µ1 , 1) + .3N (µ2 , 1) and scale 1 Iteration 1000 4 3 2 µ2 1 0 −1 −1 0 1 2 3 4 µ1
  • 53. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Random walk MCMC output for .7N (µ1 , 1) + .3N (µ2 , 1) √ and scale .1 Iteration 10 4 3 2 µ2 1 0 −1 −1 0 1 2 3 4 µ1
  • 54. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Random walk MCMC output for .7N (µ1 , 1) + .3N (µ2 , 1) √ and scale .1 Iteration 100 4 3 2 µ2 1 0 −1 −1 0 1 2 3 4 µ1
  • 55. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Random walk MCMC output for .7N (µ1 , 1) + .3N (µ2 , 1) √ and scale .1 Iteration 500 4 3 2 µ2 1 0 −1 −1 0 1 2 3 4 µ1
  • 56. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Random walk MCMC output for .7N (µ1 , 1) + .3N (µ2 , 1) √ and scale .1 Iteration 1000 4 3 2 µ2 1 0 −1 −1 0 1 2 3 4 µ1
  • 57. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Random walk MCMC output for .7N (µ1 , 1) + .3N (µ2 , 1) √ and scale .1 Iteration 10,000 4 3 2 µ2 1 0 −1 −1 0 1 2 3 4 µ1
  • 58. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Random walk MCMC output for .7N (µ1 , 1) + .3N (µ2 , 1) √ and scale .1 Iteration 5000 4 3 2 µ2 1 0 −1 −1 0 1 2 3 4 µ1
  • 59. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Convergence properties Uniform ergodicity prohibited by random walk structure
  • 60. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms Convergence properties Uniform ergodicity prohibited by random walk structure At best, geometric ergodicity: Theorem (Sufficient ergodicity) For a symmetric density f , log-concave in the tails, and a positive and symmetric density g, the chain (X (t) ) is geometrically ergodic. [Mengersen & Tweedie, 1996] no tail effect
  • 61. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm A collection of Metropolis-Hastings algorithms 1.5 1.5 1.0 1.0 Example (Comparison of tail effects) 0.5 0.5 0.0 0.0 Random-walk Metropolis–Hastings algorithms -0.5 -0.5 based on a N (0, 1) instrumental -1.0 -1.0 for the generation of (a) a -1.5 -1.5 N (0, 1) distribution and (b) a 0 50 100 (a) 150 200 0 50 100 (b) 150 200 distribution with density 90% confidence envelopes of ψ(x) ∝ (1 + |x|)−3 the means, derived from 500 parallel independent chains
  • 62. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Extensions Extensions There are many other families of HM algorithms Adaptive Rejection Metropolis Sampling Reversible Jump Langevin algorithms to name just a few...
  • 63. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Extensions Langevin Algorithms Proposal based on the Langevin diffusion Lt is defined by the stochastic differential equation 1 dLt = dBt + ∇ log f (Lt )dt, 2 where Bt is the standard Brownian motion Theorem The Langevin diffusion is the only non-explosive diffusion which is reversible with respect to f .
  • 64. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Extensions Discretization Because continuous time cannot be simulated, consider the discretised sequence σ2 x(t+1) = x(t) + ∇ log f (x(t) ) + σεt , εt ∼ Np (0, Ip ) 2 where σ 2 corresponds to the discretisation step 0.6 0.5 0.4 Example of Density 0.3 f (x) = exp(−x4 ) 0.2 0.1 0.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 σ2 = .1
  • 65. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Extensions Discretization Because continuous time cannot be simulated, consider the discretised sequence σ2 x(t+1) = x(t) + ∇ log f (x(t) ) + σεt , εt ∼ Np (0, Ip ) 2 where σ 2 corresponds to the discretisation step 0.6 0.5 0.4 Example of 0.3 f (x) = exp(−x4 ) 0.2 0.1 0.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 σ2 = .01
  • 66. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Extensions Discretization Because continuous time cannot be simulated, consider the discretised sequence σ2 x(t+1) = x(t) + ∇ log f (x(t) ) + σεt , εt ∼ Np (0, Ip ) 2 where σ 2 corresponds to the discretisation step 0.6 0.5 0.4 Example of Density 0.3 f (x) = exp(−x4 ) 0.2 0.1 0.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 σ2 = .001
  • 67. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Extensions Discretization Because continuous time cannot be simulated, consider the discretised sequence σ2 x(t+1) = x(t) + ∇ log f (x(t) ) + σεt , εt ∼ Np (0, Ip ) 2 where σ 2 corresponds to the discretisation step 0.8 0.6 Example of Density 0.4 f (x) = exp(−x4 ) 0.2 0.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 σ2 = .0001
  • 68. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Extensions Discretization Because continuous time cannot be simulated, consider the discretised sequence σ2 x(t+1) = x(t) + ∇ log f (x(t) ) + σεt , εt ∼ Np (0, Ip ) 2 where σ 2 corresponds to the discretisation step 0.6 0.5 0.4 Example of Density 0.3 f (x) = exp(−x4 ) 0.2 0.1 0.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 σ2 = .0001∗
  • 69. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Extensions Discretization Unfortunately, the discretized chain may be transient, for instance when lim σ 2 ∇ log f (x)|x|−1 > 1 x→±∞ Example of f (x) = exp(−x4 ) when σ 2 = .2
  • 70. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Extensions MH correction Accept the new value Yt with probability 2 σ2 exp − Yt − x(t) − (t) 2 ∇ log f (x ) 2σ 2 f (Yt ) · ∧1. f (x(t) ) σ2 2 exp − x(t) − Yt − 2 ∇ log f (Yt ) 2σ 2 Choice of the scaling factor σ Should lead to an acceptance rate of 0.574 to achieve optimal convergence rates (when the components of x are uncorrelated) [Roberts & Rosenthal, 1998]
  • 71. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Extensions Optimizing the Acceptance Rate Problem of choice of the transition kernel from a practical point of view Most common alternatives: 1 a fully automated algorithm like ARMS; 2 an instrumental density g which approximates f , such that f /g is bounded for uniform ergodicity to apply; 3 a random walk In both cases (b) and (c), the choice of g is critical,
  • 72. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Extensions Case of the random walk Different approach to acceptance rates A high acceptance rate does not indicate that the algorithm is moving correctly since it indicates that the random walk is moving too slowly on the surface of f .
  • 73. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Extensions Case of the random walk Different approach to acceptance rates A high acceptance rate does not indicate that the algorithm is moving correctly since it indicates that the random walk is moving too slowly on the surface of f . If x(t) and yt are close, i.e. f (x(t) ) ≃ f (yt ) y is accepted with probability f (yt ) min ,1 ≃ 1 . f (x(t) ) For multimodal densities with well separated modes, the negative effect of limited moves on the surface of f clearly shows.
  • 74. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Extensions Case of the random walk (2) If the average acceptance rate is low, the successive values of f (yt ) tend to be small compared with f (x(t) ), which means that the random walk moves quickly on the surface of f since it often reaches the “borders” of the support of f
  • 75. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Extensions Rule of thumb In small dimensions, aim at an average acceptance rate of 50%. In large dimensions, at an average acceptance rate of 25%. [Gelman,Gilks and Roberts, 1995]
  • 76. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Extensions Rule of thumb In small dimensions, aim at an average acceptance rate of 50%. In large dimensions, at an average acceptance rate of 25%. [Gelman,Gilks and Roberts, 1995] This rule is to be taken with a pinch of salt!
  • 77. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Extensions Example (Noisy AR(1) continued) For a Gaussian random walk with scale ω small enough, the random walk never jumps to the other mode. But if the scale ω is sufficiently large, the Markov chain explores both modes and give a satisfactory approximation of the target distribution.
  • 78. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Extensions Markov chain based on a random walk with scale ω = .1
  • 79. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Extensions Markov chain based on a random walk with scale ω = .5
  • 80. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Extensions Where do we stand? MCMC in a nutshell:
  • 81. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Extensions Where do we stand? MCMC in a nutshell: Running a sequence Xt+1 = Ψ(Xt , Yy ) provides approximation to target density f when detailed balance condition holds f (x)K(x, y) = f (y)K(y, x)
  • 82. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Extensions Where do we stand? MCMC in a nutshell: Running a sequence Xt+1 = Ψ(Xt , Yy ) provides approximation to target density f when detailed balance condition holds f (x)K(x, y) = f (y)K(y, x) Easiest implementation of the principle is random walk Metropolis-Hastings Yt = X (t) + εt
  • 83. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Extensions Where do we stand? MCMC in a nutshell: Running a sequence Xt+1 = Ψ(Xt , Yy ) provides approximation to target density f when detailed balance condition holds f (x)K(x, y) = f (y)K(y, x) Easiest implementation of the principle is random walk Metropolis-Hastings Yt = X (t) + εt Practical convergence requires sufficient energy from the proposal that is calibrated by trial and error.
  • 84. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Convergence assessment Convergence diagnostics How many iterations?
  • 85. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Convergence assessment Convergence diagnostics How many iterations? Rule # 1 There is no absolute number of simulations, i.e. 1, 000 is neither large, nor small. Rule # 2 It takes [much] longer to check for convergence than for the chain itself to converge. Rule # 3 MCMC is a “what-you-get-is-what-you-see” algorithm: it fails to tell about unexplored parts of the space. Rule # 4 When in doubt, run MCMC chains in parallel and check for consistency.
  • 86. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Convergence assessment Convergence diagnostics How many iterations? Rule # 1 There is no absolute number of simulations, i.e. 1, 000 is neither large, nor small. Rule # 2 It takes [much] longer to check for convergence than for the chain itself to converge. Rule # 3 MCMC is a “what-you-get-is-what-you-see” algorithm: it fails to tell about unexplored parts of the space. Rule # 4 When in doubt, run MCMC chains in parallel and check for consistency. Many “quick-&-dirty” solutions in the literature, but not necessarily 100% trustworthy.
  • 87. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Convergence assessment Example (Bimodal target) 0.4 Density 0.3 exp −x2 /2 4(x − .3)2 + .01 0.2 f (x) = √ . 4(1 + (.3)2 ) + .01 0.1 2π 0.0 −4 −2 0 2 4 and use of random walk Metropolis–Hastings algorithm with variance .04 Evaluation of the missing mass by T −1 [θ(t+1) − θ(t) ] f (θ(t) ) t=1
  • 88. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Convergence assessment 1.0 0.8 0.6 mass 0.4 0.2 0.0 0 500 1000 1500 2000 Index Sequence [in blue] and mass evaluation [in brown] [Philippe & Robert, 2001]
  • 89. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Convergence assessment Effective sample size How many iid simulations from π are equivalent to N simulations from the MCMC algorithm?
  • 90. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Convergence assessment Effective sample size How many iid simulations from π are equivalent to N simulations from the MCMC algorithm? Based on estimated k-th order auto-correlation, ρk = cov x(t) , x(t+k) , effective sample size T0 −1/2 ess N =n 1+2 ρk ˆ , k=1 Only partial indicator that fails to signal chains stuck in one mode of the target
  • 91. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Convergence assessment Tempering Facilitate exploration of π by flattening the target: simulate from πα (x) ∝ π(x)α for α > 0 small enough
  • 92. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Convergence assessment Tempering Facilitate exploration of π by flattening the target: simulate from πα (x) ∝ π(x)α for α > 0 small enough Determine where the modal regions of π are (possibly with parallel versions using different α’s) Recycle simulations from π(x)α into simulations from π by importance sampling Simple modification of the Metropolis–Hastings algorithm, with new acceptance α π(θ′ |x) q(θ|θ′ ) ∧1 π(θ|x) q(θ′ |θ)
  • 93. An introduction to advanced (?) MCMC methods The Metropolis-Hastings Algorithm Convergence assessment Tempering with the mean mixture 1 0.5 0.2 4 4 4 3 3 3 2 2 2 1 1 1 0 0 0 −1 −1 −1 −1 0 1 2 3 4 −1 0 1 2 3 4 −1 0 1 2 3 4