Anisotropic Metropolis Adjusted Langevin Algorithm: convergence and utility in Stochastic EM algorithm

Anisotropic Metropolis adjusted Langevin algorithm:
Convergence and utility in stochastic EM algorithm.

´ `
Stephanie Allassonniere
´
CMAP, Ecole Polytechnique

BigMC, January 2012

Join work with Estelle Kuhn (INRA, France)

St´phanie Allassonni`re (CMAP)
e e AMALA BigMC, January 2012 1 / 42

Introduction

Introduction:

Where does the problem came from?
Image analysis: Compare two observations via the quantiﬁcation of
the deformation from one to the other (D’Arcy Thompson, 1917)

Each element of a population is a smooth deformation of a template


Introduction

Introduction:


Registration


Introduction

Introduction:


Registration
Template estimation


Introduction

Introduction:


Registration
Template estimation / Mean


Introduction

Introduction:


Registration / Variance
Template estimation / Mean


Introduction

Introduction:



Introduction

Introduction:


Deformable template model: (u = voxel, vu its position)

y (u) = I0 (vu − m(vu )) + σ (u) ,


Introduction

Introduction:



y (u) = I0 (vu − m(vu )) + σ (u) ,

Template I0 and geometry Law (m) estimation


Introduction

Introduction:



y (u) = I0 (vu − m(vu )) + σ (u) ,


High dimensional setting, Low sample size


Introduction

Introduction:



y (u) = I0 (vu − m(vu )) + σ (u) ,


High dimensional setting, Low sample size

Considering the LDDMM framework through the shooting equations


Introduction

Outline:

1. AMALA: simulation of random variables in high dimension
Anisotropic MALA description
Convergence property
2. AMALA within stochastic algorithm for parameter estimation

Maximum likelihood estimation for incomplete data
setting
AMALA-SAEM
Convergence properties
3. Experiments
BME-Template model: small deformation setting
BME-Template model: LDDMM setting


Introduction

Outline:


setting
AMALA-SAEM
3. Experiments


Anisotropic Metropolis Adjusted Langevin algorithm (AMALA)

Introduction:

General setting:



Introduction:

General setting:
Simulation of random variable in high dimension settings: → Gibbs
Sampler not useful



Introduction:

General setting:
Sampler not useful
Metropolis Adjusted Langevin Algorithm (MALA)



Introduction:

General setting:
Sampler not useful
Target distribution: π



Introduction:

General setting:
Sampler not useful
At iteration k of this algorithm, Xk the current value
Simulate Xc w.r.t. N (Xk + δD(Xk ), δIdd )
where D(x) = max(b,| blog π(x)|) log π(x).
Update Xk+1 = Xc with probability
α(Xk , Xc ) = min 1, qMALA (Xk ,Xc(Xc ,Xk ) and Xk+1 = Xk otherwise.
π(Xc )qMALA
)π(Xk )



Introduction:

General setting:
Sampler not useful
π(Xc )qMALA
)π(Xk )

Problem: isotropic covariance matrix = numerically trapped
(α(Xk , Xc ) = 0)



Introduction:

General setting:
Sampler not useful
π(Xc )qMALA
)π(Xk )

Problem: isotropic covariance matrix = numerically trapped
(α(Xk , Xc ) = 0)
→ Anisotropic Metropolis Adjusted Langevin Algorithm (AMALA)


Anisotropic Metropolis Adjusted Langevin algorithm (AMALA) Description of the algorithm

How including anisotropy?

Following the magnitude of the gradient





First approximation: independence of directions





First approximation: independence of directions

Bounded covariance (same as bounded drift)



Anisotropic Metropolis Adjusted Langevin Algorithm (AMALA)

For all k = 1 : kend Iterates of Markov chain




Sample Xc with respect to
N (Xk + δD(Xk ), δΣ(Xk ))
b
with D(Xk ) = max(b,| log π(Xk )|) log π(Xk ) and

Σ(Xk ) = Idd + diag ([ log π(Xk )]2 ∧ b), ... , ([ log π(Xk )]2 ∧ b)
1 d




b

1 d

Compute the acceptance ratio
π(Xc )qc (Xc , Xk )
α(Xk , Xc ) = min 1,
qc (Xk , Xc )π(Xk )
(qc = the pdf of this distribution).




b

1 d

Compute the acceptance ratio
π(Xc )qc (Xc , Xk )
α(Xk , Xc ) = min 1,
qc (Xk , Xc )π(Xk )
(qc = the pdf of this distribution).
Sample Xk+1 = Xc with probability α(Xk , Xc ) and Xk+1 = Xk with
probability 1 − α(Xk , Xc ) = Acceptation/reject


Anisotropic Metropolis Adjusted Langevin algorithm (AMALA) Geometric ergodicity of the chain

Geometric ergodicity of the Markov chain

Condition:
π super-exponential: Smoothness condition on the target distribution
(B1) The density π is positive with continuous ﬁrst derivative such that:

lim n(x). log π(x) = −∞ (1)
|x|→∞

and
lim sup n(x).m(x) < 0 (2)
|x|→∞

x
where is the gradient operator in Rd , n(x) = |x| is the unit vector
π(x)
pointing in the direction of x and m(x) = is the unit vector in
| π(x)|
the direction of the gradient of the stationary distribution at point x.




Result:




Result:
Existence of a small set

Π(x, A) ≥ εν(A)1C (x), ∀x ∈ X and ∀A ∈ B




Result:


Drift condition: pulls the chain back into the small set

ΠV (x) ≤ λV (x) + b1C (x) .




Result:


Drift condition: pulls the chain back into the small set

ΠV (x) ≤ λV (x) + b1C (x) .

Geometric ergodicity

|Πn V (x) − π(x)|
sup ≤ Rρn . (3)
x∈X V (x)



Experiments on synthetic data

Target: 10 dimensional Gaussian distribution with zero mean and
diagonal covariance matrix with diagonal coeﬃcients randomly picked
between 1 and 2500

Comparison of AMALA and symmetric random walk

500, 000 iterations for each algorithm starting at zero

Mean squared jump distance (MSJD) in stationarity:
AMALA 0.1504 - random walk 0.0407.



Experiments on synthetic data

Figure: Autocorrelation functions of the AMALA (red) and the random walk
(blue) samplers for four of the ten components of the Gaussian 10 dimensional
distribution.


Why not using exising MALA-like algorithms?

Optimised MALA-like algorithms are usually adaptive
Good performances in practice
Good theoretical properties





However





However

Numerical problem at the ﬁrst iterations (not yet stationary):
convergence time?





However

Numerical problem at the ﬁrst iterations (not yet stationary):
convergence time?
Most important: Our goal = parameter estimation
AMALA = one tool inside another algorithm
Adaptive + estimation algorithm = numerical issues: too many
degree of freedom


Applying AMALA within SAEM

Outline:


setting
AMALA-SAEM
3. Experiments


Applying AMALA within SAEM Maximum likelihood estimation for incomplete data setting

Maximum likelihood estimation for incomplete data setting

y ∈ Rn : observed data




z ∈ Rl : missing data




(y , z) ∈ Rn+l : complete data




P = {f (y , z; θ), θ ∈ Θ}: family of parametric pdfs on Rn+l




Assumption:
∃θ ∈ Θ s.t. the complete data likelihood q(y , z; θ) = f (y , z; θ)




Assumption:
Observed likelihood:

g (y ; θ) = f (y , z; θ)µ(dz). (4)




Assumption:

g (y ; θ) = f (y , z; θ)µ(dz). (4)

n
Given a sample of observations (yi )1≤i≤n = y1




Assumption:

g (y ; θ) = f (y , z; θ)µ(dz). (4)

Given a sample of observations (yi )1≤i≤n = y1n
ˆ
Find: θg in Θ s.t.
ˆ n
θg = arg max g (y1 ; θ)
θ∈Θ


Applying AMALA within SAEM Description of the algorithm

AMALA-SAEM

Incomplete data setting + maximum likelihood estimation = EM
algorithm



AMALA-SAEM

algorithm

General case −→ E step not tractable



AMALA-SAEM

algorithm


Stochastic Approximation EM for convergence properties



AMALA-SAEM

algorithm



with MCMC method for simulation step.



AMALA-SAEM

algorithm



with MCMC method for simulation step.

→ AMALA-SAEM: using AMALA as the MCMC method



Description of the algorithm

Assumption: model in the exponential family = all information carried by
suﬃcient statistics S
For k = 1 : kend Iteration of SAEM




Sample zk through a single AMALA step (simulation and
acceptation/reject) using current parameter θk−1




Compute the stochastic approximation

sk = sk−1 + γk (S(zk ) − sk−1 ) ,

where (γk )k is a sequence of positive step sizes.





sk = sk−1 + γk (S(zk ) − sk−1 ) ,

Update the parameter
ˆ
θk = θ(sk ).





sk = sk−1 + γk (S(zk ) − sk−1 ) ,

Update the parameter
ˆ
θk = θ(sk ).

Can require truncation on random boundaries for convergence purposes


Applying AMALA within SAEM Convergence properties


Conditions:
Smoothness of the model (classic conditions for convergence of
stochastic approximation and EM)
Condition for AMALA geometric ergodicity (B1)




Conditions:
Results:
Convergence of (sk ) a.s. towards critical point of mean ﬁeld of the
problem




Conditions:
Results:
problem
Convergence of estimated parameters (θk ) a.s. towards critical point
of observed likelihood




Conditions:
Results:
problem
Convergence of estimated parameters (θk ) a.s. towards critical point
of observed likelihood
√
Central limit theorem for (θk ) with rate 1/ γk



Conditions for the SA to converge

Deﬁne for any V : X → [1, ∞] and any g : X → Rm the norm

g (z)
g V = sup .
z∈X V (z)

(A1’) S is an open subset of Rm , h : S → Rm is continuous and there exists
a continuously diﬀerentiable function w : S → [0, ∞[ with the
following properties.
(i) There exists an M0 > 0 such that

L {s ∈ S, w (s), h(s) = 0} ⊂ {s ∈ S, w (s) < M0 } .



Conditions for the SA to converge (2)

(ii) There exists a closed convex set Sa ⊂ S for which
s → s + ρHs (z) ∈ Sa for any ρ ∈ [0, 1] and (z, s) ∈ X × Sa (Sa is
absorbing) and such that for any M1 ∈]M0 , ∞], the set WM1 ∩ Sa is a
compact set of S where WM1 {s ∈ S, w (s) ≤ M1 }.
(iii) For any s ∈ SL w (s), h(s) < 0.
(iv) The closure of w (L) has an empty interior.

(A2’) For any s ∈ S, Hs : X → S is measurable and Hs (z) πs (dz) < ∞.




(A3”) There exist a function V : X → [1, ∞] such that
{z ∈ X , V (z) < ∞} = ∅, constants a ∈]0, 1], p ≥ 2 , r > 0 and
q ≥ 1 such that for any compact subset K ⊂ S,
(i)

sup Hs V < ∞, (5)
s∈K
sup ( gs V + Πs gs V) < ∞, (6)
s∈K
−a
sup s −s { gs − gs Vq + Πs gs − Πs gs Vq} < ∞, (7)
s,s ∈K

where for anys ∈ S a solution of the Poisson equation
g − Πs g = Hs − πs (Hs ) is denoted by gs .




(ii) For any sequence ε = (εk )k≥0 satisfying εk < ¯ for an ¯ suﬃciently
small, for any sequence γ = (γk )k≥0 , there exist a constant C such
that and for any z ∈ X ,

sup sup Eγ V p (zk )1σ(K)∧ν(ε)≥k ≤ C V p+r (z) ,
z,s (8)
s∈K k≥0

where ν(ε) = inf{k ≥ 1, sk − sk−1 ≥ εk } and
σ(K) = inf{k ≥ 1, sk ∈ K} and the expectation is related to the
/
non-homogeneous Markov chain ((zk , sk ))k≥0 using the step-size
sequence γ = (γk )k≥0 .
(A4) The sequences γ = (γk )k≥0 and ε = (εk )k≥0 are non-increasing,
∞
positive and satisfy: γk = ∞, lim εk = 0 and
k=0 k→∞
∞
{γk + γk εa + (γk ε−1 )p } < ∞, where a and p are deﬁned in (A3”).
2
k k
k=1



Condition for AMALA-SAEM to converge

(M1) The parameter space Θ is an open subset of Rp . The complete
data likelihood function is given by:

f (y , z; θ) = exp {−ψ(θ) + S(z), φ(θ) } ,

where S is a Borel function on Rl taking its values in an open subset
S of Rm . Moreover, the convex hull of S(Rl ) is included in S, and,
for all θ in Θ,
||S(z)||pθ (z)µ(dz) < ∞.

(M2) The functions ψ and φ are twice continuously diﬀerentiable on
Θ.



Condition for AMALA-SAEM to converge (2)

(M3) The function ¯ : Θ → S defined as
s

¯(θ)
s S(z)pθ (z)µ(dz)

is continuously differentiable on Θ.
(M4) The function l : Θ → R defined as the observed-data
log-likelihood

l(θ) log g (y ; θ) = log f (y , z; θ)µ(dz)

is continuously differentiable on Θ and

∂θ f (y , z; θ)µ(dz) = ∂θ f (y , z; θ)µ(dz).




ˆ
(M5) There exists a function θ : S → Θ, such that:
ˆ
∀s ∈ S, ∀θ ∈ Θ, L(s; θ(s)) ≥ L(s; θ).
ˆ
Moreover, the function θ is continuously diﬀerentiable on S.
ˆ
(M6) The functions l : Θ → R and θ : S → Θ are m times
diﬀerentiable.

(M7)
(i) There exists an M0 > 0 such that

ˆ ˆ
s ∈ S, ∂s l(θ(s)) = 0 ⊂ {s ∈ S, −l(θ(s)) < M0 } .

¯ ˆ
(ii) For all M1 > M0 , the set Conv (S(Rl )) ∩ {s ∈ S, −l(θ(s)) ≤ M1 } is a
compact set of S.




(M8) There exists a polynomial function P of degree 2 such that for
all z ∈ X
||S(z)|| ≤ |P(z)| .
(B3) For any compact subset K of S, there exists a polynomial
function Q of the hidden variable such that

sup | z log pθ(s) (z)| ≤ |Q(z)|
ˆ
s∈K
.


Application on Bayesian Mixed eﬀect template estimation

Outline:


setting
AMALA-SAEM
3. Experiments


Application on Bayesian Mixed eﬀect template estimation Description of the BME Template model

BME Template model with small deformations


y (u) = I0 (vu − m(vu )) + σ (u) ,





y (u) = I0 (vu − m(vu )) + σ (u) ,

Parametric template and deformation:
kp
Iα (v ) = (Kp α)(v ) = Kp (v , rp,k )αj and
j=1
kg
mz (v ) = (Kg z(v ) = Kg (v , rg ,k )z j .
j=1





y (u) = I0 (vu − m(vu )) + σ (u) ,

kp
j=1
kg
j=1
Generative model:
 z ∼ ⊗n N2kg (0, Γg ) | Γg ,

i=1

y ∼ ⊗n N|Λ| (mzi Iα , σ 2 Id) | z, α, σ 2 ,

i=1





y (u) = I0 (vu − m(vu )) + σ (u) ,

kp
j=1
kg
j=1
Generative model:
 z ∼ ⊗n N2kg (0, Γg ) | Γg ,

i=1

y ∼ ⊗n N|Λ| (mzi Iα , σ 2 Id) | z, α, σ 2 ,

i=1

Bayesian framework → MAP estimator (= penalised MLE)

Application on Bayesian Mixed eﬀect template estimation Results on the template estimation

Training sets

Figure: Left: Training set (inverse video). Right: Noisy training set (inverse
video).


Application on Bayesian Mixed eﬀect template estimation Results on the template estimation

Estimated templates

Algorithm/ FAM-EM H.G.-SAEM AMALA-SAEM
Noise level

No Noise

Noisy
of Variance 1

Figure: Estimated templates using diﬀerent algorithms and two level of noise.
The training set includes 20 images per digit.


Application on Bayesian Mixed eﬀect template estimation Results on the covariance matrix estimation

Estimated geometric variability

Figure: Synthetic samples generated with respect to the BME template model.

Application on Bayesian Mixed eﬀect template estimation CLT empirical proof

Empirical proof of the CLT

Figure: Evolution of the estimation of the noise variance along the SAEM
iterations. Left: original data. Right: Noisy training set.


Application on Bayesian Mixed eﬀect template estimation CLT empirical proof

Figure: Evolution of the estimation of the noise variance along the SAEM
iterations. Test of convergence towards the Gaussian distribution of the estimated
parameters.


Application on Bayesian Mixed eﬀect template estimation Medical image template estimation

Corpus callosum data base

Figure: Medical image template estimation: 10 Corpus callosum and splenium
training images among the 47 available.

Figure: Grey level mean. FAM-EM estimated template. Hybrid Gibbs - SAEM
estimated template.AMALA-SAEM estimation .


Application on Bayesian Mixed eﬀect template estimation Results on the template estimation using LDDMM shooting

BME Template model with LDDMM


y (u) = I0 (φ−1 (vu )) + σ (u) ,
β(0)

kp
Parametric template: Iα (v ) = (Kp α)(v ) = Kp (v , rp,k )αj and φ
j=1
LDDMM solution of shooting with initial momentum β(0).
Generative model:

 z ∼ ⊗n N2kg (0, Γg ) | Γg ,

i=1

y ∼ ⊗n N|Λ| (φβ(0) Iα , σ 2 Id) | z, α, σ 2 ,

i=1

Bayesian framework → MAP estimator (= penalised MLE)



LDDMM: parametric deformation:

Fix some control points: c(t) = (c1 (t), ..., cng (t))
Choose a kernel Kg
Start from an initial momentum β(0) = β 1 (0), ..., β ng (0)
Then, Hamiltonian System → Time evolution of both momenta and
control points

 dc = ∂H (c, β) = K (c(t))β(t)


 dt g
 ∂β
(9)
 dβ

∂H 1
= − (c, β) = − c(t) K (β(t), β(t))


dt ∂c 2



LDDMM: parametric deformation (2):

Interpolating on any point of the domain:
ng
vt (r ) = (Kg β(t))(r ) = Kg (r , ck (t))β k (t) ∀r ∈ D (10)
k=1

Deformation = solution of the ﬂow equation:



∂φβ(0) (t)

= vt ◦ φβ(0) (t) (11)
 φ ∂t


0 = Id .

φβ(0) = φβ(0) (1)



Gradient computation

E (ci , βi ) = k (I0 (φ−1 (yk )) − I (yk ))2 +σ 2
1 Reg(φ1 )




E (ci , βi ) = k (I0 (φ−1 (yk )) − I (yk ))2 +σ 2
1 Reg(φ1 )
S0

 S0 = {(ci , βi )}i













E (ci , βi ) = k (I0 (φ−1 (yk )) − I (yk ))2 +σ 2
1 Reg(φ1 )
S0

 S0 = {(ci , βi )}i


 dS(t)

= F (S(t)) S(0) = S0
 dt








E (ci , βi ) = k (I0 (φ−1 (yk )) − I (yk ))2 +σ 2
1 Reg(φ1 )
S0

 S0 = {(ci , βi )}i


 dS(t)

= F (S(t)) S(0) = S0
 dt
 dy (t)

= G (S(t), y (t)) y (1) = y


dt




E (ci , βi ) = k (I0 (φ−1 (yk )) − I (yk ))2 +σ 2
1 Reg(φ1 )
S0 A(yk (0))

 S0 = {(ci , βi )}i


 dS(t)

= F (S(t)) S(0) = S0
 dt
 dy (t)

= G (S(t), y (t)) y (1) = y


dt




E (ci , βi ) = k (I0 (φ−1 (yk )) − I (yk ))2 +σ 2
1 Reg(φ1 )
S0 A(yk (0)) L(S0 )=β(0)t Γg (q(0),q(0))β(0)

 S0 = {(ci , βi )}i


 dS(t)

= F (S(t)) S(0) = S0
 dt
 dy (t)

= G (S(t), y (t)) y (1) = y


dt




E (ci , βi ) = k (I0 (φ−1 (yk )) − I (yk ))2 +σ 2
1 Reg(φ1 )
S0 A(yk (0)) L(S0 )=β(0)t Γg (q(0),q(0))β(0)

 S0 = {(ci , βi )}i


 dS(t)

= F (S(t)) S(0) = S0
 dt
 dy (t)

= G (S(t), y (t)) y (1) = y


dt
T
S0 E = dS0 y (0) y (0) A + S0 L




E (ci , βi ) = k (I0 (φ−1 (yk )) − I (yk ))2 +σ 2
1 Reg(φ1 )
S0 A(yk (0)) L(S0 )=β(0)t Γg (q(0),q(0))β(0)

 S0 = {(ci , βi )}i


 dS(t)

= F (S(t)) S(0) = S0
 dt
 dy (t)

= G (S(t), y (t)) y (1) = y


dt
T
S0 E = dS0 y (0) y (0) A + S0 L

yk (0) A = 2 (I0 (yk (0)) − I (yk (1))) yk (0) I0




E (ci , βi ) = k (I0 (φ−1 (yk )) − I (yk ))2 +σ 2
1 Reg(φ1 )
S0 A(yk (0)) L(S0 )=β(0)t Γg (q(0),q(0))β(0)

 S0 = {(ci , βi )}i


 dS(t)

= F (S(t)) S(0) = S0
 dt
 dy (t)

= G (S(t), y (t)) y (1) = y


dt
T
S0 E = dS0 y (0) y (0) A + S0 L

yk (0) A = 2 (I0 (yk (0)) − I (yk (1))) yk (0) I0
- Momenta decrease image discrepancy
- Control Points attracted by image contours




E (ci , βi ) = k (I0 (φ−1 (yk )) − I (yk ))2 +σ 2
1 Reg(φ1 )
S0 A(yk (0)) L(S0 )=β(0)t Γg (q(0),q(0))β(0)

 S0 = {(ci , βi )}i


 dS(t)

= F (S(t)) S(0) = S0
 dt
 dy (t)

= G (S(t), y (t)) y (1) = y


dt
dη(t)
= ∂S(t) G T η(t), η(0) = y (0) A
dt




E (ci , βi ) = k (I0 (φ−1 (yk )) − I (yk ))2 +σ 2
1 Reg(φ1 )
S0 A(yk (0)) L(S0 )=β(0)t Γg (q(0),q(0))β(0)

 S0 = {(ci , βi )}i


 dS(t)

= F (S(t)) S(0) = S0
 dt
 dy (t)

= G (S(t), y (t)) y (1) = y


dt
dη(t)
= ∂S(t) G T η(t), η(0) = y (0) A
dt
dξ(t)
= ∂y (t) G T η(t) − dF T ξ(t), ξ(1) = 0
dt




E (ci , βi ) = k (I0 (φ−1 (yk )) − I (yk ))2 +σ 2
1 Reg(φ1 )
S0 A(yk (0)) L(S0 )=β(0)t Γg (q(0),q(0))β(0)

 S0 = {(ci , βi )}i


 dS(t)

= F (S(t)) S(0) = S0
 dt
 dy (t)

= G (S(t), y (t)) y (1) = y


dt
dη(t)
= ∂S(t) G T η(t), η(0) = y (0) A
dt
dξ(t)
= ∂y (t) G T η(t) − dF T ξ(t), ξ(1) = 0
dt

S0 E = ξ(0) + S0 L



Using LDDMM deformations via shooting (preliminary results)

AMALA : GH :


Conclusion

Conclusion

Good performances (as accurate as other algorithms)

Reduce computational time

Can handle the movement of control points in practice (theory to
conﬁrm)

Can handle sparsity of the template ( model selection)

Removing control points ? In practice, why not... theory ?


Conclusion

Conclusion

Good performances (as accurate as other algorithms)

Reduce computational time

Can handle the movement of control points in practice (theory to
conﬁrm)

Can handle sparsity of the template ( model selection)

Removing control points ? In practice, why not... theory ?

Thank you !


Conclusion


Anisotropic Metropolis Adjusted Langevin Algorithm: convergence and utility in Stochastic EM algorithm

Recommandé

Recommandé

Contenu connexe

Plus de BigMC

Plus de BigMC (9)

Anisotropic Metropolis Adjusted Langevin Algorithm: convergence and utility in Stochastic EM algorithm