Forensic Biology & Its biological significance.pdf
Estimation of the score vector and observed information matrix in intractable models
1. Estimation of the score vector and observed
information matrix in intractable models
Arnaud Doucet (University of Oxford)
Pierre E. Jacob (University of Oxford)
Sylvain Rubenthaler (Universit´e Nice Sophia Antipolis)
June 15th, 2015
Pierre Jacob Derivative estimation 1/ 25
2. Outline
1 Context: intractable derivatives
2 Shift estimators
3 Monte Carlo shift estimators
Pierre Jacob Derivative estimation 1/ 25
3. Outline
1 Context: intractable derivatives
2 Shift estimators
3 Monte Carlo shift estimators
Pierre Jacob Derivative estimation 2/ 25
4. Motivation
Derivatives of the likelihood help optimizing / sampling.
They become crucial when the parameter space is
high-dimensional.
For many models, these derivatives are intractable.
One can resort to approximation techniques.
Pierre Jacob Derivative estimation 2/ 25
5. Motivation: hidden Markov models
y2
X2X0
y1
X1
...
... yT
XT
θ
Figure : Graph representation of a general hidden Markov model.
Pierre Jacob Derivative estimation 3/ 25
6. Motivation: hidden Markov models
The associated loglikelihood is
(θ) = log p(y1:T ; θ)
= log
XT+1
T
t=1
g(yt | xt, θ) µ(dx1 | θ)
T
t=2
f (dxt | xt−1, θ).
This high-dimensional integral can be approximated
efficiently using particle filters.
However, the approximation itself is “discontinuous” as a
function of the parameter θ, even when the “seed” is fixed.
Figures coming.
Pierre Jacob Derivative estimation 4/ 25
7. Motivation: hidden Markov models
−198
−196
−194
−192
−190
0.500 0.525 0.550 0.575 0.600
θ
loglikelihood
different seed truth
Figure : Approximation of the log-likelihood by particle filters.
Pierre Jacob Derivative estimation 5/ 25
8. Motivation: hidden Markov models
−196
−194
−192
−190
0.500 0.525 0.550 0.575 0.600
θ
loglikelihood
same seed truth
Figure : Approximation of the log-likelihood by particle filters.
Pierre Jacob Derivative estimation 6/ 25
9. Motivation: general context
Let (θ) denote a “log-likelihood” but it could be any
function.
Let L(θ) = exp (θ), the “likelihood”.
Assume that we have access to estimators L(θ) of L(θ)
such that
E[L(θ)] = L(θ)
and
V
L(θ)
L(θ)
=
v(θ)
M
≈
v
M
(∀θ).
Pierre Jacob Derivative estimation 7/ 25
10. Finite difference
First derivative:
(θ ) =
log L(θ + h) − log L(θ − h)
2h
converges to (θ ) when M → ∞ and h → 0.
Second derivative:
2 (θ) =
log L(θ + h) − 2 log L(θ) + log L(θ − h)
h2
converges to 2 (θ ) when M → ∞ and h → 0.
Pierre Jacob Derivative estimation 8/ 25
11. Finite difference
Bias:
E (θ ) − (θ ) = O(h2
).
Variance:
V (θ ) = O
1
Mh2
.
Optimal rate of convergence for the first derivative:
h ∼ M−1/6
leading to MSE ∼ M−2/3
.
For the second derivative:
h ∼ M−1/8
leading to MSE ∼ M−1/2
.
Pierre Jacob Derivative estimation 9/ 25
12. Outline
1 Context: intractable derivatives
2 Shift estimators
3 Monte Carlo shift estimators
Pierre Jacob Derivative estimation 10/ 25
13. Iterated Filtering
Ionides, Bhadra, Atchad´e, King, Iterated filtering, AoS, 2011.
Given a log-likelihood and a given point θ , consider a
“prior” distribution centered on θ , for instance
Θ ∼ N(θ , τ2
Σ).
We can consider “posterior” expectations, when they exist:
Epost [ϕ (Θ)] =
E [ϕ (Θ) exp (Θ)]
E [exp (Θ)]
,
where the expectations E are with respect to the “prior”.
Pierre Jacob Derivative estimation 10/ 25
14. Iterated Filtering
Under conditions on the prior distribution, and on the
log-likelihood, one gets the following result.
Posterior expectation when the prior variance goes to zero
For any θ , there exists a C such that, for τ small enough,
|τ−2
Σ−1
(Epost[Θ] − θ ) − (θ )| ≤ Cτ.
Phrased simply,
posterior mean − prior mean
prior variance
≈ score.
Result from Ionides, Bhadra, Atchad´e, King, Iterated filtering,
2011.
Pierre Jacob Derivative estimation 11/ 25
15. Normal priors
Let us focus on the normal case:
Θ ∼ N(θ , τ2
Σ).
Stein’s Lemma says...
Assume that E [| iL (Θ)|] < ∞ and E 2
ijL (Θ) < ∞ for
i, j ∈ {1, . . . , d}. Then we have
Epost [Θ − θ ] = τ2
Σ Epost [ (Θ)] ,
and
Epost (Θ − θ ) (Θ − θ )T
= τ2
Σ + τ4
ΣEpost
2
(Θ) + (Θ) (Θ)T
Σ.
Pierre Jacob Derivative estimation 12/ 25
16. Expected derivatives and derivatives
We do not want to estimate
Epost [ (Θ)]
but rather
(θ ).
Posterior moment expansion
When τ → 0, we have for ϕ satisfying some conditions,
Epost [ϕ (Θ)] = ϕ (θ )
+
τ2
2
d
i=1
d
j=1
2
ijϕ (θ ) + 2 iϕ (θ ) j (θ ) Σij
+ O τ4
.
Pierre Jacob Derivative estimation 13/ 25
17. Expected derivatives and derivatives
Conditions for this to hold:
Let ϕ : Rd → R be a four times continuously differentiable
function. Assume that there exists a constant M < ∞ and
δ > 0 such that ∂4ϕ(θ)
∂θi∂θj∂θk∂θl
≤ M for all θ ∈ BΣ(θ , δ), all
(i, j, k, l) ∈ {1, . . . , d}4
.
The likelihood function L is such that both θ → L (θ) and
θ → ϕ (θ) × L (θ) satisfy the same assumption as ϕ.
There exists τ0 > 0 such that Eτ0 [|ϕ (Θ)|] < ∞,
Eτ0 [|L (Θ)|] < ∞ and Eτ0 [|ϕ (Θ) L (Θ)|] < ∞.
Pierre Jacob Derivative estimation 14/ 25
18. Shift estimators
Shift estimators
Sτ (θ ) = τ−2
Σ−1
Epost [Θ − θ ] = (θ ) + O τ2
,
Iτ (θ ) = τ−4
Σ−1
Vpost [Θ] − τ2
Σ Σ−1
= 2
(θ ) + O τ2
.
Using Monte Carlo approximations of
Epost [Θ] and Vpost [Θ] ,
these formulas provide estimators of
(θ ) and 2
(θ ).
Pierre Jacob Derivative estimation 15/ 25
19. Example: normal likelihood
The parameter θ represents a d-dimensional location, and
Y ∼ N(θ, Λ−1
y ).
Thus the derivatives of the log-likelihood are
(θ) = −Λy(θ − y),
2
(θ) = −Λy.
Using a prior N θ , τ2Σ , the posterior is conjugate and the
shift estimators are given by
Sτ (θ ) =
1
(1 + τ2ΣΛy)
(−Λy(θ − y)),
Iτ (θ ) =
1
(1 + τ2ΣΛy)
(−Λy) .
Pierre Jacob Derivative estimation 16/ 25
20. Outline
1 Context: intractable derivatives
2 Shift estimators
3 Monte Carlo shift estimators
Pierre Jacob Derivative estimation 17/ 25
21. Monte Carlo shift estimators
Draw θi ∼ N θ , τ2Σ , for each i ∈ {1, . . . , N}.
Draw ˆwi = L (θi), for each i ∈ {1, . . . , N}.
Normalize the weights: for each i ∈ {1, . . . , N},
ˆWi = ˆwi/
N
j=1
ˆwj.
Then we can estimate
τ−2
Σ−1
(Epost[Θ] − θ )
by
τ−2
Σ−1
N
i=1
ˆWiθi − θ ,
or, even better,
τ−2
Σ−1
N
i=1
ˆWiθi −
1
N
N
i=1
θi .
Pierre Jacob Derivative estimation 17/ 25
22. Monte Carlo shift estimators
For the second order derivative, recall that:
τ−4
Σ−1
Vpost [Θ] − τ2
Σ Σ−1
= 2
(θ ) + O τ2
.
Importance sampling strategy (with control variates) leads to:
τ−4
Σ−1
N
i=1
ˆWi θi −
N
i=1
ˆWjθj
2
−
1
N
N
i=1
θi −
1
N
N
i=1
θi
2
Σ−1
.
with obvious extension in the multivariate case.
Pierre Jacob Derivative estimation 18/ 25
23. Monte Carlo shift estimators
Score estimator:
SN,τ (θ ) = τ−2
Σ−1
N
i=1
ˆWiθi −
1
N
N
i=1
θi .
Bias:
E SN,τ (θ ) − (θ ) = O τ2
.
Variance:
V SN,τ (θ ) = O
1
Nτ2
.
⇒ same rate of convergence than the finite difference method!
Similar results for the estimator of 2 (θ ).
Pierre Jacob Derivative estimation 19/ 25
24. Monte Carlo shift estimators
The shift estimators are competitive with the gold
standard!
The shift estimators are not better than the gold standard!
So why bother?
Why is Iterated Filtering used in practice?
Let’s have a look at the non-asymptotic behaviour.
Pierre Jacob Derivative estimation 20/ 25
25. Robustness
Recall:
SN,τ (θ ) = τ−2
Σ−1
N
i=1
ˆWiθi −
1
N
N
i=1
θi ,
where
ˆWi =
L(θi)
N
j=1 L(θj)
,
and V(L(θi)/L(θi)) = v/M.
Scenario: v is very large. Then N
i=1
ˆWiθi ≈ θj for some j.
We obtain, for all v:
V(SN,τ (θ ) ≤ Cτ−2
.
On the other hand, the variance of finite difference estimators
increases to ∞ with v.
Pierre Jacob Derivative estimation 21/ 25
26. Example: normal likelihood
Latent variable model:
X ∼ N(θ, Λ−1
y − λΛ−1
y|x), Y | X = x ∼ N(x, λΛ−1
y|x),
for fixed matrices Λy and Λy|x, and λ ∈ (0, 1).
Still corresponds to
Y ∼ N(θ, Λ−1
y ),
but a naive Monte Carlo approach has an infinite relative
variance when λ goes to zero.
The following figure represents the behaviour of the MSE for
the Monte Carlo shift estimator and the finite difference
estimator as λ goes to zero, i.e. v → ∞.
Pierre Jacob Derivative estimation 22/ 25
27. Robustness
1e+01
1e+04
1e+07
1e−01 1e−02 1e−03 1e−04 1e−05 1e−06
λ
MSE
Monte Carlo shift finite difference
Figure : Robustness of the Monte Carlo shift estimator when the
variance of the likelihood estimator increases.
Pierre Jacob Derivative estimation 23/ 25
28. Discussion
Monte Carlo shift estimators and finite difference
estimators: equivalent in mean squared error (N → ∞).
Monte Carlo shift estimators are robust to the noise of the
likelihood estimators (v → ∞). This might explain the
observed gain within optimization procedures.
There are specific forms of shift estimators for hidden
Markov models (e.g. Iterated Filtering).
Behaviour of the posterior distribution when the prior is a
concentrated normal distribution, interesting on its own
right?
Pierre Jacob Derivative estimation 24/ 25
29. Bibliography
Main references:
Inference for nonlinear dynamical systems, Ionides, Breto,
King, PNAS, 2006.
Iterated filtering, Ionides, Bhadra, Atchad´e, King, Annals
of Statistics, 2011.
Derivative-Free Estimation of the Score Vector
and Observed Information Matrix,
Doucet, Jacob, Rubenthaler, (on arXiv, to be updated).
Pierre Jacob Derivative estimation 25/ 25