Similaire à A walk in the black forest - during which I explain the fundamental problem of forensic statistics and discuss some new approaches to solving it
Similaire à A walk in the black forest - during which I explain the fundamental problem of forensic statistics and discuss some new approaches to solving it (20)
2. • Know nothing about mushrooms, can see if 2 mushrooms belong to
the same species or not
• S = set of all conceivable species of mushrooms
• n = 3, 3 = 2 + 1
• Want to estimate (functionals of) p = (ps : s ∈ S)
• S is (probably) huge
• MLE is ( ↦ 2/3, ↦ 1/3, “other” ↦ 0)
15:10 15:1215:42
3. Forget species names
(reduce data), then do MLE
• 3 = 2 + 1 is the partition of n = 3 generated by our sample =
reduced data
• qk := probability of k th most common species in population
• Estimate q = (q1, q2, …) (i.e., ps listed in decreasing order)
• Marginal likelihood = q1
2
q2 + q2
2
q1 + q1
2
q3 + q3
2
q1 + …
• Maximized by q = (1/2, 1/2, 0, 0, …) (exercise!)
• Many interesting functionals of p and of q = rsort(p) coincide
• But q = rsort(p) does not generally hold for MLE’s based on
reduced / all data, and many interesting functionals differ too
7. The Fundamental Problem of Forensic Statistics
(Sparsity, and sometimes Less is More)
Richard Gill
with: Dragi Anevski, Stefan Zohren;
Michael Bargpeter; Giulia Cereda
September 12, 2014
8. Data (database):
X ∼ Multinomial(n, p) p = (ps : s ∈ S)
#S is huge (all theoretically possible species)
Very many ps are very small, almost all are zero
Conventional task:
Estimate ps, for a certain s such that Xs = 0
Preferably, also inform “consumer” of accuracy
At present hardly ever known, hardly ever done
9. Motivation:
Each “species” = a possible DNA profile
We have a database of size n of a sample of DNA profiles
e.g. Y-chromosome 8 locus STR haplotypes (yhrd.org)
e.g. Mitochondrial DNA, ...
Some profiles occur many times, some are very rare,
most “possible” profiles do not occur in the data-base at all
Profile s is observed at a crime-scene
We have a suspect and his profile matches
We never saw s before, so it must be rare
Likelihood ratio (prosecution : defence) is 1/ps
Evidential value is − log10(ps) ban’s
(one ban = 2.30 nats = 3.32 bits)
10. Present day practices
Type 1: don’t use any prior information, discard most data
Various ad-hoc “elementary” approaches
No evaluation of accuracy
Type 2: use all prior information, all data
Model (ps : s ∈ S) using population genetics, biology, imagination:
Hierarchy of parametric models of rapidly increasing complexity:
“Present day population equals finite mixture of subpopulations
each following exact theoretical law”.
e.g. Andersen et al. (2013) “DiscLapMix”:
Model selection by AIC, estimation by EM (MLE), plug-in for LR
No evaluation of accuracy
Both types are (more precisely: easily can be) disastrous
11. New approach: Good-Turing
Idea 1: full data = database + crime scene + suspect data
Idea 2: reduce full data cleverly to reduce complexity,
to get better-estimable and better-evaluable LR without losing
much discriminatory power
Database & all evidence together = sample of size n, twice
increased by +1
Reduce complexity by forgetting names of the species
All data together is reduced to
partition of n implied by database, together with
partition of n + 1 (add crime-scene species to database)
partition of n + 2 (add suspect species to database)
Database partition = spectrum of database species frequencies
= Y = rsort(X) (reverse sort)
12. The distribution of database frequency spectrum Y = rsort(X)
only depends on q = rsort(p)
the spectrum of true species probabilities (*)
Note: components of X and of p are indexed by s ∈ S,
components of Y and of q are indexed by k ∈ N+,
Y/n (database spectrum, expressed as relative frequencies)
is for our purposes a lousy estimator of q
(i.e., for the functionals of q we are interested in),
even more so than X/n is a lousy estimator of p (for those
functionals)
(*) More generally, (joint) laws of (Y, Y+, Y++) under prosecution and
defence hypotheses only depend on q
13. Our aim: estimate reduced data LR
LR = s∈S(1 − ps)nps
s∈S(1 − ps)np2
s
= k∈N+ (1 − qk)nqk
k∈N+ (1 − qk)nq2
k
or, “estimate” reduced data conditional LR
LRcond = s∈S I{Xs = 0}qs
s∈S I{Xs = 0}q2
s
and report (or at the least, know something about) the precision of
the estimate
Important to transform to “bans” (ie take negative log base 10)
before discussing precision
14. Key observations
Database spectrum can be reduced to Nx = #{s : Xs = x},
x ∈ N+, or even further, to {(x, Nx ) : Nx > 0}
Add superscript n to “impose” dependence on sample size,
n + 1
1
s∈S
(1 − ps)n
ps = En+1
(N1) ≈ En
(N1)
n + 2
2
s∈S
(1 − ps)n
p2
s = En+2
(N2) ≈ En
(N2)
suggests: estimate LR (or LRcond!) from database by
nN1
2N2
Alternative: estimate by plug-in of database MLE of q in formula
for LR as function of q
But how accurate is LR?
15. Accuracy (1): asymptotic normality; wlog S = N+, p = q
Poissonization trick: pretend n is realisation of N ∼ Poisson(λ)
=⇒ Xs ∼ independent Poisson(λps)
(N, N1, N2) = s(Xs, I{Xs = 1}, I{Xs = 2}) should be
asymptotically Gaussian3 as λ → ∞
Conditional law of (N1, N2) given N = n = λ → ∞, hopefully,
converges to corresponding conditional Gaussian2
Proof: Esty (1983), no N2, p depending on n,
√
n normalisation;
Zhang & Zhang (2009): N&SC, implying p must vary with n to
have convergence (& non degenerate limit) at this rate.
Conjecture: Gaussian limit with slower rate and fixed p possible,
under appropriate tail behaviour of p; rate will depend on tail
“exponent” (rate of decay of tail)
Formal delta-method calculations give simple conjectured
asymptotic variance estimator depending on n, N1, N2, N3, N4;
simulations suggest “it works”
16. Accuracy (2): MLE and bootstrap
Problem: Nx gets rapidly unstable as x increases
Good & Turing proposed “smoothing” the higher observed Nx
Alternative: Orlitsky et al. (2004, 2005, ...): estimate q by MLE
Likelihood = k qYk
χ(k), sum over bijections χ : N+ → N+
Anevksy, Gill, Zohren (arXiv:1312.1200) prove (almost) root n
L1-consistency by delicate proof based on simple ideas inspired by
Orlitsky (et al.) “outline” of “proof”
Must “close” parameter space to make MLE exist – allow positive
probability “blob” of zero probability species
Computation: SA-MH-EM
Proposal: use database MLE to estimate accuracy (bootstrap),
and possibly to give alternative estimator of LR
17. Open problems
(1) Get results for these and other interesting functionals of MLE,
e.g., entropy
Conjecture: will see interesting convergence rates depending on
rate of decay of tail of q
Conjecture: interesting / useful theorems require q to depend on n
Problems of adaptation, coverage probability challenges
(2) Prove “semiparametric” bootstrap works for such functionals
(3) Design computationally feasible nonparametric Bayesian
approach (a confidence region for a likelihood ratio is an oxymoron)
and verify it has good frequentist properties;
or design hybrid (i.e., “empirical Bayes”) approaches = Bayes with
data-estimated hyperparameter
18. Remarks
Plug-in MLE to estimate E(Nx ) almost exactly reproduces observed
Nx : order of size of sqrt(expected) minus sqrt(observed) is ≈ ±1
2
Our approach is all the more useful when case involves match of a
not new but still (in database) uncommon species:
Good-Turing-inspired estimated LR easy to generalise, involves Nx
for“higher” x, hence better to smooth by replacing with MLE
predictions / estimation of expectations
20. • Convergence at rate n – (k – 1) / 2k if 𝜃x ≈ C / x k
• Convergence at rate n – 1 / 2 if 𝜃x ≈ A exp( – B x k)
Estimating a probability mass function with
unknown labels
Dragi Anevski, Richard Gill, Stefan Zohren,
Lund University, Leiden University, Oxford University
July 25, 2012 (last revised DA, 10:10 am CET)
Abstract
1 The model
1.1 Introduction
Imagine an area inhabited by a population of animals which can be classified
by species. Which species actually live in the area (many of them previously
unknown to science) is a priori unknown. Let A denote the set of all possible
species potentially living in the area. For instance, if animals are identified by
their genetic code, then the species’ names ↵ are “just” equivalence classes
of DNA sequences. The set of all possible DNA sequences is e↵ectively
uncountably infinite, and for present purposes so is the set of equivalence
classes, each equivalence class defining one “potential” species.
Suppose that animals of species ↵ 2 A form a fraction ✓↵ 0 of the total
population of animals. The probabilities ✓ are completely unknown.
Corollary:
14 ANEVSKI, GILL AND ZOHREN
We show that an extended maximum likelihood estimator exists in Ap-
pendix A of [3]. We next derive the almost sure consistency of (any) extended
maximum likelihood estimator ˆ✓.
Theorem 1. Let ˆ✓ = ˆ✓(n) be (any) extended maximum likelihood esti-
mator. Then for any > 0
Pn,✓
(||ˆ✓ ✓||1 > )
1
p
3n
e⇡
p2n
3
n ✏2
2 (1 + o(1)) as n ! 1
where ✏ = /(8r) and r = r(✓, ) such that
P1
i=r+1 ✓i /4.
Proof. Now let Q✓, be as in the statement of Lemma 1. Then there is
an r such that the conclusion of the lemma holds, i.e. for each n there is a
set
A = An = { sup
1xr
| ˆf(n)
x ✓x| ✏}
such that
n,✓ n✏2/2
21. Tools
• Kiefer-Dvoretsky-Wolfowitz
• “r-sort” is contraction mapping w.r.t sup norm
• Hardy-Ramanujan: number of partitions of n grows as
ESTIMATING A PROBABILITY MASS FUNCTION 15
is an extended ML estimator then
dPn,ˆ✓
dPn,✓
1.
a given n = n1 + . . . + nk such that n1 . . . nk > 0, (with k varying),
re is a finite number p(n) of possibilities for the value of (n1, . . . , nk). The
mber p(n) is the partition function of n, for which we have the asymptotic
mula
p(n) =
1
4n
p
3
e⇡
p2n
3 (1 + o(1)),
n ! 1, cf. [23]. For each possibility of (n1, . . . , nk) there is an extended
estimator (for each possibility we can choose one such) and we let Pn =
From (8) follows that for some i r we have
|✓i i|
4r
:= 2✏ = 2✏( , ✓).)
ote that r, and thus also ✏ depends only on ✓, and not on .
Recall the Dvoretzky-Kiefer-Wolfowitz (DKW) inequality [6, 17]; for e
> 0
P✓(sup
x 0
|F(n)
(x) F✓(x))| ✏) 2e 2n✏2
,0)
here F✓ is the cumulative distribution function corresponding to ✓,
(n) the empirical probability function based on i.i.d. data from F✓.
upx 0 |F(n)(x) F✓(x)| ✏} {supx 0 |f
(n)
x ✓x| 2✏} {supx 1 |f
| 2✏}, with f(n) the empirical probability mass function correspon
F(n), equation (10) implies
Pn,✓
(sup
x 1
|f(n)
x ✓x| ✏) = P✓(sup
x 1
|f(n)
x ✓x| ✏)
r-sort = reverse sort = sort in decreasing order
22. Key lemma
P, Q probability measures; p, q densities
• Find event A depending on P and 𝛿
• P (Ac) ≤ 𝜀
• Q (A) ≤ 𝜀 for all Q: d (Q, P) ≥ 𝛿
• Hence P ( p /q < 1) ≤ 2𝜀 if d (Q, P) ≥ 𝛿
Application:
P, Q are probability distributions of data,
depending on parameters 𝜃, 𝜙 respectively
A is event that Y/n is within 𝛿 of 𝜃 (sup norm)
d is L1 distance between 𝜃 and 𝜙
Proof:
P ( p /q < 1)
= P( p /q < 1 ∩ Ac) + P( p /q < 1 ∩ A)
≤ P(Ac) + Q(A)
23. Preliminary steps
• By Kiefer-Dvoretsky-Wolfowitz, empirical relative frequencies
are close to true probabilities, in sup norm, ordering known
• By contraction property, same is true after monotone
reordering of empirical
• Apply key lemma, with as input:
• 1. Naive estimator is close to the truth (with large
probability)
• 2. Naive estimator is far from any particular distant non-
truth (with large probability)
24. Proof outiline
• By key lemma, probability MLE is any particular q
distant from (true) p is very small
• By Hardy, there are not many q to consider
• Therefore probability MLE is far from p is small
• Careful – sup norm on data, L1 norm on parameter
space, truncation of parameter vectors ...
25. –Ate Kloosterman
A practical question from the NFI:
It is a Y-chromosome identification case (MH17 disaster).
A nephew (in the paternal line) of a missing male individual and the unidentified victim share the
same Y-chromosome haplotype.
The evidential value of the autosomal DNA evidence for this ID is low (LR=20).
The matching Y-str haplotype is unique in a Dutch population sample (N=2085) and in the
worldwide YHRD Y-str database (containing 84 thousand profiles).
I know the distribution of the haplotype frequencies in the Dutch database (1826 haplotypes)
Distribution of haplotype frequencies
Database frequency of haplotype 1 2 3 4 5 6 7 13
Number of haplotypes in database 1650 130 30 7 5 2 1 1
27. Observed and expected number of haplotypes, by database frequency
Frequency 1 2 3 4 5 6
O 1650 130 30 7 5 2
E 1650.49 129.64 29.14 8.99 3.77 1.75
Hanging rootogram
Database frequency
−1.0−0.50.00.51.0
28. Bootstrap s.e. 0.040, Good-Turing asymptotic theory estimated s.e. 0.044
4.00 4.05 4.10 4.15 4.20 4.25
0246810
Bootstrap distribution of Good−Turing log10LR
Bootstrap log10LR
Probabilitydensity