SlideShare une entreprise Scribd logo
1  sur  134
Télécharger pour lire hors ligne
Bayes 250th versus Bayes 2.5.0
Christian P. Robert
Universit´ Paris-Dauphine, University of Warwick, & CREST, Paris
e

written for EMS 2013, Budapest
Outline

Bayes, Thomas (1702–1761)
Jeffreys, Harold (1891–1989)
Lindley, Dennis (1923– )
Besag, Julian (1945–2010)
de Finetti, Bruno (1906–1985)
Bayes, Price and Laplace

Bayes, Thomas (1702–1761)
Bayes’ 1763 paper
Bayes’ example
Laplace’s 1774 derivation
Jeffreys, Harold (1891–1989)
Lindley, Dennis (1923– )
Besag, Julian (1945–2010)
de Finetti, Bruno (1906–1985)
a first Bayes 250
Took place in Edinburgh, Sept. 5–7, 2011:
Sparse Nonparametric Bayesian Learning from
Big Data David Dunson, Duke University
Classification Models and Predictions for Ordered
Data Chris Holmes, Oxford University
Bayesian Variable Selection in Markov Mixture
Models Luigi Spezia, Biomathematics
& Statistics Scotland, Aberdeen
Bayesian inference for partially observed Markov
processes, with application to systems biology
Darren Wilkinson, University of Newcastle
Coherent Inference on Distributed Bayesian
Expert Systems Jim Smith, University of Warwick

Bayesian Priors in the Brain Peggy Series,
University of Edinburgh
Approximate Bayesian Computation for model
selection Christian Robert, Universit´
e
Paris-Dauphine
ABC-EP: Expectation Propagation for
Likelihood-free Bayesian Computation Nicholas
Chopin, CREST–ENSAE
Bayes at Edinburgh University - a talk and tour
Dr Andrew Fraser, Honorary Fellow, University of
Edinburgh

Probabilistic Programming John Winn, Microsoft
Research

Intractable likelihoods and exact approximate
MCMC algorithms Christophe Andrieu,
University of Bristol

How To Gamble If You Must (courtesy of the
Reverend Bayes) David Spiegelhalter, University
of Cambridge

Bayesian computational methods for intractable
continuous-time non-Gaussian time series Simon
Godsill, University of Cambridge

Inference and computing with decomposable
graphs Peter Green, University of Bristol

Efficient MCMC for Continuous Time Discrete
State Systems Yee Whye Teh, Gatsby
Computational Neuroscience Unit, University
College London

Nonparametric Bayesian Models for Sparse
Matrices and Covariances Zoubin Gharamani,
University of Cambridge
Latent Force Models Neil Lawrence, University of
Sheffield
Does Bayes Theorem Work? Michael Goldstein,
Durham University

Adaptive Control and Bayesian Inference Carl
Rasmussen, University of Cambridge
Bernstein - von Mises theorem for irregular
statistical models Natalia Bochkina, University of
Edinburgh
Why Bayes 250?

Publication on Dec. 23, 1763 of
“An Essay towards solving a
Problem in the Doctrine of
Chances” by the late
Rev. Mr. Bayes, communicated
by Mr. Price in the Philosophical
Transactions of the Royal Society
of London.
c 250th anniversary of the Essay
Why Bayes 250?

Publication on Dec. 23, 1763 of
“An Essay towards solving a
Problem in the Doctrine of
Chances” by the late
Rev. Mr. Bayes, communicated
by Mr. Price in the Philosophical
Transactions of the Royal Society
of London.
c 250th anniversary of the Essay
Why Bayes 250?

Publication on Dec. 23, 1763 of
“An Essay towards solving a
Problem in the Doctrine of
Chances” by the late
Rev. Mr. Bayes, communicated
by Mr. Price in the Philosophical
Transactions of the Royal Society
of London.
c 250th anniversary of the Essay
Why Bayes 250?

Publication on Dec. 23, 1763 of
“An Essay towards solving a
Problem in the Doctrine of
Chances” by the late
Rev. Mr. Bayes, communicated
by Mr. Price in the Philosophical
Transactions of the Royal Society
of London.
c 250th anniversary of the Essay
Breaking news!!!
An accepted paper by Stephen Stigler in Statistical Science
uncovers the true title of the Essay:
A Method of
Calculating the Exact
Probability of All
Conclusions founded on
Induction
Intended as a reply to
Hume’s (1748) evaluation
of the probability of
miracles
Breaking news!!!
may have been written as early as 1749: “we may hope to
determine the Propositions, and, by degrees, the whole Nature
of unknown Causes, by a sufficient Observation of their
effects” (D. Hartley)
in 1767, Richard Price used
Bayes’ theorem as a tool to
attack Hume’s argument,
refering to the above title
Bayes’ offprints available at
Yale’s Beinecke Library (but
missing the title page) and
at the Library Company of
Philadelphia (Franklin’s
library)
[Stigler, 2013]
Bayes Theorem

Bayes theorem = Inversion of causes and effects
If A and E are events such that P(E ) = 0, P(A|E ) and P(E |A) are
related by
P(A|E ) =
P(E |A)P(A)
P(E |A)P(A) + P(E |Ac )P(Ac )
P(E |A)P(A)
=
P(E )
Bayes Theorem

Bayes theorem = Inversion of causes and effects
If A and E are events such that P(E ) = 0, P(A|E ) and P(E |A) are
related by
P(A|E ) =
P(E |A)P(A)
P(E |A)P(A) + P(E |Ac )P(Ac )
P(E |A)P(A)
=
P(E )
Bayes Theorem

Bayes theorem = Inversion of causes and effects
Continuous version for random
variables X and Y
fX |Y (x|y ) =

fY |X (y |x) × fX (x)
fY (y )
Who was Thomas Bayes?
Reverend Thomas Bayes (ca. 1702–1761), educated in London
then at the University of Edinburgh (1719-1721), presbyterian
minister in Tunbridge Wells (Kent) from 1731, son of Joshua
Bayes, nonconformist minister.
“Election to the Royal Society based on
a tract of 1736 where he defended the
views and philosophy of Newton.
A notebook of his includes a method of
finding the time and place of
conjunction of two planets, notes on
weights and measures, a method of
differentiation, and logarithms.”
[Wikipedia]
Who was Thomas Bayes?
Reverend Thomas Bayes (ca. 1702–1761), educated in London
then at the University of Edinburgh (1719-1721), presbyterian
minister in Tunbridge Wells (Kent) from 1731, son of Joshua
Bayes, nonconformist minister.
“Election to the Royal Society based on
a tract of 1736 where he defended the
views and philosophy of Newton.
A notebook of his includes a method of
finding the time and place of
conjunction of two planets, notes on
weights and measures, a method of
differentiation, and logarithms.”
[Wikipedia]
Bayes’ 1763 paper:

Billiard ball W rolled on a line of length one, with a uniform
probability of stopping anywhere: W stops at p.
Second ball O then rolled n times under the same assumptions. X
denotes the number of times the ball O stopped on the left of W .
Bayes’ 1763 paper:

Billiard ball W rolled on a line of length one, with a uniform
probability of stopping anywhere: W stops at p.
Second ball O then rolled n times under the same assumptions. X
denotes the number of times the ball O stopped on the left of W .

Bayes’ question:
Given X , what inference can we make on p?
Bayes’ 1763 paper:

Billiard ball W rolled on a line of length one, with a uniform
probability of stopping anywhere: W stops at p.
Second ball O then rolled n times under the same assumptions. X
denotes the number of times the ball O stopped on the left of W .

Bayes’ wording:
“Given the number of times in which an unknown event has
happened and failed; Required the chance that the probability of
its happening in a single trial lies somewhere between any two
degrees of probability that can be named.”
Bayes’ 1763 paper:

Billiard ball W rolled on a line of length one, with a uniform
probability of stopping anywhere: W stops at p.
Second ball O then rolled n times under the same assumptions. X
denotes the number of times the ball O stopped on the left of W .

Modern translation:
Derive the posterior distribution of p given X , when
p ∼ U([0, 1]) and X |p ∼ B(n, p)
Resolution

Since
P(X = x|p) =

n x
p (1 − p)n−x ,
x
b

P(a < p < b and X = x) =
a

and

1

P(X = x) =
0

n x
p (1 − p)n−x dp
x

n x
p (1 − p)n−x dp,
x
Resolution (2)

then
P(a < p < b|X = x) =
=

b
a
1
0
b
a

n x
n−x dp
x p (1 − p)
n x
n−x dp
x p (1 − p)
p x (1 − p)n−x dp

B(x + 1, n − x + 1)

,

i.e.
p|x ∼ Be(x + 1, n − x + 1)
[Beta distribution]
Resolution (2)
then
P(a < p < b|X = x) =
=

b
a
1
0
b
a

n x
n−x dp
x p (1 − p)
n x
n−x dp
x p (1 − p)
p x (1 − p)n−x dp

B(x + 1, n − x + 1)

,

i.e.
p|x ∼ Be(x + 1, n − x + 1)
In Bayes’ words:
“The same things supposed, I guess that the probability
of the event M lies somewhere between 0 and the ratio of
Ab to AB, my chance to be in the right is the ratio of
Abm to AiB.”
Laplace’s version

Pierre Simon (de) Laplace (1749–1827):
“Je me propose de d´terminer la probabilit´
e
e
des causes par les ´v´nements mati`re neuve `
e e
e
a
bien des ´gards et qui m´rite d’autant plus
e
e
d’ˆtre cultiv´e que c’est principalement sous ce
e
e
point de vue que la science des hasards peut
ˆtre utile ` la vie civile.”
e
a
[M´moire sur la probabilit´ des causes par les ´v´nemens, 1774]
e
e
e e
Laplace’s version

“Si un ´v`nement peut ˆtre produit par un
e e
e
nombre n de causes diff`rentes, les probabilit´s
e
e
de l’existence de ces causes prises de
l’´v`nement, sont entre elles comme les
e e
probabilit´s de l’´v`nement prises de ces
e
e e
causes, et la probabilit´ de l’existence de
e
chacune d’elles, est ´gale ` la probabilit´ de
e
a
e
l’´v`nement prise de cette cause, divise´ par la
e e
e
somme de toutes les probabilit´s de
e
l’´v`nement prises de chacune de ces causes.”
e e
[M´moire sur la probabilit´ des causes par les ´v´nemens, 1774]
e
e
e e
Laplace’s version

In modern terms: Under a uniform prior,
P(Ai |E )
P(E |Ai )
=
P(Aj |E )
P(E |Aj )
and
f (x|y ) =

f (y |x)
f (y |x) dy

[M´moire sur la probabilit´ des causes par les ´v´nemens, 1774]
e
e
e e
Laplace’s version

Later Laplace acknowledges Bayes by
“Bayes a cherch´ directement la probabilit´
e
e
que les possibilit´s indiqu´es par des
e
e
exp´riences d´j` faites sont comprises dans les
e
ea
limites donn´es et il y est parvenu d’une
e
mani`re fine et tr`s ing´nieuse”
e
e
e
[Essai philosophique sur les probabilit´s, 1810]
e
Another Bayes 250

Meeting that took place at the Royal Statistical Society, June
19-20, 2013, on the current state of Bayesian statistics
G. Roberts (University of Warwick) “Bayes for
differential equation models”
N. Best (Imperial College London) “Bayesian
space-time models for environmental
epidemiology”
D. Prangle (Lancaster University) “Approximate
Bayesian Computation”
P. Dawid (University of Cambridge), “Putting
Bayes to the Test”
M. Jordan (UC Berkeley) “Feature Allocations,
Probability Functions, and Paintboxes”
I. Murray (University of Edinburgh) “Flexible
models for density estimation”
M. Goldstein (Durham University) “Geometric
Bayes”
C. Andrieu (University of Bristol) “Inference with
noisy likelihoods”

A. Golightly (Newcastle University), “Auxiliary
particle MCMC schemes for partially observed
diffusion processes”
S. Richardson (MRC Biostatistics Unit)
“Biostatistics and Bayes”
C. Yau (Imperial College London)
“Understanding cancer through Bayesian
approaches”
S. Walker (University of Kent) “The Misspecified
Bayesian”
S. Wilson (Trinity College Dublin), “Linnaeus,
Bayes and the number of species problem”
B. Calderhead (UCL) “Probabilistic Integration
for Differential Equation Models”
P. Green (University of Bristol and UT Sydney)
“Bayesian graphical model determination”
The search for certain π

Bayes, Thomas (1702–1761)
Jeffreys, Harold (1891–1989)
Keynes’ treatise
Jeffreys’ prior distributions
Jeffreys’ Bayes factor
expected posterior priors
Lindley, Dennis (1923– )
Besag, Julian (1945–2010)
de Finetti, Bruno (1906–1985)
Keynes’ dead end

In John Maynard Keynes’s A Treatise on Probability (1921):

“I do not believe that there is
any direct and simple method by
which we can make the transition
from an observed numerical
frequency to a numerical measure
of probability.”
[Robert, 2011, ISR]
Keynes’ dead end

In John Maynard Keynes’s A Treatise on Probability (1921):
“Bayes’ enunciation is strictly
correct and its method of arriving
at it shows its true logical
connection with more
fundamental principles, whereas
Laplace’s enunciation gives it the
appearance of a new principle
specially introduced for the
solution of causal problems.”
[Robert, 2011, ISR]
Who was Harold Jeffreys?

Harold Jeffreys (1891–1989)
mathematician, statistician,
geophysicist, and astronomer.
Knighted in 1953 and Gold
Medal of the Royal Astronomical
Society in 1937. Funder of
modern British geophysics. Many
of his contributions are
summarised in his book The
Earth.

[Wikipedia]
Theory of Probability

The first modern and comprehensive treatise on (objective)
Bayesian statistics
Theory of Probability (1939)
begins with probability, refining
the treatment in Scientific
Inference (1937), and proceeds to
cover a range of applications
comparable to that in Fisher’s
book.
[Robert, Chopin & Rousseau, 2009, Stat. Science]
Jeffreys’ justifications

All probability statements are conditional
Actualisation of the information on θ by extracting the
information on θ contained in the observation x
The principle of inverse probability does correspond
to ordinary processes of learning (I, §1.5)
Allows incorporation of imperfect information in the decision
process
A probability number can be regarded as a
generalization of the assertion sign (I, §1.51).
Posterior distribution
Operates conditional upon the observations
Incorporates the requirement of the Likelihood Principle
...the whole of the information contained in the
observations that is relevant to the posterior
probabilities of different hypotheses is summed up in
the values that they give the likelihood (II, §2.0).
Avoids averaging over the unobserved values of x
Coherent updating of the information available on θ,
independent of the order in which i.i.d. observations are
collected
...can be used as the prior probability in taking
account of a further set of data, and the theory can
therefore always take account of new information (I,
§1.5).
Provides a complete inferential scope
Posterior distribution
Operates conditional upon the observations
Incorporates the requirement of the Likelihood Principle
...the whole of the information contained in the
observations that is relevant to the posterior
probabilities of different hypotheses is summed up in
the values that they give the likelihood (II, §2.0).
Avoids averaging over the unobserved values of x
Coherent updating of the information available on θ,
independent of the order in which i.i.d. observations are
collected
...can be used as the prior probability in taking
account of a further set of data, and the theory can
therefore always take account of new information (I,
§1.5).
Provides a complete inferential scope
Posterior distribution
Operates conditional upon the observations
Incorporates the requirement of the Likelihood Principle
...the whole of the information contained in the
observations that is relevant to the posterior
probabilities of different hypotheses is summed up in
the values that they give the likelihood (II, §2.0).
Avoids averaging over the unobserved values of x
Coherent updating of the information available on θ,
independent of the order in which i.i.d. observations are
collected
...can be used as the prior probability in taking
account of a further set of data, and the theory can
therefore always take account of new information (I,
§1.5).
Provides a complete inferential scope
Posterior distribution
Operates conditional upon the observations
Incorporates the requirement of the Likelihood Principle
...the whole of the information contained in the
observations that is relevant to the posterior
probabilities of different hypotheses is summed up in
the values that they give the likelihood (II, §2.0).
Avoids averaging over the unobserved values of x
Coherent updating of the information available on θ,
independent of the order in which i.i.d. observations are
collected
...can be used as the prior probability in taking
account of a further set of data, and the theory can
therefore always take account of new information (I,
§1.5).
Provides a complete inferential scope
Subjective priors

Subjective nature of priors
Critics (...) usually say that the prior probability is
‘subjective’ (...) or refer to the vagueness of previous
knowledge as an indication that the prior probability
cannot be assessed (VIII, §8.0).
Long walk (from Laplace’s principle of insufficient reason) to a
reference prior:
A prior probability used to express ignorance is merely the
formal statement of ignorance (VIII, §8.1).
Subjective priors

Subjective nature of priors
Critics (...) usually say that the prior probability is
‘subjective’ (...) or refer to the vagueness of previous
knowledge as an indication that the prior probability
cannot be assessed (VIII, §8.0).
Long walk (from Laplace’s principle of insufficient reason) to a
reference prior:
A prior probability used to express ignorance is merely the
formal statement of ignorance (VIII, §8.1).
The fundamental prior

...if we took the prior probability density for the
parameters to be proportional to ||gik ||1/2 [= |I (θ)|1/2 ], it
could stated for any law that is differentiable with respect
to all parameters that the total probability in any region
of the αi would be equal to the total probability in the
corresponding region of the αi ; in other words, it satisfies
the rule that equivalent propositions have the same
probability (III, §3.10)
Note: Jeffreys never mentions Fisher information in connection
with (gik )
The fundamental prior

In modern terms:
if I (θ) is the Fisher information matrix associated with the
likelihood (θ|x),
∂ ∂
I (θ) = Eθ
∂θT ∂θ
the reference prior distribution is
π∗ (θ) ∝ |I (θ)|1/2

Note: Jeffreys never mentions Fisher information in connection
with (gik )
Objective prior distributions

reference priors (Bayarri, Bernardo, Berger, ...)
not supposed to represent complete ignorance (Kass
& Wasserman, 1996)
The prior probabilities needed to express ignorance
of the value of a quantity to be estimated, where
there is nothing to call special attention to a
particular value are given by an invariance theory
(Jeffreys, VIII, §8.6).
often endowed with or seeking frequency-based properties
Jeffreys also proposed another Jeffreys prior dedicated to
testing (Bayarri & Garcia-Donato, 2007)
Jeffreys’ Bayes factor
Definition (Bayes factor, Jeffreys, V, §5.01)
For testing hypothesis H0 : θ ∈ Θ0 vs. Ha : θ ∈ Θ0

B01 =

π(Θ0 |x)
π(Θc |x)
0

π(Θ0 )
=
π(Θc )
0

f (x|θ)π0 (θ)dθ
Θ0
Θc
0

f (x|θ)π1 (θ)dθ

Equivalent to Bayes rule: acceptance if
B01 > {(1 − π(Θ0 ))/a1 }/{π(Θ0 )/a0 }
What if... π0 is improper?!
[DeGroot, 1973; Berger, 1985; Marin & Robert, 2007]
Jeffreys’ Bayes factor
Definition (Bayes factor, Jeffreys, V, §5.01)
For testing hypothesis H0 : θ ∈ Θ0 vs. Ha : θ ∈ Θ0

B01 =

π(Θ0 |x)
π(Θc |x)
0

π(Θ0 )
=
π(Θc )
0

f (x|θ)π0 (θ)dθ
Θ0
Θc
0

f (x|θ)π1 (θ)dθ

Equivalent to Bayes rule: acceptance if
B01 > {(1 − π(Θ0 ))/a1 }/{π(Θ0 )/a0 }
What if... π0 is improper?!
[DeGroot, 1973; Berger, 1985; Marin & Robert, 2007]
Expected posterior priors (example)
Starting from reference priors πN and πN , substitute by prior
0
1
distributions π0 and π1 that solve the system of integral equations
π0 (θ0 ) =

X

πN (θ0 | x)m1 (x)dx
0

and
π1 (θ1 ) =

X

πN (θ1 | x)m0 (x)dx,
1

where x is an imaginary minimal training sample and m0 , m1 are
the marginals associated with π0 and π1 respectively
m0 (x) = f0 (x|θ0 )π0 (dθ0 )

m1 (x) = f1 (x|θ1 )π1 (dθ1 )
[Perez & Berger, 2000]
Existence/Unicity
Recurrence condition
When both the observations and the parameters in both models
are continuous, if the Markov chain with transition
Q θ0 | θ0 =

g θ0 , θ0 , θ1 , x, x

dxdx dθ1

where
g θ0 , θ0 , θ1 , x, x

= πN θ0 | x f1 (x | θ1 ) πN θ1 | x
0
1

f0 x | θ0 ,

is recurrent, then there exists a solution to the integral equations,
unique up to a multiplicative constant.
[Cano, Salmer´n, & Robert, 2008, 2013]
o
Bayesian testing of hypotheses
Bayes, Thomas (1702–1761)
Jeffreys, Harold (1891–1989)
Lindley, Dennis (1923– )
Lindley’s paradox
dual versions of the paradox
“Who should be afraid of the
Lindley–Jeffreys paradox?”
Bayesian resolutions
Besag, Julian (1945–2010)
de Finetti, Bruno (1906–1985)
Who is Dennis Lindley?

British statistician, decision theorist and
leading advocate of Bayesian statistics.
Held positions at Cambridge,
Aberystwyth, and UCL, retiring at the
early age of 54 to become an itinerant
scholar. Wrote four books and
numerous papers on Bayesian statistics.
c “Coherence is everything”
Lindley’s paradox

In a normal mean testing problem,
¯
xn ∼ N(θ, σ2 /n) ,

H0 : θ = θ 0 ,

under Jeffreys prior, θ ∼ N(θ0 , σ2 ), the Bayes factor
2
B01 (tn ) = (1 + n)1/2 exp −ntn /2[1 + n] ,

where tn =

√
n|¯n − θ0 |/σ, satisfies
x
n−→∞

B01 (tn ) −→ ∞
[assuming a fixed tn ]
Lindley’s paradox

Often dubbed Jeffreys–Lindley paradox...
In terms of
t=
K∼

√
n − 1¯/s ,
x
πν
2

1+

ν=n−1
t2
ν

−1/2ν+1/2

.

(...) The variation of K with t is much more important
than the variation with ν (Jeffreys, V, §5.2).
Two versions of the paradox
“the weight of Lindley’s paradoxical result (...) burdens
proponents of the Bayesian practice”.
[Lad, 2003]
official version, opposing frequentist and Bayesian assessments
[Lindley, 1957]
intra-Bayesian version, blaming vague and improper priors for
the Bayes factor misbehaviour:
if π1 (·|σ) depends on a scale parameter σ, it is often the case
that
σ−→∞
B01 (x) −→ +∞
for a given x, meaning H0 is always accepted
[Robert, 1992, 2013]
Evacuation of the first version

Two paradigms [(b) versus (f)]
one (b) operates on the parameter space Θ, while the other
(f) is produced from the sample space
one (f) relies solely on the point-null hypothesis H0 and the
corresponding sampling distribution, while the other
(b) opposes H0 to a (predictive) marginal version of H1
one (f) could reject “a hypothesis that may be true (...)
because it has not predicted observable results that have not
occurred” (Jeffreys, VII, §7.2) while the other (b) conditions
upon the observed value xobs
one (f) resorts to an arbitrary fixed bound α on the p-value,
while the other (b) refers to the boundary probability of 1
2
More arguments on the first version

observing a constant tn as n increases is of limited interest:
under H0 tn has limiting N(0, 1) distribution, while, under H1
tn a.s. converges to ∞
behaviour that remains entirely compatible with the
consistency of the Bayes factor, which a.s. converges either to
0 or ∞, depending on which hypothesis is true.
Consequent literature (e.g., Berger & Sellke,1987) has since then
shown how divergent those two approaches could be (to the point
of being asymptotically incompatible).
[Robert, 2013]
Nothing’s wrong with the second version
n, prior’s scale factor: prior variance n times larger than the
observation variance and when n goes to ∞, Bayes factor
goes to ∞ no matter what the observation is
n becomes what Lindley (1957) calls “a measure of lack of
conviction about the null hypothesis”
when prior diffuseness under H1 increases, only relevant
information becomes that θ could be equal to θ0 , and this
overwhelms any evidence to the contrary contained in the data
mass of the prior distribution in the vicinity of any fixed
neighbourhood of the null hypothesis vanishes to zero under
H1
[Robert, 2013]
c deep coherence in the outcome: being indecisive about the
alternative hypothesis means we should not chose it
Nothing’s wrong with the second version
n, prior’s scale factor: prior variance n times larger than the
observation variance and when n goes to ∞, Bayes factor
goes to ∞ no matter what the observation is
n becomes what Lindley (1957) calls “a measure of lack of
conviction about the null hypothesis”
when prior diffuseness under H1 increases, only relevant
information becomes that θ could be equal to θ0 , and this
overwhelms any evidence to the contrary contained in the data
mass of the prior distribution in the vicinity of any fixed
neighbourhood of the null hypothesis vanishes to zero under
H1
[Robert, 2013]
c deep coherence in the outcome: being indecisive about the
alternative hypothesis means we should not chose it
“Who should be afraid of the Lindley–Jeffreys paradox?”

Recent publication by A. Spanos with above title:
the paradox demonstrates against
Bayesian and likelihood resolutions of the
problem for failing to account for the
large sample size.
the failure of all three main paradigms
leads Spanos to advocate Mayo’s and
Spanos’“postdata severity evaluation”
[Spanos, 2013]
“Who should be afraid of the Lindley–Jeffreys paradox?”
Recent publication by A. Spanos with above title:
“the postdata severity evaluation
(...) addresses the key problem with
Fisherian p-values in the sense that
the severity evaluation provides the
“magnitude” of the warranted
discrepancy from the null by taking
into account the generic capacity of
the test (that includes n) in question
as it relates to the observed
data”(p.88)
[Spanos, 2013]
On some resolutions of the second version
use of pseudo-Bayes factors, fractional Bayes factors, &tc,
which lacks proper Bayesian justification
[Berger & Pericchi, 2001]
use of identical improper priors on nuisance parameters, a
notion already entertained by Jeffreys
[Berger et al., 1998; Marin & Robert, 2013]
use of the posterior predictive distribution, which uses the
data twice (see also Aitkin’s (2010) integrated likelihood)
[Gelman, Rousseau & Robert, 2013]
use of score functions extending the log score function
log B12 (x) = log m1 (x) − log m2 (x) = S0 (x, m1 ) − S0 (x, m2 ) ,
that are independent of the normalising constant
[Dawid et al., 2013]
Bayesian computing (R)evolution

Bayes, Thomas (1702–1761)
Jeffreys, Harold (1891–1989)
Lindley, Dennis (1923– )
Besag, Julian (1945–2010)
Besag’s early contributions
MCMC revolution and beyond
de Finetti, Bruno (1906–1985)
computational jam

In the 1970’s and early 1980’s, theoretical foundations of Bayesian
statistics were sound, but methodology was lagging for lack of
computing tools.
restriction to conjugate priors
limited complexity of models
small sample sizes
The field was desperately in need of a new computing paradigm!
[Robert & Casella, 2012]
MCMC as in Markov Chain Monte Carlo

Notion that i.i.d. simulation is definitely not necessary, all that
matters is the ergodic theorem
Realization that Markov chains could be used in a wide variety of
situations only came to mainstream statisticians with Gelfand and
Smith (1990) despite earlier publications in the statistical literature
like Hastings (1970) and growing awareness in spatial statistics
(Besag, 1986)
Reasons:
lack of computing machinery
lack of background on Markov chains
lack of trust in the practicality of the method
Who was Julian Besag?

British statistician known chiefly for his
work in spatial statistics (including its
applications to epidemiology, image
analysis and agricultural science), and
Bayesian inference (including Markov
chain Monte Carlo algorithms).
Lecturer in Liverpool and Durham, then
professor in Durham and Seattle.
[Wikipedia]
pre-Gibbs/pre-Hastings era

Early 1970’s, Hammersley, Clifford, and Besag were working on the
specification of joint distributions from conditional distributions
and on necessary and sufficient conditions for the conditional
distributions to be compatible with a joint distribution.

[Hammersley and Clifford, 1971]
pre-Gibbs/pre-Hastings era

Early 1970’s, Hammersley, Clifford, and Besag were working on the
specification of joint distributions from conditional distributions
and on necessary and sufficient conditions for the conditional
distributions to be compatible with a joint distribution.
“What is the most general form of the conditional
probability functions that define a coherent joint
function? And what will the joint look like?”
[Besag, 1972]
Hammersley-Clifford[-Besag] theorem

Theorem (Hammersley-Clifford)
Joint distribution of vector associated with a dependence graph
must be represented as product of functions over the cliques of the
graphs, i.e., of functions depending only on the components
indexed by the labels in the clique.
[Cressie, 1993; Lauritzen, 1996]
Hammersley-Clifford[-Besag] theorem

Theorem (Hammersley-Clifford)
A probability distribution P with positive and continuous density f
satisfies the pairwise Markov property with respect to an
undirected graph G if and only if it factorizes according to G, i.e.,
(F ) ≡ (G )

[Cressie, 1993; Lauritzen, 1996]
Hammersley-Clifford[-Besag] theorem

Theorem (Hammersley-Clifford)
Under the positivity condition, the joint distribution g satisfies
g j (y j |y 1 , . . . , y j−1 , y j+1 , . . . , y p )

p

g (y1 , . . . , yp ) ∝
j=1

for every permutation

g j (y j |y 1 , . . . , y j−1 , y j+1 , . . . , y p )

on {1, 2, . . . , p} and every y ∈ Y.
[Cressie, 1993; Lauritzen, 1996]
To Gibbs or not to Gibbs?

Julian Besag should certainly be credited to a large extent of the
(re?-)discovery of the Gibbs sampler.
To Gibbs or not to Gibbs?

Julian Besag should certainly be credited to a large extent of the
(re?-)discovery of the Gibbs sampler.
“The simulation procedure is to consider the sites
cyclically and, at each stage, to amend or leave unaltered
the particular site value in question, according to a
probability distribution whose elements depend upon the
current value at neighboring sites (...) However, the
technique is unlikely to be particularly helpful in many
other than binary situations and the Markov chain itself
has no practical interpretation.”
[Besag, 1974]
Clicking in

After Peskun (1973), MCMC mostly dormant in mainstream
statistical world for about 10 years, then several papers/books
highlighted its usefulness in specific settings:
Geman and Geman (1984)
Besag (1986)
Strauss (1986)
Ripley (Stochastic Simulation, 1987)
Tanner and Wong (1987)
Younes (1988)
Enters the Gibbs sampler

Geman and Geman (1984), building on
Metropolis et al. (1953), Hastings (1970), and
Peskun (1973), constructed a Gibbs sampler
for optimisation in a discrete image processing
problem with a Gibbs random field without
completion.
Back to Metropolis et al., 1953: the Gibbs
sampler is already in use therein and ergodicity
is proven on the collection of global maxima
Enters the Gibbs sampler

Geman and Geman (1984), building on
Metropolis et al. (1953), Hastings (1970), and
Peskun (1973), constructed a Gibbs sampler
for optimisation in a discrete image processing
problem with a Gibbs random field without
completion.
Back to Metropolis et al., 1953: the Gibbs
sampler is already in use therein and ergodicity
is proven on the collection of global maxima
Besag (1986) integrates GS for SA...

“...easy to construct the transition matrix Q, of a
discrete time Markov chain, with state space Ω and limit
distribution (4). Simulated annealing proceeds by
running an associated time inhomogeneous Markov chain
with transition matrices QT , where T is progressively
decreased according to a prescribed “schedule” to a value
close to zero.”
[Besag, 1986]
...and links with Metropolis-Hastings...

“There are various related methods of constructing a
manageable QT (Hastings, 1970). Geman and Geman
(1984) adopt the simplest, which they term the ”Gibbs
sampler” (...) time reversibility, a common ingredient in
this type of problem (see, for example, Besag, 1977a), is
present at individual stages but not over complete cycles,
though Peter Green has pointed out that it returns if QT
is taken over a pair of cycles, the second of which visits
pixels in reverse order”
[Besag, 1986]
The candidate’s formula
Representation of the marginal likelihood as
m(x) =

π(θ)f (x|θ)
π(θ|x)

or of the marginal predictive as
pn (y |y ) = f (y |θ)πn (θ|y ) πn+1 (θ|y , y )
[Besag, 1989]

Why candidate?
“Equation (2) appeared without explanation in a Durham
University undergraduate final examination script of
1984. Regrettably, the student’s name is no longer
known to me.”
The candidate’s formula
Representation of the marginal likelihood as
m(x) =

π(θ)f (x|θ)
π(θ|x)

or of the marginal predictive as
pn (y |y ) = f (y |θ)πn (θ|y ) πn+1 (θ|y , y )
[Besag, 1989]

Why candidate?
“Equation (2) appeared without explanation in a Durham
University undergraduate final examination script of
1984. Regrettably, the student’s name is no longer
known to me.”
Implications

Newton and Raftery (1994) used this representation to derive
the [infamous] harmonic mean approximation to the marginal
likelihood
Gelfand and Dey (1994)
Geyer and Thompson (1995)
Chib (1995)
Marin and Robert (2010) and Robert and Wraith (2009)
[Chen, Shao & Ibrahim, 2000]
Implications

Newton and Raftery (1994)
Gelfand and Dey (1994) also relied on this formula for the
same purpose in a more general perspective
Geyer and Thompson (1995)
Chib (1995)
Marin and Robert (2010) and Robert and Wraith (2009)
[Chen, Shao & Ibrahim, 2000]
Implications

Newton and Raftery (1994)
Gelfand and Dey (1994)
Geyer and Thompson (1995) derived MLEs by a Monte Carlo
approximation to the normalising constant
Chib (1995)
Marin and Robert (2010) and Robert and Wraith (2009)
[Chen, Shao & Ibrahim, 2000]
Implications

Newton and Raftery (1994)
Gelfand and Dey (1994)
Geyer and Thompson (1995)
Chib (1995) uses this representation to build a MCMC
approximation to the marginal likelihood
Marin and Robert (2010) and Robert and Wraith (2009)
[Chen, Shao & Ibrahim, 2000]
Implications

Newton and Raftery (1994)
Gelfand and Dey (1994)
Geyer and Thompson (1995)
Chib (1995)
Marin and Robert (2010) and Robert and Wraith (2009)
corrected Newton and Raftery (1994) by restricting the
importance function to an HPD region
[Chen, Shao & Ibrahim, 2000]
Removing the jam

In early 1990s, researchers found that Gibbs and then Metropolis Hastings algorithms would crack almost any problem!
Flood of papers followed applying MCMC:
linear mixed models (Gelfand & al., 1990; Zeger & Karim, 1991;
Wang & al., 1993, 1994)
generalized linear mixed models (Albert & Chib, 1993)
mixture models (Tanner & Wong, 1987; Diebolt & Robert., 1990,
1994; Escobar & West, 1993)
changepoint analysis (Carlin & al., 1992)
point processes (Grenander & Møller, 1994)
&tc
Removing the jam
In early 1990s, researchers found that Gibbs and then Metropolis Hastings algorithms would crack almost any problem!
Flood of papers followed applying MCMC:
genomics (Stephens & Smith, 1993; Lawrence & al., 1993;
Churchill, 1995; Geyer & Thompson, 1995; Stephens & Donnelly,
2000)
ecology (George & Robert, 1992)
variable selection in regression (George & mcCulloch, 1993; Green,
1995; Chen & al., 2000)
spatial statistics (Raftery & Banfield, 1991; Besag & Green, 1993))
longitudinal studies (Lange & al., 1992)
&tc
MCMC and beyond

reversible jump MCMC which impacted considerably Bayesian
model choice (Green, 1995)
adaptive MCMC algorithms (Haario & al., 1999; Roberts
& Rosenthal, 2009)
exact approximations to targets (Tanner & Wong, 1987;
Beaumont, 2003; Andrieu & Roberts, 2009)
particle filters with application to sequential statistics,
state-space models, signal processing, &tc. (Gordon & al.,
1993; Doucet & al., 2001; del Moral & al., 2006)
MCMC and beyond beyond

comp’al stats catching up with comp’al physics: free energy
sampling (e.g., Wang-Landau), Hamilton Monte Carlo
(Girolami & Calderhead, 2011)
sequential Monte Carlo (SMC) for non-sequential problems
(Chopin, 2002; Neal, 2001; Del Moral et al 2006)
retrospective sampling
intractability: EP – GIMH – PMCMC – SMC2 – INLA
QMC[MC] (Owen, 2011)
Particles

Iterating/sequential importance sampling is about as old as Monte
Carlo methods themselves!
[Hammersley and Morton,1954; Rosenbluth and Rosenbluth, 1955]

Found in the molecular simulation literature of the 50’s with
self-avoiding random walks and signal processing
[Marshall, 1965; Handschin and Mayne, 1969]
Use of the term “particle” dates back to Kitagawa (1996), and Carpenter
et al. (1997) coined the term “particle filter”.
Particles

Iterating/sequential importance sampling is about as old as Monte
Carlo methods themselves!
[Hammersley and Morton,1954; Rosenbluth and Rosenbluth, 1955]

Found in the molecular simulation literature of the 50’s with
self-avoiding random walks and signal processing
[Marshall, 1965; Handschin and Mayne, 1969]
Use of the term “particle” dates back to Kitagawa (1996), and Carpenter
et al. (1997) coined the term “particle filter”.
pMC & pMCMC

Recycling of past simulations legitimate to build better
importance sampling functions as in population Monte Carlo
[Iba, 2000; Capp´ et al, 2004; Del Moral et al., 2007]
e

synthesis by Andrieu, Doucet, and Hollenstein (2010) using
particles to build an evolving MCMC kernel pθ (y1:T ) in state
^
space models p(x1:T )p(y1:T |x1:T )
importance sampling on discretely observed diffusions
[Beskos et al., 2006; Fearnhead et al., 2008, 2010]
towards ever more complexity

Bayes, Thomas (1702–1761)
Jeffreys, Harold (1891–1989)
Lindley, Dennis (1923– )
Besag, Julian (1945–2010)
de Finetti, Bruno (1906–1985)
de Finetti’s exchangeability theorem
Bayesian nonparametrics
Bayesian analysis in a Big Data era
Who was Bruno de Finetti?
“Italian probabilist, statistician and
actuary, noted for the “operational
subjective” conception of probability.
The classic exposition of his distinctive
theory is the 1937 “La pr´vision: ses
e
lois logiques, ses sources subjectives,”
which discussed probability founded on
the coherence of betting odds and the
consequences of exchangeability.”
[Wikipedia]
Chair in Financial Mathematics at Trieste University (1939) and
Roma (1954) then in Calculus of Probabilities (1961). Most
famous sentence:
“Probability does not exist”
Who was Bruno de Finetti?
“Italian probabilist, statistician and
actuary, noted for the “operational
subjective” conception of probability.
The classic exposition of his distinctive
theory is the 1937 “La pr´vision: ses
e
lois logiques, ses sources subjectives,”
which discussed probability founded on
the coherence of betting odds and the
consequences of exchangeability.”
[Wikipedia]
Chair in Financial Mathematics at Trieste University (1939) and
Roma (1954) then in Calculus of Probabilities (1961). Most
famous sentence:
“Probability does not exist”
Exchangeability
Notion of exchangeable sequences:
A random sequence (x1 , . . . , xn , . . .) is exchangeable if for
any n the distribution of (x1 , . . . , xn ) is equal to the
distribution of any permutation of the sequence
(xσ1 , . . . , xσn )
de Finetti’s theorem (1937):
An exchangeable distribution is a mixture of iid
distributions
n

f (xi |G )dπ(G )

p(x1 , . . . , xn ) =
i=1

where G can be infinite-dimensional
Extension to Markov chains (Freedman, 1962; Diaconis
& Freedman, 1980)
Exchangeability
Notion of exchangeable sequences:
A random sequence (x1 , . . . , xn , . . .) is exchangeable if for
any n the distribution of (x1 , . . . , xn ) is equal to the
distribution of any permutation of the sequence
(xσ1 , . . . , xσn )
de Finetti’s theorem (1937):
An exchangeable distribution is a mixture of iid
distributions
n

f (xi |G )dπ(G )

p(x1 , . . . , xn ) =
i=1

where G can be infinite-dimensional
Extension to Markov chains (Freedman, 1962; Diaconis
& Freedman, 1980)
Exchangeability
Notion of exchangeable sequences:
A random sequence (x1 , . . . , xn , . . .) is exchangeable if for
any n the distribution of (x1 , . . . , xn ) is equal to the
distribution of any permutation of the sequence
(xσ1 , . . . , xσn )
de Finetti’s theorem (1937):
An exchangeable distribution is a mixture of iid
distributions
n

f (xi |G )dπ(G )

p(x1 , . . . , xn ) =
i=1

where G can be infinite-dimensional
Extension to Markov chains (Freedman, 1962; Diaconis
& Freedman, 1980)
Bayesian nonparametrics

Based on de Finetti’s representation,
use of priors on functional spaces (densities, regression, trees,
partitions, clustering, &tc)
production of Bayes estimates in those spaces
convergence mileage may vary
available efficient (MCMC) algorithms to conduct
non-parametric inference
[van der Vaart, 1998; Hjort et al., 2010; M¨ller & Rodriguez, 2013]
u
Dirichlet processes

One of the earliest examples of priors on distributions
[Ferguson, 1973]

stick-breaking construction of D(α0 , G0 )
generate βk ∼ B(1, α0 )
define π1 = β1 and πk =

k−1
j=1 (1

− βj )βk

generate θk ∼ G0
derive G =

k

πk δθk ∼ D(α0 , G0 )
[Sethuraman, 1994]
Chinese restaurant process
If we assume
G ∼ D(α0 , G0 )
θi ∼ G

then the marginal distribution of (θ1 , . . .) is a Chinese restaurant
process (P´lya urn model), which is exchangeable. In particular,
o
i−1

θi |θ1:i−1 ∼ α0 G0 +

δθj
j=1

Posterior distribution built by MCMC
[Escobar and West, 1992]
Chinese restaurant process
If we assume
G ∼ D(α0 , G0 )
θi ∼ G

then the marginal distribution of (θ1 , . . .) is a Chinese restaurant
process (P´lya urn model), which is exchangeable. In particular,
o
i−1

θi |θ1:i−1 ∼ α0 G0 +

δθj
j=1

Posterior distribution built by MCMC
[Escobar and West, 1992]
Many alternatives

truncated Dirichlet processes
Pitman Yor processes
completely random measures
normalized random measures with independent increments
(NRMI)
[M¨ller and Mitra, 2013]
u
Theoretical advances

posterior consistency: Seminal work of Schwarz (1965) in iid
case and extension of Barron et al. (1999) for general
consistency
consistency rates: Ghosal & van der Vaart (2000) Ghosal et
al. (2008) with minimax (adaptive ) Bayesian nonparametric
estimators for nonparametric process mixtures (Gaussian,
Beta) (Rousseau, 2008; Kruijer, Rousseau & van der Vaart,
2010; Shen, Tokdar & Ghosal, 2013; Scricciolo, 2013)
Bernstein-von Mises theorems: (Castillo, 2011; Rivoirard
& Rousseau, 2012; Kleijn & Bickel, 2013; Castillo
& Rousseau, 2013)
recent extensions to semiparametric models
Consistency and posterior concentration rates

Posterior
dπ(θ|X n ) =

fθ (X n )dπ(θ)
m(X n )

fθ (X n )dπ(θ)

m(X n ) =
Θ

and posterior concentration: Under Pθ0
Pπ [d(θ, θ0 )

|X n ] = 1+op (1),

Pπ [d(θ, θ0 )

n |X

n

] = 1+op (1)

Given n : consistency
where d(θ, θ ) is a loss function. e.g. Hellinger, L1 , L2 , L∞
Consistency and posterior concentration rates

Posterior
dπ(θ|X n ) =

fθ (X n )dπ(θ)
m(X n )

fθ (X n )dπ(θ)

m(X n ) =
Θ

and posterior concentration: Under Pθ0
Pπ [d(θ, θ0 )

|X n ] = 1+op (1),

Pπ [d(θ, θ0 )

n |X

n

] = 1+op (1)

Setting n ↓ 0: consistency rates
where d(θ, θ ) is a loss function. e.g. Hellinger, L1 , L2 , L∞
Bernstein–von Mises theorems

Parameter of interest
ψ = ψ(θ) ∈ Rd ,
(with dim(θ) = +∞)
BVM:
√
^
π[ n(ψ − ψ)
and

d < +∞,

z|X n ] = Φ(z/

θ∼π

V0 ) + op (1),

√
^
n(ψ − ψ(θ0 )) ≈ N(0, V0 )

Pθ0

under Pθ0

[Doob, 1949; Le Cam, 1986; van der Vaart, 1998]
New challenges

Novel statisticial issues that forces a different Bayesian answer:
very large datasets
complex or unknown dependence structures with maybe p
multiple and involved random effects
missing data structures containing most of the information
sequential structures involving most of the above

n
New paradigm?

“Surprisingly, the confident prediction of the previous
generation that Bayesian methods would ultimately supplant
frequentist methods has given way to a realization that Markov
chain Monte Carlo (MCMC) may be too slow to handle
modern data sets. Size matters because large data sets stress
computer storage and processing power to the breaking point.
The most successful compromises between Bayesian and
frequentist methods now rely on penalization and
optimization.”

[Lange at al., ISR, 2013]
New paradigm?

Observe (Xi , Ri , Yi Ri ) where
Xi ∼ U(0, 1)d , Ri |Xi ∼ B(π(Xi ))

and Yi |Xi ∼ B(θ(Xi ))

(π(·) is known and θ(·) is unknwon)
Then any estimator of E[Y ] that does not depend on π is
inconsistent.
c There is no genuine Bayesian answer producing a consistent
estimator (without throwing away part of the data)
[Robins & Wasserman, 2000, 2013]
New paradigm?

sad reality constraint that
size does matter
focus on much smaller
dimensions and on sparse
summaries
many (fast if non-Bayesian)
ways of producing those
summaries
Bayesian inference can kick
in almost automatically at
this stage
Approximate Bayesian computation (ABC)
Case of a well-defined statistical model where the likelihood
function
(θ|y) = f (y1 , . . . , yn |θ)
is out of reach!
Empirical approximations to the original
Bayesian inference problem
Degrading the data precision down
to a tolerance ε
Replacing the likelihood with a
non-parametric approximation
Summarising/replacing the data
with insufficient statistics
Approximate Bayesian computation (ABC)
Case of a well-defined statistical model where the likelihood
function
(θ|y) = f (y1 , . . . , yn |θ)
is out of reach!
Empirical approximations to the original
Bayesian inference problem
Degrading the data precision down
to a tolerance ε
Replacing the likelihood with a
non-parametric approximation
Summarising/replacing the data
with insufficient statistics
Approximate Bayesian computation (ABC)
Case of a well-defined statistical model where the likelihood
function
(θ|y) = f (y1 , . . . , yn |θ)
is out of reach!
Empirical approximations to the original
Bayesian inference problem
Degrading the data precision down
to a tolerance ε
Replacing the likelihood with a
non-parametric approximation
Summarising/replacing the data
with insufficient statistics
Approximate Bayesian computation (ABC)
Case of a well-defined statistical model where the likelihood
function
(θ|y) = f (y1 , . . . , yn |θ)
is out of reach!
Empirical approximations to the original
Bayesian inference problem
Degrading the data precision down
to a tolerance ε
Replacing the likelihood with a
non-parametric approximation
Summarising/replacing the data
with insufficient statistics
ABC methodology
Bayesian setting: target is π(θ)f (x|θ)
When likelihood f (x|θ) not in closed form, likelihood-free rejection
technique:

Foundation
For an observation y ∼ f (y|θ), under the prior π(θ), if one keeps
jointly simulating
θ ∼ π(θ) , z ∼ f (z|θ ) ,
until the auxiliary variable z is equal to the observed value, z = y,
then the selected
θ ∼ π(θ|y)
[Rubin, 1984; Diggle & Gratton, 1984; Griffith et al., 1997]
ABC methodology
Bayesian setting: target is π(θ)f (x|θ)
When likelihood f (x|θ) not in closed form, likelihood-free rejection
technique:

Foundation
For an observation y ∼ f (y|θ), under the prior π(θ), if one keeps
jointly simulating
θ ∼ π(θ) , z ∼ f (z|θ ) ,
until the auxiliary variable z is equal to the observed value, z = y,
then the selected
θ ∼ π(θ|y)
[Rubin, 1984; Diggle & Gratton, 1984; Griffith et al., 1997]
ABC methodology
Bayesian setting: target is π(θ)f (x|θ)
When likelihood f (x|θ) not in closed form, likelihood-free rejection
technique:

Foundation
For an observation y ∼ f (y|θ), under the prior π(θ), if one keeps
jointly simulating
θ ∼ π(θ) , z ∼ f (z|θ ) ,
until the auxiliary variable z is equal to the observed value, z = y,
then the selected
θ ∼ π(θ|y)
[Rubin, 1984; Diggle & Gratton, 1984; Griffith et al., 1997]
ABC algorithm

In most implementations, degree of approximation:
Algorithm 1 Likelihood-free rejection sampler
for i = 1 to N do
repeat
generate θ from the prior distribution π(·)
generate z from the likelihood f (·|θ )
until ρ{η(z), η(y)}
set θi = θ
end for
where η(y) defines a (not necessarily sufficient) statistic
Comments

role of distance paramount
(because = 0)
scaling of components of η(y) also
capital
matters little if “small enough”
representative of “curse of
dimensionality”
small is beautiful!, i.e. data as a
whole may be weakly informative
for ABC
non-parametric method at core
ABC simulation advances

Simulating from the prior is often poor in efficiency
Either modify the proposal distribution on θ to increase the density
of x’s within the vicinity of y ...
[Marjoram et al, 2003; Beaumont et al., 2009, Del Moral et al., 2012]
...or by viewing the problem as a conditional density estimation
and by developing techniques to allow for larger
[Beaumont et al., 2002; Blum & Fran¸ois, 2010; Biau et al., 2013]
c
.....or even by including

in the inferential framework [ABCµ ]
[Ratmann et al., 2009]
ABC simulation advances

Simulating from the prior is often poor in efficiency
Either modify the proposal distribution on θ to increase the density
of x’s within the vicinity of y ...
[Marjoram et al, 2003; Beaumont et al., 2009, Del Moral et al., 2012]
...or by viewing the problem as a conditional density estimation
and by developing techniques to allow for larger
[Beaumont et al., 2002; Blum & Fran¸ois, 2010; Biau et al., 2013]
c
.....or even by including

in the inferential framework [ABCµ ]
[Ratmann et al., 2009]
ABC simulation advances

Simulating from the prior is often poor in efficiency
Either modify the proposal distribution on θ to increase the density
of x’s within the vicinity of y ...
[Marjoram et al, 2003; Beaumont et al., 2009, Del Moral et al., 2012]
...or by viewing the problem as a conditional density estimation
and by developing techniques to allow for larger
[Beaumont et al., 2002; Blum & Fran¸ois, 2010; Biau et al., 2013]
c
.....or even by including

in the inferential framework [ABCµ ]
[Ratmann et al., 2009]
ABC simulation advances

Simulating from the prior is often poor in efficiency
Either modify the proposal distribution on θ to increase the density
of x’s within the vicinity of y ...
[Marjoram et al, 2003; Beaumont et al., 2009, Del Moral et al., 2012]
...or by viewing the problem as a conditional density estimation
and by developing techniques to allow for larger
[Beaumont et al., 2002; Blum & Fran¸ois, 2010; Biau et al., 2013]
c
.....or even by including

in the inferential framework [ABCµ ]
[Ratmann et al., 2009]
ABC as an inference machine

Starting point is summary statistic η(y), either chosen for
computational realism or imposed by external constraints
ABC can produce a distribution on the parameter of interest
conditional on this summary statistic η(y)
inference based on ABC may be consistent or not, so it needs
to be validated on its own
the choice of the tolerance level is dictated by both
computational and convergence constraints
ABC as an inference machine

Starting point is summary statistic η(y), either chosen for
computational realism or imposed by external constraints
ABC can produce a distribution on the parameter of interest
conditional on this summary statistic η(y)
inference based on ABC may be consistent or not, so it needs
to be validated on its own
the choice of the tolerance level is dictated by both
computational and convergence constraints
How Bayesian aBc is..?

At best, ABC approximates π(θ|η(y)):
approximation error unknown (w/o massive simulation)
pragmatic or empirical Bayes (there is no other solution!)
many calibration issues (tolerance, distance, statistics)
the NP side should be incorporated into the whole Bayesian
picture
the approximation error should also be part of the Bayesian
inference
Noisy ABC
ABC approximation error (under non-zero tolerance ) replaced
with exact simulation from a controlled approximation to the
target, convolution of true posterior with kernel function
π (θ, z|y) =

π(θ)f (z|θ)K (y − z)
,
π(θ)f (z|θ)K (y − z)dzdθ

with K kernel parameterised by bandwidth .
[Wilkinson, 2013]

Theorem
˜
The ABC algorithm based on a randomised observation y = y + ξ,
ξ ∼ K , and an acceptance probability of
K (y − z)/M
gives draws from the posterior distribution π(θ|y).
Noisy ABC
ABC approximation error (under non-zero tolerance ) replaced
with exact simulation from a controlled approximation to the
target, convolution of true posterior with kernel function
π (θ, z|y) =

π(θ)f (z|θ)K (y − z)
,
π(θ)f (z|θ)K (y − z)dzdθ

with K kernel parameterised by bandwidth .
[Wilkinson, 2013]

Theorem
˜
The ABC algorithm based on a randomised observation y = y + ξ,
ξ ∼ K , and an acceptance probability of
K (y − z)/M
gives draws from the posterior distribution π(θ|y).
Which summary?

Fundamental difficulty of the choice of the summary statistic when
there is no non-trivial sufficient statistics [except when done by the
experimenters in the field]
Which summary?

Fundamental difficulty of the choice of the summary statistic when
there is no non-trivial sufficient statistics [except when done by the
experimenters in the field]
Loss of statistical information balanced against gain in data
roughening
Approximation error and information loss remain unknown
Choice of statistics induces choice of distance function
towards standardisation
borrowing tools from data analysis (LDA) machine learning
[Estoup et al., ME, 2012]
Which summary?

Fundamental difficulty of the choice of the summary statistic when
there is no non-trivial sufficient statistics [except when done by the
experimenters in the field]
may be imposed for external/practical reasons
may gather several non-B point estimates
we can learn about efficient combination
distance can be provided by estimation techniques
Which summary for model choice?
‘This is also why focus on model discrimination typically
(...) proceeds by (...) accepting that the Bayes Factor
that one obtains is only derived from the summary
statistics and may in no way correspond to that of the
full model.’
[Scott Sisson, Jan. 31, 2011, xianblog]

Depending on the choice of η(·), the Bayes factor based on this
insufficient statistic,
η
B12 (y) =

π1 (θ1 )f1η (η(y)|θ1 ) dθ1
,
π2 (θ2 )f2η (η(y)|θ2 ) dθ2

is either consistent or inconsistent
[Robert et al., PNAS, 2012]
Which summary for model choice?
Depending on the choice of η(·), the Bayes factor based on this
insufficient statistic,
η
B12 (y)

=

π1 (θ1 )f1η (η(y)|θ1 ) dθ1
,
π2 (θ2 )f2η (η(y)|θ2 ) dθ2

is either consistent or inconsistent
[Robert et al., PNAS, 2012]
n=100

0.7

1.0

n=100

q

0.6

q

0.8

q
q

0.5

q
q
q

q
q
q
q

0.4

q
q
q

q

0.1

0.2

q

q

0.2

0.3

0.4

0.6

q

q
q
q

q
q

0.0

q
q
q

Gauss

Laplace

0.0

q

Gauss

Laplace
Selecting proper summaries
Consistency only depends on the range of
µi (θ) = Ei [η(y)]
under both models against the asymptotic mean µ0 of η(y)

Theorem
If Pn belongs to one of the two models and if µ0 cannot be
attained by the other one :
0 = min (inf{|µ0 − µi (θi )|; θi ∈ Θi }, i = 1, 2)
< max (inf{|µ0 − µi (θi )|; θi ∈ Θi }, i = 1, 2) ,
η
then the Bayes factor B12 is consistent

[Marin et al., 2012]
Selecting proper summaries
Consistency only depends on the range of
µi (θ) = Ei [η(y)]
0.7
0.6

0.7

0.4

0.4
q

0.5

0.6

q

0.5

0.7
0.4

0.5

0.6

under both models against the asymptotic mean µ0 of η(y)

q
q

M2

M1

M2

1.0

M1
1.0

M2

0.3

0.3

0.3

q

M1

0.8

0.8

0.8

q
q

q

q

q
q

q

0.6

0.6

0.6

q

q
q

q

q

q
q

q
q
q

0.2

q
q
q
q

0.2

0.2

q
q
q

q

0.4

0.4

0.4

q
q

M1

M2

M2

1.0

M1

0.8

0.8

0.8

1.0

M2

0.0

0.0

0.0

q

M1

q
q
q
q

0.4

q
q

q
q
q

q

q
q

0.2

q

0.2

0.2

q
q

q
q
q

0.4

0.4

0.6

0.6

q
q

0.6

q
q

M1

M2

0.0

q
q

0.0

0.0

q

M1

M2

M1

M2

[Marin et al., 2012]
on some Bayesian open problems
In 2011, Michael Jordan, then ISBA President, conducted a
mini-survey on Bayesian open problems:
Nonparametrics and semiparametrics: assessing and validating
priors on infinite dimension spaces with an infinite number of
nuisance parameters
Priors: elicitation mecchanisms and strategies to get the prior
from the likelihood or even from the posterior distribution
Bayesian/frequentist relationships: how far should one reach
for frequentist validation?
Computation and statistics: computational abilities should be
part of the modelling, with some expressing doubts about
INLA and ABC
Model selection and hypothesis testing: still unsettled
opposition between model checking, model averaging and
model selection
[Jordan, ISBA Bulletin, March 2011]
yet another Bayes 250

Meeting that will take place in Duke University, December 17:
Stephen Fienberg, Carnegie-Mellon
University
Michael Jordan, University of
California, Berkeley
Christopher Sims, Princeton University
Adrian Smith, University of London
Stephen Stigler, University of Chicago
Sharon Bertsch McGrayne, author of
“the theory that would not die”

Contenu connexe

En vedette (7)

Blum
BlumBlum
Blum
 
Massol bio info2011
Massol bio info2011Massol bio info2011
Massol bio info2011
 
Chiara focacci shaw
Chiara focacci shawChiara focacci shaw
Chiara focacci shaw
 
Aussem
AussemAussem
Aussem
 
Ibm worklight
Ibm worklightIbm worklight
Ibm worklight
 
Daunizeau
DaunizeauDaunizeau
Daunizeau
 
IBM Worklight- Checkout Process Architecture
IBM Worklight- Checkout Process ArchitectureIBM Worklight- Checkout Process Architecture
IBM Worklight- Checkout Process Architecture
 

Similaire à Robert

Bayes250: Thomas Bayes Memorial Lecture at EMS 2013
Bayes250: Thomas Bayes Memorial Lecture at EMS 2013Bayes250: Thomas Bayes Memorial Lecture at EMS 2013
Bayes250: Thomas Bayes Memorial Lecture at EMS 2013Christian Robert
 
BayesSolution Bayes theorem deals with the.pdf
BayesSolution                     Bayes theorem deals with the.pdfBayesSolution                     Bayes theorem deals with the.pdf
BayesSolution Bayes theorem deals with the.pdfparwaniajay
 
Theory of Probability revisited
Theory of Probability revisitedTheory of Probability revisited
Theory of Probability revisitedChristian Robert
 
Yash group Maths PPT for class IX
Yash group  Maths PPT for class IXYash group  Maths PPT for class IX
Yash group Maths PPT for class IXYash Jangra
 
(Cambridge studies in probability, induction, and decision theory) S. L. Zabe...
(Cambridge studies in probability, induction, and decision theory) S. L. Zabe...(Cambridge studies in probability, induction, and decision theory) S. L. Zabe...
(Cambridge studies in probability, induction, and decision theory) S. L. Zabe...ssuser4d9865
 
The Bible Code Decoded
The Bible Code DecodedThe Bible Code Decoded
The Bible Code DecodedJosh Fixler
 
Future contingents and the Multiverse
Future contingents and the MultiverseFuture contingents and the Multiverse
Future contingents and the Multiverse2idseminar
 
"Future Contingents" and "The Multiverse". Combining Two Perspectives
"Future Contingents" and "The Multiverse". Combining Two Perspectives"Future Contingents" and "The Multiverse". Combining Two Perspectives
"Future Contingents" and "The Multiverse". Combining Two PerspectivesAntoine Suarez
 
I hope someone can help me , I need to write a ten page paper on Pro.pdf
I hope someone can help me , I need to write a ten page paper on Pro.pdfI hope someone can help me , I need to write a ten page paper on Pro.pdf
I hope someone can help me , I need to write a ten page paper on Pro.pdfMadansilks
 
I hope someone can help me , I need to write a ten page paper on Pro.pdf
I hope someone can help me , I need to write a ten page paper on Pro.pdfI hope someone can help me , I need to write a ten page paper on Pro.pdf
I hope someone can help me , I need to write a ten page paper on Pro.pdfamitpurbey2
 
Aristophanes Lysistrata And The Two Acropolis Priestesses
Aristophanes  Lysistrata And The Two Acropolis PriestessesAristophanes  Lysistrata And The Two Acropolis Priestesses
Aristophanes Lysistrata And The Two Acropolis PriestessesAndrew Molina
 
Adam Smith The Wealth of Nations.pdf
Adam Smith  The Wealth of Nations.pdfAdam Smith  The Wealth of Nations.pdf
Adam Smith The Wealth of Nations.pdfMandy Brown
 

Similaire à Robert (15)

Bayes250: Thomas Bayes Memorial Lecture at EMS 2013
Bayes250: Thomas Bayes Memorial Lecture at EMS 2013Bayes250: Thomas Bayes Memorial Lecture at EMS 2013
Bayes250: Thomas Bayes Memorial Lecture at EMS 2013
 
BayesSolution Bayes theorem deals with the.pdf
BayesSolution                     Bayes theorem deals with the.pdfBayesSolution                     Bayes theorem deals with the.pdf
BayesSolution Bayes theorem deals with the.pdf
 
Theory of Probability revisited
Theory of Probability revisitedTheory of Probability revisited
Theory of Probability revisited
 
Mathematicians
MathematiciansMathematicians
Mathematicians
 
Yash group Maths PPT for class IX
Yash group  Maths PPT for class IXYash group  Maths PPT for class IX
Yash group Maths PPT for class IX
 
(Cambridge studies in probability, induction, and decision theory) S. L. Zabe...
(Cambridge studies in probability, induction, and decision theory) S. L. Zabe...(Cambridge studies in probability, induction, and decision theory) S. L. Zabe...
(Cambridge studies in probability, induction, and decision theory) S. L. Zabe...
 
Maths
MathsMaths
Maths
 
The Bible Code Decoded
The Bible Code DecodedThe Bible Code Decoded
The Bible Code Decoded
 
Future contingents and the Multiverse
Future contingents and the MultiverseFuture contingents and the Multiverse
Future contingents and the Multiverse
 
"Future Contingents" and "The Multiverse". Combining Two Perspectives
"Future Contingents" and "The Multiverse". Combining Two Perspectives"Future Contingents" and "The Multiverse". Combining Two Perspectives
"Future Contingents" and "The Multiverse". Combining Two Perspectives
 
Lawrence Krauss trailer
Lawrence Krauss trailerLawrence Krauss trailer
Lawrence Krauss trailer
 
I hope someone can help me , I need to write a ten page paper on Pro.pdf
I hope someone can help me , I need to write a ten page paper on Pro.pdfI hope someone can help me , I need to write a ten page paper on Pro.pdf
I hope someone can help me , I need to write a ten page paper on Pro.pdf
 
I hope someone can help me , I need to write a ten page paper on Pro.pdf
I hope someone can help me , I need to write a ten page paper on Pro.pdfI hope someone can help me , I need to write a ten page paper on Pro.pdf
I hope someone can help me , I need to write a ten page paper on Pro.pdf
 
Aristophanes Lysistrata And The Two Acropolis Priestesses
Aristophanes  Lysistrata And The Two Acropolis PriestessesAristophanes  Lysistrata And The Two Acropolis Priestesses
Aristophanes Lysistrata And The Two Acropolis Priestesses
 
Adam Smith The Wealth of Nations.pdf
Adam Smith  The Wealth of Nations.pdfAdam Smith  The Wealth of Nations.pdf
Adam Smith The Wealth of Nations.pdf
 

Dernier

Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 

Dernier (20)

Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 

Robert

  • 1. Bayes 250th versus Bayes 2.5.0 Christian P. Robert Universit´ Paris-Dauphine, University of Warwick, & CREST, Paris e written for EMS 2013, Budapest
  • 2. Outline Bayes, Thomas (1702–1761) Jeffreys, Harold (1891–1989) Lindley, Dennis (1923– ) Besag, Julian (1945–2010) de Finetti, Bruno (1906–1985)
  • 3. Bayes, Price and Laplace Bayes, Thomas (1702–1761) Bayes’ 1763 paper Bayes’ example Laplace’s 1774 derivation Jeffreys, Harold (1891–1989) Lindley, Dennis (1923– ) Besag, Julian (1945–2010) de Finetti, Bruno (1906–1985)
  • 4. a first Bayes 250 Took place in Edinburgh, Sept. 5–7, 2011: Sparse Nonparametric Bayesian Learning from Big Data David Dunson, Duke University Classification Models and Predictions for Ordered Data Chris Holmes, Oxford University Bayesian Variable Selection in Markov Mixture Models Luigi Spezia, Biomathematics & Statistics Scotland, Aberdeen Bayesian inference for partially observed Markov processes, with application to systems biology Darren Wilkinson, University of Newcastle Coherent Inference on Distributed Bayesian Expert Systems Jim Smith, University of Warwick Bayesian Priors in the Brain Peggy Series, University of Edinburgh Approximate Bayesian Computation for model selection Christian Robert, Universit´ e Paris-Dauphine ABC-EP: Expectation Propagation for Likelihood-free Bayesian Computation Nicholas Chopin, CREST–ENSAE Bayes at Edinburgh University - a talk and tour Dr Andrew Fraser, Honorary Fellow, University of Edinburgh Probabilistic Programming John Winn, Microsoft Research Intractable likelihoods and exact approximate MCMC algorithms Christophe Andrieu, University of Bristol How To Gamble If You Must (courtesy of the Reverend Bayes) David Spiegelhalter, University of Cambridge Bayesian computational methods for intractable continuous-time non-Gaussian time series Simon Godsill, University of Cambridge Inference and computing with decomposable graphs Peter Green, University of Bristol Efficient MCMC for Continuous Time Discrete State Systems Yee Whye Teh, Gatsby Computational Neuroscience Unit, University College London Nonparametric Bayesian Models for Sparse Matrices and Covariances Zoubin Gharamani, University of Cambridge Latent Force Models Neil Lawrence, University of Sheffield Does Bayes Theorem Work? Michael Goldstein, Durham University Adaptive Control and Bayesian Inference Carl Rasmussen, University of Cambridge Bernstein - von Mises theorem for irregular statistical models Natalia Bochkina, University of Edinburgh
  • 5. Why Bayes 250? Publication on Dec. 23, 1763 of “An Essay towards solving a Problem in the Doctrine of Chances” by the late Rev. Mr. Bayes, communicated by Mr. Price in the Philosophical Transactions of the Royal Society of London. c 250th anniversary of the Essay
  • 6. Why Bayes 250? Publication on Dec. 23, 1763 of “An Essay towards solving a Problem in the Doctrine of Chances” by the late Rev. Mr. Bayes, communicated by Mr. Price in the Philosophical Transactions of the Royal Society of London. c 250th anniversary of the Essay
  • 7. Why Bayes 250? Publication on Dec. 23, 1763 of “An Essay towards solving a Problem in the Doctrine of Chances” by the late Rev. Mr. Bayes, communicated by Mr. Price in the Philosophical Transactions of the Royal Society of London. c 250th anniversary of the Essay
  • 8. Why Bayes 250? Publication on Dec. 23, 1763 of “An Essay towards solving a Problem in the Doctrine of Chances” by the late Rev. Mr. Bayes, communicated by Mr. Price in the Philosophical Transactions of the Royal Society of London. c 250th anniversary of the Essay
  • 9. Breaking news!!! An accepted paper by Stephen Stigler in Statistical Science uncovers the true title of the Essay: A Method of Calculating the Exact Probability of All Conclusions founded on Induction Intended as a reply to Hume’s (1748) evaluation of the probability of miracles
  • 10. Breaking news!!! may have been written as early as 1749: “we may hope to determine the Propositions, and, by degrees, the whole Nature of unknown Causes, by a sufficient Observation of their effects” (D. Hartley) in 1767, Richard Price used Bayes’ theorem as a tool to attack Hume’s argument, refering to the above title Bayes’ offprints available at Yale’s Beinecke Library (but missing the title page) and at the Library Company of Philadelphia (Franklin’s library) [Stigler, 2013]
  • 11. Bayes Theorem Bayes theorem = Inversion of causes and effects If A and E are events such that P(E ) = 0, P(A|E ) and P(E |A) are related by P(A|E ) = P(E |A)P(A) P(E |A)P(A) + P(E |Ac )P(Ac ) P(E |A)P(A) = P(E )
  • 12. Bayes Theorem Bayes theorem = Inversion of causes and effects If A and E are events such that P(E ) = 0, P(A|E ) and P(E |A) are related by P(A|E ) = P(E |A)P(A) P(E |A)P(A) + P(E |Ac )P(Ac ) P(E |A)P(A) = P(E )
  • 13. Bayes Theorem Bayes theorem = Inversion of causes and effects Continuous version for random variables X and Y fX |Y (x|y ) = fY |X (y |x) × fX (x) fY (y )
  • 14. Who was Thomas Bayes? Reverend Thomas Bayes (ca. 1702–1761), educated in London then at the University of Edinburgh (1719-1721), presbyterian minister in Tunbridge Wells (Kent) from 1731, son of Joshua Bayes, nonconformist minister. “Election to the Royal Society based on a tract of 1736 where he defended the views and philosophy of Newton. A notebook of his includes a method of finding the time and place of conjunction of two planets, notes on weights and measures, a method of differentiation, and logarithms.” [Wikipedia]
  • 15. Who was Thomas Bayes? Reverend Thomas Bayes (ca. 1702–1761), educated in London then at the University of Edinburgh (1719-1721), presbyterian minister in Tunbridge Wells (Kent) from 1731, son of Joshua Bayes, nonconformist minister. “Election to the Royal Society based on a tract of 1736 where he defended the views and philosophy of Newton. A notebook of his includes a method of finding the time and place of conjunction of two planets, notes on weights and measures, a method of differentiation, and logarithms.” [Wikipedia]
  • 16. Bayes’ 1763 paper: Billiard ball W rolled on a line of length one, with a uniform probability of stopping anywhere: W stops at p. Second ball O then rolled n times under the same assumptions. X denotes the number of times the ball O stopped on the left of W .
  • 17. Bayes’ 1763 paper: Billiard ball W rolled on a line of length one, with a uniform probability of stopping anywhere: W stops at p. Second ball O then rolled n times under the same assumptions. X denotes the number of times the ball O stopped on the left of W . Bayes’ question: Given X , what inference can we make on p?
  • 18. Bayes’ 1763 paper: Billiard ball W rolled on a line of length one, with a uniform probability of stopping anywhere: W stops at p. Second ball O then rolled n times under the same assumptions. X denotes the number of times the ball O stopped on the left of W . Bayes’ wording: “Given the number of times in which an unknown event has happened and failed; Required the chance that the probability of its happening in a single trial lies somewhere between any two degrees of probability that can be named.”
  • 19. Bayes’ 1763 paper: Billiard ball W rolled on a line of length one, with a uniform probability of stopping anywhere: W stops at p. Second ball O then rolled n times under the same assumptions. X denotes the number of times the ball O stopped on the left of W . Modern translation: Derive the posterior distribution of p given X , when p ∼ U([0, 1]) and X |p ∼ B(n, p)
  • 20. Resolution Since P(X = x|p) = n x p (1 − p)n−x , x b P(a < p < b and X = x) = a and 1 P(X = x) = 0 n x p (1 − p)n−x dp x n x p (1 − p)n−x dp, x
  • 21. Resolution (2) then P(a < p < b|X = x) = = b a 1 0 b a n x n−x dp x p (1 − p) n x n−x dp x p (1 − p) p x (1 − p)n−x dp B(x + 1, n − x + 1) , i.e. p|x ∼ Be(x + 1, n − x + 1) [Beta distribution]
  • 22. Resolution (2) then P(a < p < b|X = x) = = b a 1 0 b a n x n−x dp x p (1 − p) n x n−x dp x p (1 − p) p x (1 − p)n−x dp B(x + 1, n − x + 1) , i.e. p|x ∼ Be(x + 1, n − x + 1) In Bayes’ words: “The same things supposed, I guess that the probability of the event M lies somewhere between 0 and the ratio of Ab to AB, my chance to be in the right is the ratio of Abm to AiB.”
  • 23. Laplace’s version Pierre Simon (de) Laplace (1749–1827): “Je me propose de d´terminer la probabilit´ e e des causes par les ´v´nements mati`re neuve ` e e e a bien des ´gards et qui m´rite d’autant plus e e d’ˆtre cultiv´e que c’est principalement sous ce e e point de vue que la science des hasards peut ˆtre utile ` la vie civile.” e a [M´moire sur la probabilit´ des causes par les ´v´nemens, 1774] e e e e
  • 24. Laplace’s version “Si un ´v`nement peut ˆtre produit par un e e e nombre n de causes diff`rentes, les probabilit´s e e de l’existence de ces causes prises de l’´v`nement, sont entre elles comme les e e probabilit´s de l’´v`nement prises de ces e e e causes, et la probabilit´ de l’existence de e chacune d’elles, est ´gale ` la probabilit´ de e a e l’´v`nement prise de cette cause, divise´ par la e e e somme de toutes les probabilit´s de e l’´v`nement prises de chacune de ces causes.” e e [M´moire sur la probabilit´ des causes par les ´v´nemens, 1774] e e e e
  • 25. Laplace’s version In modern terms: Under a uniform prior, P(Ai |E ) P(E |Ai ) = P(Aj |E ) P(E |Aj ) and f (x|y ) = f (y |x) f (y |x) dy [M´moire sur la probabilit´ des causes par les ´v´nemens, 1774] e e e e
  • 26. Laplace’s version Later Laplace acknowledges Bayes by “Bayes a cherch´ directement la probabilit´ e e que les possibilit´s indiqu´es par des e e exp´riences d´j` faites sont comprises dans les e ea limites donn´es et il y est parvenu d’une e mani`re fine et tr`s ing´nieuse” e e e [Essai philosophique sur les probabilit´s, 1810] e
  • 27. Another Bayes 250 Meeting that took place at the Royal Statistical Society, June 19-20, 2013, on the current state of Bayesian statistics G. Roberts (University of Warwick) “Bayes for differential equation models” N. Best (Imperial College London) “Bayesian space-time models for environmental epidemiology” D. Prangle (Lancaster University) “Approximate Bayesian Computation” P. Dawid (University of Cambridge), “Putting Bayes to the Test” M. Jordan (UC Berkeley) “Feature Allocations, Probability Functions, and Paintboxes” I. Murray (University of Edinburgh) “Flexible models for density estimation” M. Goldstein (Durham University) “Geometric Bayes” C. Andrieu (University of Bristol) “Inference with noisy likelihoods” A. Golightly (Newcastle University), “Auxiliary particle MCMC schemes for partially observed diffusion processes” S. Richardson (MRC Biostatistics Unit) “Biostatistics and Bayes” C. Yau (Imperial College London) “Understanding cancer through Bayesian approaches” S. Walker (University of Kent) “The Misspecified Bayesian” S. Wilson (Trinity College Dublin), “Linnaeus, Bayes and the number of species problem” B. Calderhead (UCL) “Probabilistic Integration for Differential Equation Models” P. Green (University of Bristol and UT Sydney) “Bayesian graphical model determination”
  • 28. The search for certain π Bayes, Thomas (1702–1761) Jeffreys, Harold (1891–1989) Keynes’ treatise Jeffreys’ prior distributions Jeffreys’ Bayes factor expected posterior priors Lindley, Dennis (1923– ) Besag, Julian (1945–2010) de Finetti, Bruno (1906–1985)
  • 29. Keynes’ dead end In John Maynard Keynes’s A Treatise on Probability (1921): “I do not believe that there is any direct and simple method by which we can make the transition from an observed numerical frequency to a numerical measure of probability.” [Robert, 2011, ISR]
  • 30. Keynes’ dead end In John Maynard Keynes’s A Treatise on Probability (1921): “Bayes’ enunciation is strictly correct and its method of arriving at it shows its true logical connection with more fundamental principles, whereas Laplace’s enunciation gives it the appearance of a new principle specially introduced for the solution of causal problems.” [Robert, 2011, ISR]
  • 31. Who was Harold Jeffreys? Harold Jeffreys (1891–1989) mathematician, statistician, geophysicist, and astronomer. Knighted in 1953 and Gold Medal of the Royal Astronomical Society in 1937. Funder of modern British geophysics. Many of his contributions are summarised in his book The Earth. [Wikipedia]
  • 32. Theory of Probability The first modern and comprehensive treatise on (objective) Bayesian statistics Theory of Probability (1939) begins with probability, refining the treatment in Scientific Inference (1937), and proceeds to cover a range of applications comparable to that in Fisher’s book. [Robert, Chopin & Rousseau, 2009, Stat. Science]
  • 33. Jeffreys’ justifications All probability statements are conditional Actualisation of the information on θ by extracting the information on θ contained in the observation x The principle of inverse probability does correspond to ordinary processes of learning (I, §1.5) Allows incorporation of imperfect information in the decision process A probability number can be regarded as a generalization of the assertion sign (I, §1.51).
  • 34. Posterior distribution Operates conditional upon the observations Incorporates the requirement of the Likelihood Principle ...the whole of the information contained in the observations that is relevant to the posterior probabilities of different hypotheses is summed up in the values that they give the likelihood (II, §2.0). Avoids averaging over the unobserved values of x Coherent updating of the information available on θ, independent of the order in which i.i.d. observations are collected ...can be used as the prior probability in taking account of a further set of data, and the theory can therefore always take account of new information (I, §1.5). Provides a complete inferential scope
  • 35. Posterior distribution Operates conditional upon the observations Incorporates the requirement of the Likelihood Principle ...the whole of the information contained in the observations that is relevant to the posterior probabilities of different hypotheses is summed up in the values that they give the likelihood (II, §2.0). Avoids averaging over the unobserved values of x Coherent updating of the information available on θ, independent of the order in which i.i.d. observations are collected ...can be used as the prior probability in taking account of a further set of data, and the theory can therefore always take account of new information (I, §1.5). Provides a complete inferential scope
  • 36. Posterior distribution Operates conditional upon the observations Incorporates the requirement of the Likelihood Principle ...the whole of the information contained in the observations that is relevant to the posterior probabilities of different hypotheses is summed up in the values that they give the likelihood (II, §2.0). Avoids averaging over the unobserved values of x Coherent updating of the information available on θ, independent of the order in which i.i.d. observations are collected ...can be used as the prior probability in taking account of a further set of data, and the theory can therefore always take account of new information (I, §1.5). Provides a complete inferential scope
  • 37. Posterior distribution Operates conditional upon the observations Incorporates the requirement of the Likelihood Principle ...the whole of the information contained in the observations that is relevant to the posterior probabilities of different hypotheses is summed up in the values that they give the likelihood (II, §2.0). Avoids averaging over the unobserved values of x Coherent updating of the information available on θ, independent of the order in which i.i.d. observations are collected ...can be used as the prior probability in taking account of a further set of data, and the theory can therefore always take account of new information (I, §1.5). Provides a complete inferential scope
  • 38. Subjective priors Subjective nature of priors Critics (...) usually say that the prior probability is ‘subjective’ (...) or refer to the vagueness of previous knowledge as an indication that the prior probability cannot be assessed (VIII, §8.0). Long walk (from Laplace’s principle of insufficient reason) to a reference prior: A prior probability used to express ignorance is merely the formal statement of ignorance (VIII, §8.1).
  • 39. Subjective priors Subjective nature of priors Critics (...) usually say that the prior probability is ‘subjective’ (...) or refer to the vagueness of previous knowledge as an indication that the prior probability cannot be assessed (VIII, §8.0). Long walk (from Laplace’s principle of insufficient reason) to a reference prior: A prior probability used to express ignorance is merely the formal statement of ignorance (VIII, §8.1).
  • 40. The fundamental prior ...if we took the prior probability density for the parameters to be proportional to ||gik ||1/2 [= |I (θ)|1/2 ], it could stated for any law that is differentiable with respect to all parameters that the total probability in any region of the αi would be equal to the total probability in the corresponding region of the αi ; in other words, it satisfies the rule that equivalent propositions have the same probability (III, §3.10) Note: Jeffreys never mentions Fisher information in connection with (gik )
  • 41. The fundamental prior In modern terms: if I (θ) is the Fisher information matrix associated with the likelihood (θ|x), ∂ ∂ I (θ) = Eθ ∂θT ∂θ the reference prior distribution is π∗ (θ) ∝ |I (θ)|1/2 Note: Jeffreys never mentions Fisher information in connection with (gik )
  • 42. Objective prior distributions reference priors (Bayarri, Bernardo, Berger, ...) not supposed to represent complete ignorance (Kass & Wasserman, 1996) The prior probabilities needed to express ignorance of the value of a quantity to be estimated, where there is nothing to call special attention to a particular value are given by an invariance theory (Jeffreys, VIII, §8.6). often endowed with or seeking frequency-based properties Jeffreys also proposed another Jeffreys prior dedicated to testing (Bayarri & Garcia-Donato, 2007)
  • 43. Jeffreys’ Bayes factor Definition (Bayes factor, Jeffreys, V, §5.01) For testing hypothesis H0 : θ ∈ Θ0 vs. Ha : θ ∈ Θ0 B01 = π(Θ0 |x) π(Θc |x) 0 π(Θ0 ) = π(Θc ) 0 f (x|θ)π0 (θ)dθ Θ0 Θc 0 f (x|θ)π1 (θ)dθ Equivalent to Bayes rule: acceptance if B01 > {(1 − π(Θ0 ))/a1 }/{π(Θ0 )/a0 } What if... π0 is improper?! [DeGroot, 1973; Berger, 1985; Marin & Robert, 2007]
  • 44. Jeffreys’ Bayes factor Definition (Bayes factor, Jeffreys, V, §5.01) For testing hypothesis H0 : θ ∈ Θ0 vs. Ha : θ ∈ Θ0 B01 = π(Θ0 |x) π(Θc |x) 0 π(Θ0 ) = π(Θc ) 0 f (x|θ)π0 (θ)dθ Θ0 Θc 0 f (x|θ)π1 (θ)dθ Equivalent to Bayes rule: acceptance if B01 > {(1 − π(Θ0 ))/a1 }/{π(Θ0 )/a0 } What if... π0 is improper?! [DeGroot, 1973; Berger, 1985; Marin & Robert, 2007]
  • 45. Expected posterior priors (example) Starting from reference priors πN and πN , substitute by prior 0 1 distributions π0 and π1 that solve the system of integral equations π0 (θ0 ) = X πN (θ0 | x)m1 (x)dx 0 and π1 (θ1 ) = X πN (θ1 | x)m0 (x)dx, 1 where x is an imaginary minimal training sample and m0 , m1 are the marginals associated with π0 and π1 respectively m0 (x) = f0 (x|θ0 )π0 (dθ0 ) m1 (x) = f1 (x|θ1 )π1 (dθ1 ) [Perez & Berger, 2000]
  • 46. Existence/Unicity Recurrence condition When both the observations and the parameters in both models are continuous, if the Markov chain with transition Q θ0 | θ0 = g θ0 , θ0 , θ1 , x, x dxdx dθ1 where g θ0 , θ0 , θ1 , x, x = πN θ0 | x f1 (x | θ1 ) πN θ1 | x 0 1 f0 x | θ0 , is recurrent, then there exists a solution to the integral equations, unique up to a multiplicative constant. [Cano, Salmer´n, & Robert, 2008, 2013] o
  • 47. Bayesian testing of hypotheses Bayes, Thomas (1702–1761) Jeffreys, Harold (1891–1989) Lindley, Dennis (1923– ) Lindley’s paradox dual versions of the paradox “Who should be afraid of the Lindley–Jeffreys paradox?” Bayesian resolutions Besag, Julian (1945–2010) de Finetti, Bruno (1906–1985)
  • 48. Who is Dennis Lindley? British statistician, decision theorist and leading advocate of Bayesian statistics. Held positions at Cambridge, Aberystwyth, and UCL, retiring at the early age of 54 to become an itinerant scholar. Wrote four books and numerous papers on Bayesian statistics. c “Coherence is everything”
  • 49. Lindley’s paradox In a normal mean testing problem, ¯ xn ∼ N(θ, σ2 /n) , H0 : θ = θ 0 , under Jeffreys prior, θ ∼ N(θ0 , σ2 ), the Bayes factor 2 B01 (tn ) = (1 + n)1/2 exp −ntn /2[1 + n] , where tn = √ n|¯n − θ0 |/σ, satisfies x n−→∞ B01 (tn ) −→ ∞ [assuming a fixed tn ]
  • 50. Lindley’s paradox Often dubbed Jeffreys–Lindley paradox... In terms of t= K∼ √ n − 1¯/s , x πν 2 1+ ν=n−1 t2 ν −1/2ν+1/2 . (...) The variation of K with t is much more important than the variation with ν (Jeffreys, V, §5.2).
  • 51. Two versions of the paradox “the weight of Lindley’s paradoxical result (...) burdens proponents of the Bayesian practice”. [Lad, 2003] official version, opposing frequentist and Bayesian assessments [Lindley, 1957] intra-Bayesian version, blaming vague and improper priors for the Bayes factor misbehaviour: if π1 (·|σ) depends on a scale parameter σ, it is often the case that σ−→∞ B01 (x) −→ +∞ for a given x, meaning H0 is always accepted [Robert, 1992, 2013]
  • 52. Evacuation of the first version Two paradigms [(b) versus (f)] one (b) operates on the parameter space Θ, while the other (f) is produced from the sample space one (f) relies solely on the point-null hypothesis H0 and the corresponding sampling distribution, while the other (b) opposes H0 to a (predictive) marginal version of H1 one (f) could reject “a hypothesis that may be true (...) because it has not predicted observable results that have not occurred” (Jeffreys, VII, §7.2) while the other (b) conditions upon the observed value xobs one (f) resorts to an arbitrary fixed bound α on the p-value, while the other (b) refers to the boundary probability of 1 2
  • 53. More arguments on the first version observing a constant tn as n increases is of limited interest: under H0 tn has limiting N(0, 1) distribution, while, under H1 tn a.s. converges to ∞ behaviour that remains entirely compatible with the consistency of the Bayes factor, which a.s. converges either to 0 or ∞, depending on which hypothesis is true. Consequent literature (e.g., Berger & Sellke,1987) has since then shown how divergent those two approaches could be (to the point of being asymptotically incompatible). [Robert, 2013]
  • 54. Nothing’s wrong with the second version n, prior’s scale factor: prior variance n times larger than the observation variance and when n goes to ∞, Bayes factor goes to ∞ no matter what the observation is n becomes what Lindley (1957) calls “a measure of lack of conviction about the null hypothesis” when prior diffuseness under H1 increases, only relevant information becomes that θ could be equal to θ0 , and this overwhelms any evidence to the contrary contained in the data mass of the prior distribution in the vicinity of any fixed neighbourhood of the null hypothesis vanishes to zero under H1 [Robert, 2013] c deep coherence in the outcome: being indecisive about the alternative hypothesis means we should not chose it
  • 55. Nothing’s wrong with the second version n, prior’s scale factor: prior variance n times larger than the observation variance and when n goes to ∞, Bayes factor goes to ∞ no matter what the observation is n becomes what Lindley (1957) calls “a measure of lack of conviction about the null hypothesis” when prior diffuseness under H1 increases, only relevant information becomes that θ could be equal to θ0 , and this overwhelms any evidence to the contrary contained in the data mass of the prior distribution in the vicinity of any fixed neighbourhood of the null hypothesis vanishes to zero under H1 [Robert, 2013] c deep coherence in the outcome: being indecisive about the alternative hypothesis means we should not chose it
  • 56. “Who should be afraid of the Lindley–Jeffreys paradox?” Recent publication by A. Spanos with above title: the paradox demonstrates against Bayesian and likelihood resolutions of the problem for failing to account for the large sample size. the failure of all three main paradigms leads Spanos to advocate Mayo’s and Spanos’“postdata severity evaluation” [Spanos, 2013]
  • 57. “Who should be afraid of the Lindley–Jeffreys paradox?” Recent publication by A. Spanos with above title: “the postdata severity evaluation (...) addresses the key problem with Fisherian p-values in the sense that the severity evaluation provides the “magnitude” of the warranted discrepancy from the null by taking into account the generic capacity of the test (that includes n) in question as it relates to the observed data”(p.88) [Spanos, 2013]
  • 58. On some resolutions of the second version use of pseudo-Bayes factors, fractional Bayes factors, &tc, which lacks proper Bayesian justification [Berger & Pericchi, 2001] use of identical improper priors on nuisance parameters, a notion already entertained by Jeffreys [Berger et al., 1998; Marin & Robert, 2013] use of the posterior predictive distribution, which uses the data twice (see also Aitkin’s (2010) integrated likelihood) [Gelman, Rousseau & Robert, 2013] use of score functions extending the log score function log B12 (x) = log m1 (x) − log m2 (x) = S0 (x, m1 ) − S0 (x, m2 ) , that are independent of the normalising constant [Dawid et al., 2013]
  • 59. Bayesian computing (R)evolution Bayes, Thomas (1702–1761) Jeffreys, Harold (1891–1989) Lindley, Dennis (1923– ) Besag, Julian (1945–2010) Besag’s early contributions MCMC revolution and beyond de Finetti, Bruno (1906–1985)
  • 60. computational jam In the 1970’s and early 1980’s, theoretical foundations of Bayesian statistics were sound, but methodology was lagging for lack of computing tools. restriction to conjugate priors limited complexity of models small sample sizes The field was desperately in need of a new computing paradigm! [Robert & Casella, 2012]
  • 61. MCMC as in Markov Chain Monte Carlo Notion that i.i.d. simulation is definitely not necessary, all that matters is the ergodic theorem Realization that Markov chains could be used in a wide variety of situations only came to mainstream statisticians with Gelfand and Smith (1990) despite earlier publications in the statistical literature like Hastings (1970) and growing awareness in spatial statistics (Besag, 1986) Reasons: lack of computing machinery lack of background on Markov chains lack of trust in the practicality of the method
  • 62. Who was Julian Besag? British statistician known chiefly for his work in spatial statistics (including its applications to epidemiology, image analysis and agricultural science), and Bayesian inference (including Markov chain Monte Carlo algorithms). Lecturer in Liverpool and Durham, then professor in Durham and Seattle. [Wikipedia]
  • 63. pre-Gibbs/pre-Hastings era Early 1970’s, Hammersley, Clifford, and Besag were working on the specification of joint distributions from conditional distributions and on necessary and sufficient conditions for the conditional distributions to be compatible with a joint distribution. [Hammersley and Clifford, 1971]
  • 64. pre-Gibbs/pre-Hastings era Early 1970’s, Hammersley, Clifford, and Besag were working on the specification of joint distributions from conditional distributions and on necessary and sufficient conditions for the conditional distributions to be compatible with a joint distribution. “What is the most general form of the conditional probability functions that define a coherent joint function? And what will the joint look like?” [Besag, 1972]
  • 65. Hammersley-Clifford[-Besag] theorem Theorem (Hammersley-Clifford) Joint distribution of vector associated with a dependence graph must be represented as product of functions over the cliques of the graphs, i.e., of functions depending only on the components indexed by the labels in the clique. [Cressie, 1993; Lauritzen, 1996]
  • 66. Hammersley-Clifford[-Besag] theorem Theorem (Hammersley-Clifford) A probability distribution P with positive and continuous density f satisfies the pairwise Markov property with respect to an undirected graph G if and only if it factorizes according to G, i.e., (F ) ≡ (G ) [Cressie, 1993; Lauritzen, 1996]
  • 67. Hammersley-Clifford[-Besag] theorem Theorem (Hammersley-Clifford) Under the positivity condition, the joint distribution g satisfies g j (y j |y 1 , . . . , y j−1 , y j+1 , . . . , y p ) p g (y1 , . . . , yp ) ∝ j=1 for every permutation g j (y j |y 1 , . . . , y j−1 , y j+1 , . . . , y p ) on {1, 2, . . . , p} and every y ∈ Y. [Cressie, 1993; Lauritzen, 1996]
  • 68. To Gibbs or not to Gibbs? Julian Besag should certainly be credited to a large extent of the (re?-)discovery of the Gibbs sampler.
  • 69. To Gibbs or not to Gibbs? Julian Besag should certainly be credited to a large extent of the (re?-)discovery of the Gibbs sampler. “The simulation procedure is to consider the sites cyclically and, at each stage, to amend or leave unaltered the particular site value in question, according to a probability distribution whose elements depend upon the current value at neighboring sites (...) However, the technique is unlikely to be particularly helpful in many other than binary situations and the Markov chain itself has no practical interpretation.” [Besag, 1974]
  • 70. Clicking in After Peskun (1973), MCMC mostly dormant in mainstream statistical world for about 10 years, then several papers/books highlighted its usefulness in specific settings: Geman and Geman (1984) Besag (1986) Strauss (1986) Ripley (Stochastic Simulation, 1987) Tanner and Wong (1987) Younes (1988)
  • 71. Enters the Gibbs sampler Geman and Geman (1984), building on Metropolis et al. (1953), Hastings (1970), and Peskun (1973), constructed a Gibbs sampler for optimisation in a discrete image processing problem with a Gibbs random field without completion. Back to Metropolis et al., 1953: the Gibbs sampler is already in use therein and ergodicity is proven on the collection of global maxima
  • 72. Enters the Gibbs sampler Geman and Geman (1984), building on Metropolis et al. (1953), Hastings (1970), and Peskun (1973), constructed a Gibbs sampler for optimisation in a discrete image processing problem with a Gibbs random field without completion. Back to Metropolis et al., 1953: the Gibbs sampler is already in use therein and ergodicity is proven on the collection of global maxima
  • 73. Besag (1986) integrates GS for SA... “...easy to construct the transition matrix Q, of a discrete time Markov chain, with state space Ω and limit distribution (4). Simulated annealing proceeds by running an associated time inhomogeneous Markov chain with transition matrices QT , where T is progressively decreased according to a prescribed “schedule” to a value close to zero.” [Besag, 1986]
  • 74. ...and links with Metropolis-Hastings... “There are various related methods of constructing a manageable QT (Hastings, 1970). Geman and Geman (1984) adopt the simplest, which they term the ”Gibbs sampler” (...) time reversibility, a common ingredient in this type of problem (see, for example, Besag, 1977a), is present at individual stages but not over complete cycles, though Peter Green has pointed out that it returns if QT is taken over a pair of cycles, the second of which visits pixels in reverse order” [Besag, 1986]
  • 75. The candidate’s formula Representation of the marginal likelihood as m(x) = π(θ)f (x|θ) π(θ|x) or of the marginal predictive as pn (y |y ) = f (y |θ)πn (θ|y ) πn+1 (θ|y , y ) [Besag, 1989] Why candidate? “Equation (2) appeared without explanation in a Durham University undergraduate final examination script of 1984. Regrettably, the student’s name is no longer known to me.”
  • 76. The candidate’s formula Representation of the marginal likelihood as m(x) = π(θ)f (x|θ) π(θ|x) or of the marginal predictive as pn (y |y ) = f (y |θ)πn (θ|y ) πn+1 (θ|y , y ) [Besag, 1989] Why candidate? “Equation (2) appeared without explanation in a Durham University undergraduate final examination script of 1984. Regrettably, the student’s name is no longer known to me.”
  • 77. Implications Newton and Raftery (1994) used this representation to derive the [infamous] harmonic mean approximation to the marginal likelihood Gelfand and Dey (1994) Geyer and Thompson (1995) Chib (1995) Marin and Robert (2010) and Robert and Wraith (2009) [Chen, Shao & Ibrahim, 2000]
  • 78. Implications Newton and Raftery (1994) Gelfand and Dey (1994) also relied on this formula for the same purpose in a more general perspective Geyer and Thompson (1995) Chib (1995) Marin and Robert (2010) and Robert and Wraith (2009) [Chen, Shao & Ibrahim, 2000]
  • 79. Implications Newton and Raftery (1994) Gelfand and Dey (1994) Geyer and Thompson (1995) derived MLEs by a Monte Carlo approximation to the normalising constant Chib (1995) Marin and Robert (2010) and Robert and Wraith (2009) [Chen, Shao & Ibrahim, 2000]
  • 80. Implications Newton and Raftery (1994) Gelfand and Dey (1994) Geyer and Thompson (1995) Chib (1995) uses this representation to build a MCMC approximation to the marginal likelihood Marin and Robert (2010) and Robert and Wraith (2009) [Chen, Shao & Ibrahim, 2000]
  • 81. Implications Newton and Raftery (1994) Gelfand and Dey (1994) Geyer and Thompson (1995) Chib (1995) Marin and Robert (2010) and Robert and Wraith (2009) corrected Newton and Raftery (1994) by restricting the importance function to an HPD region [Chen, Shao & Ibrahim, 2000]
  • 82. Removing the jam In early 1990s, researchers found that Gibbs and then Metropolis Hastings algorithms would crack almost any problem! Flood of papers followed applying MCMC: linear mixed models (Gelfand & al., 1990; Zeger & Karim, 1991; Wang & al., 1993, 1994) generalized linear mixed models (Albert & Chib, 1993) mixture models (Tanner & Wong, 1987; Diebolt & Robert., 1990, 1994; Escobar & West, 1993) changepoint analysis (Carlin & al., 1992) point processes (Grenander & Møller, 1994) &tc
  • 83. Removing the jam In early 1990s, researchers found that Gibbs and then Metropolis Hastings algorithms would crack almost any problem! Flood of papers followed applying MCMC: genomics (Stephens & Smith, 1993; Lawrence & al., 1993; Churchill, 1995; Geyer & Thompson, 1995; Stephens & Donnelly, 2000) ecology (George & Robert, 1992) variable selection in regression (George & mcCulloch, 1993; Green, 1995; Chen & al., 2000) spatial statistics (Raftery & Banfield, 1991; Besag & Green, 1993)) longitudinal studies (Lange & al., 1992) &tc
  • 84. MCMC and beyond reversible jump MCMC which impacted considerably Bayesian model choice (Green, 1995) adaptive MCMC algorithms (Haario & al., 1999; Roberts & Rosenthal, 2009) exact approximations to targets (Tanner & Wong, 1987; Beaumont, 2003; Andrieu & Roberts, 2009) particle filters with application to sequential statistics, state-space models, signal processing, &tc. (Gordon & al., 1993; Doucet & al., 2001; del Moral & al., 2006)
  • 85. MCMC and beyond beyond comp’al stats catching up with comp’al physics: free energy sampling (e.g., Wang-Landau), Hamilton Monte Carlo (Girolami & Calderhead, 2011) sequential Monte Carlo (SMC) for non-sequential problems (Chopin, 2002; Neal, 2001; Del Moral et al 2006) retrospective sampling intractability: EP – GIMH – PMCMC – SMC2 – INLA QMC[MC] (Owen, 2011)
  • 86. Particles Iterating/sequential importance sampling is about as old as Monte Carlo methods themselves! [Hammersley and Morton,1954; Rosenbluth and Rosenbluth, 1955] Found in the molecular simulation literature of the 50’s with self-avoiding random walks and signal processing [Marshall, 1965; Handschin and Mayne, 1969] Use of the term “particle” dates back to Kitagawa (1996), and Carpenter et al. (1997) coined the term “particle filter”.
  • 87. Particles Iterating/sequential importance sampling is about as old as Monte Carlo methods themselves! [Hammersley and Morton,1954; Rosenbluth and Rosenbluth, 1955] Found in the molecular simulation literature of the 50’s with self-avoiding random walks and signal processing [Marshall, 1965; Handschin and Mayne, 1969] Use of the term “particle” dates back to Kitagawa (1996), and Carpenter et al. (1997) coined the term “particle filter”.
  • 88. pMC & pMCMC Recycling of past simulations legitimate to build better importance sampling functions as in population Monte Carlo [Iba, 2000; Capp´ et al, 2004; Del Moral et al., 2007] e synthesis by Andrieu, Doucet, and Hollenstein (2010) using particles to build an evolving MCMC kernel pθ (y1:T ) in state ^ space models p(x1:T )p(y1:T |x1:T ) importance sampling on discretely observed diffusions [Beskos et al., 2006; Fearnhead et al., 2008, 2010]
  • 89. towards ever more complexity Bayes, Thomas (1702–1761) Jeffreys, Harold (1891–1989) Lindley, Dennis (1923– ) Besag, Julian (1945–2010) de Finetti, Bruno (1906–1985) de Finetti’s exchangeability theorem Bayesian nonparametrics Bayesian analysis in a Big Data era
  • 90. Who was Bruno de Finetti? “Italian probabilist, statistician and actuary, noted for the “operational subjective” conception of probability. The classic exposition of his distinctive theory is the 1937 “La pr´vision: ses e lois logiques, ses sources subjectives,” which discussed probability founded on the coherence of betting odds and the consequences of exchangeability.” [Wikipedia] Chair in Financial Mathematics at Trieste University (1939) and Roma (1954) then in Calculus of Probabilities (1961). Most famous sentence: “Probability does not exist”
  • 91. Who was Bruno de Finetti? “Italian probabilist, statistician and actuary, noted for the “operational subjective” conception of probability. The classic exposition of his distinctive theory is the 1937 “La pr´vision: ses e lois logiques, ses sources subjectives,” which discussed probability founded on the coherence of betting odds and the consequences of exchangeability.” [Wikipedia] Chair in Financial Mathematics at Trieste University (1939) and Roma (1954) then in Calculus of Probabilities (1961). Most famous sentence: “Probability does not exist”
  • 92. Exchangeability Notion of exchangeable sequences: A random sequence (x1 , . . . , xn , . . .) is exchangeable if for any n the distribution of (x1 , . . . , xn ) is equal to the distribution of any permutation of the sequence (xσ1 , . . . , xσn ) de Finetti’s theorem (1937): An exchangeable distribution is a mixture of iid distributions n f (xi |G )dπ(G ) p(x1 , . . . , xn ) = i=1 where G can be infinite-dimensional Extension to Markov chains (Freedman, 1962; Diaconis & Freedman, 1980)
  • 93. Exchangeability Notion of exchangeable sequences: A random sequence (x1 , . . . , xn , . . .) is exchangeable if for any n the distribution of (x1 , . . . , xn ) is equal to the distribution of any permutation of the sequence (xσ1 , . . . , xσn ) de Finetti’s theorem (1937): An exchangeable distribution is a mixture of iid distributions n f (xi |G )dπ(G ) p(x1 , . . . , xn ) = i=1 where G can be infinite-dimensional Extension to Markov chains (Freedman, 1962; Diaconis & Freedman, 1980)
  • 94. Exchangeability Notion of exchangeable sequences: A random sequence (x1 , . . . , xn , . . .) is exchangeable if for any n the distribution of (x1 , . . . , xn ) is equal to the distribution of any permutation of the sequence (xσ1 , . . . , xσn ) de Finetti’s theorem (1937): An exchangeable distribution is a mixture of iid distributions n f (xi |G )dπ(G ) p(x1 , . . . , xn ) = i=1 where G can be infinite-dimensional Extension to Markov chains (Freedman, 1962; Diaconis & Freedman, 1980)
  • 95. Bayesian nonparametrics Based on de Finetti’s representation, use of priors on functional spaces (densities, regression, trees, partitions, clustering, &tc) production of Bayes estimates in those spaces convergence mileage may vary available efficient (MCMC) algorithms to conduct non-parametric inference [van der Vaart, 1998; Hjort et al., 2010; M¨ller & Rodriguez, 2013] u
  • 96. Dirichlet processes One of the earliest examples of priors on distributions [Ferguson, 1973] stick-breaking construction of D(α0 , G0 ) generate βk ∼ B(1, α0 ) define π1 = β1 and πk = k−1 j=1 (1 − βj )βk generate θk ∼ G0 derive G = k πk δθk ∼ D(α0 , G0 ) [Sethuraman, 1994]
  • 97. Chinese restaurant process If we assume G ∼ D(α0 , G0 ) θi ∼ G then the marginal distribution of (θ1 , . . .) is a Chinese restaurant process (P´lya urn model), which is exchangeable. In particular, o i−1 θi |θ1:i−1 ∼ α0 G0 + δθj j=1 Posterior distribution built by MCMC [Escobar and West, 1992]
  • 98. Chinese restaurant process If we assume G ∼ D(α0 , G0 ) θi ∼ G then the marginal distribution of (θ1 , . . .) is a Chinese restaurant process (P´lya urn model), which is exchangeable. In particular, o i−1 θi |θ1:i−1 ∼ α0 G0 + δθj j=1 Posterior distribution built by MCMC [Escobar and West, 1992]
  • 99. Many alternatives truncated Dirichlet processes Pitman Yor processes completely random measures normalized random measures with independent increments (NRMI) [M¨ller and Mitra, 2013] u
  • 100. Theoretical advances posterior consistency: Seminal work of Schwarz (1965) in iid case and extension of Barron et al. (1999) for general consistency consistency rates: Ghosal & van der Vaart (2000) Ghosal et al. (2008) with minimax (adaptive ) Bayesian nonparametric estimators for nonparametric process mixtures (Gaussian, Beta) (Rousseau, 2008; Kruijer, Rousseau & van der Vaart, 2010; Shen, Tokdar & Ghosal, 2013; Scricciolo, 2013) Bernstein-von Mises theorems: (Castillo, 2011; Rivoirard & Rousseau, 2012; Kleijn & Bickel, 2013; Castillo & Rousseau, 2013) recent extensions to semiparametric models
  • 101. Consistency and posterior concentration rates Posterior dπ(θ|X n ) = fθ (X n )dπ(θ) m(X n ) fθ (X n )dπ(θ) m(X n ) = Θ and posterior concentration: Under Pθ0 Pπ [d(θ, θ0 ) |X n ] = 1+op (1), Pπ [d(θ, θ0 ) n |X n ] = 1+op (1) Given n : consistency where d(θ, θ ) is a loss function. e.g. Hellinger, L1 , L2 , L∞
  • 102. Consistency and posterior concentration rates Posterior dπ(θ|X n ) = fθ (X n )dπ(θ) m(X n ) fθ (X n )dπ(θ) m(X n ) = Θ and posterior concentration: Under Pθ0 Pπ [d(θ, θ0 ) |X n ] = 1+op (1), Pπ [d(θ, θ0 ) n |X n ] = 1+op (1) Setting n ↓ 0: consistency rates where d(θ, θ ) is a loss function. e.g. Hellinger, L1 , L2 , L∞
  • 103. Bernstein–von Mises theorems Parameter of interest ψ = ψ(θ) ∈ Rd , (with dim(θ) = +∞) BVM: √ ^ π[ n(ψ − ψ) and d < +∞, z|X n ] = Φ(z/ θ∼π V0 ) + op (1), √ ^ n(ψ − ψ(θ0 )) ≈ N(0, V0 ) Pθ0 under Pθ0 [Doob, 1949; Le Cam, 1986; van der Vaart, 1998]
  • 104. New challenges Novel statisticial issues that forces a different Bayesian answer: very large datasets complex or unknown dependence structures with maybe p multiple and involved random effects missing data structures containing most of the information sequential structures involving most of the above n
  • 105. New paradigm? “Surprisingly, the confident prediction of the previous generation that Bayesian methods would ultimately supplant frequentist methods has given way to a realization that Markov chain Monte Carlo (MCMC) may be too slow to handle modern data sets. Size matters because large data sets stress computer storage and processing power to the breaking point. The most successful compromises between Bayesian and frequentist methods now rely on penalization and optimization.” [Lange at al., ISR, 2013]
  • 106. New paradigm? Observe (Xi , Ri , Yi Ri ) where Xi ∼ U(0, 1)d , Ri |Xi ∼ B(π(Xi )) and Yi |Xi ∼ B(θ(Xi )) (π(·) is known and θ(·) is unknwon) Then any estimator of E[Y ] that does not depend on π is inconsistent. c There is no genuine Bayesian answer producing a consistent estimator (without throwing away part of the data) [Robins & Wasserman, 2000, 2013]
  • 107. New paradigm? sad reality constraint that size does matter focus on much smaller dimensions and on sparse summaries many (fast if non-Bayesian) ways of producing those summaries Bayesian inference can kick in almost automatically at this stage
  • 108. Approximate Bayesian computation (ABC) Case of a well-defined statistical model where the likelihood function (θ|y) = f (y1 , . . . , yn |θ) is out of reach! Empirical approximations to the original Bayesian inference problem Degrading the data precision down to a tolerance ε Replacing the likelihood with a non-parametric approximation Summarising/replacing the data with insufficient statistics
  • 109. Approximate Bayesian computation (ABC) Case of a well-defined statistical model where the likelihood function (θ|y) = f (y1 , . . . , yn |θ) is out of reach! Empirical approximations to the original Bayesian inference problem Degrading the data precision down to a tolerance ε Replacing the likelihood with a non-parametric approximation Summarising/replacing the data with insufficient statistics
  • 110. Approximate Bayesian computation (ABC) Case of a well-defined statistical model where the likelihood function (θ|y) = f (y1 , . . . , yn |θ) is out of reach! Empirical approximations to the original Bayesian inference problem Degrading the data precision down to a tolerance ε Replacing the likelihood with a non-parametric approximation Summarising/replacing the data with insufficient statistics
  • 111. Approximate Bayesian computation (ABC) Case of a well-defined statistical model where the likelihood function (θ|y) = f (y1 , . . . , yn |θ) is out of reach! Empirical approximations to the original Bayesian inference problem Degrading the data precision down to a tolerance ε Replacing the likelihood with a non-parametric approximation Summarising/replacing the data with insufficient statistics
  • 112. ABC methodology Bayesian setting: target is π(θ)f (x|θ) When likelihood f (x|θ) not in closed form, likelihood-free rejection technique: Foundation For an observation y ∼ f (y|θ), under the prior π(θ), if one keeps jointly simulating θ ∼ π(θ) , z ∼ f (z|θ ) , until the auxiliary variable z is equal to the observed value, z = y, then the selected θ ∼ π(θ|y) [Rubin, 1984; Diggle & Gratton, 1984; Griffith et al., 1997]
  • 113. ABC methodology Bayesian setting: target is π(θ)f (x|θ) When likelihood f (x|θ) not in closed form, likelihood-free rejection technique: Foundation For an observation y ∼ f (y|θ), under the prior π(θ), if one keeps jointly simulating θ ∼ π(θ) , z ∼ f (z|θ ) , until the auxiliary variable z is equal to the observed value, z = y, then the selected θ ∼ π(θ|y) [Rubin, 1984; Diggle & Gratton, 1984; Griffith et al., 1997]
  • 114. ABC methodology Bayesian setting: target is π(θ)f (x|θ) When likelihood f (x|θ) not in closed form, likelihood-free rejection technique: Foundation For an observation y ∼ f (y|θ), under the prior π(θ), if one keeps jointly simulating θ ∼ π(θ) , z ∼ f (z|θ ) , until the auxiliary variable z is equal to the observed value, z = y, then the selected θ ∼ π(θ|y) [Rubin, 1984; Diggle & Gratton, 1984; Griffith et al., 1997]
  • 115. ABC algorithm In most implementations, degree of approximation: Algorithm 1 Likelihood-free rejection sampler for i = 1 to N do repeat generate θ from the prior distribution π(·) generate z from the likelihood f (·|θ ) until ρ{η(z), η(y)} set θi = θ end for where η(y) defines a (not necessarily sufficient) statistic
  • 116. Comments role of distance paramount (because = 0) scaling of components of η(y) also capital matters little if “small enough” representative of “curse of dimensionality” small is beautiful!, i.e. data as a whole may be weakly informative for ABC non-parametric method at core
  • 117. ABC simulation advances Simulating from the prior is often poor in efficiency Either modify the proposal distribution on θ to increase the density of x’s within the vicinity of y ... [Marjoram et al, 2003; Beaumont et al., 2009, Del Moral et al., 2012] ...or by viewing the problem as a conditional density estimation and by developing techniques to allow for larger [Beaumont et al., 2002; Blum & Fran¸ois, 2010; Biau et al., 2013] c .....or even by including in the inferential framework [ABCµ ] [Ratmann et al., 2009]
  • 118. ABC simulation advances Simulating from the prior is often poor in efficiency Either modify the proposal distribution on θ to increase the density of x’s within the vicinity of y ... [Marjoram et al, 2003; Beaumont et al., 2009, Del Moral et al., 2012] ...or by viewing the problem as a conditional density estimation and by developing techniques to allow for larger [Beaumont et al., 2002; Blum & Fran¸ois, 2010; Biau et al., 2013] c .....or even by including in the inferential framework [ABCµ ] [Ratmann et al., 2009]
  • 119. ABC simulation advances Simulating from the prior is often poor in efficiency Either modify the proposal distribution on θ to increase the density of x’s within the vicinity of y ... [Marjoram et al, 2003; Beaumont et al., 2009, Del Moral et al., 2012] ...or by viewing the problem as a conditional density estimation and by developing techniques to allow for larger [Beaumont et al., 2002; Blum & Fran¸ois, 2010; Biau et al., 2013] c .....or even by including in the inferential framework [ABCµ ] [Ratmann et al., 2009]
  • 120. ABC simulation advances Simulating from the prior is often poor in efficiency Either modify the proposal distribution on θ to increase the density of x’s within the vicinity of y ... [Marjoram et al, 2003; Beaumont et al., 2009, Del Moral et al., 2012] ...or by viewing the problem as a conditional density estimation and by developing techniques to allow for larger [Beaumont et al., 2002; Blum & Fran¸ois, 2010; Biau et al., 2013] c .....or even by including in the inferential framework [ABCµ ] [Ratmann et al., 2009]
  • 121. ABC as an inference machine Starting point is summary statistic η(y), either chosen for computational realism or imposed by external constraints ABC can produce a distribution on the parameter of interest conditional on this summary statistic η(y) inference based on ABC may be consistent or not, so it needs to be validated on its own the choice of the tolerance level is dictated by both computational and convergence constraints
  • 122. ABC as an inference machine Starting point is summary statistic η(y), either chosen for computational realism or imposed by external constraints ABC can produce a distribution on the parameter of interest conditional on this summary statistic η(y) inference based on ABC may be consistent or not, so it needs to be validated on its own the choice of the tolerance level is dictated by both computational and convergence constraints
  • 123. How Bayesian aBc is..? At best, ABC approximates π(θ|η(y)): approximation error unknown (w/o massive simulation) pragmatic or empirical Bayes (there is no other solution!) many calibration issues (tolerance, distance, statistics) the NP side should be incorporated into the whole Bayesian picture the approximation error should also be part of the Bayesian inference
  • 124. Noisy ABC ABC approximation error (under non-zero tolerance ) replaced with exact simulation from a controlled approximation to the target, convolution of true posterior with kernel function π (θ, z|y) = π(θ)f (z|θ)K (y − z) , π(θ)f (z|θ)K (y − z)dzdθ with K kernel parameterised by bandwidth . [Wilkinson, 2013] Theorem ˜ The ABC algorithm based on a randomised observation y = y + ξ, ξ ∼ K , and an acceptance probability of K (y − z)/M gives draws from the posterior distribution π(θ|y).
  • 125. Noisy ABC ABC approximation error (under non-zero tolerance ) replaced with exact simulation from a controlled approximation to the target, convolution of true posterior with kernel function π (θ, z|y) = π(θ)f (z|θ)K (y − z) , π(θ)f (z|θ)K (y − z)dzdθ with K kernel parameterised by bandwidth . [Wilkinson, 2013] Theorem ˜ The ABC algorithm based on a randomised observation y = y + ξ, ξ ∼ K , and an acceptance probability of K (y − z)/M gives draws from the posterior distribution π(θ|y).
  • 126. Which summary? Fundamental difficulty of the choice of the summary statistic when there is no non-trivial sufficient statistics [except when done by the experimenters in the field]
  • 127. Which summary? Fundamental difficulty of the choice of the summary statistic when there is no non-trivial sufficient statistics [except when done by the experimenters in the field] Loss of statistical information balanced against gain in data roughening Approximation error and information loss remain unknown Choice of statistics induces choice of distance function towards standardisation borrowing tools from data analysis (LDA) machine learning [Estoup et al., ME, 2012]
  • 128. Which summary? Fundamental difficulty of the choice of the summary statistic when there is no non-trivial sufficient statistics [except when done by the experimenters in the field] may be imposed for external/practical reasons may gather several non-B point estimates we can learn about efficient combination distance can be provided by estimation techniques
  • 129. Which summary for model choice? ‘This is also why focus on model discrimination typically (...) proceeds by (...) accepting that the Bayes Factor that one obtains is only derived from the summary statistics and may in no way correspond to that of the full model.’ [Scott Sisson, Jan. 31, 2011, xianblog] Depending on the choice of η(·), the Bayes factor based on this insufficient statistic, η B12 (y) = π1 (θ1 )f1η (η(y)|θ1 ) dθ1 , π2 (θ2 )f2η (η(y)|θ2 ) dθ2 is either consistent or inconsistent [Robert et al., PNAS, 2012]
  • 130. Which summary for model choice? Depending on the choice of η(·), the Bayes factor based on this insufficient statistic, η B12 (y) = π1 (θ1 )f1η (η(y)|θ1 ) dθ1 , π2 (θ2 )f2η (η(y)|θ2 ) dθ2 is either consistent or inconsistent [Robert et al., PNAS, 2012] n=100 0.7 1.0 n=100 q 0.6 q 0.8 q q 0.5 q q q q q q q 0.4 q q q q 0.1 0.2 q q 0.2 0.3 0.4 0.6 q q q q q q 0.0 q q q Gauss Laplace 0.0 q Gauss Laplace
  • 131. Selecting proper summaries Consistency only depends on the range of µi (θ) = Ei [η(y)] under both models against the asymptotic mean µ0 of η(y) Theorem If Pn belongs to one of the two models and if µ0 cannot be attained by the other one : 0 = min (inf{|µ0 − µi (θi )|; θi ∈ Θi }, i = 1, 2) < max (inf{|µ0 − µi (θi )|; θi ∈ Θi }, i = 1, 2) , η then the Bayes factor B12 is consistent [Marin et al., 2012]
  • 132. Selecting proper summaries Consistency only depends on the range of µi (θ) = Ei [η(y)] 0.7 0.6 0.7 0.4 0.4 q 0.5 0.6 q 0.5 0.7 0.4 0.5 0.6 under both models against the asymptotic mean µ0 of η(y) q q M2 M1 M2 1.0 M1 1.0 M2 0.3 0.3 0.3 q M1 0.8 0.8 0.8 q q q q q q q 0.6 0.6 0.6 q q q q q q q q q q 0.2 q q q q 0.2 0.2 q q q q 0.4 0.4 0.4 q q M1 M2 M2 1.0 M1 0.8 0.8 0.8 1.0 M2 0.0 0.0 0.0 q M1 q q q q 0.4 q q q q q q q q 0.2 q 0.2 0.2 q q q q q 0.4 0.4 0.6 0.6 q q 0.6 q q M1 M2 0.0 q q 0.0 0.0 q M1 M2 M1 M2 [Marin et al., 2012]
  • 133. on some Bayesian open problems In 2011, Michael Jordan, then ISBA President, conducted a mini-survey on Bayesian open problems: Nonparametrics and semiparametrics: assessing and validating priors on infinite dimension spaces with an infinite number of nuisance parameters Priors: elicitation mecchanisms and strategies to get the prior from the likelihood or even from the posterior distribution Bayesian/frequentist relationships: how far should one reach for frequentist validation? Computation and statistics: computational abilities should be part of the modelling, with some expressing doubts about INLA and ABC Model selection and hypothesis testing: still unsettled opposition between model checking, model averaging and model selection [Jordan, ISBA Bulletin, March 2011]
  • 134. yet another Bayes 250 Meeting that will take place in Duke University, December 17: Stephen Fienberg, Carnegie-Mellon University Michael Jordan, University of California, Berkeley Christopher Sims, Princeton University Adrian Smith, University of London Stephen Stigler, University of Chicago Sharon Bertsch McGrayne, author of “the theory that would not die”