SlideShare une entreprise Scribd logo
1  sur  125
Télécharger pour lire hors ligne
Bayesian model choice
(and some alternatives)
Christian P. Robert
Universit´e Paris-Dauphine, IuF, & CRESt
http://www.ceremade.dauphine.fr/~xian
November 20, 2010
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 1 / 64
Outline
Anyone not shocked by the Bayesian theory of inference has not understood it
Senn, BA., 2008
1 Introduction
2 Tests and model choice
3 Incoherent inferences
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 2 / 64
Vocabulary and concepts
Bayesian inference is a coherent mathematical theory
but I don’t trust it in scientific applications.
Gelman, BA, 2008
1 Introduction
Models
The Bayesian framework
Improper prior distributions
Noninformative prior distributions
2 Tests and model choice
3 Incoherent inferences
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 3 / 64
Parametric model
Bayesians promote the idea that a multiplicity of parameters can be handled via
hierarchical, typically exchangeable, models, but it seems implausible that this
could really work automatically [instead of] giving reasonable answers using
minimal assumptions.
Gelman, BA, 2008
Observations x1, . . . , xn generated from a probability distribution
fi(xi|θi, x1, . . . , xi−1) = fi(xi|θi, x1:i−1)
x = (x1, . . . , xn) ∼ f(x|θ), θ = (θ1, . . . , θn)
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 4 / 64
Parametric model
Bayesians promote the idea that a multiplicity of parameters can be handled via
hierarchical, typically exchangeable, models, but it seems implausible that this
could really work automatically [instead of] giving reasonable answers using
minimal assumptions.
Gelman, BA, 2008
Observations x1, . . . , xn generated from a probability distribution
fi(xi|θi, x1, . . . , xi−1) = fi(xi|θi, x1:i−1)
x = (x1, . . . , xn) ∼ f(x|θ), θ = (θ1, . . . , θn)
Associated likelihood
(θ|x) = f(x|θ)
[inverted density & starting point]
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 4 / 64
Bayes theorem 101
Bayes theorem = Inversion of probabilities
If A and E are events such that P(E) = 0, P(A|E) and P(E|A) are
related by
P(A|E) =
P(E|A)P(A)
P(E|A)P(A) + P(E|Ac)P(Ac)
=
P(E|A)P(A)
P(E)
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 5 / 64
Bayes theorem 101
Bayes theorem = Inversion of probabilities
If A and E are events such that P(E) = 0, P(A|E) and P(E|A) are
related by
P(A|E) =
P(E|A)P(A)
P(E|A)P(A) + P(E|Ac)P(Ac)
=
P(E|A)P(A)
P(E)
[Thomas Bayes (?)]
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 5 / 64
Bayesian approach
The impact of treating x as a fixed constant
is to increase statistical power as an artefact
Templeton, Molec. Ecol., 2009
New perspective
Uncertainty on the parameters θ of a model modeled through a
probability distribution π on Θ, called prior distribution
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 6 / 64
Bayesian approach
The impact of treating x as a fixed constant
is to increase statistical power as an artefact
Templeton, Molec. Ecol., 2009
New perspective
Uncertainty on the parameters θ of a model modeled through a
probability distribution π on Θ, called prior distribution
Inference based on the distribution of θ conditional on x, π(θ|x),
called posterior distribution
π(θ|x) =
f(x|θ)π(θ)
f(x|θ)π(θ) dθ
.
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 6 / 64
[Nonphilosophical] justifications
Ignoring the sampling error of x undermines
the statistical validity of all inferences made by the method
Templeton, Molec. Ecol., 2009
Semantic drift from unknown to random
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 7 / 64
[Nonphilosophical] justifications
Ignoring the sampling error of x undermines
the statistical validity of all inferences made by the method
Templeton, Molec. Ecol., 2009
Semantic drift from unknown to random
Actualization of the information on θ by extracting the information on
θ contained in the observation x
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 7 / 64
[Nonphilosophical] justifications
Ignoring the sampling error of x undermines
the statistical validity of all inferences made by the method
Templeton, Molec. Ecol., 2009
Semantic drift from unknown to random
Actualization of the information on θ by extracting the information on
θ contained in the observation x
Allows incorporation of imperfect information in the decision process
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 7 / 64
[Nonphilosophical] justifications
Ignoring the sampling error of x undermines
the statistical validity of all inferences made by the method
Templeton, Molec. Ecol., 2009
Semantic drift from unknown to random
Actualization of the information on θ by extracting the information on
θ contained in the observation x
Allows incorporation of imperfect information in the decision process
Unique mathematical way to condition upon the observations
(conditional perspective)
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 7 / 64
[Nonphilosophical] justifications
Ignoring the sampling error of x undermines
the statistical validity of all inferences made by the method
Templeton, Molec. Ecol., 2009
Semantic drift from unknown to random
Actualization of the information on θ by extracting the information on
θ contained in the observation x
Allows incorporation of imperfect information in the decision process
Unique mathematical way to condition upon the observations
(conditional perspective)
Unique way to give meaning to statements like P(θ > 0)
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 7 / 64
Posterior distribution
Bayesian methods are presented as an automatic inference engine,
and this raises suspicion in anyone with applied experience
Gelman, BA, 2008
π(θ|x) central to Bayesian inference
Operates conditional upon the observations
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 8 / 64
Posterior distribution
Bayesian methods are presented as an automatic inference engine,
and this raises suspicion in anyone with applied experience
Gelman, BA, 2008
π(θ|x) central to Bayesian inference
Operates conditional upon the observations
Incorporates the requirement of the Likelihood Principle
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 8 / 64
Posterior distribution
Bayesian methods are presented as an automatic inference engine,
and this raises suspicion in anyone with applied experience
Gelman, BA, 2008
π(θ|x) central to Bayesian inference
Operates conditional upon the observations
Incorporates the requirement of the Likelihood Principle
Avoids averaging over the unobserved values of x
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 8 / 64
Posterior distribution
Bayesian methods are presented as an automatic inference engine,
and this raises suspicion in anyone with applied experience
Gelman, BA, 2008
π(θ|x) central to Bayesian inference
Operates conditional upon the observations
Incorporates the requirement of the Likelihood Principle
Avoids averaging over the unobserved values of x
Coherent updating of the information available on θ
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 8 / 64
Posterior distribution
Bayesian methods are presented as an automatic inference engine,
and this raises suspicion in anyone with applied experience
Gelman, BA, 2008
π(θ|x) central to Bayesian inference
Operates conditional upon the observations
Incorporates the requirement of the Likelihood Principle
Avoids averaging over the unobserved values of x
Coherent updating of the information available on θ
Provides a complete inferential machinery
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 8 / 64
Improper distributions
If we take P(dσ) ∝ dσ as a statement that σ may have any value between 0 and
∞ (...), we must use ∞ instead of 1 to denote certainty.
Jeffreys, ToP, 1939
Necessary extension from a prior distribution to a prior σ-finite measure π
such that
Θ
π(θ) dθ = +∞
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 9 / 64
Improper distributions
If we take P(dσ) ∝ dσ as a statement that σ may have any value between 0 and
∞ (...), we must use ∞ instead of 1 to denote certainty.
Jeffreys, ToP, 1939
Necessary extension from a prior distribution to a prior σ-finite measure π
such that
Θ
π(θ) dθ = +∞
Improper prior distribution
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 9 / 64
Improper distributions
If we take P(dσ) ∝ dσ as a statement that σ may have any value between 0 and
∞ (...), we must use ∞ instead of 1 to denote certainty.
Jeffreys, ToP, 1939
Necessary extension from a prior distribution to a prior σ-finite measure π
such that
Θ
π(θ) dθ = +∞
Improper prior distribution
[Weird? Inappropriate?? report!! ]
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 9 / 64
Justifications
If the parameter may have any value from −∞ to +∞,
its prior probability should be taken as uniformly distributed
Jeffreys, ToP, 1939
Automated prior determination often leads to improper priors
1 Similar performances of estimators derived from these generalized
distributions
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 10 / 64
Justifications
If the parameter may have any value from −∞ to +∞,
its prior probability should be taken as uniformly distributed
Jeffreys, ToP, 1939
Automated prior determination often leads to improper priors
1 Similar performances of estimators derived from these generalized
distributions
2 Improper priors as limits of proper distributions in many
[mathematical] senses
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 10 / 64
Further justifications
There is no good objective principle for choosing a noninformative prior (even if
that concept were mathematically defined, which it is not)
Gelman, BA, 2008
4 Robust answer against possible misspecifications of the prior
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 11 / 64
Further justifications
There is no good objective principle for choosing a noninformative prior (even if
that concept were mathematically defined, which it is not)
Gelman, BA, 2008
4 Robust answer against possible misspecifications of the prior
5 Frequencial justifications, such as:
(i) minimaxity
(ii) admissibility
(iii) invariance (Haar measure)
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 11 / 64
Further justifications
There is no good objective principle for choosing a noninformative prior (even if
that concept were mathematically defined, which it is not)
Gelman, BA, 2008
4 Robust answer against possible misspecifications of the prior
5 Frequencial justifications, such as:
(i) minimaxity
(ii) admissibility
(iii) invariance (Haar measure)
6 Improper priors [much] prefered to vague proper priors like N(0, 106)
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 11 / 64
Validation
The mistake is to think of them as representing ignorance
Lindley, JASA, 1990
Extension of the posterior distribution π(θ|x) associated with an improper
prior π as given by Bayes’s formula
π(θ|x) =
f(x|θ)π(θ)
Θ f(x|θ)π(θ) dθ
,
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 12 / 64
Validation
The mistake is to think of them as representing ignorance
Lindley, JASA, 1990
Extension of the posterior distribution π(θ|x) associated with an improper
prior π as given by Bayes’s formula
π(θ|x) =
f(x|θ)π(θ)
Θ f(x|θ)π(θ) dθ
,
when
Θ
f(x|θ)π(θ) dθ < ∞
Delete emotionally loaded names
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 12 / 64
Noninformative priors
...cannot be expected to represent exactly total ignorance about the problem, but
should rather be taken as reference priors, upon which everyone could fall back
when the prior information is missing.
Kass and Wasserman, JASA, 1996
What if all we know is that we know “nothing” ?!
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 13 / 64
Noninformative priors
...cannot be expected to represent exactly total ignorance about the problem, but
should rather be taken as reference priors, upon which everyone could fall back
when the prior information is missing.
Kass and Wasserman, JASA, 1996
What if all we know is that we know “nothing” ?!
In the absence of prior information, prior distributions solely derived from
the sample distribution f(x|θ)
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 13 / 64
Noninformative priors
...cannot be expected to represent exactly total ignorance about the problem, but
should rather be taken as reference priors, upon which everyone could fall back
when the prior information is missing.
Kass and Wasserman, JASA, 1996
What if all we know is that we know “nothing” ?!
In the absence of prior information, prior distributions solely derived from
the sample distribution f(x|θ)
Difficulty with uniform priors, lacking invariance properties. Rather use
Jeffreys’ prior.
[Jeffreys, 1939; Robert, Chopin & Rousseau, 2009]
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 13 / 64
Tests and model choice
The Jeffreys-subjective synthesis betrays a much more dangerous confusion than
the Neyman-Pearson-Fisher synthesis as regards hypothesis tests
Senn, BA, 2008
1 Introduction
2 Tests and model choice
Bayesian tests
Opposition to classical tests
Model choice
Pseudo-Bayes factors
Compatible priors
Variable selection
3 Incoherent inferences
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 14 / 64
Construction of Bayes tests
What is almost never used, however, is the Jeffreys significance test.
Senn, BA, 2008
Definition (Test)
Given an hypothesis H0 : θ ∈ Θ0 on the parameter θ ∈ Θ0 of a statistical
model, a test is a statistical procedure that takes its values in {0, 1}.
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 15 / 64
Construction of Bayes tests
What is almost never used, however, is the Jeffreys significance test.
Senn, BA, 2008
Definition (Test)
Given an hypothesis H0 : θ ∈ Θ0 on the parameter θ ∈ Θ0 of a statistical
model, a test is a statistical procedure that takes its values in {0, 1}.
Example (Normal mean)
For x ∼ N (θ, 1), decide whether or not θ ≤ 0.
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 15 / 64
Decision-theoretic perspective
Loss functions [are] not relevant to statistical inference
Gelman, BA, 2008
Theorem (Optimal Bayes decision)
Under the 0 − 1 loss function
L(θ, d) =



0 if d = IΘ0 (θ)
a0 if d = 1 and θ ∈ Θ0
a1 if d = 0 and θ ∈ Θ0
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 16 / 64
Decision-theoretic perspective
Loss functions [are] not relevant to statistical inference
Gelman, BA, 2008
Theorem (Optimal Bayes decision)
Under the 0 − 1 loss function
L(θ, d) =



0 if d = IΘ0 (θ)
a0 if d = 1 and θ ∈ Θ0
a1 if d = 0 and θ ∈ Θ0
the Bayes procedure is
δπ
(x) =
1 if Prπ
(θ ∈ Θ0|x) ≥ a0/(a0 + a1)
0 otherwise
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 16 / 64
A function of posterior probabilities
The method posits two or more alternative hypotheses and tests their relative fits
to some observed statistics — Templeton, Mol. Ecol., 2009
Definition (Bayes factors)
For hypotheses H0 : θ ∈ Θ0 vs. Ha : θ ∈ Θ0
B01 =
π(Θ0|x)
π(Θc
0|x)
π(Θ0)
π(Θc
0)
=
Θ0
f(x|θ)π0(θ)dθ
Θc
0
f(x|θ)π1(θ)dθ
[Good, 1958 & Jeffreys, 1961]
pseudo-Bayes factors
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 17 / 64
Self-contained concept
Having a high relative probability does not mean that a hypothesis is true or
supported by the data — Templeton, Mol. Ecol., 2009
Non-decision-theoretic:
eliminates choice of π(Θ0)
Bayesian/marginal equivalent to the likelihood ratio
Jeffreys’ scale of evidence:
if log10(Bπ
10) between 0 and 0.5, evidence against H0 weak,
if log10(Bπ
10) 0.5 and 1, evidence substantial,
if log10(Bπ
10) 1 and 2, evidence strong and
if log10(Bπ
10) above 2, evidence decisive
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 18 / 64
A major modification
Considering whether a location parameter α is 0. The prior is uniform and we
should have to take f(α) = 0 and B10 would always be infinite
Jeffreys, ToP, 1939
When the null hypothesis is supported by a set of measure 0, π(Θ0) = 0
and thus π(Θ0|x) = 0.
[End of the story?!]
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 19 / 64
Changing the prior to fit the hypotheses
Given that some logical overlap is common when dealing with complex models,
this means that much of the literature is invalid
Templeton, Trends in Ecology and Evolution, 2010
Requirement
Define prior distributions under both assumptions,
π0(θ) ∝ π(θ)IΘ0 (θ), π1(θ) ∝ π(θ)IΘ1 (θ),
[under the standard dominating measures on Θ0 and Θ1], leading to
π(θ) = 0π0(θ) + 1π1(θ).
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 20 / 64
Point null hypotheses
I have no patience for statistical methods that assign positive probability to point
hypotheses of the θ = 0 type that can never actually be true
Gelman, BA, 2008
Take ρ0 = Prπ
(θ = θ0) and g1 prior density under Ha. Then
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 21 / 64
Point null hypotheses
I have no patience for statistical methods that assign positive probability to point
hypotheses of the θ = 0 type that can never actually be true
Gelman, BA, 2008
Take ρ0 = Prπ
(θ = θ0) and g1 prior density under Ha. Then
π(Θ0|x) =
f(x|θ0)ρ0
f(x|θ)π(θ) dθ
=
f(x|θ0)ρ0
f(x|θ0)ρ0 + (1 − ρ0)m1(x)
and Bayes factor
Bπ
01(x) =
f(x|θ0)ρ0
m1(x)(1 − ρ0)
ρ0
1 − ρ0
=
f(x|θ0)
m1(x)
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 21 / 64
Point null hypotheses (cont’d)
Example (Normal mean)
Test of H0 : θ = 0 when x ∼ N (θ, 1): we take π1 as N(0, τ2)
m1(x)
f(x|0)
=
σ2
σ2 + τ2
exp
τ2x2
2σ2(σ2 + τ2)
and the posterior probability is
τ/x 0 0.68 1.28 1.96
1 0.586 0.557 0.484 0.351
10 0.768 0.729 0.612 0.366
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 22 / 64
Comparison with classical tests
The 95 percent frequentist intervals will live up to their advertised coverage
claims — Wasserman, BA, 2008
Standard/classical answer
Definition (p-value)
The p-value p(x) associated with a test is the largest significance level for
which H0 is rejected
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 23 / 64
Problems with p-values
The use of P implies that a hypothesis that may be true may be rejected because
it had not predicted observable results that have not occurred
Jeffreys, ToP, 1939
Evaluation of the wrong quantity, namely the probability to exceed
the observed quantity.(wrong conditioning)
Evaluation only under the null hypothesis
Huge numerical difference with the Bayesian range of answers
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 24 / 64
Bayesian lower bounds
If the Bayes estimator has good frequency behavior
then we might as well use the frequentist method.
If it has bad frequency behavior then we shouldn’t use it.
Wasserman, BA, 2008
Least favourable Bayesian answer is
B(x, GA) = inf
g∈GA
f(x|θ0)
Θ f(x|θ)g(θ) dθ
,
i.e., if there exists a mle for θ, ˆθ(x),
B(x, GA) =
f(x|θ0)
f(x|ˆθ(x))
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 25 / 64
Illustration
Example (Normal case)
When x ∼ N (θ, 1) and H0 : θ0 = 0, the lower bounds are
B(x, GA) = e−x2/2
and P(x, GA) = 1 + ex2/2
−1
,
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 26 / 64
Illustration
Example (Normal case)
When x ∼ N (θ, 1) and H0 : θ0 = 0, the lower bounds are
B(x, GA) = e−x2/2
and P(x, GA) = 1 + ex2/2
−1
,
i.e.
p-value 0.10 0.05 0.01 0.001
P 0.205 0.128 0.035 0.004
B 0.256 0.146 0.036 0.004
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 26 / 64
Illustration
Example (Normal case)
When x ∼ N (θ, 1) and H0 : θ0 = 0, the lower bounds are
B(x, GA) = e−x2/2
and P(x, GA) = 1 + ex2/2
−1
,
i.e.
p-value 0.10 0.05 0.01 0.001
P 0.205 0.128 0.035 0.004
B 0.256 0.146 0.036 0.004
[Quite different!]
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 26 / 64
Model choice and model comparison
There is no null hypothesis, which complicates the computation of sampling error
Templeton, Mol. Ecol., 2009
Choice among models:
Several models available for the same observation(s)
Mi : x ∼ fi(x|θi), i ∈ I
where I can be finite or infinite
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 27 / 64
Bayesian resolution
The posterior probabilities are constructed by using a numerator that is a function
of the observation for a particular model, then divided by a denominator that
ensures that the ”probabilities” sum to one. — Templeton, Mol. Ecol., 2009
Probabilise the entire model/parameter space
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 28 / 64
Bayesian resolution
The posterior probabilities are constructed by using a numerator that is a function
of the observation for a particular model, then divided by a denominator that
ensures that the ”probabilities” sum to one. — Templeton, Mol. Ecol., 2009
Probabilise the entire model/parameter space
allocate probabilities pi to all models Mi
define priors πi(θi) for each parameter space Θi
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 28 / 64
Bayesian resolution
The posterior probabilities are constructed by using a numerator that is a function
of the observation for a particular model, then divided by a denominator that
ensures that the ”probabilities” sum to one. — Templeton, Mol. Ecol., 2009
Probabilise the entire model/parameter space
allocate probabilities pi to all models Mi
define priors πi(θi) for each parameter space Θi
compute
π(Mi|x) =
pi
Θi
fi(x|θi)πi(θi)dθi
j
pj
Θj
fj(x|θj)πj(θj)dθj
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 28 / 64
Bayesian resolution(2)
The numerators are not co-measurable across hypotheses, and the denominators
are sums of non-co-measurable entities. This means that it is mathematically
impossible for them to be probabilities — Templeton, Mol. Ecol., 2009
take largest π(Mi|x) to determine “best” model,
or use averaged predictive
j
π(Mj|x)
Θj
fj(x |θj)πj(θj|x)dθj
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 29 / 64
Natural Occam’s razor
Pluralitas non est ponenda sine neccesitate
Variation is random until the contrary
is shown; and new parameters in laws,
when they are suggested, must be
tested one at a time, unless there is
specific reason to the contrary.
Jeffreys, ToP, 1939
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 30 / 64
Natural Occam’s razor
Pluralitas non est ponenda sine neccesitate
Variation is random until the contrary
is shown; and new parameters in laws,
when they are suggested, must be
tested one at a time, unless there is
specific reason to the contrary.
Jeffreys, ToP, 1939
The Bayesian approach naturally weights differently models with different
parameter dimensions (BIC being an approximative log-Bayes factor).
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 30 / 64
A fundamental difficulty
1) ABC can and does produce results that are mathematically impossible;
2) the “posterior probabilities” of ABC cannot possibly be true probability
measures;
and 3) ABC is statistically incoherent.
Templeton, Trends in Ecology and Evolution, 2010
Improper priors are NOT allowed here
If
Θ1
π1(dθ1) = ∞ or
Θ2
π2(dθ2) = ∞
then either π1 or π2 cannot be coherently normalised
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 31 / 64
A fundamental difficulty
1) ABC can and does produce results that are mathematically impossible;
2) the “posterior probabilities” of ABC cannot possibly be true probability
measures;
and 3) ABC is statistically incoherent.
Templeton, Trends in Ecology and Evolution, 2010
Improper priors are NOT allowed here
If
Θ1
π1(dθ1) = ∞ or
Θ2
π2(dθ2) = ∞
then either π1 or π2 cannot be coherently normalised but the
normalisation matters in the Bayes factor Recall Bayes factor
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 31 / 64
Normal illustration
Take x ∼ N (θ, 1) and H0 : θ = 0
Impact of the constant
x 0.0 1.0 1.65 1.96 2.58
π(θ) = 1 0.285 0.195 0.089 0.055 0.014
π(θ) = 10 0.0384 0.0236 0.0101 0.00581 0.00143
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 32 / 64
Vague proper priors are NOT the solution
Taking a proper prior and take a “very large” variance (e.g., BUGS)
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 33 / 64
Vague proper priors are NOT the solution
Taking a proper prior and take a “very large” variance (e.g., BUGS) will
most often result in an undefined or ill-defined limit
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 33 / 64
Vague proper priors are NOT the solution
Taking a proper prior and take a “very large” variance (e.g., BUGS) will
most often result in an undefined or ill-defined limit
Example (Lindley’s paradox)
If testing H0 : θ = 0 when observing x ∼ N(θ, 1), under a normal N(0, α)
prior π1(θ),
B01(x)
α−→∞
−→ 0
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 33 / 64
Learning from the sample
It is possible for data to discriminate among a set of hypotheses without saying
anything about a proposition that is common to all the alternatives considered.
Seber, Evidence and Evolution, 2008
Definition (Learning sample)
Given an improper prior π, (x1, . . . , xn) is a learning sample if
π(·|x1, . . . , xn) is proper and a minimal learning sample if none of its
subsamples is a learning sample
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 34 / 64
Learning from the sample
It is possible for data to discriminate among a set of hypotheses without saying
anything about a proposition that is common to all the alternatives considered.
Seber, Evidence and Evolution, 2008
Definition (Learning sample)
Given an improper prior π, (x1, . . . , xn) is a learning sample if
π(·|x1, . . . , xn) is proper and a minimal learning sample if none of its
subsamples is a learning sample
There is just enough information in a minimal learning sample to make
inference about θ under the prior π
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 34 / 64
Pseudo-Bayes factors
Idea
Use a first part x[i] of the data x to make the prior proper:
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 35 / 64
Pseudo-Bayes factors
Idea
Use a first part x[i] of the data x to make the prior proper:
πi improper but πi(·|x[i]) proper
and
fi(x[n/i]|θi) πi(θi|x[i])dθi
fj(x[n/i]|θj) πj(θj|x[i])dθj
independent of normalizing constant
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 35 / 64
Pseudo-Bayes factors
Idea
Use a first part x[i] of the data x to make the prior proper:
πi improper but πi(·|x[i]) proper
and
fi(x[n/i]|θi) πi(θi|x[i])dθi
fj(x[n/i]|θj) πj(θj|x[i])dθj
independent of normalizing constant
Use remaining part x[n/i] to run test as if πj(θj|x[i]) was the true prior
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 35 / 64
Motivation
Provides a working principle for improper priors
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 36 / 64
Motivation
Provides a working principle for improper priors
Gather enough information from data to achieve properness
and use this properness to run the test on remaining data
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 36 / 64
Motivation
Provides a working principle for improper priors
Gather enough information from data to achieve properness
and use this properness to run the test on remaining data
does not use the data x twice as in Aitkin’s (1991,2010)
Back later!
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 36 / 64
Fractional Bayes factor
To test a theory, you need to test it against alternatives.
Seber, Evidence and Evolution, 2008
Idea
use directly the likelihood to separate training sample from testing sample
BF
12 = B12(x) × Lb
2(θ2)π2(θ2)dθ2 Lb
1(θ1)π1(θ1)dθ1
[O’Hagan, 1995]
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 37 / 64
Fractional Bayes factor
To test a theory, you need to test it against alternatives.
Seber, Evidence and Evolution, 2008
Idea
use directly the likelihood to separate training sample from testing sample
BF
12 = B12(x) × Lb
2(θ2)π2(θ2)dθ2 Lb
1(θ1)π1(θ1)dθ1
[O’Hagan, 1995]
Proportion b of the sample used to gain proper-ness
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 37 / 64
Fractional Bayes factor (cont’d)
Example (Normal mean)
BF
12 =
1
√
b
en(b−1)¯x2
n/2
corresponds to exact Bayes factor for the prior N 0, 1−b
nb
If b constant, prior variance goes to 0
If b =
1
n
, prior variance stabilises around 1
If b = n−α
, α < 1, prior variance goes to 0 too.
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 38 / 64
Compatibility principle
Further complicating dimensionality of test statistics is the fact that the models
are often not nested, and one model may contain parameters that do not have
analogues in the other models and vice versa
Templeton, Mol. Ecol., 2009
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 39 / 64
Compatibility principle
Further complicating dimensionality of test statistics is the fact that the models
are often not nested, and one model may contain parameters that do not have
analogues in the other models and vice versa
Templeton, Mol. Ecol., 2009
Difficulty of finding simultaneously priors on a collection of models
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 39 / 64
Compatibility principle
Further complicating dimensionality of test statistics is the fact that the models
are often not nested, and one model may contain parameters that do not have
analogues in the other models and vice versa
Templeton, Mol. Ecol., 2009
Difficulty of finding simultaneously priors on a collection of models
Easier to start from a single prior on a “big” [encompassing] model and to
derive others from a coherence principle
[Dawid & Lauritzen, 2000]
Raw regression output
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 39 / 64
An illustration for linear regression
In the case M1 and M2 are two nested Gaussian linear regression models
with Zellner’s g-priors and the same variance σ2 ∼ π(σ2):
M1 : y|β1, σ2 ∼ N(X1β1, σ2) with
β1|σ2
∼ N s1, σ2
n1(XT
1 X1)−1
where X1 is a (n × k1) matrix of rank k1 ≤ n
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 40 / 64
An illustration for linear regression
In the case M1 and M2 are two nested Gaussian linear regression models
with Zellner’s g-priors and the same variance σ2 ∼ π(σ2):
M1 : y|β1, σ2 ∼ N(X1β1, σ2) with
β1|σ2
∼ N s1, σ2
n1(XT
1 X1)−1
where X1 is a (n × k1) matrix of rank k1 ≤ n
M2 : y|β2, σ2 ∼ N(X2β2, σ2) with
β2|σ2
∼ N s2, σ2
n2(XT
2 X2)−1
,
where X2 is a (n × k2) matrix with span(X2) ⊆ span(X1)
[ c Marin & Robert, Bayesian Core]
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 40 / 64
Compatible g-priors
I don’t see any role for squared error loss, minimax, or the rest of what is
sometimes called statistical decision theory
Gelman, BA, 2008
Since σ2 is a nuisance parameter, minimize the Kullback-Leibler
divergence between both marginal distributions conditional on σ2:
m1(y|σ2; s1, n1) and m2(y|σ2; s2, n2), with solution
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 41 / 64
Compatible g-priors
I don’t see any role for squared error loss, minimax, or the rest of what is
sometimes called statistical decision theory
Gelman, BA, 2008
Since σ2 is a nuisance parameter, minimize the Kullback-Leibler
divergence between both marginal distributions conditional on σ2:
m1(y|σ2; s1, n1) and m2(y|σ2; s2, n2), with solution
β2|X2, σ2
∼ N s∗
2, σ2
n∗
2(XT
2 X2)−1
with
s∗
2 = (XT
2 X2)−1
XT
2 X1s1 n∗
2 = n1
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 41 / 64
Symmetrised compatible priors
If those prior probabilities are obscure, the same will be true of the posterior
probabilities — Seber, Evidence and Evolution, 2008
Postulate: Previous principle requires embedded models (or an
encompassing model) and proper priors, while being hard to implement
outside exponential families
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 42 / 64
Symmetrised compatible priors
If those prior probabilities are obscure, the same will be true of the posterior
probabilities — Seber, Evidence and Evolution, 2008
Postulate: Previous principle requires embedded models (or an
encompassing model) and proper priors, while being hard to implement
outside exponential families
We determine prior measures on two models M1 and M2, π1 and π2,
directly by a compatibility principle.
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 42 / 64
Generalised expected posterior priors
[Perez & Berger, 2000]
EPP Principle
Starting from reference priors πN
1 and πN
2 , substitute by prior distributions
π1 and π2 that solve the system of integral equations
π1(θ1) =
X
πN
1 (θ1 | x)m2(x)dx
and
π2(θ2) =
X
πN
2 (θ2 | x)m1(x)dx,
where x is an imaginary minimal training sample and m1, m2 are the
marginals associated with π1 and π2 respectively.
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 43 / 64
Motivations
Eliminates the “imaginary observation” device and proper-isation
through part of the data by integration under the “truth”
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 44 / 64
Motivations
Eliminates the “imaginary observation” device and proper-isation
through part of the data by integration under the “truth”
Assumes that both models are equally valid and equipped with ideal
unknown priors
πi, i = 1, 2,
that yield “true” marginals balancing each model wrt the other
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 44 / 64
Motivations
Eliminates the “imaginary observation” device and proper-isation
through part of the data by integration under the “truth”
Assumes that both models are equally valid and equipped with ideal
unknown priors
πi, i = 1, 2,
that yield “true” marginals balancing each model wrt the other
For a given π1, π2 is an expected posterior prior
Using both equations introduces symmetry into the game
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 44 / 64
Bayesian coherence
Logical overlap is the norm for the complex models analyzed with ABC, so many
ABC posterior model probabilities published to date are wrong.
Templeton, PNAS, 2009
Theorem (True Bayes factor)
If π1 and π2 are the EPPs and if their marginals are finite, then the
corresponding Bayes factor
B1,2(x)
is either a (true) Bayes factor or a limit of (true) Bayes factors.
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 45 / 64
Bayesian coherence
Logical overlap is the norm for the complex models analyzed with ABC, so many
ABC posterior model probabilities published to date are wrong.
Templeton, PNAS, 2009
Theorem (True Bayes factor)
If π1 and π2 are the EPPs and if their marginals are finite, then the
corresponding Bayes factor
B1,2(x)
is either a (true) Bayes factor or a limit of (true) Bayes factors.
Obviously only interesting when both π1 and π2 are improper.
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 45 / 64
Variable selection
Regression setup where y regressed on a set {x1, . . . , xp} of p potential
explanatory regressors (plus intercept)
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 46 / 64
Variable selection
Regression setup where y regressed on a set {x1, . . . , xp} of p potential
explanatory regressors (plus intercept)
Corresponding 2p submodels Mγ, where γ ∈ Γ = {0, 1}p indicates
inclusion/exclusion of variables by a binary representation,
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 46 / 64
Variable selection
Regression setup where y regressed on a set {x1, . . . , xp} of p potential
explanatory regressors (plus intercept)
Corresponding 2p submodels Mγ, where γ ∈ Γ = {0, 1}p indicates
inclusion/exclusion of variables by a binary representation,
e.g. γ = 101001011 means that x1, x3, x5, x7 and x8 are included.
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 46 / 64
Notations
For model Mγ,
qγ variables included
t1(γ) = {t1,1(γ), . . . , t1,qγ (γ)} indices of those variables and t0(γ)
indices of the variables not included
For β ∈ Rp+1,
βt1(γ) = β0, βt1,1(γ), . . . , βt1,qγ (γ)
Xt1(γ) = 1n|xt1,1(γ)| . . . |xt1,qγ (γ) .
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 47 / 64
Notations
For model Mγ,
qγ variables included
t1(γ) = {t1,1(γ), . . . , t1,qγ (γ)} indices of those variables and t0(γ)
indices of the variables not included
For β ∈ Rp+1,
βt1(γ) = β0, βt1,1(γ), . . . , βt1,qγ (γ)
Xt1(γ) = 1n|xt1,1(γ)| . . . |xt1,qγ (γ) .
Submodel Mγ is thus
y|β, γ, σ2
∼ N Xt1(γ)βt1(γ), σ2
In
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 47 / 64
Global and compatible priors
Use Zellner’s g-prior, i.e. a normal prior for β conditional on σ2,
β|σ2
∼ N(˜β, cσ2
(XT
X)−1
)
and a Jeffreys prior for σ2,
π(σ2
) ∝ σ−2
Noninformative g
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 48 / 64
Global and compatible priors
Use Zellner’s g-prior, i.e. a normal prior for β conditional on σ2,
β|σ2
∼ N(˜β, cσ2
(XT
X)−1
)
and a Jeffreys prior for σ2,
π(σ2
) ∝ σ−2
Noninformative g
Resulting compatible prior
βt1(γ) ∼ N XT
t1(γ)Xt1(γ)
−1
XT
t1(γ)X ˜β, cσ2
XT
t1(γ)Xt1(γ)
−1
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 48 / 64
Posterior model probability
Can be obtained in closed form:
π(γ|y) ∝ (c + 1)−(qγ +1)/2
yT
y −
cyT
P1y
c + 1
+
˜βT
XT
P1X ˜β
c + 1
−
2yT
P1X ˜β
c + 1
−n/2
.
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 49 / 64
Posterior model probability
Can be obtained in closed form:
π(γ|y) ∝ (c + 1)−(qγ +1)/2
yT
y −
cyT
P1y
c + 1
+
˜βT
XT
P1X ˜β
c + 1
−
2yT
P1X ˜β
c + 1
−n/2
.
Conditionally on γ, posterior distributions of β and σ2:
βt1(γ)|σ2
, y, γ ∼ N
c
c + 1
(U1y + U1X ˜β/c),
σ2
c
c + 1
XT
t1(γ)Xt1(γ)
−1
,
σ2
|y, γ ∼ IG
n
2
,
yT
y
2
−
cyT
P1y
2(c + 1)
+
˜βT
XT
P1X ˜β
2(c + 1)
−
yT
P1X ˜β
c + 1
.
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 49 / 64
Noninformative case
Use the same compatible informative g-prior distribution with ˜β = 0p+1
and a hierarchical diffuse prior distribution on c,
π(c) ∝ c−1
IN∗ (c) or π(c) ∝ c−1
Ic>0
Recall g-prior
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 50 / 64
Noninformative case
Use the same compatible informative g-prior distribution with ˜β = 0p+1
and a hierarchical diffuse prior distribution on c,
π(c) ∝ c−1
IN∗ (c) or π(c) ∝ c−1
Ic>0
Recall g-prior
The choice of this hierarchical diffuse prior distribution on c is due to the
model posterior sensitivity to large values of c:
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 50 / 64
Noninformative case
Use the same compatible informative g-prior distribution with ˜β = 0p+1
and a hierarchical diffuse prior distribution on c,
π(c) ∝ c−1
IN∗ (c) or π(c) ∝ c−1
Ic>0
Recall g-prior
The choice of this hierarchical diffuse prior distribution on c is due to the
model posterior sensitivity to large values of c:
Taking ˜β = 0p+1 and c large does not work
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 50 / 64
Processionary caterpillar
Influence of some forest settlement characteristics on the development of
caterpillar colonies
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 51 / 64
Processionary caterpillar
Influence of some forest settlement characteristics on the development of
caterpillar colonies
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 51 / 64
Processionary caterpillar
Influence of some forest settlement characteristics on the development of
caterpillar colonies
Response y log-transform of the average number of nests of caterpillars
per tree on an area of 500 square meters (n = 33 areas)
[ c Marin & Robert, Bayesian Core]
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 51 / 64
Processionary caterpillar (cont’d)
Potential explanatory variables
x1 altitude (in meters), x2 slope (in degrees),
x3 number of pines in the square,
x4 height (in meters) of the tree at the center of the square,
x5 diameter of the tree at the center of the square,
x6 index of the settlement density,
x7 orientation of the square (from 1 if southb’d to 2 ow),
x8 height (in meters) of the dominant tree,
x9 number of vegetation strata,
x10 mix settlement index (from 1 if not mixed to 2 if mixed).
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 52 / 64
Bayesian regression output
Estimate BF log10(BF)
(Intercept) 9.2714 26.334 1.4205 (***)
X1 -0.0037 7.0839 0.8502 (**)
X2 -0.0454 3.6850 0.5664 (**)
X3 0.0573 0.4356 -0.3609
X4 -1.0905 2.8314 0.4520 (*)
X5 0.1953 2.5157 0.4007 (*)
X6 -0.3008 0.3621 -0.4412
X7 -0.2002 0.3627 -0.4404
X8 0.1526 0.4589 -0.3383
X9 -1.0835 0.9069 -0.0424
X10 -0.3651 0.4132 -0.3838
evidence against H0: (****) decisive, (***) strong, (**) subtantial,
(*) poor
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 53 / 64
Bayesian variable selection
t1(γ) π(γ|y, X)
0,1,2,4,5 0.0929
0,1,2,4,5,9 0.0325
0,1,2,4,5,10 0.0295
0,1,2,4,5,7 0.0231
0,1,2,4,5,8 0.0228
0,1,2,4,5,6 0.0228
0,1,2,3,4,5 0.0224
0,1,2,3,4,5,9 0.0167
0,1,2,4,5,6,9 0.0167
0,1,2,4,5,8,9 0.0137
Noninformative G-prior model choice
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 54 / 64
Fringe alternatives
1 Introduction
2 Tests and model choice
3 Incoherent inferences
Templeton’s debate
Bayes/likelihood fusion
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 55 / 64
A revealing confusion
In statistics, coherent measures of fit of nested and overlapping composite
hypotheses are technically those measures that are consistent with the constraints
of formal logic. For example, the probability of the nested special case must be
less than or equal to the probability of the general model within which the special
case is nested. Any statistic that assigns greater probability to the special case is
said to be incoherent.
Templeton, PNAS, 2009
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 56 / 64
ABC algorithm
Instead of evaluating hypotheses in terms of how probable they say the data are,
we evaluate them by estimating how accurately they’ll predict new data when
fitted to old — Seber, Evidence and Evolution, 2008
Algorithm 1 Likelihood-free rejection sampler
for i = 1 to N do
repeat
generate θ from the prior distribution π(·)
generate z from the likelihood f(·|θ )
until ρ{η(z), η(y)} ≤
set θi = θ
end for
where η(y) defines a (not necessarily sufficient) statistic
[Pritchard et al., 1999]
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 57 / 64
ABC output
The likelihood-free algorithm samples from the marginal in z of:
π (θ, z|y) =
π(θ)f(z|θ)IA ,y (z)
A ,y×Θ π(θ)f(z|θ)dzdθ
,
where A ,y = {z ∈ D|ρ(η(z), η(y)) < }.
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 58 / 64
ABC output
The likelihood-free algorithm samples from the marginal in z of:
π (θ, z|y) =
π(θ)f(z|θ)IA ,y (z)
A ,y×Θ π(θ)f(z|θ)dzdθ
,
where A ,y = {z ∈ D|ρ(η(z), η(y)) < }.
The idea behind ABC is that the summary statistics coupled with a small
tolerance should provide a good approximation of the posterior
distribution:
π (θ|y) = π (θ, z|y)dz ≈ π(θ|y) .
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 58 / 64
The ”Great ABC controversy”
On-going controvery in phylogeographic genetics about the validity of
using ABC for testing
Against: Templeton, 2008, 2009,
2010a, 2010b, 2010c argues that
nested hypotheses cannot have
higher probabilities than nesting
hypotheses (!)
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 59 / 64
The ”Great ABC controversy”
On-going controvery in phylogeographic genetics about the validity of
using ABC for testing
Against: Templeton, 2008, 2009,
2010a, 2010b, 2010c argues that
nested hypotheses cannot have
higher probabilities than nesting
hypotheses (!)
The probability of the nested special
case must be less than or equal to
the probability of the general model
within which the special case is
nested. Any statistic that assigns
greater probability to the special case
is incoherent. An example of
incoherence is shown for the ABC
method.
Templeton, PNAS, 2010
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 59 / 64
The ”Great ABC controversy”
On-going controvery in phylogeographic genetics about the validity of
using ABC for testing
Against: Templeton, 2008, 2009,
2010a, 2010b, 2010c argues that
nested hypotheses cannot have
higher probabilities than nesting
hypotheses (!)
Incoherent methods, such as ABC,
Bayes factor, or any simulation
approach that treats all hypotheses
as mutually exclusive, should never
be used with logically overlapping
hypotheses.
Templeton, PNAS, 2010
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 59 / 64
The ”Great ABC controversy”
On-going controvery in phylogeographic genetics about the validity of
using ABC for testing
Against: Templeton, 2008, 2009,
2010a, 2010b, 2010c argues that
nested hypotheses cannot have
higher probabilities than nesting
hypotheses (!)
The central equation of ABC
P(Hi|H, S∗
) =
Gi(||Si − S∗
||)Πi
Pn
j=1 Gj(||Sj − S∗||)Πj
is inherently incoherent. This
fundamental equation is
mathematically incorrect in every
instance of overlap.
Templeton, PNAS, 2010
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 59 / 64
The ”Great ABC controversy”
On-going controvery in phylogeographic genetics about the validity of
using ABC for testing
Against: Templeton, 2008, 2009,
2010a, 2010b, 2010c argues that
nested hypotheses cannot have
higher probabilities than nesting
hypotheses (!)
Replies: Fagundes et al., 2008,
Beaumont et al., 2010, Berger et al.,
2010, Csill`ery et al., 2010 point out
that the criticisms are addressed at
[Bayesian] model-based inference and
have nothing to do with ABC...
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 59 / 64
The ”Great ABC controversy”
On-going controvery in phylogeographic genetics about the validity of
using ABC for testing
ABC is a statistically valid approach,
alongside other computational
statistical techniques that have been
successfully used to infer parameters
and compare models in population
genetics.
Beaumont et al., Molec. Ecology,
2010
Replies: Fagundes et al., 2008,
Beaumont et al., 2010, Berger et al.,
2010, Csill`ery et al., 2010 point out
that the criticisms are addressed at
[Bayesian] model-based inference and
have nothing to do with ABC...
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 59 / 64
The ”Great ABC controversy”
On-going controvery in phylogeographic genetics about the validity of
using ABC for testing
The confusion seems to arise from
misunderstanding the difference
between scientific hypotheses and
their mathematical representation.
Bayes’ theorem shows that the
simpler model can indeed have a
much higher posterior probability.
Berger et al., PNAS, 2010
Replies: Fagundes et al., 2008,
Beaumont et al., 2010, Berger et al.,
2010, Csill`ery et al., 2010 point out
that the criticisms are addressed at
[Bayesian] model-based inference and
have nothing to do with ABC...
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 59 / 64
Aitkin’s alternative
Without a specific alternative, the best we can do is to
make posterior probability statements about µ and transfer
these to the posterior distribution of the likelihood ratio.
Aitkin, Statistical Inference, 2010
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 60 / 64
Aitkin’s alternative
Without a specific alternative, the best we can do is to
make posterior probability statements about µ and transfer
these to the posterior distribution of the likelihood ratio.
Aitkin, Statistical Inference, 2010
Proposal to examine the posterior distribution of the likelihood function :
compare models via the “posterior distribution” of the likelihood ratio.
L1(θ1|x) L2(θ2|x) ,
with θ1 ∼ π1(θ1|x) and θ2 ∼ π2(θ2|x).
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 60 / 64
Using the data “twice”
A persistent criticism of the posterior likelihood approach has been based
on the claim that these approaches are ‘using the data twice’, or are
‘violating temporal coherence’ — Aitkin, Statistical Inference, 2010
Complete separation between both models due to simulation under
product of the posterior distributions, i.e. replaces standard Bayesian
inference under joint posterior of (θ1, θ2),
p1m1(x)π1(θ1|x)π2(θ2) + p2m2(x)π2(θ2|x)π1(θ1)
by product of both posteriors
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 61 / 64
Illustration
Comparison of a Poisson model against a negative binomial with m = 5
successes, when x = 3:
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 62 / 64
Pros ...
This quite small change to standard Bayesian analysis allows a very
general approach to a wide range of apparently different inference
problems; a particular advantage of the approach is that it can use the
same noninformative priors — Aitkin, Statistical Inference, 2010
the approach is general and allows to resolve the difficulties with the
Bayesian processing of point null hypotheses;
the approach allows for the use of generic noninformative and
improper priors;
the approach handles more naturally the “vexed question of model
fit”;
the approach is “simple”.
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 63 / 64
... & cons
The p-value is equal to the posterior probability that the likelihood ratio,
for null hypothesis to alternative, is greater than 1 (...) The posterior
probability is p that the posterior probability of H0 is greater than 0.5.
Aitkin, Statistical Inference, 2010
the approach is not Bayesian (product of the posteriors)
the approach uses undeterminate entities (“posterior probability that
the posterior probability is larger than 0.5”...)
the approach tries to get as close as possible to the p-value
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 64 / 64

Contenu connexe

Tendances

Linear models for classification
Linear models for classificationLinear models for classification
Linear models for classificationSung Yub Kim
 
Lect5 principal component analysis
Lect5 principal component analysisLect5 principal component analysis
Lect5 principal component analysishktripathy
 
Data Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
Data Science - Part XII - Ridge Regression, LASSO, and Elastic NetsData Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
Data Science - Part XII - Ridge Regression, LASSO, and Elastic NetsDerek Kane
 
What is bayesian statistics and how is it different?
What is bayesian statistics and how is it different?What is bayesian statistics and how is it different?
What is bayesian statistics and how is it different?Wayne Lee
 
Knn Algorithm presentation
Knn Algorithm presentationKnn Algorithm presentation
Knn Algorithm presentationRishavSharma112
 
Introduction to Python Pandas for Data Analytics
Introduction to Python Pandas for Data AnalyticsIntroduction to Python Pandas for Data Analytics
Introduction to Python Pandas for Data AnalyticsPhoenix
 
Random forest algorithm
Random forest algorithmRandom forest algorithm
Random forest algorithmRashid Ansari
 
Bayesian statistics using r intro
Bayesian statistics using r   introBayesian statistics using r   intro
Bayesian statistics using r introBayesLaplace1
 
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...Simplilearn
 
Principal component analysis and lda
Principal component analysis and ldaPrincipal component analysis and lda
Principal component analysis and ldaSuresh Pokharel
 
NAIVE BAYES CLASSIFIER
NAIVE BAYES CLASSIFIERNAIVE BAYES CLASSIFIER
NAIVE BAYES CLASSIFIERKnoldus Inc.
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reductionmrizwan969
 
Naïve Bayes Classifier Algorithm.pptx
Naïve Bayes Classifier Algorithm.pptxNaïve Bayes Classifier Algorithm.pptx
Naïve Bayes Classifier Algorithm.pptxShubham Jaybhaye
 

Tendances (20)

Linear models for classification
Linear models for classificationLinear models for classification
Linear models for classification
 
KNN
KNNKNN
KNN
 
Lect5 principal component analysis
Lect5 principal component analysisLect5 principal component analysis
Lect5 principal component analysis
 
Principal component analysis
Principal component analysisPrincipal component analysis
Principal component analysis
 
Module 4 part_1
Module 4 part_1Module 4 part_1
Module 4 part_1
 
Bayesian statistics
Bayesian  statisticsBayesian  statistics
Bayesian statistics
 
Data Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
Data Science - Part XII - Ridge Regression, LASSO, and Elastic NetsData Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
Data Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
 
What is bayesian statistics and how is it different?
What is bayesian statistics and how is it different?What is bayesian statistics and how is it different?
What is bayesian statistics and how is it different?
 
Knn Algorithm presentation
Knn Algorithm presentationKnn Algorithm presentation
Knn Algorithm presentation
 
Introduction to Python Pandas for Data Analytics
Introduction to Python Pandas for Data AnalyticsIntroduction to Python Pandas for Data Analytics
Introduction to Python Pandas for Data Analytics
 
Ch02 fuzzyrelation
Ch02 fuzzyrelationCh02 fuzzyrelation
Ch02 fuzzyrelation
 
Random forest algorithm
Random forest algorithmRandom forest algorithm
Random forest algorithm
 
K - Nearest neighbor ( KNN )
K - Nearest neighbor  ( KNN )K - Nearest neighbor  ( KNN )
K - Nearest neighbor ( KNN )
 
07 approximate inference in bn
07 approximate inference in bn07 approximate inference in bn
07 approximate inference in bn
 
Bayesian statistics using r intro
Bayesian statistics using r   introBayesian statistics using r   intro
Bayesian statistics using r intro
 
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
 
Principal component analysis and lda
Principal component analysis and ldaPrincipal component analysis and lda
Principal component analysis and lda
 
NAIVE BAYES CLASSIFIER
NAIVE BAYES CLASSIFIERNAIVE BAYES CLASSIFIER
NAIVE BAYES CLASSIFIER
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
 
Naïve Bayes Classifier Algorithm.pptx
Naïve Bayes Classifier Algorithm.pptxNaïve Bayes Classifier Algorithm.pptx
Naïve Bayes Classifier Algorithm.pptx
 

Similaire à Bayesian model choice (and some alternatives)

from model uncertainty to ABC
from model uncertainty to ABCfrom model uncertainty to ABC
from model uncertainty to ABCChristian Robert
 
Approximate Bayesian model choice via random forests
Approximate Bayesian model choice via random forestsApproximate Bayesian model choice via random forests
Approximate Bayesian model choice via random forestsChristian Robert
 
Course on Bayesian computational methods
Course on Bayesian computational methodsCourse on Bayesian computational methods
Course on Bayesian computational methodsChristian Robert
 
San Antonio short course, March 2010
San Antonio short course, March 2010San Antonio short course, March 2010
San Antonio short course, March 2010Christian Robert
 
Laplace's Demon: seminar #1
Laplace's Demon: seminar #1Laplace's Demon: seminar #1
Laplace's Demon: seminar #1Christian Robert
 
NBBC15, Reyjavik, June 08, 2015
NBBC15, Reyjavik, June 08, 2015NBBC15, Reyjavik, June 08, 2015
NBBC15, Reyjavik, June 08, 2015Christian Robert
 
Yes III: Computational methods for model choice
Yes III: Computational methods for model choiceYes III: Computational methods for model choice
Yes III: Computational methods for model choiceChristian Robert
 
(Approximate) Bayesian computation as a new empirical Bayes (something)?
(Approximate) Bayesian computation as a new empirical Bayes (something)?(Approximate) Bayesian computation as a new empirical Bayes (something)?
(Approximate) Bayesian computation as a new empirical Bayes (something)?Christian Robert
 
An overview of Bayesian testing
An overview of Bayesian testingAn overview of Bayesian testing
An overview of Bayesian testingChristian Robert
 
Statistics symposium talk, Harvard University
Statistics symposium talk, Harvard UniversityStatistics symposium talk, Harvard University
Statistics symposium talk, Harvard UniversityChristian Robert
 
On the vexing dilemma of hypothesis testing and the predicted demise of the B...
On the vexing dilemma of hypothesis testing and the predicted demise of the B...On the vexing dilemma of hypothesis testing and the predicted demise of the B...
On the vexing dilemma of hypothesis testing and the predicted demise of the B...Christian Robert
 
Slides econometrics-2018-graduate-3
Slides econometrics-2018-graduate-3Slides econometrics-2018-graduate-3
Slides econometrics-2018-graduate-3Arthur Charpentier
 

Similaire à Bayesian model choice (and some alternatives) (20)

from model uncertainty to ABC
from model uncertainty to ABCfrom model uncertainty to ABC
from model uncertainty to ABC
 
Approximate Bayesian model choice via random forests
Approximate Bayesian model choice via random forestsApproximate Bayesian model choice via random forests
Approximate Bayesian model choice via random forests
 
Course on Bayesian computational methods
Course on Bayesian computational methodsCourse on Bayesian computational methods
Course on Bayesian computational methods
 
San Antonio short course, March 2010
San Antonio short course, March 2010San Antonio short course, March 2010
San Antonio short course, March 2010
 
Laplace's Demon: seminar #1
Laplace's Demon: seminar #1Laplace's Demon: seminar #1
Laplace's Demon: seminar #1
 
NBBC15, Reyjavik, June 08, 2015
NBBC15, Reyjavik, June 08, 2015NBBC15, Reyjavik, June 08, 2015
NBBC15, Reyjavik, June 08, 2015
 
Yes III: Computational methods for model choice
Yes III: Computational methods for model choiceYes III: Computational methods for model choice
Yes III: Computational methods for model choice
 
Boston talk
Boston talkBoston talk
Boston talk
 
(Approximate) Bayesian computation as a new empirical Bayes (something)?
(Approximate) Bayesian computation as a new empirical Bayes (something)?(Approximate) Bayesian computation as a new empirical Bayes (something)?
(Approximate) Bayesian computation as a new empirical Bayes (something)?
 
An overview of Bayesian testing
An overview of Bayesian testingAn overview of Bayesian testing
An overview of Bayesian testing
 
JSM 2011 round table
JSM 2011 round tableJSM 2011 round table
JSM 2011 round table
 
JSM 2011 round table
JSM 2011 round tableJSM 2011 round table
JSM 2011 round table
 
ABC-Gibbs
ABC-GibbsABC-Gibbs
ABC-Gibbs
 
Statistics symposium talk, Harvard University
Statistics symposium talk, Harvard UniversityStatistics symposium talk, Harvard University
Statistics symposium talk, Harvard University
 
Varese italie #2
Varese italie #2Varese italie #2
Varese italie #2
 
On the vexing dilemma of hypothesis testing and the predicted demise of the B...
On the vexing dilemma of hypothesis testing and the predicted demise of the B...On the vexing dilemma of hypothesis testing and the predicted demise of the B...
On the vexing dilemma of hypothesis testing and the predicted demise of the B...
 
MaxEnt 2009 talk
MaxEnt 2009 talkMaxEnt 2009 talk
MaxEnt 2009 talk
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
Jsm09 talk
Jsm09 talkJsm09 talk
Jsm09 talk
 
Slides econometrics-2018-graduate-3
Slides econometrics-2018-graduate-3Slides econometrics-2018-graduate-3
Slides econometrics-2018-graduate-3
 

Plus de Christian Robert

Asymptotics of ABC, lecture, Collège de France
Asymptotics of ABC, lecture, Collège de FranceAsymptotics of ABC, lecture, Collège de France
Asymptotics of ABC, lecture, Collège de FranceChristian Robert
 
Workshop in honour of Don Poskitt and Gael Martin
Workshop in honour of Don Poskitt and Gael MartinWorkshop in honour of Don Poskitt and Gael Martin
Workshop in honour of Don Poskitt and Gael MartinChristian Robert
 
How many components in a mixture?
How many components in a mixture?How many components in a mixture?
How many components in a mixture?Christian Robert
 
Testing for mixtures at BNP 13
Testing for mixtures at BNP 13Testing for mixtures at BNP 13
Testing for mixtures at BNP 13Christian Robert
 
Inferring the number of components: dream or reality?
Inferring the number of components: dream or reality?Inferring the number of components: dream or reality?
Inferring the number of components: dream or reality?Christian Robert
 
Testing for mixtures by seeking components
Testing for mixtures by seeking componentsTesting for mixtures by seeking components
Testing for mixtures by seeking componentsChristian Robert
 
discussion on Bayesian restricted likelihood
discussion on Bayesian restricted likelihooddiscussion on Bayesian restricted likelihood
discussion on Bayesian restricted likelihoodChristian Robert
 
NCE, GANs & VAEs (and maybe BAC)
NCE, GANs & VAEs (and maybe BAC)NCE, GANs & VAEs (and maybe BAC)
NCE, GANs & VAEs (and maybe BAC)Christian Robert
 
Coordinate sampler : A non-reversible Gibbs-like sampler
Coordinate sampler : A non-reversible Gibbs-like samplerCoordinate sampler : A non-reversible Gibbs-like sampler
Coordinate sampler : A non-reversible Gibbs-like samplerChristian Robert
 
Likelihood-free Design: a discussion
Likelihood-free Design: a discussionLikelihood-free Design: a discussion
Likelihood-free Design: a discussionChristian Robert
 
CISEA 2019: ABC consistency and convergence
CISEA 2019: ABC consistency and convergenceCISEA 2019: ABC consistency and convergence
CISEA 2019: ABC consistency and convergenceChristian Robert
 
a discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment models
a discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment modelsa discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment models
a discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment modelsChristian Robert
 

Plus de Christian Robert (20)

Asymptotics of ABC, lecture, Collège de France
Asymptotics of ABC, lecture, Collège de FranceAsymptotics of ABC, lecture, Collège de France
Asymptotics of ABC, lecture, Collège de France
 
Workshop in honour of Don Poskitt and Gael Martin
Workshop in honour of Don Poskitt and Gael MartinWorkshop in honour of Don Poskitt and Gael Martin
Workshop in honour of Don Poskitt and Gael Martin
 
discussion of ICML23.pdf
discussion of ICML23.pdfdiscussion of ICML23.pdf
discussion of ICML23.pdf
 
How many components in a mixture?
How many components in a mixture?How many components in a mixture?
How many components in a mixture?
 
restore.pdf
restore.pdfrestore.pdf
restore.pdf
 
Testing for mixtures at BNP 13
Testing for mixtures at BNP 13Testing for mixtures at BNP 13
Testing for mixtures at BNP 13
 
Inferring the number of components: dream or reality?
Inferring the number of components: dream or reality?Inferring the number of components: dream or reality?
Inferring the number of components: dream or reality?
 
CDT 22 slides.pdf
CDT 22 slides.pdfCDT 22 slides.pdf
CDT 22 slides.pdf
 
Testing for mixtures by seeking components
Testing for mixtures by seeking componentsTesting for mixtures by seeking components
Testing for mixtures by seeking components
 
discussion on Bayesian restricted likelihood
discussion on Bayesian restricted likelihooddiscussion on Bayesian restricted likelihood
discussion on Bayesian restricted likelihood
 
NCE, GANs & VAEs (and maybe BAC)
NCE, GANs & VAEs (and maybe BAC)NCE, GANs & VAEs (and maybe BAC)
NCE, GANs & VAEs (and maybe BAC)
 
Coordinate sampler : A non-reversible Gibbs-like sampler
Coordinate sampler : A non-reversible Gibbs-like samplerCoordinate sampler : A non-reversible Gibbs-like sampler
Coordinate sampler : A non-reversible Gibbs-like sampler
 
eugenics and statistics
eugenics and statisticseugenics and statistics
eugenics and statistics
 
ABC-Gibbs
ABC-GibbsABC-Gibbs
ABC-Gibbs
 
asymptotics of ABC
asymptotics of ABCasymptotics of ABC
asymptotics of ABC
 
ABC-Gibbs
ABC-GibbsABC-Gibbs
ABC-Gibbs
 
Likelihood-free Design: a discussion
Likelihood-free Design: a discussionLikelihood-free Design: a discussion
Likelihood-free Design: a discussion
 
the ABC of ABC
the ABC of ABCthe ABC of ABC
the ABC of ABC
 
CISEA 2019: ABC consistency and convergence
CISEA 2019: ABC consistency and convergenceCISEA 2019: ABC consistency and convergence
CISEA 2019: ABC consistency and convergence
 
a discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment models
a discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment modelsa discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment models
a discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment models
 

Bayesian model choice (and some alternatives)

  • 1. Bayesian model choice (and some alternatives) Christian P. Robert Universit´e Paris-Dauphine, IuF, & CRESt http://www.ceremade.dauphine.fr/~xian November 20, 2010 Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 1 / 64
  • 2. Outline Anyone not shocked by the Bayesian theory of inference has not understood it Senn, BA., 2008 1 Introduction 2 Tests and model choice 3 Incoherent inferences Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 2 / 64
  • 3. Vocabulary and concepts Bayesian inference is a coherent mathematical theory but I don’t trust it in scientific applications. Gelman, BA, 2008 1 Introduction Models The Bayesian framework Improper prior distributions Noninformative prior distributions 2 Tests and model choice 3 Incoherent inferences Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 3 / 64
  • 4. Parametric model Bayesians promote the idea that a multiplicity of parameters can be handled via hierarchical, typically exchangeable, models, but it seems implausible that this could really work automatically [instead of] giving reasonable answers using minimal assumptions. Gelman, BA, 2008 Observations x1, . . . , xn generated from a probability distribution fi(xi|θi, x1, . . . , xi−1) = fi(xi|θi, x1:i−1) x = (x1, . . . , xn) ∼ f(x|θ), θ = (θ1, . . . , θn) Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 4 / 64
  • 5. Parametric model Bayesians promote the idea that a multiplicity of parameters can be handled via hierarchical, typically exchangeable, models, but it seems implausible that this could really work automatically [instead of] giving reasonable answers using minimal assumptions. Gelman, BA, 2008 Observations x1, . . . , xn generated from a probability distribution fi(xi|θi, x1, . . . , xi−1) = fi(xi|θi, x1:i−1) x = (x1, . . . , xn) ∼ f(x|θ), θ = (θ1, . . . , θn) Associated likelihood (θ|x) = f(x|θ) [inverted density & starting point] Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 4 / 64
  • 6. Bayes theorem 101 Bayes theorem = Inversion of probabilities If A and E are events such that P(E) = 0, P(A|E) and P(E|A) are related by P(A|E) = P(E|A)P(A) P(E|A)P(A) + P(E|Ac)P(Ac) = P(E|A)P(A) P(E) Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 5 / 64
  • 7. Bayes theorem 101 Bayes theorem = Inversion of probabilities If A and E are events such that P(E) = 0, P(A|E) and P(E|A) are related by P(A|E) = P(E|A)P(A) P(E|A)P(A) + P(E|Ac)P(Ac) = P(E|A)P(A) P(E) [Thomas Bayes (?)] Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 5 / 64
  • 8. Bayesian approach The impact of treating x as a fixed constant is to increase statistical power as an artefact Templeton, Molec. Ecol., 2009 New perspective Uncertainty on the parameters θ of a model modeled through a probability distribution π on Θ, called prior distribution Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 6 / 64
  • 9. Bayesian approach The impact of treating x as a fixed constant is to increase statistical power as an artefact Templeton, Molec. Ecol., 2009 New perspective Uncertainty on the parameters θ of a model modeled through a probability distribution π on Θ, called prior distribution Inference based on the distribution of θ conditional on x, π(θ|x), called posterior distribution π(θ|x) = f(x|θ)π(θ) f(x|θ)π(θ) dθ . Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 6 / 64
  • 10. [Nonphilosophical] justifications Ignoring the sampling error of x undermines the statistical validity of all inferences made by the method Templeton, Molec. Ecol., 2009 Semantic drift from unknown to random Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 7 / 64
  • 11. [Nonphilosophical] justifications Ignoring the sampling error of x undermines the statistical validity of all inferences made by the method Templeton, Molec. Ecol., 2009 Semantic drift from unknown to random Actualization of the information on θ by extracting the information on θ contained in the observation x Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 7 / 64
  • 12. [Nonphilosophical] justifications Ignoring the sampling error of x undermines the statistical validity of all inferences made by the method Templeton, Molec. Ecol., 2009 Semantic drift from unknown to random Actualization of the information on θ by extracting the information on θ contained in the observation x Allows incorporation of imperfect information in the decision process Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 7 / 64
  • 13. [Nonphilosophical] justifications Ignoring the sampling error of x undermines the statistical validity of all inferences made by the method Templeton, Molec. Ecol., 2009 Semantic drift from unknown to random Actualization of the information on θ by extracting the information on θ contained in the observation x Allows incorporation of imperfect information in the decision process Unique mathematical way to condition upon the observations (conditional perspective) Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 7 / 64
  • 14. [Nonphilosophical] justifications Ignoring the sampling error of x undermines the statistical validity of all inferences made by the method Templeton, Molec. Ecol., 2009 Semantic drift from unknown to random Actualization of the information on θ by extracting the information on θ contained in the observation x Allows incorporation of imperfect information in the decision process Unique mathematical way to condition upon the observations (conditional perspective) Unique way to give meaning to statements like P(θ > 0) Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 7 / 64
  • 15. Posterior distribution Bayesian methods are presented as an automatic inference engine, and this raises suspicion in anyone with applied experience Gelman, BA, 2008 π(θ|x) central to Bayesian inference Operates conditional upon the observations Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 8 / 64
  • 16. Posterior distribution Bayesian methods are presented as an automatic inference engine, and this raises suspicion in anyone with applied experience Gelman, BA, 2008 π(θ|x) central to Bayesian inference Operates conditional upon the observations Incorporates the requirement of the Likelihood Principle Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 8 / 64
  • 17. Posterior distribution Bayesian methods are presented as an automatic inference engine, and this raises suspicion in anyone with applied experience Gelman, BA, 2008 π(θ|x) central to Bayesian inference Operates conditional upon the observations Incorporates the requirement of the Likelihood Principle Avoids averaging over the unobserved values of x Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 8 / 64
  • 18. Posterior distribution Bayesian methods are presented as an automatic inference engine, and this raises suspicion in anyone with applied experience Gelman, BA, 2008 π(θ|x) central to Bayesian inference Operates conditional upon the observations Incorporates the requirement of the Likelihood Principle Avoids averaging over the unobserved values of x Coherent updating of the information available on θ Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 8 / 64
  • 19. Posterior distribution Bayesian methods are presented as an automatic inference engine, and this raises suspicion in anyone with applied experience Gelman, BA, 2008 π(θ|x) central to Bayesian inference Operates conditional upon the observations Incorporates the requirement of the Likelihood Principle Avoids averaging over the unobserved values of x Coherent updating of the information available on θ Provides a complete inferential machinery Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 8 / 64
  • 20. Improper distributions If we take P(dσ) ∝ dσ as a statement that σ may have any value between 0 and ∞ (...), we must use ∞ instead of 1 to denote certainty. Jeffreys, ToP, 1939 Necessary extension from a prior distribution to a prior σ-finite measure π such that Θ π(θ) dθ = +∞ Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 9 / 64
  • 21. Improper distributions If we take P(dσ) ∝ dσ as a statement that σ may have any value between 0 and ∞ (...), we must use ∞ instead of 1 to denote certainty. Jeffreys, ToP, 1939 Necessary extension from a prior distribution to a prior σ-finite measure π such that Θ π(θ) dθ = +∞ Improper prior distribution Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 9 / 64
  • 22. Improper distributions If we take P(dσ) ∝ dσ as a statement that σ may have any value between 0 and ∞ (...), we must use ∞ instead of 1 to denote certainty. Jeffreys, ToP, 1939 Necessary extension from a prior distribution to a prior σ-finite measure π such that Θ π(θ) dθ = +∞ Improper prior distribution [Weird? Inappropriate?? report!! ] Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 9 / 64
  • 23. Justifications If the parameter may have any value from −∞ to +∞, its prior probability should be taken as uniformly distributed Jeffreys, ToP, 1939 Automated prior determination often leads to improper priors 1 Similar performances of estimators derived from these generalized distributions Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 10 / 64
  • 24. Justifications If the parameter may have any value from −∞ to +∞, its prior probability should be taken as uniformly distributed Jeffreys, ToP, 1939 Automated prior determination often leads to improper priors 1 Similar performances of estimators derived from these generalized distributions 2 Improper priors as limits of proper distributions in many [mathematical] senses Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 10 / 64
  • 25. Further justifications There is no good objective principle for choosing a noninformative prior (even if that concept were mathematically defined, which it is not) Gelman, BA, 2008 4 Robust answer against possible misspecifications of the prior Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 11 / 64
  • 26. Further justifications There is no good objective principle for choosing a noninformative prior (even if that concept were mathematically defined, which it is not) Gelman, BA, 2008 4 Robust answer against possible misspecifications of the prior 5 Frequencial justifications, such as: (i) minimaxity (ii) admissibility (iii) invariance (Haar measure) Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 11 / 64
  • 27. Further justifications There is no good objective principle for choosing a noninformative prior (even if that concept were mathematically defined, which it is not) Gelman, BA, 2008 4 Robust answer against possible misspecifications of the prior 5 Frequencial justifications, such as: (i) minimaxity (ii) admissibility (iii) invariance (Haar measure) 6 Improper priors [much] prefered to vague proper priors like N(0, 106) Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 11 / 64
  • 28. Validation The mistake is to think of them as representing ignorance Lindley, JASA, 1990 Extension of the posterior distribution π(θ|x) associated with an improper prior π as given by Bayes’s formula π(θ|x) = f(x|θ)π(θ) Θ f(x|θ)π(θ) dθ , Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 12 / 64
  • 29. Validation The mistake is to think of them as representing ignorance Lindley, JASA, 1990 Extension of the posterior distribution π(θ|x) associated with an improper prior π as given by Bayes’s formula π(θ|x) = f(x|θ)π(θ) Θ f(x|θ)π(θ) dθ , when Θ f(x|θ)π(θ) dθ < ∞ Delete emotionally loaded names Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 12 / 64
  • 30. Noninformative priors ...cannot be expected to represent exactly total ignorance about the problem, but should rather be taken as reference priors, upon which everyone could fall back when the prior information is missing. Kass and Wasserman, JASA, 1996 What if all we know is that we know “nothing” ?! Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 13 / 64
  • 31. Noninformative priors ...cannot be expected to represent exactly total ignorance about the problem, but should rather be taken as reference priors, upon which everyone could fall back when the prior information is missing. Kass and Wasserman, JASA, 1996 What if all we know is that we know “nothing” ?! In the absence of prior information, prior distributions solely derived from the sample distribution f(x|θ) Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 13 / 64
  • 32. Noninformative priors ...cannot be expected to represent exactly total ignorance about the problem, but should rather be taken as reference priors, upon which everyone could fall back when the prior information is missing. Kass and Wasserman, JASA, 1996 What if all we know is that we know “nothing” ?! In the absence of prior information, prior distributions solely derived from the sample distribution f(x|θ) Difficulty with uniform priors, lacking invariance properties. Rather use Jeffreys’ prior. [Jeffreys, 1939; Robert, Chopin & Rousseau, 2009] Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 13 / 64
  • 33. Tests and model choice The Jeffreys-subjective synthesis betrays a much more dangerous confusion than the Neyman-Pearson-Fisher synthesis as regards hypothesis tests Senn, BA, 2008 1 Introduction 2 Tests and model choice Bayesian tests Opposition to classical tests Model choice Pseudo-Bayes factors Compatible priors Variable selection 3 Incoherent inferences Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 14 / 64
  • 34. Construction of Bayes tests What is almost never used, however, is the Jeffreys significance test. Senn, BA, 2008 Definition (Test) Given an hypothesis H0 : θ ∈ Θ0 on the parameter θ ∈ Θ0 of a statistical model, a test is a statistical procedure that takes its values in {0, 1}. Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 15 / 64
  • 35. Construction of Bayes tests What is almost never used, however, is the Jeffreys significance test. Senn, BA, 2008 Definition (Test) Given an hypothesis H0 : θ ∈ Θ0 on the parameter θ ∈ Θ0 of a statistical model, a test is a statistical procedure that takes its values in {0, 1}. Example (Normal mean) For x ∼ N (θ, 1), decide whether or not θ ≤ 0. Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 15 / 64
  • 36. Decision-theoretic perspective Loss functions [are] not relevant to statistical inference Gelman, BA, 2008 Theorem (Optimal Bayes decision) Under the 0 − 1 loss function L(θ, d) =    0 if d = IΘ0 (θ) a0 if d = 1 and θ ∈ Θ0 a1 if d = 0 and θ ∈ Θ0 Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 16 / 64
  • 37. Decision-theoretic perspective Loss functions [are] not relevant to statistical inference Gelman, BA, 2008 Theorem (Optimal Bayes decision) Under the 0 − 1 loss function L(θ, d) =    0 if d = IΘ0 (θ) a0 if d = 1 and θ ∈ Θ0 a1 if d = 0 and θ ∈ Θ0 the Bayes procedure is δπ (x) = 1 if Prπ (θ ∈ Θ0|x) ≥ a0/(a0 + a1) 0 otherwise Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 16 / 64
  • 38. A function of posterior probabilities The method posits two or more alternative hypotheses and tests their relative fits to some observed statistics — Templeton, Mol. Ecol., 2009 Definition (Bayes factors) For hypotheses H0 : θ ∈ Θ0 vs. Ha : θ ∈ Θ0 B01 = π(Θ0|x) π(Θc 0|x) π(Θ0) π(Θc 0) = Θ0 f(x|θ)π0(θ)dθ Θc 0 f(x|θ)π1(θ)dθ [Good, 1958 & Jeffreys, 1961] pseudo-Bayes factors Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 17 / 64
  • 39. Self-contained concept Having a high relative probability does not mean that a hypothesis is true or supported by the data — Templeton, Mol. Ecol., 2009 Non-decision-theoretic: eliminates choice of π(Θ0) Bayesian/marginal equivalent to the likelihood ratio Jeffreys’ scale of evidence: if log10(Bπ 10) between 0 and 0.5, evidence against H0 weak, if log10(Bπ 10) 0.5 and 1, evidence substantial, if log10(Bπ 10) 1 and 2, evidence strong and if log10(Bπ 10) above 2, evidence decisive Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 18 / 64
  • 40. A major modification Considering whether a location parameter α is 0. The prior is uniform and we should have to take f(α) = 0 and B10 would always be infinite Jeffreys, ToP, 1939 When the null hypothesis is supported by a set of measure 0, π(Θ0) = 0 and thus π(Θ0|x) = 0. [End of the story?!] Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 19 / 64
  • 41. Changing the prior to fit the hypotheses Given that some logical overlap is common when dealing with complex models, this means that much of the literature is invalid Templeton, Trends in Ecology and Evolution, 2010 Requirement Define prior distributions under both assumptions, π0(θ) ∝ π(θ)IΘ0 (θ), π1(θ) ∝ π(θ)IΘ1 (θ), [under the standard dominating measures on Θ0 and Θ1], leading to π(θ) = 0π0(θ) + 1π1(θ). Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 20 / 64
  • 42. Point null hypotheses I have no patience for statistical methods that assign positive probability to point hypotheses of the θ = 0 type that can never actually be true Gelman, BA, 2008 Take ρ0 = Prπ (θ = θ0) and g1 prior density under Ha. Then Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 21 / 64
  • 43. Point null hypotheses I have no patience for statistical methods that assign positive probability to point hypotheses of the θ = 0 type that can never actually be true Gelman, BA, 2008 Take ρ0 = Prπ (θ = θ0) and g1 prior density under Ha. Then π(Θ0|x) = f(x|θ0)ρ0 f(x|θ)π(θ) dθ = f(x|θ0)ρ0 f(x|θ0)ρ0 + (1 − ρ0)m1(x) and Bayes factor Bπ 01(x) = f(x|θ0)ρ0 m1(x)(1 − ρ0) ρ0 1 − ρ0 = f(x|θ0) m1(x) Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 21 / 64
  • 44. Point null hypotheses (cont’d) Example (Normal mean) Test of H0 : θ = 0 when x ∼ N (θ, 1): we take π1 as N(0, τ2) m1(x) f(x|0) = σ2 σ2 + τ2 exp τ2x2 2σ2(σ2 + τ2) and the posterior probability is τ/x 0 0.68 1.28 1.96 1 0.586 0.557 0.484 0.351 10 0.768 0.729 0.612 0.366 Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 22 / 64
  • 45. Comparison with classical tests The 95 percent frequentist intervals will live up to their advertised coverage claims — Wasserman, BA, 2008 Standard/classical answer Definition (p-value) The p-value p(x) associated with a test is the largest significance level for which H0 is rejected Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 23 / 64
  • 46. Problems with p-values The use of P implies that a hypothesis that may be true may be rejected because it had not predicted observable results that have not occurred Jeffreys, ToP, 1939 Evaluation of the wrong quantity, namely the probability to exceed the observed quantity.(wrong conditioning) Evaluation only under the null hypothesis Huge numerical difference with the Bayesian range of answers Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 24 / 64
  • 47. Bayesian lower bounds If the Bayes estimator has good frequency behavior then we might as well use the frequentist method. If it has bad frequency behavior then we shouldn’t use it. Wasserman, BA, 2008 Least favourable Bayesian answer is B(x, GA) = inf g∈GA f(x|θ0) Θ f(x|θ)g(θ) dθ , i.e., if there exists a mle for θ, ˆθ(x), B(x, GA) = f(x|θ0) f(x|ˆθ(x)) Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 25 / 64
  • 48. Illustration Example (Normal case) When x ∼ N (θ, 1) and H0 : θ0 = 0, the lower bounds are B(x, GA) = e−x2/2 and P(x, GA) = 1 + ex2/2 −1 , Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 26 / 64
  • 49. Illustration Example (Normal case) When x ∼ N (θ, 1) and H0 : θ0 = 0, the lower bounds are B(x, GA) = e−x2/2 and P(x, GA) = 1 + ex2/2 −1 , i.e. p-value 0.10 0.05 0.01 0.001 P 0.205 0.128 0.035 0.004 B 0.256 0.146 0.036 0.004 Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 26 / 64
  • 50. Illustration Example (Normal case) When x ∼ N (θ, 1) and H0 : θ0 = 0, the lower bounds are B(x, GA) = e−x2/2 and P(x, GA) = 1 + ex2/2 −1 , i.e. p-value 0.10 0.05 0.01 0.001 P 0.205 0.128 0.035 0.004 B 0.256 0.146 0.036 0.004 [Quite different!] Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 26 / 64
  • 51. Model choice and model comparison There is no null hypothesis, which complicates the computation of sampling error Templeton, Mol. Ecol., 2009 Choice among models: Several models available for the same observation(s) Mi : x ∼ fi(x|θi), i ∈ I where I can be finite or infinite Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 27 / 64
  • 52. Bayesian resolution The posterior probabilities are constructed by using a numerator that is a function of the observation for a particular model, then divided by a denominator that ensures that the ”probabilities” sum to one. — Templeton, Mol. Ecol., 2009 Probabilise the entire model/parameter space Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 28 / 64
  • 53. Bayesian resolution The posterior probabilities are constructed by using a numerator that is a function of the observation for a particular model, then divided by a denominator that ensures that the ”probabilities” sum to one. — Templeton, Mol. Ecol., 2009 Probabilise the entire model/parameter space allocate probabilities pi to all models Mi define priors πi(θi) for each parameter space Θi Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 28 / 64
  • 54. Bayesian resolution The posterior probabilities are constructed by using a numerator that is a function of the observation for a particular model, then divided by a denominator that ensures that the ”probabilities” sum to one. — Templeton, Mol. Ecol., 2009 Probabilise the entire model/parameter space allocate probabilities pi to all models Mi define priors πi(θi) for each parameter space Θi compute π(Mi|x) = pi Θi fi(x|θi)πi(θi)dθi j pj Θj fj(x|θj)πj(θj)dθj Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 28 / 64
  • 55. Bayesian resolution(2) The numerators are not co-measurable across hypotheses, and the denominators are sums of non-co-measurable entities. This means that it is mathematically impossible for them to be probabilities — Templeton, Mol. Ecol., 2009 take largest π(Mi|x) to determine “best” model, or use averaged predictive j π(Mj|x) Θj fj(x |θj)πj(θj|x)dθj Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 29 / 64
  • 56. Natural Occam’s razor Pluralitas non est ponenda sine neccesitate Variation is random until the contrary is shown; and new parameters in laws, when they are suggested, must be tested one at a time, unless there is specific reason to the contrary. Jeffreys, ToP, 1939 Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 30 / 64
  • 57. Natural Occam’s razor Pluralitas non est ponenda sine neccesitate Variation is random until the contrary is shown; and new parameters in laws, when they are suggested, must be tested one at a time, unless there is specific reason to the contrary. Jeffreys, ToP, 1939 The Bayesian approach naturally weights differently models with different parameter dimensions (BIC being an approximative log-Bayes factor). Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 30 / 64
  • 58. A fundamental difficulty 1) ABC can and does produce results that are mathematically impossible; 2) the “posterior probabilities” of ABC cannot possibly be true probability measures; and 3) ABC is statistically incoherent. Templeton, Trends in Ecology and Evolution, 2010 Improper priors are NOT allowed here If Θ1 π1(dθ1) = ∞ or Θ2 π2(dθ2) = ∞ then either π1 or π2 cannot be coherently normalised Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 31 / 64
  • 59. A fundamental difficulty 1) ABC can and does produce results that are mathematically impossible; 2) the “posterior probabilities” of ABC cannot possibly be true probability measures; and 3) ABC is statistically incoherent. Templeton, Trends in Ecology and Evolution, 2010 Improper priors are NOT allowed here If Θ1 π1(dθ1) = ∞ or Θ2 π2(dθ2) = ∞ then either π1 or π2 cannot be coherently normalised but the normalisation matters in the Bayes factor Recall Bayes factor Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 31 / 64
  • 60. Normal illustration Take x ∼ N (θ, 1) and H0 : θ = 0 Impact of the constant x 0.0 1.0 1.65 1.96 2.58 π(θ) = 1 0.285 0.195 0.089 0.055 0.014 π(θ) = 10 0.0384 0.0236 0.0101 0.00581 0.00143 Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 32 / 64
  • 61. Vague proper priors are NOT the solution Taking a proper prior and take a “very large” variance (e.g., BUGS) Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 33 / 64
  • 62. Vague proper priors are NOT the solution Taking a proper prior and take a “very large” variance (e.g., BUGS) will most often result in an undefined or ill-defined limit Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 33 / 64
  • 63. Vague proper priors are NOT the solution Taking a proper prior and take a “very large” variance (e.g., BUGS) will most often result in an undefined or ill-defined limit Example (Lindley’s paradox) If testing H0 : θ = 0 when observing x ∼ N(θ, 1), under a normal N(0, α) prior π1(θ), B01(x) α−→∞ −→ 0 Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 33 / 64
  • 64. Learning from the sample It is possible for data to discriminate among a set of hypotheses without saying anything about a proposition that is common to all the alternatives considered. Seber, Evidence and Evolution, 2008 Definition (Learning sample) Given an improper prior π, (x1, . . . , xn) is a learning sample if π(·|x1, . . . , xn) is proper and a minimal learning sample if none of its subsamples is a learning sample Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 34 / 64
  • 65. Learning from the sample It is possible for data to discriminate among a set of hypotheses without saying anything about a proposition that is common to all the alternatives considered. Seber, Evidence and Evolution, 2008 Definition (Learning sample) Given an improper prior π, (x1, . . . , xn) is a learning sample if π(·|x1, . . . , xn) is proper and a minimal learning sample if none of its subsamples is a learning sample There is just enough information in a minimal learning sample to make inference about θ under the prior π Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 34 / 64
  • 66. Pseudo-Bayes factors Idea Use a first part x[i] of the data x to make the prior proper: Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 35 / 64
  • 67. Pseudo-Bayes factors Idea Use a first part x[i] of the data x to make the prior proper: πi improper but πi(·|x[i]) proper and fi(x[n/i]|θi) πi(θi|x[i])dθi fj(x[n/i]|θj) πj(θj|x[i])dθj independent of normalizing constant Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 35 / 64
  • 68. Pseudo-Bayes factors Idea Use a first part x[i] of the data x to make the prior proper: πi improper but πi(·|x[i]) proper and fi(x[n/i]|θi) πi(θi|x[i])dθi fj(x[n/i]|θj) πj(θj|x[i])dθj independent of normalizing constant Use remaining part x[n/i] to run test as if πj(θj|x[i]) was the true prior Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 35 / 64
  • 69. Motivation Provides a working principle for improper priors Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 36 / 64
  • 70. Motivation Provides a working principle for improper priors Gather enough information from data to achieve properness and use this properness to run the test on remaining data Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 36 / 64
  • 71. Motivation Provides a working principle for improper priors Gather enough information from data to achieve properness and use this properness to run the test on remaining data does not use the data x twice as in Aitkin’s (1991,2010) Back later! Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 36 / 64
  • 72. Fractional Bayes factor To test a theory, you need to test it against alternatives. Seber, Evidence and Evolution, 2008 Idea use directly the likelihood to separate training sample from testing sample BF 12 = B12(x) × Lb 2(θ2)π2(θ2)dθ2 Lb 1(θ1)π1(θ1)dθ1 [O’Hagan, 1995] Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 37 / 64
  • 73. Fractional Bayes factor To test a theory, you need to test it against alternatives. Seber, Evidence and Evolution, 2008 Idea use directly the likelihood to separate training sample from testing sample BF 12 = B12(x) × Lb 2(θ2)π2(θ2)dθ2 Lb 1(θ1)π1(θ1)dθ1 [O’Hagan, 1995] Proportion b of the sample used to gain proper-ness Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 37 / 64
  • 74. Fractional Bayes factor (cont’d) Example (Normal mean) BF 12 = 1 √ b en(b−1)¯x2 n/2 corresponds to exact Bayes factor for the prior N 0, 1−b nb If b constant, prior variance goes to 0 If b = 1 n , prior variance stabilises around 1 If b = n−α , α < 1, prior variance goes to 0 too. Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 38 / 64
  • 75. Compatibility principle Further complicating dimensionality of test statistics is the fact that the models are often not nested, and one model may contain parameters that do not have analogues in the other models and vice versa Templeton, Mol. Ecol., 2009 Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 39 / 64
  • 76. Compatibility principle Further complicating dimensionality of test statistics is the fact that the models are often not nested, and one model may contain parameters that do not have analogues in the other models and vice versa Templeton, Mol. Ecol., 2009 Difficulty of finding simultaneously priors on a collection of models Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 39 / 64
  • 77. Compatibility principle Further complicating dimensionality of test statistics is the fact that the models are often not nested, and one model may contain parameters that do not have analogues in the other models and vice versa Templeton, Mol. Ecol., 2009 Difficulty of finding simultaneously priors on a collection of models Easier to start from a single prior on a “big” [encompassing] model and to derive others from a coherence principle [Dawid & Lauritzen, 2000] Raw regression output Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 39 / 64
  • 78. An illustration for linear regression In the case M1 and M2 are two nested Gaussian linear regression models with Zellner’s g-priors and the same variance σ2 ∼ π(σ2): M1 : y|β1, σ2 ∼ N(X1β1, σ2) with β1|σ2 ∼ N s1, σ2 n1(XT 1 X1)−1 where X1 is a (n × k1) matrix of rank k1 ≤ n Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 40 / 64
  • 79. An illustration for linear regression In the case M1 and M2 are two nested Gaussian linear regression models with Zellner’s g-priors and the same variance σ2 ∼ π(σ2): M1 : y|β1, σ2 ∼ N(X1β1, σ2) with β1|σ2 ∼ N s1, σ2 n1(XT 1 X1)−1 where X1 is a (n × k1) matrix of rank k1 ≤ n M2 : y|β2, σ2 ∼ N(X2β2, σ2) with β2|σ2 ∼ N s2, σ2 n2(XT 2 X2)−1 , where X2 is a (n × k2) matrix with span(X2) ⊆ span(X1) [ c Marin & Robert, Bayesian Core] Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 40 / 64
  • 80. Compatible g-priors I don’t see any role for squared error loss, minimax, or the rest of what is sometimes called statistical decision theory Gelman, BA, 2008 Since σ2 is a nuisance parameter, minimize the Kullback-Leibler divergence between both marginal distributions conditional on σ2: m1(y|σ2; s1, n1) and m2(y|σ2; s2, n2), with solution Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 41 / 64
  • 81. Compatible g-priors I don’t see any role for squared error loss, minimax, or the rest of what is sometimes called statistical decision theory Gelman, BA, 2008 Since σ2 is a nuisance parameter, minimize the Kullback-Leibler divergence between both marginal distributions conditional on σ2: m1(y|σ2; s1, n1) and m2(y|σ2; s2, n2), with solution β2|X2, σ2 ∼ N s∗ 2, σ2 n∗ 2(XT 2 X2)−1 with s∗ 2 = (XT 2 X2)−1 XT 2 X1s1 n∗ 2 = n1 Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 41 / 64
  • 82. Symmetrised compatible priors If those prior probabilities are obscure, the same will be true of the posterior probabilities — Seber, Evidence and Evolution, 2008 Postulate: Previous principle requires embedded models (or an encompassing model) and proper priors, while being hard to implement outside exponential families Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 42 / 64
  • 83. Symmetrised compatible priors If those prior probabilities are obscure, the same will be true of the posterior probabilities — Seber, Evidence and Evolution, 2008 Postulate: Previous principle requires embedded models (or an encompassing model) and proper priors, while being hard to implement outside exponential families We determine prior measures on two models M1 and M2, π1 and π2, directly by a compatibility principle. Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 42 / 64
  • 84. Generalised expected posterior priors [Perez & Berger, 2000] EPP Principle Starting from reference priors πN 1 and πN 2 , substitute by prior distributions π1 and π2 that solve the system of integral equations π1(θ1) = X πN 1 (θ1 | x)m2(x)dx and π2(θ2) = X πN 2 (θ2 | x)m1(x)dx, where x is an imaginary minimal training sample and m1, m2 are the marginals associated with π1 and π2 respectively. Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 43 / 64
  • 85. Motivations Eliminates the “imaginary observation” device and proper-isation through part of the data by integration under the “truth” Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 44 / 64
  • 86. Motivations Eliminates the “imaginary observation” device and proper-isation through part of the data by integration under the “truth” Assumes that both models are equally valid and equipped with ideal unknown priors πi, i = 1, 2, that yield “true” marginals balancing each model wrt the other Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 44 / 64
  • 87. Motivations Eliminates the “imaginary observation” device and proper-isation through part of the data by integration under the “truth” Assumes that both models are equally valid and equipped with ideal unknown priors πi, i = 1, 2, that yield “true” marginals balancing each model wrt the other For a given π1, π2 is an expected posterior prior Using both equations introduces symmetry into the game Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 44 / 64
  • 88. Bayesian coherence Logical overlap is the norm for the complex models analyzed with ABC, so many ABC posterior model probabilities published to date are wrong. Templeton, PNAS, 2009 Theorem (True Bayes factor) If π1 and π2 are the EPPs and if their marginals are finite, then the corresponding Bayes factor B1,2(x) is either a (true) Bayes factor or a limit of (true) Bayes factors. Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 45 / 64
  • 89. Bayesian coherence Logical overlap is the norm for the complex models analyzed with ABC, so many ABC posterior model probabilities published to date are wrong. Templeton, PNAS, 2009 Theorem (True Bayes factor) If π1 and π2 are the EPPs and if their marginals are finite, then the corresponding Bayes factor B1,2(x) is either a (true) Bayes factor or a limit of (true) Bayes factors. Obviously only interesting when both π1 and π2 are improper. Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 45 / 64
  • 90. Variable selection Regression setup where y regressed on a set {x1, . . . , xp} of p potential explanatory regressors (plus intercept) Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 46 / 64
  • 91. Variable selection Regression setup where y regressed on a set {x1, . . . , xp} of p potential explanatory regressors (plus intercept) Corresponding 2p submodels Mγ, where γ ∈ Γ = {0, 1}p indicates inclusion/exclusion of variables by a binary representation, Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 46 / 64
  • 92. Variable selection Regression setup where y regressed on a set {x1, . . . , xp} of p potential explanatory regressors (plus intercept) Corresponding 2p submodels Mγ, where γ ∈ Γ = {0, 1}p indicates inclusion/exclusion of variables by a binary representation, e.g. γ = 101001011 means that x1, x3, x5, x7 and x8 are included. Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 46 / 64
  • 93. Notations For model Mγ, qγ variables included t1(γ) = {t1,1(γ), . . . , t1,qγ (γ)} indices of those variables and t0(γ) indices of the variables not included For β ∈ Rp+1, βt1(γ) = β0, βt1,1(γ), . . . , βt1,qγ (γ) Xt1(γ) = 1n|xt1,1(γ)| . . . |xt1,qγ (γ) . Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 47 / 64
  • 94. Notations For model Mγ, qγ variables included t1(γ) = {t1,1(γ), . . . , t1,qγ (γ)} indices of those variables and t0(γ) indices of the variables not included For β ∈ Rp+1, βt1(γ) = β0, βt1,1(γ), . . . , βt1,qγ (γ) Xt1(γ) = 1n|xt1,1(γ)| . . . |xt1,qγ (γ) . Submodel Mγ is thus y|β, γ, σ2 ∼ N Xt1(γ)βt1(γ), σ2 In Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 47 / 64
  • 95. Global and compatible priors Use Zellner’s g-prior, i.e. a normal prior for β conditional on σ2, β|σ2 ∼ N(˜β, cσ2 (XT X)−1 ) and a Jeffreys prior for σ2, π(σ2 ) ∝ σ−2 Noninformative g Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 48 / 64
  • 96. Global and compatible priors Use Zellner’s g-prior, i.e. a normal prior for β conditional on σ2, β|σ2 ∼ N(˜β, cσ2 (XT X)−1 ) and a Jeffreys prior for σ2, π(σ2 ) ∝ σ−2 Noninformative g Resulting compatible prior βt1(γ) ∼ N XT t1(γ)Xt1(γ) −1 XT t1(γ)X ˜β, cσ2 XT t1(γ)Xt1(γ) −1 Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 48 / 64
  • 97. Posterior model probability Can be obtained in closed form: π(γ|y) ∝ (c + 1)−(qγ +1)/2 yT y − cyT P1y c + 1 + ˜βT XT P1X ˜β c + 1 − 2yT P1X ˜β c + 1 −n/2 . Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 49 / 64
  • 98. Posterior model probability Can be obtained in closed form: π(γ|y) ∝ (c + 1)−(qγ +1)/2 yT y − cyT P1y c + 1 + ˜βT XT P1X ˜β c + 1 − 2yT P1X ˜β c + 1 −n/2 . Conditionally on γ, posterior distributions of β and σ2: βt1(γ)|σ2 , y, γ ∼ N c c + 1 (U1y + U1X ˜β/c), σ2 c c + 1 XT t1(γ)Xt1(γ) −1 , σ2 |y, γ ∼ IG n 2 , yT y 2 − cyT P1y 2(c + 1) + ˜βT XT P1X ˜β 2(c + 1) − yT P1X ˜β c + 1 . Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 49 / 64
  • 99. Noninformative case Use the same compatible informative g-prior distribution with ˜β = 0p+1 and a hierarchical diffuse prior distribution on c, π(c) ∝ c−1 IN∗ (c) or π(c) ∝ c−1 Ic>0 Recall g-prior Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 50 / 64
  • 100. Noninformative case Use the same compatible informative g-prior distribution with ˜β = 0p+1 and a hierarchical diffuse prior distribution on c, π(c) ∝ c−1 IN∗ (c) or π(c) ∝ c−1 Ic>0 Recall g-prior The choice of this hierarchical diffuse prior distribution on c is due to the model posterior sensitivity to large values of c: Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 50 / 64
  • 101. Noninformative case Use the same compatible informative g-prior distribution with ˜β = 0p+1 and a hierarchical diffuse prior distribution on c, π(c) ∝ c−1 IN∗ (c) or π(c) ∝ c−1 Ic>0 Recall g-prior The choice of this hierarchical diffuse prior distribution on c is due to the model posterior sensitivity to large values of c: Taking ˜β = 0p+1 and c large does not work Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 50 / 64
  • 102. Processionary caterpillar Influence of some forest settlement characteristics on the development of caterpillar colonies Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 51 / 64
  • 103. Processionary caterpillar Influence of some forest settlement characteristics on the development of caterpillar colonies Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 51 / 64
  • 104. Processionary caterpillar Influence of some forest settlement characteristics on the development of caterpillar colonies Response y log-transform of the average number of nests of caterpillars per tree on an area of 500 square meters (n = 33 areas) [ c Marin & Robert, Bayesian Core] Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 51 / 64
  • 105. Processionary caterpillar (cont’d) Potential explanatory variables x1 altitude (in meters), x2 slope (in degrees), x3 number of pines in the square, x4 height (in meters) of the tree at the center of the square, x5 diameter of the tree at the center of the square, x6 index of the settlement density, x7 orientation of the square (from 1 if southb’d to 2 ow), x8 height (in meters) of the dominant tree, x9 number of vegetation strata, x10 mix settlement index (from 1 if not mixed to 2 if mixed). Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 52 / 64
  • 106. Bayesian regression output Estimate BF log10(BF) (Intercept) 9.2714 26.334 1.4205 (***) X1 -0.0037 7.0839 0.8502 (**) X2 -0.0454 3.6850 0.5664 (**) X3 0.0573 0.4356 -0.3609 X4 -1.0905 2.8314 0.4520 (*) X5 0.1953 2.5157 0.4007 (*) X6 -0.3008 0.3621 -0.4412 X7 -0.2002 0.3627 -0.4404 X8 0.1526 0.4589 -0.3383 X9 -1.0835 0.9069 -0.0424 X10 -0.3651 0.4132 -0.3838 evidence against H0: (****) decisive, (***) strong, (**) subtantial, (*) poor Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 53 / 64
  • 107. Bayesian variable selection t1(γ) π(γ|y, X) 0,1,2,4,5 0.0929 0,1,2,4,5,9 0.0325 0,1,2,4,5,10 0.0295 0,1,2,4,5,7 0.0231 0,1,2,4,5,8 0.0228 0,1,2,4,5,6 0.0228 0,1,2,3,4,5 0.0224 0,1,2,3,4,5,9 0.0167 0,1,2,4,5,6,9 0.0167 0,1,2,4,5,8,9 0.0137 Noninformative G-prior model choice Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 54 / 64
  • 108. Fringe alternatives 1 Introduction 2 Tests and model choice 3 Incoherent inferences Templeton’s debate Bayes/likelihood fusion Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 55 / 64
  • 109. A revealing confusion In statistics, coherent measures of fit of nested and overlapping composite hypotheses are technically those measures that are consistent with the constraints of formal logic. For example, the probability of the nested special case must be less than or equal to the probability of the general model within which the special case is nested. Any statistic that assigns greater probability to the special case is said to be incoherent. Templeton, PNAS, 2009 Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 56 / 64
  • 110. ABC algorithm Instead of evaluating hypotheses in terms of how probable they say the data are, we evaluate them by estimating how accurately they’ll predict new data when fitted to old — Seber, Evidence and Evolution, 2008 Algorithm 1 Likelihood-free rejection sampler for i = 1 to N do repeat generate θ from the prior distribution π(·) generate z from the likelihood f(·|θ ) until ρ{η(z), η(y)} ≤ set θi = θ end for where η(y) defines a (not necessarily sufficient) statistic [Pritchard et al., 1999] Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 57 / 64
  • 111. ABC output The likelihood-free algorithm samples from the marginal in z of: π (θ, z|y) = π(θ)f(z|θ)IA ,y (z) A ,y×Θ π(θ)f(z|θ)dzdθ , where A ,y = {z ∈ D|ρ(η(z), η(y)) < }. Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 58 / 64
  • 112. ABC output The likelihood-free algorithm samples from the marginal in z of: π (θ, z|y) = π(θ)f(z|θ)IA ,y (z) A ,y×Θ π(θ)f(z|θ)dzdθ , where A ,y = {z ∈ D|ρ(η(z), η(y)) < }. The idea behind ABC is that the summary statistics coupled with a small tolerance should provide a good approximation of the posterior distribution: π (θ|y) = π (θ, z|y)dz ≈ π(θ|y) . Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 58 / 64
  • 113. The ”Great ABC controversy” On-going controvery in phylogeographic genetics about the validity of using ABC for testing Against: Templeton, 2008, 2009, 2010a, 2010b, 2010c argues that nested hypotheses cannot have higher probabilities than nesting hypotheses (!) Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 59 / 64
  • 114. The ”Great ABC controversy” On-going controvery in phylogeographic genetics about the validity of using ABC for testing Against: Templeton, 2008, 2009, 2010a, 2010b, 2010c argues that nested hypotheses cannot have higher probabilities than nesting hypotheses (!) The probability of the nested special case must be less than or equal to the probability of the general model within which the special case is nested. Any statistic that assigns greater probability to the special case is incoherent. An example of incoherence is shown for the ABC method. Templeton, PNAS, 2010 Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 59 / 64
  • 115. The ”Great ABC controversy” On-going controvery in phylogeographic genetics about the validity of using ABC for testing Against: Templeton, 2008, 2009, 2010a, 2010b, 2010c argues that nested hypotheses cannot have higher probabilities than nesting hypotheses (!) Incoherent methods, such as ABC, Bayes factor, or any simulation approach that treats all hypotheses as mutually exclusive, should never be used with logically overlapping hypotheses. Templeton, PNAS, 2010 Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 59 / 64
  • 116. The ”Great ABC controversy” On-going controvery in phylogeographic genetics about the validity of using ABC for testing Against: Templeton, 2008, 2009, 2010a, 2010b, 2010c argues that nested hypotheses cannot have higher probabilities than nesting hypotheses (!) The central equation of ABC P(Hi|H, S∗ ) = Gi(||Si − S∗ ||)Πi Pn j=1 Gj(||Sj − S∗||)Πj is inherently incoherent. This fundamental equation is mathematically incorrect in every instance of overlap. Templeton, PNAS, 2010 Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 59 / 64
  • 117. The ”Great ABC controversy” On-going controvery in phylogeographic genetics about the validity of using ABC for testing Against: Templeton, 2008, 2009, 2010a, 2010b, 2010c argues that nested hypotheses cannot have higher probabilities than nesting hypotheses (!) Replies: Fagundes et al., 2008, Beaumont et al., 2010, Berger et al., 2010, Csill`ery et al., 2010 point out that the criticisms are addressed at [Bayesian] model-based inference and have nothing to do with ABC... Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 59 / 64
  • 118. The ”Great ABC controversy” On-going controvery in phylogeographic genetics about the validity of using ABC for testing ABC is a statistically valid approach, alongside other computational statistical techniques that have been successfully used to infer parameters and compare models in population genetics. Beaumont et al., Molec. Ecology, 2010 Replies: Fagundes et al., 2008, Beaumont et al., 2010, Berger et al., 2010, Csill`ery et al., 2010 point out that the criticisms are addressed at [Bayesian] model-based inference and have nothing to do with ABC... Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 59 / 64
  • 119. The ”Great ABC controversy” On-going controvery in phylogeographic genetics about the validity of using ABC for testing The confusion seems to arise from misunderstanding the difference between scientific hypotheses and their mathematical representation. Bayes’ theorem shows that the simpler model can indeed have a much higher posterior probability. Berger et al., PNAS, 2010 Replies: Fagundes et al., 2008, Beaumont et al., 2010, Berger et al., 2010, Csill`ery et al., 2010 point out that the criticisms are addressed at [Bayesian] model-based inference and have nothing to do with ABC... Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 59 / 64
  • 120. Aitkin’s alternative Without a specific alternative, the best we can do is to make posterior probability statements about µ and transfer these to the posterior distribution of the likelihood ratio. Aitkin, Statistical Inference, 2010 Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 60 / 64
  • 121. Aitkin’s alternative Without a specific alternative, the best we can do is to make posterior probability statements about µ and transfer these to the posterior distribution of the likelihood ratio. Aitkin, Statistical Inference, 2010 Proposal to examine the posterior distribution of the likelihood function : compare models via the “posterior distribution” of the likelihood ratio. L1(θ1|x) L2(θ2|x) , with θ1 ∼ π1(θ1|x) and θ2 ∼ π2(θ2|x). Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 60 / 64
  • 122. Using the data “twice” A persistent criticism of the posterior likelihood approach has been based on the claim that these approaches are ‘using the data twice’, or are ‘violating temporal coherence’ — Aitkin, Statistical Inference, 2010 Complete separation between both models due to simulation under product of the posterior distributions, i.e. replaces standard Bayesian inference under joint posterior of (θ1, θ2), p1m1(x)π1(θ1|x)π2(θ2) + p2m2(x)π2(θ2|x)π1(θ1) by product of both posteriors Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 61 / 64
  • 123. Illustration Comparison of a Poisson model against a negative binomial with m = 5 successes, when x = 3: Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 62 / 64
  • 124. Pros ... This quite small change to standard Bayesian analysis allows a very general approach to a wide range of apparently different inference problems; a particular advantage of the approach is that it can use the same noninformative priors — Aitkin, Statistical Inference, 2010 the approach is general and allows to resolve the difficulties with the Bayesian processing of point null hypotheses; the approach allows for the use of generic noninformative and improper priors; the approach handles more naturally the “vexed question of model fit”; the approach is “simple”. Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 63 / 64
  • 125. ... & cons The p-value is equal to the posterior probability that the likelihood ratio, for null hypothesis to alternative, is greater than 1 (...) The posterior probability is p that the posterior probability of H0 is greater than 0.5. Aitkin, Statistical Inference, 2010 the approach is not Bayesian (product of the posteriors) the approach uses undeterminate entities (“posterior probability that the posterior probability is larger than 0.5”...) the approach tries to get as close as possible to the p-value Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 64 / 64