1) The document discusses philosophical interventions in statistical debates and proposed reforms. It outlines three types of interventions: illuminating debates within statistics, reformulating frequentist tools through a severity perspective, and scrutinizing proposed reforms from the replication crisis.
2) A key idea is that evidence for a claim only comes when it has undergone severe testing, meaning it would probably have failed if it were false. The document argues for keeping the best aspects of Fisherian and Neyman-Pearson testing through a severity interpretation.
3) In scrutinizing reforms, it questions proposals to abandon statistical significance and P-value thresholds, arguing this could exacerbate selective reporting instead of reducing it. The document advocates reformulating tests
1. 0
Philosophical Interventions in the
Statistics Wars
Deborah G. Mayo
Virginia Tech
Philosophy in Science: Can Philosophers of
Science Contribute to Science?
PSA 2021 November 13, 2-4 pm
2. 1
“A Statistical Scientist Meets a
Philosopher of Science”
Sir David Cox: “Deborah, in some fields
foundations do not seem very important, but we
both think foundations of statistical inference are
important; why do you think that is?”
Mayo: “…in statistics …we invariably cross into
philosophical questions about empirical knowledge
and inductive inference.” (Cox and Mayo 2011)
Some call statistics “applied philosophy of Science”
(Kempthorne 1976)
3. 2
Statistics Philosophy of science
Most of my interactions with statistics have been
drawing out insights from stat:
(1) To solve philosophical problems about
inductive inference, evidence, experiment;
(2) To answer knotty metamethodological
questions: When (if ever) is it legitimate to use
the ‘same’ data to construct and test a
hypothesis?
4. 3
Philosophy of Science Statistics
• In the last decade I’m more likely to be intervening
in stat—in the sense of this session: PinS
• A central job for philosophers of science: minister
to conceptual and logical problems of sciences
• Especially when widely used methods (e.g.,
statistical significance tests) are said to be causing
a crisis (and should be “abandoned” or “retired”)
5. 4
Long-standing philosophical
controversy on probability
Frequentist (error statisticians): to control and
assess the relative frequency of misinterpretations
of data—error probabilities
(e.g., P-values, confidence intervals, randomization,
resampling)
Bayesians (and other probabilists): to assign
comparative degrees of belief or support in claims
(e.g., Bayes factors, Bayesian posterior probabilities)
6. 5
• Wars between frequentists and Bayesians
have been contentious, everyone wants to
believe we are long past them.
• Long standing battles still simmer below the
surface
7. 6
My first type of intervention:
• Illuminate the debates, within and between rival stat
tribes, in relation to today’s problems
• What’s behind the drumbeat that there’s a statistical
crisis in science?
8. • High powered methods enable arriving at well-
fitting models and impressive looking effects even
if they’re not warranted.
• I set sail with a simple tool: if little or nothing has
been done to rule out flaws in inferring a claim, we
do not have evidence for it.
7
9. A claim is warranted to the extent
it passes severely
8
• We have evidence for a claim only to the
extent that it has been subjected to and
passes a test that would probably have found
it flawed or specifiably false, just if it is
• This probability is the stringency or severity
with which it has passed the test
10. Second type of intervention:
Statistical inference as severe testing
• Reformulate frequentist error statistical tools
• Probability arises (in scientific inference) to assess and
control how capable methods are at uncovering and
avoiding erroneous interpretations of data (Probativism)
• Excavation tool: Holds for any kind of inference; you
needn’t accept this philosophy to use it to get beyond
today’s statistical wars and scrutinize reforms
9
11. Third type of intervention: scrutinize
proposed reforms growing out of the
“replication crisis”
• Several proposed reforms are welcome:
preregistration, avoidance of cookbook statistics,
calls for more replication research
• Others are quite radical, and even obstruct practices
known to improve on replication.
10
12. Consider statistical significance tests
(frequentist)
Significance tests (R.A. Fisher) are a small part of an
error statistical methodology
“…to test the conformity of the particular data under
analysis with H0 in some respect….”
…the p-value: the probability of getting an even
larger value of t0bs assuming background variability
or noise (Mayo and Cox 2006, 81)
11
13. Testing reasoning, as I see it
• If even larger differences than t0bs occur fairly frequently
under H0 (i.e., P-value is not small), there’s scarcely
evidence of incompatibility with H0
• Small P-value indicates some underlying discrepancy
from H0 because very probably (1–P) you would
have seen a smaller difference than t0bs were H0 true.
• Even if the small P-value is valid, it isn’t evidence of a
scientific conclusion H*
Stat-Sub fallacy H1 => H*
12
14. Neyman-Pearson (N-P) put
Fisherian tests on firmer footing
(1933):
Introduces alternative hypotheses H0, H1
H0: μ ≤ 0 vs. H1: μ > 0
• Constrains tests by requiring control of both Type I error
(erroneously rejecting) and Type II error (erroneously
failing to reject) H0, and power
(Neyman also developed confidence interval estimation
at the same time)
13
15. N-P tests tools for optimal
performance:
• Their success in optimal control of error
probabilities gives a new paradigm for statistics
• Also encouraged viewing tests as “accept/reject”
rules more apt for industrial quality control, or
high throughput screening, than science
• Fisher, later in life, criticized N-P for turning “his”
tests into acceptance sampling tools —I learned
later it was mostly in-fighting
14
16. • Can we keep the best from Fisherian and N-P tests
without an ‘inconsistent hybrid” (Gigerenzer)?
• This fueled my second intervention (Mayo 1991, 1996)
later developed with econometrician Aris Spanos in
2000 and statistician David Cox in 2003
• “Our goal is to identify a key principle of evidence by
which hypothetical error probabilities may be used for
inductive inference.” (Mayo and Cox 2006)
• Mathematically Fisher and N-P are nearly identical—it is
an interpretation or philosophy that is needed
15
17. Both Fisher & N-P: it’s easy to lie with
biasing selection effects
• Sufficient finagling—cherry-picking, significance
seeking, multiple testing, post-data subgroups, trying
and trying again—may practically guarantee a
preferred claim H gets support, even if it’s unwarranted
by evidence
• Such a test fails a minimal requirement for a stringent
or severe test (P-value is invalidated)
16
18. Key to solving a central problem
• Why is reliable performance relevant for a specific
inference?
• Ask yourself: What bothers you with selective
reporting, cherry picking, stopping when the data
look good, P-hacking.
17
19. 18
• Not a problem about long-run performance—
• It’s that we can’t say the test did its job in the
case at hand: give “a first line of defense against
being fooled by randomness” (Benjamini 2016)
20. Inferential construal of error
probabilities
• Use error probabilities to assess capabilities of tools to
probe various flaws (Probativism)
• They are what Popper call’s “methodological
probabilities”
• “Severe Testing as a Basic Concept in a Neyman-Pearson
Philosophy of Induction” (Mayo and Spanos 2006)
• ”Frequentist theory as an account of inductive inference” (Mayo
and Cox 2006)
19
21. 20
Popper vs logics of induction/
confirmation
Severity was Popper’s term, and the debate between
Popperian falsificationism and inductive logics of
confirmation/ support parallel those in statistics.
Popper: claim C is “corroborated” to the extent C
passes a severe test (one that probably would have
detected C’s falsity, if false).
22. Comparative logic of support
• Ian Hacking (1965) “Law of Likelihood”:
x support hypothesis H0 less well than H1 if,
Pr(x;H0) < Pr(x;H1)
A problem is:
• Any hypothesis that perfectly fits the data is
maximally likely (even if data-dredged)
• “there always is such a rival hypothesis viz., that
things just had to turn out the way they actually
did” (Barnard 1972, 129)
21
23. Error probabilities are
“one level above” a fit measure:
• Pr(H0 is less well supported than H1; H0 ) is high
for some H1 or other
“to fix a limit between ‘small’ and ‘large’ values of
[the likelihood ratio] we must know how often such
values appear when we deal with a true
hypothesis.” (Pearson and Neyman 1967, 106)
22
24. “There is No Such Thing as a Logic
of Statistical Inference”
• Hacking retracts his Law of Likelihood (LL), (1972,
1980)
• And retracts his earlier rejections that Neyman–
Pearson statistics is inferential.
“I now believe that Neyman, Peirce, and
Braithwaite were on the right lines to follow in the
analysis of inductive arguments”
(Hacking 1980, 141)
23
25. Likelihood Principle: what counts
as evidence?
A pervasive view is that all the evidence is
contained in the ratio of likelihoods:
Pr(x;H0) / Pr(x;H1) likelihood principle (LP)
On the LP (followed by strict Bayesians):
“Sampling distributions, significance levels,
power, all depend on something more [than
the likelihood function]–something that is
irrelevant in Bayesian inference–namely the
sample space” (Lindley 1971, 436)
24
26. Bayesians Howson and Urbach
• They say a significance test is precluded from
giving judgments about empirical support
• “[it] depends not only on the outcome that a trial
produced, but also on the outcomes that it could
have produced but did not. …determined by certain
private intentions of the experimenters, embodying
their stopping rule.” (1993 p. 212)
• Whether error probabilities matter turns on your
methodology being able to pick up on them.
25
28. • So the frequentist needs to know the stopping rule
For a (strict) Bayesian:
“It seems very strange that a frequentist could not
analyze a given set of data…if the stopping rule is
not given….Data should be able to speak for itself.”
(Berger and Wolpert, The Likelihood Principle 1988,
78)
27
29. Radiation oncologists look to phil
science: “Why do we disagree about
clinical trials?” (ASTRO 2021)
In a case we considered, Bayesian researchers:
“The [regulatory] requirement of type I error control for
Bayesian adaptive designs causes them to lose many
of their philosophical advantages, such as compliance
with the likelihood principle [which does not require
adjusting]” (Ryan et al. 2020).
They admit “the type I error was inflated in the [trials]
..without adjustments to account for multiplicity”.
• No wonder they disagree, and it turns partly on the
likelihood principle. (LP)
28
30. Bayesians may block implausible
inferences
• With a low prior degree of belief on H (e.g., real
effect), the Bayesian can block inferring H
• Can work in some cases
31. Concerns
• Additional source of flexibility, priors as well as
biasing selection effects
• Doesn’t show what researchers had done wrong—
it’s the multiple testing, data-dredging
• The believability of data-dredged hypotheses is
what makes them so seductive
• Claims can be highly probable (in any sense) while
poorly probed
30
32. Family feuds within the Bayesian
school: default, objective priors:
• Most Bayesian practitioners (last decade) look for
non-subjective prior probabilities
• “Default” priors are supposed to prevent prior beliefs
from influencing the posteriors–data dominant
31
33. How should we interpret them?
“By definition, ‘non-subjective’ prior distributions are
not intended to describe personal beliefs, and in
most cases, they are not even proper probability
distributions .. . . (Bernardo 1997, pp. 159–60)
• No agreement on rival systems for default/non-
subjective priors
(invariance, maximum entropy, maximizing missing
information, matching (Kass and Wasserman 1996))
32
34. There may be ways to combine Bayesian
and error statistical accounts
(Gelman: Falsificationist Bayesian; Shalizi: error statistician)
“[C]rucial parts of Bayesian data analysis, … can be
understood as ‘error probes’ in Mayo’s sense”
“[W]hat we are advocating, then, is what Cox and Hinkley
(1974) call ‘pure significance testing’, in which certain of
the model’s implications are compared directly to the
data.” (Gelman and Shalizi 2013, 10, 20).
• Gelman was at a session on significance testing controversies at the 2016
PSA with Gigerenzer, and Glymour
• Can’t also champion “abandoning statistical significance”
33
35. Now we get to scrutinizing proposed
reforms
34
36. No Threshold view: Don’t say
‘significance’, don’t use P-value
thresholds
• In 2019, executive director of the American Statistical
Association (ASA), Ron Wasserstein, (and 2 co-
authors), announce: "declarations of ‘statistical
significance’ be abandoned"
• “Don’t say “significance”, don’t use P-value thresholds
(e.g., .05, .01, .005)
• John Ioannidis invited me and Andrew Gelman to write
opposing editorials on the “no threshold view”
(European Journal of Clinical Investigation)
Mine was “P-value thresholds: forfeit at your peril”
35
37. • To be fair, many who signed on to the “no threshold view”
think by removing P-value thresholds, researchers lose an
incentive to data dredge and multiple test and otherwise
exploit researcher flexibility
• I argue banning the use of P-value thresholds in
interpreting data does not diminish but rather exacerbates
data-dredging
36
38. • In a world without predesignated thresholds, it would
be hard to hold the data dredgers accountable for
reporting a nominally small P-value through
ransacking, data dredging, trying and trying again.
• What distinguishes genuine P-values from invalid
ones is that they meet a prespecified error probability.
• No thresholds, no tests.
• We agree the actual P-value should be reported (as
all the founders of tests recommended)
37
39. 38
Problems are avoided by reformulating
tests with a discrepancy γ from H0
Instead of a binary cut-off (significant or not) the
particular outcome is used to infer discrepancies that
are or are not warranted
In a nutshell: one tests several discrepancies from a
test hypothesis and infers those well or poorly
warranted
E.g., With non-significant results, we set an upper
bound (e.g., any discrepancy from H0 is less than γ)
40. Final Remarks: to intervene in
statistics battles you need to ask:
• How do they use probability?
(probabilism, performance, probativism (severe testing))
• What’s their notion of evidence?
(error probability principle, likelihood principle)
39
41. Intervening in today’s stat policy
reforms requires chutzpah
• Things have gotten so political, sometimes an
outsider status can help with acrimonious battles
with thought leaders in statistics.
• To give an update: a Task Force of 14 statisticians
was appointed by the ASA President in 2019 “to
address concerns that [the no threshold view] might
be mistakenly interpreted as official ASA policy”
(Benjamini 2021)
40
42. • “the use of P -values and significance testing, properly
applied and interpreted, are important tools that should not
be abandoned” (Benjamini et al. 2021)
• Instead, we need to confront the fact that basic stat
concepts are more confused than ever (in medicine,
economics, law, psychology, climate science, social
science etc.)
• I was glad to see the morning's session* organized by
members of the 2019 Summer Seminar in Phil Stat (Aris
Spanos and I ran)
• I hope more philosophers of science enter the 2-way street
*Current Debates on Statistical Modeling and Inference
41
Phil Sci Stat Sci
45. (FEV) Frequentist Principle of
Evidence: Mayo and Cox (2006)
(SEV): Mayo 1991, 1996, 2018; Mayo
and Spanos (2006)
FEV/SEV Small P-value: indicates discrepancy γ from
H0, only if, there is a high probability the test would
have resulted in a larger P-value were a discrepancy
as large as γ absent.
FEV/SEV Moderate or large P-value: indicates the
absence of a discrepancy γ from H0, only if there is
a high probability the test would have given a
worse fit with H0 (i.e., a smaller P-value) were a
discrepancy γ present.
44
46. References
• Barnard, G. (1972). The logic of statistical inference (Review of “The Logic of Statistical Inference” by Ian
Hacking). British Journal for the Philosophy of Science 23(2), 123–32.
• Benjamini, Y., De Veaux, R., Efron, B., et al. (2021). The ASA President’s task force statement on
statistical significance and replicability. The Annals of Applied Statistics. (Online June 20, 2021.)
• Berger, J. O. and Wolpert, R. (1988). The Likelihood Principle, 2nd ed. Vol. 6 Lecture Notes-Monograph
Series. Hayward, CA: Institute of Mathematical Statistics.
• Cox, D. R., and Mayo, D. G. (2010). “Objectivity and Conditionality in Frequentist Inference.” In Error and
Inference: Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality
of Science, edited by Deborah G. Mayo and Aris Spanos, 276–304. Cambridge: Cambridge University
Press.
• Cox, D. and Mayo, D. (2011). “A Statistical Scientist Meets a Philosopher of Science: A Conversation
between Sir David Cox and Deborah Mayo”, in Rationality, Markets and Morals (RMM) 2, 103–14.
• Fisher, R. A. (1935a). The Design of Experiments. Oxford: Oxford University Press.
• Gelman, A. and Shalizi, C. (2013). Philosophy and the Practice of Bayesian Statistics and Rejoinder,
British Journal of Mathematical and Statistical Psychology 66(1), 8–38; 76–80.
• Giere, R. (1976). Empirical probability, objective statistical methods, and scientific inquiry. In Foundations
of probability theory, statistical inference and statistical theories of science, vol. 2, edited by W. 1. Harper
and C. A. Hooker, 63-101. Dordrecht, The Netherlands: D. Reidel.
• Hacking, I. (1965). Logic of Statistical Inference. Cambridge: Cambridge University Press.
• Hacking (1972). Likelihood. British Journal for the Philosophy of Science 23, l32-3 7.
• Hacking, I. (1980). The theory of probable inference: Neyman, Peirce and Braithwaite. In Mellor, D.
(ed.), Science, Belief and Behavior: Essays in Honour of R. B. Braithwaite, Cambridge: Cambridge
University Press, pp. 141–60. 45
47. • Harper, W. 1., and C. A. Hooker, eds. 1976. Foundations of probability theory, statistical inference and
statistical theories of science. Vol. 2. Dordrecht, The Netherlands: D. Reidel.
• Howson, C. & Urbach, P. (1993). Scientific Reasoning: The Bayesian Approach. LaSalle, IL: Open Court.
• Kass, R. & Wasserman, L. (1996). The Selection of Prior Distributions by Formal Rules. Journal of the
American Statistical Association 91, 1343–70.
• Kempthorne, O. (1976). Statistics and the Philosophers, in Harper, W. and Hooker, C. (eds.), Foundations of
Probability Theory, Statistical Inference and Statistical Theories of Science, Volume II. 273–314. Boston, MA:
D. Reidel.
• Lindley, D. V. (1971). The Estimation of Many Parameters in Godambe, V. and Sprott, D. (eds.), Foundations
of Statistical Inference 435–455. Toronto: Holt, Rinehart and Winston.
• Mayo, D. (1991). Novel Evidence and Severe Tests. Philosophy of Science 58(4), 523–52.
• Mayo, D. (1996). Error and the Growth of Experimental Knowledge. Chicago: University of Chicago Press.
• Mayo, D. (2014). On the Birnbaum Argument for the Strong Likelihood Principle (with discussion), Statistical
Science 29(2), 227–39; 261–6.
• Mayo, D. (2016). Don’t Throw Out the Error Control Baby with the Bad Statistics Bathwater: A Commentary
on Wasserstein, R. L. and Lazar, N. A. 2016, “The ASA’s Statement on p-Values: Context, Process, and
Purpose. The American Statistician 70(2) (supplemental materials).
• Mayo, D. (2018). Statistical inference as severe testing: How to get beyond the statistics wars. Cambridge:
Cambridge University Press.
• Mayo, D. (forthcoming). The Statistics Wars and Intellectual Conflicts of Interest (editorial). Conservation
Biology.
46
48. • Mayo, D. & Cox, D. (2006). Frequentist statistics as a theory of inductive inference. In Rojo, J. (ed.),
Optimality: The Second Erich L. Lehmann Symposium, Lecture Notes-Monograph series, Institute of
Mathematical Statistics (IMS), 49, pp. 77–97. (Reprinted 2010 in Mayo, D. and Spanos, A. (eds.), pp. 247–
75.)
• Mayo, D. & Hand, D. (under review). Statistical Significance Tests: Practicing damaging science, or damaging
scientific practice? In Kao, M., Shech, E., & Mayo, D. Synthese (Special Issue: Recent Issues in Philosophy
of Statistics: Evidence, Testing, and Applications ).
• Mayo, D. & Spanos, A. (2006). Severe testing as a basic concept in a Neyman–Pearson philosophy of
induction. British Journal for the Philosophy of Science 57(2), 323–57.
• Mayo, D. G., and A. Spanos (2011). “Error Statistics.” In Philosophy of Statistics, edited by Prasanta S.
Bandyopadhyay and Malcolm R. Forster, 7:152–198. Handbook of the Philosophy of Science. The
Netherlands: Elsevier.
• Musgrave, A. (1974). ‘Logical versus Historical Theories of Confirmation’, The British Journal for the
Philosophy of Science 25(1), 1–23.
• Neyman J. & Pearson, E. (1967). On the problem of the most efficient tests of statistical hypotheses. In Joint
statistical papers, 140-85 (Berkeley: University of California Press). First published in Philosophical
Transactions of the Royal Society (A)(1933):231, 289-337.
• Popper, K. (1959). The Logic of Scientific Discovery. London, New York: Routledge.
• Simmons, J., Nelson, L., & Simonsohn, U. (2012). A 21 word solution. Dialogue: The Official Newsletter of
the Society for Personality and Social Psychology 26(2), 4–7.
• Wasserstein, R. & Lazar, N. (2016). The ASA’s statement on p-values: Context, process and purpose (and
supplemental materials). The American Statistician, 70(2), 129-133.
• Wasserstein, R., Schirm, A,. & Lazar, N. (2019). Moving to a world beyond “p < 0.05” (Editorial). The
American Statistician 73(S1), 1–19. https://doi.org/10.1080/00031305.2019.1583913
47