SlideShare une entreprise Scribd logo
1  sur  43
Télécharger pour lire hors ligne
"4 Waves in Philosophy of Statistics"
What is the Philosophy of Statistics?
At one level of analysis at least, statisticians and philosophers of science ask many of the same
questions:
 What should be observed and what may justifiably be inferred from the resulting data?
 How well do data confirm or fit a model?
 What is a good test?
 Must predictions be “novel” in some sense? (selection effects, double counting, data mining)
 How can spurious relationships be distinguished from genuine regularities? from causal
regularities?
 How can we infer more accurate and reliable observations from less accurate ones?
 When does a fitted model account for regularities in the data?
1
That these very general questions are entwined with long standing debates in philosophy of
science helps to explain why the field of statistics tends to cross over so often into
philosophical territory.
That statistics is a kind of “applied philosophy of science” is not too far off the mark
(Kempthorne, 1976).

2
Statistics  philosophy3 ways statistical accounts are used in philosophy of science
(1) Model Scientific Inference—to capture either the actual or rational ways to arrive at
evidence and inference
(2) Resolve Philosophical Problems about scientific inference, observation, experiment;
(problem of induction, objectivity of observation, reliable evidence, Duhem's problem,
underdetermination).
(3) Perform a Metamethodological Critique—scrutinize methodological rules, e.g., accord
special weight to "novel" facts, avoid ad hoc hypotheses, avoid "data mining", require
randomization.

philosophy  statistics
central job to help resolve the conceptual, logical, and methodological discomforts of
scientists as to: how to make reliable inferences despite uncertainties and errors?
In tackling the problems around which the statistics wars have been fought, I claim, one also
arrives at a general account of inductive inference that solves or makes progress on:
the philosopher's problems of induction, objective evidence, underdetermination.

3
History and philosophy of statistics is a huge territory marked by 70 years of debates widely
known for reaching unusual heights both of passion and of technical complexity.
To get a handle on the movements and cycles without too much distortion, I propose to
identify four main “battle waves”—
Wave I ~ 1930 –1955/60
Wave II~ 1955/60-1980
Wave III~1980-2005
Wave IV~2005- (ongoing)

4
Confirmation Theory: The Search for Measures of
Degree of Evidential-Relationship (E-R)
Philosophy of science: e.g., 1960s and 70s (in the early to mid 20th century), saw a resurgence of
interest in solving the traditional Humean problem of induction.
Conceding that all attempts to solve the problem of induction, fail, philosophers of induction
turned to constructing logics of induction or confirmation theories (e.g., Carnap 1962).
The thinking was/is:
Deductive logic: rules to compute whether a conclusion is true, given the truth of a set of premises
(True)
Inductive logic or confirmation theory: would provide rules to compute the probability of a
conclusion, given the truth of certain evidence statements (?)

Having conceded loss in the battle for justifying induction, philosophers appeal to logic to capture
scientific method

5
Inductive Logics

Logic of falsification

“Confirmation Theory”
Methodological falsification
Rules to assign degrees of
Rules to decide when to
probability or confirmation to “prefer” or accept hypotheses
hypotheses given evidence e
Carnap C(H,e)

Popper

Inductive Logicians
we can build and try to justify
“inductive logics”
straight rule: Assign degrees of
confirmation/credibility

Deductive Testers
we can reject induction and
uphold the “rationality” of
preferring or accepting
H if it is “well tested”

Statistical affinity

Statistical affinity

Bayesian (and likelihoodist)
accounts

Fisherian, Neyman-Pearson
methods: probability enters to
ensure reliability and severity of
tests with these tests.
6
The goal of an inductive logic: supply means to compute the degree of evidential relationship
between given evidence statements, e, and a hypothesis H, e.g., look to conditional probability or
Bayes’s Theorem:
P(H|e) = P(e|H)P(H)/P(e)
where P(e) = P(e|H)P(H) + P(e|not-H) P(not-H).
Computing P(H|e), the posterior probability, requires a probability assignment to all of the
members of “not-H”.
Major source of difficulty: how to obtain and interpret these prior probabilities.
a. If analytic and a priori, relevance for predicting and learning about empirical
phenomena is problematic
b. If they measure subjective degrees of belief, their relevance for giving objective
guarantees of reliable inference is unclear.
In statistics, a, is analogous to “objective” Bayesianism (e.g., Jeffreys); b, to subjective
Bayesianism
The Bayesian-Frequentist controversies is one of the big topics we’ll explore in this
course
7
A core question: What is the nature and role of probabilistic concepts, methods, and
models in making inferences in the face of limited data, uncertainty and error?
1.Three Roles For Probability:
Degrees of Confirmation, Degrees of long-run error rates, Degrees of Well-Testedness
a. To provide a post-data assignment of degree of probability, confirmation, support or
belief in a hypothesis (probabilism);
b. To ensure long-run reliability of methods (performance)
c. To determine the warrant of hypotheses by assessing how stringently or severely probed
they are (probativeness)
These three contrasting philosophies of the role of probability in statistical inference are very
much at the heart of the central points of controversy in the “four waves” of philosophy of
statistics…

8
I.

Philosophy of Statistics: “The First Battle Wave”

WAVE I: circa 1930-1955/60:
Fisher, Neyman, Pearson, Savage, and Jeffreys.
Statistical inference tools use data x to probe aspects of the data generating source:
In statistical testing, these aspects are in terms of statistical hypotheses about parameters
governing a statistical distribution
H tells us the “probability of x under H”, written P(x;H)
(probabilistic assignments under a model)
P(H,H,T,H,T,T,T,H,H,T; fair coin) = (.5)10
We will explain how this differs from conditional probabilities in Bayes’s rule or theorem, P(x|H).

9
Modern Statistics Begins with Fisher:
“Simple” Significance Tests
Fisher strongly objected to Bayesian inference, in particular to the use of prior distributions (relevant for psychology not
science).
Looks to develop ways to express the uncertainty of inferences without deviating from frequentist probabilities.
Example. Let the sample be X = (X1, …,Xn), be n iid (independent and identically distributed) outcomes from a Normal
distribution with standard deviation =1
1. A null hypothesis H0 : H0:  = 0
e.g., 0 mean concentration of lead, no difference in mean survival in a given group, in mean risk, mean deflection of
light.
2. A function of the sample, d(X), the test statistic: which reflects the difference between the data x0 = (x1, …,xn), and H0;
The larger d(x0) the further the outcome is from what is expected under H0, with respect to the particular question being asked.
3. the p-value is the probability of a difference larger than d(x0), under the assumption that H0 is true:
p(x0)=P(d(X) > d(x0); H0)

10
Mini-recipe for p-value calculation:
The observed significance level (p-value) with observed
p(x0)=P(d(X) > d(x0); H0).

X

= .1

The relevant test statistic d(X) is:
d(X) = ( X -0x,
where X is the sample mean with standard deviation x = (√n).

d (X) =

Observed - Expected (under H0 )

Let n = 25.
Since s x =

sx

s
n

= 1/5 = .2, d(X) = .1 – 0 in units of x yields

d(x0)=.1/.2 = .5
Under the null, d(X) is distributed as standard Normal, denoted by d(X) ~ N(0,1).
(area to the right of .5) ~.3, i.e. not very significant.

11
Logic of Simple Significance Tests: Statistical Modus Tollens
“Every experiment may be said to exist only in order to give the facts a chance of
disproving the null hypothesis” (Fisher, 1956, p.160).

Statistical analogy to the deductively valid pattern modus tollens:
If the hypothesis H0 is correct then, with high probability, 1− p, the data would not
be statistically significant at level p.
x0 is statistically significant at level p.
____________________________________________
Thus, x0 is evidence against H0, or x0 indicates the falsity of H0.

12
The Alternative or “Non-Null” Hypothesis
Evidence against H0 seems to indicate evidence for some alternative.
Fisherian significance tests strictly consider only the H0
Neyman and Pearson (N-P) tests introduce an alternative H1 (even if only to serve as a
direction of departure)
Example. X = (X1, …,Xn), iid Normal with =1,
H0:  = 0 vs. H1:  > 0
Despite the bitter disputes with Fisher that were to erupt soon after ~1935, Neyman and
Pearson, at first, saw their work as merely placing Fisherian tests on firmer logical footing.
Much of Fisher’s hostility toward N-P methods reflects professional and personality conflicts
more than philosophical differences.

13
Neyman-Pearson (N-P) Tests
N-P hypothesis test: maps each outcome x = (x1, …,xn) into either the null hypothesis H0, or
an alternative hypothesis H1 (where the two exhaust the parameter space) to ensure the
probabilities of erroneous rejections (type I errors) and erroneous acceptances (type II errors)
are controlled at prespecified values, e.g., 0.05 or 0.01, the significance level of the test. It
also requires a sensible distance measure d(x0).

Test T+: X = (X1, …,Xn), iid Normal with =1
H0:  =  vs. H1:  > 
if d(x0) > c, "reject" H0, (or declare the result statistically significant at the level)
if d(x0) < c, "do not reject” or “accept" H0,
e.g. c=1.96 for =.025
“Accept/reject” uninterpreted parts of the mathematical apparatus.

14
Testing Errors and Error Probabilities
Type I error: Reject H0 even though H0 is true.
Type II error: Fail to reject H0 even though H0 is false.
Probability of a Type I error = P(d(x0) > c; H0) ≤ 
Probability of a Type II error:
P(Test T+ does not reject H0 ;  =1) =
= P(d(X) < c; H0) = ß(1), for any 1 > 0.
The "best" test at level  at the same time minimizes the value of ß for all 1 > 0, or
equivalently, maximizes the power:
POW(1)= P(d(X) > c; 1)
T+ is a Uniformly Most Powerful (UMP) level  test

15
Inductive Behavior Philosophy
Philosophical issues and debates arise once one begins to consider the interpretations of the formal
apparatus
‘Accept/Reject’ are identified with deciding to take specific actions, e.g., publishing a result,
announcing a new effect.
The justification for optimal tests is that:
“it may often be proved that if we behave according to such a rule ... we shall reject H when
it is true not more, say, than once in a hundred times, and in addition we may have evidence
that we shall reject H sufficiently often when it is false.”
Neyman: Tests are not rules of inductive inference but rules of behavior:
The goal is not to adjust our beliefs but rather to “adjust our behavior” to limited amounts of data
Is he just drawing a stark contrast between N-P tests and Fisherian as well as Bayesian methods?
Or is the behavioral interpretation essential to the tests?
16
“Inductive behavior” vs. “Inductive inference” battle
Commingles philosophical, statistical and personality clashes.
Fisher (1955) denounced the way that Neyman and Pearson transformed ‘his’ significance
tests into ‘acceptance procedures’
 They’ve turned my tests into mechanical rules or ‘recipes’ for ‘deciding’ to accept or
reject statistical hypothesis H0,
 The concern has more to do with speeding up production or making money than in
learning about phenomena
N-P followers are like:
“Russians (who) are made familiar with the ideal that research in pure science can and
should be geared to technological performance, in the comprehensive organized effort of a
five-year plan for the nation.” (1955, 70)

17
Pearson distanced himself from Neyman’s “inductive behavior” jargon, calling it “Professor
Neyman’s field rather than mine.”
But the most impressive mathematical results were in the decision-theoretic framework of
Neyman-Pearson-Wald.
Many of the qualifications by Neyman and Pearson in the first wave are overlooked in the
philosophy of statistics literature.
Admittedly, these “evidential” practices were not made explicit *. (Had they been, the
subsequent waves of philosophy of statistics might have looked very different).
*Mayo’s goal as a graduate student.

18
The Second Wave: ~1955/60 -1980
“Post-data criticisms of N-P methods”:
Ian Hacking (1965), framed the main lines of criticism by philosophers “Neyman-Pearson tests as
suitable for before-trial betting, but not for after-trial evaluation.” (p. 99):
Battles: “initial precision’ vs. “final precision”,
“before-data vs. after data”
After the data, he claimed, the relevant measure of support is the (relative) likelihood
Two data sets x and y may afford the same "support" to H, yet warrant different
inferences [on significance test reasoning] because x and y arose from tests with
different error probabilities.
o This is just what error statisticians want!
o But (at least early on) Hacking (1965) held to the
“Law of Likelihood”: x support hypotheses H1 more than H2 if,
P(x;H1) > P(x;H2).

19
Yet, as Barnard notes, “there always is such a rival hypothesis: That things just had to turn
out the way they actually did”.
(H,H,T,H) is made most probable by the hypothesis that makes P(H) = 1 on trials 1, 2, and 4
(0 on trial 3).
“Best explanation”? Since such a maximally likelihood alternative H2 can always be
constructed, H1 may always be found less well supported, even if H1 is true—no error
control.
Hacking soon rejected the likelihood approach on such grounds, but likelihoodist accounts
are advocated by others—most especially philosophers (e.g., formal epistemologists).
So we will want to consider some of the problems that beset such accounts (in philosophy
and in statistics).
To begin with we’ll need to be clear on what a likelihood function is.

20
Perhaps THE key issue of controversy in the philosophy of statistics battles
The (strong) likelihood principle, likelihoods suffice to convey “all that the data have to say:”
According to Bayes’s theorem, P(x|µ) ... constitutes the entire evidence of the experiment,
that is, it tells all that the experiment has to tell. More fully and more precisely, if y is the
datum of some other experiment, and if it happens that P(x|µ) and P(y|µ) are proportional
functions of µ (that is, constant multiples of each other), then each of the two data x and y
have exactly the same thing to say about the values of µ… (Savage 1962, p. 17.)
—the error probability statistician needs to consider, in addition, the sampling distribution of
the likelihoods.
—significance levels and other error probabilities all violate the likelihood principle (Savage
1962).
Breakthrough update: A long-held “proof” of the likelihood principle by Allan Birnbaum is
the subject of some recent work of mine: I will give a colloquia talk on this in the Philosophy
Department, May 2.

21
Paradox of Optional Stopping
Instead of fixing the sample size n in advance, in some tests, n is determined by a stopping rule:
In Normal testing, 2-sided H0:  = 0 vs. H1:  ≠ 0
Keep sampling until H0 is rejected at the .05 level
(i.e., keep sampling until | X |  1.96 /

n ).

Nominal vs. Actual significance levels:
With n fixed the type 1 error probability is .05,
With this stopping rule the actual significance level differs from, and will be greater than .05.
By contrast, since likelihoods are unaffected by the stopping rule, the LP follower denies there
really is an evidential difference between the two cases (i.e., n fixed and n determined by the
stopping rule).

22
Intuitively: Should it matter if I decided to toss the coin 100 times and happened to get 60% heads,
or if I decided to keep tossing until I could reject at the .05 level (2-sided) and this happened to
occur on trial 100?
Should it matter if I kept going until I found statistical significance?
Error statistical principles: Yes!—penalty for perseverance!
The LP says NO!
Savage Forum 1959: Savage audaciously declares that the lesson to draw from the optional
stopping effect is that “optional stopping is no sin” so the problem must lie with the use of
significance levels. But why accept the likelihood principle (LP)? (simplicity and freedom?)
The likelihood principle emphasized in Bayesian statistics implies, … that the rules governing
when data collection stops are irrelevant to data interpretation. It is entirely appropriate to
collect data until a point has been proved or disproved (p. 193)…This irrelevance of stopping
rules to statistical inference restores a simplicity and freedom to experimental design that had
been lost by classical emphasis on significance levels (in the sense of Neyman and Pearson)
(Edwards, Lindman, Savage 1963, p. 239).

23
For frequentists this only underscores the point raised years before by Pearson and Neyman:
A likelihood ratio (LR) may be a criterion of relative fit but it “is still necessary to determine its
sampling distribution in order to control the error involved in rejecting a true hypothesis,
because a knowledge of [LR] alone is not adequate to insure control of this error (Pearson and
Neyman, 1930, p. 106).

The key difference: likelihood fixes the actual outcome, i.e., just d(x), while error statistics
considers outcomes other than the one observed in order to assess the error properties
LP  irrelevance of, and no control over, error probabilities.
("why you cannot be just a little bit Bayesian" EGEK 1996)
EGEK: Error and the Growth of Experimental Knowledge (Mayo 1996)

24
The Statistical Significance Test Controversy
(Morrison and Henkel, 1970) – contributors chastise social scientists for slavish use of
significance tests
o Focus on simple Fisherian significance tests
o Philosophers direct criticisms mostly to N-P tests.
Fallacies of Rejection: Statistical vs. Substantive Significance
(i) take statistical significance as evidence of substantive theory that explains the effect
(ii) Infer a discrepancy from the null beyond what the test
warrants
(i) Paul Meehl: It is fallacious to go from a statistically significant result, e.g., at the .001
level, to infer that “one’s substantive theory T, which entails the [statistical] alternative H1, has
received .. quantitative support of magnitude around .999”
A statistically significant difference (e.g., in child rearing) is not automatically evidence for a
Freudian theory.
Merely refuting the null hypothesis is too weak to corroborate substantive theories, “we have
to have ‘Popperian risk’, ‘severe test’ [as in Mayo], or what philosopher Wesley Salmon
called a highly improbable coincidence” (Meehl and Waller 2002, 184) (“damn coincidence”)
25
Fallacies of rejection:
(i) Take statistical significance as evidence of substantive theory that explains the effect
(ii) Infer a discrepancy from the null beyond what the test warrants
Finding a statistically significant effect, d(x0)> c (cut-off for rejection) need not be
indicative of large or meaningful effect sizes — test too sensitive
Large n Problem: an  significant rejection of H0 can be very probable, even with a
substantively trivial discrepancy from H0.
This is often taken as a criticism because it is assumed that statistical significance at a given
level is more evidence against the null the larger the sample size (n)—fallacy!
"The thesis implicit in the [NP] approach [is] that a hypothesis may be rejected with increasing
confidence or reasonableness as the power of the test increases” (Howson and Urbach 1989 and
later editions)
In fact, it is indicative of less of a discrepancy from the null than if it resulted from a smaller
sample size.
Comes also in the form of the “Jeffrey-Good-Lindley” paradox
Even a highly statistically significant result can, with n sufficiently large, correspond to a high
posterior probability to a null hypothesis.
26
Fallacy of Non-Statistically Significant Results
Test T fails to reject the null, when the test statistic fails to reach the cut-off point for
rejection, i.e., d(x0) ≤ c.
A classic fallacy is to construe such a “negative” result as evidence for the correctness of the null
hypothesis (common in risk assessment contexts).
“No evidence against” is not “evidence for”
Merely surviving the statistical test is too easy, occurs too frequently, even when the null is false.
—results from tests lacking sufficient sensitivity or power.
The Power Analytic Movement of the 60’s in psychology
Jacob Cohen: By considering ahead of time the Power of the test, select a test capable of detecting
discrepancies of interest.
(Power is a feature of N-P tests, but apparently the prevalence of Fisherian tests in the social
sciences, coupled, perhaps, with the difficulty in calculating power, resulted in ignoring power)
A multitude of tables were supplied (Cohen, 1988), but until his death he bemoaned their all-torare use.
27
Post-data use of power to avoid fallacies of insensitive tests
If there's a low probability of a statistically significant result, even if a non-trivial discrepancy
non-trivial is present (low power against non-trivial) then a non-significant difference is not good
evidence that a non-trivial discrepancy is absent.
This still retains an unacceptable coarseness: power is always calculated relative to the cut-off
point c for rejecting H0.
We will introduce you to a way of retaining the main logic but in a data-dependent use of power.
Rather than calculating
(1) P(d(X) > c;  =.2)

Power

one should calculate
(2) P(d(X) > d(x0); =.2).

observed power (severity)

Even if (1) is low, (2) may be high. We return to this in the developments of Wave III.

28
III. The Third Wave: Relativism, Reformulations, Reconciliations ~1980-2005+
Rational Reconstruction and Relativism in Philosophy of Science
Fighting Kuhnian battles to the very idea of a unified method of scientific inference, statistical inference less
prominent in philosophy
— largely used in rational reconstructions of scientific episodes,
— in appraising methodological rules,
— in classic philosophical problems e.g., Duhem’s problem—reconstruct a given assignment of blame so as to
be “warranted” by Bayesian probability assignments.

problem with reconstructions: normative force.
The recognition that science involves subjective judgments and values, reconstructions often appeal to a
subjective Bayesian account (Salmon’s “Tom Kuhn Meets Tom Bayes”).
(Kuhn thought this was confused: no reason to suppose an algorithm remains through theory change)

Naturalisms, HPS —immersed in biology, psychology, etc., philosophers of science recoil from
unified inferential accounts.
Achinstein (2001): “scientists do not and should not take such philosophical accounts of evidence
seriously” (p. 9).
They are a priori while they should be empirical; but being empirical is not enough ….
29
Wave III in Scientific Practice: Still operative
—

Statisticians turn to eclecticism.

—

Non-statistician practitioners (e.g., in psychology, ecology, medicine), bemoan “unholy
hybrids” (the New Hybridists)

A mixture of ideas from N-P methods, Fisherian tests, and Bayesian accounts that is
“inconsistent from both perspectives and burdened with conceptual confusion”. (Gigerenzer, 1993,
p. 323).
 Faced with foundational questions, non statistician practitioners raise anew the questions
from the first and second waves.
 Finding the automaticity and fallacies still rampant, many call on an outright “ban” on
significance tests in research, or at least insist on reforms and reformulations of statistical
tests.
Task Force to consider Test Ban in Psychology: 1990s
(They didn’t ban them, but it’s continued to give them fodder for reforms: e.g., confidence
interval estimation. Fine, but they commit the same fallacies, often, using confidence intervals.)
30
Reforms and Reinterpretations Within Error Probability Statistics
Any adequate reformulation must:
(i)

Show how to avoid classic fallacies (of acceptance and of rejection) —on principled
grounds,

(ii)

Show that it provides an account of inductive inference

31
Avoiding Fallacies
We will discuss attempts to avoid fallacies of acceptance and rejection (e.g., using confidence
interval estimates).
Move away from coarse accept/reject rule; use specific result (significant or insignificant) to infer
those discrepancies from the null that are well ruled-out, and those which are not.
e.g., Interpretation of Non-Significant results
If d(x) is not statistically significant, and the test had a very high probability of a
more statistically significant difference if µ > µ0 + , then d(x) is good grounds for
inferring µ ≤ µ0 + .
Use specific outcome to infer an upper bound
µ ≤ µ* (values beyond are ruled out by given severity.)

32
Takes us back to the post-data version of power:
Rather than construe “a miss as good as a mile”, parity of logic suggests that the post-data
power assessment should replace the usual calculation of power against :
POW() = P(d(X) > c; =),
with what might be called the power actually attained or, to have a distinct term, the severity
(SEV):
SEV(< ) = P(d(X) > d(x0); =),
where d(x0) is the observed (non-statistically significant) result.

33
Fallacies of Rejection: The Large n-Problem
While with a nonsignificant result, the concern is erroneously inferring that a discrepancy from µ0
is absent;
With a significant result x0, the concern is erroneously inferring that it is present.
Utilizing the severity assessment an -significant difference with n1 passes µ > µ1 less
severely than with n2 where n1 > n2.
(What’s more indicative of a large effect (fire), a fire alarm that goes off with burnt toast or
one so insensitive that it doesn’t go off unless the house is fully ablaze. The larger sample size is
like the one that goes off with burnt toast.)
In this way we solve the problems of tests too sensitive or not sensitive enough, but there’s one
more thing ... showing how it supplies an account of inductive inference.
Many argue in Wave III that error statistical methods cannot supply an account of inductive
inference because error probabilities conflict with posterior probabilities.

34
P-values vs Bayesian Posteriors
A statistically significant difference from H0 can correspond to large posteriors in H0
From the Bayesian perspective, it follows that p-values come up short as a measure of
inductive evidence,
 the significance testers balk at the fact that the recommended priors result in highly
significant results being construed as no evidence against the null — or even evidence for
it!
The conflict often considers the two sided T(2 test
H0:  =  versus H1:  ≠  .
(The difference between p-values and posteriors are far less marked with one-sided tests).
“Assuming a prior of .5 to H0, with n = 50 one can classically ‘reject H0 at significance level p =
.05,’ although P(H0|x) = .52 (which would actually indicate that the evidence favors H0).”
This is taken as a criticism of p-values, only because, it is assumed the .51 posterior is the
appropriate measure of the belief-worthiness.
As the sample size increases, the conflict becomes more noteworthy.

35
If n = 1000, a result statistically significant at the .05 level leads to a posterior to the null of .82!
SEV (H1) = .95 while the corresponding posterior has gone
from .5 to .82. What warrants such a prior?
n (sample size)
______________________________________________________
p
t
n=10
n=20
n=50 n=100 n=1000
.10
.05
.01
.001

1.645
1.960
2.576
3.291

.47
.37
.14
.024

.56
.42
.16
.026

.65
.52
.22
.034

.72
.60
.27
.045

.89
.82
.53
.124

(1) Some claim the prior of .5 is a warranted frequentist assignment:
H0 was randomly selected from an urn in which 50% are true
(*) Therefore P(H0) = p
H0 may be 0 change in extinction rates, 0 lead concentration, etc.
What should go in the urn of hypotheses?
For the frequentist: either H0 is true or false, the probability in (*) is fallacious and results from an
unsound instantiation.
We are very interested in how false it might be, which is what we can do by means of a severity
assessment of.
36
(2) Subjective degree of belief assignments will not ensure the error probability, and thus the
severity, assessments we need.
(3) Some suggest an “impartial” or “uninformative” Bayesian prior gives .5 to H0, the remaining .5
probability being spread out over the alternative parameter space, e.g., Jeffreys.
This “spiked concentration of belief in the null” is at odds with the prevailing view “we know all
nulls are false”.

37
Wave IV 2005: Recapitulation of previous waves + new challenges to reliability of science
A. Contemporary “Impersonal” Bayesianism: In the Bayes vs frequentist wars, the impersonal
Bayesian tries to have frequentist guarantees
Because of the difficulty of eliciting subjective priors, and because of the reluctance among
scientists to allow subjective beliefs to be conflated with the information provided by data,
much current Bayesian work in practice favors conventional “default”, “uninformative,” or
“reference” priors.
We may call them “conventional” Bayesians
The conventional Bayesians abandon coherence, the LP, and strive to match frequentist error
probabilities!

38
Some questions for “reference” Bayesians
1. What do reference posteriors measure?
 A classic conundrum: there is no unique “noninformative” prior. (Supposing there is
one leads to inconsistencies in calculating posterior marginal probabilities).
 Any representation of ignorance or lack of information that succeeds for one
parameterization will, under a different parameterization, entail having knowledge.
 The conventional prior is said to be simply something that allows computing the
posterior (undefined), they are weights of some sort.
 Not to be considered expressions of uncertainty, ignorance, or degree of belief.
 May not even be probabilities; flat priors may not sum to one (improper prior). If priors
are not probabilities, what then is the interpretation of a posterior?

39
2. Priors for the same hypothesis changes according to what experiment is to be done!
Bayesian incoherent
If the prior is to represent information, why should it be influenced by the sample space of a
contemplated experiment?
Violates the likelihood principle — the cornerstone of Bayesian coherency
Conventional Bayesians: it is “the price” of objectivity.
Seems to wreck havoc with basic Bayesian foundations, but without the payoff of an
objective, interpretable output—even subjective Bayesians object
3. Reference posteriors with good frequentist properties
Reference priors are touted as having some good frequentist properties, at least in onedimensional problems.
They are deliberately designed to match frequentist error probabilities.
If you want error probabilities, why not use techniques that provide them directly?
By the way, using conditional probability—which is part and parcel of probability theory, (as in
“Bayes nets”, etc.), in no way makes one a Bayesian—no priors to hypotheses….
40
B. Brand new sets of crises: (so new we have barely started writing on it):
Research implicating statistical methods with pseudoscience, fraud, unreplicable results
Origins?
 Controversies about models in economics, climate change, medicine?
 Economic downturn (open source journal demand sexy results?); big data
makes it easy to “cherry pick” and data mine to get ad hoc models and unreliable
results?
 Use of Bayesian statistics?
 Use of frequentist statistics?
 Computerized data analysis)
Regardless of the source, it’s resulted in one of the hottest topics in science (that
philosophers should be involved in).
41
Different forms:
(i) Science-wise false discovery rates.
Given type 1 and type 2 error probabilities, and an assumption of the proportion
of false hypotheses studied, it is argued that most statistically significant
“discoveries” are false—
Stems from large-scale screening in bioinformatics
Not based on real data, but conjecture and simulation.
(ii) Journal practices: attention-getting articles with eye-catching, but inadequately
scrutinized, conjectures. (Stapel in social psychology, who collected no data.)

(iii) Unthinking uses of statistics (previous waves I-III)

42
We can’t possibly cover the tremendous number of important issues, let alone
readings; and we were sorely tempted to include so many “greats” that we’ve had to
omit to avoid overwhelming you. But you will, by the end of the course, have a
basic methodological framework within which current methodological problems
may be understood and addressed.

43

Contenu connexe

Tendances

D. Mayo: Philosophy of Statistics & the Replication Crisis in Science
D. Mayo: Philosophy of Statistics & the Replication Crisis in ScienceD. Mayo: Philosophy of Statistics & the Replication Crisis in Science
D. Mayo: Philosophy of Statistics & the Replication Crisis in Sciencejemille6
 
Spanos: Lecture 1 Notes: Introduction to Probability and Statistical Inference
Spanos: Lecture 1 Notes: Introduction to Probability and Statistical InferenceSpanos: Lecture 1 Notes: Introduction to Probability and Statistical Inference
Spanos: Lecture 1 Notes: Introduction to Probability and Statistical Inferencejemille6
 
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...jemille6
 
Statistical Flukes, the Higgs Discovery, and 5 Sigma
Statistical Flukes, the Higgs Discovery, and 5 Sigma Statistical Flukes, the Higgs Discovery, and 5 Sigma
Statistical Flukes, the Higgs Discovery, and 5 Sigma jemille6
 
Mayo: 2nd half “Frequentist Statistics as a Theory of Inductive Inference” (S...
Mayo: 2nd half “Frequentist Statistics as a Theory of Inductive Inference” (S...Mayo: 2nd half “Frequentist Statistics as a Theory of Inductive Inference” (S...
Mayo: 2nd half “Frequentist Statistics as a Theory of Inductive Inference” (S...jemille6
 
April 3 2014 slides mayo
April 3 2014 slides mayoApril 3 2014 slides mayo
April 3 2014 slides mayojemille6
 
D. G. Mayo: Your data-driven claims must still be probed severely
D. G. Mayo: Your data-driven claims must still be probed severelyD. G. Mayo: Your data-driven claims must still be probed severely
D. G. Mayo: Your data-driven claims must still be probed severelyjemille6
 
Probability/Statistics Lecture Notes 4: Hypothesis Testing
Probability/Statistics Lecture Notes 4: Hypothesis TestingProbability/Statistics Lecture Notes 4: Hypothesis Testing
Probability/Statistics Lecture Notes 4: Hypothesis Testingjemille6
 
hypothesis testing overview
hypothesis testing overviewhypothesis testing overview
hypothesis testing overviewi i
 
An Introduction to Mis-Specification (M-S) Testing
An Introduction to Mis-Specification (M-S) TestingAn Introduction to Mis-Specification (M-S) Testing
An Introduction to Mis-Specification (M-S) Testingjemille6
 
Feb21 mayobostonpaper
Feb21 mayobostonpaperFeb21 mayobostonpaper
Feb21 mayobostonpaperjemille6
 
"The Statistical Replication Crisis: Paradoxes and Scapegoats”
"The Statistical Replication Crisis: Paradoxes and Scapegoats”"The Statistical Replication Crisis: Paradoxes and Scapegoats”
"The Statistical Replication Crisis: Paradoxes and Scapegoats”jemille6
 
beyond objectivity and subjectivity; a discussion paper
beyond objectivity and subjectivity; a discussion paperbeyond objectivity and subjectivity; a discussion paper
beyond objectivity and subjectivity; a discussion paperChristian Robert
 
Discussion a 4th BFFF Harvard
Discussion a 4th BFFF HarvardDiscussion a 4th BFFF Harvard
Discussion a 4th BFFF HarvardChristian Robert
 
6 estimation hypothesis testing t test
6 estimation hypothesis testing t test6 estimation hypothesis testing t test
6 estimation hypothesis testing t testPenny Jiang
 
Hypothesis Testing
Hypothesis TestingHypothesis Testing
Hypothesis Testingrishi.indian
 
Senn repligate
Senn repligateSenn repligate
Senn repligatejemille6
 

Tendances (20)

D. Mayo: Philosophy of Statistics & the Replication Crisis in Science
D. Mayo: Philosophy of Statistics & the Replication Crisis in ScienceD. Mayo: Philosophy of Statistics & the Replication Crisis in Science
D. Mayo: Philosophy of Statistics & the Replication Crisis in Science
 
Spanos: Lecture 1 Notes: Introduction to Probability and Statistical Inference
Spanos: Lecture 1 Notes: Introduction to Probability and Statistical InferenceSpanos: Lecture 1 Notes: Introduction to Probability and Statistical Inference
Spanos: Lecture 1 Notes: Introduction to Probability and Statistical Inference
 
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...
 
Statistical Flukes, the Higgs Discovery, and 5 Sigma
Statistical Flukes, the Higgs Discovery, and 5 Sigma Statistical Flukes, the Higgs Discovery, and 5 Sigma
Statistical Flukes, the Higgs Discovery, and 5 Sigma
 
Mayo: 2nd half “Frequentist Statistics as a Theory of Inductive Inference” (S...
Mayo: 2nd half “Frequentist Statistics as a Theory of Inductive Inference” (S...Mayo: 2nd half “Frequentist Statistics as a Theory of Inductive Inference” (S...
Mayo: 2nd half “Frequentist Statistics as a Theory of Inductive Inference” (S...
 
April 3 2014 slides mayo
April 3 2014 slides mayoApril 3 2014 slides mayo
April 3 2014 slides mayo
 
D. G. Mayo: Your data-driven claims must still be probed severely
D. G. Mayo: Your data-driven claims must still be probed severelyD. G. Mayo: Your data-driven claims must still be probed severely
D. G. Mayo: Your data-driven claims must still be probed severely
 
Probability/Statistics Lecture Notes 4: Hypothesis Testing
Probability/Statistics Lecture Notes 4: Hypothesis TestingProbability/Statistics Lecture Notes 4: Hypothesis Testing
Probability/Statistics Lecture Notes 4: Hypothesis Testing
 
hypothesis testing overview
hypothesis testing overviewhypothesis testing overview
hypothesis testing overview
 
An Introduction to Mis-Specification (M-S) Testing
An Introduction to Mis-Specification (M-S) TestingAn Introduction to Mis-Specification (M-S) Testing
An Introduction to Mis-Specification (M-S) Testing
 
Feb21 mayobostonpaper
Feb21 mayobostonpaperFeb21 mayobostonpaper
Feb21 mayobostonpaper
 
"The Statistical Replication Crisis: Paradoxes and Scapegoats”
"The Statistical Replication Crisis: Paradoxes and Scapegoats”"The Statistical Replication Crisis: Paradoxes and Scapegoats”
"The Statistical Replication Crisis: Paradoxes and Scapegoats”
 
beyond objectivity and subjectivity; a discussion paper
beyond objectivity and subjectivity; a discussion paperbeyond objectivity and subjectivity; a discussion paper
beyond objectivity and subjectivity; a discussion paper
 
Discussion a 4th BFFF Harvard
Discussion a 4th BFFF HarvardDiscussion a 4th BFFF Harvard
Discussion a 4th BFFF Harvard
 
6 estimation hypothesis testing t test
6 estimation hypothesis testing t test6 estimation hypothesis testing t test
6 estimation hypothesis testing t test
 
Hypothesis Testing
Hypothesis TestingHypothesis Testing
Hypothesis Testing
 
Hypothesis testing
Hypothesis testingHypothesis testing
Hypothesis testing
 
Hypothesis testing - Primer
Hypothesis testing - PrimerHypothesis testing - Primer
Hypothesis testing - Primer
 
Senn repligate
Senn repligateSenn repligate
Senn repligate
 
Hypothesis testing
Hypothesis testingHypothesis testing
Hypothesis testing
 

Similaire à Phil 6334 Mayo slides Day 1

Philosophy of Science and Philosophy of Statistics
Philosophy of Science and Philosophy of StatisticsPhilosophy of Science and Philosophy of Statistics
Philosophy of Science and Philosophy of Statisticsjemille6
 
The Statistics Wars: Errors and Casualties
The Statistics Wars: Errors and CasualtiesThe Statistics Wars: Errors and Casualties
The Statistics Wars: Errors and Casualtiesjemille6
 
On the vexing dilemma of hypothesis testing and the predicted demise of the B...
On the vexing dilemma of hypothesis testing and the predicted demise of the B...On the vexing dilemma of hypothesis testing and the predicted demise of the B...
On the vexing dilemma of hypothesis testing and the predicted demise of the B...Christian Robert
 
Severe Testing: The Key to Error Correction
Severe Testing: The Key to Error CorrectionSevere Testing: The Key to Error Correction
Severe Testing: The Key to Error Correctionjemille6
 
Controversy Over the Significance Test Controversy
Controversy Over the Significance Test ControversyControversy Over the Significance Test Controversy
Controversy Over the Significance Test Controversyjemille6
 
D. Mayo: Philosophical Interventions in the Statistics Wars
D. Mayo: Philosophical Interventions in the Statistics WarsD. Mayo: Philosophical Interventions in the Statistics Wars
D. Mayo: Philosophical Interventions in the Statistics Warsjemille6
 
Excursion 3 Tour III, Capability and Severity: Deeper Concepts
Excursion 3 Tour III, Capability and Severity: Deeper ConceptsExcursion 3 Tour III, Capability and Severity: Deeper Concepts
Excursion 3 Tour III, Capability and Severity: Deeper Conceptsjemille6
 
S5 w1 hypothesis testing & t test
S5 w1 hypothesis testing & t testS5 w1 hypothesis testing & t test
S5 w1 hypothesis testing & t testRachel Chung
 
D. G. Mayo Columbia slides for Workshop on Probability &Learning
D. G. Mayo Columbia slides for Workshop on Probability &LearningD. G. Mayo Columbia slides for Workshop on Probability &Learning
D. G. Mayo Columbia slides for Workshop on Probability &Learningjemille6
 
Bayesian decision making in clinical research
Bayesian decision making in clinical researchBayesian decision making in clinical research
Bayesian decision making in clinical researchBhaswat Chakraborty
 
Mayo &amp; parker spsp 2016 june 16
Mayo &amp; parker   spsp 2016 june 16Mayo &amp; parker   spsp 2016 june 16
Mayo &amp; parker spsp 2016 june 16jemille6
 
D. Mayo: Replication Research Under an Error Statistical Philosophy
D. Mayo: Replication Research Under an Error Statistical Philosophy D. Mayo: Replication Research Under an Error Statistical Philosophy
D. Mayo: Replication Research Under an Error Statistical Philosophy jemille6
 
D.G. Mayo Slides LSE PH500 Meeting #1
D.G. Mayo Slides LSE PH500 Meeting #1D.G. Mayo Slides LSE PH500 Meeting #1
D.G. Mayo Slides LSE PH500 Meeting #1jemille6
 
D.g. mayo 1st mtg lse ph 500
D.g. mayo 1st mtg lse ph 500D.g. mayo 1st mtg lse ph 500
D.g. mayo 1st mtg lse ph 500jemille6
 
“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”jemille6
 

Similaire à Phil 6334 Mayo slides Day 1 (20)

Philosophy of Science and Philosophy of Statistics
Philosophy of Science and Philosophy of StatisticsPhilosophy of Science and Philosophy of Statistics
Philosophy of Science and Philosophy of Statistics
 
The Statistics Wars: Errors and Casualties
The Statistics Wars: Errors and CasualtiesThe Statistics Wars: Errors and Casualties
The Statistics Wars: Errors and Casualties
 
On the vexing dilemma of hypothesis testing and the predicted demise of the B...
On the vexing dilemma of hypothesis testing and the predicted demise of the B...On the vexing dilemma of hypothesis testing and the predicted demise of the B...
On the vexing dilemma of hypothesis testing and the predicted demise of the B...
 
Severe Testing: The Key to Error Correction
Severe Testing: The Key to Error CorrectionSevere Testing: The Key to Error Correction
Severe Testing: The Key to Error Correction
 
Controversy Over the Significance Test Controversy
Controversy Over the Significance Test ControversyControversy Over the Significance Test Controversy
Controversy Over the Significance Test Controversy
 
D. Mayo: Philosophical Interventions in the Statistics Wars
D. Mayo: Philosophical Interventions in the Statistics WarsD. Mayo: Philosophical Interventions in the Statistics Wars
D. Mayo: Philosophical Interventions in the Statistics Wars
 
Mayod@psa 21(na)
Mayod@psa 21(na)Mayod@psa 21(na)
Mayod@psa 21(na)
 
Excursion 3 Tour III, Capability and Severity: Deeper Concepts
Excursion 3 Tour III, Capability and Severity: Deeper ConceptsExcursion 3 Tour III, Capability and Severity: Deeper Concepts
Excursion 3 Tour III, Capability and Severity: Deeper Concepts
 
S5 w1 hypothesis testing & t test
S5 w1 hypothesis testing & t testS5 w1 hypothesis testing & t test
S5 w1 hypothesis testing & t test
 
D. G. Mayo Columbia slides for Workshop on Probability &Learning
D. G. Mayo Columbia slides for Workshop on Probability &LearningD. G. Mayo Columbia slides for Workshop on Probability &Learning
D. G. Mayo Columbia slides for Workshop on Probability &Learning
 
Bayesian decision making in clinical research
Bayesian decision making in clinical researchBayesian decision making in clinical research
Bayesian decision making in clinical research
 
Mayo &amp; parker spsp 2016 june 16
Mayo &amp; parker   spsp 2016 june 16Mayo &amp; parker   spsp 2016 june 16
Mayo &amp; parker spsp 2016 june 16
 
D. Mayo: Replication Research Under an Error Statistical Philosophy
D. Mayo: Replication Research Under an Error Statistical Philosophy D. Mayo: Replication Research Under an Error Statistical Philosophy
D. Mayo: Replication Research Under an Error Statistical Philosophy
 
D.G. Mayo Slides LSE PH500 Meeting #1
D.G. Mayo Slides LSE PH500 Meeting #1D.G. Mayo Slides LSE PH500 Meeting #1
D.G. Mayo Slides LSE PH500 Meeting #1
 
D.g. mayo 1st mtg lse ph 500
D.g. mayo 1st mtg lse ph 500D.g. mayo 1st mtg lse ph 500
D.g. mayo 1st mtg lse ph 500
 
“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”
 
Hypothesis in rm
Hypothesis in rmHypothesis in rm
Hypothesis in rm
 
Biostatistics
BiostatisticsBiostatistics
Biostatistics
 
Statistics
StatisticsStatistics
Statistics
 
Statistics
StatisticsStatistics
Statistics
 

Plus de jemille6

Statistical Inference as Severe Testing: Beyond Performance and Probabilism
Statistical Inference as Severe Testing: Beyond Performance and ProbabilismStatistical Inference as Severe Testing: Beyond Performance and Probabilism
Statistical Inference as Severe Testing: Beyond Performance and Probabilismjemille6
 
D. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdfD. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdfjemille6
 
reid-postJSM-DRC.pdf
reid-postJSM-DRC.pdfreid-postJSM-DRC.pdf
reid-postJSM-DRC.pdfjemille6
 
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022jemille6
 
Causal inference is not statistical inference
Causal inference is not statistical inferenceCausal inference is not statistical inference
Causal inference is not statistical inferencejemille6
 
What are questionable research practices?
What are questionable research practices?What are questionable research practices?
What are questionable research practices?jemille6
 
What's the question?
What's the question? What's the question?
What's the question? jemille6
 
The neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and MetascienceThe neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and Metasciencejemille6
 
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...jemille6
 
On Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the TwoOn Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the Twojemille6
 
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...jemille6
 
Comparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple TestingComparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple Testingjemille6
 
Good Data Dredging
Good Data DredgingGood Data Dredging
Good Data Dredgingjemille6
 
The Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of ProbabilityThe Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of Probabilityjemille6
 
Error Control and Severity
Error Control and SeverityError Control and Severity
Error Control and Severityjemille6
 
The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)jemille6
 
The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)jemille6
 
On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...jemille6
 
The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (jemille6
 
The two statistical cornerstones of replicability: addressing selective infer...
The two statistical cornerstones of replicability: addressing selective infer...The two statistical cornerstones of replicability: addressing selective infer...
The two statistical cornerstones of replicability: addressing selective infer...jemille6
 

Plus de jemille6 (20)

Statistical Inference as Severe Testing: Beyond Performance and Probabilism
Statistical Inference as Severe Testing: Beyond Performance and ProbabilismStatistical Inference as Severe Testing: Beyond Performance and Probabilism
Statistical Inference as Severe Testing: Beyond Performance and Probabilism
 
D. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdfD. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdf
 
reid-postJSM-DRC.pdf
reid-postJSM-DRC.pdfreid-postJSM-DRC.pdf
reid-postJSM-DRC.pdf
 
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
 
Causal inference is not statistical inference
Causal inference is not statistical inferenceCausal inference is not statistical inference
Causal inference is not statistical inference
 
What are questionable research practices?
What are questionable research practices?What are questionable research practices?
What are questionable research practices?
 
What's the question?
What's the question? What's the question?
What's the question?
 
The neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and MetascienceThe neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and Metascience
 
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
 
On Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the TwoOn Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the Two
 
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
 
Comparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple TestingComparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple Testing
 
Good Data Dredging
Good Data DredgingGood Data Dredging
Good Data Dredging
 
The Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of ProbabilityThe Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of Probability
 
Error Control and Severity
Error Control and SeverityError Control and Severity
Error Control and Severity
 
The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)
 
The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)
 
On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...
 
The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (
 
The two statistical cornerstones of replicability: addressing selective infer...
The two statistical cornerstones of replicability: addressing selective infer...The two statistical cornerstones of replicability: addressing selective infer...
The two statistical cornerstones of replicability: addressing selective infer...
 

Dernier

Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfAyushMahapatra5
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 

Dernier (20)

Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 

Phil 6334 Mayo slides Day 1

  • 1. "4 Waves in Philosophy of Statistics" What is the Philosophy of Statistics? At one level of analysis at least, statisticians and philosophers of science ask many of the same questions:  What should be observed and what may justifiably be inferred from the resulting data?  How well do data confirm or fit a model?  What is a good test?  Must predictions be “novel” in some sense? (selection effects, double counting, data mining)  How can spurious relationships be distinguished from genuine regularities? from causal regularities?  How can we infer more accurate and reliable observations from less accurate ones?  When does a fitted model account for regularities in the data? 1
  • 2. That these very general questions are entwined with long standing debates in philosophy of science helps to explain why the field of statistics tends to cross over so often into philosophical territory. That statistics is a kind of “applied philosophy of science” is not too far off the mark (Kempthorne, 1976). 2
  • 3. Statistics  philosophy3 ways statistical accounts are used in philosophy of science (1) Model Scientific Inference—to capture either the actual or rational ways to arrive at evidence and inference (2) Resolve Philosophical Problems about scientific inference, observation, experiment; (problem of induction, objectivity of observation, reliable evidence, Duhem's problem, underdetermination). (3) Perform a Metamethodological Critique—scrutinize methodological rules, e.g., accord special weight to "novel" facts, avoid ad hoc hypotheses, avoid "data mining", require randomization. philosophy  statistics central job to help resolve the conceptual, logical, and methodological discomforts of scientists as to: how to make reliable inferences despite uncertainties and errors? In tackling the problems around which the statistics wars have been fought, I claim, one also arrives at a general account of inductive inference that solves or makes progress on: the philosopher's problems of induction, objective evidence, underdetermination. 3
  • 4. History and philosophy of statistics is a huge territory marked by 70 years of debates widely known for reaching unusual heights both of passion and of technical complexity. To get a handle on the movements and cycles without too much distortion, I propose to identify four main “battle waves”— Wave I ~ 1930 –1955/60 Wave II~ 1955/60-1980 Wave III~1980-2005 Wave IV~2005- (ongoing) 4
  • 5. Confirmation Theory: The Search for Measures of Degree of Evidential-Relationship (E-R) Philosophy of science: e.g., 1960s and 70s (in the early to mid 20th century), saw a resurgence of interest in solving the traditional Humean problem of induction. Conceding that all attempts to solve the problem of induction, fail, philosophers of induction turned to constructing logics of induction or confirmation theories (e.g., Carnap 1962). The thinking was/is: Deductive logic: rules to compute whether a conclusion is true, given the truth of a set of premises (True) Inductive logic or confirmation theory: would provide rules to compute the probability of a conclusion, given the truth of certain evidence statements (?) Having conceded loss in the battle for justifying induction, philosophers appeal to logic to capture scientific method 5
  • 6. Inductive Logics Logic of falsification “Confirmation Theory” Methodological falsification Rules to assign degrees of Rules to decide when to probability or confirmation to “prefer” or accept hypotheses hypotheses given evidence e Carnap C(H,e) Popper Inductive Logicians we can build and try to justify “inductive logics” straight rule: Assign degrees of confirmation/credibility Deductive Testers we can reject induction and uphold the “rationality” of preferring or accepting H if it is “well tested” Statistical affinity Statistical affinity Bayesian (and likelihoodist) accounts Fisherian, Neyman-Pearson methods: probability enters to ensure reliability and severity of tests with these tests. 6
  • 7. The goal of an inductive logic: supply means to compute the degree of evidential relationship between given evidence statements, e, and a hypothesis H, e.g., look to conditional probability or Bayes’s Theorem: P(H|e) = P(e|H)P(H)/P(e) where P(e) = P(e|H)P(H) + P(e|not-H) P(not-H). Computing P(H|e), the posterior probability, requires a probability assignment to all of the members of “not-H”. Major source of difficulty: how to obtain and interpret these prior probabilities. a. If analytic and a priori, relevance for predicting and learning about empirical phenomena is problematic b. If they measure subjective degrees of belief, their relevance for giving objective guarantees of reliable inference is unclear. In statistics, a, is analogous to “objective” Bayesianism (e.g., Jeffreys); b, to subjective Bayesianism The Bayesian-Frequentist controversies is one of the big topics we’ll explore in this course 7
  • 8. A core question: What is the nature and role of probabilistic concepts, methods, and models in making inferences in the face of limited data, uncertainty and error? 1.Three Roles For Probability: Degrees of Confirmation, Degrees of long-run error rates, Degrees of Well-Testedness a. To provide a post-data assignment of degree of probability, confirmation, support or belief in a hypothesis (probabilism); b. To ensure long-run reliability of methods (performance) c. To determine the warrant of hypotheses by assessing how stringently or severely probed they are (probativeness) These three contrasting philosophies of the role of probability in statistical inference are very much at the heart of the central points of controversy in the “four waves” of philosophy of statistics… 8
  • 9. I. Philosophy of Statistics: “The First Battle Wave” WAVE I: circa 1930-1955/60: Fisher, Neyman, Pearson, Savage, and Jeffreys. Statistical inference tools use data x to probe aspects of the data generating source: In statistical testing, these aspects are in terms of statistical hypotheses about parameters governing a statistical distribution H tells us the “probability of x under H”, written P(x;H) (probabilistic assignments under a model) P(H,H,T,H,T,T,T,H,H,T; fair coin) = (.5)10 We will explain how this differs from conditional probabilities in Bayes’s rule or theorem, P(x|H). 9
  • 10. Modern Statistics Begins with Fisher: “Simple” Significance Tests Fisher strongly objected to Bayesian inference, in particular to the use of prior distributions (relevant for psychology not science). Looks to develop ways to express the uncertainty of inferences without deviating from frequentist probabilities. Example. Let the sample be X = (X1, …,Xn), be n iid (independent and identically distributed) outcomes from a Normal distribution with standard deviation =1 1. A null hypothesis H0 : H0:  = 0 e.g., 0 mean concentration of lead, no difference in mean survival in a given group, in mean risk, mean deflection of light. 2. A function of the sample, d(X), the test statistic: which reflects the difference between the data x0 = (x1, …,xn), and H0; The larger d(x0) the further the outcome is from what is expected under H0, with respect to the particular question being asked. 3. the p-value is the probability of a difference larger than d(x0), under the assumption that H0 is true: p(x0)=P(d(X) > d(x0); H0) 10
  • 11. Mini-recipe for p-value calculation: The observed significance level (p-value) with observed p(x0)=P(d(X) > d(x0); H0). X = .1 The relevant test statistic d(X) is: d(X) = ( X -0x, where X is the sample mean with standard deviation x = (√n). d (X) = Observed - Expected (under H0 ) Let n = 25. Since s x = sx s n = 1/5 = .2, d(X) = .1 – 0 in units of x yields d(x0)=.1/.2 = .5 Under the null, d(X) is distributed as standard Normal, denoted by d(X) ~ N(0,1). (area to the right of .5) ~.3, i.e. not very significant. 11
  • 12. Logic of Simple Significance Tests: Statistical Modus Tollens “Every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis” (Fisher, 1956, p.160). Statistical analogy to the deductively valid pattern modus tollens: If the hypothesis H0 is correct then, with high probability, 1− p, the data would not be statistically significant at level p. x0 is statistically significant at level p. ____________________________________________ Thus, x0 is evidence against H0, or x0 indicates the falsity of H0. 12
  • 13. The Alternative or “Non-Null” Hypothesis Evidence against H0 seems to indicate evidence for some alternative. Fisherian significance tests strictly consider only the H0 Neyman and Pearson (N-P) tests introduce an alternative H1 (even if only to serve as a direction of departure) Example. X = (X1, …,Xn), iid Normal with =1, H0:  = 0 vs. H1:  > 0 Despite the bitter disputes with Fisher that were to erupt soon after ~1935, Neyman and Pearson, at first, saw their work as merely placing Fisherian tests on firmer logical footing. Much of Fisher’s hostility toward N-P methods reflects professional and personality conflicts more than philosophical differences. 13
  • 14. Neyman-Pearson (N-P) Tests N-P hypothesis test: maps each outcome x = (x1, …,xn) into either the null hypothesis H0, or an alternative hypothesis H1 (where the two exhaust the parameter space) to ensure the probabilities of erroneous rejections (type I errors) and erroneous acceptances (type II errors) are controlled at prespecified values, e.g., 0.05 or 0.01, the significance level of the test. It also requires a sensible distance measure d(x0). Test T+: X = (X1, …,Xn), iid Normal with =1 H0:  =  vs. H1:  >  if d(x0) > c, "reject" H0, (or declare the result statistically significant at the level) if d(x0) < c, "do not reject” or “accept" H0, e.g. c=1.96 for =.025 “Accept/reject” uninterpreted parts of the mathematical apparatus. 14
  • 15. Testing Errors and Error Probabilities Type I error: Reject H0 even though H0 is true. Type II error: Fail to reject H0 even though H0 is false. Probability of a Type I error = P(d(x0) > c; H0) ≤  Probability of a Type II error: P(Test T+ does not reject H0 ;  =1) = = P(d(X) < c; H0) = ß(1), for any 1 > 0. The "best" test at level  at the same time minimizes the value of ß for all 1 > 0, or equivalently, maximizes the power: POW(1)= P(d(X) > c; 1) T+ is a Uniformly Most Powerful (UMP) level  test 15
  • 16. Inductive Behavior Philosophy Philosophical issues and debates arise once one begins to consider the interpretations of the formal apparatus ‘Accept/Reject’ are identified with deciding to take specific actions, e.g., publishing a result, announcing a new effect. The justification for optimal tests is that: “it may often be proved that if we behave according to such a rule ... we shall reject H when it is true not more, say, than once in a hundred times, and in addition we may have evidence that we shall reject H sufficiently often when it is false.” Neyman: Tests are not rules of inductive inference but rules of behavior: The goal is not to adjust our beliefs but rather to “adjust our behavior” to limited amounts of data Is he just drawing a stark contrast between N-P tests and Fisherian as well as Bayesian methods? Or is the behavioral interpretation essential to the tests? 16
  • 17. “Inductive behavior” vs. “Inductive inference” battle Commingles philosophical, statistical and personality clashes. Fisher (1955) denounced the way that Neyman and Pearson transformed ‘his’ significance tests into ‘acceptance procedures’  They’ve turned my tests into mechanical rules or ‘recipes’ for ‘deciding’ to accept or reject statistical hypothesis H0,  The concern has more to do with speeding up production or making money than in learning about phenomena N-P followers are like: “Russians (who) are made familiar with the ideal that research in pure science can and should be geared to technological performance, in the comprehensive organized effort of a five-year plan for the nation.” (1955, 70) 17
  • 18. Pearson distanced himself from Neyman’s “inductive behavior” jargon, calling it “Professor Neyman’s field rather than mine.” But the most impressive mathematical results were in the decision-theoretic framework of Neyman-Pearson-Wald. Many of the qualifications by Neyman and Pearson in the first wave are overlooked in the philosophy of statistics literature. Admittedly, these “evidential” practices were not made explicit *. (Had they been, the subsequent waves of philosophy of statistics might have looked very different). *Mayo’s goal as a graduate student. 18
  • 19. The Second Wave: ~1955/60 -1980 “Post-data criticisms of N-P methods”: Ian Hacking (1965), framed the main lines of criticism by philosophers “Neyman-Pearson tests as suitable for before-trial betting, but not for after-trial evaluation.” (p. 99): Battles: “initial precision’ vs. “final precision”, “before-data vs. after data” After the data, he claimed, the relevant measure of support is the (relative) likelihood Two data sets x and y may afford the same "support" to H, yet warrant different inferences [on significance test reasoning] because x and y arose from tests with different error probabilities. o This is just what error statisticians want! o But (at least early on) Hacking (1965) held to the “Law of Likelihood”: x support hypotheses H1 more than H2 if, P(x;H1) > P(x;H2). 19
  • 20. Yet, as Barnard notes, “there always is such a rival hypothesis: That things just had to turn out the way they actually did”. (H,H,T,H) is made most probable by the hypothesis that makes P(H) = 1 on trials 1, 2, and 4 (0 on trial 3). “Best explanation”? Since such a maximally likelihood alternative H2 can always be constructed, H1 may always be found less well supported, even if H1 is true—no error control. Hacking soon rejected the likelihood approach on such grounds, but likelihoodist accounts are advocated by others—most especially philosophers (e.g., formal epistemologists). So we will want to consider some of the problems that beset such accounts (in philosophy and in statistics). To begin with we’ll need to be clear on what a likelihood function is. 20
  • 21. Perhaps THE key issue of controversy in the philosophy of statistics battles The (strong) likelihood principle, likelihoods suffice to convey “all that the data have to say:” According to Bayes’s theorem, P(x|µ) ... constitutes the entire evidence of the experiment, that is, it tells all that the experiment has to tell. More fully and more precisely, if y is the datum of some other experiment, and if it happens that P(x|µ) and P(y|µ) are proportional functions of µ (that is, constant multiples of each other), then each of the two data x and y have exactly the same thing to say about the values of µ… (Savage 1962, p. 17.) —the error probability statistician needs to consider, in addition, the sampling distribution of the likelihoods. —significance levels and other error probabilities all violate the likelihood principle (Savage 1962). Breakthrough update: A long-held “proof” of the likelihood principle by Allan Birnbaum is the subject of some recent work of mine: I will give a colloquia talk on this in the Philosophy Department, May 2. 21
  • 22. Paradox of Optional Stopping Instead of fixing the sample size n in advance, in some tests, n is determined by a stopping rule: In Normal testing, 2-sided H0:  = 0 vs. H1:  ≠ 0 Keep sampling until H0 is rejected at the .05 level (i.e., keep sampling until | X |  1.96 / n ). Nominal vs. Actual significance levels: With n fixed the type 1 error probability is .05, With this stopping rule the actual significance level differs from, and will be greater than .05. By contrast, since likelihoods are unaffected by the stopping rule, the LP follower denies there really is an evidential difference between the two cases (i.e., n fixed and n determined by the stopping rule). 22
  • 23. Intuitively: Should it matter if I decided to toss the coin 100 times and happened to get 60% heads, or if I decided to keep tossing until I could reject at the .05 level (2-sided) and this happened to occur on trial 100? Should it matter if I kept going until I found statistical significance? Error statistical principles: Yes!—penalty for perseverance! The LP says NO! Savage Forum 1959: Savage audaciously declares that the lesson to draw from the optional stopping effect is that “optional stopping is no sin” so the problem must lie with the use of significance levels. But why accept the likelihood principle (LP)? (simplicity and freedom?) The likelihood principle emphasized in Bayesian statistics implies, … that the rules governing when data collection stops are irrelevant to data interpretation. It is entirely appropriate to collect data until a point has been proved or disproved (p. 193)…This irrelevance of stopping rules to statistical inference restores a simplicity and freedom to experimental design that had been lost by classical emphasis on significance levels (in the sense of Neyman and Pearson) (Edwards, Lindman, Savage 1963, p. 239). 23
  • 24. For frequentists this only underscores the point raised years before by Pearson and Neyman: A likelihood ratio (LR) may be a criterion of relative fit but it “is still necessary to determine its sampling distribution in order to control the error involved in rejecting a true hypothesis, because a knowledge of [LR] alone is not adequate to insure control of this error (Pearson and Neyman, 1930, p. 106). The key difference: likelihood fixes the actual outcome, i.e., just d(x), while error statistics considers outcomes other than the one observed in order to assess the error properties LP  irrelevance of, and no control over, error probabilities. ("why you cannot be just a little bit Bayesian" EGEK 1996) EGEK: Error and the Growth of Experimental Knowledge (Mayo 1996) 24
  • 25. The Statistical Significance Test Controversy (Morrison and Henkel, 1970) – contributors chastise social scientists for slavish use of significance tests o Focus on simple Fisherian significance tests o Philosophers direct criticisms mostly to N-P tests. Fallacies of Rejection: Statistical vs. Substantive Significance (i) take statistical significance as evidence of substantive theory that explains the effect (ii) Infer a discrepancy from the null beyond what the test warrants (i) Paul Meehl: It is fallacious to go from a statistically significant result, e.g., at the .001 level, to infer that “one’s substantive theory T, which entails the [statistical] alternative H1, has received .. quantitative support of magnitude around .999” A statistically significant difference (e.g., in child rearing) is not automatically evidence for a Freudian theory. Merely refuting the null hypothesis is too weak to corroborate substantive theories, “we have to have ‘Popperian risk’, ‘severe test’ [as in Mayo], or what philosopher Wesley Salmon called a highly improbable coincidence” (Meehl and Waller 2002, 184) (“damn coincidence”) 25
  • 26. Fallacies of rejection: (i) Take statistical significance as evidence of substantive theory that explains the effect (ii) Infer a discrepancy from the null beyond what the test warrants Finding a statistically significant effect, d(x0)> c (cut-off for rejection) need not be indicative of large or meaningful effect sizes — test too sensitive Large n Problem: an  significant rejection of H0 can be very probable, even with a substantively trivial discrepancy from H0. This is often taken as a criticism because it is assumed that statistical significance at a given level is more evidence against the null the larger the sample size (n)—fallacy! "The thesis implicit in the [NP] approach [is] that a hypothesis may be rejected with increasing confidence or reasonableness as the power of the test increases” (Howson and Urbach 1989 and later editions) In fact, it is indicative of less of a discrepancy from the null than if it resulted from a smaller sample size. Comes also in the form of the “Jeffrey-Good-Lindley” paradox Even a highly statistically significant result can, with n sufficiently large, correspond to a high posterior probability to a null hypothesis. 26
  • 27. Fallacy of Non-Statistically Significant Results Test T fails to reject the null, when the test statistic fails to reach the cut-off point for rejection, i.e., d(x0) ≤ c. A classic fallacy is to construe such a “negative” result as evidence for the correctness of the null hypothesis (common in risk assessment contexts). “No evidence against” is not “evidence for” Merely surviving the statistical test is too easy, occurs too frequently, even when the null is false. —results from tests lacking sufficient sensitivity or power. The Power Analytic Movement of the 60’s in psychology Jacob Cohen: By considering ahead of time the Power of the test, select a test capable of detecting discrepancies of interest. (Power is a feature of N-P tests, but apparently the prevalence of Fisherian tests in the social sciences, coupled, perhaps, with the difficulty in calculating power, resulted in ignoring power) A multitude of tables were supplied (Cohen, 1988), but until his death he bemoaned their all-torare use. 27
  • 28. Post-data use of power to avoid fallacies of insensitive tests If there's a low probability of a statistically significant result, even if a non-trivial discrepancy non-trivial is present (low power against non-trivial) then a non-significant difference is not good evidence that a non-trivial discrepancy is absent. This still retains an unacceptable coarseness: power is always calculated relative to the cut-off point c for rejecting H0. We will introduce you to a way of retaining the main logic but in a data-dependent use of power. Rather than calculating (1) P(d(X) > c;  =.2) Power one should calculate (2) P(d(X) > d(x0); =.2). observed power (severity) Even if (1) is low, (2) may be high. We return to this in the developments of Wave III. 28
  • 29. III. The Third Wave: Relativism, Reformulations, Reconciliations ~1980-2005+ Rational Reconstruction and Relativism in Philosophy of Science Fighting Kuhnian battles to the very idea of a unified method of scientific inference, statistical inference less prominent in philosophy — largely used in rational reconstructions of scientific episodes, — in appraising methodological rules, — in classic philosophical problems e.g., Duhem’s problem—reconstruct a given assignment of blame so as to be “warranted” by Bayesian probability assignments. problem with reconstructions: normative force. The recognition that science involves subjective judgments and values, reconstructions often appeal to a subjective Bayesian account (Salmon’s “Tom Kuhn Meets Tom Bayes”). (Kuhn thought this was confused: no reason to suppose an algorithm remains through theory change) Naturalisms, HPS —immersed in biology, psychology, etc., philosophers of science recoil from unified inferential accounts. Achinstein (2001): “scientists do not and should not take such philosophical accounts of evidence seriously” (p. 9). They are a priori while they should be empirical; but being empirical is not enough …. 29
  • 30. Wave III in Scientific Practice: Still operative — Statisticians turn to eclecticism. — Non-statistician practitioners (e.g., in psychology, ecology, medicine), bemoan “unholy hybrids” (the New Hybridists) A mixture of ideas from N-P methods, Fisherian tests, and Bayesian accounts that is “inconsistent from both perspectives and burdened with conceptual confusion”. (Gigerenzer, 1993, p. 323).  Faced with foundational questions, non statistician practitioners raise anew the questions from the first and second waves.  Finding the automaticity and fallacies still rampant, many call on an outright “ban” on significance tests in research, or at least insist on reforms and reformulations of statistical tests. Task Force to consider Test Ban in Psychology: 1990s (They didn’t ban them, but it’s continued to give them fodder for reforms: e.g., confidence interval estimation. Fine, but they commit the same fallacies, often, using confidence intervals.) 30
  • 31. Reforms and Reinterpretations Within Error Probability Statistics Any adequate reformulation must: (i) Show how to avoid classic fallacies (of acceptance and of rejection) —on principled grounds, (ii) Show that it provides an account of inductive inference 31
  • 32. Avoiding Fallacies We will discuss attempts to avoid fallacies of acceptance and rejection (e.g., using confidence interval estimates). Move away from coarse accept/reject rule; use specific result (significant or insignificant) to infer those discrepancies from the null that are well ruled-out, and those which are not. e.g., Interpretation of Non-Significant results If d(x) is not statistically significant, and the test had a very high probability of a more statistically significant difference if µ > µ0 + , then d(x) is good grounds for inferring µ ≤ µ0 + . Use specific outcome to infer an upper bound µ ≤ µ* (values beyond are ruled out by given severity.) 32
  • 33. Takes us back to the post-data version of power: Rather than construe “a miss as good as a mile”, parity of logic suggests that the post-data power assessment should replace the usual calculation of power against : POW() = P(d(X) > c; =), with what might be called the power actually attained or, to have a distinct term, the severity (SEV): SEV(< ) = P(d(X) > d(x0); =), where d(x0) is the observed (non-statistically significant) result. 33
  • 34. Fallacies of Rejection: The Large n-Problem While with a nonsignificant result, the concern is erroneously inferring that a discrepancy from µ0 is absent; With a significant result x0, the concern is erroneously inferring that it is present. Utilizing the severity assessment an -significant difference with n1 passes µ > µ1 less severely than with n2 where n1 > n2. (What’s more indicative of a large effect (fire), a fire alarm that goes off with burnt toast or one so insensitive that it doesn’t go off unless the house is fully ablaze. The larger sample size is like the one that goes off with burnt toast.) In this way we solve the problems of tests too sensitive or not sensitive enough, but there’s one more thing ... showing how it supplies an account of inductive inference. Many argue in Wave III that error statistical methods cannot supply an account of inductive inference because error probabilities conflict with posterior probabilities. 34
  • 35. P-values vs Bayesian Posteriors A statistically significant difference from H0 can correspond to large posteriors in H0 From the Bayesian perspective, it follows that p-values come up short as a measure of inductive evidence,  the significance testers balk at the fact that the recommended priors result in highly significant results being construed as no evidence against the null — or even evidence for it! The conflict often considers the two sided T(2 test H0:  =  versus H1:  ≠  . (The difference between p-values and posteriors are far less marked with one-sided tests). “Assuming a prior of .5 to H0, with n = 50 one can classically ‘reject H0 at significance level p = .05,’ although P(H0|x) = .52 (which would actually indicate that the evidence favors H0).” This is taken as a criticism of p-values, only because, it is assumed the .51 posterior is the appropriate measure of the belief-worthiness. As the sample size increases, the conflict becomes more noteworthy. 35
  • 36. If n = 1000, a result statistically significant at the .05 level leads to a posterior to the null of .82! SEV (H1) = .95 while the corresponding posterior has gone from .5 to .82. What warrants such a prior? n (sample size) ______________________________________________________ p t n=10 n=20 n=50 n=100 n=1000 .10 .05 .01 .001 1.645 1.960 2.576 3.291 .47 .37 .14 .024 .56 .42 .16 .026 .65 .52 .22 .034 .72 .60 .27 .045 .89 .82 .53 .124 (1) Some claim the prior of .5 is a warranted frequentist assignment: H0 was randomly selected from an urn in which 50% are true (*) Therefore P(H0) = p H0 may be 0 change in extinction rates, 0 lead concentration, etc. What should go in the urn of hypotheses? For the frequentist: either H0 is true or false, the probability in (*) is fallacious and results from an unsound instantiation. We are very interested in how false it might be, which is what we can do by means of a severity assessment of. 36
  • 37. (2) Subjective degree of belief assignments will not ensure the error probability, and thus the severity, assessments we need. (3) Some suggest an “impartial” or “uninformative” Bayesian prior gives .5 to H0, the remaining .5 probability being spread out over the alternative parameter space, e.g., Jeffreys. This “spiked concentration of belief in the null” is at odds with the prevailing view “we know all nulls are false”. 37
  • 38. Wave IV 2005: Recapitulation of previous waves + new challenges to reliability of science A. Contemporary “Impersonal” Bayesianism: In the Bayes vs frequentist wars, the impersonal Bayesian tries to have frequentist guarantees Because of the difficulty of eliciting subjective priors, and because of the reluctance among scientists to allow subjective beliefs to be conflated with the information provided by data, much current Bayesian work in practice favors conventional “default”, “uninformative,” or “reference” priors. We may call them “conventional” Bayesians The conventional Bayesians abandon coherence, the LP, and strive to match frequentist error probabilities! 38
  • 39. Some questions for “reference” Bayesians 1. What do reference posteriors measure?  A classic conundrum: there is no unique “noninformative” prior. (Supposing there is one leads to inconsistencies in calculating posterior marginal probabilities).  Any representation of ignorance or lack of information that succeeds for one parameterization will, under a different parameterization, entail having knowledge.  The conventional prior is said to be simply something that allows computing the posterior (undefined), they are weights of some sort.  Not to be considered expressions of uncertainty, ignorance, or degree of belief.  May not even be probabilities; flat priors may not sum to one (improper prior). If priors are not probabilities, what then is the interpretation of a posterior? 39
  • 40. 2. Priors for the same hypothesis changes according to what experiment is to be done! Bayesian incoherent If the prior is to represent information, why should it be influenced by the sample space of a contemplated experiment? Violates the likelihood principle — the cornerstone of Bayesian coherency Conventional Bayesians: it is “the price” of objectivity. Seems to wreck havoc with basic Bayesian foundations, but without the payoff of an objective, interpretable output—even subjective Bayesians object 3. Reference posteriors with good frequentist properties Reference priors are touted as having some good frequentist properties, at least in onedimensional problems. They are deliberately designed to match frequentist error probabilities. If you want error probabilities, why not use techniques that provide them directly? By the way, using conditional probability—which is part and parcel of probability theory, (as in “Bayes nets”, etc.), in no way makes one a Bayesian—no priors to hypotheses…. 40
  • 41. B. Brand new sets of crises: (so new we have barely started writing on it): Research implicating statistical methods with pseudoscience, fraud, unreplicable results Origins?  Controversies about models in economics, climate change, medicine?  Economic downturn (open source journal demand sexy results?); big data makes it easy to “cherry pick” and data mine to get ad hoc models and unreliable results?  Use of Bayesian statistics?  Use of frequentist statistics?  Computerized data analysis) Regardless of the source, it’s resulted in one of the hottest topics in science (that philosophers should be involved in). 41
  • 42. Different forms: (i) Science-wise false discovery rates. Given type 1 and type 2 error probabilities, and an assumption of the proportion of false hypotheses studied, it is argued that most statistically significant “discoveries” are false— Stems from large-scale screening in bioinformatics Not based on real data, but conjecture and simulation. (ii) Journal practices: attention-getting articles with eye-catching, but inadequately scrutinized, conjectures. (Stapel in social psychology, who collected no data.) (iii) Unthinking uses of statistics (previous waves I-III) 42
  • 43. We can’t possibly cover the tremendous number of important issues, let alone readings; and we were sorely tempted to include so many “greats” that we’ve had to omit to avoid overwhelming you. But you will, by the end of the course, have a basic methodological framework within which current methodological problems may be understood and addressed. 43