In the session "Philosophy of Science and the New Paradigm of Data-Driven Science at the American Statistical Association Conference on Statistical Learning and Data Science/Nonparametric Statistics
Chi-Square Test Non Parametric Test Categorical Variable
D. G. Mayo: Your data-driven claims must still be probed severely
1. Philosophy of Science and the New
Paradigm of Data-Driven Science
“Your data-driven claims must still be
probed severely”
Deborah G. Mayo
June 4, 2018
American Statistical Association Conference on
Statistical Learning and Data Science/Nonparametric
Statistics
1
2. 2
A central job for
philosophers of science:
• Minister to scientists’
conceptual and methodological
discomforts
• Especially fields wanting to learn despite threats of
errors and mistakes
• The risk of error (mistaken interpretations of data)
enters if you want to move beyond the data
• The output is data transcending: you want warranted
inductive inference (not abductive)
3. A (Kuhnian) paradigm shift:
• An unreasoned shift of an entire set of methods,
theories and aims–all at once–akin to a gestalt switch or
religious conversion.
• Once you switch, you cannot go back or even
communicate with those in the old (incommensurability)
Of course, you don’t really mean this (nor does anyone!)
3
4. Two claims in tension?
(i) “Thanks to today’s world of high powered
computer searches and data-mining, we can
embrace a paradigm change in scientific
method.”
(ii) “Thanks to today’s world of high powered
computer searches and data-mining, we
have a crisis of replication”
4
5. • High powered methods make it easy to uncover
impressive-looking findings even if they are false:
• Everyone agrees you don’t have evidence for a
claim if little has been done to rule out flaws in
inferring it: in my language, it has not passed
a severe test.
• The notion applies to exploration, estimation, or
prediction.
5
6. • Probability is used to assess and control how
capable methods are at uncovering or
avoiding erroneous interpretations of data.
• It can be quasi-formal or informal
• What might this account say about data
science?
6
7. Data Science
“Data scientists have the pragmatic mandate of
producing answers, which may be in the form of
estimates, predictions or classifications. Success
of a procedure is judged by good performance,
in a dataset-specific sense, on held-out subsets
of the data.” (Kuffner and Young 2017, 3)
No radical change of aim
7
8. A Modest claim: Data science gives a whole
new world of exploration
• Screening or exploration: data-driven science
gives ways to discover models and theories to be
tested on new data
• Logic of Discovery: there is none (creative
brilliance, heuristics, analogies, learning from
failed theories)
8
9. A Bolder Claim
• Data-driven science violates rules of existing
methods of science while still being reliable
• “Observation can inspire theory, but it is
generally not scientifically (or statistically) valid
to use the very same observations to both
formulate a theoretical hypothesis and perform
empirical evaluation of the hypothesis.” (Kuffner
and Young 2017, 12)
9
10. No double-counting
• A common intuition about evidence: if data x
have been used to construct a hypothesis
H(x), then x should not be used again as a test
of H(x)
• Long debated in philosophy and in statistics
(Mill, Keynes, Peirce, Popper)–unresolved
10
11. • Testing accounts: learning takes place by
testing, falsifying (statistically), corroborating
(inferring claims that have survived probes
they would have failed if specifiably false)
• Confirmation accounts: learning takes
place by updating degrees of belief, support,
confirmation in claims
11
How it arises in Testing (vs
Confirmation) Accounts
12. Building an account of testing
Data x provide a good test of, or good evidence for H
if (and only if)
(a) x agree with or “fit” H
______________________________
(b) x must be novel in some sense: A novel fact for a
hypothesis H is one not already used in arriving at or
constructing H.
12
13. Inferences involving double-counting
may be characterized by means of a rule R
R: data x are used to construct or select hypothesis
H(x) so that the resulting H(x) fits x; and then used
“again” as evidence to warrant H (as supported,
well tested, indicated, or the like.)
A “use-constructed” test procedure — H(x) violates
“use-novelty” (Musgrave 1974, Worrall 1978, 1989).
Today it might be “data based”
13
14. In large-scale science:
“I think that people emphasize prediction in validating
scientific theories because the classic attitude of
commentators on science is not to trust the theorist.
The fear is that the theorist adjusts his or her theory
to fit whatever experimental facts are already known,
so that for the theory to fit these facts is not a reliable
test of that theory….”
(Steven Weinberg, Dreams of a Final Theory, 1992
96-7)
14
Why prefer hypothesis first?
15. Based on a discredited empiricism: trust data
• “…on the other hand the experimentalist
does know about the theoretical result
when he does the experiment” (Weinberg)
(Now we know not to trust either)
15
16. 16
Same threat of biased data in statistical
cases
Hypothesis: treatment is beneficial
Capitalizing on Chance: Search through several
factors (beneficial effects of a treatment) and report
just those that show (apparently) impressive
correlations, there is a high probability of erroneously
inferring a real correlation.
—your p-value will have no relation to the actual error
probability
—need to adjust error probabilities.
17. 17
Novel evidence (H first) is not sufficient
Stipulating H first still allows opportunities to
interpret the data in support of H
H: Unscrambling soap-related words makes
subjects less judgmental on ethical dilemmas
Rationale behind Registered Reports: report
planned analysis, stopping rule, variables,
predictions/hypotheses, rules for excluding data
18. 18
Surprise: It’s not necessary either!
We can (in some cases) reliably use the same data
both to arrive at and warrant claims H(x)
(Mayo 1991):
1. Test model assumptions: the same data can be
used to both arrive at and test an assumption (id
violated) (a significance test, graphical analysis)
Can still vouchsafe error probabilities
Pr(R(X) would output H(x0); H(x0) is false) = low
19. 19
Severity can be improved by searching
(in some cases):
Searching for a DNA match with a criminal’s DNA: The
probability is high that we would not obtain a match
with person i, if i were not the criminal;
So, finding the match is good evidence that i is the
criminal.
(wrong to say a frequentist would have to be penalized
for searching here)
Quite a lot of background knowledge, numerous
assumptions, not knowledge-free or model free
20. 20
How the severe tester fills out the (b)
requirement
Data x provide a good test of, or good evidence for
H if (and only if)
(a) x agree with or “fit” H
______________________________
(b) So good a fit would (very probably) not have
occurred were H false
(very probably, a worse fit would occur if H fails to
solve our problem)
21. 21
My diagnosis (of why novelty is often thought
necessary):
•A slip from ‘rule R is bound to output an H(x) that
fits’, to ‘R is bound to output well-fitting hypotheses
unreliably’
22. 22
Birth of the Severity Criterion
• It is the severity, stringency, or probativeness of
the test—or lack of it—that determines if a
double-use of data is permissible—or so I argue.
• Must be able to correctly estimate the error
properties of the procedure (formal or informal)
• What’s the lesson for Data Science?
23. 23
• What’s problematic is not using x to both
construct (or discover) and appraise a claim,
what’s problematic is doing it in the same way
as if claims were predesignated, & biases not
introduced
• The use-constructed procedure has different
ways of being wrong; the ability to protect against
mistaken interpretations changes.
• Your methods work only to the extent that you
take account of this
24. 24
Separating kosher from non-kosher cases not
so easy (the best source I’ve found is statistics)
• Use a method designed to work independently
of what’s unknown (testing assumptions)
• Know enough about the properties of the
method or truly eliminative, known effect (DNA)
• Take account of the damage of data mining,
hunting for significance, tuning on the signal
• Able to learn what it would be like if you had
predesignated relevant factors
25. 25
Example. Can learn what it would be like
(approximately) to have separate data for training
and for testing.
“Cross-validation: a statistical method [to estimate]
a model's performance using a single data set, by
dividing the data into multiple segments, and
iteratively fitting the model to all but one segment
and then evaluating its performance on the
remaining segment.”
This is from the Omics Guidelines after the Anil
Potti controversy: Evolution of Translational Omics:
Lessons Learned and the Path Forward (2012, 17)
26. 26
• based on Potti and Nevins’ prediction model
using data on associations of various tumors
and positive responses to chemotherapy
Personalized medicine
• Over 100 patients signed up
Duke (2007-10) for a custom-
tailored cancer treatment
27. 27
“[Potti’s method is] fitting a statistical model to all
available study data then splitting the data into
subsets, labelling one of them a ‘training’ set, another
a ‘validation’ or ‘test’ set, and showing that the
statistical model works well for both sets.” (S.
McKinney, 2012)
“When we apply the same methods but maintain the
separation of training and test sets, …the results are
no better than those obtained with randomly selected
cell lines.” (Baggerly, Wang and Coombes, 2007, 277)
“they reproduce our result when they use our method”
(Potti and Nevins, 2007, 1277.)
28. 28
There was a whistle-blower!
“only those samples which fit the model best in
cross validation were included. Over half of the
original samples were removed…This was an
incredibly biased approach which does little more
than give the appearance of a successful cross
validation.” (Brad Perez 2015)
29. 29
Not just an isolated case, though it’s unique in
having clinical trials underway
“Candidate omics-based tests should be confirmed
using an independent set of samples not used in
the generation of the computational model, … the
specimens …will have been collected at a different
point in time, … processed in a different
laboratory” (Omics, 36)
30. 30
All the microarray data pre-2010 is unreliable due
to spurious associations from “batch effects”
• Minute differences in processing can easily
swamp the difference of interest.
• By randomly assigning the order of cases and
controls, the spurious associations vanish!
31. 31
Post Mortems
“To call in the statistician after the
experiment is done may be no more than
asking him to perform a postmortem
examination:… to say what the experiment
died of.” (Fisher 1938, p. 14).
32. CONCLUDING REMARKS
Can you use the severe testing approach to
evaluate the credentials of data intensive science?
• Instead of rules, it focuses on the ability to sustain
an argument: you have evidence for a claim to the
extent that your procedure ruled out flaws in moving
from x to H
• There are reliable procedures that both construct
and test H(x) with x
• Special precautions, modes of data analysis,
experimental designs are needed, depending on the
threat of error 32
33. 33
• The account is piecemeal: it allows saying we’ve
severely probed some aspect, restricted to
samples generated and processed in a given way
• It requires indicating what was poorly probed—
mistaken interpretations of data that would not
have been uncovered
34. 34
• It may be that data-driven method is reliable so
long as the data used to construct the hypothesis
doesn’t differ much from the context of application
• There, black boxes are fine–you still have to
check IID (even with non-parametrics)
• Fine for predicting what book I’ll buy given I
bought books on Popper and poetry
• That it’s restricted doesn’t mean it’s not important
35. 35
• It’s not all of science
• The growth of knowledge grows from being
skeptical that future cases will be like past cases,
and developing ways to probe for mistakes
• Theories with greater content (and more
assumptions!) can pass with higher severity!
• Errors ramify more quickly with full-bodied
interconnected checks and triangulation
36. • Experiment can live a life of its own, but theories
about mistaken interpretations of data are central
• Need to build repertoires of mistaken extrapolations
of data
• Need to know the “why” underlying successful and
failed predictions to make progress
36
You need theories (about mistakes) to Learn
Without Theories
37. • Data-driven methods are developing, theories of
mistakes and fallacies regarding them is futuristic
• An interdisciplinary task for scientists, statisticians,
data analysts, philosophers
37
38. References:
• Baggerly, K. Coombes, K. and Neeley, E. (2008) Run Batch Effects Potentially
Compromise the Usefulness of Genomic Signatures for Ovarian Cancer. JCO March 1,
2008:1186-1187.
• Baggerly, K. and Coombes, K. (2009). ‘Deriving Chemosensitivity from Cell Lines:
Forensic Bioinformatics and Reproducible Research in High-throughput Biology’,
Annals of Applied Statistics 3(4), 1309–34.
• Coombes, K., Wang, J. and Baggerly, K.. (2007). “Microrrays: retracing steps.” Nat.
Med. Nov 13(11):1276-7.
• Fisher, R. A. (1938). “Presidential Address”, Sankhyā: The Indian Journal of Statistics 4
(1), 14–17.
• Hitchcock, C. and Sober, E. (2004). “Prediction Versus Accommodation and the Risk of
Overfitting”, The British Journal For the Philosophy of Science, 55: 1-34.
• Keynes, J. ([1921]1952). A Treatise on Probability. London: MacMillan and Co.,
Limited. Reprinted, New York: St. Martin‘s Press.
• Kuffner, T. and Young, G. (2017). “Philosophy of science, principled statistical inference,
and data science”, Working paper.
• Kuhn, T. (1962). The Structure of Scientific Revolutions. Chicago: University of Chicago
Press.
• Lambert, C. and Black, L. (2012). 'Learning From Our GWAS Mistakes: From
Experimental Design to Scientific Method', Biostatistics, 13(2), 195-203.
38
39. 39
• Mayo, D. G. (1991). "Novel Evidence and Severe Tests." Philosophy of Science 58:
523-552. (Reprinted in The Philosopher's Annual XIV(1991): 203-232.)
• Mayo, D. G. (1996). Error and the Growth of Experimental Knowledge. The
University of Chicago Press (Series in Conceptual Foundations of Science).
• Mayo, D. G. (2008). “How to Discount Double-Counting when It Counts: Some
Clarifications,” British Journal of Philosophy of Science 59: 857-879.
• Mayo, D. G. (2010). "An Ad Hoc Save of a Theory of Adhocness? Exchanges with
John Worrall," in Error and Inference: Recent Exchanges on Experimental
Reasoning, Reliability and the Objectivity and Rationality of Science (D. Mayo and A.
Spanos eds.), Cambridge: Cambridge University Press: 155-169.
• Mayo, D. G. (2018). Statistical Inference as Severe Testing: How to Get Beyond the
Statistics Wars, Cambridge: Cambridge University Press.
• Mayo, D. G. and D.R. Cox (2006). “Frequentist Statistics as a Theory of Inductive
Inference,” in Optimality: The Second Erich L. Lehmann Symposium, (ed. J. Rojo),
Lecture Notes-Monograph Series, Institute of Mathematical Statistics (IMS) 49: 77-
97.
• Mayo, D. G. and A. Spanos (2010). “Introduction and Background,” in Error and
Inference: Recent Exchanges on Experimental Reasoning, Reliability and the
Objectivity and Rationality of Science, (D. Mayo and A. Spanos eds.), Cambridge
University Press: 1-27.
40. 40
• McKinney, S. (2010). “December 16, 2010 letter to IOM committee” PAF Document
19. (Accessible here: errorstatistics.com May 31,
2014https://errorstatistics.com/2014/05/31/what-have-we-learned-from-the-anil-
potti-training-and-test-data-fireworks-part-1/
• Micheel, C., Nass, S. and Omenn, G. (eds). (2012). Evolution of Translational
OMICS: Lessons Learned and the Path Forward. Washington D.C.: The National
Academies Press.
• Mill, J. S. (1888). A System of Logic, 8th edn., New York: Harper and Brothers
• Musgrave, A. (1974). “Logical versus Historical Theories of Confirmation”, The
British Journal for the Philosophy of Science 25(1), 1–23.
• Peirce, C. S. (1931–35). Collected Papers, Volumes 1–6. Hartsthorne, C and
Weiss, P. (eds.), Cambridge, MA: Harvard University Press.
• Perez, B. (2015). ‘Research Concerns, The Med. Student’s Memo’, Cancer Letter,
1/9/2015.
• Popper, K. (1959). The Logic of Scientific Discovery. New York: Basic Books.
Reprinted 2000 The Logic of Scientific Discovery. London, New York: Routledge.
• Potti, A., Dressman H. K., Bild, A., et al. (2006). ‘Genomic Signatures to Guide the
Use of Chemotherapeutics’, Nature Medicine 12(11), 1294–300.
• Potti, A. and Nevins, J. (2007). ‘Potti et al. Reply’, Nature Medicine 13(11), 1277–8.
41. 41
• Schnall, S., Benton, J. and Harvey, S. (2008). 'With a Clean Conscience:
Cleanliness Reduces the Severity of Moral Judgments', Pscyhological Science
19(12), 1219-22.
• Weinberg, S. (1992). Dreams of a Final Theory: A Scientist’s Search for the
Ultimate Laws of Nature, New York: Pantheon Books.
• Worrall, J. (1978). ‘Research Programmes, Empirical Support, and the Duhem
Problem: Replies to Criticism’, in Radnitzky, G. and Andersson, G. (eds.), Progress
and Rationality in Science, Dordrecht, The Netherlands: D. Reidel: 321–38.
• Worrall, J. (1989). ‘Fresnel, Poisson and the White Spot: The Role of Successful
Predictions in the Acceptance of Scientific Theories’, in Gooding, D., Pinch, T. and
Schaffer, S. (eds.), The Uses of Experiment: Studies in the Natural Sciences,
Cambridge: Cambridge University Press: 135–57.
Notes de l'éditeur
Aterm that’s trickled down from philosophy.
We know what the second means.
Everyone agrees that you don’t have evidence for a claim if nothing or little has been done in ruling out flaws in inferring it. In my language we say the claim hasn’t passed a severe test. It could be a machine that does the test automatically.
How, from this perspective, might we understand the first claim: the great new shift afforded by data science?
It’s true we never had a logic of discovery, and some might see it as the holy grail. But I think there’s more to the “data science has radically changed science” mantra.
Potti’s response
You might say this was extreme but as the Omics book shows, the teams routinely fall into statistical problems of analysis and design.
You might say this was extreme but as the Omics book shows, the teams routinely fall into statistical problems of analysis and design.
Genomist Chhristopher Lambert says that statisticians tell him they’renever called in beforehand, only to clean up afterwards.