3. THE CONTEXT OF DATA
ANALYSIS
Problem/Issue
Outcome
Form Objectives &Take actions
4. THE CONTEXT OF DATA
ANALYSIS
Problem/Issue
Outcome
Prepare
data
Analysis
Report /
Summary
Get data
5. • Which objectives are the right objectives?
• What is the right way to achieve them?
• Increased data set size and increased computational power
only address a small piece of the puzzle
EVEN BIGGER CONTEXT
6. • Question everything in the social world, and try to “get to the root of
the matter”
√
• How do social scientists explain social facts?
• With other social facts!
• Challenge: Everything in the social world is related to everything else
• The whole world is the original “big data” to sociologists
SOCIOLOGICALVIEW
7. VOCABULARYTEST
a be dawned finger Michael on Symonds
adjudged beckoned day Hussey Mike overnight the
an before depended inevitable much Oz this
and being dreaded inside mustered perished Thus
Andrew But duo latter near Ponting to/too/two
as capacity edge leg new pyrotechnics unfortunate
Australia cheered either little ninth run wicket
bar Clarke every lunch of shown with
batsmen crowd final lustily off side wizard
(Any objection to any of these words?)
8. VOCABULARYTEST
dawned finger
adjudged beckoned day overnight
before depended inevitable much
being dreaded inside mustered perished
duo latter near
capacity edge leg new pyrotechnics unfortunate
cheered either little ninth run wicket
bar every lunch shown
batsmen crowd final lustily side wizard
Stop words and proper nouns removed...
Ready for a reading test?
9. READINGTEST
Thus, as the final day dawned and a near capacity
crowd lustily cheered every run Australia mustered,
much depended on Ponting and the new wizard of
Oz, Mike Hussey, the two overnight batsmen. But this
duo perished either side of lunch--the latter a little
unfortunate to be adjudged leg-before--and with
Andrew Symonds, too, being shown the dreaded
finger off an inside edge, the inevitable beckoned, bar
the pyrotechnics of Michael Clarke and the ninth
wicket.
WhatYour Kindergartner Needs to Know
E.D. Hirsch, Jr. and John Holdren, eds., (2013)
10. WHAT IS A SKILL?
• Substantive knowledge required for reading
• Reading is not a skill
• Data analysis is not a skill
• Both require substantive understanding
• If there is such a thing as “data science” its practitioners must
combine skills and subject matter knowledge
11. • Hard Science
• Theories + Evidence = Laws
• Experiments
• Repeatability!
• Is anything we do remotely like that?
SCIENCE
12. • Essence of scientific approach is trying to find valid
generalizations
• “Theories” or “laws” relate causal conditions to outcomes
• Making predictions about things that are actually observable
is much more likely to result in valid generalizations.
• Placing p-values next to regressions does not make an
analysis scientific.
SCIENCE???
13. • Universal?
• Plenty of laws are not universal. (e.g. F = ma)
• Precise?
• The more accurately we measure, the more we discover
discrepancies.
• No exceptions?
• “Exceptions are just the least frequent alternative in a collection of
facts.” --M. Bunge
REQUIREMENTS FOR A
SCIENTIFIC LAW?
14. • No experiment or empirical result provides absolutely
consistency (between inputs/outputs or causes and effects)
• The finer the measurement, the more inconsistencies you find
IN/CONSISTENCY
ConsistencyConsistencyConsistency
High Low
More general laws,
more widely applicable
Laws with more limited
scope
• (But nobody cares about “laws” that have tiny, tiny scope)
15. • Lawfulness is not identical to universal applicability + 100%
consistency
• There can be all kinds of lawfulness, discoverable by
proceeding scientifically.
• Science advances in a community
• Discovery of lawfulness is not only the primary result of scientific research, it is a fundamental presupposition of scientific
endeavors.
SCIENCE:YES
16. • Counterfactual thinking
• Things could have been a different way [i.e. counter to] actual results [i.e. fact]
• Find the conditions that make the difference in outcomes
• A pattern in data is not evidence if you search only for that pattern
without letting other possibilities into the picture
• Observational data complicates things
• Have to find a way to meet “all else equal” condition
• Reminder: predictions about actually observable phenomena >> p-values
PROCEEDING SCIENTIFICALLY
18. • Matching the level of precision to the scope of the generalizations
needed
• It helps (a lot) to develop agreement on
• What you need to know about to meet objectives
• Balance required
• Too limited scope doesn’t do much for the next project
• Too ambitious is too costly, takes too long, and still might not answer
your questions
ORGANIZATIONAL
IMPLICATIONS
19. “SHOWYOUR WORK”
• Good answers have credibility when the process that
generates them is clear
• Building valid generalizations is often incremental, not starting
over from scratch each time (in an interactive environment)
• Obstacles to modifying your analysis create intolerable friction
(mental and organizational)
• Spreadsheet jockeys and interactive statistics package users:
you are on notice
20. REPRODUCIBLE COMPUTING
ENHANCES COLLABORATION
• Thanks to developments like knitr, anybody can reproduce your analysis.
• Motivation for the programming steps becomes much clearer
• Combine with distributed version control enables collaborative,
reproducible research.
• People with complementary skills can collaborate
• Most importantly, we are following Knuth’s dictum that we should be
concentrate on explaining to other humans what we want the
computer to do
21. GOOD QUESTIONS
• Started with “Problem/Issue” ... formulate that in form of question
• “What is the relationship between this and that?”
• Treats the relationship itself as a hypothesis
• Good questions are posed at a useful level of abstraction
• “So what?” Good questions provide answers for the inevitable “so
what?”
• Derived from and add to/extend/challenge current thinking
22. APPROPRIATE DATA SET SIZE
• Adding/collecting lots of data just because you can is not a strategy
• You might find something surprising/cool
• You might waste your time looking at what you have instead of what you need to
• The correct balance depends on the problem
• Too small to resolve issues is a waste of money
• Unnecessarily large is also a waste
• Lots of variables won’t save you from endogeneity problems (when cause and
effect are unclear)
23. • Big data is
• Going to help me find a parking spot
• Going to help you offer me up just the right ad at just the
right time
BIG [DATA] CONCLUSION
• Computational complexity of dealing with big data is dropping
• Not going to change the logic of establishing lawfulness
24. SAMPLESVS. DATA SETS
• True samples are random and representative.
• Statistics from random samples have clear relationship to the whole population
• The rest (i.e. “what you have”) are just ad hoc collections of data of varying quality
• Selection bias exists and is a problem when who (or what) gets
into the data set is related to what is being measured by the data
set.
• Election polls calling only landline telephones (because there is a tendency for
cellphone only voters to vote differently than landline users.)
• Important to think very carefully about the data-generation mechanism.
25. ISYOUR DATA SUBJECTTO
SELECTION BIAS?
• Yes
• But you have to size the impact
• (Paul Rosenbaum’s Observational Studies offers an accessible framework to quantify how
vulnerable results are to unmeasured biases)
• Thousands of variables won’t help you figure out what is going on if...
• you are missing substantial chunks of the population of interest
• you are not measuring the right thing(s)
• Temptation to assume that lots of data (variables or observations) means lots of coverage and
therefore implies representative data
• Very poor assumption
26. BLACK BOX PREDICTIVE
SYSTEMSTOTHE RESCUE?
• Can you build a machine-learning system that compensates
for the missing segments of the population?
• Assumes the relationship between the represented and
unrepresented population is stable.
• But we can’t measure everything, so what do we do?
27. THE NON-ROYAL ROAD
• Measure the right things to discover valid laws about the relationships we
are interested in
• It could take anywhere from tens to trillions of measurements
• Be cognizant of the valid scope of generalizations possible
• Selection bias, unreliable humans, etc.
• Knowing the weaknesses of a data set or a statistical method is up to us
• Combine: Subject matter knowledge and statistical understanding and data
manipulation
28. CAGE MATCH!
• Measure the
“right” things
• Use the “right”
data
• But this is the only data I have. I
know it’s flawed, but it’s this or
nothing.
• Consider the limitations of who is in the data, the
validity of claiming that a measurement is really the
thing you want it to be, etc.
• Consider if there is any other way to see how far
off you are on critical issues because of these limits.
Vs
29. APPLICATION
• During the talk I gave a too-hasty example that a lot of people didn’t get.
• It showed the strength of the relationship between different items. (The sort of visualization makes
sense for trade offs or flows as well.)
• I created the graph using the R library circlize, which implements the circos library in R.
• circlize
• Zuguang Gu (2014). circlize: Circular visualization in R. R package version 0.0.8.
• http://CRAN.R-project.org/package=circlize
• circose
• http://circos.ca/
• Then I was trying to make a little joke about how I would show the “steps” to produce the plot, with
the idea that the listener might think they are about to see R code. What they actually saw was...
31. FINALTHOUGHTS
• Situational lawfulness
• Defining the situation in which we are trying to find lawfulness of sufficient generality contributes to
organizational harmony and success
• “Defining” is not a solo activity
• Black boxes can be predictive, but understanding relationships matters
• The right data is better than lots of data
• The right data depends on the questions, which depends on substantive understanding
• Science progresses in a community: Listen and engage actively and broadly
• Communicate to contribute
• There is no simple formula for “good questions,” only general guidelines about how to head in the right
direction