This document summarizes a social scientist's perspectives on data science. It discusses that data comes from many sources and in many formats, which requires data scientists to know how to obtain data and work with different file types and APIs. It also notes that real data is often messy, with duplicates, missing values, and inconsistent formats, and combining data from multiple sources requires tools like UNIX commands, scripting languages, and databases. The document discusses that while data munging takes 80% of effort, teaching hacking skills is straightforward by borrowing from computer science curriculums. It also discusses exploring and modeling data through methods that scale and match different data types like text, geospatial, and web-scale data. The document advocates focusing
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Drew Conway: A Social Scientist's Perspective on Data Science
1. A social scientist‟s perspectives on
data science
Drew Conway
NYC Data Science
Meetup
March 5, 2013http://www.flickr.com/photos/uiowa/804719510
0/
2.
3. Hacking
Skills
Obtain Munge
I hold the following truths to be self-
evident...
1. Data come from many sources
2. Data come in many form(at)s
10
% 10
%
80
%
A .zip file of PDFs ≠ data
‣Data scientist must know where to
get data and how to obtain it
‣Work with big text files
$ head publicvotes-20101018_votes.dump
‣Work with APIs
$ curl
http://search.twitter.com/search.json?q=@dr
ewconway > drewconway.json
Real data are messy
‣Even curated data: duplicates,
missing values, date formats
‣Combine data from multiple
sources/formats
‣Tools
• *NIX tools: sed, awk, grep
• Scripting languages: Perl, Python
and R
$ cat ufo_awesome.tsv | grep probe | wc -l
131
4. Hacking
Skills
While 80% of effort is spent here,
perhaps most straightforward to teach
Heavily tool focused, borrow from CS/EE curriculums
‣Comfort working at the command-line, with text editors
‣A language for every season!
Conveying findings in creative and compelling ways
5. Math &
Stats
Knowledge
If: Better data beats better math
Then: What methods should be
taught?
How do you find
structure in new data?
‣Scatter plots
‣Density plots
Data exploration that
scales
‣Reduce dimensionality
‣PCA, SVD, MDS
Methods must match
data
‣Text
‣Geospatial
‣Web-scale
What is the „best‟
model?
‣Most predictive
‣Most parsimonious
Explore Model
6. }
Math &
Stats
Knowledge
Universities good at methods
training...
...but what methods fit into Data
Science?
Things data scientist like...
‣Illustrating the current state of the
world
‣Predicting future observations
‣Classifying/ranking observations
Things social scientists like...
‣Testable theoretical models
‣Natural experiments
‣Causality
1. When applicable
2. Right tool / right job
3. Open black boxes
4. Learn limitations
7. Substantive
Expertise
Data Science, as a discipline, is
fundamentally about human behavior
Inquire Interpret
10
% 10
%
80
%
Focus on questions / not
tech
‣What new questions can be
asked from web-scale data?
‣Tools are a means to an end
Social science has
questions
‣Markets
‣Organization
How do we know when
the results we get make
sense, if ever?
9. Median Voter
Theorem
Theorem: In a majority rules system, the preference of the median voter will succeed
http://thomasmoreinstitute.wordpress.com/2010/04/28/the-uk-election-and-the-curse-of-the-median-
voter/
Assumption: The political/ideological preferences of voters can be projected onto a
single numeric dimension
13. One thing we have a lot of:
text
Politicians
‣Speeches
‣Constituent communication
Parties
‣Platform / manifestos
‣Position statements
Countries
‣Diplomatic cables
‣Military declarations
Expert
Coding
!
14. How expert coding (typically)
works
http://en.wikipedia.org/wiki/Official_Monster_Raving_Loony_Party
Expert Code Book
1. Health & Safety: We propose to ban Self Responsibilty on the grounds that it
may be dangerous to your health.
2. M.P‟s Expenses: We propose that instead of a second home allowance M.P‟s
will have a caravan which will be parked outside the Houses of Parliament. This
will make it easier as flipping a caravan is easier than flipping homes
3. Eurofit: The European Constitution which will be sorted out by going for a long
Walk. “As everyone knows that walking is good for the constitution”Manifesto
Party Year Score
Monster Raving Loony 2010 -2
DATA!
16. Can we use non-
experts to code
political
manifestos?
How can we
measure the
quality/validity of
non-expert
codings?
Use Mechanical
Turk to code
many manifesto
fragments.
17. Experimental
approach
Expert
codings
Texts: 18 “big 3” British party
manifestos 1987-2010
Experts: 5 advanced poli. sci.
graduate students + 2
tenured faculty
Coding: deliberately simple
schema
Baseline data
Three experiments
No
Qualification
Low-
Threshold
High-
Threshold
Anyone in 4/6 Correct 5/6 Correct
MT
codings
Experimental design
Hypothesis: Stronger filter on
Turkers leads to better coding
Filter: Use MT qualification
test as gatekeeper
18. How do we think about coding a manifesto
fragment?
20. How do we implement this (aka, the glue)?
Expert
codings
[{ ‘text_unit_id’: ...,
‘sentence_text’: ...,
....
},
...
]
Random sample, as
JSON
EC2
S3
MT
Dynamically generate
HITs
MT
codings
Push HITs + retrieve
results
Statistical
analysis
of results
Scholarship,
FTW!
https://github.com/drewconway/mturk_coder_qua
lity
21. What‟s good about MT non-
experts?
They‟re
fast
They‟re
biased?
They‟re
cheap
They‟re
wrong?
The last crowd-sourced
coding job for 600
sentences and got
4,300 sentences coded
in about 20 hours
(about 3.6 sentences
per minute)
• We pay about $0.02 /
sentence
• Typical manifesto (in British
set) has 1,000 sentences
• Whole manifesto coded for
$20
• By comparison, the CMP
pays expert coders about
€150 per manifesto, call it
€.15 or $.20/manifesto - 10x
more per sentence
22. Results Kappa Statistic
Experiment Sentences # MT Coders % Agreement k* Std. Error z
No Qual. 1,315 89 0.65 0.47 0.13 22.6
Low-Threshold 1,393 56 0.7 0.54 0.12 26.7
High-Threshold 1,250 23 0.62 0.41 0.13 18.3
* A k value between 0.4-0.6 is considered “moderate” agreement
Agreement by experiment
Experiment Expert Coding MT % Agreement
No Qual.
Economic 0.77
Social 0.92
Neither 0.22
Low-Threshold
Economic 0.87
Social 0.98
Neither 0.2
High-Threshold
Economic 0.77
Social 0.91
Neither 0.09
Agreement by expert-coding
Results of initial MT experiments
23. Results Kappa Statistic
Experiment Sentences # MT Coders % Agreement k* Std. Error z
Econ-only 942 15 0.62 0.23 0.1 4.28
Soc-only 955 32 0.6 0.17 0.09 0.95
* A k value between 0.4-0.6 is considered “moderate” agreement
Experiment Expert Coding MT % Agreement
Economic 0.92
Economic-only Neither 0.28
Social 0.97
Social-only Neither 0.19
Non-experts have
a very hard time
with a “null” coding!
Separating Social and Economic Sentences
24. Joint work
with...
Michael Laver
NYU
Kenneth Bennoit
LSE
Slava Mikhaylov
UCL
Paper: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2260437
Presentation: http://bit.ly/nonexperts