Drew Conway: A Social Scientist's Perspective on Data Science

A social scientist‟s perspectives on
data science
Drew Conway
NYC Data Science
Meetup
March 5, 2013http://www.flickr.com/photos/uiowa/804719510
0/

Hacking
Skills
Obtain Munge
I hold the following truths to be self-
evident...
1. Data come from many sources
2. Data come in many form(at)s
10
% 10
%
80
%
A .zip file of PDFs ≠ data
‣Data scientist must know where to
get data and how to obtain it
‣Work with big text files
$ head publicvotes-20101018_votes.dump
‣Work with APIs
$ curl
http://search.twitter.com/search.json?q=@dr
ewconway > drewconway.json
Real data are messy
‣Even curated data: duplicates,
missing values, date formats
‣Combine data from multiple
sources/formats
‣Tools
• *NIX tools: sed, awk, grep
• Scripting languages: Perl, Python
and R
$ cat ufo_awesome.tsv | grep probe | wc -l
131

Hacking
Skills
While 80% of effort is spent here,
perhaps most straightforward to teach
Heavily tool focused, borrow from CS/EE curriculums
‣Comfort working at the command-line, with text editors
‣A language for every season!
Conveying findings in creative and compelling ways

Math &
Stats
Knowledge
If: Better data beats better math
Then: What methods should be
taught?
How do you find
structure in new data?
‣Scatter plots
‣Density plots
Data exploration that
scales
‣Reduce dimensionality
‣PCA, SVD, MDS
Methods must match
data
‣Text
‣Geospatial
‣Web-scale
What is the „best‟
model?
‣Most predictive
‣Most parsimonious
Explore Model

}
Math &
Stats
Knowledge
Universities good at methods
training...
...but what methods fit into Data
Science?
Things data scientist like...
‣Illustrating the current state of the
world
‣Predicting future observations
‣Classifying/ranking observations
Things social scientists like...
‣Testable theoretical models
‣Natural experiments
‣Causality
1. When applicable
2. Right tool / right job
3. Open black boxes
4. Learn limitations

Substantive
Expertise
Data Science, as a discipline, is
fundamentally about human behavior
Inquire Interpret
10
% 10
%
80
%
Focus on questions / not
tech
‣What new questions can be
asked from web-scale data?
‣Tools are a means to an end
Social science has
questions
‣Markets
‣Organization
How do we know when
the results we get make
sense, if ever?

http://www.flickr.com/photos/cawley/324240322
4/
Case Study: Methods for Collecting Large-
Scale Non-Expert Text Coding

Median Voter
Theorem
Theorem: In a majority rules system, the preference of the median voter will succeed
http://thomasmoreinstitute.wordpress.com/2010/04/28/the-uk-election-and-the-curse-of-the-median-
voter/
Assumption: The political/ideological preferences of voters can be projected onto a
single numeric dimension

Median Voter
Theorem
http://voteview.com/blog/?p=5
How do we calculate these numbers?

We make it
up...
http://www.flickr.com/photos/estherlairlandesa/46495660
But, we have
to!

http://en.wikipedia.org/wiki/File:Obama_Health_Care_Speech_to_Joint_Session_of_Congre
ss.jpg
http://www.flickr.com/photos/becca02/672719355
7/
A tale of two
disciplines
Physics Political Science
Build instrument Measure Observe action Infer

One thing we have a lot of:
text
Politicians
‣Speeches
‣Constituent communication
Parties
‣Platform / manifestos
‣Position statements
Countries
‣Diplomatic cables
‣Military declarations
Expert
Coding
!

How expert coding (typically)
works
http://en.wikipedia.org/wiki/Official_Monster_Raving_Loony_Party
Expert Code Book
1. Health & Safety: We propose to ban Self Responsibilty on the grounds that it
may be dangerous to your health.
2. M.P‟s Expenses: We propose that instead of a second home allowance M.P‟s
will have a caravan which will be parked outside the Houses of Parliament. This
will make it easier as flipping a caravan is easier than flipping homes
3. Eurofit: The European Constitution which will be sorted out by going for a long
Walk. “As everyone knows that walking is good for the constitution”Manifesto
Party Year Score
Monster Raving Loony 2010 -2
DATA!

What‟s wrong with
experts?
They‟re
slow
They‟re
biased
They‟re
expensive
They‟re
wrong

Can we use non-
experts to code
political
manifestos?
How can we
measure the
quality/validity of
non-expert
codings?
Use Mechanical
Turk to code
many manifesto
fragments.

Experimental
approach
Expert
codings
Texts: 18 “big 3” British party
manifestos 1987-2010
Experts: 5 advanced poli. sci.
graduate students + 2
tenured faculty
Coding: deliberately simple
schema
Baseline data
Three experiments
No
Qualification
Low-
Threshold
High-
Threshold
Anyone in 4/6 Correct 5/6 Correct
MT
codings
Experimental design
Hypothesis: Stronger filter on
Turkers leads to better coding
Filter: Use MT qualification
test as gatekeeper

How do we think about coding a manifesto
fragment?

Example text coding HIT from the experiment

How do we implement this (aka, the glue)?
Expert
codings
[{ ‘text_unit_id’: ...,
‘sentence_text’: ...,
....
},
...
]
Random sample, as
JSON
EC2
S3
MT
Dynamically generate
HITs
MT
codings
Push HITs + retrieve
results
Statistical
analysis
of results
Scholarship,
FTW!
https://github.com/drewconway/mturk_coder_qua
lity

What‟s good about MT non-
experts?
They‟re
fast
They‟re
biased?
They‟re
cheap
They‟re
wrong?
The last crowd-sourced
coding job for 600
sentences and got
4,300 sentences coded
in about 20 hours
(about 3.6 sentences
per minute)
• We pay about $0.02 /
sentence
• Typical manifesto (in British
set) has 1,000 sentences
• Whole manifesto coded for
$20
• By comparison, the CMP
pays expert coders about
€150 per manifesto, call it
€.15 or $.20/manifesto - 10x
more per sentence

Results Kappa Statistic
Experiment Sentences # MT Coders % Agreement k* Std. Error z
No Qual. 1,315 89 0.65 0.47 0.13 22.6
Low-Threshold 1,393 56 0.7 0.54 0.12 26.7
High-Threshold 1,250 23 0.62 0.41 0.13 18.3
* A k value between 0.4-0.6 is considered “moderate” agreement
Agreement by experiment
Experiment Expert Coding MT % Agreement
No Qual.
Economic 0.77
Social 0.92
Neither 0.22
Low-Threshold
Economic 0.87
Social 0.98
Neither 0.2
High-Threshold
Economic 0.77
Social 0.91
Neither 0.09
Agreement by expert-coding
Results of initial MT experiments

Results Kappa Statistic
Experiment Sentences # MT Coders % Agreement k* Std. Error z
Econ-only 942 15 0.62 0.23 0.1 4.28
Soc-only 955 32 0.6 0.17 0.09 0.95
* A k value between 0.4-0.6 is considered “moderate” agreement
Experiment Expert Coding MT % Agreement
Economic 0.92
Economic-only Neither 0.28
Social 0.97
Social-only Neither 0.19
Non-experts have
a very hard time
with a “null” coding!
Separating Social and Economic Sentences

Joint work
with...
Michael Laver
NYU
Kenneth Bennoit
LSE
Slava Mikhaylov
UCL
Paper: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2260437
Presentation: http://bit.ly/nonexperts

No Qualification
Coder performance
stability
Low-threshold
High-threshold
Performance
becomes very stable
after approximately
20 HITs

Drew Conway: A Social Scientist's Perspective on Data Science

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (10)

Similaire à Drew Conway: A Social Scientist's Perspective on Data Science

Similaire à Drew Conway: A Social Scientist's Perspective on Data Science (20)

Plus de mortardata

Plus de mortardata (6)

Dernier

Dernier (20)

Drew Conway: A Social Scientist's Perspective on Data Science