Data, Responsibly:
The Next Decade of Data Science
Bill Howe, PhD
Associate Professor, Information School
Director, Cascadia Urban Analytics Cooperative
Adjunct Associate Professor, Computer Science & Engineering
University of Washington
My goals this afternoon…
• Describe “data science” from my perspective
• Describe some concerns that have recently emerged around the
irresponsible use of data science techniques and technologies
• Show off some of the work we’re doing to address it
DataLab
Bill Howe
Databases, data
management
Jessica Hullman
Visualization, HCI
Carole Palmer
Open data, digital
curation
Nic Weber
Open data, civic tech
Jevin West
Science of science,
bibliometrics
…”calling bullshit”
Emma Spiro
Social network
analysis
The Fourth Paradigm
1. Empirical + experimental
2. Theoretical
3. Computational
4. Data-Intensive
Jim Gray
1/10/2018 Bill Howe, UW 4
Nearly every field of discovery is transitioning from
“data poor” to “data rich”
Astronomy: LSST
Physics: LHC
Oceanography: OOI
Social Sciences
Biology: Sequencing
Economics
Neuroscience: EEG, fMRI
My view:
1/10/2018 Bill Howe, UW 8
Data science is about answering questions
using large, noisy, and heterogeneous
datasets, usually those that were
collected for some unrelated purpose
1/10/2018 Bill Howe, UW9
Question:
How early and accurately can we predict flu
outbreaks, so we can plan production levels
of flu vaccine?
Dataset:
Search histories of users
Question:
1/10/2018 Bill Howe, UW11
Do people that take paroxetine and
pravastatin together exhibit
hypoglycemia symptoms?
Dataset:
Search engine histories
Ryen W White,Nicholas P Tatonetti, Nigam H Shah, Russ B Altman, Eric Horvitz,
Web-scale pharmacovigilance: listening to signals from the crowd, J Am
Med Inform Assoc, March 2013, doi:10.1136/amiajnl-2012-001482
Open Sidewalks – Sidewalk maps for low-mobility citizens
Project Leads: Nick Bolten, Anat Caspi – Taskar Center, CSE
DSSG Fellows: Amir Amini, Yun Hao, Vaishnavi Ravichandran,
Andre Stephens
ALVA High School Students: Nick Krasnoselsky, Doris Layman
eScience Data Scientist Mentors: Anthony Arendt, Jake
Vanderplas
“ 30 million Americans over 15
years old experience limited mobility,
including difficulty walking, climbing stairs, using
wheelchairs, crutches, walkers” while 24
million more persons experience
difficulty walkinga quarter mile”
|Picture: US Federal Highway administration
http://www.fhwa.dot.gov/environment/bicycle_pedestrian/publications/sidewalk2/sidewalks204.cfm
Automated cleaning of sidewalk data through computational geometry
powered by data
from:
SDOT/Socrata
Google API
Step Runtime Solved (All) Percent
Connecting T-Gaps ~3.9s 3,837 (4,352) 88.2
Intersection
Cleaning
~23.6s 38,844 (44,700) 86.9
Polygon Cleaning ~10min 7,283 (8,035) 90.6
Subgraphs ~23.2s 39,913 (45,265) 88.1
Homeless families may take many pathways through programs
Emergency
shelter
Transitional
housing
Rapid
re-housing
Permanent
housing
Housing with
services
Unsuccessful exit
Preliminary results to understand potential predictors of
successful outcomes
Correlation with successful outcome,
by family characteristics
Correlation with successful outcome, by
homelessness program
Emergency Shelter use
tends to be associated with
unsuccessful outcomes
(unsurprising!)
Homelessness Prevention
programs more strongly
associated with positive
outcomes than
transitional housing
Substance abuse strongly
associated with
unsuccessful outcomes
Parent employment
strongest predictor of
successful outcomes
Common trajectories lead to different outcomes:
• a successful exit from an episode would mean that the family found a permanent housing
solution
• a proportion of these still receive government subsidies
• other exits are exits back into homelessness, or to other, unknown destinations
Analyzing Family Trajectories through Programs
Data: Pierce County
Emergency Shelter -> Rapid Re-housing
Emergency Shelter -> Transitional Housing
80% successful exits
Only 40% successful exits
Observation:
Epistemic issues are beginning to dominate
the data science discussion in every field
reproducibility, “algorithmic bias,” curation, discrimination,
accountability, transparency, provenance, explanations,
persuasion, privacy
21
Ex: Staples online pricing
Reasoning: Offer deals to people that live near competitors’ stores
Effect: lower prices offered to buyers who live in more affluent
neighborhoods
22
[Latanya Sweeney; CACM 2013]
Racially identifying names trigger
ads suggestive of an arrest record
slide adapted from Stoyanovich, Miklau
24
The Special Committee on Criminal Justice Reform's
hearing of reducing the pre-trial jail population.
Technical.ly, September 2016
Philadelphia is grappling with the prospect of a racist computer algorithm
Any background signal in the
data of institutional racism is
amplified by the algorithm
operationalized by the algorithm
legitimized by the algorithm
“Should I be afraid of risk assessment tools?”
“No, you gotta tell me a lot more about yourself.
At what age were you first arrested?
What is the date of your most recent crime?”
“And what’s the culture of policing in the
neighborhood in which I grew up in?”
First decade of Data Science research and practice:
What can we do with massive, noisy, heterogeneous datasets?
Next decade of Data Science research and practice:
What should we do with massive, noisy, heterogeneous datasets?
The way I think about this…..(1)
The way I think about this…. (2)
Decisions are based on two sources of information:
1. Past examples
e.g., “prior arrests tend to increase likelihood of future arrests”
2. Societal constraints
e.g., “we must avoid racial discrimination”
11/10/2016 Data, Responsibly / SciTech NW 16
We’ve become very good at automating the use of past examples
We’ve only just started to think about incorporating societal constraints
The way I think about this… (3)
How do we apply societal constraints to algorithmic
decision-making?
Option 1: Keep a human in the loop
Ex: EU General Data Protection Regulation requires that a
human be involved in legally binding algorithmic decision-making
Ex: Wisconsin Supreme Court says a human must review
algorithmic decisions made by recidivism models
Option 2: Build them into the algorithms themselves
I’ll talk about some approaches for this
11/10/2016 Data, Responsibly / SciTech NW 17
The way I think about this…(4)
On transparency vs. accountability:
• For human decision-making, sometimes explanations are
required, improving transparency
– Supreme court decisions
– Employee reprimands/termination
• But when transparency is difficult, accountability takes over
– medical emergencies, business decisions
• As we shift decisions to algorithms, we lose both
transparency AND accountability
• “The buck stops where?”
11/10/2016 Data, Responsibly / SciTech NW 18
So what can we do about it?
• Algorithms that balance predictive accuracy with fairness
• Increase data sharing, while protecting privacy
– Avoid the “tyranny of convenience”
• Ensure transparency in all methods, datasets
• Track known biases in how data was collected, so it can
be controlled in downstream analytics
• All of these approaches are being explored in the
research community.
1/10/2018 Bill Howe, UW 38
Recap
• There’s a sea change underway in how we will teach
and practice data science
• No longer only about what can be done, but about
what should be done
• This is not just a policy/behavior/culture issue – there
are technical problems to solve
• Prediction: If a company is not thinking about this
stuff, they will soon be facing retention and
compliance issues
– Witness how the privacy discussion evolved
Science is a complete mess
• Reproducibility
– Begley & Ellis, Nature 2012: 6 out of 53 cancer studies reproducible
– Only about half of psychology 100 studies had effect sizes that approximated
the original result (Science, 2015)
– Ioannidis 2005: Why most public research findings are false
– Reinhart & Rogoff: global economic policy based on spreadsheet fuck ups
11/10/2016 Bill Howe, UW 33
Vision: Validate scientific claims automatically
– Check for manipulation (manipulated images, Benford’s Law)
– Extract claims from papers
– Check claims against the authors’ data
– Check claims against related data sets
– Automatic meta-analysis across the literature + public datasets
• First steps
– Automatic curation: Validate and attach metadata to public datasets
– Longitudinal analysis of the visual literature
11/10/2016 Data, Responsibly / SciTech NW 41
11/10/2016 Bill Howe, UW 43
Microarray samples submitted to the Gene Expression Omnibus
Curation is fast becoming the
bottleneck to data sharing
Maxim
Gretchkin Poon
Hoifung
color = labels supplied
as metadata
clusters = 1st two PCA
dimensions on the
gene expression data
itself
Can we use curate algorithmically?
Maxim
Gretchkin Poon
Hoifung
The expression data
and the text labels
appear to disagree
Deep Curation Maxim
Gretchkin Poon
Hoifung
Distant supervision and co-learning between text-
based classified and expression-based classifier: Both
models improve by training on each others’ results.
Free-text classifier
Expression classifier
Deep Curation:
Our stuff wins, with no training data
Maxim
Gretchkin Poon
Hoifung
state of the art
our reimplementation
of the state of the art
our dueling
pianos NN
amount of training data used
Notes de l'éditeur
4
And processing power, either as raw processor speed or via novel multi-core and many-core architectures, is also continuing to increase exponentially…
… but human cognitive capacity is remaining constant. How can computing technologies help scientists make sense out of these vast and complex data sets?
The challenges stem from the large, noisy, and heterogeneous more than from colelcting the data in the first place.
Data scie
Google
So in part as an attempt to relate “eSciene” and “data science,” and in part to make sure the idea of data science wasn’t completely taken over by the machine learning people, we ran a massively open online course last Spring called Introduction to Data Science
We taught Scalable Databases, MapReduce, Statistics, Machine Learning, Visualization
Following a 2014 report entitled “Big Data: Seizing Opportunities, Preserving Values”