SlideShare une entreprise Scribd logo
1  sur  55
Data, Responsibly:
The Next Decade of Data Science
Bill Howe, PhD
Associate Professor, Information School
Director, Cascadia Urban Analytics Cooperative
Adjunct Associate Professor, Computer Science & Engineering
University of Washington
My goals this afternoon…
• Describe “data science” from my perspective
• Describe some concerns that have recently emerged around the
irresponsible use of data science techniques and technologies
• Show off some of the work we’re doing to address it
DataLab
Bill Howe
Databases, data
management
Jessica Hullman
Visualization, HCI
Carole Palmer
Open data, digital
curation
Nic Weber
Open data, civic tech
Jevin West
Science of science,
bibliometrics
…”calling bullshit”
Emma Spiro
Social network
analysis
The Fourth Paradigm
1. Empirical + experimental
2. Theoretical
3. Computational
4. Data-Intensive
Jim Gray
1/10/2018 Bill Howe, UW 4
Nearly every field of discovery is transitioning from
“data poor” to “data rich”
Astronomy: LSST
Physics: LHC
Oceanography: OOI
Social Sciences
Biology: Sequencing
Economics
Neuroscience: EEG, fMRI
My view:
1/10/2018 Bill Howe, UW 8
Data science is about answering questions
using large, noisy, and heterogeneous
datasets, usually those that were
collected for some unrelated purpose
1/10/2018 Bill Howe, UW9
Question:
How early and accurately can we predict flu
outbreaks, so we can plan production levels
of flu vaccine?
Dataset:
Search histories of users
source:
http://www.google.org/flutrends/us/#US
http://www.google.com/permissions/using-product-graphics.html
flu risk
“Scientific hindsight shows that
Google Flu Trends far overstated this
year's flu season….”
“Lots of media attention to this
year's flu season skewed Google's
search engine traffic.”
David Wagner, Atlantic Wire, Feb
13 2013
Question:
1/10/2018 Bill Howe, UW11
Do people that take paroxetine and
pravastatin together exhibit
hypoglycemia symptoms?
Dataset:
Search engine histories
Ryen W White,Nicholas P Tatonetti, Nigam H Shah, Russ B Altman, Eric Horvitz,
Web-scale pharmacovigilance: listening to signals from the crowd, J Am
Med Inform Assoc, March 2013, doi:10.1136/amiajnl-2012-001482
Open Sidewalks – Sidewalk maps for low-mobility citizens
Project Leads: Nick Bolten, Anat Caspi – Taskar Center, CSE
DSSG Fellows: Amir Amini, Yun Hao, Vaishnavi Ravichandran,
Andre Stephens
ALVA High School Students: Nick Krasnoselsky, Doris Layman
eScience Data Scientist Mentors: Anthony Arendt, Jake
Vanderplas
“ 30 million Americans over 15
years old experience limited mobility,
including difficulty walking, climbing stairs, using
wheelchairs, crutches, walkers” while 24
million more persons experience
difficulty walkinga quarter mile”
|Picture: US Federal Highway administration
http://www.fhwa.dot.gov/environment/bicycle_pedestrian/publications/sidewalk2/sidewalks204.cfm
Automated cleaning of sidewalk data through computational geometry
powered by data
from:
SDOT/Socrata
Google API
Step Runtime Solved (All) Percent
Connecting T-Gaps ~3.9s 3,837 (4,352) 88.2
Intersection
Cleaning
~23.6s 38,844 (44,700) 86.9
Polygon Cleaning ~10min 7,283 (8,035) 90.6
Subgraphs ~23.2s 39,913 (45,265) 88.1
Homeless families may take many pathways through programs
Emergency
shelter
Transitional
housing
Rapid
re-housing
Permanent
housing
Housing with
services
Unsuccessful exit
Develop visualizations to show how homeless families move
through programs
Preliminary results to understand potential predictors of
successful outcomes
Correlation with successful outcome,
by family characteristics
Correlation with successful outcome, by
homelessness program
Emergency Shelter use
tends to be associated with
unsuccessful outcomes
(unsurprising!)
Homelessness Prevention
programs more strongly
associated with positive
outcomes than
transitional housing
Substance abuse strongly
associated with
unsuccessful outcomes
Parent employment
strongest predictor of
successful outcomes
Common trajectories lead to different outcomes:
• a successful exit from an episode would mean that the family found a permanent housing
solution
• a proportion of these still receive government subsidies
• other exits are exits back into homelessness, or to other, unknown destinations
Analyzing Family Trajectories through Programs
Data: Pierce County
Emergency Shelter -> Rapid Re-housing
Emergency Shelter -> Transitional Housing
80% successful exits
Only 40% successful exits
ORCA Percentage Difference in Ridership, Seattle
Mark
Hallenbeck
TRAC
1/10/2018 Bill Howe, UW 20
Passenger
Type Redmond Tukwila Redmond Tukwila
Adult 317181 72202 91% 67%
Youth 12818 7433 4% 7%
Senior 5425 4577 2% 4%
Disabled 7722 10449 2% 10%
Low Income 6912 12438 2% 12%
Metro Boardings By Type of Rider
1/10/2018 Bill Howe, UW 21
Session 2
Summer 2014
121,215 students
Session 1
Spring 2013
119,504 students
1/10/2018 Bill Howe, UW 23
14
Cathy O’Neil
September 2016
Three properties of a WMD:
Opacity
Scale
Damage
July 2016
“Data, Responsibly”
Dagstuhl Workshop
Gerhard
Weikum
Serge
Abiteboul
Julia
Stoyanovich
Gerome
Miklau
Observation:
Epistemic issues are beginning to dominate
the data science discussion in every field
reproducibility, “algorithmic bias,” curation, discrimination,
accountability, transparency, provenance, explanations,
persuasion, privacy
21
Ex: Staples online pricing
Reasoning: Offer deals to people that live near competitors’ stores
Effect: lower prices offered to buyers who live in more affluent
neighborhoods
22
[Latanya Sweeney; CACM 2013]
Racially identifying names trigger
ads suggestive of an arrest record
slide adapted from Stoyanovich, Miklau
1/10/2018 Bill Howe, UW 29
Amazon Prime Now Delivery Area: Atlanta Bloomberg, 2016
1/10/2018 Bill Howe, UW 30
Amazon Prime Now Delivery Area: Chicago Bloomberg, 2016
1/10/2018 Bill Howe, UW 31
Amazon Prime Now Delivery Area: Boston Bloomberg, 2016
23
Propublica, May 2016
24
The Special Committee on Criminal Justice Reform's
hearing of reducing the pre-trial jail population.
Technical.ly, September 2016
Philadelphia is grappling with the prospect of a racist computer algorithm
Any background signal in the
data of institutional racism is
amplified by the algorithm
operationalized by the algorithm
legitimized by the algorithm
“Should I be afraid of risk assessment tools?”
“No, you gotta tell me a lot more about yourself.
At what age were you first arrested?
What is the date of your most recent crime?”
“And what’s the culture of policing in the
neighborhood in which I grew up in?”
First decade of Data Science research and practice:
What can we do with massive, noisy, heterogeneous datasets?
Next decade of Data Science research and practice:
What should we do with massive, noisy, heterogeneous datasets?
The way I think about this…..(1)
The way I think about this…. (2)
Decisions are based on two sources of information:
1. Past examples
e.g., “prior arrests tend to increase likelihood of future arrests”
2. Societal constraints
e.g., “we must avoid racial discrimination”
11/10/2016 Data, Responsibly / SciTech NW 16
We’ve become very good at automating the use of past examples
We’ve only just started to think about incorporating societal constraints
The way I think about this… (3)
How do we apply societal constraints to algorithmic
decision-making?
Option 1: Keep a human in the loop
Ex: EU General Data Protection Regulation requires that a
human be involved in legally binding algorithmic decision-making
Ex: Wisconsin Supreme Court says a human must review
algorithmic decisions made by recidivism models
Option 2: Build them into the algorithms themselves
I’ll talk about some approaches for this
11/10/2016 Data, Responsibly / SciTech NW 17
The way I think about this…(4)
On transparency vs. accountability:
• For human decision-making, sometimes explanations are
required, improving transparency
– Supreme court decisions
– Employee reprimands/termination
• But when transparency is difficult, accountability takes over
– medical emergencies, business decisions
• As we shift decisions to algorithms, we lose both
transparency AND accountability
• “The buck stops where?”
11/10/2016 Data, Responsibly / SciTech NW 18
So what can we do about it?
• Algorithms that balance predictive accuracy with fairness
• Increase data sharing, while protecting privacy
– Avoid the “tyranny of convenience”
• Ensure transparency in all methods, datasets
• Track known biases in how data was collected, so it can
be controlled in downstream analytics
• All of these approaches are being explored in the
research community.
1/10/2018 Bill Howe, UW 38
Recap
• There’s a sea change underway in how we will teach
and practice data science
• No longer only about what can be done, but about
what should be done
• This is not just a policy/behavior/culture issue – there
are technical problems to solve
• Prediction: If a company is not thinking about this
stuff, they will soon be facing retention and
compliance issues
– Witness how the privacy discussion evolved
REPRODUCIBILITY
11/10/2016 Bill Howe, UW 32
Science is a complete mess
• Reproducibility
– Begley & Ellis, Nature 2012: 6 out of 53 cancer studies reproducible
– Only about half of psychology 100 studies had effect sizes that approximated
the original result (Science, 2015)
– Ioannidis 2005: Why most public research findings are false
– Reinhart & Rogoff: global economic policy based on spreadsheet fuck ups
11/10/2016 Bill Howe, UW 33
Science, 2015
11/10/2016 Data, Responsibly @ Dagstuhl 35
Retractions are increasing…..
Why is this happening? (1)
11/10/2016 Bill Howe, UW 37
Why is this happening? (2)
11/10/2016 Bill Howe, UW 38
Why is this happening? (2)
Publication Bias!
“DEEP CURATION”
TOWARDS AUTOMATIC SCIENTIFIC CLAIM CHECKING
Vision: Validate scientific claims automatically
– Check for manipulation (manipulated images, Benford’s Law)
– Extract claims from papers
– Check claims against the authors’ data
– Check claims against related data sets
– Automatic meta-analysis across the literature + public datasets
• First steps
– Automatic curation: Validate and attach metadata to public datasets
– Longitudinal analysis of the visual literature
11/10/2016 Data, Responsibly / SciTech NW 41
Microarray experiments
11/10/2016 Bill Howe, UW 43
Microarray samples submitted to the Gene Expression Omnibus
Curation is fast becoming the
bottleneck to data sharing
Maxim
Gretchkin Poon
Hoifung
Maxim
Gretchkin Poon
Hoifung
No growth in number of
datasets used per paper!
Maxim
Gretchkin Poon
Hoifung
Majority of samples are
one-time-use only!
color = labels supplied
as metadata
clusters = 1st two PCA
dimensions on the
gene expression data
itself
Can we use curate algorithmically?
Maxim
Gretchkin Poon
Hoifung
The expression data
and the text labels
appear to disagree
Maxim
Gretchkin Poon
Hoifung
Better Tissue
Type Labels
Domain knowledge
(Ontology)
Expression data
Free-text Metadata
2 Deep Networks
text
expr
SVM
Deep Curation Maxim
Gretchkin Poon
Hoifung
Distant supervision and co-learning between text-
based classified and expression-based classifier: Both
models improve by training on each others’ results.
Free-text classifier
Expression classifier
Deep Curation:
Our stuff wins, with no training data
Maxim
Gretchkin Poon
Hoifung
state of the art
our reimplementation
of the state of the art
our dueling
pianos NN
amount of training data used

Contenu connexe

Tendances

Transitioning Education’s Knowledge Infrastructure ICLS 2018
Transitioning Education’s Knowledge Infrastructure ICLS 2018Transitioning Education’s Knowledge Infrastructure ICLS 2018
Transitioning Education’s Knowledge Infrastructure ICLS 2018
Simon Buckingham Shum
 
Ralph schroeder and eric meyer
Ralph schroeder and eric meyerRalph schroeder and eric meyer
Ralph schroeder and eric meyer
oiisdp
 
Mest3 Internet Lessons 1-3
Mest3 Internet Lessons 1-3Mest3 Internet Lessons 1-3
Mest3 Internet Lessons 1-3
Macguffin
 
Making Decisions in a World Awash in Data: We’re going to need a different bo...
Making Decisions in a World Awash in Data: We’re going to need a different bo...Making Decisions in a World Awash in Data: We’re going to need a different bo...
Making Decisions in a World Awash in Data: We’re going to need a different bo...
Micah Altman
 

Tendances (20)

Citizen Sensor Data Mining, Social Media Analytics and Applications
Citizen Sensor Data Mining, Social Media Analytics and ApplicationsCitizen Sensor Data Mining, Social Media Analytics and Applications
Citizen Sensor Data Mining, Social Media Analytics and Applications
 
Transitioning Education’s Knowledge Infrastructure ICLS 2018
Transitioning Education’s Knowledge Infrastructure ICLS 2018Transitioning Education’s Knowledge Infrastructure ICLS 2018
Transitioning Education’s Knowledge Infrastructure ICLS 2018
 
Ralph schroeder and eric meyer
Ralph schroeder and eric meyerRalph schroeder and eric meyer
Ralph schroeder and eric meyer
 
Future Flight Fridays: Public Trust in Future Flight
Future Flight Fridays: Public Trust in Future FlightFuture Flight Fridays: Public Trust in Future Flight
Future Flight Fridays: Public Trust in Future Flight
 
Social and Physical Sensing Enabled Decision Support for Disaster Management ...
Social and Physical Sensing Enabled Decision Support for Disaster Management ...Social and Physical Sensing Enabled Decision Support for Disaster Management ...
Social and Physical Sensing Enabled Decision Support for Disaster Management ...
 
Crowdsourcing Science
Crowdsourcing ScienceCrowdsourcing Science
Crowdsourcing Science
 
Science as an Open Enterprise – Geoffrey Boulton
Science as an Open Enterprise – Geoffrey BoultonScience as an Open Enterprise – Geoffrey Boulton
Science as an Open Enterprise – Geoffrey Boulton
 
Learning Analytics as Educational Knowledge Infrastructure
Learning Analytics as Educational Knowledge InfrastructureLearning Analytics as Educational Knowledge Infrastructure
Learning Analytics as Educational Knowledge Infrastructure
 
Citizen Science Phenotypes
Citizen Science PhenotypesCitizen Science Phenotypes
Citizen Science Phenotypes
 
Delphi2 results (Cycle 2) and towards Delphi3
Delphi2 results (Cycle 2) and towards Delphi3Delphi2 results (Cycle 2) and towards Delphi3
Delphi2 results (Cycle 2) and towards Delphi3
 
Mest3 Internet Lessons 1-3
Mest3 Internet Lessons 1-3Mest3 Internet Lessons 1-3
Mest3 Internet Lessons 1-3
 
MIT Program on Information Science Talk -- Julia Flanders on Jobs, Roles, Ski...
MIT Program on Information Science Talk -- Julia Flanders on Jobs, Roles, Ski...MIT Program on Information Science Talk -- Julia Flanders on Jobs, Roles, Ski...
MIT Program on Information Science Talk -- Julia Flanders on Jobs, Roles, Ski...
 
Web Observatories and e-Research
Web Observatories and e-ResearchWeb Observatories and e-Research
Web Observatories and e-Research
 
Making Decisions in a World Awash in Data: We’re going to need a different bo...
Making Decisions in a World Awash in Data: We’re going to need a different bo...Making Decisions in a World Awash in Data: We’re going to need a different bo...
Making Decisions in a World Awash in Data: We’re going to need a different bo...
 
AI & Work, with Transparency & the Crowd
AI & Work, with Transparency & the Crowd AI & Work, with Transparency & the Crowd
AI & Work, with Transparency & the Crowd
 
Information, Science, and Society
Information, Science, and SocietyInformation, Science, and Society
Information, Science, and Society
 
Little eScience
Little eScienceLittle eScience
Little eScience
 
Data Science and Urban Science @ UW
Data Science and Urban Science @ UWData Science and Urban Science @ UW
Data Science and Urban Science @ UW
 
Public data archiving: Who does? Who doesn't? What can we do about it?
Public data archiving: Who does?  Who doesn't?  What can we do about it?Public data archiving: Who does?  Who doesn't?  What can we do about it?
Public data archiving: Who does? Who doesn't? What can we do about it?
 
An initial exploration of Citizen Science
An initial exploration of Citizen ScienceAn initial exploration of Citizen Science
An initial exploration of Citizen Science
 

Similaire à Data Responsibly: The next decade of data science

Respond to these two classmates’ posts. 1. After reading thi.docx
Respond to these two classmates’ posts. 1. After reading thi.docxRespond to these two classmates’ posts. 1. After reading thi.docx
Respond to these two classmates’ posts. 1. After reading thi.docx
daynamckernon
 
Respond to at least two of your classmates’ posts. 1. After .docx
Respond to at least two of your classmates’ posts. 1. After .docxRespond to at least two of your classmates’ posts. 1. After .docx
Respond to at least two of your classmates’ posts. 1. After .docx
daynamckernon
 
After reading this journal article regarding ethics of interne.docx
After reading this journal article regarding ethics of interne.docxAfter reading this journal article regarding ethics of interne.docx
After reading this journal article regarding ethics of interne.docx
rosiecabaniss
 

Similaire à Data Responsibly: The next decade of data science (20)

Data, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceData, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data Science
 
Thoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State LegislatureThoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State Legislature
 
Science Data, Responsibly
Science Data, ResponsiblyScience Data, Responsibly
Science Data, Responsibly
 
CODATA International Training Workshop in Big Data for Science for Researcher...
CODATA International Training Workshop in Big Data for Science for Researcher...CODATA International Training Workshop in Big Data for Science for Researcher...
CODATA International Training Workshop in Big Data for Science for Researcher...
 
Open Data in a Big Data World: easy to say, but hard to do?
Open Data in a Big Data World: easy to say, but hard to do?Open Data in a Big Data World: easy to say, but hard to do?
Open Data in a Big Data World: easy to say, but hard to do?
 
Mind the Gap: Reflections on Data Policies and Practice
Mind the Gap: Reflections on Data Policies and PracticeMind the Gap: Reflections on Data Policies and Practice
Mind the Gap: Reflections on Data Policies and Practice
 
Biomedical Data Science: We Are Not Alone
Biomedical Data Science: We Are Not AloneBiomedical Data Science: We Are Not Alone
Biomedical Data Science: We Are Not Alone
 
Citizen Science overview for ASU HSD598 graduate course, "Citizen Science"
Citizen Science overview for ASU HSD598 graduate course, "Citizen Science"Citizen Science overview for ASU HSD598 graduate course, "Citizen Science"
Citizen Science overview for ASU HSD598 graduate course, "Citizen Science"
 
Univ of Miami CTSI: Citizen science seminar; Oct 2014
Univ of Miami CTSI: Citizen science seminar; Oct 2014Univ of Miami CTSI: Citizen science seminar; Oct 2014
Univ of Miami CTSI: Citizen science seminar; Oct 2014
 
Studying Cybercrime: Raising Awareness of Objectivity & Bias
Studying Cybercrime: Raising Awareness of Objectivity & BiasStudying Cybercrime: Raising Awareness of Objectivity & Bias
Studying Cybercrime: Raising Awareness of Objectivity & Bias
 
Respond to these two classmates’ posts. 1. After reading thi.docx
Respond to these two classmates’ posts. 1. After reading thi.docxRespond to these two classmates’ posts. 1. After reading thi.docx
Respond to these two classmates’ posts. 1. After reading thi.docx
 
Data Science definition
Data Science definitionData Science definition
Data Science definition
 
Let's talk about Data Science
Let's talk about Data ScienceLet's talk about Data Science
Let's talk about Data Science
 
Acting as Advocate? Seven steps for libraries in the data decade
Acting as Advocate? Seven steps for libraries in the data decadeActing as Advocate? Seven steps for libraries in the data decade
Acting as Advocate? Seven steps for libraries in the data decade
 
Computational Social Science:The Collaborative Futures of Big Data, Computer ...
Computational Social Science:The Collaborative Futures of Big Data, Computer ...Computational Social Science:The Collaborative Futures of Big Data, Computer ...
Computational Social Science:The Collaborative Futures of Big Data, Computer ...
 
Open Data in a Global Ecosystem
Open Data in a Global EcosystemOpen Data in a Global Ecosystem
Open Data in a Global Ecosystem
 
Respond to at least two of your classmates’ posts. 1. After .docx
Respond to at least two of your classmates’ posts. 1. After .docxRespond to at least two of your classmates’ posts. 1. After .docx
Respond to at least two of your classmates’ posts. 1. After .docx
 
A politics of counting - putting people back into big data
A politics of counting - putting people back into big dataA politics of counting - putting people back into big data
A politics of counting - putting people back into big data
 
After reading this journal article regarding ethics of interne.docx
After reading this journal article regarding ethics of interne.docxAfter reading this journal article regarding ethics of interne.docx
After reading this journal article regarding ethics of interne.docx
 
A brave new world: student surveillance in higher education
A brave new world: student surveillance in higher educationA brave new world: student surveillance in higher education
A brave new world: student surveillance in higher education
 

Plus de University of Washington

Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)
University of Washington
 
Data Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionData Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data Interaction
University of Washington
 
Enabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShareEnabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShare
University of Washington
 

Plus de University of Washington (20)

Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)
 
The Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsThe Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore Environments
 
Big Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD ModelsBig Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD Models
 
Democratizing Data Science in the Cloud
Democratizing Data Science in the CloudDemocratizing Data Science in the Cloud
Democratizing Data Science in the Cloud
 
Data Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionData Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data Interaction
 
The Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingThe Other HPC: High Productivity Computing
The Other HPC: High Productivity Computing
 
Urban Data Science at UW
Urban Data Science at UWUrban Data Science at UW
Urban Data Science at UW
 
Intro to Data Science Concepts
Intro to Data Science ConceptsIntro to Data Science Concepts
Intro to Data Science Concepts
 
Big Data Talent in Academic and Industry R&D
Big Data Talent in Academic and Industry R&DBig Data Talent in Academic and Industry R&D
Big Data Talent in Academic and Industry R&D
 
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
 
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
 
XLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and MyriaXLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and Myria
 
Myria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) ScientistsMyria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) Scientists
 
Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013
 
eResearch New Zealand Keynote
eResearch New Zealand KeynoteeResearch New Zealand Keynote
eResearch New Zealand Keynote
 
Data science curricula at UW
Data science curricula at UWData science curricula at UW
Data science curricula at UW
 
Enabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShareEnabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShare
 
Virtual Appliances, Cloud Computing, and Reproducible Research
Virtual Appliances, Cloud Computing, and Reproducible ResearchVirtual Appliances, Cloud Computing, and Reproducible Research
Virtual Appliances, Cloud Computing, and Reproducible Research
 
End-to-End eScience
End-to-End eScienceEnd-to-End eScience
End-to-End eScience
 
HaLoop: Efficient Iterative Processing on Large-Scale Clusters
HaLoop: Efficient Iterative Processing on Large-Scale ClustersHaLoop: Efficient Iterative Processing on Large-Scale Clusters
HaLoop: Efficient Iterative Processing on Large-Scale Clusters
 

Dernier

一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
pyhepag
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
DilipVasan
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
pyhepag
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理
cyebo
 
一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理
cyebo
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
pyhepag
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理
pyhepag
 

Dernier (20)

How I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prisonHow I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prison
 
Slip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp ClaimsSlip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp Claims
 
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictSupply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
 
Pre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxPre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptx
 
Easy and simple project file on mp online
Easy and simple project file on mp onlineEasy and simple project file on mp online
Easy and simple project file on mp online
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理
 
Data analytics courses in Nepal Presentation
Data analytics courses in Nepal PresentationData analytics courses in Nepal Presentation
Data analytics courses in Nepal Presentation
 
一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
 
2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理
 
Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)
 
Artificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfArtificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdf
 
basics of data science with application areas.pdf
basics of data science with application areas.pdfbasics of data science with application areas.pdf
basics of data science with application areas.pdf
 
AI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdfAI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdf
 
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdfGenerative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
 

Data Responsibly: The next decade of data science

  • 1. Data, Responsibly: The Next Decade of Data Science Bill Howe, PhD Associate Professor, Information School Director, Cascadia Urban Analytics Cooperative Adjunct Associate Professor, Computer Science & Engineering University of Washington
  • 2. My goals this afternoon… • Describe “data science” from my perspective • Describe some concerns that have recently emerged around the irresponsible use of data science techniques and technologies • Show off some of the work we’re doing to address it
  • 3. DataLab Bill Howe Databases, data management Jessica Hullman Visualization, HCI Carole Palmer Open data, digital curation Nic Weber Open data, civic tech Jevin West Science of science, bibliometrics …”calling bullshit” Emma Spiro Social network analysis
  • 4. The Fourth Paradigm 1. Empirical + experimental 2. Theoretical 3. Computational 4. Data-Intensive Jim Gray 1/10/2018 Bill Howe, UW 4
  • 5. Nearly every field of discovery is transitioning from “data poor” to “data rich” Astronomy: LSST Physics: LHC Oceanography: OOI Social Sciences Biology: Sequencing Economics Neuroscience: EEG, fMRI
  • 6. My view: 1/10/2018 Bill Howe, UW 8 Data science is about answering questions using large, noisy, and heterogeneous datasets, usually those that were collected for some unrelated purpose
  • 7. 1/10/2018 Bill Howe, UW9 Question: How early and accurately can we predict flu outbreaks, so we can plan production levels of flu vaccine? Dataset: Search histories of users
  • 8. source: http://www.google.org/flutrends/us/#US http://www.google.com/permissions/using-product-graphics.html flu risk “Scientific hindsight shows that Google Flu Trends far overstated this year's flu season….” “Lots of media attention to this year's flu season skewed Google's search engine traffic.” David Wagner, Atlantic Wire, Feb 13 2013
  • 9. Question: 1/10/2018 Bill Howe, UW11 Do people that take paroxetine and pravastatin together exhibit hypoglycemia symptoms? Dataset: Search engine histories
  • 10. Ryen W White,Nicholas P Tatonetti, Nigam H Shah, Russ B Altman, Eric Horvitz, Web-scale pharmacovigilance: listening to signals from the crowd, J Am Med Inform Assoc, March 2013, doi:10.1136/amiajnl-2012-001482
  • 11. Open Sidewalks – Sidewalk maps for low-mobility citizens Project Leads: Nick Bolten, Anat Caspi – Taskar Center, CSE DSSG Fellows: Amir Amini, Yun Hao, Vaishnavi Ravichandran, Andre Stephens ALVA High School Students: Nick Krasnoselsky, Doris Layman eScience Data Scientist Mentors: Anthony Arendt, Jake Vanderplas “ 30 million Americans over 15 years old experience limited mobility, including difficulty walking, climbing stairs, using wheelchairs, crutches, walkers” while 24 million more persons experience difficulty walkinga quarter mile” |Picture: US Federal Highway administration http://www.fhwa.dot.gov/environment/bicycle_pedestrian/publications/sidewalk2/sidewalks204.cfm
  • 12. Automated cleaning of sidewalk data through computational geometry powered by data from: SDOT/Socrata Google API Step Runtime Solved (All) Percent Connecting T-Gaps ~3.9s 3,837 (4,352) 88.2 Intersection Cleaning ~23.6s 38,844 (44,700) 86.9 Polygon Cleaning ~10min 7,283 (8,035) 90.6 Subgraphs ~23.2s 39,913 (45,265) 88.1
  • 13. Homeless families may take many pathways through programs Emergency shelter Transitional housing Rapid re-housing Permanent housing Housing with services Unsuccessful exit
  • 14. Develop visualizations to show how homeless families move through programs
  • 15. Preliminary results to understand potential predictors of successful outcomes Correlation with successful outcome, by family characteristics Correlation with successful outcome, by homelessness program Emergency Shelter use tends to be associated with unsuccessful outcomes (unsurprising!) Homelessness Prevention programs more strongly associated with positive outcomes than transitional housing Substance abuse strongly associated with unsuccessful outcomes Parent employment strongest predictor of successful outcomes
  • 16. Common trajectories lead to different outcomes: • a successful exit from an episode would mean that the family found a permanent housing solution • a proportion of these still receive government subsidies • other exits are exits back into homelessness, or to other, unknown destinations Analyzing Family Trajectories through Programs Data: Pierce County Emergency Shelter -> Rapid Re-housing Emergency Shelter -> Transitional Housing 80% successful exits Only 40% successful exits
  • 17. ORCA Percentage Difference in Ridership, Seattle Mark Hallenbeck TRAC
  • 18. 1/10/2018 Bill Howe, UW 20 Passenger Type Redmond Tukwila Redmond Tukwila Adult 317181 72202 91% 67% Youth 12818 7433 4% 7% Senior 5425 4577 2% 4% Disabled 7722 10449 2% 10% Low Income 6912 12438 2% 12% Metro Boardings By Type of Rider
  • 20. Session 2 Summer 2014 121,215 students Session 1 Spring 2013 119,504 students
  • 22. 14 Cathy O’Neil September 2016 Three properties of a WMD: Opacity Scale Damage
  • 23. July 2016 “Data, Responsibly” Dagstuhl Workshop Gerhard Weikum Serge Abiteboul Julia Stoyanovich Gerome Miklau
  • 24. Observation: Epistemic issues are beginning to dominate the data science discussion in every field reproducibility, “algorithmic bias,” curation, discrimination, accountability, transparency, provenance, explanations, persuasion, privacy
  • 25. 21 Ex: Staples online pricing Reasoning: Offer deals to people that live near competitors’ stores Effect: lower prices offered to buyers who live in more affluent neighborhoods
  • 26. 22 [Latanya Sweeney; CACM 2013] Racially identifying names trigger ads suggestive of an arrest record slide adapted from Stoyanovich, Miklau
  • 27. 1/10/2018 Bill Howe, UW 29 Amazon Prime Now Delivery Area: Atlanta Bloomberg, 2016
  • 28. 1/10/2018 Bill Howe, UW 30 Amazon Prime Now Delivery Area: Chicago Bloomberg, 2016
  • 29. 1/10/2018 Bill Howe, UW 31 Amazon Prime Now Delivery Area: Boston Bloomberg, 2016
  • 31. 24 The Special Committee on Criminal Justice Reform's hearing of reducing the pre-trial jail population. Technical.ly, September 2016 Philadelphia is grappling with the prospect of a racist computer algorithm Any background signal in the data of institutional racism is amplified by the algorithm operationalized by the algorithm legitimized by the algorithm “Should I be afraid of risk assessment tools?” “No, you gotta tell me a lot more about yourself. At what age were you first arrested? What is the date of your most recent crime?” “And what’s the culture of policing in the neighborhood in which I grew up in?”
  • 32. First decade of Data Science research and practice: What can we do with massive, noisy, heterogeneous datasets? Next decade of Data Science research and practice: What should we do with massive, noisy, heterogeneous datasets? The way I think about this…..(1)
  • 33. The way I think about this…. (2) Decisions are based on two sources of information: 1. Past examples e.g., “prior arrests tend to increase likelihood of future arrests” 2. Societal constraints e.g., “we must avoid racial discrimination” 11/10/2016 Data, Responsibly / SciTech NW 16 We’ve become very good at automating the use of past examples We’ve only just started to think about incorporating societal constraints
  • 34. The way I think about this… (3) How do we apply societal constraints to algorithmic decision-making? Option 1: Keep a human in the loop Ex: EU General Data Protection Regulation requires that a human be involved in legally binding algorithmic decision-making Ex: Wisconsin Supreme Court says a human must review algorithmic decisions made by recidivism models Option 2: Build them into the algorithms themselves I’ll talk about some approaches for this 11/10/2016 Data, Responsibly / SciTech NW 17
  • 35. The way I think about this…(4) On transparency vs. accountability: • For human decision-making, sometimes explanations are required, improving transparency – Supreme court decisions – Employee reprimands/termination • But when transparency is difficult, accountability takes over – medical emergencies, business decisions • As we shift decisions to algorithms, we lose both transparency AND accountability • “The buck stops where?” 11/10/2016 Data, Responsibly / SciTech NW 18
  • 36. So what can we do about it? • Algorithms that balance predictive accuracy with fairness • Increase data sharing, while protecting privacy – Avoid the “tyranny of convenience” • Ensure transparency in all methods, datasets • Track known biases in how data was collected, so it can be controlled in downstream analytics • All of these approaches are being explored in the research community. 1/10/2018 Bill Howe, UW 38
  • 37. Recap • There’s a sea change underway in how we will teach and practice data science • No longer only about what can be done, but about what should be done • This is not just a policy/behavior/culture issue – there are technical problems to solve • Prediction: If a company is not thinking about this stuff, they will soon be facing retention and compliance issues – Witness how the privacy discussion evolved
  • 39. Science is a complete mess • Reproducibility – Begley & Ellis, Nature 2012: 6 out of 53 cancer studies reproducible – Only about half of psychology 100 studies had effect sizes that approximated the original result (Science, 2015) – Ioannidis 2005: Why most public research findings are false – Reinhart & Rogoff: global economic policy based on spreadsheet fuck ups 11/10/2016 Bill Howe, UW 33
  • 41. 11/10/2016 Data, Responsibly @ Dagstuhl 35 Retractions are increasing…..
  • 42.
  • 43. Why is this happening? (1) 11/10/2016 Bill Howe, UW 37
  • 44. Why is this happening? (2) 11/10/2016 Bill Howe, UW 38
  • 45. Why is this happening? (2) Publication Bias!
  • 46. “DEEP CURATION” TOWARDS AUTOMATIC SCIENTIFIC CLAIM CHECKING
  • 47. Vision: Validate scientific claims automatically – Check for manipulation (manipulated images, Benford’s Law) – Extract claims from papers – Check claims against the authors’ data – Check claims against related data sets – Automatic meta-analysis across the literature + public datasets • First steps – Automatic curation: Validate and attach metadata to public datasets – Longitudinal analysis of the visual literature 11/10/2016 Data, Responsibly / SciTech NW 41
  • 49. 11/10/2016 Bill Howe, UW 43 Microarray samples submitted to the Gene Expression Omnibus Curation is fast becoming the bottleneck to data sharing Maxim Gretchkin Poon Hoifung
  • 50. Maxim Gretchkin Poon Hoifung No growth in number of datasets used per paper!
  • 51. Maxim Gretchkin Poon Hoifung Majority of samples are one-time-use only!
  • 52. color = labels supplied as metadata clusters = 1st two PCA dimensions on the gene expression data itself Can we use curate algorithmically? Maxim Gretchkin Poon Hoifung The expression data and the text labels appear to disagree
  • 53. Maxim Gretchkin Poon Hoifung Better Tissue Type Labels Domain knowledge (Ontology) Expression data Free-text Metadata 2 Deep Networks text expr SVM
  • 54. Deep Curation Maxim Gretchkin Poon Hoifung Distant supervision and co-learning between text- based classified and expression-based classifier: Both models improve by training on each others’ results. Free-text classifier Expression classifier
  • 55. Deep Curation: Our stuff wins, with no training data Maxim Gretchkin Poon Hoifung state of the art our reimplementation of the state of the art our dueling pianos NN amount of training data used

Notes de l'éditeur

  1. 4
  2. And processing power, either as raw processor speed or via novel multi-core and many-core architectures, is also continuing to increase exponentially…
  3. … but human cognitive capacity is remaining constant. How can computing technologies help scientists make sense out of these vast and complex data sets?
  4. The challenges stem from the large, noisy, and heterogeneous more than from colelcting the data in the first place. Data scie
  5. Google
  6. So in part as an attempt to relate “eSciene” and “data science,” and in part to make sure the idea of data science wasn’t completely taken over by the machine learning people, we ran a massively open online course last Spring called Introduction to Data Science We taught Scalable Databases, MapReduce, Statistics, Machine Learning, Visualization
  7. Following a 2014 report entitled “Big Data: Seizing Opportunities, Preserving Values”