Contenu connexe

Présentations pour vous(20)

Similaire à Data Responsibly: The next decade of data science(20)


Plus de University of Washington(20)



Data Responsibly: The next decade of data science

  1. Data, Responsibly: The Next Decade of Data Science Bill Howe, PhD Associate Professor, Information School Director, Cascadia Urban Analytics Cooperative Adjunct Associate Professor, Computer Science & Engineering University of Washington
  2. My goals this afternoon… • Describe “data science” from my perspective • Describe some concerns that have recently emerged around the irresponsible use of data science techniques and technologies • Show off some of the work we’re doing to address it
  3. DataLab Bill Howe Databases, data management Jessica Hullman Visualization, HCI Carole Palmer Open data, digital curation Nic Weber Open data, civic tech Jevin West Science of science, bibliometrics …”calling bullshit” Emma Spiro Social network analysis
  4. The Fourth Paradigm 1. Empirical + experimental 2. Theoretical 3. Computational 4. Data-Intensive Jim Gray 1/10/2018 Bill Howe, UW 4
  5. Nearly every field of discovery is transitioning from “data poor” to “data rich” Astronomy: LSST Physics: LHC Oceanography: OOI Social Sciences Biology: Sequencing Economics Neuroscience: EEG, fMRI
  6. My view: 1/10/2018 Bill Howe, UW 8 Data science is about answering questions using large, noisy, and heterogeneous datasets, usually those that were collected for some unrelated purpose
  7. 1/10/2018 Bill Howe, UW9 Question: How early and accurately can we predict flu outbreaks, so we can plan production levels of flu vaccine? Dataset: Search histories of users
  8. source: flu risk “Scientific hindsight shows that Google Flu Trends far overstated this year's flu season….” “Lots of media attention to this year's flu season skewed Google's search engine traffic.” David Wagner, Atlantic Wire, Feb 13 2013
  9. Question: 1/10/2018 Bill Howe, UW11 Do people that take paroxetine and pravastatin together exhibit hypoglycemia symptoms? Dataset: Search engine histories
  10. Ryen W White,Nicholas P Tatonetti, Nigam H Shah, Russ B Altman, Eric Horvitz, Web-scale pharmacovigilance: listening to signals from the crowd, J Am Med Inform Assoc, March 2013, doi:10.1136/amiajnl-2012-001482
  11. Open Sidewalks – Sidewalk maps for low-mobility citizens Project Leads: Nick Bolten, Anat Caspi – Taskar Center, CSE DSSG Fellows: Amir Amini, Yun Hao, Vaishnavi Ravichandran, Andre Stephens ALVA High School Students: Nick Krasnoselsky, Doris Layman eScience Data Scientist Mentors: Anthony Arendt, Jake Vanderplas “ 30 million Americans over 15 years old experience limited mobility, including difficulty walking, climbing stairs, using wheelchairs, crutches, walkers” while 24 million more persons experience difficulty walkinga quarter mile” |Picture: US Federal Highway administration
  12. Automated cleaning of sidewalk data through computational geometry powered by data from: SDOT/Socrata Google API Step Runtime Solved (All) Percent Connecting T-Gaps ~3.9s 3,837 (4,352) 88.2 Intersection Cleaning ~23.6s 38,844 (44,700) 86.9 Polygon Cleaning ~10min 7,283 (8,035) 90.6 Subgraphs ~23.2s 39,913 (45,265) 88.1
  13. Homeless families may take many pathways through programs Emergency shelter Transitional housing Rapid re-housing Permanent housing Housing with services Unsuccessful exit
  14. Develop visualizations to show how homeless families move through programs
  15. Preliminary results to understand potential predictors of successful outcomes Correlation with successful outcome, by family characteristics Correlation with successful outcome, by homelessness program Emergency Shelter use tends to be associated with unsuccessful outcomes (unsurprising!) Homelessness Prevention programs more strongly associated with positive outcomes than transitional housing Substance abuse strongly associated with unsuccessful outcomes Parent employment strongest predictor of successful outcomes
  16. Common trajectories lead to different outcomes: • a successful exit from an episode would mean that the family found a permanent housing solution • a proportion of these still receive government subsidies • other exits are exits back into homelessness, or to other, unknown destinations Analyzing Family Trajectories through Programs Data: Pierce County Emergency Shelter -> Rapid Re-housing Emergency Shelter -> Transitional Housing 80% successful exits Only 40% successful exits
  17. ORCA Percentage Difference in Ridership, Seattle Mark Hallenbeck TRAC
  18. 1/10/2018 Bill Howe, UW 20 Passenger Type Redmond Tukwila Redmond Tukwila Adult 317181 72202 91% 67% Youth 12818 7433 4% 7% Senior 5425 4577 2% 4% Disabled 7722 10449 2% 10% Low Income 6912 12438 2% 12% Metro Boardings By Type of Rider
  19. 1/10/2018 Bill Howe, UW 21
  20. Session 2 Summer 2014 121,215 students Session 1 Spring 2013 119,504 students
  21. 1/10/2018 Bill Howe, UW 23
  22. 14 Cathy O’Neil September 2016 Three properties of a WMD: Opacity Scale Damage
  23. July 2016 “Data, Responsibly” Dagstuhl Workshop Gerhard Weikum Serge Abiteboul Julia Stoyanovich Gerome Miklau
  24. Observation: Epistemic issues are beginning to dominate the data science discussion in every field reproducibility, “algorithmic bias,” curation, discrimination, accountability, transparency, provenance, explanations, persuasion, privacy
  25. 21 Ex: Staples online pricing Reasoning: Offer deals to people that live near competitors’ stores Effect: lower prices offered to buyers who live in more affluent neighborhoods
  26. 22 [Latanya Sweeney; CACM 2013] Racially identifying names trigger ads suggestive of an arrest record slide adapted from Stoyanovich, Miklau
  27. 1/10/2018 Bill Howe, UW 29 Amazon Prime Now Delivery Area: Atlanta Bloomberg, 2016
  28. 1/10/2018 Bill Howe, UW 30 Amazon Prime Now Delivery Area: Chicago Bloomberg, 2016
  29. 1/10/2018 Bill Howe, UW 31 Amazon Prime Now Delivery Area: Boston Bloomberg, 2016
  30. 23 Propublica, May 2016
  31. 24 The Special Committee on Criminal Justice Reform's hearing of reducing the pre-trial jail population., September 2016 Philadelphia is grappling with the prospect of a racist computer algorithm Any background signal in the data of institutional racism is amplified by the algorithm operationalized by the algorithm legitimized by the algorithm “Should I be afraid of risk assessment tools?” “No, you gotta tell me a lot more about yourself. At what age were you first arrested? What is the date of your most recent crime?” “And what’s the culture of policing in the neighborhood in which I grew up in?”
  32. First decade of Data Science research and practice: What can we do with massive, noisy, heterogeneous datasets? Next decade of Data Science research and practice: What should we do with massive, noisy, heterogeneous datasets? The way I think about this…..(1)
  33. The way I think about this…. (2) Decisions are based on two sources of information: 1. Past examples e.g., “prior arrests tend to increase likelihood of future arrests” 2. Societal constraints e.g., “we must avoid racial discrimination” 11/10/2016 Data, Responsibly / SciTech NW 16 We’ve become very good at automating the use of past examples We’ve only just started to think about incorporating societal constraints
  34. The way I think about this… (3) How do we apply societal constraints to algorithmic decision-making? Option 1: Keep a human in the loop Ex: EU General Data Protection Regulation requires that a human be involved in legally binding algorithmic decision-making Ex: Wisconsin Supreme Court says a human must review algorithmic decisions made by recidivism models Option 2: Build them into the algorithms themselves I’ll talk about some approaches for this 11/10/2016 Data, Responsibly / SciTech NW 17
  35. The way I think about this…(4) On transparency vs. accountability: • For human decision-making, sometimes explanations are required, improving transparency – Supreme court decisions – Employee reprimands/termination • But when transparency is difficult, accountability takes over – medical emergencies, business decisions • As we shift decisions to algorithms, we lose both transparency AND accountability • “The buck stops where?” 11/10/2016 Data, Responsibly / SciTech NW 18
  36. So what can we do about it? • Algorithms that balance predictive accuracy with fairness • Increase data sharing, while protecting privacy – Avoid the “tyranny of convenience” • Ensure transparency in all methods, datasets • Track known biases in how data was collected, so it can be controlled in downstream analytics • All of these approaches are being explored in the research community. 1/10/2018 Bill Howe, UW 38
  37. Recap • There’s a sea change underway in how we will teach and practice data science • No longer only about what can be done, but about what should be done • This is not just a policy/behavior/culture issue – there are technical problems to solve • Prediction: If a company is not thinking about this stuff, they will soon be facing retention and compliance issues – Witness how the privacy discussion evolved
  38. REPRODUCIBILITY 11/10/2016 Bill Howe, UW 32
  39. Science is a complete mess • Reproducibility – Begley & Ellis, Nature 2012: 6 out of 53 cancer studies reproducible – Only about half of psychology 100 studies had effect sizes that approximated the original result (Science, 2015) – Ioannidis 2005: Why most public research findings are false – Reinhart & Rogoff: global economic policy based on spreadsheet fuck ups 11/10/2016 Bill Howe, UW 33
  40. Science, 2015
  41. 11/10/2016 Data, Responsibly @ Dagstuhl 35 Retractions are increasing…..
  42. Why is this happening? (1) 11/10/2016 Bill Howe, UW 37
  43. Why is this happening? (2) 11/10/2016 Bill Howe, UW 38
  44. Why is this happening? (2) Publication Bias!
  46. Vision: Validate scientific claims automatically – Check for manipulation (manipulated images, Benford’s Law) – Extract claims from papers – Check claims against the authors’ data – Check claims against related data sets – Automatic meta-analysis across the literature + public datasets • First steps – Automatic curation: Validate and attach metadata to public datasets – Longitudinal analysis of the visual literature 11/10/2016 Data, Responsibly / SciTech NW 41
  47. Microarray experiments
  48. 11/10/2016 Bill Howe, UW 43 Microarray samples submitted to the Gene Expression Omnibus Curation is fast becoming the bottleneck to data sharing Maxim Gretchkin Poon Hoifung
  49. Maxim Gretchkin Poon Hoifung No growth in number of datasets used per paper!
  50. Maxim Gretchkin Poon Hoifung Majority of samples are one-time-use only!
  51. color = labels supplied as metadata clusters = 1st two PCA dimensions on the gene expression data itself Can we use curate algorithmically? Maxim Gretchkin Poon Hoifung The expression data and the text labels appear to disagree
  52. Maxim Gretchkin Poon Hoifung Better Tissue Type Labels Domain knowledge (Ontology) Expression data Free-text Metadata 2 Deep Networks text expr SVM
  53. Deep Curation Maxim Gretchkin Poon Hoifung Distant supervision and co-learning between text- based classified and expression-based classifier: Both models improve by training on each others’ results. Free-text classifier Expression classifier
  54. Deep Curation: Our stuff wins, with no training data Maxim Gretchkin Poon Hoifung state of the art our reimplementation of the state of the art our dueling pianos NN amount of training data used

Notes de l'éditeur

  1. 4
  2. And processing power, either as raw processor speed or via novel multi-core and many-core architectures, is also continuing to increase exponentially…
  3. … but human cognitive capacity is remaining constant. How can computing technologies help scientists make sense out of these vast and complex data sets?
  4. The challenges stem from the large, noisy, and heterogeneous more than from colelcting the data in the first place. Data scie
  5. Google
  6. So in part as an attempt to relate “eSciene” and “data science,” and in part to make sure the idea of data science wasn’t completely taken over by the machine learning people, we ran a massively open online course last Spring called Introduction to Data Science We taught Scalable Databases, MapReduce, Statistics, Machine Learning, Visualization
  7. Following a 2014 report entitled “Big Data: Seizing Opportunities, Preserving Values”