Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Data science governance and GDPR

211 vues

Publié le

Extended discourse on the importance of data science governance for production ML and how GDPR can become the catalyst but also generate value for organizations!

Publié dans : Données & analyses
  • Soyez le premier à commenter

  • Soyez le premier à aimer ceci

Data science governance and GDPR

  1. 1. www.kensu.io DATA SCIENCE GOVERNANCE 1 Turn GDPR’s accountability principles into an added-value for your business Data Science Meetup - Milan - March 18
  2. 2. www.kensu.io 2 - CEO & Founder - Mathematics Computer Science ANDY PETRELLA KENSU & ME Started with an enterprise stack for Data Scientists: 
 Agile Data Science Toolkit Pivot on internal component:
 Data Science Catalog Main focus: 
 Data Science Governance Spark Notebook O’Reilly Training
  3. 3. www.kensu.io TOPICS 1. Some thoughts on “Data Science” 2. Data Science Governance: What 3. Data Science Governance: How 4. GDPR: Accountability and Transparency Principles 5. How to leverage GDPR and Data Science to improve or disrupt the Business 3
  4. 4. www.kensu.io SOME THOUGHTS ON “DATA SCIENCE” 4
  5. 5. www.kensu.io MACHINE LEARNING Pioneers in 1950s AI Winter in 1970s due pessimism Resurgence in 1980s Machine Learning (and related) is used since the 1990s (esp. SVM and RNN) Deep learning see widespread commercial use in 2000s Machine learning receives great publicity (read: buzz) in 2010s 5ref: https://en.wikipedia.org/wiki/Timeline_of_machine_learning
  6. 6. www.kensu.io DATA SCIENCE: +ENGINEERING Claim: “Data Scientist” coined by DJ Patil in 2008. Pretty much where Machine Learning was part of Softwares In a way, when we added “engineering” to the mix Also, engineering is even more prominent with Big Data Distributed Computing 6
  7. 7. www.kensu.io DATA SCIENCE: +EXPERIMENTATION So much data available So many tools, libraries, frameworks, … So many things we can try We have distributed computing now, right? => Let’s try everything Discover new insights (and potentially new businesses) 7
  8. 8. www.kensu.io DATA SCIENCE: RECAP Maths: stats, machine learning and so on Engineering: ETL, Databases, Computing framework, Softwares, Platforms, … Creativity: “From business intelligence To intelligent business”- Michael Fergusson Data Science is an umbrella on top of all activities on data 8
  9. 9. www.kensu.io DON’T BELIEVE ME? 9https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf
  10. 10. www.kensu.io 10 DON’T WANT TO READ THE PAPER? What about this 3 minutes lecture in the Google Machine Learning Crash Course Talking about production of ML systems…
  11. 11. www.kensu.io 11 !!! OR THIS ONE MAYBE? Okay, it’s a 14 minutes lecture (probably as long as reading the paper ^^) Talking about data dependencies
  12. 12. www.kensu.io DATA SCIENCE GOVERNANCE: WHAT 12
  13. 13. www.kensu.io DATA PIPELINE Data pipeline is connecting activities on data, potentially involving several technologies. A pipeline is generally thought as an End-to-End processing line to solve one problem. But, part of pipelines are reused to save computation, storage, time, … Thus interdependency between pipeline segments grows with initiatives 13
  14. 14. www.kensu.io GOAL: TAKE DECISION Data Pipelines, connected together, aren’t created for the beauty of it. The ultimate goal is always to take decisions. Decisions are generally taken or linked to humans with responsibilities.
 (even for self driving cars, in case of problem) Given that pipelines are cut-and-wired, interleaved, … 14 How not to be anxious at deploying the last piece used by the decision maker
  15. 15. www.kensu.io SOURCES OF ANXIETY What if: • one of the data used in the process has different patterns suddenly? • one of the tools, projects or similar is modified upstream? • the insights are deviating from the reality? • … 15
  16. 16. www.kensu.io DEBUGGING? To reduce the anxiety or, actually, reducing the risks, we need ways to debug. In pure engineering, we have unit, function, integrations tests,… but How do we do when the problems come from the data themselves? We can’t generate all cases of data variations, right? How to debug? 
 Without the big picture, we may try to optimise a model for weeks for nothing 16
  17. 17. www.kensu.io DATA SCIENCE GOVERNANCE Data governance: • controls that data meets precise standards • involves monitoring against production data. Data Science Governance: • controls that data activity meets precise standards • involves monitoring against production data activity. A Data Activity is a phenomenon composed of Technologies, Users, Systems, Data and Processing 17
  18. 18. www.kensu.io GOVERNING DATA SCIENCE Who does what on which data and where it is done? What is the impact of a process on the global system? What are the performance metrics (quality, execution,…) of the processes? 18
  19. 19. www.kensu.io CONTINUOUS INTEGRATION FOR DATA SCIENCE Data Scientists/Citizens have a holistic view of their data, system and processes. They also have a control on their own results in production They have the opportunity to analyse and debug any pipeline involving all activities: • independently of the technologies • involving several units in the enterprise 19
  20. 20. www.kensu.io DATA SCIENCE GOVERNANCE: HOW 20
  21. 21. www.kensu.io CHALLENGES So many tools are using data! The number of processing is growing impressively. We have to take care of the legacy… 21
  22. 22. www.kensu.io GET THE DATA As usual, we have to collect the right data to take right decision. First run an assessment to create a high level map of all the tools involved into a company. For each data tool, do whatever it takes to collect information describing its activities. Information are metadata, lineage, statistics, accuracy measures, … 22
  23. 23. www.kensu.io CONNECT THE DATA To do that we need to connect all data that can be collected. So that, it is possible to create a cartography of all on-going processes. 23 This map tracks all data and their descendants Data Science Governance needs the global picture.
  24. 24. www.kensu.io USE THE DATA This is where the fun part starts… the map of data activities is an amazing source of information Here are a few things you can think of when using this kind of data: • impact analysis • dependency analysis • pipeline optimisation • data or model recommendation 24
  25. 25. www.kensu.io GDPR 25 General Data Protection Regulation
  26. 26. www.kensu.io ACCOUNTABILITY PRINCIPLE Implement appropriate technical and organisational measures that ensure and demonstrate that you comply. This may include internal data protection policies such as staff training, internal audits of processing activities, and reviews of internal HR policies. 26
  27. 27. www.kensu.io TRANSPARENCY As well as your obligation to provide comprehensive, clear and transparent privacy policies, if your organisation has more than 250 employees, you must maintain additional internal records of your processing activities. 27
  28. 28. www.kensu.io ACCOUNTABILITY: DATA SCIENCE GOVERNANCE To govern data science, we have to: • collect activities • connect activities Or… building and maintaining the audit trails needed to create measures that demonstrates accountability 28
  29. 29. www.kensu.io TRANSPARENCY: DATA SCIENCE GOVERNANCE To govern data science seen as a continuous integration solution: 
 We have to monitor activities independently of the technologies With this information we can reliably create automatically the process registry composed of goals pursued and all data involved 29
  30. 30. www.kensu.io BUSINESS: IMPROVE AND DISRUPT 30 Connect data and business Spoiler attack: one-line ahead
  31. 31. www.kensu.io DATA TO BUSINESS 31 Business KPIs are nothing but data!
  32. 32. www.kensu.io BUSINESS TO DATA 32 Change the business to match the data ADAPT!
  33. 33. www.kensu.io KENSU Making it real… yet, taking the idea even further 33 Kinda pitchy I know but meh… :-D
  34. 34. www.kensu.io 34 ARTIFICIAL INTELLIGENCE ON DATA SCIENCE Solution Scientist / Engineer Manager
 Business CDO
 Authority Activities API Governance Compliance Transformation Machine Learning Performance Artificial Intelligence Actionable Data
  35. 35. www.kensu.io DATA FLOW, USERS, PROCESSES ALL-IN 35 Data sources, Schemas Categories of data involved Transitive lineage Markers on privacy data involved Users involved in the processes Programs used to create/run the flow
  36. 36. www.kensu.io INTUITIVE AND COMPREHENSIBLE REST API 36 dam_dependencies = [ ProcessLineageDepsBuilder(input_schema, output_schema) .identity_from_output("data") .append('f', ['name', 'last'], "data"), ProcessLineageDepsBuilder(input_schema_2, output_schema) .identity_from_output("data") ] dam_create_process_run_and_lineage(process, user, code_version, process_name, dam_dependencies) Example in Python High level integration like for Spark // Initializing library to hook up to Apache Spark import io.kensu.dam.lineage.spark.lineage.Implicits._ spark.track() Automatically tracks - data transformations - data stats - machine learning models - performance of models
  37. 37. www.kensu.io MONITORING MACHINE LEARNING PERFORMANCE 37 Read cold data DEV / Offline Pick parameters <train> PROD / Online Read hot data Use parameters <train> Automated Monitoring - Create data flow 
 data -> prepared data -> model - Register parameters - Compute/Gather performance metrics
  38. 38. www.kensu.io EASY TO INTEGRATE IN DEV ENVIRONMENT 38 Jupyter Spark (Python, R, Scala) Even notebooks ! Google Colab Python, TensorFlow
  39. 39. www.kensu.io OUR PRODUCT: KENSU DATA ACTIVITY MANAGEMENT 39 Data Science Governance First Governance, Compliance and Performance solution for Data science Feature Benefit Why it matters Connect.Collect.Learn Automatically captures all data science relevant activities related to governance, compliance and performance within a given domain. Provided end-to-end control and insights into all relevant aspects of data science related activities
 #GDPR DPO Dashboard One-stop control center for all potential data privacy violations Near-realtime notifications and actionable intelligence current state of “compliance health” #GDPR Compliance Reporting One-click reports for all relevant governance and compliance reports Guarantee for good relationship with authorities in charge by respecting their templates #GDPR
  40. 40. www.kensu.io BTW Spark and Machine training in Roma in June: http://www.technologytransfer.eu/event/1779/ Apache_Spark_and_Machine_Learning_Workshop.html ——————————————————————————————— Interested in our way to think about ML and DS? We have another 3-days training on this (Spark, TensorFlow, H2O, …) (One in Roma to be scheduled this Fall) 40
  41. 41. www.kensu.io DATA SCIENCE GOVERNANCE Andy Petrella CEO and Co-Founder @noootsab andy.petrella@kensu.io @kensuio Let’s chat after the talk o/ Or, contact me for: - DAM (demo, pilot, …) - training!