SlideShare utilise les cookies pour améliorer les fonctionnalités et les performances, et également pour vous montrer des publicités pertinentes. Si vous continuez à naviguer sur ce site, vous acceptez l’utilisation de cookies. Consultez nos Conditions d’utilisation et notre Politique de confidentialité.
SlideShare utilise les cookies pour améliorer les fonctionnalités et les performances, et également pour vous montrer des publicités pertinentes. Si vous continuez à naviguer sur ce site, vous acceptez l’utilisation de cookies. Consultez notre Politique de confidentialité et nos Conditions d’utilisation pour en savoir plus.
DATA SCIENCE GOVERNANCE
Turn GDPR’s accountability principles
into an added-value for your business
Data Science Meetup - Milan - March 18
- CEO & Founder -
KENSU & ME
Started with an enterprise stack for Data Scientists:
Agile Data Science Toolkit
Pivot on internal component:
Data Science Catalog
Data Science Governance
Spark Notebook O’Reilly Training
1. Some thoughts on “Data Science”
2. Data Science Governance: What
3. Data Science Governance: How
4. GDPR: Accountability and Transparency Principles
5. How to leverage GDPR and Data Science to improve
or disrupt the Business
Pioneers in 1950s
AI Winter in 1970s due pessimism
Resurgence in 1980s
Machine Learning (and related) is used since the 1990s (esp. SVM and RNN)
Deep learning see widespread commercial use in 2000s
Machine learning receives great publicity (read: buzz) in 2010s
DATA SCIENCE: +ENGINEERING
Claim: “Data Scientist” coined by DJ Patil in 2008.
Pretty much where Machine Learning was part of Softwares
In a way, when we added “engineering” to the mix
Also, engineering is even more prominent with Big Data Distributed
DATA SCIENCE: +EXPERIMENTATION
So much data available
So many tools, libraries, frameworks, …
So many things we can try
We have distributed computing now, right? => Let’s try everything
Discover new insights (and potentially new businesses)
DATA SCIENCE: RECAP
Maths: stats, machine learning and so on
Engineering: ETL, Databases, Computing framework, Softwares, Platforms,
Creativity: “From business intelligence To intelligent business”- Michael Fergusson
Data Science is an umbrella on top of all activities on data
Data pipeline is connecting activities on data, potentially involving
A pipeline is generally thought as an End-to-End processing line to
solve one problem.
But, part of pipelines are reused to save computation, storage, time, …
Thus interdependency between pipeline segments grows with initiatives
GOAL: TAKE DECISION
Data Pipelines, connected together, aren’t created for the beauty of it.
The ultimate goal is always to take decisions.
Decisions are generally taken or linked to humans with responsibilities.
(even for self driving cars, in case of problem)
Given that pipelines are cut-and-wired, interleaved, …
How not to be anxious at deploying the last piece used by the decision maker
SOURCES OF ANXIETY
• one of the data used in the process has different patterns suddenly?
• one of the tools, projects or similar is modiﬁed upstream?
• the insights are deviating from the reality?
To reduce the anxiety or, actually, reducing the risks, we need ways to debug.
In pure engineering, we have unit, function, integrations tests,… but
How do we do when the problems come from the data themselves?
We can’t generate all cases of data variations, right?
How to debug?
Without the big picture, we may try to optimise a model for weeks for nothing
DATA SCIENCE GOVERNANCE
• controls that data meets precise standards
• involves monitoring against production data.
Data Science Governance:
• controls that data activity meets precise standards
• involves monitoring against production data activity.
A Data Activity is a phenomenon composed of
Technologies, Users, Systems, Data and Processing
GOVERNING DATA SCIENCE
Who does what on which data and where it is done?
What is the impact of a process on the global system?
What are the performance metrics (quality, execution,…) of the
CONTINUOUS INTEGRATION FOR DATA SCIENCE
Data Scientists/Citizens have a holistic view of their data,
system and processes.
They also have a control on their own results in production
They have the opportunity to analyse and debug any pipeline
involving all activities:
• independently of the technologies
• involving several units in the enterprise
So many tools are using data!
The number of processing is growing impressively.
We have to take care of the legacy…
GET THE DATA
As usual, we have to collect the right data to take right decision.
First run an assessment to create a high level map of all the tools
involved into a company.
For each data tool, do whatever it takes to collect information
describing its activities.
Information are metadata, lineage, statistics, accuracy measures, …
CONNECT THE DATA
To do that we need to connect all data that can be collected.
So that, it is possible to create a cartography of all on-going processes.
This map tracks all data and their descendants
Data Science Governance needs the global picture.
USE THE DATA
This is where the fun part starts… the map of data activities is an
amazing source of information
Here are a few things you can think of when using this kind of data:
• impact analysis
• dependency analysis
• pipeline optimisation
• data or model recommendation
General Data Protection Regulation
Implement appropriate technical and organisational measures that
ensure and demonstrate that you comply. This may include internal
data protection policies such as staff training, internal audits of
processing activities, and reviews of internal HR policies.
As well as your obligation to provide comprehensive, clear
and transparent privacy policies, if your organisation has
more than 250 employees, you must maintain additional
internal records of your processing activities.
ACCOUNTABILITY: DATA SCIENCE GOVERNANCE
To govern data science, we have to:
• collect activities
• connect activities
Or… building and maintaining the audit trails needed to
create measures that demonstrates accountability
TRANSPARENCY: DATA SCIENCE GOVERNANCE
To govern data science seen as a continuous integration solution:
We have to monitor activities independently of the technologies
With this information we can reliably create automatically the
process registry composed of goals pursued and all data involved
BUSINESS: IMPROVE AND DISRUPT
Connect data and business
Spoiler attack: one-line ahead
DATA TO BUSINESS
Business KPIs are nothing but data!
BUSINESS TO DATA
Change the business to match the data
Making it real… yet, taking the idea even further
Kinda pitchy I know but meh… :-D
ARTIFICIAL INTELLIGENCE ON DATA SCIENCE
Scientist / Engineer
DATA FLOW, USERS, PROCESSES ALL-IN
Data sources, Schemas
Categories of data involved
Markers on privacy data involved
Users involved in the processes
Programs used to create/run the ﬂow
INTUITIVE AND COMPREHENSIBLE REST API
dam_dependencies = [
.append('f', ['name', 'last'], "data"),
Example in Python
High level integration like for Spark
// Initializing library to hook up to Apache Spark
- data transformations
- data stats
- machine learning models
- performance of models
MONITORING MACHINE LEARNING PERFORMANCE
Read cold data
DEV / Offline
PROD / Online
Read hot data
- Create data flow
data -> prepared data -> model
- Register parameters
- Compute/Gather performance metrics
EASY TO INTEGRATE IN DEV ENVIRONMENT
Spark (Python, R, Scala)
Even notebooks !
OUR PRODUCT: KENSU DATA ACTIVITY MANAGEMENT
Data Science Governance
First Governance, Compliance and Performance solution for Data science
Feature Beneﬁt Why it matters
Automatically captures all data
science relevant activities related to
governance, compliance and
performance within a given domain.
Provided end-to-end control and
insights into all relevant aspects of
data science related activities
One-stop control center for all
potential data privacy violations
Near-realtime notiﬁcations and
actionable intelligence current state
of “compliance health”
One-click reports for all relevant
governance and compliance reports
Guarantee for good relationship with
authorities in charge by respecting
Spark and Machine training in Roma in June:
Interested in our way to think about ML and DS?
We have another 3-days training on this (Spark, TensorFlow, H2O, …)
(One in Roma to be scheduled this Fall)
DATA SCIENCE GOVERNANCE
CEO and Co-Founder
Let’s chat after the talk o/
Or, contact me for:
- DAM (demo, pilot, …)