SlideShare une entreprise Scribd logo
1  sur  30
Télécharger pour lire hors ligne
www.kensu.io
DATA SCIENCE GOVERNANCE
1
What and How
www.kensu.io 2
- CEO & Founder -
Mathematics & Computer Science MsC.
Creator of Spark Notebook
- CSO & Founder -
Physics PhD. 

Genomics & Quantitative Finance
XAVIER TORDOIRANDY PETRELLA
KENSU & ME
Started in 2015 as Data Fellas, focus on Data Science consulting
Team of 10 engineers and scientists
Shift toward Product Company in 2016, renamed to Kensu,
Focus on Data Science Governance
Accelerated by Alchemist Accelerator in San Francisco and The Faktory in Belgium
www.kensu.io
TOPICS
1. Some thoughts on “Data Science”
2. Data Science Governance: What
3. Data Science Governance: How
4. GDPR: Accountability principle and transparency
3
www.kensu.io
SOME THOUGHTS ON “DATA SCIENCE”
4
www.kensu.io
MACHINE LEARNING
Pioneers in 1950s
AI Winter in 1970s due pessimism
Resurgence in 1980s
Machine Learning (and related) is used since the 1990s (esp. SVM and RNN)
Deep learning see widespread commercial use in 2000s
Machine learning receives great publicity (read: buzz) in 2010s
5ref: https://en.wikipedia.org/wiki/Timeline_of_machine_learning
www.kensu.io
DATA SCIENCE: +ENGINEERING
Claim: “Data Scientist” coined by DJ Patil in 2008.
Pretty much where Machine Learning was part of Softwares
In a way, when we added “engineering” to the mix
Also, engineering is even more prominent with Big Data Distributed
Computing
6
www.kensu.io
DATA SCIENCE: +EXPERIMENTATION
So much data available
So many tools, libraries, frameworks, …
So many things we can try
We have distributed computing now, right? => Let’s try everything
Discover new insights (and potentially new businesses)
7
www.kensu.io
DATA SCIENCE: RECAP
Maths: stats, machine learning and so on
Engineering: ETL, Databases, Computing framework, Softwares, Platforms, …
Creativity: “From business intelligence To intelligent business”- Michael Fergusson
Data Science is an umbrella on top of all activities on data
8
www.kensu.io
DATA SCIENCE GOVERNANCE: WHAT
9
www.kensu.io
DATA PIPELINE
Data pipeline is connecting activities on data, potentially involving
several technologies.
A pipeline is generally thought as an End-to-End processing line to solve
one problem.
But, part of pipelines are reused to save computation, storage, time, …
Thus interdependency between pipeline segments grows with initiatives
10
www.kensu.io
GOAL: TAKE DECISION
Data Pipelines, connected together, aren’t created for the beauty of it.
The ultimate goal is always to take decisions.
Decisions are generally taken or linked to humans with responsibilities.

(even for self driving cars, in case of problem)
Given that pipelines are cut-and-wired, interleaved, …
How not to be anxious at deploying the last piece used by the decision maker
11
www.kensu.io
SOURCES OF ANXIETY
What if:
• one of the data used in the process has different patterns suddenly?
• one of the tools, projects or similar is modified upstream?
• the insights are deviating from the reality?
• …
12
www.kensu.io
DEBUGGING?
To reduce the anxiety or, actually, reducing the risks, we need ways to debug.
In pure engineering, we have unit, function, integrations tests,… but
How do we do when the problems come from the data themselves?
We can’t generate all cases of data variations, right?
How to debug? 

Without the big picture, we may try to optimise a model for weeks for nothing
13
www.kensu.io
DATA SCIENCE GOVERNANCE
Data governance: controls that data meets precise standards and
involves monitoring against production data.
Data Science Governance: control that data activity meets precise
standards and involves monitoring against production data activity.
A Data Activity is described by at least technologies, users, systems,
data, processing
14
www.kensu.io
GOVERNING DATA SCIENCE
Who does what on which data and where it is done?
What is the impact of a process on the global system?
What are the performance metrics (quality, execution,…) of the processes?
15
www.kensu.io
CONTINUOUS INTEGRATION FOR DATA SCIENCE
Data Scientists/Citizens have a view on all the activities applied to
the original sources used in his/her own process.
They also have a control on their own results in production
They have the opportunity to analyse and debug a pipeline
involving all activities:
• independently of the technologies
• involving several people in the enterprise
16
www.kensu.io
DATA SCIENCE GOVERNANCE: HOW
17
www.kensu.io
CHALLENGES
So many tools are using data!
The number of processing is growing impressively.
We have to take care of the legacy…
18
www.kensu.io
GET THE DATA
As usual, we have to collect the right data to take right decision.
First run an assessment to create a high level map of all the tools
involved into a company.
For each tool, do whatever it takes to collect information about the
activities it is creating.
Information are metadata, lineage, statistics, accuracy measures, …
19
www.kensu.io
CONNECT THE DATA
Data Science Governance needs the global picture.
To do that we need to connect all data that can be collected.
So that, it is possible to create a cartography of all on-going processes.
This map tracks all data and their descendants
20
www.kensu.io
USE THE DATA
This is where the fun part starts… the map of data activities is an
amazing source of information
Here are a few things you can think of when using this kind of data:
• impact analysis
• dependency analysis
• optimisation
• recommendation
21
www.kensu.io
GDPR
22
General Data Protection Regulation
www.kensu.io
ACCOUNTABILITY PRINCIPLE
Implement appropriate technical and organisational measures that
ensure and demonstrate that you comply. This may include internal
data protection policies such as staff training, internal audits of
processing activities, and reviews of internal HR policies.
23
www.kensu.io
TRANSPARENCY
As well as your obligation to provide comprehensive, clear and
transparent privacy policies, if your organisation has more than 250
employees, you must maintain additional internal records of your
processing activities.
24
www.kensu.io
ACCOUNTABILITY: DATA SCIENCE GOVERNANCE
To govern data science, we have to:
• collect activities
• connect activities
With this information we can reliably create automatically the
process registry
25
www.kensu.io
TRANSPARENCY: DATA SCIENCE GOVERNANCE
To govern data science seen as a continuous integration solution: 

we have to explain and measure activities independently of the
technologies.
With this information we can reliably create transparent reports of
activities across the whole chain of processing
26
www.kensu.io
GUESS WHAT?
This what Adalog, our product at Kensu, does!
27
www.kensu.io
ADALOG
28
Adalog Collectors
Adalog Service
Data Citizen
HTTPSPortonly
Recommendation System
Data Process Registry
Impact Analyzer
Data
Protection
Officer
Dashboard
www.kensu.io
WANT TO SEE MORE?
Request a demo on our website: http://kensu.io
29
www.kensu.io
DATA SCIENCE GOVERNANCE
Andy Petrella
CEO Co Founder
0032 495 99 11 04
@noootsab
Xavier Tordoir
CSO Co Founder
0032 495 99 11 04
+1 (628) 236-9239
@xtordoir
@kensuio

Contenu connexe

Tendances

DataEd Slides: Data Strategy – Plans Are Useless but Planning Is Invaluable
DataEd Slides: Data Strategy – Plans Are Useless but Planning Is InvaluableDataEd Slides: Data Strategy – Plans Are Useless but Planning Is Invaluable
DataEd Slides: Data Strategy – Plans Are Useless but Planning Is Invaluable
DATAVERSITY
 
DataEd Slides: Expressing Data Improvements as Business Outcomes
DataEd Slides: Expressing Data Improvements as Business OutcomesDataEd Slides: Expressing Data Improvements as Business Outcomes
DataEd Slides: Expressing Data Improvements as Business Outcomes
DATAVERSITY
 
BIG DATA WORKBOOK OCT 2015
BIG DATA WORKBOOK OCT 2015BIG DATA WORKBOOK OCT 2015
BIG DATA WORKBOOK OCT 2015
Fiona Lew
 

Tendances (20)

Data Management vs. Data Governance Program
Data Management vs. Data Governance ProgramData Management vs. Data Governance Program
Data Management vs. Data Governance Program
 
DataEd Slides: Data Strategy – Plans Are Useless but Planning Is Invaluable
DataEd Slides: Data Strategy – Plans Are Useless but Planning Is InvaluableDataEd Slides: Data Strategy – Plans Are Useless but Planning Is Invaluable
DataEd Slides: Data Strategy – Plans Are Useless but Planning Is Invaluable
 
DataEd Slides: Expressing Data Improvements as Business Outcomes
DataEd Slides: Expressing Data Improvements as Business OutcomesDataEd Slides: Expressing Data Improvements as Business Outcomes
DataEd Slides: Expressing Data Improvements as Business Outcomes
 
Key Elements for a Successful Service Analytics Program
Key Elements for a Successful Service Analytics ProgramKey Elements for a Successful Service Analytics Program
Key Elements for a Successful Service Analytics Program
 
Approaching Data Quality
Approaching Data QualityApproaching Data Quality
Approaching Data Quality
 
Predictive Analytics - How to get stuff out of your Crystal Ball
Predictive Analytics - How to get stuff out of your Crystal BallPredictive Analytics - How to get stuff out of your Crystal Ball
Predictive Analytics - How to get stuff out of your Crystal Ball
 
DAS Slides: Building a Data Strategy — Practical Steps for Aligning with Busi...
DAS Slides: Building a Data Strategy — Practical Steps for Aligning with Busi...DAS Slides: Building a Data Strategy — Practical Steps for Aligning with Busi...
DAS Slides: Building a Data Strategy — Practical Steps for Aligning with Busi...
 
RWDG Slides: Metadata Governance for Catalogs, Glossaries, Dictionaries, and ...
RWDG Slides: Metadata Governance for Catalogs, Glossaries, Dictionaries, and ...RWDG Slides: Metadata Governance for Catalogs, Glossaries, Dictionaries, and ...
RWDG Slides: Metadata Governance for Catalogs, Glossaries, Dictionaries, and ...
 
DI&A Slides: Data Lake vs. Data Warehouse
DI&A Slides: Data Lake vs. Data WarehouseDI&A Slides: Data Lake vs. Data Warehouse
DI&A Slides: Data Lake vs. Data Warehouse
 
Are You Your Company's Chief Data Officer?
Are You Your Company's Chief Data Officer?Are You Your Company's Chief Data Officer?
Are You Your Company's Chief Data Officer?
 
Data Governance Best Practices and Lessons Learned
Data Governance Best Practices and Lessons LearnedData Governance Best Practices and Lessons Learned
Data Governance Best Practices and Lessons Learned
 
ADV Slides: What Happened of Note in 1H 2020 in Enterprise Advanced Analytics
ADV Slides: What Happened of Note in 1H 2020 in Enterprise Advanced AnalyticsADV Slides: What Happened of Note in 1H 2020 in Enterprise Advanced Analytics
ADV Slides: What Happened of Note in 1H 2020 in Enterprise Advanced Analytics
 
Building a data fluent organization
Building a data fluent organizationBuilding a data fluent organization
Building a data fluent organization
 
Data Literacy and Data Virtualization: A Step-by-step Guide to Bolstering You...
Data Literacy and Data Virtualization: A Step-by-step Guide to Bolstering You...Data Literacy and Data Virtualization: A Step-by-step Guide to Bolstering You...
Data Literacy and Data Virtualization: A Step-by-step Guide to Bolstering You...
 
Data-Ed Webinar: Data Quality Success Stories
Data-Ed Webinar: Data Quality Success StoriesData-Ed Webinar: Data Quality Success Stories
Data-Ed Webinar: Data Quality Success Stories
 
Data-Ed Online Webinar: Data Governance Strategies
Data-Ed Online Webinar: Data Governance StrategiesData-Ed Online Webinar: Data Governance Strategies
Data-Ed Online Webinar: Data Governance Strategies
 
BIG DATA WORKBOOK OCT 2015
BIG DATA WORKBOOK OCT 2015BIG DATA WORKBOOK OCT 2015
BIG DATA WORKBOOK OCT 2015
 
How to Consume Your Data for AI
How to Consume Your Data for AIHow to Consume Your Data for AI
How to Consume Your Data for AI
 
DataEd Slides: Getting (Re)Started with Data Stewardship
DataEd Slides: Getting (Re)Started with Data StewardshipDataEd Slides: Getting (Re)Started with Data Stewardship
DataEd Slides: Getting (Re)Started with Data Stewardship
 
DataEd Slides: Data Management + Data Strategy = Interoperability
DataEd Slides: Data Management + Data Strategy = InteroperabilityDataEd Slides: Data Management + Data Strategy = Interoperability
DataEd Slides: Data Management + Data Strategy = Interoperability
 

Similaire à Data science governance : what and how

Confirming PagesLess managing. More teaching. Greater
Confirming PagesLess managing. More teaching. Greater Confirming PagesLess managing. More teaching. Greater
Confirming PagesLess managing. More teaching. Greater
AlleneMcclendon878
 
How Can Analytics Improve Business?
How Can Analytics Improve Business?How Can Analytics Improve Business?
How Can Analytics Improve Business?
Inside Analysis
 

Similaire à Data science governance : what and how (20)

Governance compliance
Governance   complianceGovernance   compliance
Governance compliance
 
Innovation med big data – chr. hansens erfaringer
Innovation med big data – chr. hansens erfaringerInnovation med big data – chr. hansens erfaringer
Innovation med big data – chr. hansens erfaringer
 
A Survey on Data Mining
A Survey on Data MiningA Survey on Data Mining
A Survey on Data Mining
 
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...Facilitating Collaborative Life Science Research in Commercial & Enterprise E...
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...
 
Introduction to business analyticsand historidal overview.pptx
Introduction to business analyticsand historidal overview.pptxIntroduction to business analyticsand historidal overview.pptx
Introduction to business analyticsand historidal overview.pptx
 
Information what is it
Information what is itInformation what is it
Information what is it
 
Big data
Big dataBig data
Big data
 
Big dataplatform operationalstrategy
Big dataplatform operationalstrategyBig dataplatform operationalstrategy
Big dataplatform operationalstrategy
 
DATASCIENCE vs BUSINESS INTELLIGENCE.pptx
DATASCIENCE vs BUSINESS INTELLIGENCE.pptxDATASCIENCE vs BUSINESS INTELLIGENCE.pptx
DATASCIENCE vs BUSINESS INTELLIGENCE.pptx
 
Luciano uvi hackfest.28.10.2020
Luciano uvi hackfest.28.10.2020Luciano uvi hackfest.28.10.2020
Luciano uvi hackfest.28.10.2020
 
Confirming PagesLess managing. More teaching. Greater
Confirming PagesLess managing. More teaching. Greater Confirming PagesLess managing. More teaching. Greater
Confirming PagesLess managing. More teaching. Greater
 
How to Prepare for a Career in Data Science
How to Prepare for a Career in Data ScienceHow to Prepare for a Career in Data Science
How to Prepare for a Career in Data Science
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
 
My FAIR share of the work - Diamond Light Source - Dec 2018
My FAIR share of the work - Diamond Light Source - Dec 2018My FAIR share of the work - Diamond Light Source - Dec 2018
My FAIR share of the work - Diamond Light Source - Dec 2018
 
Introduction to Data Science and Analytics
Introduction to Data Science and AnalyticsIntroduction to Data Science and Analytics
Introduction to Data Science and Analytics
 
Big data
Big dataBig data
Big data
 
Driving Business Value Through Agile Data Assets
Driving Business Value Through Agile Data AssetsDriving Business Value Through Agile Data Assets
Driving Business Value Through Agile Data Assets
 
How Can Analytics Improve Business?
How Can Analytics Improve Business?How Can Analytics Improve Business?
How Can Analytics Improve Business?
 
Story of Bigdata and its Applications in Financial Institutions
Story of Bigdata and its Applications in Financial InstitutionsStory of Bigdata and its Applications in Financial Institutions
Story of Bigdata and its Applications in Financial Institutions
 
Agile data science
Agile data scienceAgile data science
Agile data science
 

Plus de Andy Petrella

Machine Learning and GraphX
Machine Learning and GraphXMachine Learning and GraphX
Machine Learning and GraphX
Andy Petrella
 

Plus de Andy Petrella (20)

Data Observability Best Pracices
Data Observability Best PracicesData Observability Best Pracices
Data Observability Best Pracices
 
How to Build a Global Data Mapping
How to Build a Global Data MappingHow to Build a Global Data Mapping
How to Build a Global Data Mapping
 
Interactive notebooks
Interactive notebooksInteractive notebooks
Interactive notebooks
 
Scala: the unpredicted lingua franca for data science
Scala: the unpredicted lingua franca  for data scienceScala: the unpredicted lingua franca  for data science
Scala: the unpredicted lingua franca for data science
 
Agile data science with scala
Agile data science with scalaAgile data science with scala
Agile data science with scala
 
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
 
What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.
 
Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)
 
Distributed machine learning 101 using apache spark from a browser devoxx.b...
Distributed machine learning 101 using apache spark from a browser   devoxx.b...Distributed machine learning 101 using apache spark from a browser   devoxx.b...
Distributed machine learning 101 using apache spark from a browser devoxx.b...
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scale
 
Leveraging mesos as the ultimate distributed data science platform
Leveraging mesos as the ultimate distributed data science platformLeveraging mesos as the ultimate distributed data science platform
Leveraging mesos as the ultimate distributed data science platform
 
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
 
Spark meetup london share and analyse genomic data at scale with spark, adam...
Spark meetup london  share and analyse genomic data at scale with spark, adam...Spark meetup london  share and analyse genomic data at scale with spark, adam...
Spark meetup london share and analyse genomic data at scale with spark, adam...
 
Distributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browserDistributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browser
 
Liège créative: Open Science
Liège créative: Open ScienceLiège créative: Open Science
Liège créative: Open Science
 
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at ScaleBioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark
 
Spark devoxx2014
Spark devoxx2014Spark devoxx2014
Spark devoxx2014
 
Lightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaLightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and Scala
 
Machine Learning and GraphX
Machine Learning and GraphXMachine Learning and GraphX
Machine Learning and GraphX
 

Dernier

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Dernier (20)

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 

Data science governance : what and how

  • 2. www.kensu.io 2 - CEO & Founder - Mathematics & Computer Science MsC. Creator of Spark Notebook - CSO & Founder - Physics PhD. 
 Genomics & Quantitative Finance XAVIER TORDOIRANDY PETRELLA KENSU & ME Started in 2015 as Data Fellas, focus on Data Science consulting Team of 10 engineers and scientists Shift toward Product Company in 2016, renamed to Kensu, Focus on Data Science Governance Accelerated by Alchemist Accelerator in San Francisco and The Faktory in Belgium
  • 3. www.kensu.io TOPICS 1. Some thoughts on “Data Science” 2. Data Science Governance: What 3. Data Science Governance: How 4. GDPR: Accountability principle and transparency 3
  • 4. www.kensu.io SOME THOUGHTS ON “DATA SCIENCE” 4
  • 5. www.kensu.io MACHINE LEARNING Pioneers in 1950s AI Winter in 1970s due pessimism Resurgence in 1980s Machine Learning (and related) is used since the 1990s (esp. SVM and RNN) Deep learning see widespread commercial use in 2000s Machine learning receives great publicity (read: buzz) in 2010s 5ref: https://en.wikipedia.org/wiki/Timeline_of_machine_learning
  • 6. www.kensu.io DATA SCIENCE: +ENGINEERING Claim: “Data Scientist” coined by DJ Patil in 2008. Pretty much where Machine Learning was part of Softwares In a way, when we added “engineering” to the mix Also, engineering is even more prominent with Big Data Distributed Computing 6
  • 7. www.kensu.io DATA SCIENCE: +EXPERIMENTATION So much data available So many tools, libraries, frameworks, … So many things we can try We have distributed computing now, right? => Let’s try everything Discover new insights (and potentially new businesses) 7
  • 8. www.kensu.io DATA SCIENCE: RECAP Maths: stats, machine learning and so on Engineering: ETL, Databases, Computing framework, Softwares, Platforms, … Creativity: “From business intelligence To intelligent business”- Michael Fergusson Data Science is an umbrella on top of all activities on data 8
  • 10. www.kensu.io DATA PIPELINE Data pipeline is connecting activities on data, potentially involving several technologies. A pipeline is generally thought as an End-to-End processing line to solve one problem. But, part of pipelines are reused to save computation, storage, time, … Thus interdependency between pipeline segments grows with initiatives 10
  • 11. www.kensu.io GOAL: TAKE DECISION Data Pipelines, connected together, aren’t created for the beauty of it. The ultimate goal is always to take decisions. Decisions are generally taken or linked to humans with responsibilities.
 (even for self driving cars, in case of problem) Given that pipelines are cut-and-wired, interleaved, … How not to be anxious at deploying the last piece used by the decision maker 11
  • 12. www.kensu.io SOURCES OF ANXIETY What if: • one of the data used in the process has different patterns suddenly? • one of the tools, projects or similar is modified upstream? • the insights are deviating from the reality? • … 12
  • 13. www.kensu.io DEBUGGING? To reduce the anxiety or, actually, reducing the risks, we need ways to debug. In pure engineering, we have unit, function, integrations tests,… but How do we do when the problems come from the data themselves? We can’t generate all cases of data variations, right? How to debug? 
 Without the big picture, we may try to optimise a model for weeks for nothing 13
  • 14. www.kensu.io DATA SCIENCE GOVERNANCE Data governance: controls that data meets precise standards and involves monitoring against production data. Data Science Governance: control that data activity meets precise standards and involves monitoring against production data activity. A Data Activity is described by at least technologies, users, systems, data, processing 14
  • 15. www.kensu.io GOVERNING DATA SCIENCE Who does what on which data and where it is done? What is the impact of a process on the global system? What are the performance metrics (quality, execution,…) of the processes? 15
  • 16. www.kensu.io CONTINUOUS INTEGRATION FOR DATA SCIENCE Data Scientists/Citizens have a view on all the activities applied to the original sources used in his/her own process. They also have a control on their own results in production They have the opportunity to analyse and debug a pipeline involving all activities: • independently of the technologies • involving several people in the enterprise 16
  • 18. www.kensu.io CHALLENGES So many tools are using data! The number of processing is growing impressively. We have to take care of the legacy… 18
  • 19. www.kensu.io GET THE DATA As usual, we have to collect the right data to take right decision. First run an assessment to create a high level map of all the tools involved into a company. For each tool, do whatever it takes to collect information about the activities it is creating. Information are metadata, lineage, statistics, accuracy measures, … 19
  • 20. www.kensu.io CONNECT THE DATA Data Science Governance needs the global picture. To do that we need to connect all data that can be collected. So that, it is possible to create a cartography of all on-going processes. This map tracks all data and their descendants 20
  • 21. www.kensu.io USE THE DATA This is where the fun part starts… the map of data activities is an amazing source of information Here are a few things you can think of when using this kind of data: • impact analysis • dependency analysis • optimisation • recommendation 21
  • 23. www.kensu.io ACCOUNTABILITY PRINCIPLE Implement appropriate technical and organisational measures that ensure and demonstrate that you comply. This may include internal data protection policies such as staff training, internal audits of processing activities, and reviews of internal HR policies. 23
  • 24. www.kensu.io TRANSPARENCY As well as your obligation to provide comprehensive, clear and transparent privacy policies, if your organisation has more than 250 employees, you must maintain additional internal records of your processing activities. 24
  • 25. www.kensu.io ACCOUNTABILITY: DATA SCIENCE GOVERNANCE To govern data science, we have to: • collect activities • connect activities With this information we can reliably create automatically the process registry 25
  • 26. www.kensu.io TRANSPARENCY: DATA SCIENCE GOVERNANCE To govern data science seen as a continuous integration solution: 
 we have to explain and measure activities independently of the technologies. With this information we can reliably create transparent reports of activities across the whole chain of processing 26
  • 27. www.kensu.io GUESS WHAT? This what Adalog, our product at Kensu, does! 27
  • 28. www.kensu.io ADALOG 28 Adalog Collectors Adalog Service Data Citizen HTTPSPortonly Recommendation System Data Process Registry Impact Analyzer Data Protection Officer Dashboard
  • 29. www.kensu.io WANT TO SEE MORE? Request a demo on our website: http://kensu.io 29
  • 30. www.kensu.io DATA SCIENCE GOVERNANCE Andy Petrella CEO Co Founder 0032 495 99 11 04 @noootsab Xavier Tordoir CSO Co Founder 0032 495 99 11 04 +1 (628) 236-9239 @xtordoir @kensuio