SlideShare une entreprise Scribd logo
1  sur  21
Télécharger pour lire hors ligne
Agile Data Science with Scala
by @DataFellas
Xavier Tordoir
xtordoir@data-fellas.guru
@xtordoir
Andy Petrella
noootsab@data-fellas.guru
@noootsab
Data Fellas
Andy Petrella
Maths
Geospatial
Distributed Computing
Spark Notebook
Trainer Spark/Scala
Machine Learning
Xavier Tordoir
Physics
Bioinformatics
Distributed Computing
Scala (& Perl)
trainer Spark
Machine Learning
© Data Fellas SPRL 2016
● Pipeline: productizing Data Science
● Demo of Distributed Pipeline (Spark, Mesos, Akka, Cassandra, Kafka, Spark Notebook)
● Why Micro Services?
● Painful points:
○ Data science is Discontiguous
○ Context Lost in Translation
● Solution: Data Fellas’ Agile Data Science Toolkit
Lineup
So if you’re not sure you want to stay...
© Data Fellas SPRL 2016
Pipeline
Productizing Data Science
Modelling Coding Deploying
Finding Data
Parsing structures
Cleaning
(Reducing)
Learning
Predicting
Connect PROD data
Tuning training parameters
Create Prediction Service
Generate Deployable
Connect to PROD infrastructure
Integration with existing env
Allocate (schedule) resources
Ensure availability
© Data Fellas SPRL 2016
Distributed Data Science
Demo
All-In Spark Notebooks
Get data: Source → Kafka
Prepare View: Kafka → Cassandra
Train Model: Cassandra → ML...
Create Server: Cassandra/ML/... → Akka Http
Create Client: Json → Html Form, Chart, table, ...
© Data Fellas SPRL 2016
Bad Pipeline
Targeting Dashboard
Modelling Coding Deploying Dashboard
»»»
Data Scientist focusing on the dashboard/report instead of content
breaks reusability of data
time wasted on learning viz instead of increasing accuracy (or velocity)
monolithic instead of service oriented
© Data Fellas SPRL 2016
Extended Pipeline
Micro Services
Modelling Coding Deploying Integrating
Application
Creating
Services
Abstracts access to prepared views
Exposes Prediction capabilities
Highly horizontally scalable
Scaling micro services cluster
→ cheaper than computing cluster
Customer integration
Can be any technologies
Can even be another pipeline!
© Data Fellas SPRL 2016
Painful points
Data science is Discontiguous
➔ Highly heterogeneous environment
➔ Too many friction areas
➔ Time to market too long
Modelling Coding Deploying Integrating
Application
Scientist Data Eng. Ops. Eng. Web Eng. Customers
➔ No integration
➔ Error prone
➔ Schedule delays
Creating
Services
Frictions
Result: Lack of Agility
Collecting
Data Eng.
© Data Fellas SPRL 2016
Painful points
Context Lost in Translation
Data Lake Processing
Machine
Learning
Model
Output
Data
Input
Data
No contextual discovery No quality info
No lineage
(origin of the data)
Link to
process and
input discarded
Huge gap in architecture:
binary and schema aware
serving layer
Accuracy depends on
concealed quality of inputs
No schema!
hard and long integration,
poor satisfaction
Moreover:
No backward links → no agility and no context awareness
Result: Lack of Reproducibility
Application
Data Fellas…
Agile Data Science Toolkit
© Data Fellas SPRL 2016
Our Approach
Agile Data Science Toolkit
Automatic
Semantics
Engine
+ Autogenerated
Microservices
Integrated
End-to-End
Environment
Huge gain
in Time and
Reliability
+ =
Notebook
Computing
Cluster
Access
Layer
Knowledge
Base
Consumers
Customers
Exposes
database,
learning models,
stream sources,
notebooks, ...
data type
process
lineage
usage
Easy to Release
Easy to (Re)Use
Notebook
Version Control
(Git)
Spark Job Project
(SBT)
Service Projects
(SBT)
Metadata
(Doc, Logic, Schema, ...)
Catalog
(ElasticSearch)
Deployable
(Jar, Docker)
Repository
(Nexus, Docker Repo,
Pypi, Gem Server)
Client Projects
(Node.Js, Java, Scala,
Python, Ruby)
Publishable
(NPM, Jar,
Pip/EasyInstall, Gem)
scientist
data
Engineer
ops
Engineer
© Data Fellas SPRL 2016
Agile Data Science Toolkit
In a nutshell
© Data Fellas SPRL 2016
Agile Data Science Toolkit
In a nutshell
© Data Fellas SPRL 2016
Agile Data Science Toolkit
In a nutshell
© Data Fellas SPRL 2016
Agile Data Science Toolkit
In a nutshell
© Data Fellas SPRL 2016
Agile Data Science Toolkit
In a nutshell
© Data Fellas SPRL 2016
Agile Data Science Toolkit
In a nutshell
Data Fellas…
Announcements!!!
© Data Fellas SPRL 2016
O’Reilly
Online seminar
© Data Fellas SPRL 2016
Growing
We’re Hiring! http://www.data-fellas.guru/#skillsjobs
Q/A
References
http://www.data-fellas.guru/
http://spark-notebook.io/
https://github.com/andypetrella/spark-notebook/
https://gitter.im/andypetrella/spark-notebook
Come at Strata
-- London at least
-- We have two talks :-)

Contenu connexe

Tendances

Tendances (20)

Big Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingBig Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely heading
 
Machine Learning with Spark
Machine Learning with SparkMachine Learning with Spark
Machine Learning with Spark
 
Introduction to Analytics with Azure Notebooks and Python
Introduction to Analytics with Azure Notebooks and PythonIntroduction to Analytics with Azure Notebooks and Python
Introduction to Analytics with Azure Notebooks and Python
 
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Distributed Models Over Distributed Data with MLflow, Pyspark, and PandasDistributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
 
Deep Learning with MXNet - Dmitry Larko
Deep Learning with MXNet - Dmitry LarkoDeep Learning with MXNet - Dmitry Larko
Deep Learning with MXNet - Dmitry Larko
 
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
 
IBM Strategy for Spark
IBM Strategy for SparkIBM Strategy for Spark
IBM Strategy for Spark
 
Analyzing Data With Python
Analyzing Data With PythonAnalyzing Data With Python
Analyzing Data With Python
 
CuRious about R in Power BI? End to end R in Power BI for beginners
CuRious about R in Power BI? End to end R in Power BI for beginners CuRious about R in Power BI? End to end R in Power BI for beginners
CuRious about R in Power BI? End to end R in Power BI for beginners
 
Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and Hadoop
 
Building Better Analytics Workflows (Strata-Hadoop World 2013)
Building Better Analytics Workflows (Strata-Hadoop World 2013)Building Better Analytics Workflows (Strata-Hadoop World 2013)
Building Better Analytics Workflows (Strata-Hadoop World 2013)
 
EDHREC @ Data Science MD
EDHREC @ Data Science MDEDHREC @ Data Science MD
EDHREC @ Data Science MD
 
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
 
Stacked Ensembles in H2O
Stacked Ensembles in H2OStacked Ensembles in H2O
Stacked Ensembles in H2O
 
Skutil - H2O meets Sklearn - Taylor Smith
Skutil - H2O meets Sklearn - Taylor SmithSkutil - H2O meets Sklearn - Taylor Smith
Skutil - H2O meets Sklearn - Taylor Smith
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLab
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analytics
 
Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark
 
Spark - Philly JUG
Spark  - Philly JUGSpark  - Philly JUG
Spark - Philly JUG
 
Distributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark MeetupDistributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark Meetup
 

Similaire à Agile data science with scala

Python + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production FasterPython + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
Paige_Roberts
 
AUSOUG - NZOUG-GroundBreakers-Jun 2019 - AI and Machine Learning
AUSOUG - NZOUG-GroundBreakers-Jun 2019 - AI and Machine LearningAUSOUG - NZOUG-GroundBreakers-Jun 2019 - AI and Machine Learning
AUSOUG - NZOUG-GroundBreakers-Jun 2019 - AI and Machine Learning
Sandesh Rao
 

Similaire à Agile data science with scala (20)

Python + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production FasterPython + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
 
Part 1: Introducing the Cloudera Data Science Workbench
Part 1: Introducing the Cloudera Data Science WorkbenchPart 1: Introducing the Cloudera Data Science Workbench
Part 1: Introducing the Cloudera Data Science Workbench
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
 
AUSOUG - NZOUG-GroundBreakers-Jun 2019 - AI and Machine Learning
AUSOUG - NZOUG-GroundBreakers-Jun 2019 - AI and Machine LearningAUSOUG - NZOUG-GroundBreakers-Jun 2019 - AI and Machine Learning
AUSOUG - NZOUG-GroundBreakers-Jun 2019 - AI and Machine Learning
 
Enabling Data centric Teams
Enabling Data centric TeamsEnabling Data centric Teams
Enabling Data centric Teams
 
Breed data scientists_ A Presentation.pptx
Breed data scientists_ A Presentation.pptxBreed data scientists_ A Presentation.pptx
Breed data scientists_ A Presentation.pptx
 
Big Data for Data Scientists - Info Session
Big Data for Data Scientists - Info SessionBig Data for Data Scientists - Info Session
Big Data for Data Scientists - Info Session
 
TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific Applciations
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific Applications
 
Sparkflows.io
Sparkflows.ioSparkflows.io
Sparkflows.io
 
Machine Learning and AI
Machine Learning and AIMachine Learning and AI
Machine Learning and AI
 
DevOps for DataScience
DevOps for DataScienceDevOps for DataScience
DevOps for DataScience
 
Global AI Bootcamp Madrid - Azure Databricks
Global AI Bootcamp Madrid - Azure DatabricksGlobal AI Bootcamp Madrid - Azure Databricks
Global AI Bootcamp Madrid - Azure Databricks
 
Ai & Data Analytics 2018 - Azure Databricks for data scientist
Ai & Data Analytics 2018 - Azure Databricks for data scientistAi & Data Analytics 2018 - Azure Databricks for data scientist
Ai & Data Analytics 2018 - Azure Databricks for data scientist
 
AzureML TechTalk
AzureML TechTalkAzureML TechTalk
AzureML TechTalk
 
Analytics and Lakehouse Integration Options for Oracle Applications
Analytics and Lakehouse Integration Options for Oracle ApplicationsAnalytics and Lakehouse Integration Options for Oracle Applications
Analytics and Lakehouse Integration Options for Oracle Applications
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
 
Embedded-ml(ai)applications - Bjoern Staender
Embedded-ml(ai)applications - Bjoern StaenderEmbedded-ml(ai)applications - Bjoern Staender
Embedded-ml(ai)applications - Bjoern Staender
 
Data Science and CDSW
Data Science and CDSWData Science and CDSW
Data Science and CDSW
 

Plus de Andy Petrella

Machine Learning and GraphX
Machine Learning and GraphXMachine Learning and GraphX
Machine Learning and GraphX
Andy Petrella
 

Plus de Andy Petrella (20)

Data Observability Best Pracices
Data Observability Best PracicesData Observability Best Pracices
Data Observability Best Pracices
 
How to Build a Global Data Mapping
How to Build a Global Data MappingHow to Build a Global Data Mapping
How to Build a Global Data Mapping
 
Interactive notebooks
Interactive notebooksInteractive notebooks
Interactive notebooks
 
Governance compliance
Governance   complianceGovernance   compliance
Governance compliance
 
Data science governance and GDPR
Data science governance and GDPRData science governance and GDPR
Data science governance and GDPR
 
Data science governance : what and how
Data science governance : what and howData science governance : what and how
Data science governance : what and how
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scale
 
Leveraging mesos as the ultimate distributed data science platform
Leveraging mesos as the ultimate distributed data science platformLeveraging mesos as the ultimate distributed data science platform
Leveraging mesos as the ultimate distributed data science platform
 
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
 
Spark meetup london share and analyse genomic data at scale with spark, adam...
Spark meetup london  share and analyse genomic data at scale with spark, adam...Spark meetup london  share and analyse genomic data at scale with spark, adam...
Spark meetup london share and analyse genomic data at scale with spark, adam...
 
Distributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browserDistributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browser
 
Liège créative: Open Science
Liège créative: Open ScienceLiège créative: Open Science
Liège créative: Open Science
 
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at ScaleBioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark
 
Spark devoxx2014
Spark devoxx2014Spark devoxx2014
Spark devoxx2014
 
Lightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaLightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and Scala
 
Machine Learning and GraphX
Machine Learning and GraphXMachine Learning and GraphX
Machine Learning and GraphX
 
Quanti-litative Revolution in GIS
Quanti-litative Revolution in GISQuanti-litative Revolution in GIS
Quanti-litative Revolution in GIS
 
Scala and-fp-in-big-data
Scala and-fp-in-big-dataScala and-fp-in-big-data
Scala and-fp-in-big-data
 
Software Crafted And Libraries Available
Software Crafted And Libraries AvailableSoftware Crafted And Libraries Available
Software Crafted And Libraries Available
 

Dernier

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Dernier (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 

Agile data science with scala

  • 1. Agile Data Science with Scala by @DataFellas Xavier Tordoir xtordoir@data-fellas.guru @xtordoir Andy Petrella noootsab@data-fellas.guru @noootsab
  • 2. Data Fellas Andy Petrella Maths Geospatial Distributed Computing Spark Notebook Trainer Spark/Scala Machine Learning Xavier Tordoir Physics Bioinformatics Distributed Computing Scala (& Perl) trainer Spark Machine Learning
  • 3. © Data Fellas SPRL 2016 ● Pipeline: productizing Data Science ● Demo of Distributed Pipeline (Spark, Mesos, Akka, Cassandra, Kafka, Spark Notebook) ● Why Micro Services? ● Painful points: ○ Data science is Discontiguous ○ Context Lost in Translation ● Solution: Data Fellas’ Agile Data Science Toolkit Lineup So if you’re not sure you want to stay...
  • 4. © Data Fellas SPRL 2016 Pipeline Productizing Data Science Modelling Coding Deploying Finding Data Parsing structures Cleaning (Reducing) Learning Predicting Connect PROD data Tuning training parameters Create Prediction Service Generate Deployable Connect to PROD infrastructure Integration with existing env Allocate (schedule) resources Ensure availability
  • 5. © Data Fellas SPRL 2016 Distributed Data Science Demo All-In Spark Notebooks Get data: Source → Kafka Prepare View: Kafka → Cassandra Train Model: Cassandra → ML... Create Server: Cassandra/ML/... → Akka Http Create Client: Json → Html Form, Chart, table, ...
  • 6. © Data Fellas SPRL 2016 Bad Pipeline Targeting Dashboard Modelling Coding Deploying Dashboard »»» Data Scientist focusing on the dashboard/report instead of content breaks reusability of data time wasted on learning viz instead of increasing accuracy (or velocity) monolithic instead of service oriented
  • 7. © Data Fellas SPRL 2016 Extended Pipeline Micro Services Modelling Coding Deploying Integrating Application Creating Services Abstracts access to prepared views Exposes Prediction capabilities Highly horizontally scalable Scaling micro services cluster → cheaper than computing cluster Customer integration Can be any technologies Can even be another pipeline!
  • 8. © Data Fellas SPRL 2016 Painful points Data science is Discontiguous ➔ Highly heterogeneous environment ➔ Too many friction areas ➔ Time to market too long Modelling Coding Deploying Integrating Application Scientist Data Eng. Ops. Eng. Web Eng. Customers ➔ No integration ➔ Error prone ➔ Schedule delays Creating Services Frictions Result: Lack of Agility Collecting Data Eng.
  • 9. © Data Fellas SPRL 2016 Painful points Context Lost in Translation Data Lake Processing Machine Learning Model Output Data Input Data No contextual discovery No quality info No lineage (origin of the data) Link to process and input discarded Huge gap in architecture: binary and schema aware serving layer Accuracy depends on concealed quality of inputs No schema! hard and long integration, poor satisfaction Moreover: No backward links → no agility and no context awareness Result: Lack of Reproducibility Application
  • 10. Data Fellas… Agile Data Science Toolkit
  • 11. © Data Fellas SPRL 2016 Our Approach Agile Data Science Toolkit Automatic Semantics Engine + Autogenerated Microservices Integrated End-to-End Environment Huge gain in Time and Reliability + = Notebook Computing Cluster Access Layer Knowledge Base Consumers Customers Exposes database, learning models, stream sources, notebooks, ... data type process lineage usage Easy to Release Easy to (Re)Use Notebook Version Control (Git) Spark Job Project (SBT) Service Projects (SBT) Metadata (Doc, Logic, Schema, ...) Catalog (ElasticSearch) Deployable (Jar, Docker) Repository (Nexus, Docker Repo, Pypi, Gem Server) Client Projects (Node.Js, Java, Scala, Python, Ruby) Publishable (NPM, Jar, Pip/EasyInstall, Gem) scientist data Engineer ops Engineer
  • 12. © Data Fellas SPRL 2016 Agile Data Science Toolkit In a nutshell
  • 13. © Data Fellas SPRL 2016 Agile Data Science Toolkit In a nutshell
  • 14. © Data Fellas SPRL 2016 Agile Data Science Toolkit In a nutshell
  • 15. © Data Fellas SPRL 2016 Agile Data Science Toolkit In a nutshell
  • 16. © Data Fellas SPRL 2016 Agile Data Science Toolkit In a nutshell
  • 17. © Data Fellas SPRL 2016 Agile Data Science Toolkit In a nutshell
  • 19. © Data Fellas SPRL 2016 O’Reilly Online seminar
  • 20. © Data Fellas SPRL 2016 Growing We’re Hiring! http://www.data-fellas.guru/#skillsjobs