SlideShare une entreprise Scribd logo
1  sur  48
Télécharger pour lire hors ligne
What is a Distributed Data Science Pipeline
How with Apache Spark and Friends.
by @DataFellas
@Noootsab, 23th Nov. ‘15 @YaJUG
● (Legacy) Data Science Pipeline/Product
● What changed since then
● Distributed Data Science (today)
● Challenges
● Going beyond (productivity)
Outline
Data Fellas
6 months old Belgian Startup
Andy Petrella
@noootsab
Maths
Geospatial
Distributed Computing
@SparkNotebook
Spark/Scala trainer
Machine Learning
Xavier Tordoir
@xtordoir
Physics
Bioinformatics
Distributed Computing
Scala (& Perl)
Spark trainer
Machine Learning
(Legacy) Data Science Pipeline
Or, so called, Data Product
Static Results
Lot of information lost in translation
Sounds like Waterfall
ETL look and feel
Sampling Modelling Tuning Report Interprete
(Legacy) Data Science Pipeline
Or, so called, Data Product
Mono machine!
CPU bounds
Memory bounds
Or resampling because small-ish data
Sampling Modelling Tuning Report Interprete
Facts
Data gets bigger or, precisely, the amount of available
sources explodes
Data gets faster (and faster) - - only even consider:
watching netflix on 4G ôÖ
Our world Today
No, it wasn’t better before
Consequences
HARD (or will be too big...)
Ephemeral
Restricted View
Sampling
Report
Our world Today
No, it wasn’t better before
Interpretation
⇒ Too SLOW to get real ROI out of the overall system
How to work around that?
Our world Today
No, it wasn’t better before
Consequences
Our world Today
No, it wasn’t better before
Alerting system over descriptive charts
More accurate results
more or harder models (e.g. Deep Learning)
More data
Constant data flow
Online interactions under control (e.g. direct feedback)
Needs are
Our world Today
No, it wasn’t better before
Distributed Systems
So, we need...
Distributed Data Science
System/Platform/SDK/Pipeline/Product/… whatever you call it
“Create” Cluster
Find available sources (context, content, quality, semantic, …)
Connect to sources (structure, schema/types, …)
Create distributed data pipeline/Model
Tune accuracy
Tune performances
Write results to Sinks
Access Layer
User Access
Distributed Data Science
System/Platform/SDK/Pipeline/Product/… whatever you call it
“Create” Cluster
Find available sources (context, content, quality, semantic, …)
Connect to sources (structure, schema/types, …)
Create distributed data pipeline/Model
Tune accuracy
Tune performances
Write results to Sinks
Access Layer
User Access
Distributed Data Science
System/Platform/SDK/Pipeline/Product/… whatever you call it
“Create” Cluster
Find available sources (context, content, quality, semantic, …)
Connect to sources (structure, schema/types, …)
Create distributed data pipeline/Model
Tune accuracy
Tune performances
Write results to Sinks
Access Layer
User Access
Distributed Data Science
System/Platform/SDK/Pipeline/Product/… whatever you call it
“Create” Cluster
Find available sources (context, content, quality, semantic, …)
Connect to sources (structure, schema/types, …)
Create distributed data pipeline/Model
Tune accuracy
Tune performances
Write results to Sinks
Access Layer
User Access
Distributed Data Science
System/Platform/SDK/Pipeline/Product/… whatever you call it
“Create” Cluster
Find available sources (context, content, quality, semantic, …)
Connect to sources (structure, schema/types, …)
Create distributed data pipeline/Model
Tune accuracy
Tune performances
Write results to Sinks
Access Layer
User Access
Distributed Data Science
System/Platform/SDK/Pipeline/Product/… whatever you call it
“Create” Cluster
Find available sources (context, content, quality, semantic, …)
Connect to sources (structure, schema/types, …)
Create distributed data pipeline/Model
Tune accuracy
Tune performances
Write results to Sinks
Access Layer
User Access
YO!
Aren’t we talking about
“Big” Data ?
Fast Data ?
So could really (all) results being
neither big nor fast?
Actually, Results are becoming
themselves
“Big” Data !
Fast Data !
Distributed Data Science
System/Platform/SDK/Pipeline/Product/… whatever you call it
“Create” Cluster
Find available sources (context, content, quality, semantic, …)
Connect to sources (structure, schema/types, …)
Create distributed data pipeline/Model
Tune accuracy
Tune performances
Write results to Sinks
Access Layer
User Access
how do we access data since 90’s? remember SOA?
→ SERVICES!
Nowadays, we’re talking about micro services.
Here we are, one service for one result.
Distributed Data Science
System/Platform/SDK/Pipeline/Product/… whatever you call it
“Create” Cluster
Find available sources (context, content, quality, semantic, …)
Connect to sources (structure, schema/types, …)
Create distributed data pipeline/Model
Tune accuracy
Tune performances
Write results to Sinks
Access Layer
User Access
C’mon, charts/Tables Cannot only be the
only views offered to customers/clients
right?
We need to open the capabilities to UI
(dashboard), connectors (third parties),
other services (“SOA”) …
…
OTHER Pipelines !!!
What about Productivity?
Streamlining development lifecycle most welcome
“Create” Cluster
Find available sources (context, content, quality, semantic, …)
Connect to sources (structure, schema/types, …)
Create distributed data pipeline/Model
Tune accuracy
Tune performances
Write results to Sinks
Access Layer
User Access
ops
data
ops data
sci
sci ops
sci
ops data
web ops data
web ops data sci
What about Productivity?
Streamlining development lifecycle most welcome
“Create” Cluster
Find available sources (context, content, quality, semantic, …)
Connect to sources (structure, schema/types, …)
Create distributed data pipeline/Model
Tune accuracy
Tune performances
Write results to Sinks
Access Layer
User Access
ops
data
ops
sci
sci ops
sci
ops data
web ops data
web ops data
data
sci
What about Productivity?
Streamlining development lifecycle most welcome
“Create” Cluster
Find available sources (context, content, quality, semantic, …)
Connect to sources (structure, schema/types, …)
Create distributed data pipeline/Model
Tune accuracy
Tune performances
Write results to Sinks
Access Layer
User Access
ops
data
ops data
sci
sci ops
sci
ops data
web ops data
web ops data sci
What about Productivity?
Streamlining development lifecycle most welcome
➔ Longer production line
➔ More constraints (resources sharing, time, …)
➔ More people
➔ More skills
Overlooking these points and you’ll be soon or sooner
So, how to have:
● results coming fast enough whilst keeping accuracy level high?
● Responsivity to external/unpredictable events?
WHEN...
kicked
Warning
Team Fight: seen by members
Warning
Team Fight: seen by managers
Warning
Team Fight: seen by employers
Warning
Team Fight: seen by customers
What about Productivity?
Streamlining development lifecycle most welcome
At Data Fellas, we think that we need Interactivity and Reactivity to
tighten the frontiers (within team and in time).
Hence, Data Fellas
● extends the Spark Notebook (interactivity)
● builds the Shar3 product arounds it (Integrated Reactivity)
Concepts of Data Fellas’ Shar3
Shareable and Streamlined Data Science
Analysis
Production
DistributionRendering
Discovery
Catalog
Project
Generator
Micro Service /
Binary format
Schema for output
Metadata
Using Shar3
yeah o/
Let’s take this example where some buddies from
Datastax Joel Jacobson @joeljacobson
Simon Ambridge @stratman1958
Mesosphere Michael Hausenblas @mhausenblas
Typesafe Iulian Dragos @jaguarul
Data Fellas Xavier Tordoir @xtordoir
(and me)
Using Shar3
yeah o/
Let’s take this example where some buddies from
Datastax Joel Jacobson @joeljacobson
Simon Ambridge @stratman1958
Mesosphere Michael Hausenblas @mhausenblas
Typesafe Iulian Dragos @jaguarul
Data Fellas Xavier Tordoir @xtordoir
(and me)
Using Shar3
yeah o/
Using Shar3
yeah o/
Using Shar3
yeah o/
Using Shar3
yeah o/
Using Shar3
yeah o/
What do we need to do now?
● Deploy
● connect the dots
● track
● scale
BoTh the Jobs and the services
Using Shar3
yeah o/
From notebook
to SBT project
to Docker
to Marathon
SNB
SBT/JAR
Docker
marathon
Using Shar3
yeah o/
From Notebook
● to output
● to Avro
SNB
Using Shar3
yeah o/
From Notebook
● to Avro
● to service
● to SBT
● to Docker
● to Marathon
SNB
SBT/JAR Docker
marathon
Using Shar3
yeah o/
From Notebook
● to Avro
● to Tableau
● or QlikView
● or D3.JS
● or …
SNB
Using Shar3
yeah o/
So we have these information available:
● notebook’s markdown text
● notebook’s code/model
● data sources
● Output/sinks
● Output/services
● Avro schema
Shouldn’t them all be
reused???
Using Shar3
yeah o/
Variant Analysis
There is a service!
Using Shar3
yeah o/
Let’s use it…
Using Shar3
yeah o/
What was the process?
Using Shar3
yeah o/
Fine and the output is in C*
Using Shar3
yeah o/
Let’s check what’s in-there
Using Shar3
yeah o/
not what I need, let’s ADAPT
Using Shar3
yeah o/
Poke us on
@DataFellas
@Shar3_Fellas
@SparkNotebook
@Xtordoir & @Noootsab
Now @TypeSafe: http://t.co/o1Bt6dQtgH
If you wanna learn more about the different tools… Join us @ O’Reilly
Follow up Soon on http://NoETL.org (HI5 to @ChiefScientist for that)
That’s all folks
Thanks for listening/staying

Contenu connexe

Tendances

Scala: the unpredicted lingua franca for data science
Scala: the unpredicted lingua franca  for data scienceScala: the unpredicted lingua franca  for data science
Scala: the unpredicted lingua franca for data scienceAndy Petrella
 
How to Integrate Spark MLlib and Apache Solr to Build Real-Time Entity Type R...
How to Integrate Spark MLlib and Apache Solr to Build Real-Time Entity Type R...How to Integrate Spark MLlib and Apache Solr to Build Real-Time Entity Type R...
How to Integrate Spark MLlib and Apache Solr to Build Real-Time Entity Type R...Spark Summit
 
Seattle Scalability Mahout
Seattle Scalability MahoutSeattle Scalability Mahout
Seattle Scalability MahoutJake Mannix
 
EDHREC @ Data Science MD
EDHREC @ Data Science MDEDHREC @ Data Science MD
EDHREC @ Data Science MDDonald Miner
 
AMP Camp 5 Intro
AMP Camp 5 IntroAMP Camp 5 Intro
AMP Camp 5 Introjeykottalam
 
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...Databricks
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonPaco Nathan
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learningPaco Nathan
 
Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark
Scalable Collaborative Filtering Recommendation Algorithms on Apache SparkScalable Collaborative Filtering Recommendation Algorithms on Apache Spark
Scalable Collaborative Filtering Recommendation Algorithms on Apache SparkEvan Casey
 
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Rodney Joyce
 
Vital.AI Creating Intelligent Apps
Vital.AI Creating Intelligent AppsVital.AI Creating Intelligent Apps
Vital.AI Creating Intelligent AppsVital.AI
 
Applied Machine learning using H2O, python and R Workshop
Applied Machine learning using H2O, python and R WorkshopApplied Machine learning using H2O, python and R Workshop
Applied Machine learning using H2O, python and R WorkshopAvkash Chauhan
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusPaco Nathan
 
Apache Spark Model Deployment
Apache Spark Model Deployment Apache Spark Model Deployment
Apache Spark Model Deployment Databricks
 
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...Spark Summit
 
Data Analytics with R and SQL Server
Data Analytics with R and SQL ServerData Analytics with R and SQL Server
Data Analytics with R and SQL ServerStéphane Fréchette
 
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...Spark Summit
 
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...Databricks
 
Analyzing Data With Python
Analyzing Data With PythonAnalyzing Data With Python
Analyzing Data With PythonSarah Guido
 
Spark MLlib - Training Material
Spark MLlib - Training Material Spark MLlib - Training Material
Spark MLlib - Training Material Bryan Yang
 

Tendances (20)

Scala: the unpredicted lingua franca for data science
Scala: the unpredicted lingua franca  for data scienceScala: the unpredicted lingua franca  for data science
Scala: the unpredicted lingua franca for data science
 
How to Integrate Spark MLlib and Apache Solr to Build Real-Time Entity Type R...
How to Integrate Spark MLlib and Apache Solr to Build Real-Time Entity Type R...How to Integrate Spark MLlib and Apache Solr to Build Real-Time Entity Type R...
How to Integrate Spark MLlib and Apache Solr to Build Real-Time Entity Type R...
 
Seattle Scalability Mahout
Seattle Scalability MahoutSeattle Scalability Mahout
Seattle Scalability Mahout
 
EDHREC @ Data Science MD
EDHREC @ Data Science MDEDHREC @ Data Science MD
EDHREC @ Data Science MD
 
AMP Camp 5 Intro
AMP Camp 5 IntroAMP Camp 5 Intro
AMP Camp 5 Intro
 
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
 
Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark
Scalable Collaborative Filtering Recommendation Algorithms on Apache SparkScalable Collaborative Filtering Recommendation Algorithms on Apache Spark
Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark
 
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
 
Vital.AI Creating Intelligent Apps
Vital.AI Creating Intelligent AppsVital.AI Creating Intelligent Apps
Vital.AI Creating Intelligent Apps
 
Applied Machine learning using H2O, python and R Workshop
Applied Machine learning using H2O, python and R WorkshopApplied Machine learning using H2O, python and R Workshop
Applied Machine learning using H2O, python and R Workshop
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and Erasmus
 
Apache Spark Model Deployment
Apache Spark Model Deployment Apache Spark Model Deployment
Apache Spark Model Deployment
 
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
 
Data Analytics with R and SQL Server
Data Analytics with R and SQL ServerData Analytics with R and SQL Server
Data Analytics with R and SQL Server
 
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
 
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...
 
Analyzing Data With Python
Analyzing Data With PythonAnalyzing Data With Python
Analyzing Data With Python
 
Spark MLlib - Training Material
Spark MLlib - Training Material Spark MLlib - Training Material
Spark MLlib - Training Material
 

En vedette

Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics d...
Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics d...Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics d...
Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics d...Dataconomy Media
 
Rakuten LeoFs - distributed file system
Rakuten LeoFs - distributed file systemRakuten LeoFs - distributed file system
Rakuten LeoFs - distributed file systemRakuten Group, Inc.
 
Real-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerReal-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerMichael Spector
 
Architecture of a Kafka camus infrastructure
Architecture of a Kafka camus infrastructureArchitecture of a Kafka camus infrastructure
Architecture of a Kafka camus infrastructuremattlieber
 
Aggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of dataAggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of dataRostislav Pashuto
 
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeData Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeSpark Summit
 
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedInBuilding a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedInAmy W. Tang
 

En vedette (8)

Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics d...
Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics d...Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics d...
Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics d...
 
Rakuten LeoFs - distributed file system
Rakuten LeoFs - distributed file systemRakuten LeoFs - distributed file system
Rakuten LeoFs - distributed file system
 
Real-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerReal-time analytics with Druid at Appsflyer
Real-time analytics with Druid at Appsflyer
 
Architecture of a Kafka camus infrastructure
Architecture of a Kafka camus infrastructureArchitecture of a Kafka camus infrastructure
Architecture of a Kafka camus infrastructure
 
Aggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of dataAggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of data
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeData Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
 
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedInBuilding a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
 

Similaire à What is a distributed data science pipeline. how with apache spark and friends.

A Look into the Apache OODT Ecosystem
A Look into the Apache OODT EcosystemA Look into the Apache OODT Ecosystem
A Look into the Apache OODT EcosystemChris Mattmann
 
Off-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier DataOff-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier DataHostedbyConfluent
 
Media_Entertainment_Veriticals
Media_Entertainment_VeriticalsMedia_Entertainment_Veriticals
Media_Entertainment_VeriticalsPeyman Mohajerian
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceeRic Choo
 
Polyglot metadata for Hadoop
Polyglot metadata for HadoopPolyglot metadata for Hadoop
Polyglot metadata for HadoopJim Dowling
 
Windows Azure: Lessons From The Field
Windows Azure: Lessons From The FieldWindows Azure: Lessons From The Field
Windows Azure: Lessons From The FieldRob Gillen
 
Integration Patterns for Big Data Applications
Integration Patterns for Big Data ApplicationsIntegration Patterns for Big Data Applications
Integration Patterns for Big Data ApplicationsMichael Häusler
 
Natural Language Processing & Semantic Models in an Imperfect World
Natural Language Processing & Semantic Modelsin an Imperfect WorldNatural Language Processing & Semantic Modelsin an Imperfect World
Natural Language Processing & Semantic Models in an Imperfect WorldVital.AI
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodDuncan Hull
 
Deep dive into the native multi model database ArangoDB
Deep dive into the native multi model database ArangoDBDeep dive into the native multi model database ArangoDB
Deep dive into the native multi model database ArangoDBArangoDB Database
 
Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...jaxLondonConference
 
Querying datasets on the Web with high availability
Querying datasets on the Web with high availabilityQuerying datasets on the Web with high availability
Querying datasets on the Web with high availabilityRuben Verborgh
 
A machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesA machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesDataWorks Summit
 
Fishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data LakeFishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data LakeArangoDB Database
 
Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)James Serra
 
Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data ScientistsRichard Garris
 
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)Jeff Magnusson
 
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & AlluxioAlluxio, Inc.
 

Similaire à What is a distributed data science pipeline. how with apache spark and friends. (20)

A Look into the Apache OODT Ecosystem
A Look into the Apache OODT EcosystemA Look into the Apache OODT Ecosystem
A Look into the Apache OODT Ecosystem
 
Off-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier DataOff-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier Data
 
Media_Entertainment_Veriticals
Media_Entertainment_VeriticalsMedia_Entertainment_Veriticals
Media_Entertainment_Veriticals
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
 
Polyglot metadata for Hadoop
Polyglot metadata for HadoopPolyglot metadata for Hadoop
Polyglot metadata for Hadoop
 
Windows Azure: Lessons From The Field
Windows Azure: Lessons From The FieldWindows Azure: Lessons From The Field
Windows Azure: Lessons From The Field
 
Integration Patterns for Big Data Applications
Integration Patterns for Big Data ApplicationsIntegration Patterns for Big Data Applications
Integration Patterns for Big Data Applications
 
Natural Language Processing & Semantic Models in an Imperfect World
Natural Language Processing & Semantic Modelsin an Imperfect WorldNatural Language Processing & Semantic Modelsin an Imperfect World
Natural Language Processing & Semantic Models in an Imperfect World
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
Deep dive into the native multi model database ArangoDB
Deep dive into the native multi model database ArangoDBDeep dive into the native multi model database ArangoDB
Deep dive into the native multi model database ArangoDB
 
Big data apache spark + scala
Big data   apache spark + scalaBig data   apache spark + scala
Big data apache spark + scala
 
Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...
 
Querying datasets on the Web with high availability
Querying datasets on the Web with high availabilityQuerying datasets on the Web with high availability
Querying datasets on the Web with high availability
 
A machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesA machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companies
 
Fishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data LakeFishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data Lake
 
ETL 2.0 Data Engineering for developers
ETL 2.0 Data Engineering for developersETL 2.0 Data Engineering for developers
ETL 2.0 Data Engineering for developers
 
Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)
 
Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data Scientists
 
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
 
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
 

Plus de Andy Petrella

Data Observability Best Pracices
Data Observability Best PracicesData Observability Best Pracices
Data Observability Best PracicesAndy Petrella
 
How to Build a Global Data Mapping
How to Build a Global Data MappingHow to Build a Global Data Mapping
How to Build a Global Data MappingAndy Petrella
 
Interactive notebooks
Interactive notebooksInteractive notebooks
Interactive notebooksAndy Petrella
 
Governance compliance
Governance   complianceGovernance   compliance
Governance complianceAndy Petrella
 
Data science governance and GDPR
Data science governance and GDPRData science governance and GDPR
Data science governance and GDPRAndy Petrella
 
Data science governance : what and how
Data science governance : what and howData science governance : what and how
Data science governance : what and howAndy Petrella
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleAndy Petrella
 
Spark meetup london share and analyse genomic data at scale with spark, adam...
Spark meetup london  share and analyse genomic data at scale with spark, adam...Spark meetup london  share and analyse genomic data at scale with spark, adam...
Spark meetup london share and analyse genomic data at scale with spark, adam...Andy Petrella
 
Distributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browserDistributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browserAndy Petrella
 
Liège créative: Open Science
Liège créative: Open ScienceLiège créative: Open Science
Liège créative: Open ScienceAndy Petrella
 
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at ScaleBioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at ScaleAndy Petrella
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkAndy Petrella
 
Lightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaLightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaAndy Petrella
 
Machine Learning and GraphX
Machine Learning and GraphXMachine Learning and GraphX
Machine Learning and GraphXAndy Petrella
 
Quanti-litative Revolution in GIS
Quanti-litative Revolution in GISQuanti-litative Revolution in GIS
Quanti-litative Revolution in GISAndy Petrella
 
Scala and-fp-in-big-data
Scala and-fp-in-big-dataScala and-fp-in-big-data
Scala and-fp-in-big-dataAndy Petrella
 
Software Crafted And Libraries Available
Software Crafted And Libraries AvailableSoftware Crafted And Libraries Available
Software Crafted And Libraries AvailableAndy Petrella
 
Fp and entrepreneurship
Fp and entrepreneurshipFp and entrepreneurship
Fp and entrepreneurshipAndy Petrella
 
BigData Week 2014 Belgium: Velocity
BigData Week 2014 Belgium: VelocityBigData Week 2014 Belgium: Velocity
BigData Week 2014 Belgium: VelocityAndy Petrella
 

Plus de Andy Petrella (20)

Data Observability Best Pracices
Data Observability Best PracicesData Observability Best Pracices
Data Observability Best Pracices
 
How to Build a Global Data Mapping
How to Build a Global Data MappingHow to Build a Global Data Mapping
How to Build a Global Data Mapping
 
Interactive notebooks
Interactive notebooksInteractive notebooks
Interactive notebooks
 
Governance compliance
Governance   complianceGovernance   compliance
Governance compliance
 
Data science governance and GDPR
Data science governance and GDPRData science governance and GDPR
Data science governance and GDPR
 
Data science governance : what and how
Data science governance : what and howData science governance : what and how
Data science governance : what and how
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scale
 
Spark meetup london share and analyse genomic data at scale with spark, adam...
Spark meetup london  share and analyse genomic data at scale with spark, adam...Spark meetup london  share and analyse genomic data at scale with spark, adam...
Spark meetup london share and analyse genomic data at scale with spark, adam...
 
Distributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browserDistributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browser
 
Liège créative: Open Science
Liège créative: Open ScienceLiège créative: Open Science
Liège créative: Open Science
 
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at ScaleBioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark
 
Spark devoxx2014
Spark devoxx2014Spark devoxx2014
Spark devoxx2014
 
Lightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaLightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and Scala
 
Machine Learning and GraphX
Machine Learning and GraphXMachine Learning and GraphX
Machine Learning and GraphX
 
Quanti-litative Revolution in GIS
Quanti-litative Revolution in GISQuanti-litative Revolution in GIS
Quanti-litative Revolution in GIS
 
Scala and-fp-in-big-data
Scala and-fp-in-big-dataScala and-fp-in-big-data
Scala and-fp-in-big-data
 
Software Crafted And Libraries Available
Software Crafted And Libraries AvailableSoftware Crafted And Libraries Available
Software Crafted And Libraries Available
 
Fp and entrepreneurship
Fp and entrepreneurshipFp and entrepreneurship
Fp and entrepreneurship
 
BigData Week 2014 Belgium: Velocity
BigData Week 2014 Belgium: VelocityBigData Week 2014 Belgium: Velocity
BigData Week 2014 Belgium: Velocity
 

Dernier

Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptxRajatChauhan518211
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...ssuser79fe74
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and ClassificationsAreesha Ahmad
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLkantirani197
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...chandars293
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICEayushi9330
 
Creating and Analyzing Definitive Screening Designs
Creating and Analyzing Definitive Screening DesignsCreating and Analyzing Definitive Screening Designs
Creating and Analyzing Definitive Screening DesignsNurulAfiqah307317
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfrohankumarsinghrore1
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 

Dernier (20)

Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
Creating and Analyzing Definitive Screening Designs
Creating and Analyzing Definitive Screening DesignsCreating and Analyzing Definitive Screening Designs
Creating and Analyzing Definitive Screening Designs
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 

What is a distributed data science pipeline. how with apache spark and friends.

  • 1. What is a Distributed Data Science Pipeline How with Apache Spark and Friends. by @DataFellas @Noootsab, 23th Nov. ‘15 @YaJUG
  • 2. ● (Legacy) Data Science Pipeline/Product ● What changed since then ● Distributed Data Science (today) ● Challenges ● Going beyond (productivity) Outline
  • 3. Data Fellas 6 months old Belgian Startup Andy Petrella @noootsab Maths Geospatial Distributed Computing @SparkNotebook Spark/Scala trainer Machine Learning Xavier Tordoir @xtordoir Physics Bioinformatics Distributed Computing Scala (& Perl) Spark trainer Machine Learning
  • 4. (Legacy) Data Science Pipeline Or, so called, Data Product Static Results Lot of information lost in translation Sounds like Waterfall ETL look and feel Sampling Modelling Tuning Report Interprete
  • 5. (Legacy) Data Science Pipeline Or, so called, Data Product Mono machine! CPU bounds Memory bounds Or resampling because small-ish data Sampling Modelling Tuning Report Interprete
  • 6. Facts Data gets bigger or, precisely, the amount of available sources explodes Data gets faster (and faster) - - only even consider: watching netflix on 4G ôÖ Our world Today No, it wasn’t better before
  • 7. Consequences HARD (or will be too big...) Ephemeral Restricted View Sampling Report Our world Today No, it wasn’t better before
  • 8. Interpretation ⇒ Too SLOW to get real ROI out of the overall system How to work around that? Our world Today No, it wasn’t better before Consequences
  • 9. Our world Today No, it wasn’t better before Alerting system over descriptive charts More accurate results more or harder models (e.g. Deep Learning) More data Constant data flow Online interactions under control (e.g. direct feedback) Needs are
  • 10. Our world Today No, it wasn’t better before Distributed Systems So, we need...
  • 11. Distributed Data Science System/Platform/SDK/Pipeline/Product/… whatever you call it “Create” Cluster Find available sources (context, content, quality, semantic, …) Connect to sources (structure, schema/types, …) Create distributed data pipeline/Model Tune accuracy Tune performances Write results to Sinks Access Layer User Access
  • 12. Distributed Data Science System/Platform/SDK/Pipeline/Product/… whatever you call it “Create” Cluster Find available sources (context, content, quality, semantic, …) Connect to sources (structure, schema/types, …) Create distributed data pipeline/Model Tune accuracy Tune performances Write results to Sinks Access Layer User Access
  • 13. Distributed Data Science System/Platform/SDK/Pipeline/Product/… whatever you call it “Create” Cluster Find available sources (context, content, quality, semantic, …) Connect to sources (structure, schema/types, …) Create distributed data pipeline/Model Tune accuracy Tune performances Write results to Sinks Access Layer User Access
  • 14. Distributed Data Science System/Platform/SDK/Pipeline/Product/… whatever you call it “Create” Cluster Find available sources (context, content, quality, semantic, …) Connect to sources (structure, schema/types, …) Create distributed data pipeline/Model Tune accuracy Tune performances Write results to Sinks Access Layer User Access
  • 15. Distributed Data Science System/Platform/SDK/Pipeline/Product/… whatever you call it “Create” Cluster Find available sources (context, content, quality, semantic, …) Connect to sources (structure, schema/types, …) Create distributed data pipeline/Model Tune accuracy Tune performances Write results to Sinks Access Layer User Access
  • 16. Distributed Data Science System/Platform/SDK/Pipeline/Product/… whatever you call it “Create” Cluster Find available sources (context, content, quality, semantic, …) Connect to sources (structure, schema/types, …) Create distributed data pipeline/Model Tune accuracy Tune performances Write results to Sinks Access Layer User Access YO! Aren’t we talking about “Big” Data ? Fast Data ? So could really (all) results being neither big nor fast? Actually, Results are becoming themselves “Big” Data ! Fast Data !
  • 17. Distributed Data Science System/Platform/SDK/Pipeline/Product/… whatever you call it “Create” Cluster Find available sources (context, content, quality, semantic, …) Connect to sources (structure, schema/types, …) Create distributed data pipeline/Model Tune accuracy Tune performances Write results to Sinks Access Layer User Access how do we access data since 90’s? remember SOA? → SERVICES! Nowadays, we’re talking about micro services. Here we are, one service for one result.
  • 18. Distributed Data Science System/Platform/SDK/Pipeline/Product/… whatever you call it “Create” Cluster Find available sources (context, content, quality, semantic, …) Connect to sources (structure, schema/types, …) Create distributed data pipeline/Model Tune accuracy Tune performances Write results to Sinks Access Layer User Access C’mon, charts/Tables Cannot only be the only views offered to customers/clients right? We need to open the capabilities to UI (dashboard), connectors (third parties), other services (“SOA”) … … OTHER Pipelines !!!
  • 19. What about Productivity? Streamlining development lifecycle most welcome “Create” Cluster Find available sources (context, content, quality, semantic, …) Connect to sources (structure, schema/types, …) Create distributed data pipeline/Model Tune accuracy Tune performances Write results to Sinks Access Layer User Access ops data ops data sci sci ops sci ops data web ops data web ops data sci
  • 20. What about Productivity? Streamlining development lifecycle most welcome “Create” Cluster Find available sources (context, content, quality, semantic, …) Connect to sources (structure, schema/types, …) Create distributed data pipeline/Model Tune accuracy Tune performances Write results to Sinks Access Layer User Access ops data ops sci sci ops sci ops data web ops data web ops data data sci
  • 21. What about Productivity? Streamlining development lifecycle most welcome “Create” Cluster Find available sources (context, content, quality, semantic, …) Connect to sources (structure, schema/types, …) Create distributed data pipeline/Model Tune accuracy Tune performances Write results to Sinks Access Layer User Access ops data ops data sci sci ops sci ops data web ops data web ops data sci
  • 22. What about Productivity? Streamlining development lifecycle most welcome ➔ Longer production line ➔ More constraints (resources sharing, time, …) ➔ More people ➔ More skills Overlooking these points and you’ll be soon or sooner So, how to have: ● results coming fast enough whilst keeping accuracy level high? ● Responsivity to external/unpredictable events? WHEN... kicked
  • 25. Warning Team Fight: seen by employers
  • 26. Warning Team Fight: seen by customers
  • 27. What about Productivity? Streamlining development lifecycle most welcome At Data Fellas, we think that we need Interactivity and Reactivity to tighten the frontiers (within team and in time). Hence, Data Fellas ● extends the Spark Notebook (interactivity) ● builds the Shar3 product arounds it (Integrated Reactivity)
  • 28. Concepts of Data Fellas’ Shar3 Shareable and Streamlined Data Science Analysis Production DistributionRendering Discovery Catalog Project Generator Micro Service / Binary format Schema for output Metadata
  • 29. Using Shar3 yeah o/ Let’s take this example where some buddies from Datastax Joel Jacobson @joeljacobson Simon Ambridge @stratman1958 Mesosphere Michael Hausenblas @mhausenblas Typesafe Iulian Dragos @jaguarul Data Fellas Xavier Tordoir @xtordoir (and me)
  • 30. Using Shar3 yeah o/ Let’s take this example where some buddies from Datastax Joel Jacobson @joeljacobson Simon Ambridge @stratman1958 Mesosphere Michael Hausenblas @mhausenblas Typesafe Iulian Dragos @jaguarul Data Fellas Xavier Tordoir @xtordoir (and me)
  • 35. Using Shar3 yeah o/ What do we need to do now? ● Deploy ● connect the dots ● track ● scale BoTh the Jobs and the services
  • 36. Using Shar3 yeah o/ From notebook to SBT project to Docker to Marathon SNB SBT/JAR Docker marathon
  • 37. Using Shar3 yeah o/ From Notebook ● to output ● to Avro SNB
  • 38. Using Shar3 yeah o/ From Notebook ● to Avro ● to service ● to SBT ● to Docker ● to Marathon SNB SBT/JAR Docker marathon
  • 39. Using Shar3 yeah o/ From Notebook ● to Avro ● to Tableau ● or QlikView ● or D3.JS ● or … SNB
  • 40. Using Shar3 yeah o/ So we have these information available: ● notebook’s markdown text ● notebook’s code/model ● data sources ● Output/sinks ● Output/services ● Avro schema Shouldn’t them all be reused???
  • 42. There is a service! Using Shar3 yeah o/
  • 43. Let’s use it… Using Shar3 yeah o/
  • 44. What was the process? Using Shar3 yeah o/
  • 45. Fine and the output is in C* Using Shar3 yeah o/
  • 46. Let’s check what’s in-there Using Shar3 yeah o/
  • 47. not what I need, let’s ADAPT Using Shar3 yeah o/
  • 48. Poke us on @DataFellas @Shar3_Fellas @SparkNotebook @Xtordoir & @Noootsab Now @TypeSafe: http://t.co/o1Bt6dQtgH If you wanna learn more about the different tools… Join us @ O’Reilly Follow up Soon on http://NoETL.org (HI5 to @ChiefScientist for that) That’s all folks Thanks for listening/staying