What was a data product before the world changed and got so complex.
Why distributed computing/data science is the solution.
What problems does that add?
How to solve most of them using the right technologies like spark notebook, spark, scala, mesos and so on in a accompanied framework
What is a distributed data science pipeline. how with apache spark and friends.
1. What is a Distributed Data Science Pipeline
How with Apache Spark and Friends.
by @DataFellas
@Noootsab, 23th Nov. ‘15 @YaJUG
2. ● (Legacy) Data Science Pipeline/Product
● What changed since then
● Distributed Data Science (today)
● Challenges
● Going beyond (productivity)
Outline
3. Data Fellas
6 months old Belgian Startup
Andy Petrella
@noootsab
Maths
Geospatial
Distributed Computing
@SparkNotebook
Spark/Scala trainer
Machine Learning
Xavier Tordoir
@xtordoir
Physics
Bioinformatics
Distributed Computing
Scala (& Perl)
Spark trainer
Machine Learning
4. (Legacy) Data Science Pipeline
Or, so called, Data Product
Static Results
Lot of information lost in translation
Sounds like Waterfall
ETL look and feel
Sampling Modelling Tuning Report Interprete
5. (Legacy) Data Science Pipeline
Or, so called, Data Product
Mono machine!
CPU bounds
Memory bounds
Or resampling because small-ish data
Sampling Modelling Tuning Report Interprete
6. Facts
Data gets bigger or, precisely, the amount of available
sources explodes
Data gets faster (and faster) - - only even consider:
watching netflix on 4G ôÖ
Our world Today
No, it wasn’t better before
7. Consequences
HARD (or will be too big...)
Ephemeral
Restricted View
Sampling
Report
Our world Today
No, it wasn’t better before
8. Interpretation
⇒ Too SLOW to get real ROI out of the overall system
How to work around that?
Our world Today
No, it wasn’t better before
Consequences
9. Our world Today
No, it wasn’t better before
Alerting system over descriptive charts
More accurate results
more or harder models (e.g. Deep Learning)
More data
Constant data flow
Online interactions under control (e.g. direct feedback)
Needs are
10. Our world Today
No, it wasn’t better before
Distributed Systems
So, we need...
11. Distributed Data Science
System/Platform/SDK/Pipeline/Product/… whatever you call it
“Create” Cluster
Find available sources (context, content, quality, semantic, …)
Connect to sources (structure, schema/types, …)
Create distributed data pipeline/Model
Tune accuracy
Tune performances
Write results to Sinks
Access Layer
User Access
12. Distributed Data Science
System/Platform/SDK/Pipeline/Product/… whatever you call it
“Create” Cluster
Find available sources (context, content, quality, semantic, …)
Connect to sources (structure, schema/types, …)
Create distributed data pipeline/Model
Tune accuracy
Tune performances
Write results to Sinks
Access Layer
User Access
13. Distributed Data Science
System/Platform/SDK/Pipeline/Product/… whatever you call it
“Create” Cluster
Find available sources (context, content, quality, semantic, …)
Connect to sources (structure, schema/types, …)
Create distributed data pipeline/Model
Tune accuracy
Tune performances
Write results to Sinks
Access Layer
User Access
14. Distributed Data Science
System/Platform/SDK/Pipeline/Product/… whatever you call it
“Create” Cluster
Find available sources (context, content, quality, semantic, …)
Connect to sources (structure, schema/types, …)
Create distributed data pipeline/Model
Tune accuracy
Tune performances
Write results to Sinks
Access Layer
User Access
15. Distributed Data Science
System/Platform/SDK/Pipeline/Product/… whatever you call it
“Create” Cluster
Find available sources (context, content, quality, semantic, …)
Connect to sources (structure, schema/types, …)
Create distributed data pipeline/Model
Tune accuracy
Tune performances
Write results to Sinks
Access Layer
User Access
16. Distributed Data Science
System/Platform/SDK/Pipeline/Product/… whatever you call it
“Create” Cluster
Find available sources (context, content, quality, semantic, …)
Connect to sources (structure, schema/types, …)
Create distributed data pipeline/Model
Tune accuracy
Tune performances
Write results to Sinks
Access Layer
User Access
YO!
Aren’t we talking about
“Big” Data ?
Fast Data ?
So could really (all) results being
neither big nor fast?
Actually, Results are becoming
themselves
“Big” Data !
Fast Data !
17. Distributed Data Science
System/Platform/SDK/Pipeline/Product/… whatever you call it
“Create” Cluster
Find available sources (context, content, quality, semantic, …)
Connect to sources (structure, schema/types, …)
Create distributed data pipeline/Model
Tune accuracy
Tune performances
Write results to Sinks
Access Layer
User Access
how do we access data since 90’s? remember SOA?
→ SERVICES!
Nowadays, we’re talking about micro services.
Here we are, one service for one result.
18. Distributed Data Science
System/Platform/SDK/Pipeline/Product/… whatever you call it
“Create” Cluster
Find available sources (context, content, quality, semantic, …)
Connect to sources (structure, schema/types, …)
Create distributed data pipeline/Model
Tune accuracy
Tune performances
Write results to Sinks
Access Layer
User Access
C’mon, charts/Tables Cannot only be the
only views offered to customers/clients
right?
We need to open the capabilities to UI
(dashboard), connectors (third parties),
other services (“SOA”) …
…
OTHER Pipelines !!!
19. What about Productivity?
Streamlining development lifecycle most welcome
“Create” Cluster
Find available sources (context, content, quality, semantic, …)
Connect to sources (structure, schema/types, …)
Create distributed data pipeline/Model
Tune accuracy
Tune performances
Write results to Sinks
Access Layer
User Access
ops
data
ops data
sci
sci ops
sci
ops data
web ops data
web ops data sci
20. What about Productivity?
Streamlining development lifecycle most welcome
“Create” Cluster
Find available sources (context, content, quality, semantic, …)
Connect to sources (structure, schema/types, …)
Create distributed data pipeline/Model
Tune accuracy
Tune performances
Write results to Sinks
Access Layer
User Access
ops
data
ops
sci
sci ops
sci
ops data
web ops data
web ops data
data
sci
21. What about Productivity?
Streamlining development lifecycle most welcome
“Create” Cluster
Find available sources (context, content, quality, semantic, …)
Connect to sources (structure, schema/types, …)
Create distributed data pipeline/Model
Tune accuracy
Tune performances
Write results to Sinks
Access Layer
User Access
ops
data
ops data
sci
sci ops
sci
ops data
web ops data
web ops data sci
22. What about Productivity?
Streamlining development lifecycle most welcome
➔ Longer production line
➔ More constraints (resources sharing, time, …)
➔ More people
➔ More skills
Overlooking these points and you’ll be soon or sooner
So, how to have:
● results coming fast enough whilst keeping accuracy level high?
● Responsivity to external/unpredictable events?
WHEN...
kicked
27. What about Productivity?
Streamlining development lifecycle most welcome
At Data Fellas, we think that we need Interactivity and Reactivity to
tighten the frontiers (within team and in time).
Hence, Data Fellas
● extends the Spark Notebook (interactivity)
● builds the Shar3 product arounds it (Integrated Reactivity)
28. Concepts of Data Fellas’ Shar3
Shareable and Streamlined Data Science
Analysis
Production
DistributionRendering
Discovery
Catalog
Project
Generator
Micro Service /
Binary format
Schema for output
Metadata
29. Using Shar3
yeah o/
Let’s take this example where some buddies from
Datastax Joel Jacobson @joeljacobson
Simon Ambridge @stratman1958
Mesosphere Michael Hausenblas @mhausenblas
Typesafe Iulian Dragos @jaguarul
Data Fellas Xavier Tordoir @xtordoir
(and me)
30. Using Shar3
yeah o/
Let’s take this example where some buddies from
Datastax Joel Jacobson @joeljacobson
Simon Ambridge @stratman1958
Mesosphere Michael Hausenblas @mhausenblas
Typesafe Iulian Dragos @jaguarul
Data Fellas Xavier Tordoir @xtordoir
(and me)
40. Using Shar3
yeah o/
So we have these information available:
● notebook’s markdown text
● notebook’s code/model
● data sources
● Output/sinks
● Output/services
● Avro schema
Shouldn’t them all be
reused???
47. not what I need, let’s ADAPT
Using Shar3
yeah o/
48. Poke us on
@DataFellas
@Shar3_Fellas
@SparkNotebook
@Xtordoir & @Noootsab
Now @TypeSafe: http://t.co/o1Bt6dQtgH
If you wanna learn more about the different tools… Join us @ O’Reilly
Follow up Soon on http://NoETL.org (HI5 to @ChiefScientist for that)
That’s all folks
Thanks for listening/staying