SlideShare une entreprise Scribd logo
1  sur  40
More Data, More Problems:
Evolving big data machine learning pipelines with Spark & Luigi
Alex Sadovsky
Director of Data Science: Oracle Data Cloud
alex.sadovsky@oracle.com
It's like the more data we come across
The more problems we see
Data Science is growing up
Data Science is growing up
For Data Science to succeed, we need to learn to play
well with others.
Important
Business
Decisions
How will
operations
adapt to this
code change?
We’ll need a
classifier capable of
capturing non-
linear interactions!
Who are the players?
Data Ingest Operations
Product Architecture
Finance / Investors Data Scientists
Litmus Test: These are the parties involved in every Data Science Product
What is success?
Success is getting from A to B with everyone staying happy
Data In
Product Goals
&
Data Science
Realization
Operational Insight
Utilization of
Existing
Architecture
Don’t Break the Bank
Data Out
Are we even talking about Spark today?
Outline:
1. Automated ML services
2. Scikit-Learn Pipelines
3. Spark Pipelines
4. Spotify’s Luigi
5. Data Science Pipelines: Spark + Luigi
Spoiler alert:
Spark is still going to be the answer to all of our big data problems
Amazon Machine Learning
ML on AWS: A simple model
{
"version" : "1.0",
"rowId" : null,
"rowWeight" : null,
"targetAttributeName" : "y",
"dataFormat" : "CSV",
"dataFileContainsHeader" : true,
"attributes" : [ {
"attributeName" : "age",
"attributeType" : "NUMERIC"
}
…
}
{
"MLModelId": "string",
"MLModelName": "string",
"MLModelType": "string",
"Parameters":
{
"string" :
"string"
},
"Recipe": "string",
"RecipeUri": "string",
"TrainingDataSourceId": "string"
}
Data Models
{
"groups": {
"LONGTEXT": "group_remove(ALL_TEXT, title, subject)”,
"SPECIALTEXT": "group(title, subject)”,
"BINCAT": "group(ALL_CATEGORICAL, ALL_BINARY)”
},
"assignments": {
"binned_age" : "quantile_bin(age,30)”,
"country_gender_interaction" : "cartesian(country, gender)”
},
"outputs": [
"lowercase(no_punct(LONGTEXT))”,
"ngram(lowercase(no_punct(SPECIALTEXT)),3)”,
"quantile_bin(hours-per-week, 10)”,
"cartesian(binned_age, quantile_bin(hours-per-week,10)) // this one is
critical”,
"country_gender_interaction”,
"BINCAT”
]
}
Recipies
ML on AWS: A simple model
ML on AWS: Who’s happy?
Data Ingest Operations
Product* Architecture
Finance / Investors Data Scientists
Scoring 3 billion records = ($0.10 / 1000) * 3 000 000 000 = $300,000 USD + compute fees
*Amazon Machine Learning can train models on datasets up to 100 GB in size.
Scikit-Learn Pipelines
vect = CountVectorizer()
tfidf = TfidfTransformer()
clf = SGDClassifier()
vX = vect.fit_transform(Xtrain)
tfidfX = tfidf.fit_transform(vX)
predicted = clf.fit_predict(tfidfX)
# Now evaluate all steps on test set
vX = vect.fit_transform(Xtest)
tfidfX = tfidf.fit_transform(vX)
predicted = clf.fit_predict(tfidfX)
pipeline = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', SGDClassifier()),
])
predicted = pipeline.fit(Xtrain).predict(Xtrain)
# Now evaluate all steps on test set
predicted = pipeline.predict(Xtest)
VS
Scikit-Learn Pipelines: Who’s happy?
Data Ingest Operations
Product Architecture
Finance / Investors Data Scientists
I thought we were going to have Big Data…
Big Data? Sounds like we need Spark.
• Great for data manipulation
• Great for large scale modeling
Spark Pipelines
Awesome… but don’t really take us end to end for anything but modeling
Spark Pipelines: Who’s happy?
Data Ingest Operations
Product Architecture
Finance / Investors Data Scientists
Just not flexible enough to comprise a whole product
So why (not) Spark?
• Great for data manipulation
• Great for large scale modeling
• Not a data warehouse
• Not needed for reporting
• Not needed for operational insight
– If anything, it’s an error source!
Pipelines
Spotify’s Luigi
https://github.com/spotify/luigi
Luigi is a pipeline tool for workflow management
• Apache 2.0 License
• Similar to Make utility in Linux
• You have tasks which have dependencies
• Luigi makes sure those dependences are met
• Similar to Spark
• It creates a directed acyclic graph and executes accordingly
Or in meme form:
How does it work: Tasks & Targets
• Tasks
– Code we want to run (that requires other tasks)
– Tasks output targets
• Targets
– A desired state
Luigi works with anything
• Tasks
– Hadoop commands
– Spark jobs
– Python, perl, fortran, shell scripts
– Anything that can be wrapped in a python“run”
method
• Targets
– local, S3, FTP, HDFS files
– database entries
– Anything that can let a python wrapper return “true”
when it exists
It’s all python too
• No XML or YAML
• Configurable via code
Foo and Bar
class Foo( luigi.WrapperTask ):
def run(self):
print("Running Foo")
def requires(self):
yield Bar()
Foo and Bar
class Bar(luigi.Task):
def run(self):
f = self.output().open('w')
f.write("hello, foobar worldn")
f.close()
def output(self):
return luigi.LocalTarget('/tmp/bar’)
/anaconda/bin/python /Users/alexander.sadovsky/PycharmProjects/megalodon/foo2.py Foo
DEBUG: Checking if examples.Foo() is complete
DEBUG: Checking if examples.Bar() is complete
INFO: Informed scheduler that task examples.Foo() has status PENDING
INFO: Informed scheduler that task examples.Bar() has status PENDING
INFO: Done scheduling tasks
INFO: Running Worker with 1 processes
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 2
INFO: [pid 12803] Worker Worker(salt=955608253, workers=1, host=asadovsky-mac.local, username=alexander.sadovsky, pid=12803) running
examples.Bar()
INFO: [pid 12803] Worker Worker(salt=955608253, workers=1, host=asadovsky-mac.local, username=alexander.sadovsky, pid=12803) done
examples.Bar()
DEBUG: 1 running tasks, waiting for next task to finish
INFO: Informed scheduler that task examples.Bar() has status DONE
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 1
INFO: [pid 12803] Worker Worker(salt=955608253, workers=1, host=asadovsky-mac.local, username=alexander.sadovsky, pid=12803) running
examples.Foo()
INFO: [pid 12803] Worker Worker(salt=955608253, workers=1, host=asadovsky-mac.local, username=alexander.sadovsky, pid=12803) done
examples.Foo()
DEBUG: 1 running tasks, waiting for next task to finish
Running Foo
INFO: Informed scheduler that task examples.Foo() has status DONE
DEBUG: Asking scheduler for work...
INFO: Done
INFO: There are no more tasks to run at this time
INFO: Worker Worker(salt=955608253, workers=1, host=asadovsky-mac.local, username=alexander.sadovsky, pid=12803) was stopped. Shutting down
Keep-Alive thread
INFO:
===== Luigi Execution Summary =====
Scheduled 2 tasks of which:
* 2 ran successfully:
- 1 examples.Bar()
- 1 examples.Foo()
This progress looks :) because there were no failed tasks or missing external dependencies
===== Luigi Execution Summary =====
Process finished with exit code 0
Foo and Bars
Foo and Bars
class Foo(luigi.WrapperTask):
def run(self):
print("Running Foo")
def requires(self):
for i in range(10):
yield Bar(i)
class Bar(luigi.Task):
num = luigi.Parameter()
def run(self):
f = self.output().open('w')
f.write("hello, foobar worldn")
f.close()
def output(self):
return luigi.LocalTarget('/tmp/bar/%d' % self.num)
Foo and Bars: Parallel Processing
What about Spark?
import sys
from pyspark import SparkContext
if __name__ == "__main__":
sc = SparkContext()
sc.textFile(sys.argv[1]) 
.flatMap(lambda line: line.split()) 
.map(lambda word: (word, 1)) 
.reduceByKey(lambda a, b: a + b) 
.saveAsTextFile(sys.argv[2])
What about Spark?
class InlinePySparkWordCount(PySparkTask):
def input(self):
return S3Target("s3n://bucket.example.org/wordcount.input")
def output(self):
return S3Target('s3n://bucket.example.org/wordcount.output')
def main(self, sc, *args):
sc.textFile(self.input().path) 
.flatMap(lambda line: line.split()) 
.map(lambda word: (word, 1)) 
.reduceByKey(lambda a, b: a + b) 
.saveAsTextFile(self.output().path)
Modeling ID
Selection
Data
Scoring ID
Selection
Variable
Creation
Variable
Reductio
n
Model
Scoring
Variable
Creation
Scoring
Data
Modeling ID
Selection
Variable
Creation
Variable
reduction
Model
Scoring
ID
Selection
Scoring
Variable
Creation
Scoring
Modeling Pipeline
Model
Model
Data
But wait! There’s more!
• Failure retries are built in
• Upstream failures will stop downstream
processing
• If tasks are files/filesystem/database states,
entire pipelines can be rerun without actually
“re-running” every step
Spark + Luigi: Who’s happy?
Data Ingest Operations
Product Architecture
Finance / Investors Data Scientists
Everybody!
How does it scale?
Shout out to my awesome team
I’m (almost) always hiring
Questions?

Contenu connexe

Tendances

Tendances (20)

Luigi presentation NYC Data Science
Luigi presentation NYC Data ScienceLuigi presentation NYC Data Science
Luigi presentation NYC Data Science
 
H2O World - Intro to R, Python, and Flow - Amy Wang
H2O World - Intro to R, Python, and Flow - Amy WangH2O World - Intro to R, Python, and Flow - Amy Wang
H2O World - Intro to R, Python, and Flow - Amy Wang
 
Dapper
DapperDapper
Dapper
 
AJUG April 2011 Raw hadoop example
AJUG April 2011 Raw hadoop exampleAJUG April 2011 Raw hadoop example
AJUG April 2011 Raw hadoop example
 
SharePoint Administration with PowerShell
SharePoint Administration with PowerShellSharePoint Administration with PowerShell
SharePoint Administration with PowerShell
 
The Ring programming language version 1.5.2 book - Part 39 of 181
The Ring programming language version 1.5.2 book - Part 39 of 181The Ring programming language version 1.5.2 book - Part 39 of 181
The Ring programming language version 1.5.2 book - Part 39 of 181
 
Performance Profiling in Rust
Performance Profiling in RustPerformance Profiling in Rust
Performance Profiling in Rust
 
Python database interfaces
Python database  interfacesPython database  interfaces
Python database interfaces
 
Do something in 5 minutes with gas 1-use spreadsheet as database
Do something in 5 minutes with gas 1-use spreadsheet as databaseDo something in 5 minutes with gas 1-use spreadsheet as database
Do something in 5 minutes with gas 1-use spreadsheet as database
 
Meet the Experts: Visualize Your Time-Stamped Data Using the React-Based Gira...
Meet the Experts: Visualize Your Time-Stamped Data Using the React-Based Gira...Meet the Experts: Visualize Your Time-Stamped Data Using the React-Based Gira...
Meet the Experts: Visualize Your Time-Stamped Data Using the React-Based Gira...
 
Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014
 
Psycopg2 - Connect to PostgreSQL using Python Script
Psycopg2 - Connect to PostgreSQL using Python ScriptPsycopg2 - Connect to PostgreSQL using Python Script
Psycopg2 - Connect to PostgreSQL using Python Script
 
Dapper & Dapper.SimpleCRUD
Dapper & Dapper.SimpleCRUDDapper & Dapper.SimpleCRUD
Dapper & Dapper.SimpleCRUD
 
The Ring programming language version 1.5.4 book - Part 40 of 185
The Ring programming language version 1.5.4 book - Part 40 of 185The Ring programming language version 1.5.4 book - Part 40 of 185
The Ring programming language version 1.5.4 book - Part 40 of 185
 
Latinoware
LatinowareLatinoware
Latinoware
 
#ajn3.lt.marblejenka
#ajn3.lt.marblejenka#ajn3.lt.marblejenka
#ajn3.lt.marblejenka
 
Dapper performance
Dapper performanceDapper performance
Dapper performance
 
Node collaboration - Exported Resources and PuppetDB
Node collaboration - Exported Resources and PuppetDBNode collaboration - Exported Resources and PuppetDB
Node collaboration - Exported Resources and PuppetDB
 
Jupyter Notebooks for machine learning on Kubernetes & OpenShift | DevNation ...
Jupyter Notebooks for machine learning on Kubernetes & OpenShift | DevNation ...Jupyter Notebooks for machine learning on Kubernetes & OpenShift | DevNation ...
Jupyter Notebooks for machine learning on Kubernetes & OpenShift | DevNation ...
 
Wprowadzenie do technologii Big Data / Intro to Big Data Ecosystem
Wprowadzenie do technologii Big Data / Intro to Big Data EcosystemWprowadzenie do technologii Big Data / Intro to Big Data Ecosystem
Wprowadzenie do technologii Big Data / Intro to Big Data Ecosystem
 

Similaire à More Data, More Problems: Evolving big data machine learning pipelines with Spark & Luigi

AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
Paul Chao
 
Powershell Training
Powershell TrainingPowershell Training
Powershell Training
Fahad Noaman
 

Similaire à More Data, More Problems: Evolving big data machine learning pipelines with Spark & Luigi (20)

Flink internals web
Flink internals web Flink internals web
Flink internals web
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
 
Apache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapApache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmap
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
 
The Ring programming language version 1.8 book - Part 95 of 202
The Ring programming language version 1.8 book - Part 95 of 202The Ring programming language version 1.8 book - Part 95 of 202
The Ring programming language version 1.8 book - Part 95 of 202
 
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
 
Data herding
Data herdingData herding
Data herding
 
Data herding
Data herdingData herding
Data herding
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Leveraging Hadoop in your PostgreSQL Environment
Leveraging Hadoop in your PostgreSQL EnvironmentLeveraging Hadoop in your PostgreSQL Environment
Leveraging Hadoop in your PostgreSQL Environment
 
Powershell Training
Powershell TrainingPowershell Training
Powershell Training
 
Building an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflowBuilding an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflow
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare Metal
 
Flyte kubecon 2019 SanDiego
Flyte kubecon 2019 SanDiegoFlyte kubecon 2019 SanDiego
Flyte kubecon 2019 SanDiego
 
Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理
 
Clojure And Swing
Clojure And SwingClojure And Swing
Clojure And Swing
 

Dernier

➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 

Dernier (20)

➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 

More Data, More Problems: Evolving big data machine learning pipelines with Spark & Luigi

  • 1. More Data, More Problems: Evolving big data machine learning pipelines with Spark & Luigi Alex Sadovsky Director of Data Science: Oracle Data Cloud alex.sadovsky@oracle.com It's like the more data we come across The more problems we see
  • 2. Data Science is growing up
  • 3. Data Science is growing up For Data Science to succeed, we need to learn to play well with others. Important Business Decisions How will operations adapt to this code change? We’ll need a classifier capable of capturing non- linear interactions!
  • 4. Who are the players? Data Ingest Operations Product Architecture Finance / Investors Data Scientists Litmus Test: These are the parties involved in every Data Science Product
  • 5. What is success? Success is getting from A to B with everyone staying happy Data In Product Goals & Data Science Realization Operational Insight Utilization of Existing Architecture Don’t Break the Bank Data Out
  • 6. Are we even talking about Spark today? Outline: 1. Automated ML services 2. Scikit-Learn Pipelines 3. Spark Pipelines 4. Spotify’s Luigi 5. Data Science Pipelines: Spark + Luigi Spoiler alert: Spark is still going to be the answer to all of our big data problems
  • 8. ML on AWS: A simple model { "version" : "1.0", "rowId" : null, "rowWeight" : null, "targetAttributeName" : "y", "dataFormat" : "CSV", "dataFileContainsHeader" : true, "attributes" : [ { "attributeName" : "age", "attributeType" : "NUMERIC" } … } { "MLModelId": "string", "MLModelName": "string", "MLModelType": "string", "Parameters": { "string" : "string" }, "Recipe": "string", "RecipeUri": "string", "TrainingDataSourceId": "string" } Data Models { "groups": { "LONGTEXT": "group_remove(ALL_TEXT, title, subject)”, "SPECIALTEXT": "group(title, subject)”, "BINCAT": "group(ALL_CATEGORICAL, ALL_BINARY)” }, "assignments": { "binned_age" : "quantile_bin(age,30)”, "country_gender_interaction" : "cartesian(country, gender)” }, "outputs": [ "lowercase(no_punct(LONGTEXT))”, "ngram(lowercase(no_punct(SPECIALTEXT)),3)”, "quantile_bin(hours-per-week, 10)”, "cartesian(binned_age, quantile_bin(hours-per-week,10)) // this one is critical”, "country_gender_interaction”, "BINCAT” ] } Recipies
  • 9. ML on AWS: A simple model
  • 10. ML on AWS: Who’s happy? Data Ingest Operations Product* Architecture Finance / Investors Data Scientists Scoring 3 billion records = ($0.10 / 1000) * 3 000 000 000 = $300,000 USD + compute fees *Amazon Machine Learning can train models on datasets up to 100 GB in size.
  • 11. Scikit-Learn Pipelines vect = CountVectorizer() tfidf = TfidfTransformer() clf = SGDClassifier() vX = vect.fit_transform(Xtrain) tfidfX = tfidf.fit_transform(vX) predicted = clf.fit_predict(tfidfX) # Now evaluate all steps on test set vX = vect.fit_transform(Xtest) tfidfX = tfidf.fit_transform(vX) predicted = clf.fit_predict(tfidfX) pipeline = Pipeline([ ('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', SGDClassifier()), ]) predicted = pipeline.fit(Xtrain).predict(Xtrain) # Now evaluate all steps on test set predicted = pipeline.predict(Xtest) VS
  • 12. Scikit-Learn Pipelines: Who’s happy? Data Ingest Operations Product Architecture Finance / Investors Data Scientists I thought we were going to have Big Data…
  • 13. Big Data? Sounds like we need Spark. • Great for data manipulation • Great for large scale modeling
  • 14. Spark Pipelines Awesome… but don’t really take us end to end for anything but modeling
  • 15. Spark Pipelines: Who’s happy? Data Ingest Operations Product Architecture Finance / Investors Data Scientists Just not flexible enough to comprise a whole product
  • 16. So why (not) Spark? • Great for data manipulation • Great for large scale modeling • Not a data warehouse • Not needed for reporting • Not needed for operational insight – If anything, it’s an error source!
  • 18. Spotify’s Luigi https://github.com/spotify/luigi Luigi is a pipeline tool for workflow management • Apache 2.0 License • Similar to Make utility in Linux • You have tasks which have dependencies • Luigi makes sure those dependences are met • Similar to Spark • It creates a directed acyclic graph and executes accordingly
  • 19. Or in meme form:
  • 20. How does it work: Tasks & Targets • Tasks – Code we want to run (that requires other tasks) – Tasks output targets • Targets – A desired state
  • 21. Luigi works with anything • Tasks – Hadoop commands – Spark jobs – Python, perl, fortran, shell scripts – Anything that can be wrapped in a python“run” method • Targets – local, S3, FTP, HDFS files – database entries – Anything that can let a python wrapper return “true” when it exists
  • 22. It’s all python too • No XML or YAML • Configurable via code
  • 23. Foo and Bar class Foo( luigi.WrapperTask ): def run(self): print("Running Foo") def requires(self): yield Bar()
  • 24. Foo and Bar class Bar(luigi.Task): def run(self): f = self.output().open('w') f.write("hello, foobar worldn") f.close() def output(self): return luigi.LocalTarget('/tmp/bar’)
  • 25. /anaconda/bin/python /Users/alexander.sadovsky/PycharmProjects/megalodon/foo2.py Foo DEBUG: Checking if examples.Foo() is complete DEBUG: Checking if examples.Bar() is complete INFO: Informed scheduler that task examples.Foo() has status PENDING INFO: Informed scheduler that task examples.Bar() has status PENDING INFO: Done scheduling tasks INFO: Running Worker with 1 processes DEBUG: Asking scheduler for work... DEBUG: Pending tasks: 2 INFO: [pid 12803] Worker Worker(salt=955608253, workers=1, host=asadovsky-mac.local, username=alexander.sadovsky, pid=12803) running examples.Bar() INFO: [pid 12803] Worker Worker(salt=955608253, workers=1, host=asadovsky-mac.local, username=alexander.sadovsky, pid=12803) done examples.Bar() DEBUG: 1 running tasks, waiting for next task to finish INFO: Informed scheduler that task examples.Bar() has status DONE DEBUG: Asking scheduler for work... DEBUG: Pending tasks: 1 INFO: [pid 12803] Worker Worker(salt=955608253, workers=1, host=asadovsky-mac.local, username=alexander.sadovsky, pid=12803) running examples.Foo() INFO: [pid 12803] Worker Worker(salt=955608253, workers=1, host=asadovsky-mac.local, username=alexander.sadovsky, pid=12803) done examples.Foo() DEBUG: 1 running tasks, waiting for next task to finish Running Foo INFO: Informed scheduler that task examples.Foo() has status DONE DEBUG: Asking scheduler for work... INFO: Done INFO: There are no more tasks to run at this time INFO: Worker Worker(salt=955608253, workers=1, host=asadovsky-mac.local, username=alexander.sadovsky, pid=12803) was stopped. Shutting down Keep-Alive thread INFO: ===== Luigi Execution Summary ===== Scheduled 2 tasks of which: * 2 ran successfully: - 1 examples.Bar() - 1 examples.Foo() This progress looks :) because there were no failed tasks or missing external dependencies ===== Luigi Execution Summary ===== Process finished with exit code 0
  • 26.
  • 28. Foo and Bars class Foo(luigi.WrapperTask): def run(self): print("Running Foo") def requires(self): for i in range(10): yield Bar(i) class Bar(luigi.Task): num = luigi.Parameter() def run(self): f = self.output().open('w') f.write("hello, foobar worldn") f.close() def output(self): return luigi.LocalTarget('/tmp/bar/%d' % self.num)
  • 29. Foo and Bars: Parallel Processing
  • 30. What about Spark? import sys from pyspark import SparkContext if __name__ == "__main__": sc = SparkContext() sc.textFile(sys.argv[1]) .flatMap(lambda line: line.split()) .map(lambda word: (word, 1)) .reduceByKey(lambda a, b: a + b) .saveAsTextFile(sys.argv[2])
  • 31. What about Spark? class InlinePySparkWordCount(PySparkTask): def input(self): return S3Target("s3n://bucket.example.org/wordcount.input") def output(self): return S3Target('s3n://bucket.example.org/wordcount.output') def main(self, sc, *args): sc.textFile(self.input().path) .flatMap(lambda line: line.split()) .map(lambda word: (word, 1)) .reduceByKey(lambda a, b: a + b) .saveAsTextFile(self.output().path)
  • 32. Modeling ID Selection Data Scoring ID Selection Variable Creation Variable Reductio n Model Scoring Variable Creation Scoring Data Modeling ID Selection Variable Creation Variable reduction Model Scoring ID Selection Scoring Variable Creation Scoring Modeling Pipeline Model Model Data
  • 33.
  • 34.
  • 35. But wait! There’s more! • Failure retries are built in • Upstream failures will stop downstream processing • If tasks are files/filesystem/database states, entire pipelines can be rerun without actually “re-running” every step
  • 36. Spark + Luigi: Who’s happy? Data Ingest Operations Product Architecture Finance / Investors Data Scientists Everybody!
  • 37. How does it scale?
  • 38. Shout out to my awesome team

Notes de l'éditeur

  1. One guy writing horrible code in R silo’ed away (story)…. Productionized, deployed code that a business depends on
  2. One guy writing horrible code in R silo’ed away (story)…. Productionized, deployed code that a business depends on
  3. One person or multiple teams:
  4. Ml pipeline
  5. Bare with me, I’m getting there
  6. diagram.
  7. diagram
  8. Lost pic? One person or multiple teams:
  9. diagram.
  10. Lost pic? One person or multiple teams:
  11. diagram.
  12. Lost pic? One person or multiple teams:
  13. Mario is off getting the girl, luigi is off creating world class data science pipeline products
  14. Apache oozie/ linked in azkaban
  15. Apache oozie/ linked in azkaban
  16. Lost pic? One person or multiple teams: