These are the slides from the Denver/Boulder Spark meet-up on February 24th, 2016. (deck build animations are all broken here... sorry!)
This talk provides an evaluation of existing machine learning pipelines in the eyes of different key stakeholders in the data science ecosystem. Focus is be placed upon the entire process from data to product (and keeping everyone in-between happy). Ultimately I explore how to utilize Spotify’s Luigi pipeline tool in combination with Spark to produce batch processing machine learning pipelines that have operational insights and redundancy built in.
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
More Data, More Problems: Evolving big data machine learning pipelines with Spark & Luigi
1. More Data, More Problems:
Evolving big data machine learning pipelines with Spark & Luigi
Alex Sadovsky
Director of Data Science: Oracle Data Cloud
alex.sadovsky@oracle.com
It's like the more data we come across
The more problems we see
3. Data Science is growing up
For Data Science to succeed, we need to learn to play
well with others.
Important
Business
Decisions
How will
operations
adapt to this
code change?
We’ll need a
classifier capable of
capturing non-
linear interactions!
4. Who are the players?
Data Ingest Operations
Product Architecture
Finance / Investors Data Scientists
Litmus Test: These are the parties involved in every Data Science Product
5. What is success?
Success is getting from A to B with everyone staying happy
Data In
Product Goals
&
Data Science
Realization
Operational Insight
Utilization of
Existing
Architecture
Don’t Break the Bank
Data Out
6. Are we even talking about Spark today?
Outline:
1. Automated ML services
2. Scikit-Learn Pipelines
3. Spark Pipelines
4. Spotify’s Luigi
5. Data Science Pipelines: Spark + Luigi
Spoiler alert:
Spark is still going to be the answer to all of our big data problems
10. ML on AWS: Who’s happy?
Data Ingest Operations
Product* Architecture
Finance / Investors Data Scientists
Scoring 3 billion records = ($0.10 / 1000) * 3 000 000 000 = $300,000 USD + compute fees
*Amazon Machine Learning can train models on datasets up to 100 GB in size.
11. Scikit-Learn Pipelines
vect = CountVectorizer()
tfidf = TfidfTransformer()
clf = SGDClassifier()
vX = vect.fit_transform(Xtrain)
tfidfX = tfidf.fit_transform(vX)
predicted = clf.fit_predict(tfidfX)
# Now evaluate all steps on test set
vX = vect.fit_transform(Xtest)
tfidfX = tfidf.fit_transform(vX)
predicted = clf.fit_predict(tfidfX)
pipeline = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', SGDClassifier()),
])
predicted = pipeline.fit(Xtrain).predict(Xtrain)
# Now evaluate all steps on test set
predicted = pipeline.predict(Xtest)
VS
12. Scikit-Learn Pipelines: Who’s happy?
Data Ingest Operations
Product Architecture
Finance / Investors Data Scientists
I thought we were going to have Big Data…
13. Big Data? Sounds like we need Spark.
• Great for data manipulation
• Great for large scale modeling
15. Spark Pipelines: Who’s happy?
Data Ingest Operations
Product Architecture
Finance / Investors Data Scientists
Just not flexible enough to comprise a whole product
16. So why (not) Spark?
• Great for data manipulation
• Great for large scale modeling
• Not a data warehouse
• Not needed for reporting
• Not needed for operational insight
– If anything, it’s an error source!
18. Spotify’s Luigi
https://github.com/spotify/luigi
Luigi is a pipeline tool for workflow management
• Apache 2.0 License
• Similar to Make utility in Linux
• You have tasks which have dependencies
• Luigi makes sure those dependences are met
• Similar to Spark
• It creates a directed acyclic graph and executes accordingly
20. How does it work: Tasks & Targets
• Tasks
– Code we want to run (that requires other tasks)
– Tasks output targets
• Targets
– A desired state
21. Luigi works with anything
• Tasks
– Hadoop commands
– Spark jobs
– Python, perl, fortran, shell scripts
– Anything that can be wrapped in a python“run”
method
• Targets
– local, S3, FTP, HDFS files
– database entries
– Anything that can let a python wrapper return “true”
when it exists
22. It’s all python too
• No XML or YAML
• Configurable via code
23. Foo and Bar
class Foo( luigi.WrapperTask ):
def run(self):
print("Running Foo")
def requires(self):
yield Bar()
24. Foo and Bar
class Bar(luigi.Task):
def run(self):
f = self.output().open('w')
f.write("hello, foobar worldn")
f.close()
def output(self):
return luigi.LocalTarget('/tmp/bar’)
25. /anaconda/bin/python /Users/alexander.sadovsky/PycharmProjects/megalodon/foo2.py Foo
DEBUG: Checking if examples.Foo() is complete
DEBUG: Checking if examples.Bar() is complete
INFO: Informed scheduler that task examples.Foo() has status PENDING
INFO: Informed scheduler that task examples.Bar() has status PENDING
INFO: Done scheduling tasks
INFO: Running Worker with 1 processes
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 2
INFO: [pid 12803] Worker Worker(salt=955608253, workers=1, host=asadovsky-mac.local, username=alexander.sadovsky, pid=12803) running
examples.Bar()
INFO: [pid 12803] Worker Worker(salt=955608253, workers=1, host=asadovsky-mac.local, username=alexander.sadovsky, pid=12803) done
examples.Bar()
DEBUG: 1 running tasks, waiting for next task to finish
INFO: Informed scheduler that task examples.Bar() has status DONE
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 1
INFO: [pid 12803] Worker Worker(salt=955608253, workers=1, host=asadovsky-mac.local, username=alexander.sadovsky, pid=12803) running
examples.Foo()
INFO: [pid 12803] Worker Worker(salt=955608253, workers=1, host=asadovsky-mac.local, username=alexander.sadovsky, pid=12803) done
examples.Foo()
DEBUG: 1 running tasks, waiting for next task to finish
Running Foo
INFO: Informed scheduler that task examples.Foo() has status DONE
DEBUG: Asking scheduler for work...
INFO: Done
INFO: There are no more tasks to run at this time
INFO: Worker Worker(salt=955608253, workers=1, host=asadovsky-mac.local, username=alexander.sadovsky, pid=12803) was stopped. Shutting down
Keep-Alive thread
INFO:
===== Luigi Execution Summary =====
Scheduled 2 tasks of which:
* 2 ran successfully:
- 1 examples.Bar()
- 1 examples.Foo()
This progress looks :) because there were no failed tasks or missing external dependencies
===== Luigi Execution Summary =====
Process finished with exit code 0
28. Foo and Bars
class Foo(luigi.WrapperTask):
def run(self):
print("Running Foo")
def requires(self):
for i in range(10):
yield Bar(i)
class Bar(luigi.Task):
num = luigi.Parameter()
def run(self):
f = self.output().open('w')
f.write("hello, foobar worldn")
f.close()
def output(self):
return luigi.LocalTarget('/tmp/bar/%d' % self.num)
30. What about Spark?
import sys
from pyspark import SparkContext
if __name__ == "__main__":
sc = SparkContext()
sc.textFile(sys.argv[1])
.flatMap(lambda line: line.split())
.map(lambda word: (word, 1))
.reduceByKey(lambda a, b: a + b)
.saveAsTextFile(sys.argv[2])
31. What about Spark?
class InlinePySparkWordCount(PySparkTask):
def input(self):
return S3Target("s3n://bucket.example.org/wordcount.input")
def output(self):
return S3Target('s3n://bucket.example.org/wordcount.output')
def main(self, sc, *args):
sc.textFile(self.input().path)
.flatMap(lambda line: line.split())
.map(lambda word: (word, 1))
.reduceByKey(lambda a, b: a + b)
.saveAsTextFile(self.output().path)
35. But wait! There’s more!
• Failure retries are built in
• Upstream failures will stop downstream
processing
• If tasks are files/filesystem/database states,
entire pipelines can be rerun without actually
“re-running” every step
36. Spark + Luigi: Who’s happy?
Data Ingest Operations
Product Architecture
Finance / Investors Data Scientists
Everybody!