More Related Content Similar to Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan Cutler (20) More from Databricks (20) Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan Cutler1. DBG / June 5, 2018 / © 2018 IBM Corporation
Model Parallelism in
Spark ML
Cross-validation
Nick Pentreath
Principal Engineer
Bryan Cutler
Software Engineer
2. DBG / June 5, 2018 / © 2018 IBM Corporation
About Nick
@MLnick on Twitter & Github
Principal Engineer, IBM
CODAIT - Center for Open-Source Data & AI
Technologies
Machine Learning & AI
Apache Spark committer & PMC
Author of Machine Learning with Spark
Various conferences & meetups
3. DBG / June 5, 2018 / © 2018 IBM Corporation
About Bryan
Software Engineer, IBM CODAIT
Apache Spark committer
Apache Arrow committer
Python, Machine Learning OSS
@BryanCutler on Github
4. DBG / June 5, 2018 / © 2018 IBM Corporation
Center for Open Source Data and AI Technologies
CODAIT
codait.org
CODAIT aims to make AI solutions
dramatically easier to create, deploy,
and manage in the enterprise
Relaunch of the Spark Technology
Center (STC) to reflect expanded
mission
Improving Enterprise AI Lifecycle in Open Source
5. DBG / June 5, 2018 / © 2018 IBM Corporation
Agenda
Model Tuning in Spark
Scaling Model Tuning
Performance Results
Best Practices
Future Directions in Optimizing
Pipelines
6. DBG / June 5, 2018 / © 2018 IBM Corporation
Model Tuning in Spark
7. DBG / June 5, 2018 / © 2018 IBM Corporation
Model selection: workflow within a workflow
Model Tuning in Spark
Ingest
Data
Processing
Feature
Engineering
Model
Selection
Final Model
Candidate
models
Train
Evaluate
Adjust
8. DBG / June 5, 2018 / © 2018 IBM Corporation
Pipeline cross-validation
Model Tuning in Spark
Tokenizer CountVectorizer LogisticRegression
Spark ML Pipeline
# features:
10
# features:
100
regParam:
0.001
regParam:
0.1
Parameters
9. DBG / June 5, 2018 / © 2018 IBM Corporation
Pipeline cross-validation
Model Tuning in Spark
Tokenizer
CountVectorizer
# features: 10
LogisticRegression
regParam: 0.001
# features:
10
Tokenizer
CountVectorizer
# features: 10
LogisticRegression
regParam: 0.1
Tokenizer
CountVectorizer
# features: 100
LogisticRegression
regParam: 0.001
Tokenizer
CountVectorizer
# features: 100
LogisticRegression
regParam: 0.1
# features:
100
regParam:
0.001
regParam:
0.1
10. DBG / June 5, 2018 / © 2018 IBM Corporation
Pipeline cross-validation
Model Tuning in Spark
# features:
10
# features:
100
regParam:
0.001
regParam:
0.1
Tokenizer CountVectorizer LogisticRegression
11. DBG / June 5, 2018 / © 2018 IBM Corporation
Pipeline cross-validation
Model Tuning in Spark
12. DBG / June 5, 2018 / © 2018 IBM Corporation
Pipeline cross-validation
Model Tuning in Spark
13. DBG / June 5, 2018 / © 2018 IBM Corporation
Pipeline cross-validation
Model Tuning in Spark
14. DBG / June 5, 2018 / © 2018 IBM Corporation
Cross-validation is expensive!
Model Tuning in Spark
• 5 x 5 x 5 hyperparameters = 125 pipelines
• ... across 4 machine learning models = 500
• If training & evaluation does not fully utilize
available cluster resources then that waste is
compounded for each model
Based on XKCD comic: https://xkcd.com/303/
& https://github.com/mislavcimpersak/xkcd-excuse-generator
15. DBG / June 5, 2018 / © 2018 IBM Corporation
Scaling Model Tuning
16. DBG / June 5, 2018 / © 2018 IBM Corporation
Parallel model evaluation
Scaling Model Tuning
Tokenizer
CountVectorizer
# features: 10
LogisticRegression
regParam: 0.001
# features:
10
Tokenizer
CountVectorizer
# features: 10
LogisticRegression
regParam: 0.1
Tokenizer
CountVectorizer
# features: 100
LogisticRegression
regParam: 0.001
Tokenizer
CountVectorizer
# features: 100
LogisticRegression
regParam: 0.1
# features:
100
regParam:
0.001
regParam:
0.1
17. DBG / June 5, 2018 / © 2018 IBM Corporation
Parallel model evaluation
Scaling Model Tuning
Tokenizer
CountVectorizer
# features: 10
LogisticRegression
regParam: 0.001
# features:
10
Tokenizer
CountVectorizer
# features: 10
LogisticRegression
regParam: 0.1
Tokenizer
CountVectorizer
# features: 100
LogisticRegression
regParam: 0.001
Tokenizer
CountVectorizer
# features: 100
LogisticRegression
regParam: 0.1
# features:
100
regParam:
0.001
regParam:
0.1
18. DBG / June 5, 2018 / © 2018 IBM Corporation
Parallel model evaluation
Scaling Model Tuning
Tokenizer
CountVectorizer
# features: 10
LogisticRegression
regParam: 0.001
# features:
10
Tokenizer
CountVectorizer
# features: 10
LogisticRegression
regParam: 0.1
Tokenizer
CountVectorizer
# features: 100
LogisticRegression
regParam: 0.001
Tokenizer
CountVectorizer
# features: 100
LogisticRegression
regParam: 0.1
# features:
100
regParam:
0.001
regParam:
0.1
19. DBG / June 5, 2018 / © 2018 IBM Corporation
Parallel model evaluation
Scaling Model Tuning
Tokenizer
CountVectorizer
# features: 10
LogisticRegression
regParam: 0.001
# features:
10
Tokenizer
CountVectorizer
# features: 10
LogisticRegression
regParam: 0.1
Tokenizer
CountVectorizer
# features: 100
LogisticRegression
regParam: 0.001
Tokenizer
CountVectorizer
# features: 100
LogisticRegression
regParam: 0.1
# features:
100
regParam:
0.001
regParam:
0.1
20. DBG / June 5, 2018 / © 2018 IBM Corporation
Parallel model evaluation
Scaling Model Tuning
• Added in SPARK-19357 and SPARK-21911
(PySpark)
• Parallelism parameter governs the
maximum # models to be trained at once
21. DBG / June 5, 2018 / © 2018 IBM Corporation
Parallel model evaluation
Scaling Model Tuning
# features:
10
# features:
100
regParam:
0.001
regParam:
0.1
Tokenizer CountVectorizer LogisticRegression
22. DBG / June 5, 2018 / © 2018 IBM Corporation
Parallel model evaluation
Scaling Model Tuning
23. DBG / June 5, 2018 / © 2018 IBM Corporation
Parallel model evaluation
Scaling Model Tuning
24. DBG / June 5, 2018 / © 2018 IBM Corporation
Parallel model evaluation
Scaling Model Tuning
25. DBG / June 5, 2018 / © 2018 IBM Corporation
Implementation considerations
Scaling Model Tuning
• Parallelism parameter sets the size of
threadpool under the hood
• Dedicated ExecutionContext created to
avoid deadlocks with using the default
threadpool
• Used Futures instead of parallel
collections – more flexible
• Model-specific parallel fitting
implementations not supported
• SPARK-22126
26. DBG / June 5, 2018 / © 2018 IBM Corporation
Performance tests
Scaling Model Tuning
• Compared parallel CV to serial CV with
varying number of samples
• Simple LogisticRegression with regParam
and fitIntercept; parameter grid size 12
• Measure elapsed time for cross-validation
• Data size: 100,000 -> 5,000,000
• Number features: 10
• Number partitions: 10
• Number CV folds: 5
• Parallelism: 3
• Standalone cluster with 30 cores
27. DBG / June 5, 2018 / © 2018 IBM Corporation
Results
Scaling Model Tuning
• ±2.4x speedup
• Stays roughly constant as #
samples increases
28. DBG / June 5, 2018 / © 2018 IBM Corporation
Best practices
Scaling Model Tuning
• Simple integer parameter is the only thing
you can set (for now)
• Too low => under-utilize resources
• Too high => could lead to memory issues or
overloading cluster
• Rough rule: # cores / # partitions
• But depends on data and model sizes
• Mid-sized cluster probably <= 10
29. DBG / June 5, 2018 / © 2018 IBM Corporation
Optimizing Tuning for
Pipeline Models
30. DBG / June 5, 2018 / © 2018 IBM Corporation
Challenges
Optimizing Tuning for Pipeline Models
• Multi-stage, complex pipelines
• Parameter grid with hyperparameters from
different stages
• Easy to have huge number of candidate
parameter combinations
• Model parallelism helps, but can we do
better?
31. DBG / June 5, 2018 / © 2018 IBM Corporation
Duplicating work
Optimizing Tuning for Pipeline Models
• Each Pipeline treated
independently
• Depending on parameter grid
and pipeline stages
• Fit the same model multiple
times
• Perform same transformations
multiple times
32. DBG / June 5, 2018 / © 2018 IBM Corporation
Optimize with a DAG
Optimizing Tuning for Pipeline Models
• A node is an estimator/transformer with a
set of hyperparameters
• A path in the graph is a single pipeline
model
Tokenizer
Count
Vectorizer
nfeat=10
Count
Vectorizer
nfeat=100
LR
reg=0.1
LR
reg=0.01
LR
reg=0.1
LR
reg=0.01
33. DBG / June 5, 2018 / © 2018 IBM Corporation
Parallelize in breadth-first order
Optimizing Tuning for Pipeline Models
• Example with parallelism parameter set to
2
• Tokenizer is only a transform, proceed to fit
CountVectorizer nodes
Tokenizer
Count
Vectorizer
nfeat=10
Count
Vectorizer
nfeat=100
LR
reg=0.1
LR
reg=0.01
LR
reg=0.1
LR
reg=0.01
34. DBG / June 5, 2018 / © 2018 IBM Corporation
Fit estimators
Optimizing Tuning for Pipeline Models
• Cache the result and proceed to fit the first
2 LogisticRegression models Tokenizer
Count
Vectorizer
nfeat=10
Count
Vectorizer
nfeat=100
LR
reg=0.1
LR
reg=0.01
LR
reg=0.1
LR
reg=0.01
Cache result
35. DBG / June 5, 2018 / © 2018 IBM Corporation
Fit estimators
Optimizing Tuning for Pipeline Models
• Unpersist when child tasks done
• Fit final 2 LR models Tokenizer
Count
Vectorizer
nfeat=10
Count
Vectorizer
nfeat=100
LR
reg=0.1
LR
reg=0.01
LR
reg=0.1
LR
reg=0.01
Unpersist
cached
dataframe
Cache
result
36. DBG / June 5, 2018 / © 2018 IBM Corporation
Fit estimators
Optimizing Tuning for Pipeline Models
• All 4 LR models fitted
Tokenizer
Count
Vectorizer
nfeat=10
Count
Vectorizer
nfeat=100
LR
reg=0.1
LR
reg=0.01
LR
reg=0.1
LR
reg=0.01
Unpersist
cached
dataframe
37. DBG / June 5, 2018 / © 2018 IBM Corporation
Evaluate models
Optimizing Tuning for Pipeline Models
• Evaluate models using similar method
• CountVectorizerModel is now a transformer
• Cache transform result
Tokenizer
CVModel
nfeat=10
CVModel
nfeat=100
LRModel
reg=0.1
LRModel
reg=0.01
LRModel
reg=0.1
LRModel
reg=0.01
Cache result
38. DBG / June 5, 2018 / © 2018 IBM Corporation
Evaluate models
Optimizing Tuning for Pipeline Models
• Evaluate models using similar method
• CountVectorizerModel is now a transformer
• Cache transform result
Tokenizer
CVModel
nfeat=10
CVModel
nfeat=100
LRModel
reg=0.1
LRModel
reg=0.01
LRModel
reg=0.1
LRModel
reg=0.01
Unpersist
cached
dataframe
Cache
result
Metrics: 0.62 0.62
39. DBG / June 5, 2018 / © 2018 IBM Corporation
Evaluate models
Optimizing Tuning for Pipeline Models
• All models evaluated for this fold
Tokenizer
CVModel
nfeat=10
CVModel
nfeat=100
LRModel
reg=0.1
LRModel
reg=0.01
LRModel
reg=0.1
LRModel
reg=0.01
Unpersist
cached
dataframe
Metrics: 0.62 0.62 0.72 0.66
40. DBG / June 5, 2018 / © 2018 IBM Corporation
Select best model
Optimizing Tuning for Pipeline Models
• Average the metrics from all folds and
select the best PipelineModel Tokenizer
CVModel
nfeat=10
CVModel
nfeat=100
LRModel
reg=0.1
LRModel
reg=0.01
LRModel
reg=0.1
LRModel
reg=0.01
Avg
Metrics:
0.64 0.64 0.71 0.65
41. DBG / June 5, 2018 / © 2018 IBM Corporation
Performance tests
Optimizing Tuning for Pipeline Models
• Compared to Standard Spark CV with
parallelism enabled
• Pipeline:
MinMaxScaler → PCA → LinearRegression
• Measure elapsed time for cross-validation
varying size of parameter grid from 36 to
80 models to evaluate
• Data size: 1,000,000
• Number features: 50
• Number partitions: 16
• Number CV folds: 4
• Parallelism: 3
• Standalone cluster with 30 cores
42. DBG / June 5, 2018 / © 2018 IBM Corporation
Results
Optimizing Tuning for Pipeline Models
• Up to 3.25x speedup
• Increases with more models …
• … and more complex pipelines
• Check out:
• https://github.com/BryanCutler/PipelineTuning
• Experimental!
• Watch SPARK-19071
Elapsed time for DAG CV vs Simple Parallel CV
0
275
550
825
1100
# models
36 48 60 80
Parallel DAG Parallel
43. DBG / June 5, 2018 / © 2018 IBM Corporation
Thank you!
codait.org
twitter.com/MLnick
github.com/MLnick
github.com/BryanCutler
developer.ibm.com/code
FfDL
Sign up for IBM Cloud and try Watson Studio!
https://datascience.ibm.com/
MAX
44. DBG / June 5, 2018 / © 2018 IBM Corporation
Date, Time, Location & Duration Session title and Speaker
Tue, June 5 | 11 AM
2010-2012, 30 mins
Productionizing Spark ML Pipelines with the Portable Format for Analytics
Nick Pentreath (IBM)
Tue, June 5 | 2 PM
2018, 30 mins
Making PySpark Amazing—From Faster UDFs to Dependency Management and Graphing!
Holden Karau (Google) Bryan Cutler (IBM)
Tue, June 5 | 2 PM
Nook by 2001, 30 mins
Making Data and AI Accessible for All
Armand Ruiz Gabernet (IBM)
Tue, June 5 | 2:40 PM
2002-2004, 30 mins
Cognitive Database: An Apache Spark-Based AI-Enabled Relational Database System
Rajesh Bordawekar (IBM T.J. Watson Research Center)
Tue, June 5 | 3:20 PM
3016-3022, 30 mins
Dynamic Priorities for Apache Spark Application’s Resource Allocations
Michael Feiman (IBM Spectrum Computing) Shinnosuke Okada (IBM Canada Ltd.)
Tue, June 5 | 3:20 PM
2001-2005, 30 mins
Model Parallelism in Spark ML Cross-Validation
Nick Pentreath (IBM) Bryan Cutler (IBM)
Tue, June 5 | 3:20 PM
2007, 30 mins
Serverless Machine Learning on Modern Hardware Using Apache Spark
Patrick Stuedi (IBM)
Tue, June 5 | 5:40 PM
2002-2004, 30 mins
Create a Loyal Customer Base by Knowing Their Personality Using AI-Based Personality Recommendation Engine;
Sourav Mazumder (IBM Analytics) Aradhna Tiwari (University of South Florida)
Tue, June 5 | 5:40 PM
2007, 30 mins
Transparent GPU Exploitation on Apache Spark
Dr. Kazuaki Ishizaki (IBM) Madhusudanan Kandasamy (IBM)
Tue, June 5 | 5:40 PM
2009-2011, 30 mins
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for Deep Neural Networks
Yonggang Hu (IBM) Chao Xue (IBM)
IBM Sessions at Spark+AI Summit 2018 (Tuesday, June 5)
45. DBG / June 5, 2018 / © 2018 IBM Corporation
Date, Time, Location & Duration Session title and Speaker
Wed, June 6 | 12:50 PM Birds of a Feather: Apache Arrow in Spark and More
Bryan Cutler (IBM) Li Jin (Two Sigma Investments, LP)
Wed, June 6 | 2 PM
2002-2004, 30 mins
Deep Learning for Recommender Systems
Nick Pentreath (IBM) )
Wed, June 6 | 3:20 PM
2018, 30 mins
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer
Frederick Reiss (IBM) Vijay Bommireddipalli (IBM Center for Open-Source Data & AI Technologies)
IBM Sessions at Spark+AI Summit 2018 (Wednesday, June 6)
Meet us at IBM booth in the Expo area.