SlideShare a Scribd company logo
1 of 46
Download to read offline
DBG / June 5, 2018 / © 2018 IBM Corporation
Model Parallelism in
Spark ML 

Cross-validation
Nick Pentreath
Principal Engineer
Bryan Cutler
Software Engineer
DBG / June 5, 2018 / © 2018 IBM Corporation
About Nick
@MLnick on Twitter & Github
Principal Engineer, IBM
CODAIT - Center for Open-Source Data & AI
Technologies
Machine Learning & AI
Apache Spark committer & PMC
Author of Machine Learning with Spark
Various conferences & meetups
DBG / June 5, 2018 / © 2018 IBM Corporation
About Bryan
Software Engineer, IBM CODAIT
Apache Spark committer
Apache Arrow committer
Python, Machine Learning OSS
@BryanCutler on Github
DBG / June 5, 2018 / © 2018 IBM Corporation
Center for Open Source Data and AI Technologies
CODAIT
codait.org
CODAIT aims to make AI solutions
dramatically easier to create, deploy,
and manage in the enterprise
Relaunch of the Spark Technology
Center (STC) to reflect expanded
mission
Improving Enterprise AI Lifecycle in Open Source
DBG / June 5, 2018 / © 2018 IBM Corporation
Agenda
Model Tuning in Spark
Scaling Model Tuning
Performance Results
Best Practices
Future Directions in Optimizing
Pipelines
DBG / June 5, 2018 / © 2018 IBM Corporation
Model Tuning in Spark
DBG / June 5, 2018 / © 2018 IBM Corporation
Model selection: workflow within a workflow
Model Tuning in Spark
Ingest
Data
Processing
Feature
Engineering
Model
Selection
Final Model
Candidate
models
Train
Evaluate
Adjust
DBG / June 5, 2018 / © 2018 IBM Corporation
Pipeline cross-validation
Model Tuning in Spark
Tokenizer CountVectorizer LogisticRegression
Spark ML Pipeline
# features:
10
# features:
100
regParam:
0.001
regParam:
0.1
Parameters
DBG / June 5, 2018 / © 2018 IBM Corporation
Pipeline cross-validation
Model Tuning in Spark
Tokenizer
CountVectorizer
# features: 10
LogisticRegression
regParam: 0.001
# features:
10
Tokenizer
CountVectorizer
# features: 10
LogisticRegression
regParam: 0.1
Tokenizer
CountVectorizer
# features: 100
LogisticRegression
regParam: 0.001
Tokenizer
CountVectorizer
# features: 100
LogisticRegression
regParam: 0.1
# features:
100
regParam:
0.001
regParam:
0.1
DBG / June 5, 2018 / © 2018 IBM Corporation
Pipeline cross-validation
Model Tuning in Spark
# features:
10
# features:
100
regParam:
0.001
regParam:
0.1
Tokenizer CountVectorizer LogisticRegression
DBG / June 5, 2018 / © 2018 IBM Corporation
Pipeline cross-validation
Model Tuning in Spark
DBG / June 5, 2018 / © 2018 IBM Corporation
Pipeline cross-validation
Model Tuning in Spark
DBG / June 5, 2018 / © 2018 IBM Corporation
Pipeline cross-validation
Model Tuning in Spark
DBG / June 5, 2018 / © 2018 IBM Corporation
Cross-validation is expensive!
Model Tuning in Spark
• 5 x 5 x 5 hyperparameters = 125 pipelines
• ... across 4 machine learning models = 500
• If training & evaluation does not fully utilize
available cluster resources then that waste is
compounded for each model
Based on XKCD comic: https://xkcd.com/303/
& https://github.com/mislavcimpersak/xkcd-excuse-generator
DBG / June 5, 2018 / © 2018 IBM Corporation
Scaling Model Tuning
DBG / June 5, 2018 / © 2018 IBM Corporation
Parallel model evaluation
Scaling Model Tuning
Tokenizer
CountVectorizer
# features: 10
LogisticRegression
regParam: 0.001
# features:
10
Tokenizer
CountVectorizer
# features: 10
LogisticRegression
regParam: 0.1
Tokenizer
CountVectorizer
# features: 100
LogisticRegression
regParam: 0.001
Tokenizer
CountVectorizer
# features: 100
LogisticRegression
regParam: 0.1
# features:
100
regParam:
0.001
regParam:
0.1
DBG / June 5, 2018 / © 2018 IBM Corporation
Parallel model evaluation
Scaling Model Tuning
Tokenizer
CountVectorizer
# features: 10
LogisticRegression
regParam: 0.001
# features:
10
Tokenizer
CountVectorizer
# features: 10
LogisticRegression
regParam: 0.1
Tokenizer
CountVectorizer
# features: 100
LogisticRegression
regParam: 0.001
Tokenizer
CountVectorizer
# features: 100
LogisticRegression
regParam: 0.1
# features:
100
regParam:
0.001
regParam:
0.1
DBG / June 5, 2018 / © 2018 IBM Corporation
Parallel model evaluation
Scaling Model Tuning
Tokenizer
CountVectorizer
# features: 10
LogisticRegression
regParam: 0.001
# features:
10
Tokenizer
CountVectorizer
# features: 10
LogisticRegression
regParam: 0.1
Tokenizer
CountVectorizer
# features: 100
LogisticRegression
regParam: 0.001
Tokenizer
CountVectorizer
# features: 100
LogisticRegression
regParam: 0.1
# features:
100
regParam:
0.001
regParam:
0.1
DBG / June 5, 2018 / © 2018 IBM Corporation
Parallel model evaluation
Scaling Model Tuning
Tokenizer
CountVectorizer
# features: 10
LogisticRegression
regParam: 0.001
# features:
10
Tokenizer
CountVectorizer
# features: 10
LogisticRegression
regParam: 0.1
Tokenizer
CountVectorizer
# features: 100
LogisticRegression
regParam: 0.001
Tokenizer
CountVectorizer
# features: 100
LogisticRegression
regParam: 0.1
# features:
100
regParam:
0.001
regParam:
0.1
DBG / June 5, 2018 / © 2018 IBM Corporation
Parallel model evaluation
Scaling Model Tuning
• Added in SPARK-19357 and SPARK-21911
(PySpark)
• Parallelism parameter governs the
maximum # models to be trained at once
DBG / June 5, 2018 / © 2018 IBM Corporation
Parallel model evaluation
Scaling Model Tuning
# features:
10
# features:
100
regParam:
0.001
regParam:
0.1
Tokenizer CountVectorizer LogisticRegression
DBG / June 5, 2018 / © 2018 IBM Corporation
Parallel model evaluation
Scaling Model Tuning
DBG / June 5, 2018 / © 2018 IBM Corporation
Parallel model evaluation
Scaling Model Tuning
DBG / June 5, 2018 / © 2018 IBM Corporation
Parallel model evaluation
Scaling Model Tuning
DBG / June 5, 2018 / © 2018 IBM Corporation
Implementation considerations
Scaling Model Tuning
• Parallelism parameter sets the size of
threadpool under the hood
• Dedicated ExecutionContext created to
avoid deadlocks with using the default
threadpool
• Used Futures instead of parallel
collections – more flexible
• Model-specific parallel fitting
implementations not supported
• SPARK-22126
DBG / June 5, 2018 / © 2018 IBM Corporation
Performance tests
Scaling Model Tuning
• Compared parallel CV to serial CV with
varying number of samples
• Simple LogisticRegression with regParam
and fitIntercept; parameter grid size 12
• Measure elapsed time for cross-validation
• Data size: 100,000 -> 5,000,000
• Number features: 10
• Number partitions: 10
• Number CV folds: 5
• Parallelism: 3
• Standalone cluster with 30 cores
DBG / June 5, 2018 / © 2018 IBM Corporation
Results
Scaling Model Tuning
• ±2.4x speedup
• Stays roughly constant as #
samples increases
DBG / June 5, 2018 / © 2018 IBM Corporation
Best practices
Scaling Model Tuning
• Simple integer parameter is the only thing
you can set (for now)
• Too low => under-utilize resources
• Too high => could lead to memory issues or
overloading cluster
• Rough rule: # cores / # partitions
• But depends on data and model sizes
• Mid-sized cluster probably <= 10
DBG / June 5, 2018 / © 2018 IBM Corporation
Optimizing Tuning for
Pipeline Models
DBG / June 5, 2018 / © 2018 IBM Corporation
Challenges
Optimizing Tuning for Pipeline Models
• Multi-stage, complex pipelines
• Parameter grid with hyperparameters from
different stages
• Easy to have huge number of candidate
parameter combinations
• Model parallelism helps, but can we do
better?
DBG / June 5, 2018 / © 2018 IBM Corporation
Duplicating work
Optimizing Tuning for Pipeline Models
• Each Pipeline treated
independently
• Depending on parameter grid
and pipeline stages
• Fit the same model multiple
times
• Perform same transformations
multiple times
DBG / June 5, 2018 / © 2018 IBM Corporation
Optimize with a DAG
Optimizing Tuning for Pipeline Models
• A node is an estimator/transformer with a
set of hyperparameters
• A path in the graph is a single pipeline
model
Tokenizer
Count
Vectorizer
nfeat=10
Count
Vectorizer
nfeat=100
LR
reg=0.1
LR
reg=0.01
LR
reg=0.1
LR
reg=0.01
DBG / June 5, 2018 / © 2018 IBM Corporation
Parallelize in breadth-first order
Optimizing Tuning for Pipeline Models
• Example with parallelism parameter set to
2
• Tokenizer is only a transform, proceed to fit
CountVectorizer nodes
Tokenizer
Count
Vectorizer
nfeat=10
Count
Vectorizer
nfeat=100
LR
reg=0.1
LR
reg=0.01
LR
reg=0.1
LR
reg=0.01
DBG / June 5, 2018 / © 2018 IBM Corporation
Fit estimators
Optimizing Tuning for Pipeline Models
• Cache the result and proceed to fit the first
2 LogisticRegression models Tokenizer
Count
Vectorizer
nfeat=10
Count
Vectorizer
nfeat=100
LR
reg=0.1
LR
reg=0.01
LR
reg=0.1
LR
reg=0.01
Cache result
DBG / June 5, 2018 / © 2018 IBM Corporation
Fit estimators
Optimizing Tuning for Pipeline Models
• Unpersist when child tasks done
• Fit final 2 LR models Tokenizer
Count
Vectorizer
nfeat=10
Count
Vectorizer
nfeat=100
LR
reg=0.1
LR
reg=0.01
LR
reg=0.1
LR
reg=0.01
Unpersist
cached
dataframe
Cache
result
DBG / June 5, 2018 / © 2018 IBM Corporation
Fit estimators
Optimizing Tuning for Pipeline Models
• All 4 LR models fitted
Tokenizer
Count
Vectorizer
nfeat=10
Count
Vectorizer
nfeat=100
LR
reg=0.1
LR
reg=0.01
LR
reg=0.1
LR
reg=0.01
Unpersist
cached
dataframe
DBG / June 5, 2018 / © 2018 IBM Corporation
Evaluate models
Optimizing Tuning for Pipeline Models
• Evaluate models using similar method
• CountVectorizerModel is now a transformer
• Cache transform result
Tokenizer
CVModel
nfeat=10
CVModel
nfeat=100
LRModel
reg=0.1
LRModel
reg=0.01
LRModel
reg=0.1
LRModel
reg=0.01
Cache result
DBG / June 5, 2018 / © 2018 IBM Corporation
Evaluate models
Optimizing Tuning for Pipeline Models
• Evaluate models using similar method
• CountVectorizerModel is now a transformer
• Cache transform result
Tokenizer
CVModel
nfeat=10
CVModel
nfeat=100
LRModel
reg=0.1
LRModel
reg=0.01
LRModel
reg=0.1
LRModel
reg=0.01
Unpersist
cached
dataframe
Cache
result
Metrics: 0.62 0.62
DBG / June 5, 2018 / © 2018 IBM Corporation
Evaluate models
Optimizing Tuning for Pipeline Models
• All models evaluated for this fold
Tokenizer
CVModel
nfeat=10
CVModel
nfeat=100
LRModel
reg=0.1
LRModel
reg=0.01
LRModel
reg=0.1
LRModel
reg=0.01
Unpersist
cached
dataframe
Metrics: 0.62 0.62 0.72 0.66
DBG / June 5, 2018 / © 2018 IBM Corporation
Select best model
Optimizing Tuning for Pipeline Models
• Average the metrics from all folds and
select the best PipelineModel Tokenizer
CVModel
nfeat=10
CVModel
nfeat=100
LRModel
reg=0.1
LRModel
reg=0.01
LRModel
reg=0.1
LRModel
reg=0.01
Avg
Metrics:
0.64 0.64 0.71 0.65
DBG / June 5, 2018 / © 2018 IBM Corporation
Performance tests
Optimizing Tuning for Pipeline Models
• Compared to Standard Spark CV with
parallelism enabled
• Pipeline:

MinMaxScaler → PCA → LinearRegression

• Measure elapsed time for cross-validation
varying size of parameter grid from 36 to
80 models to evaluate
• Data size: 1,000,000
• Number features: 50
• Number partitions: 16
• Number CV folds: 4
• Parallelism: 3
• Standalone cluster with 30 cores
DBG / June 5, 2018 / © 2018 IBM Corporation
Results
Optimizing Tuning for Pipeline Models
• Up to 3.25x speedup
• Increases with more models …
• … and more complex pipelines
• Check out:
• https://github.com/BryanCutler/PipelineTuning
• Experimental!
• Watch SPARK-19071
Elapsed time for DAG CV vs Simple Parallel CV
0
275
550
825
1100
# models
36 48 60 80
Parallel DAG Parallel
DBG / June 5, 2018 / © 2018 IBM Corporation
Thank you!
codait.org
twitter.com/MLnick
github.com/MLnick
github.com/BryanCutler
developer.ibm.com/code
FfDL
Sign up for IBM Cloud and try Watson Studio!
https://datascience.ibm.com/
MAX
DBG / June 5, 2018 / © 2018 IBM Corporation
Date, Time, Location & Duration Session title and Speaker
Tue, June 5 | 11 AM
2010-2012, 30 mins
Productionizing Spark ML Pipelines with the Portable Format for Analytics
Nick Pentreath (IBM)
Tue, June 5 | 2 PM
2018, 30 mins
Making PySpark Amazing—From Faster UDFs to Dependency Management and Graphing!
Holden Karau (Google) Bryan Cutler (IBM)
Tue, June 5 | 2 PM
Nook by 2001, 30 mins
Making Data and AI Accessible for All
Armand Ruiz Gabernet (IBM)
Tue, June 5 | 2:40 PM
2002-2004, 30 mins
Cognitive Database: An Apache Spark-Based AI-Enabled Relational Database System
Rajesh Bordawekar (IBM T.J. Watson Research Center)
Tue, June 5 | 3:20 PM
3016-3022, 30 mins
Dynamic Priorities for Apache Spark Application’s Resource Allocations
Michael Feiman (IBM Spectrum Computing) Shinnosuke Okada (IBM Canada Ltd.)
Tue, June 5 | 3:20 PM
2001-2005, 30 mins
Model Parallelism in Spark ML Cross-Validation
Nick Pentreath (IBM) Bryan Cutler (IBM)
Tue, June 5 | 3:20 PM
2007, 30 mins
Serverless Machine Learning on Modern Hardware Using Apache Spark
Patrick Stuedi (IBM)
Tue, June 5 | 5:40 PM
2002-2004, 30 mins
Create a Loyal Customer Base by Knowing Their Personality Using AI-Based Personality Recommendation Engine;
Sourav Mazumder (IBM Analytics) Aradhna Tiwari (University of South Florida)
Tue, June 5 | 5:40 PM
2007, 30 mins
Transparent GPU Exploitation on Apache Spark
Dr. Kazuaki Ishizaki (IBM) Madhusudanan Kandasamy (IBM)
Tue, June 5 | 5:40 PM
2009-2011, 30 mins
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for Deep Neural Networks
Yonggang Hu (IBM) Chao Xue (IBM)
IBM Sessions at Spark+AI Summit 2018 (Tuesday, June 5)
DBG / June 5, 2018 / © 2018 IBM Corporation
Date, Time, Location & Duration Session title and Speaker
Wed, June 6 | 12:50 PM Birds of a Feather: Apache Arrow in Spark and More
Bryan Cutler (IBM) Li Jin (Two Sigma Investments, LP)
Wed, June 6 | 2 PM
2002-2004, 30 mins
Deep Learning for Recommender Systems
Nick Pentreath (IBM) )
Wed, June 6 | 3:20 PM
2018, 30 mins
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer
Frederick Reiss (IBM) Vijay Bommireddipalli (IBM Center for Open-Source Data & AI Technologies)
IBM Sessions at Spark+AI Summit 2018 (Wednesday, June 6)
Meet us at IBM booth in the Expo area.
DBG / June 5, 2018 / © 2018 IBM Corporation

More Related Content

What's hot

Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
Dvir Volk
 

What's hot (20)

More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
 
NGINX High-performance Caching
NGINX High-performance CachingNGINX High-performance Caching
NGINX High-performance Caching
 
Scaling your logging infrastructure using syslog-ng
Scaling your logging infrastructure using syslog-ngScaling your logging infrastructure using syslog-ng
Scaling your logging infrastructure using syslog-ng
 
Bootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkBootstrapping state in Apache Flink
Bootstrapping state in Apache Flink
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
Practical Partitioning in Production with Postgres
Practical Partitioning in Production with PostgresPractical Partitioning in Production with Postgres
Practical Partitioning in Production with Postgres
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
 
High Availability PostgreSQL with Zalando Patroni
High Availability PostgreSQL with Zalando PatroniHigh Availability PostgreSQL with Zalando Patroni
High Availability PostgreSQL with Zalando Patroni
 
Understanding MySQL Performance through Benchmarking
Understanding MySQL Performance through BenchmarkingUnderstanding MySQL Performance through Benchmarking
Understanding MySQL Performance through Benchmarking
 
MariaDB Server Performance Tuning & Optimization
MariaDB Server Performance Tuning & OptimizationMariaDB Server Performance Tuning & Optimization
MariaDB Server Performance Tuning & Optimization
 
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergData Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
 
Deep Dive into Amazon ElastiCache Architecture and Design Patterns (DAT307) |...
Deep Dive into Amazon ElastiCache Architecture and Design Patterns (DAT307) |...Deep Dive into Amazon ElastiCache Architecture and Design Patterns (DAT307) |...
Deep Dive into Amazon ElastiCache Architecture and Design Patterns (DAT307) |...
 
Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0
 
Clone Oracle Databases In Minutes Without Risk Using Enterprise Manager 13c
Clone Oracle Databases In Minutes Without Risk Using Enterprise Manager 13cClone Oracle Databases In Minutes Without Risk Using Enterprise Manager 13c
Clone Oracle Databases In Minutes Without Risk Using Enterprise Manager 13c
 
Pentesting Modern Web Apps: A Primer
Pentesting Modern Web Apps: A PrimerPentesting Modern Web Apps: A Primer
Pentesting Modern Web Apps: A Primer
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Practical Memory Tuning for PostgreSQL
Practical Memory Tuning for PostgreSQLPractical Memory Tuning for PostgreSQL
Practical Memory Tuning for PostgreSQL
 
OLTP+OLAP=HTAP
 OLTP+OLAP=HTAP OLTP+OLAP=HTAP
OLTP+OLAP=HTAP
 
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
High Performance Data Lake with Apache Hudi and Alluxio at T3GoHigh Performance Data Lake with Apache Hudi and Alluxio at T3Go
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
 

Similar to Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan Cutler

Index conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreathIndex conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreath
Chester Chen
 

Similar to Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan Cutler (20)

Productionizing Spark ML Pipelines with the Portable Format for Analytics
Productionizing Spark ML Pipelines with the Portable Format for AnalyticsProductionizing Spark ML Pipelines with the Portable Format for Analytics
Productionizing Spark ML Pipelines with the Portable Format for Analytics
 
Productionizing Spark ML Pipelines with the Portable Format for Analytics wit...
Productionizing Spark ML Pipelines with the Portable Format for Analytics wit...Productionizing Spark ML Pipelines with the Portable Format for Analytics wit...
Productionizing Spark ML Pipelines with the Portable Format for Analytics wit...
 
Optimizing your SparkML pipelines using the latest features in Spark 2.3
Optimizing your SparkML pipelines using the latest features in Spark 2.3Optimizing your SparkML pipelines using the latest features in Spark 2.3
Optimizing your SparkML pipelines using the latest features in Spark 2.3
 
ODSC18, London, How to build high performing weighted XGBoost ML Model for Re...
ODSC18, London, How to build high performing weighted XGBoost ML Model for Re...ODSC18, London, How to build high performing weighted XGBoost ML Model for Re...
ODSC18, London, How to build high performing weighted XGBoost ML Model for Re...
 
Index conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreathIndex conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreath
 
Productionizing Spark ML pipelines with the portable format for analytics
Productionizing Spark ML pipelines with the portable format for analyticsProductionizing Spark ML pipelines with the portable format for analytics
Productionizing Spark ML pipelines with the portable format for analytics
 
SigOpt for Hedge Funds
SigOpt for Hedge FundsSigOpt for Hedge Funds
SigOpt for Hedge Funds
 
SigOpt for Machine Learning and AI
SigOpt for Machine Learning and AISigOpt for Machine Learning and AI
SigOpt for Machine Learning and AI
 
Search and Recommendations: 3 Sides of the Same Coin
Search and Recommendations: 3 Sides of the Same CoinSearch and Recommendations: 3 Sides of the Same Coin
Search and Recommendations: 3 Sides of the Same Coin
 
Building machine learning inference pipelines at scale (March 2019)
Building machine learning inference pipelines at scale (March 2019)Building machine learning inference pipelines at scale (March 2019)
Building machine learning inference pipelines at scale (March 2019)
 
Serverless AI with Scikit-Learn (GPSWS405) - AWS re:Invent 2018
Serverless AI with Scikit-Learn (GPSWS405) - AWS re:Invent 2018Serverless AI with Scikit-Learn (GPSWS405) - AWS re:Invent 2018
Serverless AI with Scikit-Learn (GPSWS405) - AWS re:Invent 2018
 
Past Experiences and Future Challenges using Automatic Performance Modelling ...
Past Experiences and Future Challenges using Automatic Performance Modelling ...Past Experiences and Future Challenges using Automatic Performance Modelling ...
Past Experiences and Future Challenges using Automatic Performance Modelling ...
 
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
 
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
 
Make your PySpark Data Fly with Arrow!
Make your PySpark Data Fly with Arrow!Make your PySpark Data Fly with Arrow!
Make your PySpark Data Fly with Arrow!
 
Big Data Helsinki v 3 | "Distributed Machine and Deep Learning at Scale with ...
Big Data Helsinki v 3 | "Distributed Machine and Deep Learning at Scale with ...Big Data Helsinki v 3 | "Distributed Machine and Deep Learning at Scale with ...
Big Data Helsinki v 3 | "Distributed Machine and Deep Learning at Scale with ...
 
Airline reservations and routing: a graph use case
Airline reservations and routing: a graph use caseAirline reservations and routing: a graph use case
Airline reservations and routing: a graph use case
 
LLMOps for Your Data: Best Practices to Ensure Safety, Quality, and Cost
LLMOps for Your Data: Best Practices to Ensure Safety, Quality, and CostLLMOps for Your Data: Best Practices to Ensure Safety, Quality, and Cost
LLMOps for Your Data: Best Practices to Ensure Safety, Quality, and Cost
 
Post compiler software optimization for reducing energy
Post compiler software optimization for reducing energyPost compiler software optimization for reducing energy
Post compiler software optimization for reducing energy
 
Developer insight into why applications run amazingly Fast in CF 2018
Developer insight into why applications run amazingly Fast in CF 2018Developer insight into why applications run amazingly Fast in CF 2018
Developer insight into why applications run amazingly Fast in CF 2018
 

More from Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
shivangimorya083
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
shambhavirathore45
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 

Recently uploaded (20)

Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 

Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan Cutler

  • 1. DBG / June 5, 2018 / © 2018 IBM Corporation Model Parallelism in Spark ML 
 Cross-validation Nick Pentreath Principal Engineer Bryan Cutler Software Engineer
  • 2. DBG / June 5, 2018 / © 2018 IBM Corporation About Nick @MLnick on Twitter & Github Principal Engineer, IBM CODAIT - Center for Open-Source Data & AI Technologies Machine Learning & AI Apache Spark committer & PMC Author of Machine Learning with Spark Various conferences & meetups
  • 3. DBG / June 5, 2018 / © 2018 IBM Corporation About Bryan Software Engineer, IBM CODAIT Apache Spark committer Apache Arrow committer Python, Machine Learning OSS @BryanCutler on Github
  • 4. DBG / June 5, 2018 / © 2018 IBM Corporation Center for Open Source Data and AI Technologies CODAIT codait.org CODAIT aims to make AI solutions dramatically easier to create, deploy, and manage in the enterprise Relaunch of the Spark Technology Center (STC) to reflect expanded mission Improving Enterprise AI Lifecycle in Open Source
  • 5. DBG / June 5, 2018 / © 2018 IBM Corporation Agenda Model Tuning in Spark Scaling Model Tuning Performance Results Best Practices Future Directions in Optimizing Pipelines
  • 6. DBG / June 5, 2018 / © 2018 IBM Corporation Model Tuning in Spark
  • 7. DBG / June 5, 2018 / © 2018 IBM Corporation Model selection: workflow within a workflow Model Tuning in Spark Ingest Data Processing Feature Engineering Model Selection Final Model Candidate models Train Evaluate Adjust
  • 8. DBG / June 5, 2018 / © 2018 IBM Corporation Pipeline cross-validation Model Tuning in Spark Tokenizer CountVectorizer LogisticRegression Spark ML Pipeline # features: 10 # features: 100 regParam: 0.001 regParam: 0.1 Parameters
  • 9. DBG / June 5, 2018 / © 2018 IBM Corporation Pipeline cross-validation Model Tuning in Spark Tokenizer CountVectorizer # features: 10 LogisticRegression regParam: 0.001 # features: 10 Tokenizer CountVectorizer # features: 10 LogisticRegression regParam: 0.1 Tokenizer CountVectorizer # features: 100 LogisticRegression regParam: 0.001 Tokenizer CountVectorizer # features: 100 LogisticRegression regParam: 0.1 # features: 100 regParam: 0.001 regParam: 0.1
  • 10. DBG / June 5, 2018 / © 2018 IBM Corporation Pipeline cross-validation Model Tuning in Spark # features: 10 # features: 100 regParam: 0.001 regParam: 0.1 Tokenizer CountVectorizer LogisticRegression
  • 11. DBG / June 5, 2018 / © 2018 IBM Corporation Pipeline cross-validation Model Tuning in Spark
  • 12. DBG / June 5, 2018 / © 2018 IBM Corporation Pipeline cross-validation Model Tuning in Spark
  • 13. DBG / June 5, 2018 / © 2018 IBM Corporation Pipeline cross-validation Model Tuning in Spark
  • 14. DBG / June 5, 2018 / © 2018 IBM Corporation Cross-validation is expensive! Model Tuning in Spark • 5 x 5 x 5 hyperparameters = 125 pipelines • ... across 4 machine learning models = 500 • If training & evaluation does not fully utilize available cluster resources then that waste is compounded for each model Based on XKCD comic: https://xkcd.com/303/ & https://github.com/mislavcimpersak/xkcd-excuse-generator
  • 15. DBG / June 5, 2018 / © 2018 IBM Corporation Scaling Model Tuning
  • 16. DBG / June 5, 2018 / © 2018 IBM Corporation Parallel model evaluation Scaling Model Tuning Tokenizer CountVectorizer # features: 10 LogisticRegression regParam: 0.001 # features: 10 Tokenizer CountVectorizer # features: 10 LogisticRegression regParam: 0.1 Tokenizer CountVectorizer # features: 100 LogisticRegression regParam: 0.001 Tokenizer CountVectorizer # features: 100 LogisticRegression regParam: 0.1 # features: 100 regParam: 0.001 regParam: 0.1
  • 17. DBG / June 5, 2018 / © 2018 IBM Corporation Parallel model evaluation Scaling Model Tuning Tokenizer CountVectorizer # features: 10 LogisticRegression regParam: 0.001 # features: 10 Tokenizer CountVectorizer # features: 10 LogisticRegression regParam: 0.1 Tokenizer CountVectorizer # features: 100 LogisticRegression regParam: 0.001 Tokenizer CountVectorizer # features: 100 LogisticRegression regParam: 0.1 # features: 100 regParam: 0.001 regParam: 0.1
  • 18. DBG / June 5, 2018 / © 2018 IBM Corporation Parallel model evaluation Scaling Model Tuning Tokenizer CountVectorizer # features: 10 LogisticRegression regParam: 0.001 # features: 10 Tokenizer CountVectorizer # features: 10 LogisticRegression regParam: 0.1 Tokenizer CountVectorizer # features: 100 LogisticRegression regParam: 0.001 Tokenizer CountVectorizer # features: 100 LogisticRegression regParam: 0.1 # features: 100 regParam: 0.001 regParam: 0.1
  • 19. DBG / June 5, 2018 / © 2018 IBM Corporation Parallel model evaluation Scaling Model Tuning Tokenizer CountVectorizer # features: 10 LogisticRegression regParam: 0.001 # features: 10 Tokenizer CountVectorizer # features: 10 LogisticRegression regParam: 0.1 Tokenizer CountVectorizer # features: 100 LogisticRegression regParam: 0.001 Tokenizer CountVectorizer # features: 100 LogisticRegression regParam: 0.1 # features: 100 regParam: 0.001 regParam: 0.1
  • 20. DBG / June 5, 2018 / © 2018 IBM Corporation Parallel model evaluation Scaling Model Tuning • Added in SPARK-19357 and SPARK-21911 (PySpark) • Parallelism parameter governs the maximum # models to be trained at once
  • 21. DBG / June 5, 2018 / © 2018 IBM Corporation Parallel model evaluation Scaling Model Tuning # features: 10 # features: 100 regParam: 0.001 regParam: 0.1 Tokenizer CountVectorizer LogisticRegression
  • 22. DBG / June 5, 2018 / © 2018 IBM Corporation Parallel model evaluation Scaling Model Tuning
  • 23. DBG / June 5, 2018 / © 2018 IBM Corporation Parallel model evaluation Scaling Model Tuning
  • 24. DBG / June 5, 2018 / © 2018 IBM Corporation Parallel model evaluation Scaling Model Tuning
  • 25. DBG / June 5, 2018 / © 2018 IBM Corporation Implementation considerations Scaling Model Tuning • Parallelism parameter sets the size of threadpool under the hood • Dedicated ExecutionContext created to avoid deadlocks with using the default threadpool • Used Futures instead of parallel collections – more flexible • Model-specific parallel fitting implementations not supported • SPARK-22126
  • 26. DBG / June 5, 2018 / © 2018 IBM Corporation Performance tests Scaling Model Tuning • Compared parallel CV to serial CV with varying number of samples • Simple LogisticRegression with regParam and fitIntercept; parameter grid size 12 • Measure elapsed time for cross-validation • Data size: 100,000 -> 5,000,000 • Number features: 10 • Number partitions: 10 • Number CV folds: 5 • Parallelism: 3 • Standalone cluster with 30 cores
  • 27. DBG / June 5, 2018 / © 2018 IBM Corporation Results Scaling Model Tuning • ±2.4x speedup • Stays roughly constant as # samples increases
  • 28. DBG / June 5, 2018 / © 2018 IBM Corporation Best practices Scaling Model Tuning • Simple integer parameter is the only thing you can set (for now) • Too low => under-utilize resources • Too high => could lead to memory issues or overloading cluster • Rough rule: # cores / # partitions • But depends on data and model sizes • Mid-sized cluster probably <= 10
  • 29. DBG / June 5, 2018 / © 2018 IBM Corporation Optimizing Tuning for Pipeline Models
  • 30. DBG / June 5, 2018 / © 2018 IBM Corporation Challenges Optimizing Tuning for Pipeline Models • Multi-stage, complex pipelines • Parameter grid with hyperparameters from different stages • Easy to have huge number of candidate parameter combinations • Model parallelism helps, but can we do better?
  • 31. DBG / June 5, 2018 / © 2018 IBM Corporation Duplicating work Optimizing Tuning for Pipeline Models • Each Pipeline treated independently • Depending on parameter grid and pipeline stages • Fit the same model multiple times • Perform same transformations multiple times
  • 32. DBG / June 5, 2018 / © 2018 IBM Corporation Optimize with a DAG Optimizing Tuning for Pipeline Models • A node is an estimator/transformer with a set of hyperparameters • A path in the graph is a single pipeline model Tokenizer Count Vectorizer nfeat=10 Count Vectorizer nfeat=100 LR reg=0.1 LR reg=0.01 LR reg=0.1 LR reg=0.01
  • 33. DBG / June 5, 2018 / © 2018 IBM Corporation Parallelize in breadth-first order Optimizing Tuning for Pipeline Models • Example with parallelism parameter set to 2 • Tokenizer is only a transform, proceed to fit CountVectorizer nodes Tokenizer Count Vectorizer nfeat=10 Count Vectorizer nfeat=100 LR reg=0.1 LR reg=0.01 LR reg=0.1 LR reg=0.01
  • 34. DBG / June 5, 2018 / © 2018 IBM Corporation Fit estimators Optimizing Tuning for Pipeline Models • Cache the result and proceed to fit the first 2 LogisticRegression models Tokenizer Count Vectorizer nfeat=10 Count Vectorizer nfeat=100 LR reg=0.1 LR reg=0.01 LR reg=0.1 LR reg=0.01 Cache result
  • 35. DBG / June 5, 2018 / © 2018 IBM Corporation Fit estimators Optimizing Tuning for Pipeline Models • Unpersist when child tasks done • Fit final 2 LR models Tokenizer Count Vectorizer nfeat=10 Count Vectorizer nfeat=100 LR reg=0.1 LR reg=0.01 LR reg=0.1 LR reg=0.01 Unpersist cached dataframe Cache result
  • 36. DBG / June 5, 2018 / © 2018 IBM Corporation Fit estimators Optimizing Tuning for Pipeline Models • All 4 LR models fitted Tokenizer Count Vectorizer nfeat=10 Count Vectorizer nfeat=100 LR reg=0.1 LR reg=0.01 LR reg=0.1 LR reg=0.01 Unpersist cached dataframe
  • 37. DBG / June 5, 2018 / © 2018 IBM Corporation Evaluate models Optimizing Tuning for Pipeline Models • Evaluate models using similar method • CountVectorizerModel is now a transformer • Cache transform result Tokenizer CVModel nfeat=10 CVModel nfeat=100 LRModel reg=0.1 LRModel reg=0.01 LRModel reg=0.1 LRModel reg=0.01 Cache result
  • 38. DBG / June 5, 2018 / © 2018 IBM Corporation Evaluate models Optimizing Tuning for Pipeline Models • Evaluate models using similar method • CountVectorizerModel is now a transformer • Cache transform result Tokenizer CVModel nfeat=10 CVModel nfeat=100 LRModel reg=0.1 LRModel reg=0.01 LRModel reg=0.1 LRModel reg=0.01 Unpersist cached dataframe Cache result Metrics: 0.62 0.62
  • 39. DBG / June 5, 2018 / © 2018 IBM Corporation Evaluate models Optimizing Tuning for Pipeline Models • All models evaluated for this fold Tokenizer CVModel nfeat=10 CVModel nfeat=100 LRModel reg=0.1 LRModel reg=0.01 LRModel reg=0.1 LRModel reg=0.01 Unpersist cached dataframe Metrics: 0.62 0.62 0.72 0.66
  • 40. DBG / June 5, 2018 / © 2018 IBM Corporation Select best model Optimizing Tuning for Pipeline Models • Average the metrics from all folds and select the best PipelineModel Tokenizer CVModel nfeat=10 CVModel nfeat=100 LRModel reg=0.1 LRModel reg=0.01 LRModel reg=0.1 LRModel reg=0.01 Avg Metrics: 0.64 0.64 0.71 0.65
  • 41. DBG / June 5, 2018 / © 2018 IBM Corporation Performance tests Optimizing Tuning for Pipeline Models • Compared to Standard Spark CV with parallelism enabled • Pipeline:
 MinMaxScaler → PCA → LinearRegression
 • Measure elapsed time for cross-validation varying size of parameter grid from 36 to 80 models to evaluate • Data size: 1,000,000 • Number features: 50 • Number partitions: 16 • Number CV folds: 4 • Parallelism: 3 • Standalone cluster with 30 cores
  • 42. DBG / June 5, 2018 / © 2018 IBM Corporation Results Optimizing Tuning for Pipeline Models • Up to 3.25x speedup • Increases with more models … • … and more complex pipelines • Check out: • https://github.com/BryanCutler/PipelineTuning • Experimental! • Watch SPARK-19071 Elapsed time for DAG CV vs Simple Parallel CV 0 275 550 825 1100 # models 36 48 60 80 Parallel DAG Parallel
  • 43. DBG / June 5, 2018 / © 2018 IBM Corporation Thank you! codait.org twitter.com/MLnick github.com/MLnick github.com/BryanCutler developer.ibm.com/code FfDL Sign up for IBM Cloud and try Watson Studio! https://datascience.ibm.com/ MAX
  • 44. DBG / June 5, 2018 / © 2018 IBM Corporation Date, Time, Location & Duration Session title and Speaker Tue, June 5 | 11 AM 2010-2012, 30 mins Productionizing Spark ML Pipelines with the Portable Format for Analytics Nick Pentreath (IBM) Tue, June 5 | 2 PM 2018, 30 mins Making PySpark Amazing—From Faster UDFs to Dependency Management and Graphing! Holden Karau (Google) Bryan Cutler (IBM) Tue, June 5 | 2 PM Nook by 2001, 30 mins Making Data and AI Accessible for All Armand Ruiz Gabernet (IBM) Tue, June 5 | 2:40 PM 2002-2004, 30 mins Cognitive Database: An Apache Spark-Based AI-Enabled Relational Database System Rajesh Bordawekar (IBM T.J. Watson Research Center) Tue, June 5 | 3:20 PM 3016-3022, 30 mins Dynamic Priorities for Apache Spark Application’s Resource Allocations Michael Feiman (IBM Spectrum Computing) Shinnosuke Okada (IBM Canada Ltd.) Tue, June 5 | 3:20 PM 2001-2005, 30 mins Model Parallelism in Spark ML Cross-Validation Nick Pentreath (IBM) Bryan Cutler (IBM) Tue, June 5 | 3:20 PM 2007, 30 mins Serverless Machine Learning on Modern Hardware Using Apache Spark Patrick Stuedi (IBM) Tue, June 5 | 5:40 PM 2002-2004, 30 mins Create a Loyal Customer Base by Knowing Their Personality Using AI-Based Personality Recommendation Engine; Sourav Mazumder (IBM Analytics) Aradhna Tiwari (University of South Florida) Tue, June 5 | 5:40 PM 2007, 30 mins Transparent GPU Exploitation on Apache Spark Dr. Kazuaki Ishizaki (IBM) Madhusudanan Kandasamy (IBM) Tue, June 5 | 5:40 PM 2009-2011, 30 mins Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for Deep Neural Networks Yonggang Hu (IBM) Chao Xue (IBM) IBM Sessions at Spark+AI Summit 2018 (Tuesday, June 5)
  • 45. DBG / June 5, 2018 / © 2018 IBM Corporation Date, Time, Location & Duration Session title and Speaker Wed, June 6 | 12:50 PM Birds of a Feather: Apache Arrow in Spark and More Bryan Cutler (IBM) Li Jin (Two Sigma Investments, LP) Wed, June 6 | 2 PM 2002-2004, 30 mins Deep Learning for Recommender Systems Nick Pentreath (IBM) ) Wed, June 6 | 3:20 PM 2018, 30 mins Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer Frederick Reiss (IBM) Vijay Bommireddipalli (IBM Center for Open-Source Data & AI Technologies) IBM Sessions at Spark+AI Summit 2018 (Wednesday, June 6) Meet us at IBM booth in the Expo area.
  • 46. DBG / June 5, 2018 / © 2018 IBM Corporation