2. About me
▪ Solutions Architect at Databricks (½ a year)
▪ Software Engineer at Databricks (5 ½ years)
▪ Apache Spark committer and PMC member
3. Global company with over 5,000 customers and 450+ partners
Original creators of popular data and machine learning open source projects
A unified data analytics platform for accelerating innovation across
data engineering, data science, and analytics
4. Hyperparameter
Model 1: “Dress”
Model 2: “Sneaker”
Why the difference?
Model 2 had better hyperparameter settings:
• Learning rate
• Model structure
• ...
What’s a hyperparameter?
• Statistical: assumptions about your
model/data
• Practical: inputs your ML library does not
learn from data
• Algorithmic: problem-dependent configs
5. Hyperparameter tuning
Expert knowledge
Not in this talk:
• Statistical best practices
• Overview of methods for tuning
In this talk:
• Data Science workflow best practices
• Tips for the big data and cloud computing space
Black-box tuning Ignore until needed
à See references at end!
6. Hyperparameter tuning is tough
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8 9 10
# hyperparameters
(7 possible values each)
Fractioncoverage
Non-convex optimization
Curse of dimensionality
è Computational cost
Unintuitive hyperparameters
• Regularization
• Neural net structure
• and many more...
9. Single-machine vs. distributed training
• Single-machine training
• Distributed training
• Training 1 model per group (per customer, product, etc.)
10. Single-machine training
Tuning
Scale out via distributed
hyperparameter tuning
à Train 1 model per Spark task
Driver
Worker WorkerWorker
Tuning
Tools for this:
• Hyperopt + SparkTrials
• sklearn + joblibspark
• Pandas UDFs
12. Training one model per group
Driver
Worker WorkerWorker
Distribute
over groups
Tuning Tuning Tuning
Scale out by distributing
over groups
à Train 1 group’s model per
Spark task
Tools for this:
• Apache Spark, Pandas UDFs
• Hyperopt
14. Getting started
Start small
• Bias towards smaller models, fewer
iterations, etc.
• Small is cheaper, and it may suffice.
• Regardless, it gives a baseline.
Think before you tune
• Separate train/val/test sets.
• Use early stopping or smart tuning
wherever possible.
• Pick hyperparameters carefully.
• Pick ranges carefully.
15. Models vs. pipelines
Best practice: Set up full pipeline before tuning.
• At what point does the pipeline compute the metric you care about?
Tuning models vs. tuning pipelines
• Tuning featurization
Optimizing tuning for pipelines
• Cache intermediate results
16. Evaluating and iterating
Validation data and metrics
• Record many metrics on both training and validation data.
Tuning hyperparameters independently vs. jointly
• Using smarter hyperparameter search algorithms
Tracking and reproducibility
• Data, code, params, metrics, models, metadata
• Tip: Parametrize code to facilitate tracking
19. Handling code
Getting code to workers
• Generally simple: Pandas UDFs or
integrations (Hyperopt, etc.)
• Debugging code serialization
• Errors are often in worker logs and
look like ML library bugs (e.g., “no
module named X...”)
• Tip: For Python, import libraries
within closures
Passing configs and credentials
• E.g., MLflow active runs and
credentials
Helpful resource:
Distributed Hyperopt best practices
and troubleshooting
20. Moving data
Single-machine ML
• Broadcast
• Load from blob storage
• Caching data
Distributed ML
• Caching data
Blob storage data prep
• Delta Lake format
• Petastorm and TFRecords
Helpful resources:
• Distributed Hyperopt best practices
and troubleshooting
• Prepare data for distributed
training
21. Configuring clusters
Single-machine ML
• Sharing machine resources
• Selecting machine types
Distributed ML
• Right-sizing clusters
• Sharing a cluster
Helpful resource:
Scaling Hyperopt to Tune Machine
Learning Models in Python
23. Tools to know about
Apache Spark CrossValidator, TrainValidationSplit, Pandas UDFs
MLflow Tracking, Auto-logging
ML in Python Hyperopt SparkTrials & more (See last year’s SAIS talk)
Scikit-learn sklearn.model_selection, skopt, joblib+Spark
TensorFlow HParams, Keras Tuner, Model Optimization Toolkit
PyTorch Ax
24. Resources
Blog posts + example notebooks
• Hyperparameter Tuning with MLflow, Apache Spark MLlib and Hyperopt
Talks
• Best Practices for Hyperparameter Tuning with MLflow (SAIS 2019)
• Advanced Hyperparameter Optimization for Deep Learning with MLflow (SAIS 2019)
Project pages + docs
• Hyperopt docs and Github
• MLflow homepage and Github
Slides: tinyurl.com/sais2020-joseph
Notebook: tinyurl.com/sais2020-joseph-demo