SlideShare une entreprise Scribd logo
1  sur  25
Télécharger pour lire hors ligne
Model Selection and
Tuning at Scale
March 2016
About us
Owen Zhang
Chief Product Officer @ DataRobot
Former #1 ranked Data Scientist on
Kaggle
Former VP, Science @ AIG
Peter Prettenhofer
Software Engineer @ DataRobot
Scikit-learn core developer
Agenda
● Introduction
● Case-study Criteo 1TB
● Conclusion / Discussion
Model Selection
● Estimating the performance of different models in order to choose the best one.
● K-Fold Cross-validation
● The devil is in the detail:
○ Partitioning
○ Leakage
○ Sample size
○ Stacked-models require nested layers
Train Validation Holdout
1 2 3 4 5
Model Complexity & Overfitting
More data to the rescue?
Underfitting or Overfitting?
http://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html
Model Tuning
● Optimizing the performance of a model
● Example: Gradient Boosted Trees
○ Nr of trees
○ Learning rate
○ Tree depth / Nr of leaf nodes
○ Min leaf size
○ Example subsampling rate
○ Feature subsampling rate
Search Space
Hyperparameter GBRT (naive) GBRT RandomForest
Nr of trees 5 1 1
Learning rate 5 5 -
Tree depth 5 5 1
Min leaf size 3 3 3
Example subsample rate 3 1 1
Feature subsample rate 2 2 5
Total 2250 150 15
Hyperparameter Optimization
● Grid Search
● Random Search
● Bayesian optimization
Challenges at Scale
● Why learning with more data is harder?
○ Paradox: we could use more complex models due to more data but we cannot because
of computational constraints*
○ => we need more efficient ways for creating complex models!
● Need to account for the combined cost: model fitting + model selection / tuning
○ Smart hyperparameter tuning tries to decrease the # of model fits
○ … we can accomplish this with fewer hyperparameters too**
* Pedro Domingos, A few useful things to know about machine learning, 2012.
** Practitioners often favor algorithms with few hyperparameters such as RandomForest or
AveragedPerceptron (see http://nlpers.blogspot.co.at/2014/10/hyperparameter-search-bayesian.html)
A case study -- binary classification on 1TB of data
● Criteo click through data
● Down sampled ads impression data on 24 days
● Fully anonymized dataset:
○ 1 target
○ 13 integer features
○ 26 hashed categorical features
● Experiment setup:
○ Using day 0 - day 22 data for training, day 23 data for testing
Big Data?
Data size:
● ~46GB/day
● ~180,000,000/day
However it is very imbalanced (even after downsampling non-events)
● ~3.5% events rate
Further downsampling of non-events to a balanced dataset will reduce the size of data to ~70GB
● Will fit into a single node under “optimal” conditions
● Loss of model accuracy is negligible in most situations
Assuming 0.1% raw event (click through) rate:
Raw Data:
35TB@.1%
Data:
1TB@3.5%
Data:
70GB@50%
Where to start?
● 70GB (~260,000,000 data points) is still a lot of data
● Let’s take a tiny slice of that to experiment
○ Take 0.25%, then .5%, then 1%, and do grid search on them
Time (Seconds)
RF
ASVM
Regularized
Regression
GBM (with Count)
GBM (without Count)Better
GBM is the way to go, let’s go up to 10% data
# of Trees
Sample Size/Depth of Tree/Time to Finish
A “Fairer” Way of Comparing Models
A better model
when time is the
constraint
Can We Extrapolate?
?
Where We (can) do
better than generic
Bayesian
Optimization
Tree Depth vs Data Size
● A natural heuristic -- increment tree depth by 1 every time data size doubles
1%
2%
4%
10%
Optimal Depth = a + b * log(DataSize)
What about VW?
● Highly efficient online learning algorithm
● Support adaptive learning rate
● Inherently linear, user needs to specify non-linear feature or interactions explicitly
● 2-way and 3-way interactions can be generated on the fly
● Supports “every k” validation
● The only “tuning” REQUIRED is specification of interactions
○ Due to availability of progressive validation, bad interactions can be detected immediately
thus don’t waste time:
Data pipeline for VW
Training
Test
T1
T2
Tm
Test
T1s
Random
Split
T2s
Tms
Random
Shuffle
Concat +
Interleave
It takes longer to
prep the data than
to run the model!
VW Results
Without
With Count + Count*Numeric
Interaction
1% Data
10% Data
100% Data
Putting it All Together 1 Hour 1 Day
Do We Really “Tune/Select Model @ Scale”?
● What we claim we do:
○ Model tuning and selection on big data
● We we actually do:
○ Model tuning and selection on small data
○ Re-run the model and expect/hope performance/hyper
parameters extrapolate as expected
● If you start the model tuning/selection process with GBs (even
100s of MBs) of data, you are doing it wrong!
Some Interesting Observations
● At least for some datasets, it is very hard for “pure linear” model to outperform (accuracy-wise)
non-linear model, even with much larger data
● There is meaningful structure in the hyper parameter space
● When we have limited time (relative to data size), running “deeper” models on smaller data
sample may actually yield better results
● To fully exploit data, model estimation time is usually at least proportional to n*log(n) and We
need models that has # of parameters that can scale with # of data points
○ GBM can have any many parameters as we want
○ So does factorization machines
● For any data any model we will run into a “diminishing return” issue, as data get bigger and
bigger
DataRobot Essentials
April 7-8 London
April 28-29 San Francisco
May 17-18 Atlanta
June 23-24 Boston
datarobot.com/training
© DataRobot, Inc. All rights reserved.
Thanks / Questions?

Contenu connexe

Tendances

Practical tips for handling noisy data and annotaiton
Practical tips for handling noisy data and annotaitonPractical tips for handling noisy data and annotaiton
Practical tips for handling noisy data and annotaiton
RyuichiKanoh
 

Tendances (20)

Winning Data Science Competitions
Winning Data Science CompetitionsWinning Data Science Competitions
Winning Data Science Competitions
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
Feature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsFeature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive models
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se...
 Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se... Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se...
Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se...
 
Practical tips for handling noisy data and annotaiton
Practical tips for handling noisy data and annotaitonPractical tips for handling noisy data and annotaiton
Practical tips for handling noisy data and annotaiton
 
Exploration and diversity in recommender systems
Exploration and diversity in recommender systemsExploration and diversity in recommender systems
Exploration and diversity in recommender systems
 
Automatic Machine Learning, AutoML
Automatic Machine Learning, AutoMLAutomatic Machine Learning, AutoML
Automatic Machine Learning, AutoML
 
Lessons learned from building practical deep learning systems
Lessons learned from building practical deep learning systemsLessons learned from building practical deep learning systems
Lessons learned from building practical deep learning systems
 
Personalizing "The Netflix Experience" with Deep Learning
Personalizing "The Netflix Experience" with Deep LearningPersonalizing "The Netflix Experience" with Deep Learning
Personalizing "The Netflix Experience" with Deep Learning
 
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual RepresentationsPR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
 
VSSML16 L6. Feature Engineering
VSSML16 L6. Feature EngineeringVSSML16 L6. Feature Engineering
VSSML16 L6. Feature Engineering
 
Deep learning Introduction and Basics
Deep learning  Introduction and BasicsDeep learning  Introduction and Basics
Deep learning Introduction and Basics
 
Real-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerReal-time analytics with Druid at Appsflyer
Real-time analytics with Druid at Appsflyer
 
Feature Engineering for ML - Dmitry Larko, H2O.ai
Feature Engineering for ML - Dmitry Larko, H2O.aiFeature Engineering for ML - Dmitry Larko, H2O.ai
Feature Engineering for ML - Dmitry Larko, H2O.ai
 
AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)
 
Past, present, and future of Recommender Systems: an industry perspective
Past, present, and future of Recommender Systems: an industry perspectivePast, present, and future of Recommender Systems: an industry perspective
Past, present, and future of Recommender Systems: an industry perspective
 
Making Netflix Machine Learning Algorithms Reliable
Making Netflix Machine Learning Algorithms ReliableMaking Netflix Machine Learning Algorithms Reliable
Making Netflix Machine Learning Algorithms Reliable
 
Calibrated Recommendations
Calibrated RecommendationsCalibrated Recommendations
Calibrated Recommendations
 
HDFS Analysis for Small Files
HDFS Analysis for Small FilesHDFS Analysis for Small Files
HDFS Analysis for Small Files
 

Similaire à Model selection and tuning at scale

Bimbo Final Project Presentation
Bimbo Final Project PresentationBimbo Final Project Presentation
Bimbo Final Project Presentation
Can Köklü
 

Similaire à Model selection and tuning at scale (20)

Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to hero
 
Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsUsing Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning Models
 
Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsUsing Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning Models
 
Production ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ wazeProduction ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ waze
 
Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...
 
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle Contest
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle ContestDA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle Contest
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle Contest
 
Using SigOpt to Tune Deep Learning Models with Nervana Cloud
Using SigOpt to Tune Deep Learning Models with Nervana CloudUsing SigOpt to Tune Deep Learning Models with Nervana Cloud
Using SigOpt to Tune Deep Learning Models with Nervana Cloud
 
Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...
 
Kaggle and data science
Kaggle and data scienceKaggle and data science
Kaggle and data science
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Tips & Tricks to Survive from “Big” Data
Tips & Tricks to Survive from “Big” DataTips & Tricks to Survive from “Big” Data
Tips & Tricks to Survive from “Big” Data
 
Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC ...
Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC                           ...Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC                           ...
Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC ...
 
Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)
 
A Kaggle Talk
A Kaggle TalkA Kaggle Talk
A Kaggle Talk
 
Data Enginering from Google Data Warehouse
Data Enginering from Google Data WarehouseData Enginering from Google Data Warehouse
Data Enginering from Google Data Warehouse
 
KNOLX_Data_preprocessing
KNOLX_Data_preprocessingKNOLX_Data_preprocessing
KNOLX_Data_preprocessing
 
Bimbo Final Project Presentation
Bimbo Final Project PresentationBimbo Final Project Presentation
Bimbo Final Project Presentation
 
Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3
 
HW03 (1).pdf
HW03 (1).pdfHW03 (1).pdf
HW03 (1).pdf
 
Netflix SRE perf meetup_slides
Netflix SRE perf meetup_slidesNetflix SRE perf meetup_slides
Netflix SRE perf meetup_slides
 

Dernier

Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
gajnagarg
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
gajnagarg
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
amitlee9823
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
amitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
karishmasinghjnh
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
gajnagarg
 

Dernier (20)

Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 

Model selection and tuning at scale

  • 1. Model Selection and Tuning at Scale March 2016
  • 2. About us Owen Zhang Chief Product Officer @ DataRobot Former #1 ranked Data Scientist on Kaggle Former VP, Science @ AIG Peter Prettenhofer Software Engineer @ DataRobot Scikit-learn core developer
  • 3. Agenda ● Introduction ● Case-study Criteo 1TB ● Conclusion / Discussion
  • 4. Model Selection ● Estimating the performance of different models in order to choose the best one. ● K-Fold Cross-validation ● The devil is in the detail: ○ Partitioning ○ Leakage ○ Sample size ○ Stacked-models require nested layers Train Validation Holdout 1 2 3 4 5
  • 5. Model Complexity & Overfitting
  • 6. More data to the rescue?
  • 8. Model Tuning ● Optimizing the performance of a model ● Example: Gradient Boosted Trees ○ Nr of trees ○ Learning rate ○ Tree depth / Nr of leaf nodes ○ Min leaf size ○ Example subsampling rate ○ Feature subsampling rate
  • 9. Search Space Hyperparameter GBRT (naive) GBRT RandomForest Nr of trees 5 1 1 Learning rate 5 5 - Tree depth 5 5 1 Min leaf size 3 3 3 Example subsample rate 3 1 1 Feature subsample rate 2 2 5 Total 2250 150 15
  • 10. Hyperparameter Optimization ● Grid Search ● Random Search ● Bayesian optimization
  • 11. Challenges at Scale ● Why learning with more data is harder? ○ Paradox: we could use more complex models due to more data but we cannot because of computational constraints* ○ => we need more efficient ways for creating complex models! ● Need to account for the combined cost: model fitting + model selection / tuning ○ Smart hyperparameter tuning tries to decrease the # of model fits ○ … we can accomplish this with fewer hyperparameters too** * Pedro Domingos, A few useful things to know about machine learning, 2012. ** Practitioners often favor algorithms with few hyperparameters such as RandomForest or AveragedPerceptron (see http://nlpers.blogspot.co.at/2014/10/hyperparameter-search-bayesian.html)
  • 12. A case study -- binary classification on 1TB of data ● Criteo click through data ● Down sampled ads impression data on 24 days ● Fully anonymized dataset: ○ 1 target ○ 13 integer features ○ 26 hashed categorical features ● Experiment setup: ○ Using day 0 - day 22 data for training, day 23 data for testing
  • 13. Big Data? Data size: ● ~46GB/day ● ~180,000,000/day However it is very imbalanced (even after downsampling non-events) ● ~3.5% events rate Further downsampling of non-events to a balanced dataset will reduce the size of data to ~70GB ● Will fit into a single node under “optimal” conditions ● Loss of model accuracy is negligible in most situations Assuming 0.1% raw event (click through) rate: Raw Data: 35TB@.1% Data: 1TB@3.5% Data: 70GB@50%
  • 14. Where to start? ● 70GB (~260,000,000 data points) is still a lot of data ● Let’s take a tiny slice of that to experiment ○ Take 0.25%, then .5%, then 1%, and do grid search on them Time (Seconds) RF ASVM Regularized Regression GBM (with Count) GBM (without Count)Better
  • 15. GBM is the way to go, let’s go up to 10% data # of Trees Sample Size/Depth of Tree/Time to Finish
  • 16. A “Fairer” Way of Comparing Models A better model when time is the constraint
  • 17. Can We Extrapolate? ? Where We (can) do better than generic Bayesian Optimization
  • 18. Tree Depth vs Data Size ● A natural heuristic -- increment tree depth by 1 every time data size doubles 1% 2% 4% 10% Optimal Depth = a + b * log(DataSize)
  • 19. What about VW? ● Highly efficient online learning algorithm ● Support adaptive learning rate ● Inherently linear, user needs to specify non-linear feature or interactions explicitly ● 2-way and 3-way interactions can be generated on the fly ● Supports “every k” validation ● The only “tuning” REQUIRED is specification of interactions ○ Due to availability of progressive validation, bad interactions can be detected immediately thus don’t waste time:
  • 20. Data pipeline for VW Training Test T1 T2 Tm Test T1s Random Split T2s Tms Random Shuffle Concat + Interleave It takes longer to prep the data than to run the model!
  • 21. VW Results Without With Count + Count*Numeric Interaction 1% Data 10% Data 100% Data
  • 22. Putting it All Together 1 Hour 1 Day
  • 23. Do We Really “Tune/Select Model @ Scale”? ● What we claim we do: ○ Model tuning and selection on big data ● We we actually do: ○ Model tuning and selection on small data ○ Re-run the model and expect/hope performance/hyper parameters extrapolate as expected ● If you start the model tuning/selection process with GBs (even 100s of MBs) of data, you are doing it wrong!
  • 24. Some Interesting Observations ● At least for some datasets, it is very hard for “pure linear” model to outperform (accuracy-wise) non-linear model, even with much larger data ● There is meaningful structure in the hyper parameter space ● When we have limited time (relative to data size), running “deeper” models on smaller data sample may actually yield better results ● To fully exploit data, model estimation time is usually at least proportional to n*log(n) and We need models that has # of parameters that can scale with # of data points ○ GBM can have any many parameters as we want ○ So does factorization machines ● For any data any model we will run into a “diminishing return” issue, as data get bigger and bigger
  • 25. DataRobot Essentials April 7-8 London April 28-29 San Francisco May 17-18 Atlanta June 23-24 Boston datarobot.com/training © DataRobot, Inc. All rights reserved. Thanks / Questions?