SlideShare une entreprise Scribd logo
1  sur  16
Télécharger pour lire hors ligne
I Don’t Want to Be a Dummy!
Encoding Predictors for Trees
Max Kuhn
NYRC
Trees
Tree–based models are nested sets of if/else statements that make predictions in the
terminal nodes:
> library(rpart)
> library(AppliedPredictiveModeling)
> data(schedulingData)
> rpart(Class ~ ., data = schedulingData, control = rpart.control(maxdepth = 2))
n= 4331
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 4331 2100 VF (0.511 0.311 0.119 0.060)
2) Protocol=C,D,E,F,G,I,J,K,L,N 2884 860 VF (0.703 0.206 0.068 0.023) *
3) Protocol=A,H,M,O 1447 690 F (0.126 0.521 0.219 0.133)
6) Iterations< 1.5e+02 1363 610 F (0.134 0.553 0.232 0.081) *
7) Iterations>=1.5e+02 84 1 L (0.000 0.000 0.012 0.988) *
Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 2 / 16
Rules
Similarly, rule–based models are non–nested sets of if statements:
> library(C50)
> summary(C5.0(Class ~ ., data = schedulingData, rules = TRUE))
<snip>
Rule 109: (17/7, lift 9.7)
Protocol in {F, J, N}
Compounds > 818
InputFields > 152
NumPending <= 0
Hour > 0.6333333
Day = Tue
-> class L [0.579]
Default class: VF
Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 3 / 16
Bayes!
Bayesian regression and classification models don’t really specify anything about the predictors
beyond Pr[X] and Pr[X|Y ].
If there were only one categorical predictor, we could have Pr[X|Y ] be a table of raw
probabilities:
> xtab <- table(schedulingData$Day, schedulingData$Class)
> apply(xtab, 2, function(x) x/sum(x))
VF F M L
Mon 0.1678 0.1492 0.15 0.162
Tue 0.1913 0.2019 0.27 0.255
Wed 0.2090 0.2101 0.19 0.228
Thu 0.1678 0.1589 0.18 0.154
Fri 0.2171 0.2183 0.20 0.178
Sat 0.0068 0.0082 0.00 0.023
Sun 0.0403 0.0535 0.00 0.000
Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 4 / 16
Dummy Variables
For the other models, we typically encode a predictor with C categories into C − 1 binary
dummy variables:
> design_mat <- model.matrix(Class ~ Day, data = head(schedulingData))
> design_mat[, colnames(design_mat) != "(Intercept)"]
DayTue DayWed DayThu DayFri DaySat DaySun
1 1 0 0 0 0 0
2 1 0 0 0 0 0
3 0 0 1 0 0 0
4 0 0 0 1 0 0
5 0 0 0 1 0 0
6 0 1 0 0 0 0
In this case, one predictor generates six columns in the design matrix
Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 5 / 16
Encoding Choices
We make the decision on how to encode the data prior to creating the model.
That means we choose whether to present the model with the grouped categories or
ungrouped binary dummy variables.
The means we could get different representations of the model (see the next two slides).
Does it matter? Let’s do some experiments!
Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 6 / 16
A Tree with Categorical Data
wday
1
Sun, Sat Mon, Tues, Wed, Thurs, Fri
Node 2 (n = 1530)
0
5000
10000
15000
20000
25000
q
qq
q
q
Node 3 (n = 3826)
0
5000
10000
15000
20000
25000
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qqq
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
qq
q
q
q
q
q
q
qqq
q
q
qq
q
q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 7 / 16
A Tree with Dummy Variables
Sun
1
≥ 0.5 < 0.5
Node 2 (n = 765)
0
5000
10000
15000
20000
25000
q
Sat
3
≥ 0.5 < 0.5
Node 4 (n = 765)
0
5000
10000
15000
20000
25000
q
q
q
q
Node 5 (n = 3826)
0
5000
10000
15000
20000
25000
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
qqq
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
qq
q
q
q
q
q
q
qqq
q
q
qq
qq
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 8 / 16
Data Sets
Classification:
German Credit, 13 categorical predictors out of 20 (ROC AUC ≈ 0.76)
UCI Car Evaluation, 6 of 6 (Acc ≈ 0.96)
APM High Performance Computing, 2 of 7 (κ ≈ 0.7)
Regression:
Sacramento house prices, 3 of 8 but one has 37 unique values and another has 68
(RMSE ≈ 0.13, R2 ≈ 0.6)
For each data set, we did 10 separate simulations were 20% of the data were used for testing.
Repeated cross-validation is used to the tune the models when they have tuning parameters.
Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 9 / 16
Simulaitons
Models fit twice on each dataset (with and without dummy variables:
single trees (CART, C5.0)
single rulesets (C5.0, Cubist)
bagged trees
random forests
boosted models (SGB trees, C5.0, Cubist)
A number of performance metrics were computed for each (e.g. RMSE, binomial or
multinomial log–loss, etc.) and the test set results are used to compare models.
Confidence intervals were computed using a linear mixed model as to account for the
resample–to–resample correlation structure.
Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 10 / 16
Regression Model Results
RF
CART
Cubist_boost
GBM
Cubist
Bagging
−0.010 −0.005 0.000 0.005 0.010
RMSE Difference
(DV Better) <−−−−−> (Factors Better)
Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 11 / 16
Classification Model Results
German Credit UCI Cars HPC
CART
C50rule_boost
C50rule
C50tree_boost
C50tree
RF
Bagging
1 2 4 1 2 4 1 2 4
Loss Ratio
Ratio > 1 => Factors Did Better
Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 12 / 16
It Depends!
For classification:
The larger differences in the UCI car data might indicate that, if the percentage of
categorical predictors is large, it might matter a lot.
However, the magnitude of improvement of factors over dummy variables depends on the
model.
For 2 or 3 data sets, there was no real difference.
For regression:
It doesn’t seem to matter (except when it does)
Two very similar models (bagging and random forests) showed effects in different
directions.
Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 13 / 16
It Depends!
All of this is also dependent on how easy the problem is.
If no models are able to adequately model the data, the choice of factor vs. dummy won’t
matter.
Also, if the categorical predictors are really important, the difference would most likely be
discernible.
For the Sacramento data, ZIP code is very important. For the HPC data, the protocol variable
is also very informative.
However, one thing is definitive:
Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 14 / 16
Factors Usually Take Less Time to Train
German Credit UCI Cars
HPC Sacramento
C50rule
C50tree
CART
C50tree_boost
C50rule_boost
RF
Bagging
Cubist_boost
Cubist
GBM
C50rule
C50tree
CART
C50tree_boost
C50rule_boost
RF
Bagging
Cubist_boost
Cubist
GBM
1 2 4 1 2 4
Speedup for Using Factors
Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 15 / 16
R and Dummy Variables
In almost all cases, using a formula with a model function will convert factors to dummy
variables.
However, some do not (e.g. rpart, randomForest, gbm, C5.0, NaiveBayes, etc.). This
makes sense for these models.
If you are tuning your model with train, the formula method will create dummy variables and
the non–formula method does not:
> ## dummy variables presented to underlying model:
> train(Class ~ ., data = schedulingData, ...)
>
> ## any factors are preserved
> train(x = schedulingData[, -ncol(schedulingData)],
+ y = schedulingData$Class,
+ ...)
Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 16 / 16

Contenu connexe

Tendances

R and Visualization: A match made in Heaven
R and Visualization: A match made in HeavenR and Visualization: A match made in Heaven
R and Visualization: A match made in HeavenEdureka!
 
010_20160216_Variational Gaussian Process
010_20160216_Variational Gaussian Process010_20160216_Variational Gaussian Process
010_20160216_Variational Gaussian ProcessHa Phuong
 
Principal Component Analysis
Principal Component AnalysisPrincipal Component Analysis
Principal Component AnalysisMason Ziemer
 
8. R Graphics with R
8. R Graphics with R8. R Graphics with R
8. R Graphics with RFAO
 
Two methods for optimising cognitive model parameters
Two methods for optimising cognitive model parametersTwo methods for optimising cognitive model parameters
Two methods for optimising cognitive model parametersUniversity of Huddersfield
 
Introduction to R Graphics with ggplot2
Introduction to R Graphics with ggplot2Introduction to R Graphics with ggplot2
Introduction to R Graphics with ggplot2izahn
 
"Principal Component Analysis - the original paper" presentation @ Papers We ...
"Principal Component Analysis - the original paper" presentation @ Papers We ..."Principal Component Analysis - the original paper" presentation @ Papers We ...
"Principal Component Analysis - the original paper" presentation @ Papers We ...Adrian Florea
 
Cs229 notes10
Cs229 notes10Cs229 notes10
Cs229 notes10VuTran231
 
Dual-time Modeling and Forecasting in Consumer Banking (2016)
Dual-time Modeling and Forecasting in Consumer Banking (2016)Dual-time Modeling and Forecasting in Consumer Banking (2016)
Dual-time Modeling and Forecasting in Consumer Banking (2016)Aijun Zhang
 
Parallel Optimization in Machine Learning
Parallel Optimization in Machine LearningParallel Optimization in Machine Learning
Parallel Optimization in Machine LearningFabian Pedregosa
 
Linear models
Linear modelsLinear models
Linear modelsFAO
 
5.4 randomized datastructures
5.4 randomized datastructures5.4 randomized datastructures
5.4 randomized datastructuresKrish_ver2
 
Algorithms 101 for Data Scientists
Algorithms 101 for Data ScientistsAlgorithms 101 for Data Scientists
Algorithms 101 for Data ScientistsChristopher Conlan
 
maXbox starter69 Machine Learning VII
maXbox starter69 Machine Learning VIImaXbox starter69 Machine Learning VII
maXbox starter69 Machine Learning VIIMax Kleiner
 

Tendances (20)

R and Visualization: A match made in Heaven
R and Visualization: A match made in HeavenR and Visualization: A match made in Heaven
R and Visualization: A match made in Heaven
 
010_20160216_Variational Gaussian Process
010_20160216_Variational Gaussian Process010_20160216_Variational Gaussian Process
010_20160216_Variational Gaussian Process
 
Principal Component Analysis
Principal Component AnalysisPrincipal Component Analysis
Principal Component Analysis
 
Principal component analysis
Principal component analysisPrincipal component analysis
Principal component analysis
 
8. R Graphics with R
8. R Graphics with R8. R Graphics with R
8. R Graphics with R
 
Computer Science Assignment Help
Computer Science Assignment Help Computer Science Assignment Help
Computer Science Assignment Help
 
PCA and SVD in brief
PCA and SVD in briefPCA and SVD in brief
PCA and SVD in brief
 
Two methods for optimising cognitive model parameters
Two methods for optimising cognitive model parametersTwo methods for optimising cognitive model parameters
Two methods for optimising cognitive model parameters
 
Introduction to R Graphics with ggplot2
Introduction to R Graphics with ggplot2Introduction to R Graphics with ggplot2
Introduction to R Graphics with ggplot2
 
"Principal Component Analysis - the original paper" presentation @ Papers We ...
"Principal Component Analysis - the original paper" presentation @ Papers We ..."Principal Component Analysis - the original paper" presentation @ Papers We ...
"Principal Component Analysis - the original paper" presentation @ Papers We ...
 
Cs229 notes10
Cs229 notes10Cs229 notes10
Cs229 notes10
 
Dual-time Modeling and Forecasting in Consumer Banking (2016)
Dual-time Modeling and Forecasting in Consumer Banking (2016)Dual-time Modeling and Forecasting in Consumer Banking (2016)
Dual-time Modeling and Forecasting in Consumer Banking (2016)
 
Parallel Optimization in Machine Learning
Parallel Optimization in Machine LearningParallel Optimization in Machine Learning
Parallel Optimization in Machine Learning
 
Linear models
Linear modelsLinear models
Linear models
 
Business Logistics Assignment Help
Business Logistics Assignment HelpBusiness Logistics Assignment Help
Business Logistics Assignment Help
 
Pca ppt
Pca pptPca ppt
Pca ppt
 
5.4 randomized datastructures
5.4 randomized datastructures5.4 randomized datastructures
5.4 randomized datastructures
 
Algorithms 101 for Data Scientists
Algorithms 101 for Data ScientistsAlgorithms 101 for Data Scientists
Algorithms 101 for Data Scientists
 
maXbox starter69 Machine Learning VII
maXbox starter69 Machine Learning VIImaXbox starter69 Machine Learning VII
maXbox starter69 Machine Learning VII
 
Principal component analysis
Principal component analysisPrincipal component analysis
Principal component analysis
 

En vedette

Inside the R Consortium
Inside the R ConsortiumInside the R Consortium
Inside the R ConsortiumWork-Bench
 
High-Performance Python
High-Performance PythonHigh-Performance Python
High-Performance PythonWork-Bench
 
Julia + R for Data Science
Julia + R for Data ScienceJulia + R for Data Science
Julia + R for Data ScienceWork-Bench
 
R for Everything
R for EverythingR for Everything
R for EverythingWork-Bench
 
Using R at NYT Graphics
Using R at NYT GraphicsUsing R at NYT Graphics
Using R at NYT GraphicsWork-Bench
 
What We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
What We Learned Building an R-Python Hybrid Predictive Analytics PipelineWhat We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
What We Learned Building an R-Python Hybrid Predictive Analytics PipelineWork-Bench
 
Building Scalable Prediction Services in R
Building Scalable Prediction Services in RBuilding Scalable Prediction Services in R
Building Scalable Prediction Services in RWork-Bench
 
Improving Data Interoperability for Python and R
Improving Data Interoperability for Python and RImproving Data Interoperability for Python and R
Improving Data Interoperability for Python and RWork-Bench
 
Iterating over statistical models: NCAA tournament edition
Iterating over statistical models: NCAA tournament editionIterating over statistical models: NCAA tournament edition
Iterating over statistical models: NCAA tournament editionWork-Bench
 
A Statistician Walks into a Tech Company: R at a Rapidly Scaling Healthcare S...
A Statistician Walks into a Tech Company: R at a Rapidly Scaling Healthcare S...A Statistician Walks into a Tech Company: R at a Rapidly Scaling Healthcare S...
A Statistician Walks into a Tech Company: R at a Rapidly Scaling Healthcare S...Work-Bench
 
Dr. Datascience or: How I Learned to Stop Munging and Love Tests
Dr. Datascience or: How I Learned to Stop Munging and Love TestsDr. Datascience or: How I Learned to Stop Munging and Love Tests
Dr. Datascience or: How I Learned to Stop Munging and Love TestsWork-Bench
 
Thinking Small About Big Data
Thinking Small About Big DataThinking Small About Big Data
Thinking Small About Big DataWork-Bench
 
Reflection on the Data Science Profession in NYC
Reflection on the Data Science Profession in NYCReflection on the Data Science Profession in NYC
Reflection on the Data Science Profession in NYCWork-Bench
 
The Political Impact of Social Penumbras
The Political Impact of Social PenumbrasThe Political Impact of Social Penumbras
The Political Impact of Social PenumbrasWork-Bench
 
Analyzing NYC Transit Data
Analyzing NYC Transit DataAnalyzing NYC Transit Data
Analyzing NYC Transit DataWork-Bench
 
Data Science Challenges in Personal Program Analysis
Data Science Challenges in Personal Program AnalysisData Science Challenges in Personal Program Analysis
Data Science Challenges in Personal Program AnalysisWork-Bench
 
Scaling Data Science at Airbnb
Scaling Data Science at AirbnbScaling Data Science at Airbnb
Scaling Data Science at AirbnbWork-Bench
 
Scaling Analysis Responsibly
Scaling Analysis ResponsiblyScaling Analysis Responsibly
Scaling Analysis ResponsiblyWork-Bench
 

En vedette (19)

Inside the R Consortium
Inside the R ConsortiumInside the R Consortium
Inside the R Consortium
 
High-Performance Python
High-Performance PythonHigh-Performance Python
High-Performance Python
 
Julia + R for Data Science
Julia + R for Data ScienceJulia + R for Data Science
Julia + R for Data Science
 
R for Everything
R for EverythingR for Everything
R for Everything
 
Using R at NYT Graphics
Using R at NYT GraphicsUsing R at NYT Graphics
Using R at NYT Graphics
 
What We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
What We Learned Building an R-Python Hybrid Predictive Analytics PipelineWhat We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
What We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
 
Building Scalable Prediction Services in R
Building Scalable Prediction Services in RBuilding Scalable Prediction Services in R
Building Scalable Prediction Services in R
 
Improving Data Interoperability for Python and R
Improving Data Interoperability for Python and RImproving Data Interoperability for Python and R
Improving Data Interoperability for Python and R
 
Iterating over statistical models: NCAA tournament edition
Iterating over statistical models: NCAA tournament editionIterating over statistical models: NCAA tournament edition
Iterating over statistical models: NCAA tournament edition
 
A Statistician Walks into a Tech Company: R at a Rapidly Scaling Healthcare S...
A Statistician Walks into a Tech Company: R at a Rapidly Scaling Healthcare S...A Statistician Walks into a Tech Company: R at a Rapidly Scaling Healthcare S...
A Statistician Walks into a Tech Company: R at a Rapidly Scaling Healthcare S...
 
Dr. Datascience or: How I Learned to Stop Munging and Love Tests
Dr. Datascience or: How I Learned to Stop Munging and Love TestsDr. Datascience or: How I Learned to Stop Munging and Love Tests
Dr. Datascience or: How I Learned to Stop Munging and Love Tests
 
Thinking Small About Big Data
Thinking Small About Big DataThinking Small About Big Data
Thinking Small About Big Data
 
Reflection on the Data Science Profession in NYC
Reflection on the Data Science Profession in NYCReflection on the Data Science Profession in NYC
Reflection on the Data Science Profession in NYC
 
The Political Impact of Social Penumbras
The Political Impact of Social PenumbrasThe Political Impact of Social Penumbras
The Political Impact of Social Penumbras
 
Analyzing NYC Transit Data
Analyzing NYC Transit DataAnalyzing NYC Transit Data
Analyzing NYC Transit Data
 
The Feels
The FeelsThe Feels
The Feels
 
Data Science Challenges in Personal Program Analysis
Data Science Challenges in Personal Program AnalysisData Science Challenges in Personal Program Analysis
Data Science Challenges in Personal Program Analysis
 
Scaling Data Science at Airbnb
Scaling Data Science at AirbnbScaling Data Science at Airbnb
Scaling Data Science at Airbnb
 
Scaling Analysis Responsibly
Scaling Analysis ResponsiblyScaling Analysis Responsibly
Scaling Analysis Responsibly
 

Similaire à I Don't Want to Be a Dummy! Encoding Predictors for Trees

Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Zihui Li
 
Map-Reduce for Machine Learning on Multicore
Map-Reduce for Machine Learning on MulticoreMap-Reduce for Machine Learning on Multicore
Map-Reduce for Machine Learning on Multicoreillidan2004
 
Machine learning for_finance
Machine learning for_financeMachine learning for_finance
Machine learning for_financeStefan Duprey
 
Approaches to online quantile estimation
Approaches to online quantile estimationApproaches to online quantile estimation
Approaches to online quantile estimationData Con LA
 
A Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelA Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelJenny Liu
 
Complex models in ecology: challenges and solutions
Complex models in ecology: challenges and solutionsComplex models in ecology: challenges and solutions
Complex models in ecology: challenges and solutionsPeter Solymos
 
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...Derryck Lamptey, MPhil, CISSP
 
Automated Clustering Project - 12th CONTECSI 34th WCARS
Automated Clustering Project - 12th CONTECSI 34th WCARS Automated Clustering Project - 12th CONTECSI 34th WCARS
Automated Clustering Project - 12th CONTECSI 34th WCARS TECSI FEA USP
 
Monte Carlo Simulations
Monte Carlo SimulationsMonte Carlo Simulations
Monte Carlo Simulationsgfbreaux
 
Heuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchHeuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchGreg Makowski
 
Parallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationParallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationGeoffrey Fox
 
A Hybrid Method of CART and Artificial Neural Network for Short Term Load For...
A Hybrid Method of CART and Artificial Neural Network for Short Term Load For...A Hybrid Method of CART and Artificial Neural Network for Short Term Load For...
A Hybrid Method of CART and Artificial Neural Network for Short Term Load For...Salford Systems
 
Apache SystemML Optimizer and Runtime techniques by Matthias Boehm
Apache SystemML Optimizer and Runtime techniques by Matthias BoehmApache SystemML Optimizer and Runtime techniques by Matthias Boehm
Apache SystemML Optimizer and Runtime techniques by Matthias BoehmArvind Surve
 
Apache SystemML Optimizer and Runtime techniques by Matthias Boehm
Apache SystemML Optimizer and Runtime techniques by Matthias BoehmApache SystemML Optimizer and Runtime techniques by Matthias Boehm
Apache SystemML Optimizer and Runtime techniques by Matthias BoehmArvind Surve
 
Citython presentation
Citython presentationCitython presentation
Citython presentationAnkit Tewari
 
HW2-1_05.doc
HW2-1_05.docHW2-1_05.doc
HW2-1_05.docbutest
 

Similaire à I Don't Want to Be a Dummy! Encoding Predictors for Trees (20)

Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)
 
Map-Reduce for Machine Learning on Multicore
Map-Reduce for Machine Learning on MulticoreMap-Reduce for Machine Learning on Multicore
Map-Reduce for Machine Learning on Multicore
 
Clustering
ClusteringClustering
Clustering
 
Machine learning for_finance
Machine learning for_financeMachine learning for_finance
Machine learning for_finance
 
Approaches to online quantile estimation
Approaches to online quantile estimationApproaches to online quantile estimation
Approaches to online quantile estimation
 
A Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelA Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in Parallel
 
Complex models in ecology: challenges and solutions
Complex models in ecology: challenges and solutionsComplex models in ecology: challenges and solutions
Complex models in ecology: challenges and solutions
 
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...
 
Automated Clustering Project - 12th CONTECSI 34th WCARS
Automated Clustering Project - 12th CONTECSI 34th WCARS Automated Clustering Project - 12th CONTECSI 34th WCARS
Automated Clustering Project - 12th CONTECSI 34th WCARS
 
6조
6조6조
6조
 
Monte Carlo Simulations
Monte Carlo SimulationsMonte Carlo Simulations
Monte Carlo Simulations
 
Heuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchHeuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient search
 
Parallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationParallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel application
 
A Hybrid Method of CART and Artificial Neural Network for Short Term Load For...
A Hybrid Method of CART and Artificial Neural Network for Short Term Load For...A Hybrid Method of CART and Artificial Neural Network for Short Term Load For...
A Hybrid Method of CART and Artificial Neural Network for Short Term Load For...
 
Apache SystemML Optimizer and Runtime techniques by Matthias Boehm
Apache SystemML Optimizer and Runtime techniques by Matthias BoehmApache SystemML Optimizer and Runtime techniques by Matthias Boehm
Apache SystemML Optimizer and Runtime techniques by Matthias Boehm
 
Apache SystemML Optimizer and Runtime techniques by Matthias Boehm
Apache SystemML Optimizer and Runtime techniques by Matthias BoehmApache SystemML Optimizer and Runtime techniques by Matthias Boehm
Apache SystemML Optimizer and Runtime techniques by Matthias Boehm
 
Citython presentation
Citython presentationCitython presentation
Citython presentation
 
HW2-1_05.doc
HW2-1_05.docHW2-1_05.doc
HW2-1_05.doc
 
Recurrent Instance Segmentation (UPC Reading Group)
Recurrent Instance Segmentation (UPC Reading Group)Recurrent Instance Segmentation (UPC Reading Group)
Recurrent Instance Segmentation (UPC Reading Group)
 
modeling.ppt
modeling.pptmodeling.ppt
modeling.ppt
 

Plus de Work-Bench

2017 Enterprise Almanac
2017 Enterprise Almanac2017 Enterprise Almanac
2017 Enterprise AlmanacWork-Bench
 
AI to Enable Next Generation of People Managers
AI to Enable Next Generation of People ManagersAI to Enable Next Generation of People Managers
AI to Enable Next Generation of People ManagersWork-Bench
 
Startup Recruiting Workbook: Sourcing and Interview Process
Startup Recruiting Workbook: Sourcing and Interview ProcessStartup Recruiting Workbook: Sourcing and Interview Process
Startup Recruiting Workbook: Sourcing and Interview ProcessWork-Bench
 
Cloud Native Infrastructure Management Solutions Compared
Cloud Native Infrastructure Management Solutions ComparedCloud Native Infrastructure Management Solutions Compared
Cloud Native Infrastructure Management Solutions ComparedWork-Bench
 
Building a Demand Generation Machine at MongoDB
Building a Demand Generation Machine at MongoDBBuilding a Demand Generation Machine at MongoDB
Building a Demand Generation Machine at MongoDBWork-Bench
 
How to Market Your Startup to the Enterprise
How to Market Your Startup to the EnterpriseHow to Market Your Startup to the Enterprise
How to Market Your Startup to the EnterpriseWork-Bench
 
Marketing & Design for the Enterprise
Marketing & Design for the EnterpriseMarketing & Design for the Enterprise
Marketing & Design for the EnterpriseWork-Bench
 
Playing the Marketing Long Game
Playing the Marketing Long GamePlaying the Marketing Long Game
Playing the Marketing Long GameWork-Bench
 

Plus de Work-Bench (8)

2017 Enterprise Almanac
2017 Enterprise Almanac2017 Enterprise Almanac
2017 Enterprise Almanac
 
AI to Enable Next Generation of People Managers
AI to Enable Next Generation of People ManagersAI to Enable Next Generation of People Managers
AI to Enable Next Generation of People Managers
 
Startup Recruiting Workbook: Sourcing and Interview Process
Startup Recruiting Workbook: Sourcing and Interview ProcessStartup Recruiting Workbook: Sourcing and Interview Process
Startup Recruiting Workbook: Sourcing and Interview Process
 
Cloud Native Infrastructure Management Solutions Compared
Cloud Native Infrastructure Management Solutions ComparedCloud Native Infrastructure Management Solutions Compared
Cloud Native Infrastructure Management Solutions Compared
 
Building a Demand Generation Machine at MongoDB
Building a Demand Generation Machine at MongoDBBuilding a Demand Generation Machine at MongoDB
Building a Demand Generation Machine at MongoDB
 
How to Market Your Startup to the Enterprise
How to Market Your Startup to the EnterpriseHow to Market Your Startup to the Enterprise
How to Market Your Startup to the Enterprise
 
Marketing & Design for the Enterprise
Marketing & Design for the EnterpriseMarketing & Design for the Enterprise
Marketing & Design for the Enterprise
 
Playing the Marketing Long Game
Playing the Marketing Long GamePlaying the Marketing Long Game
Playing the Marketing Long Game
 

Dernier

Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...gajnagarg
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...amitlee9823
 
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...gajnagarg
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...gajnagarg
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...karishmasinghjnh
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...gajnagarg
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...gajnagarg
 

Dernier (20)

Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
 
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
 

I Don't Want to Be a Dummy! Encoding Predictors for Trees

  • 1. I Don’t Want to Be a Dummy! Encoding Predictors for Trees Max Kuhn NYRC
  • 2. Trees Tree–based models are nested sets of if/else statements that make predictions in the terminal nodes: > library(rpart) > library(AppliedPredictiveModeling) > data(schedulingData) > rpart(Class ~ ., data = schedulingData, control = rpart.control(maxdepth = 2)) n= 4331 node), split, n, loss, yval, (yprob) * denotes terminal node 1) root 4331 2100 VF (0.511 0.311 0.119 0.060) 2) Protocol=C,D,E,F,G,I,J,K,L,N 2884 860 VF (0.703 0.206 0.068 0.023) * 3) Protocol=A,H,M,O 1447 690 F (0.126 0.521 0.219 0.133) 6) Iterations< 1.5e+02 1363 610 F (0.134 0.553 0.232 0.081) * 7) Iterations>=1.5e+02 84 1 L (0.000 0.000 0.012 0.988) * Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 2 / 16
  • 3. Rules Similarly, rule–based models are non–nested sets of if statements: > library(C50) > summary(C5.0(Class ~ ., data = schedulingData, rules = TRUE)) <snip> Rule 109: (17/7, lift 9.7) Protocol in {F, J, N} Compounds > 818 InputFields > 152 NumPending <= 0 Hour > 0.6333333 Day = Tue -> class L [0.579] Default class: VF Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 3 / 16
  • 4. Bayes! Bayesian regression and classification models don’t really specify anything about the predictors beyond Pr[X] and Pr[X|Y ]. If there were only one categorical predictor, we could have Pr[X|Y ] be a table of raw probabilities: > xtab <- table(schedulingData$Day, schedulingData$Class) > apply(xtab, 2, function(x) x/sum(x)) VF F M L Mon 0.1678 0.1492 0.15 0.162 Tue 0.1913 0.2019 0.27 0.255 Wed 0.2090 0.2101 0.19 0.228 Thu 0.1678 0.1589 0.18 0.154 Fri 0.2171 0.2183 0.20 0.178 Sat 0.0068 0.0082 0.00 0.023 Sun 0.0403 0.0535 0.00 0.000 Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 4 / 16
  • 5. Dummy Variables For the other models, we typically encode a predictor with C categories into C − 1 binary dummy variables: > design_mat <- model.matrix(Class ~ Day, data = head(schedulingData)) > design_mat[, colnames(design_mat) != "(Intercept)"] DayTue DayWed DayThu DayFri DaySat DaySun 1 1 0 0 0 0 0 2 1 0 0 0 0 0 3 0 0 1 0 0 0 4 0 0 0 1 0 0 5 0 0 0 1 0 0 6 0 1 0 0 0 0 In this case, one predictor generates six columns in the design matrix Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 5 / 16
  • 6. Encoding Choices We make the decision on how to encode the data prior to creating the model. That means we choose whether to present the model with the grouped categories or ungrouped binary dummy variables. The means we could get different representations of the model (see the next two slides). Does it matter? Let’s do some experiments! Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 6 / 16
  • 7. A Tree with Categorical Data wday 1 Sun, Sat Mon, Tues, Wed, Thurs, Fri Node 2 (n = 1530) 0 5000 10000 15000 20000 25000 q qq q q Node 3 (n = 3826) 0 5000 10000 15000 20000 25000 q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qqq q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q qq q q q q q q qqq q q qq q q q q qq q q q q qq q q q q q q q q q q q q q q q q q Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 7 / 16
  • 8. A Tree with Dummy Variables Sun 1 ≥ 0.5 < 0.5 Node 2 (n = 765) 0 5000 10000 15000 20000 25000 q Sat 3 ≥ 0.5 < 0.5 Node 4 (n = 765) 0 5000 10000 15000 20000 25000 q q q q Node 5 (n = 3826) 0 5000 10000 15000 20000 25000 q q q q q q qq q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q qqq q q qq q q q q qq q q q q q q q qq q q q q q q q q q q q q q q q q q q q q qq q q q q q qq q q q q q q qqq q q qq qq q q qq q q q q qq q q q q q q q q q q q q q q q q q Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 8 / 16
  • 9. Data Sets Classification: German Credit, 13 categorical predictors out of 20 (ROC AUC ≈ 0.76) UCI Car Evaluation, 6 of 6 (Acc ≈ 0.96) APM High Performance Computing, 2 of 7 (κ ≈ 0.7) Regression: Sacramento house prices, 3 of 8 but one has 37 unique values and another has 68 (RMSE ≈ 0.13, R2 ≈ 0.6) For each data set, we did 10 separate simulations were 20% of the data were used for testing. Repeated cross-validation is used to the tune the models when they have tuning parameters. Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 9 / 16
  • 10. Simulaitons Models fit twice on each dataset (with and without dummy variables: single trees (CART, C5.0) single rulesets (C5.0, Cubist) bagged trees random forests boosted models (SGB trees, C5.0, Cubist) A number of performance metrics were computed for each (e.g. RMSE, binomial or multinomial log–loss, etc.) and the test set results are used to compare models. Confidence intervals were computed using a linear mixed model as to account for the resample–to–resample correlation structure. Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 10 / 16
  • 11. Regression Model Results RF CART Cubist_boost GBM Cubist Bagging −0.010 −0.005 0.000 0.005 0.010 RMSE Difference (DV Better) <−−−−−> (Factors Better) Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 11 / 16
  • 12. Classification Model Results German Credit UCI Cars HPC CART C50rule_boost C50rule C50tree_boost C50tree RF Bagging 1 2 4 1 2 4 1 2 4 Loss Ratio Ratio > 1 => Factors Did Better Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 12 / 16
  • 13. It Depends! For classification: The larger differences in the UCI car data might indicate that, if the percentage of categorical predictors is large, it might matter a lot. However, the magnitude of improvement of factors over dummy variables depends on the model. For 2 or 3 data sets, there was no real difference. For regression: It doesn’t seem to matter (except when it does) Two very similar models (bagging and random forests) showed effects in different directions. Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 13 / 16
  • 14. It Depends! All of this is also dependent on how easy the problem is. If no models are able to adequately model the data, the choice of factor vs. dummy won’t matter. Also, if the categorical predictors are really important, the difference would most likely be discernible. For the Sacramento data, ZIP code is very important. For the HPC data, the protocol variable is also very informative. However, one thing is definitive: Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 14 / 16
  • 15. Factors Usually Take Less Time to Train German Credit UCI Cars HPC Sacramento C50rule C50tree CART C50tree_boost C50rule_boost RF Bagging Cubist_boost Cubist GBM C50rule C50tree CART C50tree_boost C50rule_boost RF Bagging Cubist_boost Cubist GBM 1 2 4 1 2 4 Speedup for Using Factors Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 15 / 16
  • 16. R and Dummy Variables In almost all cases, using a formula with a model function will convert factors to dummy variables. However, some do not (e.g. rpart, randomForest, gbm, C5.0, NaiveBayes, etc.). This makes sense for these models. If you are tuning your model with train, the formula method will create dummy variables and the non–formula method does not: > ## dummy variables presented to underlying model: > train(Class ~ ., data = schedulingData, ...) > > ## any factors are preserved > train(x = schedulingData[, -ncol(schedulingData)], + y = schedulingData$Class, + ...) Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 16 / 16