1. 1
Comparison of Machine Learning Algorithms
in Market Segmentation Analysis
Zhaohua Huang
Dec 12, 2005
Abstract
This project is aimed to compare the four machine learning methods: bagging, random
forests (RFA), artificial neural network (ANN) and support vector machine (SVM) by
using the sales data of an orthopedic equipment company. The result shows that the four
methods show similarly unsatisfactory prediction performance on this dataset. Though
these four methods have their own advantages on predicting some specific categories,
ANN is relatively the best based on the misclassification rates.
2. 2
1. Introduction
Bagging, random forests (RFA), artificial neural network (ANN) and support vector
machine (SVM) are four useful machine learning methods, which can be use to improve
of the classification accuracy. Bagging produces replicated training sets by sampling with
replacement from the training set to form the classifiers. Random forests are a
combination of tree predictors such that each tree depends on the values of a random
vector sampled independently and with the same distribution for all trees in the forest.
The artificial neural network used here is a the single-layer network, which consists of
only a single layer of output nodes and the inputs are fed directly to the outputs via a
series of weights. And the support vector machine for classification creates a hyperplane
that separates the data into two classes with the maximum margin. Theoretically, random
forests yield better error rates, at least than bagging, and are more robust to noise.
In this project, we use a real data set to empirically compare their classification
performance. The data set contains the sales of a company’s orthopedic equipments in
4703 hospitals and 13 feature variables that are potentially able to explain the difference
of sales among these hospitals. The above four classification algorithms yield the
predicted probabilities of sales of four categories: “no sale”, “low”, “high”, and “very
high” from low to high. Then these probabilities are input to LDA for classification and
the misclassification rates are compared.
2. Analysis and Results
2.1 Overview
The procedure of analysis is summarized in the following diagram. Since RFA directly
reports the classification result instead of probabilities, LDA is not applied.
Data Transformation
2.2 Data Manipulation and PCA
The goal and method of data transformation are the same as the previous project and the
PCA
details are in appendix 0 to 3. There are 5 closely related and highly correlated variables,
“knee 95”, “knee 96”, “hip 95”, “hip 96” and “femur 96”. They are the numbers of
operations of knee, hip and femur in 1995 and 1996. Simply using them together as
Bagging Random ANN SVM
predictors will generate unnecessary noises to the prediction. Therefore, we apply
Forests
principle component analysis to these 5 variables after transforming the data. Since the
first component can explain 0.9138151 of the variance and the second one drops to only
0.05, we only use the first principle component variable “V1”, which is the linear
combination of the above five highly correlated variables: V1 = -0.456hip95 -
LDA
0.445knee95 - 0.458hip96 - 0.445knee96 - 0.432femur96. The other predictors do not
show high correlation after transformation.
3. 3
In addition, variable “rbeds” is transformed into a binary categorical variable, which is
the same as the variable “rehab”. So we drop the first. We also try cross validation
method to find a good combination of the predictors. The change of result is subtle and in
most cases, dropping any one variable will weaken the prediction power. Hence, the final
predictors are “V1”, “beds”, “outv”, “adm”, “sir”, “th”, “trauma” and “rehab”. The
description of these 8 predictors is in the appendix. To examine the out-of-sample
performance of these six methods, we randomly split the whole data set into two subsets:
one training set with 4203 observations and one testing set with 500 observations.
2.3 Results of bagging, RFA, ANN and SVM
The results of bagging, RFA, ANN and SVM are shown in Appendix 5 to 8, respectively.
All these four methods are related to some randomness. Sometimes the result is good and
sometimes is bad. Hence we try them for several times and only report their best
performances
Specifically, we choose the default set for bagging, ntree=1000 for RFA, size=20,
maxit=1000 for ANN and kernel = "polynomial", degree = 6, gamma = 0.5, cross = 10 for
SVM. We do not have many options for bagging in R. We test different numbers of bags
from 5 to 100. The results differ a little bit, but just due to the randomness of the method.
Therefore, we only use the default setting. The number of trees in RFA has also been
tested from 100 to 1000. Though the increment does not improve the result significantly,
we get the best result when we set it to be 1000. ANN is very unstable. Sometimes the
iteration number is only 5 and sometimes it will go to 800 and above. Obviously the
higher the iteration number, the better the result is. The default setting of maximum
iteration number is relatively small and we set it to be 1000, though the iteration never
reaches that high. Our best result comes out when iteration goes to 740. There are many
options for SVM. We try different kernels: sigmoid, radial or polynomial. For polynomial
kernel, we also try different degree. In general, polynomial with high degree dominates
sigmoid and radial. It fits the training set pretty well with the misclassification rate as low
as 42%. But none of the combination does well in the testing set. Also, the gamma
coefficient should be 1 over the number of parameters, which should be 0.125 in our
study. However, the testing set performs better if we increase gamma to 0.6. Therefore,
we suspect there is an overfitting issue with SVM.
In Appendix 9, we compare the misclassification rates to evaluate the prediction accuracy
of these 4 classifiers. First, we focus on their performance in training and testing sets. In
training set, SVM has the lowest misclassification rate 45.3. However the testing set
misclassification rate is 50.8%. Compared with the relatively consistent performance of
the other 3 methods, this could be a sign of overfitting. The result of bagging is also not
good. The misclassification rate of the training set is 0.485, which is close to ANN’s
0.480. But the 0.498 rate in testing set means that bagging almost makes half incorrect
4. 4
predictions. RFA performs pretty stably as expected. The misclassification rate for
training set is always close to the one for testing set. But the result is still not satisfactory.
Surprisingly, ANN dominates all the other three methods. The 48% rate in training set is
the second best, counting the possible overfitting SVM and the 47.4% rate is testing set is
defiantly the best. The problem with ANN is its randomness and inconsistency. One has
to try ANN for many times to get the best result and no one knows whether that result is
really the best one can possibly get.
As far as the specific category is concerned, their performances are close but a little bit
different. In general, all four methods can classify the “n” group out very well. The
accuracy rates are all above 80%. But they do extremely badly in “l” group, where the
misclassification rate is above 80%. They can not classify the “h” and “v” group. The
correct and incorrect predictions are about half and half. The reason for this could be that
categorization is not good and the difference between selling 10 and 50 equipments can
be very subtle. Hence the classification among low, high and very high sales is very
difficult. If the categorization decreases to only 2, these methods could perform very well.
The four methods differ in their prediction abilities toward different categories of the
response variables. In general, bagging does worse in “n” but pretty well in “v”. RFA has
its weakness in “h” and “v”. SVM does very badly in “l” and “v”, but pretty well in “n”
and “h”. ANN is ok for “l” and “h”, but pretty good for “n” and “v”. Since the “n” group
has most of the observations, the method which performs the best in that group is highly
possible to be the best one. In our case, it is ANN.
2.5 Something more
We also try the data mining tree method. But the result does not come out on our
computer in two hours. Furthermore, the goal of DMT is to find interesting groups, a
little bit different from the goal of this project. Therefore, we might try it again after
upgrading the computer, but regretfully give it up now. We plan to try different
categorization to further test the prediction abilities, since this 4 categorization result is
far from satisfactory. Besides, the ANN we use is the single layer neural net provided by
R. Maybe a more complicated ANN could perform better. However, due to the time limit,
we have to leave all these to our future study.
3. Conclusion
The four methods do not yield ideal results. Artificial neural network is ok and random
forests show the expected robustness. The prediction abilities for different categories of
the response variable differ among them. The category “n” is relatively the easiest to be
predicted, while none of the four methods can successfully identify the category “l”.
There are several ways to improve the performance, such as lower the dimension of
category or adding region or state factor into analysis. In summary, ANN relatively
dominates other three methods in the analysis of this orthopedic equipment data.
6. 6
Reference:
[1] Cabrera, J. and McDougall, A. (2001). Statistical Consulting, Springer-Verlag, New
York.
[2] Agresti, Alan (1996), An Introduction to Categorical Data Analysis, John Wiley &
Sons, Canada.
[3] Hastie, Tibshirani, and Friedman (2001), The Elements of Statistical Learning: Data
Mining, Inference, and Prediction, Springer Series in Statistics
[4] Venables, W. N. and B. D. Ripley (2002). Modern Applied Statisitcs with S, Springer-
Verlag, New York.
[5] Breiman, Leo. Random Forest. Machine Learning, 45, 5-32, 2001
7. 7
Appendix 0 The Notations of Variables
Response:
SALES12 : SALES OF REHAB. EQUIP. FOR THE LAST 12 MO
Features (predictors):
BEDS : NUMBER OF HOSPITAL BEDS
RBEDS : NUMBER OF REHAB BEDS
OUT-V : NUMBER OF OUTPATIENT VISITS
ADM : ADMINISTRATIVE COST(In $1000's per year)
SIR : REVENUE FROM INPATIENT
HIP95 : NUMBER OF HIP OPERATIONS FOR 1995
KNEE95 : NUMBER OF KNEE OPERATIONS FOR 1995
TH : TEACHING HOSPITAL? 0, 1
TRAUMA : DO THEY HAVE A TRAUMA UNIT? 0, 1
REHAB : DO THEY HAVE A REHAB UNIT? 0, 1
HIP96 : NUMBER HIP OPERATIONS FOR 1996
KNEE96 : NUMBER KNEE OPERATIONS FOR 1996
FEMUR96 : NUMBER FEMUR OPERATIONS FOR 1996
Appendix 1 Transformations of Selected Variables
beds = log(beds+1)
rbeds = 1 if rbeds ≠ 1
outv = 15*log(outv+215)
adm = 0.0001*log(adm+425)
sir <- log(0.1*sir+42)
hip95 <- log(3*hip95+11)
knee95 <- sqrt(log(3*knee95+15))
hip96 <- log(25*hip96+150)
knee96 <- log(5+10*knee96)
femur96 <- log(20*femur96+60)
11. 11
Appendix 5 Result of Bagging:
Length Class Mode
y 4203 -none- numeric
X 8 data.frame list
Mtrees 10 -none- list
OOB 1 -none- logical
comb 1 -none- logical
call 4 -none- call
Bagging regression trees with 20 bootstrap replications
Call: bagging.data.frame(formula = yy_ ~ v1 + beds + outv + adm + sir +
th + trauma + rehab, data = hos.train, nbagg = 20, coob = T)
Training set:
pred h l n v
h 473 14 324 261
y l 257 91 401 102
n 387 2 1314 87
v 156 1 46 287
Misclassification rate= 0.485
Testing set:
lda.predict
pred h l n v
h 66 3 41 20
y l 30 12 52 20
n 51 2 145 8
v 18 0 4 28
Misclassification rate= 0.498
12. 12
Appendix 6 Result of Random Forests:
Call:
randomForest(x = xx.train, y = yf.train, xtest = xx.test, ytest = yf.test, ntree = 1000)
Type of random forest: classification
Number of trees: 1000
No. of variables tried at each split: 2
OOB estimate of error rate: 49.75%
Confusion matrix:
Pre 1 2 3 4 class.error
1 1424 38 285 43 0.2044693
y 2 448 107 242 54 0.8742656
3 439 65 421 147 0.6072761
4 88 12 230 160 0.6734694
Test set error rate: 48.6%
Confusion matrix:
Pre 1 2 3 4 class.error
1 164 2 35 5 0.2038835
y 2 58 14 35 7 0.8771930
3 51 5 61 13 0.5307692
4 7 0 25 18 0.6400000
13. 13
Appendix 7 Results of ANN
a 8-20-4 network with 264 weights
options were -
# weights: 264
initial value 3855.019060
iter 10 value 2583.006140
iter 20 value 2409.884200
iter 30 value 2357.671846
iter 40 value 2332.398369
......
iter 740 value 2198.189341
final value 2198.189123
converged
LDA result(training set):
Pred n l h v
n 1440 0 292 58
y l 461 92 219 79
h 423 13 407 229
v 64 1 178 247
Misclassification rate= 0.480
LDA result(testing set):
Pred n l h v
n 165 0 35 6
y l 59 12 28 15
h 48 3 64 15
v 7 0 21 22
Misclassification rate= 0.474
14. 14
Appendix 8 Results of SVM
Call:
svm.default(x = xx.train, y = yf.train, kernel = "polynomial", degree = 6, gamma = 0.5,
cross = 10)
Parameters:
SVM-Type: C-classification
SVM-Kernel: polynomial
cost: 1
degree: 6
gamma: 0.5
coef.0: 0
Number of Support Vectors: 3251
( 779 1139 932 401 )
Number of Classes: 4
Levels:
1234
10-fold cross-validation on training data:
Total Accuracy: 45.12557
Single Accuracies:
51.35135 51.08108 40.81081 40.70081 38.91892 37.02703 44.74394 42.43243 48.64865
55.52561
LDA result(training set):
pred 1 2 3 4
1 1500 9 257 24
y 2 513 62 244 32
3 420 9 579 64
4 68 3 259 160
Misclassification rate= 0.453
LDA result(testing set):
pred 1 2 3 4
1 163 4 38 1
y 2 70 5 33 6
3 56 1 65 8
4 6 0 31 13
Misclassification rate= 0.508
15. 15
Appendix 9 Comparison of 4 Methods:
Misclassification Train N L H V Testi N L H V
Rates ing ng
Set Set
Bagging 0.485 0.266 0.893 0.559 0.414 0.498 0.296 0.895 0.492 0.440
Random Forests 0.498 0.204 0.874 0.607 0.673 0.486 0.204 0.877 0.531 0.640
Artificial Neural 0.480 0.196 0.892 0.620 0.496 0.474 0.199 0.895 0.508 0.560
Network
Support Vector 0.453 0.162 0.927 0.460 0.673 0.508 0.209 0.956 0.500 0.740
Machine