SlideShare une entreprise Scribd logo
1  sur  15
1




               Comparison of Machine Learning Algorithms


                       in Market Segmentation Analysis



                                   Zhaohua Huang

                                     Dec 12, 2005




Abstract

This project is aimed to compare the four machine learning methods: bagging, random
forests (RFA), artificial neural network (ANN) and support vector machine (SVM) by
using the sales data of an orthopedic equipment company. The result shows that the four
methods show similarly unsatisfactory prediction performance on this dataset. Though
these four methods have their own advantages on predicting some specific categories,
ANN is relatively the best based on the misclassification rates.
2


1. Introduction
Bagging, random forests (RFA), artificial neural network (ANN) and support vector
machine (SVM) are four useful machine learning methods, which can be use to improve
of the classification accuracy. Bagging produces replicated training sets by sampling with
replacement from the training set to form the classifiers. Random forests are a
combination of tree predictors such that each tree depends on the values of a random
vector sampled independently and with the same distribution for all trees in the forest.
The artificial neural network used here is a the single-layer network, which consists of
only a single layer of output nodes and the inputs are fed directly to the outputs via a
series of weights. And the support vector machine for classification creates a hyperplane
that separates the data into two classes with the maximum margin. Theoretically, random
forests yield better error rates, at least than bagging, and are more robust to noise.

In this project, we use a real data set to empirically compare their classification
performance. The data set contains the sales of a company’s orthopedic equipments in
4703 hospitals and 13 feature variables that are potentially able to explain the difference
of sales among these hospitals. The above four classification algorithms yield the
predicted probabilities of sales of four categories: “no sale”, “low”, “high”, and “very
high” from low to high. Then these probabilities are input to LDA for classification and
the misclassification rates are compared.

2. Analysis and Results
2.1 Overview
The procedure of analysis is summarized in the following diagram. Since RFA directly
reports the classification result instead of probabilities, LDA is not applied.


                                 Data Transformation
2.2 Data Manipulation and PCA
The goal and method of data transformation are the same as the previous project and the
                                          PCA
details are in appendix 0 to 3. There are 5 closely related and highly correlated variables,
“knee 95”, “knee 96”, “hip 95”, “hip 96” and “femur 96”. They are the numbers of
operations of knee, hip and femur in 1995 and 1996. Simply using them together as
       Bagging               Random                  ANN                   SVM
predictors will generate unnecessary noises to the prediction. Therefore, we apply
                             Forests
principle component analysis to these 5 variables after transforming the data. Since the
first component can explain 0.9138151 of the variance and the second one drops to only
0.05, we only use the first principle component variable “V1”, which is the linear
combination of the above five highly correlated variables: V1 = -0.456hip95 -
                                          LDA
0.445knee95 - 0.458hip96 - 0.445knee96 - 0.432femur96. The other predictors do not
show high correlation after transformation.
3


In addition, variable “rbeds” is transformed into a binary categorical variable, which is
the same as the variable “rehab”. So we drop the first. We also try cross validation
method to find a good combination of the predictors. The change of result is subtle and in
most cases, dropping any one variable will weaken the prediction power. Hence, the final
predictors are “V1”, “beds”, “outv”, “adm”, “sir”, “th”, “trauma” and “rehab”. The
description of these 8 predictors is in the appendix. To examine the out-of-sample
performance of these six methods, we randomly split the whole data set into two subsets:
one training set with 4203 observations and one testing set with 500 observations.

2.3 Results of bagging, RFA, ANN and SVM
The results of bagging, RFA, ANN and SVM are shown in Appendix 5 to 8, respectively.
All these four methods are related to some randomness. Sometimes the result is good and
sometimes is bad. Hence we try them for several times and only report their best
performances

Specifically, we choose the default set for bagging, ntree=1000 for RFA, size=20,
maxit=1000 for ANN and kernel = "polynomial", degree = 6, gamma = 0.5, cross = 10 for
SVM. We do not have many options for bagging in R. We test different numbers of bags
from 5 to 100. The results differ a little bit, but just due to the randomness of the method.
Therefore, we only use the default setting. The number of trees in RFA has also been
tested from 100 to 1000. Though the increment does not improve the result significantly,
we get the best result when we set it to be 1000. ANN is very unstable. Sometimes the
iteration number is only 5 and sometimes it will go to 800 and above. Obviously the
higher the iteration number, the better the result is. The default setting of maximum
iteration number is relatively small and we set it to be 1000, though the iteration never
reaches that high. Our best result comes out when iteration goes to 740. There are many
options for SVM. We try different kernels: sigmoid, radial or polynomial. For polynomial
kernel, we also try different degree. In general, polynomial with high degree dominates
sigmoid and radial. It fits the training set pretty well with the misclassification rate as low
as 42%. But none of the combination does well in the testing set. Also, the gamma
coefficient should be 1 over the number of parameters, which should be 0.125 in our
study. However, the testing set performs better if we increase gamma to 0.6. Therefore,
we suspect there is an overfitting issue with SVM.

In Appendix 9, we compare the misclassification rates to evaluate the prediction accuracy
of these 4 classifiers. First, we focus on their performance in training and testing sets. In
training set, SVM has the lowest misclassification rate 45.3. However the testing set
misclassification rate is 50.8%. Compared with the relatively consistent performance of
the other 3 methods, this could be a sign of overfitting. The result of bagging is also not
good. The misclassification rate of the training set is 0.485, which is close to ANN’s
0.480. But the 0.498 rate in testing set means that bagging almost makes half incorrect
4


predictions. RFA performs pretty stably as expected. The misclassification rate for
training set is always close to the one for testing set. But the result is still not satisfactory.
Surprisingly, ANN dominates all the other three methods. The 48% rate in training set is
the second best, counting the possible overfitting SVM and the 47.4% rate is testing set is
defiantly the best. The problem with ANN is its randomness and inconsistency. One has
to try ANN for many times to get the best result and no one knows whether that result is
really the best one can possibly get.

As far as the specific category is concerned, their performances are close but a little bit
different. In general, all four methods can classify the “n” group out very well. The
accuracy rates are all above 80%. But they do extremely badly in “l” group, where the
misclassification rate is above 80%. They can not classify the “h” and “v” group. The
correct and incorrect predictions are about half and half. The reason for this could be that
categorization is not good and the difference between selling 10 and 50 equipments can
be very subtle. Hence the classification among low, high and very high sales is very
difficult. If the categorization decreases to only 2, these methods could perform very well.
The four methods differ in their prediction abilities toward different categories of the
response variables. In general, bagging does worse in “n” but pretty well in “v”. RFA has
its weakness in “h” and “v”. SVM does very badly in “l” and “v”, but pretty well in “n”
and “h”. ANN is ok for “l” and “h”, but pretty good for “n” and “v”. Since the “n” group
has most of the observations, the method which performs the best in that group is highly
possible to be the best one. In our case, it is ANN.

2.5 Something more
We also try the data mining tree method. But the result does not come out on our
computer in two hours. Furthermore, the goal of DMT is to find interesting groups, a
little bit different from the goal of this project. Therefore, we might try it again after
upgrading the computer, but regretfully give it up now. We plan to try different
categorization to further test the prediction abilities, since this 4 categorization result is
far from satisfactory. Besides, the ANN we use is the single layer neural net provided by
R. Maybe a more complicated ANN could perform better. However, due to the time limit,
we have to leave all these to our future study.

3. Conclusion
The four methods do not yield ideal results. Artificial neural network is ok and random
forests show the expected robustness. The prediction abilities for different categories of
the response variable differ among them. The category “n” is relatively the easiest to be
predicted, while none of the four methods can successfully identify the category “l”.
There are several ways to improve the performance, such as lower the dimension of
category or adding region or state factor into analysis. In summary, ANN relatively
dominates other three methods in the analysis of this orthopedic equipment data.
5
6



Reference:

[1] Cabrera, J. and McDougall, A. (2001). Statistical Consulting, Springer-Verlag, New
York.

[2] Agresti, Alan (1996), An Introduction to Categorical Data Analysis, John Wiley &
Sons, Canada.

[3] Hastie, Tibshirani, and Friedman (2001), The Elements of Statistical Learning: Data
Mining, Inference, and Prediction, Springer Series in Statistics

[4] Venables, W. N. and B. D. Ripley (2002). Modern Applied Statisitcs with S, Springer-
Verlag, New York.

[5] Breiman, Leo. Random Forest. Machine Learning, 45, 5-32, 2001
7


Appendix 0 The Notations of Variables




Response:

SALES12 :   SALES OF REHAB. EQUIP. FOR THE LAST 12 MO

Features (predictors):
BEDS : NUMBER OF HOSPITAL BEDS
RBEDS :       NUMBER OF REHAB BEDS
OUT-V :       NUMBER OF OUTPATIENT VISITS
ADM : ADMINISTRATIVE COST(In $1000's per year)
SIR :         REVENUE FROM INPATIENT
HIP95 : NUMBER OF HIP OPERATIONS FOR 1995
KNEE95 :      NUMBER OF KNEE OPERATIONS FOR 1995
TH :          TEACHING HOSPITAL? 0, 1
TRAUMA : DO THEY HAVE A TRAUMA UNIT? 0, 1
REHAB :       DO THEY HAVE A REHAB UNIT? 0, 1
HIP96 : NUMBER HIP OPERATIONS FOR 1996
KNEE96 :      NUMBER KNEE OPERATIONS FOR 1996
FEMUR96 : NUMBER FEMUR OPERATIONS FOR 1996



Appendix 1 Transformations of Selected Variables

beds = log(beds+1)
rbeds = 1 if rbeds ≠ 1
outv = 15*log(outv+215)
adm = 0.0001*log(adm+425)
sir <- log(0.1*sir+42)
hip95 <- log(3*hip95+11)
knee95 <- sqrt(log(3*knee95+15))
hip96 <- log(25*hip96+150)
knee96 <- log(5+10*knee96)
femur96 <- log(20*femur96+60)
8


Appendix 2 Distributions before Transformation
9


Appendix 3 Distributions after Transformation
10



Appendix 4 Result of PCA:
 Call: princomp(x = check)
Standard deviations:
  Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
2.1373123 0.4777930 0.3386012 0.2303204 0.1866775
 5 variables and 4703 observations.

Importance of components:
              Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
Standard deviation    2.1373123 0.47779303 0.33860116 0.23032037 0.18667749
Proportion of Variance 0.9138151 0.04566695 0.02293503 0.01061175 0.00697118
Cumulative Proportion 0.9138151 0.95948204 0.98241707 0.99302882 1.00000000

Loadings:
  Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
[1,] -0.456     -0.529 0.162 0.697
[2,] -0.445 -0.548 -0.214 -0.612 -0.286
[3,] -0.458 0.126 -0.227 0.587 -0.615
[4,] -0.445 -0.344 0.749 0.261 0.233
[5,] -0.432 0.751 0.248 -0.432

         Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
SS loadings    1.0 1.0 1.0 1.0 1.0
Proportion Var 0.2 0.2 0.2 0.2 0.2
Cumulative Var 0.2 0.4 0.6 0.8 1.0
11


Appendix 5 Result of Bagging:
      Length Class   Mode
 y     4203 -none- numeric
 X      8 data.frame list
 Mtrees 10 -none- list
 OOB 1 -none- logical
 comb 1 -none- logical
 call    4 -none- call
 Bagging regression trees with 20 bootstrap replications

 Call: bagging.data.frame(formula = yy_ ~ v1 + beds + outv + adm + sir +
   th + trauma + rehab, data = hos.train, nbagg = 20, coob = T)

  Training set:
 pred h l n v
  h 473 14 324 261
y l 257 91 401 102
  n 387 2 1314 87
  v 156 1 46 287
Misclassification rate= 0.485

Testing set:
  lda.predict
 pred h l n v
   h 66 3 41 20
y l 30 12 52 20
   n 51 2 145 8
   v 18 0 4 28
Misclassification rate= 0.498
12


Appendix 6 Result of Random Forests:
Call:
randomForest(x = xx.train, y = yf.train, xtest = xx.test, ytest = yf.test,   ntree = 1000)
         Type of random forest: classification
             Number of trees: 1000
No. of variables tried at each split: 2

    OOB estimate of error rate: 49.75%
Confusion matrix:
 Pre 1 2 3 4 class.error
   1 1424 38 285 43 0.2044693
y 2 448 107 242 54 0.8742656
   3 439 65 421 147 0.6072761
   4 88 12 230 160 0.6734694
         Test set error rate: 48.6%
Confusion matrix:
 Pre 1 2 3 4 class.error
   1 164 2 35 5 0.2038835
y 2 58 14 35 7 0.8771930
   3 51 5 61 13 0.5307692
   4 7 0 25 18 0.6400000
13


       Appendix 7 Results of ANN
a 8-20-4 network with 264 weights
options were -
# weights: 264
initial value 3855.019060
iter 10 value 2583.006140
iter 20 value 2409.884200
iter 30 value 2357.671846
iter 40 value 2332.398369
......
iter 740 value 2198.189341
final value 2198.189123
converged


LDA result(training set):
  Pred n l h         v
  n 1440 0 292 58
y l 461 92 219 79
  h 423 13 407 229
   v 64 1 178 247
Misclassification rate= 0.480

LDA result(testing set):
  Pred n l h         v
  n 165 0 35 6
y l 59 12 28 15
  h 48 3 64 15
  v 7 0 21 22
Misclassification rate= 0.474
14


Appendix 8 Results of SVM
Call:
svm.default(x = xx.train, y = yf.train, kernel = "polynomial", degree = 6, gamma = 0.5,
cross = 10)

Parameters:
  SVM-Type: C-classification
SVM-Kernel: polynomial
     cost: 1
   degree: 6
    gamma: 0.5
   coef.0: 0
Number of Support Vectors: 3251
( 779 1139 932 401 )

Number of Classes: 4
Levels:
1234

10-fold cross-validation on training data:

Total Accuracy: 45.12557
Single Accuracies:
 51.35135 51.08108 40.81081 40.70081 38.91892 37.02703 44.74394 42.43243 48.64865
55.52561

LDA result(training set):
    pred 1 2 3 4
   1 1500 9 257 24
 y 2 513 62 244 32
   3 420 9 579 64
   4 68 3 259 160
Misclassification rate= 0.453

LDA result(testing set):
 pred 1 2 3 4
    1 163 4 38 1
y 2 70 5 33 6
    3 56 1 65 8
    4 6 0 31 13
Misclassification rate= 0.508
15


Appendix 9 Comparison of 4 Methods:

Misclassification     Train   N       L       H       V       Testi   N       L       H       V
Rates                 ing                                     ng
                      Set                                     Set
Bagging               0.485   0.266   0.893   0.559   0.414   0.498   0.296   0.895   0.492   0.440



Random Forests        0.498   0.204   0.874   0.607   0.673   0.486   0.204   0.877   0.531   0.640


Artificial   Neural   0.480   0.196   0.892   0.620   0.496   0.474   0.199   0.895   0.508   0.560
Network
Support      Vector   0.453   0.162   0.927   0.460   0.673   0.508   0.209   0.956   0.500   0.740
Machine

Contenu connexe

Tendances

Feature selection
Feature selectionFeature selection
Feature selectionDong Guo
 
MACHINE LEARNING - GENETIC ALGORITHM
MACHINE LEARNING - GENETIC ALGORITHMMACHINE LEARNING - GENETIC ALGORITHM
MACHINE LEARNING - GENETIC ALGORITHMPuneet Kulyana
 
NLP Project Presentation
NLP Project PresentationNLP Project Presentation
NLP Project PresentationAryak Sengupta
 
Introduction to XGBoost
Introduction to XGBoostIntroduction to XGBoost
Introduction to XGBoostJoonyoung Yi
 
Linear models and multiclass classification
Linear models and multiclass classificationLinear models and multiclass classification
Linear models and multiclass classificationNdSv94
 
Decision Tree, Naive Bayes, Association Rule Mining, Support Vector Machine, ...
Decision Tree, Naive Bayes, Association Rule Mining, Support Vector Machine, ...Decision Tree, Naive Bayes, Association Rule Mining, Support Vector Machine, ...
Decision Tree, Naive Bayes, Association Rule Mining, Support Vector Machine, ...Akanksha Bali
 
Naive Bayes Classifier
Naive Bayes ClassifierNaive Bayes Classifier
Naive Bayes ClassifierArunabha Saha
 
Feature selection concepts and methods
Feature selection concepts and methodsFeature selection concepts and methods
Feature selection concepts and methodsReza Ramezani
 
What is the Expectation Maximization (EM) Algorithm?
What is the Expectation Maximization (EM) Algorithm?What is the Expectation Maximization (EM) Algorithm?
What is the Expectation Maximization (EM) Algorithm?Kazuki Yoshida
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysisGramener
 
Introduction to dataset
Introduction to datasetIntroduction to dataset
Introduction to datasetdatamantra
 
Deploying End-to-End Deep Learning Pipelines with ONNX
Deploying End-to-End Deep Learning Pipelines with ONNXDeploying End-to-End Deep Learning Pipelines with ONNX
Deploying End-to-End Deep Learning Pipelines with ONNXDatabricks
 
SQL vs. NoSQL Databases
SQL vs. NoSQL DatabasesSQL vs. NoSQL Databases
SQL vs. NoSQL DatabasesOsama Jomaa
 

Tendances (20)

Naive bayes
Naive bayesNaive bayes
Naive bayes
 
AI: Learning in AI
AI: Learning in AI AI: Learning in AI
AI: Learning in AI
 
Feature selection
Feature selectionFeature selection
Feature selection
 
MACHINE LEARNING - GENETIC ALGORITHM
MACHINE LEARNING - GENETIC ALGORITHMMACHINE LEARNING - GENETIC ALGORITHM
MACHINE LEARNING - GENETIC ALGORITHM
 
NLP Project Presentation
NLP Project PresentationNLP Project Presentation
NLP Project Presentation
 
Introduction to XGBoost
Introduction to XGBoostIntroduction to XGBoost
Introduction to XGBoost
 
Linear models and multiclass classification
Linear models and multiclass classificationLinear models and multiclass classification
Linear models and multiclass classification
 
Introduction to HBase
Introduction to HBaseIntroduction to HBase
Introduction to HBase
 
Decision Tree, Naive Bayes, Association Rule Mining, Support Vector Machine, ...
Decision Tree, Naive Bayes, Association Rule Mining, Support Vector Machine, ...Decision Tree, Naive Bayes, Association Rule Mining, Support Vector Machine, ...
Decision Tree, Naive Bayes, Association Rule Mining, Support Vector Machine, ...
 
Naive Bayes Classifier
Naive Bayes ClassifierNaive Bayes Classifier
Naive Bayes Classifier
 
Feature selection concepts and methods
Feature selection concepts and methodsFeature selection concepts and methods
Feature selection concepts and methods
 
Spark
SparkSpark
Spark
 
Data Science: Past, Present, and Future
Data Science: Past, Present, and FutureData Science: Past, Present, and Future
Data Science: Past, Present, and Future
 
What is the Expectation Maximization (EM) Algorithm?
What is the Expectation Maximization (EM) Algorithm?What is the Expectation Maximization (EM) Algorithm?
What is the Expectation Maximization (EM) Algorithm?
 
K - Nearest neighbor ( KNN )
K - Nearest neighbor  ( KNN )K - Nearest neighbor  ( KNN )
K - Nearest neighbor ( KNN )
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysis
 
Introduction to dataset
Introduction to datasetIntroduction to dataset
Introduction to dataset
 
Deploying End-to-End Deep Learning Pipelines with ONNX
Deploying End-to-End Deep Learning Pipelines with ONNXDeploying End-to-End Deep Learning Pipelines with ONNX
Deploying End-to-End Deep Learning Pipelines with ONNX
 
Skip gram and cbow
Skip gram and cbowSkip gram and cbow
Skip gram and cbow
 
SQL vs. NoSQL Databases
SQL vs. NoSQL DatabasesSQL vs. NoSQL Databases
SQL vs. NoSQL Databases
 

En vedette

Comparison of machine learning algorithms for e commerce
Comparison of machine learning algorithms for e commerceComparison of machine learning algorithms for e commerce
Comparison of machine learning algorithms for e commerceNiyuj - Delivering innovation
 
Rules in Artificial Intelligence
Rules in Artificial IntelligenceRules in Artificial Intelligence
Rules in Artificial IntelligencePierre Feillet
 
Deep Learning - The Past, Present and Future of Artificial Intelligence
Deep Learning - The Past, Present and Future of Artificial IntelligenceDeep Learning - The Past, Present and Future of Artificial Intelligence
Deep Learning - The Past, Present and Future of Artificial IntelligenceLukas Masuch
 
MLHEP 2015: Introductory Lecture #2
MLHEP 2015: Introductory Lecture #2MLHEP 2015: Introductory Lecture #2
MLHEP 2015: Introductory Lecture #2arogozhnikov
 
Iaetsd early detection of breast cancer
Iaetsd early detection of breast cancerIaetsd early detection of breast cancer
Iaetsd early detection of breast cancerIaetsd Iaetsd
 
Wavelets AND counterlets
Wavelets  AND  counterletsWavelets  AND  counterlets
Wavelets AND counterletsAvichal Sharma
 
Comparative Literature Studies
Comparative Literature StudiesComparative Literature Studies
Comparative Literature StudiesDilip Barad
 
Key Expert Systems Concepts
Key Expert Systems ConceptsKey Expert Systems Concepts
Key Expert Systems ConceptsHarmony Kwawu
 
Word Embeddings - Introduction
Word Embeddings - IntroductionWord Embeddings - Introduction
Word Embeddings - IntroductionChristian Perone
 
Poster-An Expert System for Car Failure Diagnosis
Poster-An Expert System for Car Failure DiagnosisPoster-An Expert System for Car Failure Diagnosis
Poster-An Expert System for Car Failure DiagnosisViralkumar Jayswal
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Gaurav Mittal
 
Deep Learning - Convolutional Neural Networks - Architectural Zoo
Deep Learning - Convolutional Neural Networks - Architectural ZooDeep Learning - Convolutional Neural Networks - Architectural Zoo
Deep Learning - Convolutional Neural Networks - Architectural ZooChristian Perone
 
Visualising Data with Code
Visualising Data with CodeVisualising Data with Code
Visualising Data with CodeRi Liu
 
Booz Allen Field Guide to Data Science
Booz Allen Field Guide to Data Science Booz Allen Field Guide to Data Science
Booz Allen Field Guide to Data Science Booz Allen Hamilton
 

En vedette (17)

Comparison of machine learning algorithms for e commerce
Comparison of machine learning algorithms for e commerceComparison of machine learning algorithms for e commerce
Comparison of machine learning algorithms for e commerce
 
Rules in Artificial Intelligence
Rules in Artificial IntelligenceRules in Artificial Intelligence
Rules in Artificial Intelligence
 
Deep Learning - The Past, Present and Future of Artificial Intelligence
Deep Learning - The Past, Present and Future of Artificial IntelligenceDeep Learning - The Past, Present and Future of Artificial Intelligence
Deep Learning - The Past, Present and Future of Artificial Intelligence
 
MLHEP 2015: Introductory Lecture #2
MLHEP 2015: Introductory Lecture #2MLHEP 2015: Introductory Lecture #2
MLHEP 2015: Introductory Lecture #2
 
Iaetsd early detection of breast cancer
Iaetsd early detection of breast cancerIaetsd early detection of breast cancer
Iaetsd early detection of breast cancer
 
10.1.1.151.4974
10.1.1.151.497410.1.1.151.4974
10.1.1.151.4974
 
Wavelets AND counterlets
Wavelets  AND  counterletsWavelets  AND  counterlets
Wavelets AND counterlets
 
Comparative Literature Studies
Comparative Literature StudiesComparative Literature Studies
Comparative Literature Studies
 
Key Expert Systems Concepts
Key Expert Systems ConceptsKey Expert Systems Concepts
Key Expert Systems Concepts
 
Word Embeddings - Introduction
Word Embeddings - IntroductionWord Embeddings - Introduction
Word Embeddings - Introduction
 
Poster-An Expert System for Car Failure Diagnosis
Poster-An Expert System for Car Failure DiagnosisPoster-An Expert System for Car Failure Diagnosis
Poster-An Expert System for Car Failure Diagnosis
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)
 
Deep Learning - Convolutional Neural Networks - Architectural Zoo
Deep Learning - Convolutional Neural Networks - Architectural ZooDeep Learning - Convolutional Neural Networks - Architectural Zoo
Deep Learning - Convolutional Neural Networks - Architectural Zoo
 
My
MyMy
My
 
Visualising Data with Code
Visualising Data with CodeVisualising Data with Code
Visualising Data with Code
 
Data science and_analytics_for_ordinary_people_ebook
Data science and_analytics_for_ordinary_people_ebookData science and_analytics_for_ordinary_people_ebook
Data science and_analytics_for_ordinary_people_ebook
 
Booz Allen Field Guide to Data Science
Booz Allen Field Guide to Data Science Booz Allen Field Guide to Data Science
Booz Allen Field Guide to Data Science
 

Similaire à Comparison of Machine Learning Algorithms

Discriminant analysis ravi nakulan slideshare
Discriminant analysis ravi nakulan slideshareDiscriminant analysis ravi nakulan slideshare
Discriminant analysis ravi nakulan slideshareRavi Nakulan
 
Performance of the classification algorithm
Performance of the classification algorithmPerformance of the classification algorithm
Performance of the classification algorithmHoopeer Hoopeer
 
Top 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdfTop 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdfDatacademy.ai
 
Detecting Analytical Bias - Isaaks
Detecting Analytical Bias - IsaaksDetecting Analytical Bias - Isaaks
Detecting Analytical Bias - IsaaksEd Isaaks
 
Telecom Fraudsters Prediction
Telecom Fraudsters Prediction Telecom Fraudsters Prediction
Telecom Fraudsters Prediction Ashish Ranjan
 
BMI214_Assignment2_S..
BMI214_Assignment2_S..BMI214_Assignment2_S..
BMI214_Assignment2_S..butest
 
Lazy Association Classification
Lazy Association ClassificationLazy Association Classification
Lazy Association ClassificationJason Yang
 
Hepatic injury classification
Hepatic injury classificationHepatic injury classification
Hepatic injury classificationZheliang Jiang
 
Learning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification DataLearning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification Data萍華 楊
 
Predicting breast cancer: Adrian Valles
Predicting breast cancer: Adrian VallesPredicting breast cancer: Adrian Valles
Predicting breast cancer: Adrian VallesAdrián Vallés
 
Random forest
Random forestRandom forest
Random forestUjjawal
 
Sample_Subjective_Questions_Answers (1).pdf
Sample_Subjective_Questions_Answers (1).pdfSample_Subjective_Questions_Answers (1).pdf
Sample_Subjective_Questions_Answers (1).pdfAaryanArora10
 
GENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATION
GENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATIONGENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATION
GENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATIONijaia
 
Lecture 10: SVM and MIRA
Lecture 10: SVM and MIRALecture 10: SVM and MIRA
Lecture 10: SVM and MIRAMarina Santini
 
ANALYSIS AND COMPARISON STUDY OF DATA MINING ALGORITHMS USING RAPIDMINER
ANALYSIS AND COMPARISON STUDY OF DATA MINING ALGORITHMS USING RAPIDMINERANALYSIS AND COMPARISON STUDY OF DATA MINING ALGORITHMS USING RAPIDMINER
ANALYSIS AND COMPARISON STUDY OF DATA MINING ALGORITHMS USING RAPIDMINERIJCSEA Journal
 

Similaire à Comparison of Machine Learning Algorithms (20)

Discriminant analysis ravi nakulan slideshare
Discriminant analysis ravi nakulan slideshareDiscriminant analysis ravi nakulan slideshare
Discriminant analysis ravi nakulan slideshare
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Performance of the classification algorithm
Performance of the classification algorithmPerformance of the classification algorithm
Performance of the classification algorithm
 
panel regression.pptx
panel regression.pptxpanel regression.pptx
panel regression.pptx
 
Top 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdfTop 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdf
 
Detecting Analytical Bias - Isaaks
Detecting Analytical Bias - IsaaksDetecting Analytical Bias - Isaaks
Detecting Analytical Bias - Isaaks
 
Supervised algorithms
Supervised algorithmsSupervised algorithms
Supervised algorithms
 
Ba group3
Ba group3Ba group3
Ba group3
 
Telecom Fraudsters Prediction
Telecom Fraudsters Prediction Telecom Fraudsters Prediction
Telecom Fraudsters Prediction
 
BMI214_Assignment2_S..
BMI214_Assignment2_S..BMI214_Assignment2_S..
BMI214_Assignment2_S..
 
Lazy Association Classification
Lazy Association ClassificationLazy Association Classification
Lazy Association Classification
 
Hepatic injury classification
Hepatic injury classificationHepatic injury classification
Hepatic injury classification
 
Learning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification DataLearning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification Data
 
Predicting breast cancer: Adrian Valles
Predicting breast cancer: Adrian VallesPredicting breast cancer: Adrian Valles
Predicting breast cancer: Adrian Valles
 
20120140505011
2012014050501120120140505011
20120140505011
 
Random forest
Random forestRandom forest
Random forest
 
Sample_Subjective_Questions_Answers (1).pdf
Sample_Subjective_Questions_Answers (1).pdfSample_Subjective_Questions_Answers (1).pdf
Sample_Subjective_Questions_Answers (1).pdf
 
GENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATION
GENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATIONGENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATION
GENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATION
 
Lecture 10: SVM and MIRA
Lecture 10: SVM and MIRALecture 10: SVM and MIRA
Lecture 10: SVM and MIRA
 
ANALYSIS AND COMPARISON STUDY OF DATA MINING ALGORITHMS USING RAPIDMINER
ANALYSIS AND COMPARISON STUDY OF DATA MINING ALGORITHMS USING RAPIDMINERANALYSIS AND COMPARISON STUDY OF DATA MINING ALGORITHMS USING RAPIDMINER
ANALYSIS AND COMPARISON STUDY OF DATA MINING ALGORITHMS USING RAPIDMINER
 

Plus de butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 

Plus de butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

Comparison of Machine Learning Algorithms

  • 1. 1 Comparison of Machine Learning Algorithms in Market Segmentation Analysis Zhaohua Huang Dec 12, 2005 Abstract This project is aimed to compare the four machine learning methods: bagging, random forests (RFA), artificial neural network (ANN) and support vector machine (SVM) by using the sales data of an orthopedic equipment company. The result shows that the four methods show similarly unsatisfactory prediction performance on this dataset. Though these four methods have their own advantages on predicting some specific categories, ANN is relatively the best based on the misclassification rates.
  • 2. 2 1. Introduction Bagging, random forests (RFA), artificial neural network (ANN) and support vector machine (SVM) are four useful machine learning methods, which can be use to improve of the classification accuracy. Bagging produces replicated training sets by sampling with replacement from the training set to form the classifiers. Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The artificial neural network used here is a the single-layer network, which consists of only a single layer of output nodes and the inputs are fed directly to the outputs via a series of weights. And the support vector machine for classification creates a hyperplane that separates the data into two classes with the maximum margin. Theoretically, random forests yield better error rates, at least than bagging, and are more robust to noise. In this project, we use a real data set to empirically compare their classification performance. The data set contains the sales of a company’s orthopedic equipments in 4703 hospitals and 13 feature variables that are potentially able to explain the difference of sales among these hospitals. The above four classification algorithms yield the predicted probabilities of sales of four categories: “no sale”, “low”, “high”, and “very high” from low to high. Then these probabilities are input to LDA for classification and the misclassification rates are compared. 2. Analysis and Results 2.1 Overview The procedure of analysis is summarized in the following diagram. Since RFA directly reports the classification result instead of probabilities, LDA is not applied. Data Transformation 2.2 Data Manipulation and PCA The goal and method of data transformation are the same as the previous project and the PCA details are in appendix 0 to 3. There are 5 closely related and highly correlated variables, “knee 95”, “knee 96”, “hip 95”, “hip 96” and “femur 96”. They are the numbers of operations of knee, hip and femur in 1995 and 1996. Simply using them together as Bagging Random ANN SVM predictors will generate unnecessary noises to the prediction. Therefore, we apply Forests principle component analysis to these 5 variables after transforming the data. Since the first component can explain 0.9138151 of the variance and the second one drops to only 0.05, we only use the first principle component variable “V1”, which is the linear combination of the above five highly correlated variables: V1 = -0.456hip95 - LDA 0.445knee95 - 0.458hip96 - 0.445knee96 - 0.432femur96. The other predictors do not show high correlation after transformation.
  • 3. 3 In addition, variable “rbeds” is transformed into a binary categorical variable, which is the same as the variable “rehab”. So we drop the first. We also try cross validation method to find a good combination of the predictors. The change of result is subtle and in most cases, dropping any one variable will weaken the prediction power. Hence, the final predictors are “V1”, “beds”, “outv”, “adm”, “sir”, “th”, “trauma” and “rehab”. The description of these 8 predictors is in the appendix. To examine the out-of-sample performance of these six methods, we randomly split the whole data set into two subsets: one training set with 4203 observations and one testing set with 500 observations. 2.3 Results of bagging, RFA, ANN and SVM The results of bagging, RFA, ANN and SVM are shown in Appendix 5 to 8, respectively. All these four methods are related to some randomness. Sometimes the result is good and sometimes is bad. Hence we try them for several times and only report their best performances Specifically, we choose the default set for bagging, ntree=1000 for RFA, size=20, maxit=1000 for ANN and kernel = "polynomial", degree = 6, gamma = 0.5, cross = 10 for SVM. We do not have many options for bagging in R. We test different numbers of bags from 5 to 100. The results differ a little bit, but just due to the randomness of the method. Therefore, we only use the default setting. The number of trees in RFA has also been tested from 100 to 1000. Though the increment does not improve the result significantly, we get the best result when we set it to be 1000. ANN is very unstable. Sometimes the iteration number is only 5 and sometimes it will go to 800 and above. Obviously the higher the iteration number, the better the result is. The default setting of maximum iteration number is relatively small and we set it to be 1000, though the iteration never reaches that high. Our best result comes out when iteration goes to 740. There are many options for SVM. We try different kernels: sigmoid, radial or polynomial. For polynomial kernel, we also try different degree. In general, polynomial with high degree dominates sigmoid and radial. It fits the training set pretty well with the misclassification rate as low as 42%. But none of the combination does well in the testing set. Also, the gamma coefficient should be 1 over the number of parameters, which should be 0.125 in our study. However, the testing set performs better if we increase gamma to 0.6. Therefore, we suspect there is an overfitting issue with SVM. In Appendix 9, we compare the misclassification rates to evaluate the prediction accuracy of these 4 classifiers. First, we focus on their performance in training and testing sets. In training set, SVM has the lowest misclassification rate 45.3. However the testing set misclassification rate is 50.8%. Compared with the relatively consistent performance of the other 3 methods, this could be a sign of overfitting. The result of bagging is also not good. The misclassification rate of the training set is 0.485, which is close to ANN’s 0.480. But the 0.498 rate in testing set means that bagging almost makes half incorrect
  • 4. 4 predictions. RFA performs pretty stably as expected. The misclassification rate for training set is always close to the one for testing set. But the result is still not satisfactory. Surprisingly, ANN dominates all the other three methods. The 48% rate in training set is the second best, counting the possible overfitting SVM and the 47.4% rate is testing set is defiantly the best. The problem with ANN is its randomness and inconsistency. One has to try ANN for many times to get the best result and no one knows whether that result is really the best one can possibly get. As far as the specific category is concerned, their performances are close but a little bit different. In general, all four methods can classify the “n” group out very well. The accuracy rates are all above 80%. But they do extremely badly in “l” group, where the misclassification rate is above 80%. They can not classify the “h” and “v” group. The correct and incorrect predictions are about half and half. The reason for this could be that categorization is not good and the difference between selling 10 and 50 equipments can be very subtle. Hence the classification among low, high and very high sales is very difficult. If the categorization decreases to only 2, these methods could perform very well. The four methods differ in their prediction abilities toward different categories of the response variables. In general, bagging does worse in “n” but pretty well in “v”. RFA has its weakness in “h” and “v”. SVM does very badly in “l” and “v”, but pretty well in “n” and “h”. ANN is ok for “l” and “h”, but pretty good for “n” and “v”. Since the “n” group has most of the observations, the method which performs the best in that group is highly possible to be the best one. In our case, it is ANN. 2.5 Something more We also try the data mining tree method. But the result does not come out on our computer in two hours. Furthermore, the goal of DMT is to find interesting groups, a little bit different from the goal of this project. Therefore, we might try it again after upgrading the computer, but regretfully give it up now. We plan to try different categorization to further test the prediction abilities, since this 4 categorization result is far from satisfactory. Besides, the ANN we use is the single layer neural net provided by R. Maybe a more complicated ANN could perform better. However, due to the time limit, we have to leave all these to our future study. 3. Conclusion The four methods do not yield ideal results. Artificial neural network is ok and random forests show the expected robustness. The prediction abilities for different categories of the response variable differ among them. The category “n” is relatively the easiest to be predicted, while none of the four methods can successfully identify the category “l”. There are several ways to improve the performance, such as lower the dimension of category or adding region or state factor into analysis. In summary, ANN relatively dominates other three methods in the analysis of this orthopedic equipment data.
  • 5. 5
  • 6. 6 Reference: [1] Cabrera, J. and McDougall, A. (2001). Statistical Consulting, Springer-Verlag, New York. [2] Agresti, Alan (1996), An Introduction to Categorical Data Analysis, John Wiley & Sons, Canada. [3] Hastie, Tibshirani, and Friedman (2001), The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer Series in Statistics [4] Venables, W. N. and B. D. Ripley (2002). Modern Applied Statisitcs with S, Springer- Verlag, New York. [5] Breiman, Leo. Random Forest. Machine Learning, 45, 5-32, 2001
  • 7. 7 Appendix 0 The Notations of Variables Response: SALES12 : SALES OF REHAB. EQUIP. FOR THE LAST 12 MO Features (predictors): BEDS : NUMBER OF HOSPITAL BEDS RBEDS : NUMBER OF REHAB BEDS OUT-V : NUMBER OF OUTPATIENT VISITS ADM : ADMINISTRATIVE COST(In $1000's per year) SIR : REVENUE FROM INPATIENT HIP95 : NUMBER OF HIP OPERATIONS FOR 1995 KNEE95 : NUMBER OF KNEE OPERATIONS FOR 1995 TH : TEACHING HOSPITAL? 0, 1 TRAUMA : DO THEY HAVE A TRAUMA UNIT? 0, 1 REHAB : DO THEY HAVE A REHAB UNIT? 0, 1 HIP96 : NUMBER HIP OPERATIONS FOR 1996 KNEE96 : NUMBER KNEE OPERATIONS FOR 1996 FEMUR96 : NUMBER FEMUR OPERATIONS FOR 1996 Appendix 1 Transformations of Selected Variables beds = log(beds+1) rbeds = 1 if rbeds ≠ 1 outv = 15*log(outv+215) adm = 0.0001*log(adm+425) sir <- log(0.1*sir+42) hip95 <- log(3*hip95+11) knee95 <- sqrt(log(3*knee95+15)) hip96 <- log(25*hip96+150) knee96 <- log(5+10*knee96) femur96 <- log(20*femur96+60)
  • 8. 8 Appendix 2 Distributions before Transformation
  • 9. 9 Appendix 3 Distributions after Transformation
  • 10. 10 Appendix 4 Result of PCA: Call: princomp(x = check) Standard deviations: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 2.1373123 0.4777930 0.3386012 0.2303204 0.1866775 5 variables and 4703 observations. Importance of components: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Standard deviation 2.1373123 0.47779303 0.33860116 0.23032037 0.18667749 Proportion of Variance 0.9138151 0.04566695 0.02293503 0.01061175 0.00697118 Cumulative Proportion 0.9138151 0.95948204 0.98241707 0.99302882 1.00000000 Loadings: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 [1,] -0.456 -0.529 0.162 0.697 [2,] -0.445 -0.548 -0.214 -0.612 -0.286 [3,] -0.458 0.126 -0.227 0.587 -0.615 [4,] -0.445 -0.344 0.749 0.261 0.233 [5,] -0.432 0.751 0.248 -0.432 Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 SS loadings 1.0 1.0 1.0 1.0 1.0 Proportion Var 0.2 0.2 0.2 0.2 0.2 Cumulative Var 0.2 0.4 0.6 0.8 1.0
  • 11. 11 Appendix 5 Result of Bagging: Length Class Mode y 4203 -none- numeric X 8 data.frame list Mtrees 10 -none- list OOB 1 -none- logical comb 1 -none- logical call 4 -none- call Bagging regression trees with 20 bootstrap replications Call: bagging.data.frame(formula = yy_ ~ v1 + beds + outv + adm + sir + th + trauma + rehab, data = hos.train, nbagg = 20, coob = T) Training set: pred h l n v h 473 14 324 261 y l 257 91 401 102 n 387 2 1314 87 v 156 1 46 287 Misclassification rate= 0.485 Testing set: lda.predict pred h l n v h 66 3 41 20 y l 30 12 52 20 n 51 2 145 8 v 18 0 4 28 Misclassification rate= 0.498
  • 12. 12 Appendix 6 Result of Random Forests: Call: randomForest(x = xx.train, y = yf.train, xtest = xx.test, ytest = yf.test, ntree = 1000) Type of random forest: classification Number of trees: 1000 No. of variables tried at each split: 2 OOB estimate of error rate: 49.75% Confusion matrix: Pre 1 2 3 4 class.error 1 1424 38 285 43 0.2044693 y 2 448 107 242 54 0.8742656 3 439 65 421 147 0.6072761 4 88 12 230 160 0.6734694 Test set error rate: 48.6% Confusion matrix: Pre 1 2 3 4 class.error 1 164 2 35 5 0.2038835 y 2 58 14 35 7 0.8771930 3 51 5 61 13 0.5307692 4 7 0 25 18 0.6400000
  • 13. 13 Appendix 7 Results of ANN a 8-20-4 network with 264 weights options were - # weights: 264 initial value 3855.019060 iter 10 value 2583.006140 iter 20 value 2409.884200 iter 30 value 2357.671846 iter 40 value 2332.398369 ...... iter 740 value 2198.189341 final value 2198.189123 converged LDA result(training set): Pred n l h v n 1440 0 292 58 y l 461 92 219 79 h 423 13 407 229 v 64 1 178 247 Misclassification rate= 0.480 LDA result(testing set): Pred n l h v n 165 0 35 6 y l 59 12 28 15 h 48 3 64 15 v 7 0 21 22 Misclassification rate= 0.474
  • 14. 14 Appendix 8 Results of SVM Call: svm.default(x = xx.train, y = yf.train, kernel = "polynomial", degree = 6, gamma = 0.5, cross = 10) Parameters: SVM-Type: C-classification SVM-Kernel: polynomial cost: 1 degree: 6 gamma: 0.5 coef.0: 0 Number of Support Vectors: 3251 ( 779 1139 932 401 ) Number of Classes: 4 Levels: 1234 10-fold cross-validation on training data: Total Accuracy: 45.12557 Single Accuracies: 51.35135 51.08108 40.81081 40.70081 38.91892 37.02703 44.74394 42.43243 48.64865 55.52561 LDA result(training set): pred 1 2 3 4 1 1500 9 257 24 y 2 513 62 244 32 3 420 9 579 64 4 68 3 259 160 Misclassification rate= 0.453 LDA result(testing set): pred 1 2 3 4 1 163 4 38 1 y 2 70 5 33 6 3 56 1 65 8 4 6 0 31 13 Misclassification rate= 0.508
  • 15. 15 Appendix 9 Comparison of 4 Methods: Misclassification Train N L H V Testi N L H V Rates ing ng Set Set Bagging 0.485 0.266 0.893 0.559 0.414 0.498 0.296 0.895 0.492 0.440 Random Forests 0.498 0.204 0.874 0.607 0.673 0.486 0.204 0.877 0.531 0.640 Artificial Neural 0.480 0.196 0.892 0.620 0.496 0.474 0.199 0.895 0.508 0.560 Network Support Vector 0.453 0.162 0.927 0.460 0.673 0.508 0.209 0.956 0.500 0.740 Machine