SlideShare une entreprise Scribd logo
1  sur  8
Télécharger pour lire hors ligne
Model Evaluation for Classification and Regression
Aravind Kumar Balasubramaniam, 14123754
School of Computing
National College of Ireland
Dublin, Ireland
Email: aravindkumar.balasubramaniam@student.ncirl.ie
Anisha Kudalappa Gudagi, 14123223
School of Computing
National College of Ireland
Dublin, Ireland
Email: anishaKudalappa.Gudagi@student.ncirl.ie
Abstract—This paper aims to compare classification and re-
gression models using CARET package in R. Classification and
regression analysis are performed for predicting survival rate on
titanic dataset and quantifying hazard score on property-liability
dataset respectively. Four models of classification algorithms are
modeled for the data samples from Titanic dataset and evaluated
against Accuracy, Sensitivity, Specificity, Pos pred value, Neg pred
value, Kappa and F-measure metrics. Based on F-measure the
best model is selected and classification analysis is performed on
the titanic dataset. Three models of regression algorithms are
modeled for the data samples from property-liability dataset an
evaluated against RMSE,R-squared, RMSE SD and R-squared
SD. RMSE metric is used for selecting the best model for
regression analysis and predicting the hazard score.
I. INTRODUCTION
Data prediction is a non-trivial task and has an enormous
value, this paper out rights the heuristics about classification
and regression modelling by providing a literature review.
Two different data sets individually suitable for classification
and regression are used for evaluating the model performance
metrics with the aid of CARET package and to train test the
selected algorithm. This research paper presents and discusses
classification based survival prediction of the infamous Titanic
shipwrecks and regression based prediction of property haz-
ards based on property information in the property-liability
insurance industry. Predictive analysis is one of the most
important supervised data mining technique which enables us
to find unidentified patterns or trends in datasets. Machine
learning from disaster is one of the most explored areas in
the field of Data mining and disaster management. In the
past, many researchers have used Data mining techniques for
predicting survivals from natural disasters and for predicting
patient survival rate for uncommon and cure-less diseases. We
have used classification data mining algorithms to predict what
sort of people were likely to survive the shipwrecks. The pre-
diction depends on the key attributes such as gender, passenger
class, age, parents, siblings and the type of ticket. Regression
analysis is done on the property-liability data for quantifying
hazards. Property insurance requires inspection of the property
based on the condition of the property such as foundation,
roof, flooring etc. which are key property attributes. These key
property attributes needs to be investigated by the insurance
companies before they are approved for insured. Thus, to
provide a clear insight to insurance companies to find property
hazards from key attributes of properties, predictive analysis is
best suited. In this paper we have analyzed and built predictive
models to classify the rate of survivals in Titanic shipwreck
and to quantify property hazards before time of inspection
of Liberty Mutual Insurance company dataset which contains
a hazard score provided for each property by inspection.
There are two queries- the first query is to “predict what
sort of people were likely to survive the shipwreck” using
classification and second query is to “predict which of these
hazards contribute more for each property. We have built
four classification models and three regression models for
comparison and selected the best model. The rest of the paper
is divided as follows: section (2) related work which discusses
about predictive analysis in insurance industry and comparing
models, section (3) methodology used to answer the query,
section (4) evaluation of results and section (5) conclusion
and future work.
NCI
August 20, 2015
II. RELATED WORK
Classification data mining technique refers to assigning each
data point instance to a class label. It is one of the most
researched platforms in the field of data mining. Classification
algorithms aim to find a classifier that will assist in assigning
the input instance to a class label [8]. Classification has
been used as the most popular technique in the field of
disaster management and medicine. A conventional algorithm
for survival analysis technique evaluate the data inputs by
using the Kaplan-Meier method or Cox proportional hazard
model. However during the recent years classification models
are widely used for medicine. These models first establishes
a tree structure by partitioning the training data into several
subclasses. This partitioning is done according to the test
conditions until all the data are grouped under one subclass.
After the tree structure is established, pruning is done to
the tree from the bottom of the tree. After the pruning,
rules are created from the output and these rules are used in
the classification of new training data for prediction [2].The
K-nearest neighbor is one of the oldest methods and non-
parametric methods of classification. In this algorithm a class
is assigned based on the common class amongst the k-nearest
neighbors. Fuzzy k-nearest neighbor is an extension of KNN
in which the algorithm assigns the fuzzy memberships of
data samples to different classes [3]. Boosting is an iterative
algorithm combining classification rules with performance in
terms of error rate to produce an accurate classification rule.
Regularized discriminant analysis refers to assigning objects
to one or several groups which is obtained from each object.
Regularization techniques are applied for linear discriminant
analysis and quadratic discriminant analysis and have been
successful in the results of poorly-posed inverse problems.
The efficiency of RDA is to improve the misclassification risk
and error rate from LDA and QDA [4].Recursive partitioning
(rpart) is a statistical method that is used in classification
and regression trees. This method is used to discover data
structures in trends in the data sample. It is in various
scientific fields as multivariate data exploration, for example:
DNA sequencing, medicine. This algorithm can be tuned
to perform classification and regression [10]. Data mining
techniques have been used in the insurance industry from quite
some time since insurance databases consists of large datasets
which provide valuable business knowledge for improving the
customer relationship or improving profits or expanding the
business. Modeling insurance risk is done by applying data
mining techniques. Past research has showed that data mining
methods improves the existing models by discovering extra
variables and by detecting nonlinear relationships. Data mining
has been of great importance in the insurance industry by
identifying risk factors that helps in predicting profits and
losses. Data mining techniques like decision trees and neural
network can accurately predict risk. Customer relationship
management analysis helps in understanding the customer and
accurately select which policies to be offered to a customer
[5]. Random Forest is a data mining algorithm for performing
classification and regression analysis. The metric measure for
randomForest is mtry which specifies the best fit split from the
predictor variables. The randomForest package produces two
sets of information: a metric for specifying the importance
of the predicting variables and a metric for specifying the
structure of the data [7]. Multivariate regression methods such
as Principal component analysis and partial least square have
lot of implications in a variety of industries. Quantitative
–structure-activity relations and quantitative structure –prop-
erty relations use PSLR and PCR. In PCR, the first a principal
components (PCs) is used to approximate the matrix. Next
the Y is regressed on the scores, which in turn provides
the regression coefficients [9]. Gradient Boosted regression
is an iterative algorithm for finding a predictor. Regression
trees provides a repeated expansion of nodes until a stopping
criteria is met. All the data points in the data are assigned
to a single node initially. Parallelizing the boosted regression
trees implies to boosting analysis sequentially by parallelizing
the building of trees individually [11]. The CARET package
which is short for classification and regression training con-
sists of classification and regression models. It is used for
model tuning and training across models. This package helps
in comparing model performance between different models.
Since classification and regression models are used in many
different applications, Caret package will help in selecting the
best model and approach [6]
III. METHODOLOGY
The data mining methodology used here is CRISP-DM
methodology. CRISP-DM stands for Cross-Industry Standard
Process for Data Mining. This methodology consists of six
steps:
Business Understanding: This phase includes understanding
the business objectives and requirements.
Data Understanding: This phase starts with the initial data
collection and identifying subsets in the data.
Data Preparation: Transformation and cleaning is done in
this phase. Data is prepared for modeling.
Modeling: Modeling techniques are applied to the dataset.
Evaluation: Evaluation of the model results are performed
in this phase. The results are measured to satisfy business
objectives.
Deployment: The model built is deployed to the customer
with the results [1]
A. Implementation for Classification Modelling
1) Business Understanding: Theobjective was to identify
or predict what sort of people other than upper class; women
and children were likely to survive in the RMS Titanic, one
of the wicked shipwrecks in history.The preliminary plan was
to understand the key variables or dimensions, which will be
used for predicting the class-object and use caret package for
model evaluation and identifying the parameter values.
2) Data Understanding: The Titanic data set was collected
from Kaggle, The key variables identified were ‘Sex’, ‘Age’,
‘Pclass’, ‘Parch’, ‘Fare’, ‘SibSp’, ‘Embarked’ to predict the
categorical class ‘Survived’. There were some problems like
NULL’s and NA’s were identified in data set, which will be
taken care in fore coming stages.
3) Data Preparation: Problems detected in data under-
standing are addressed here.
• Median values are replaced for NA’s in ‘Fare’.
• NULL values are replaced with “S” in ‘Embarked’.
• NA’s in ‘Age’ are replaced with -1.
• ‘Sex’ is converted to factor.
• ‘Embarked’ is converted to factor.
4) Modeling: Four modeling algorithms were selected in
a random manner. Random forest (‘rf’), K-Nearest neighbor
(‘knn’), Regularized Discriminant Analysis (‘rda’) and Tree
based model – CART (‘rpart’). In order to tune the parameters
and evaluate the metrics of the above models,Classification
And REgression Training package acronym as CARET1
is
used.
Model Tuning NA’s, NULL’s and factorization is handled
in data pre-processing. Data partitioning is performed
through simple splitting by creating balanced splits of data
in 75&25% for each class. 75% is used for training and
25% is used for testing within caret package. Parameter
selection: Train function in ‘CARET’ is used to evaluate
by resampling training data with parameterized number
1http://topepo.github.io/caret/index.html
of folds and repeats. Here resampling is done using ten
folds and three repeats by using ‘repatedcv’ method which
means repeating the cross validation. Class probability is
set to TRUE to compute class probability for the held out
samples. summaryFuntion is set as ‘twoClassSummary’
for caret to compute specificity, sensitivity and the area
under Receiver Operating Characteristic curve. Neither the
predictors require estimating power transformations nor have
zero or negative values, so Box-Cox or Yeo-Jhonson method
of preprocessing is eliminated and instead only centering
and scaling is used for random forest and ‘knnImpute’ is
used to find the distance between k closest samples by using
Euclidan distance incase of K-Nearest Neighbour. Tune
grid: gamma value is set to range from 0.00 to 1.00 with a
scale of 0.25 and lambda is fixed with 0.75 for Regularized
Discriminant Analysis to find the optimal ROC. Ntree
parameter was checked between ranges 10 to 15 with the
scale of 1 and found to have a best fit at 13 for Random Forest.
R-Code for Classification Modeling using CARET
##CARET Model evaluation for Titanic data
set##
##Common Process accross all algorithms
begin##
library(caret)
library(readr)
set.seed(121)
#Read Titanic data from csv
crtTrain<-read_csv("trainTc.csv",
col_names = TRUE,n_max = -1,
progress = interactive())
#Data cleaning ’NA’ in AGE is replaced with
median value
crtTrain$Age[is.na(crtTrain$Age)] <-
median(crtTrain$Age, na.rm=TRUE)
#Selecting required variables
crtTrain <- crtTrain[c(#’PassengerId’,
’Pclass’,#’Name’,
’Sex’,’Age’,’SibSp’,’Parch’,
#’Ticket’,’Fare’,’Cabin’,’Embarked’,
’Survived’)]
#Convert the survived from binary into factor
crtTrain$Survived <-
ifelse(crtTrain$Survived==1,’yes’,’no’)
crtTrain$Survived <-
as.factor(crtTrain$Survived)
#Partition 75% to training and remaining to
test
set.seed(221)
inTrain <- createDataPartition(y =
crtTrain$Survived,p = .75,list = FALSE)
#Create train and test
tcTrain <- crtTrain[ inTrain,]
tcTest <- crtTrain[-inTrain,]
####Common Process ends#####
####Train function on all 4 algorithms#######
####Random Forest#######
set.seed(301)
rfctrl <- trainControl(method =
"repeatedcv",number=10,repeats = 3,
verboseIter=TRUE,classProbs
= TRUE,
summaryFunction =
twoClassSummary)
#install.packages("pROC")
rfFit <- train(Survived ˜.,
data = tcTrain,
method = "rf",
metric="ROC",
ntree = 13,
preProcess = c("center", "scale"),
trControl = rfctrl)
rfFit
plot(rfFit)
#Predict
rfPredictTC <- predict(rfFit, newdata =
tcTest)
rfProbs <- predict(rfFit, newdata = tcTest,
type = "prob")
confusionMatrix(data = rfPredictTC,
tcTest$Survived)
######K-Nearest Neighbour########
knnctrl <- trainControl(method =
"repeatedcv",number=10,repeats = 3,
verboseIter=TRUE,classProbs
= TRUE,
summaryFunction =
twoClassSummary)
set.seed(301)
#install.packages("pROC")
knnFit <- train(Survived ˜.,
data = tcTrain,
method = "knn",
metric="ROC",
preProcess = "knnImpute",
tuneLength = 10,
trControl = knnctrl)
knnFit
plot(knnFit)
#Predict
knnPredictTC <- predict(knnFit, newdata =
tcTest)
knnProbs <- predict(knnFit, newdata = tcTest,
type = "prob")
confusionMatrix(data = knnPredictTC,
tcTest$Survived)
####Regularized Discriminant Analysis#####
mygrid <- data.frame(gamma = (0:4)/4, lambda
= 3/4)
rdactrl <- trainControl(method =
"repeatedcv",number=10,repeats = 3,
verboseIter=TRUE,classProbs
= TRUE,
summaryFunction =
twoClassSummary)
rdaFit <- train(Survived ˜.,
data = tcTrain,
method = "rda",
trControl = rdactrl,
metric = "ROC",
tuneGrid=mygrid,
trace = FALSE,
maxit = 100)
rdaFit
plot(rdaFit)
#Predict
rdaPredictTC <- predict(rdaFit, newdata =
tcTest)
rdaProbs <- predict(rdaFit, newdata = tcTest,
type = "prob")
confusionMatrix(data = rdaPredictTC,
tcTest$Survived)
####Tree based model CART (rpart)#####
rptctrl <- trainControl(method =
"repeatedcv",number=10,repeats = 3,
verboseIter=TRUE,classProbs
= TRUE,
summaryFunction =
twoClassSummary)
rptFit <- train(Survived ˜ .,
data = tcTrain,method = "rpart",
trControl = rptctrl,metric =
"ROC",tuneLength = 10)
rptFit
plot(rptFit)
#Predict
rptPredictTC <- predict(rptFit, newdata =
tcTest)
rptProbs <- predict(rptFit, newdata = tcTest,
type = "prob")
confusionMatrix(data = rptPredictTC,
tcTest$Survived)
######End of Modeling###########
Performance metrics comparison between 4 models: Af-
ter executing the models using CARET package fore coming
statistics were collected to evaluate the results for choosing the
best model. Fig 1 shows the tabulation of cross table matrix
and F-Measure statistics for all four models.
Figure 1. Table: Cross table and F-Measure
It is inferred from Fig 2 that Random Forest has 77.3%
F-Measure, which is higher while comparing other models.
Fig 3 shows the other metrics comparison for all four
models.
Fig 4 shows the Kappa statistic comparison for all four
models.
Fig 5 highlights the Random Forest performance against
other models.
The model Random Forest with its tuning parameter ntree
= 13 and mtry = 2 is evaluated to have best-fit using CARET
package from the metrics collected and consolidated with other
models.
B. Implementation for Regression Modelling
1) Business Understanding: The objective of the research
was to quantify property hazards before time of inspection of
Figure 2. F-Measure comparison
Figure 3. Other metrics comparison
Liberty Mutual Insurance company dataset. At this stage the
key attributes and dimensionality of the dataset were identified.
Caret package from R was chosen for model evaluation and
then selecting the best model for regression analysis.
2) Data Understanding: The property hazard dataset was
downloaded from Kaggle website. The dataset is split into
two- train and test. Train dataset contains the hazard score
and anonymized predictor variables. This dataset was used
for modeling algorithms and selecting the best model. Test
dataset contains only the anonymised predictor variables. Each
row in the Train dataset corresponds to a property that was
evaluated and given hazard score. The hazard score attribute
is a continuous number that represents the condition of the
property provided by the inspection committee. The dataset
contains 51000 records.
3) Data Preparation: The train dataset was used for mod-
eling and selecting the best model for regression. This dataset
did not contain any NULL values or NA values. The dataset
was clean and did not require any preparation or transforma-
tion for modeling.
4) Modeling: prediction through regression analysis. Ran-
dom Forest (rf), Partial Least Squares (pls) and eXtreme
Gradient Boosting (xgboost) are the three models used. Caret
package contains all the classification and regression models.
Figure 4. Kappa Statistic
Figure 5. Other Metrics
This package is used for tuning the parameters and comparing
the models in R. Train dataset is loaded in to R, data
partitioning is done by splitting the data as training data which
contains 75
Model Tuning Parameter selection is done through the
trainControl function. This function is used for resampling
the training data by specifying the tuning parameters. It also
controls the computational refinements of the train function.
The tuning parameters used are- method (repeatedcv): which
specifies the resampling method to be used is selected, number
(10): specifies the number of folds or number of resampling
iterations which is selected as 10, repeats (3): the number
of complete sets of folds to compute, verboseIter (TRUE):
which is a logical printing of training log if TRUE is specified,
classProbs (FALSE): this parameter is used for classification
analysis, summaryFunction (defaultSummary): this is a func-
tion to compute performance metrics across resamples. Train
function is used to fit predictive models over different tuning
parameters. It its each model and calculates a resampling based
performance measure. Parameters used for train are: data-
training data, method- which specifies the model used (rf, pls
and xgboost), trControl- which specifies how the function acts
(trainControl function is used), tuneLength- specified as 10 for
number of levels, ntree- number of trees (20) and important-
as TRUE.
R-Code for Regression Modeling using CARET
##CARET Model evaluation for Liberty Mutual
Group dataset##
##Common Process across all algorithms begin##
#Install ’caret’, ’randomForest’ , readr,
pls, xgboost
install.packages ("caret")
install.packages ("randomForest")
install.packages ("readr")
install.packages ("xgboost") install.packages
("pls")
# Load the caret, readr, xgboost, pls and rf
packages
library(caret)
library(xgboost)
library(readr)
library(rf)
library(pls)
# read first 5000 rows of Liberty Mutual
Dataset into R using readr package
Crttrain<-read_csv("C:/train.csv", n_max =
5000, progress = interactive())
#Partion the data into training and test.
intrain<- createDataPartition(y =
Crttrain$Hazard,p = .75,list = FALSE)
#create train and test
tcTrain <- Crttrain[ intrain,]
tcTest <- Crttrain [-intrain,]
#checking the number of rows and columns
nrow(tcTrain)
ncol(tcTest)
# set seed
set.seed(107)
####Common Process ends#####
###Train function on all 4 algorithms#######
###Random Forest####
# tuning the model using trainControl.
Specifiying the method as repeatedcv and
number of folds as 10
rfctrl <- trainControl(method =
"repeatedcv",repeats = 3, number = 10,
verboseIter=TRUE,
classProbs = FALSE,
summaryFunction =
defaultSummary)
# running the randomForest model
rfFit <- train(Hazard˜ .,data = tcTrain,
method = "rf", trControl = rfctrl,
tuneLength = 10, ntree = 20, importance =
TRUE)
#checking the output
rfFit
# plotting the model
plot(rfFit)
# using Predict function to predict the model
rf <- predict(rfFit, newdata = tcTest)
str(rf)
# Checking for problems
rfProbs <- predict(rfFit, newdata = tcTest,
type = "raw")
head(rfProbs)
####Partial Least Squares####
# tuning the model using
trainControl.Specifying the method as
repeatedcv and number of folds as 10
plsctrl <- trainControl(method =
"repeatedcv",repeats = 3,number = 10,
verboseIter=TRUE, classProbs = FALSE,
summaryFunction = defaultSummary)
# Running the PLS model
plsFit <- train(Hazard ˜ .,data = tcTrain,
method = "pls", trControl = plsctrl,
tuneLength = 10,ntree = 20, importance =
TRUE)
# checking the output
plsFit
plot(plsFit)
# using Predict function to predict the model
pls <- predict(plsFit, newdata = tcTest)
str(pls)
# Checking for problems
plsProbs <- predict(rfFit, newdata = tcTest,
type = "raw") head(plsProbs)
####eXtreme Gradient Boosting####
#tuning the model using trainControl.
Specifying the method as repeatedcv and
number of folds as 10
xgctrl <- trainControl(method =
"repeatedcv",repeats = 3,number = 10,
verboseIter=TRUE, classProbs = FALSE,
summaryFunction = defaultSummary)
#running the Xgboost model
xgFit <- train(Hazard ˜ .,data = tcTrain,
method = "xgbTree", trControl = xgctrl,
tuneLength = 10, ntree = 20, importance =
TRUE)
#checking the output
xgFit
plot(xgFit)
#using Predict function to predict the model
xgboost <- predict(xgFit, newdata = tcTest)
str(xgboost)
#Checking for problems
xgProbs <- predict(xgFit, newdata = tcTest,
type = "raw") head(xgProbs)
######End of Modeling###########
Performance metrics comparison between 3 models:
The output of these models shows four metrics namely-
RMSE, R Squared, RMSE SD and R Squared SD. RMSE and
R Squared are considered for comparing the models. RMSE
stands for Root Mean Square Error is a standard metric for
reporting predicting error of a continuous variable. R Squared
is used to examine how well the model fits the training data.
This will tell us what percentage of the variance in the data
are explained by the model. The least value of RMSE of a
model is considered as optimal. Fig 7 shows the comparison
of metrics for all three models. As shown in Fig 5, RF and
Xgboost have the least RMSE value. Also these values are very
close with 3.874635 and 3.872974 respectively. This allows us
to ensemble RF and xgboost models and use it for regression
analysis.Below figure shows comparison between 3 models.
Figure 6. comparison Metrics
IV. EVALUATION
A. Classification Evaluation
Random Forest model with parameters ntree = 13 and mtry
= 2 is executed against test data set and predicted the results.
R-Code for Random Forest output
###Survival prediction for Titanic data set
using Random Forest######
library(randomForest)
library(readr)
set.seed(1001)
#Read Titanic data from csv
tcTrain<-read_csv("trainTc.csv",
col_names = TRUE,n_max = -1,
progress = interactive())
tcTest <- read_csv("test.csv",
col_names = TRUE,n_max = -1,
progress = interactive())
#Manual function to preprocess data
tcPredictors <- function(inputdat) {
predictors <- c("Pclass","Age","Sex",
"Parch","SibSp",
"Fare","Embarked")
preds <- inputdat[,predictors]
preds$Fare[is.na(preds$Fare)] <-
median(preds$Fare, na.rm=TRUE)
preds$Embarked[preds$Embarked==""] = "S"
preds$Age[is.na(preds$Age)] <- -1
preds$Sex <- as.factor(preds$Sex)
preds$Embarked <- as.factor(preds$Embarked)
return(preds)
}
#Random Forest algorithm
rf <- randomForest(tcPredictors(tcTrain),
as.factor(tcTrain$Survived),
ntree=13,mtry = 2,
importance=TRUE)
#Write out result
outResult <- data.frame(PassengerId =
tcTest$PassengerId)
outResult$Survived <- predict(rf,
tcPredictors(tcTest))
#Write CSV file for out result
write.csv(outResult, file =
"TitanicResult.csv", row.names=FALSE)
###End of code###
Fig 7 shows a screen shot of the output written csv format
by Random Forest model.
Figure 7. Predicted Output screen shot
B. Regression Evaluation
With the results of the model comparison, we have selected
rf and xgboost models as ensemble for regression analysis.
The tuning parameters are selected by the result of the model
comparison. For applying the regression algorithm, train and
test datasets are used and the predictor dataset is formed from
Train data. The tuning parameters are as follow: For random
forest
ntree=20; sampsize=10000; Xgboost; nrounds=150;
eta = .3; maxdepth = 1;
and the objective of xgboost is specified as linear regression.
R-Code for ensemble of Rf and Xgboost models
### Predicting transformed count of
Hazards####
# load required packages
require(xgboost)
library(caret)
library(randomForest)
library(readr)
# load raw data
train = read_csv(’C:/train.csv’)
test = read_csv(’C:/test.csv’)
# Create the response variable
hazard = train$Hazard
# Create the predictor data set and encode
categorical variables using caret library.
htrain = train[,-c(1,2)]
htest = test[,-c(1)]
dummy <- dummyVars(˜ ., data = htrain)
htrain = predict(dummy, newdata = htrain)
htest = predict(dummy, newdata = htest)
#running Random Forest
set.seed(1234)
rf <- randomForest(htrain, hazard, ntree=20,
imp=TRUE, sampsize=10000, do.trace=TRUE)
predict_rf <- predict(rf, htest)
# Set necessary parameters and use parallel
threads
parameters <- list("objective" =
"reg:linear", "nthread" = 8, "verbose"=0)
# running xgboost model
xgb.fit = xgboost(param=parameters, data =
htrain, label = hazard, nrounds=150, eta
= .3, max_depth = 1, min_child_weight =
5, scale_pos_weight = 1.0, subsample=0.8)
predict_xgboost <- predict(xgb.fit, htest)
# Predict Hazard for the test set
predict <- data.frame(Id=test$Id)
predict$Hazard <-
(predict_rf+predict_xgboost)/2
#write predict output as csv to system
write_csv(predict, "predict.csv")
V. CONCLUSION
CARET package in R has helped in model evaluation and
comparing the performances between different models. Mul-
tiple classification and regression models were modeled and
evaluated with respect to performance metrics using CARET
package. Four classification models were used namely- KNN,
rpart, randomForest and regularized discriminant analysis (rda)
Figure 8. output of the regression analysis
Figure 9. Scree plot of regression analysis
using CARET package for checking the best model. Ran-
domForest performed best with respect to F-measure. Hence
this model was used for predicting what type of people
survived the Titanic shipwreck. Three regression models were
used namely-principal component analysis, randomForest and
extreme gradient boosting (xgboost) using CARET package
for checking the model for regression analysis. Ensemble of
randomForest and Xgboost was used for predicting the hazards
for each property.The future work proposes for ensemble
of classification models and evaluating it. We also propose
evaluating and comparing other models of classification and
regression analysis.
REFERENCES
[1] Ana Isabel Roj˜ao Lourenc¸o Azevedo. “KDD, SEMMA
and CRISP-DM: a parallel overview”. In: (2008).
[2] Cheng-Mei Chen et al. “Prediction of survival in pa-
tients with liver cancer using artificial neural networks
and classification and regression trees”. In: Natural
Computation (ICNC), 2011 Seventh International Con-
ference on. Vol. 2. IEEE. 2011, pp. 811–815.
[3] Hui-Ling Chen et al. “A novel bankruptcy prediction
model based on an adaptive fuzzy k-nearest neighbor
method”. In: Knowledge-Based Systems 24.8 (2011),
pp. 1348–1359.
[4] Jerome H Friedman. “Regularized discriminant analy-
sis”. In: Journal of the American statistical association
84.405 (1989), pp. 165–175.
[5] Lijia Guo. “Applying data mining techniques in proper-
ty/casualty insurance”. In: in CAS 2003 Winter Forum,
Data Management, Quality, and Technology Call Pa-
pers and Ratemaking Discussion Papers, CAS. Citeseer.
2003.
[6] Max Kuhn. “Building predictive models in R using the
caret package”. In: Journal of Statistical Software 28.5
(2008), pp. 1–26.
[7] Andy Liaw and Matthew Wiener. “Classification and
regression by randomForest”. In: R news 2.3 (2002),
pp. 18–22.
[8] Mark Menor and Kyungim Baek. “Relevance units
machine for classification”. In: Biomedical Engineering
and Informatics (BMEI), 2011 4th International Con-
ference on. Vol. 4. IEEE. 2011, pp. 2295–2299.
[9] Bj¨orn-Helge Mevik and Ron Wehrens. “The pls pack-
age: principal component and partial least squares re-
gression in R”. In: Journal of Statistical Software 18.2
(2007), pp. 1–24.
[10] Carolin Strobl, James Malley, and Gerhard Tutz. “An
introduction to recursive partitioning: rationale, applica-
tion, and characteristics of classification and regression
trees, bagging, and random forests.” In: Psychological
methods 14.4 (2009), p. 323.
[11] Stephen Tyree et al. “Parallel boosted regression trees
for web search ranking”. In: Proceedings of the 20th
international conference on World wide web. ACM.
2011, pp. 387–396.

Contenu connexe

Tendances

Enactment Ranking of Supervised Algorithms Dependence of Data Splitting Algor...
Enactment Ranking of Supervised Algorithms Dependence of Data Splitting Algor...Enactment Ranking of Supervised Algorithms Dependence of Data Splitting Algor...
Enactment Ranking of Supervised Algorithms Dependence of Data Splitting Algor...AIRCC Publishing Corporation
 
Selecting the correct Data Mining Method: Classification & InDaMiTe-R
Selecting the correct Data Mining Method: Classification & InDaMiTe-RSelecting the correct Data Mining Method: Classification & InDaMiTe-R
Selecting the correct Data Mining Method: Classification & InDaMiTe-RIOSR Journals
 
Automatic Unsupervised Data Classification Using Jaya Evolutionary Algorithm
Automatic Unsupervised Data Classification Using Jaya Evolutionary AlgorithmAutomatic Unsupervised Data Classification Using Jaya Evolutionary Algorithm
Automatic Unsupervised Data Classification Using Jaya Evolutionary Algorithmaciijournal
 
Automatic Unsupervised Data Classification Using Jaya Evolutionary Algorithm
Automatic Unsupervised Data Classification Using Jaya Evolutionary AlgorithmAutomatic Unsupervised Data Classification Using Jaya Evolutionary Algorithm
Automatic Unsupervised Data Classification Using Jaya Evolutionary Algorithmaciijournal
 
STAT 897D Project 2 - Final Draft
STAT 897D Project 2 - Final DraftSTAT 897D Project 2 - Final Draft
STAT 897D Project 2 - Final DraftJonathan Fivelsdal
 
Business Bankruptcy Prediction Based on Survival Analysis Approach
Business Bankruptcy Prediction Based on Survival Analysis ApproachBusiness Bankruptcy Prediction Based on Survival Analysis Approach
Business Bankruptcy Prediction Based on Survival Analysis Approachijcsit
 
Adaptive Classification of Imbalanced Data using ANN with Particle of Swarm O...
Adaptive Classification of Imbalanced Data using ANN with Particle of Swarm O...Adaptive Classification of Imbalanced Data using ANN with Particle of Swarm O...
Adaptive Classification of Imbalanced Data using ANN with Particle of Swarm O...ijtsrd
 
11.software modules clustering an effective approach for reusability
11.software modules clustering an effective approach for  reusability11.software modules clustering an effective approach for  reusability
11.software modules clustering an effective approach for reusabilityAlexander Decker
 
A Study on Cancer Perpetuation Using the Classification Algorithms
A Study on Cancer Perpetuation Using the Classification AlgorithmsA Study on Cancer Perpetuation Using the Classification Algorithms
A Study on Cancer Perpetuation Using the Classification Algorithmspaperpublications3
 
Survey on Feature Selection and Dimensionality Reduction Techniques
Survey on Feature Selection and Dimensionality Reduction TechniquesSurvey on Feature Selection and Dimensionality Reduction Techniques
Survey on Feature Selection and Dimensionality Reduction TechniquesIRJET Journal
 
Implementation of Prototype Based Credal Classification approach For Enhanced...
Implementation of Prototype Based Credal Classification approach For Enhanced...Implementation of Prototype Based Credal Classification approach For Enhanced...
Implementation of Prototype Based Credal Classification approach For Enhanced...IRJET Journal
 
Controlling informative features for improved accuracy and faster predictions...
Controlling informative features for improved accuracy and faster predictions...Controlling informative features for improved accuracy and faster predictions...
Controlling informative features for improved accuracy and faster predictions...Damian R. Mingle, MBA
 
A novel hybrid feature selection approach
A novel hybrid feature selection approachA novel hybrid feature selection approach
A novel hybrid feature selection approachijaia
 
Basic course for computer based methods
Basic course for computer based methodsBasic course for computer based methods
Basic course for computer based methodsimprovemed
 
Final Report
Final ReportFinal Report
Final Reportimu409
 

Tendances (20)

Enactment Ranking of Supervised Algorithms Dependence of Data Splitting Algor...
Enactment Ranking of Supervised Algorithms Dependence of Data Splitting Algor...Enactment Ranking of Supervised Algorithms Dependence of Data Splitting Algor...
Enactment Ranking of Supervised Algorithms Dependence of Data Splitting Algor...
 
Selecting the correct Data Mining Method: Classification & InDaMiTe-R
Selecting the correct Data Mining Method: Classification & InDaMiTe-RSelecting the correct Data Mining Method: Classification & InDaMiTe-R
Selecting the correct Data Mining Method: Classification & InDaMiTe-R
 
Automatic Unsupervised Data Classification Using Jaya Evolutionary Algorithm
Automatic Unsupervised Data Classification Using Jaya Evolutionary AlgorithmAutomatic Unsupervised Data Classification Using Jaya Evolutionary Algorithm
Automatic Unsupervised Data Classification Using Jaya Evolutionary Algorithm
 
JEDM_RR_JF_Final
JEDM_RR_JF_FinalJEDM_RR_JF_Final
JEDM_RR_JF_Final
 
Automatic Unsupervised Data Classification Using Jaya Evolutionary Algorithm
Automatic Unsupervised Data Classification Using Jaya Evolutionary AlgorithmAutomatic Unsupervised Data Classification Using Jaya Evolutionary Algorithm
Automatic Unsupervised Data Classification Using Jaya Evolutionary Algorithm
 
STAT 897D Project 2 - Final Draft
STAT 897D Project 2 - Final DraftSTAT 897D Project 2 - Final Draft
STAT 897D Project 2 - Final Draft
 
I0704047054
I0704047054I0704047054
I0704047054
 
Business Bankruptcy Prediction Based on Survival Analysis Approach
Business Bankruptcy Prediction Based on Survival Analysis ApproachBusiness Bankruptcy Prediction Based on Survival Analysis Approach
Business Bankruptcy Prediction Based on Survival Analysis Approach
 
Adaptive Classification of Imbalanced Data using ANN with Particle of Swarm O...
Adaptive Classification of Imbalanced Data using ANN with Particle of Swarm O...Adaptive Classification of Imbalanced Data using ANN with Particle of Swarm O...
Adaptive Classification of Imbalanced Data using ANN with Particle of Swarm O...
 
U0 vqmtq2otq=
U0 vqmtq2otq=U0 vqmtq2otq=
U0 vqmtq2otq=
 
Statsci
StatsciStatsci
Statsci
 
11.software modules clustering an effective approach for reusability
11.software modules clustering an effective approach for  reusability11.software modules clustering an effective approach for  reusability
11.software modules clustering an effective approach for reusability
 
A Study on Cancer Perpetuation Using the Classification Algorithms
A Study on Cancer Perpetuation Using the Classification AlgorithmsA Study on Cancer Perpetuation Using the Classification Algorithms
A Study on Cancer Perpetuation Using the Classification Algorithms
 
Survey on Feature Selection and Dimensionality Reduction Techniques
Survey on Feature Selection and Dimensionality Reduction TechniquesSurvey on Feature Selection and Dimensionality Reduction Techniques
Survey on Feature Selection and Dimensionality Reduction Techniques
 
Implementation of Prototype Based Credal Classification approach For Enhanced...
Implementation of Prototype Based Credal Classification approach For Enhanced...Implementation of Prototype Based Credal Classification approach For Enhanced...
Implementation of Prototype Based Credal Classification approach For Enhanced...
 
One Graduate Paper
One Graduate PaperOne Graduate Paper
One Graduate Paper
 
Controlling informative features for improved accuracy and faster predictions...
Controlling informative features for improved accuracy and faster predictions...Controlling informative features for improved accuracy and faster predictions...
Controlling informative features for improved accuracy and faster predictions...
 
A novel hybrid feature selection approach
A novel hybrid feature selection approachA novel hybrid feature selection approach
A novel hybrid feature selection approach
 
Basic course for computer based methods
Basic course for computer based methodsBasic course for computer based methods
Basic course for computer based methods
 
Final Report
Final ReportFinal Report
Final Report
 

Similaire à DataMining_CA2-4

Proficiency comparison ofladtree
Proficiency comparison ofladtreeProficiency comparison ofladtree
Proficiency comparison ofladtreeijcsa
 
Review of Algorithms for Crime Analysis & Prediction
Review of Algorithms for Crime Analysis & PredictionReview of Algorithms for Crime Analysis & Prediction
Review of Algorithms for Crime Analysis & PredictionIRJET Journal
 
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETSA HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETSEditor IJCATR
 
CATEGORIZATION OF FACTORS AFFECTING CLASSIFICATION ALGORITHMS SELECTION
CATEGORIZATION OF FACTORS AFFECTING CLASSIFICATION ALGORITHMS SELECTIONCATEGORIZATION OF FACTORS AFFECTING CLASSIFICATION ALGORITHMS SELECTION
CATEGORIZATION OF FACTORS AFFECTING CLASSIFICATION ALGORITHMS SELECTIONIJDKP
 
CATEGORIZATION OF FACTORS AFFECTING CLASSIFICATION ALGORITHMS SELECTION
CATEGORIZATION OF FACTORS AFFECTING CLASSIFICATION ALGORITHMS SELECTIONCATEGORIZATION OF FACTORS AFFECTING CLASSIFICATION ALGORITHMS SELECTION
CATEGORIZATION OF FACTORS AFFECTING CLASSIFICATION ALGORITHMS SELECTIONIJDKP
 
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEY
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEYCLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEY
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEYEditor IJMTER
 
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...IRJET Journal
 
IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...
IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...
IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...IRJET Journal
 
KIT-601 Lecture Notes-UNIT-2.pdf
KIT-601 Lecture Notes-UNIT-2.pdfKIT-601 Lecture Notes-UNIT-2.pdf
KIT-601 Lecture Notes-UNIT-2.pdfDr. Radhey Shyam
 
Simplified Knowledge Prediction: Application of Machine Learning in Real Life
Simplified Knowledge Prediction: Application of Machine Learning in Real LifeSimplified Knowledge Prediction: Application of Machine Learning in Real Life
Simplified Knowledge Prediction: Application of Machine Learning in Real LifePeea Bal Chakraborty
 
IRJET - Survey on Analysis of Breast Cancer Prediction
IRJET - Survey on Analysis of Breast Cancer PredictionIRJET - Survey on Analysis of Breast Cancer Prediction
IRJET - Survey on Analysis of Breast Cancer PredictionIRJET Journal
 
Csit65111ASSOCIATIVE REGRESSIVE DECISION RULE MINING FOR ASSOCIATIVE REGRESSI...
Csit65111ASSOCIATIVE REGRESSIVE DECISION RULE MINING FOR ASSOCIATIVE REGRESSI...Csit65111ASSOCIATIVE REGRESSIVE DECISION RULE MINING FOR ASSOCIATIVE REGRESSI...
Csit65111ASSOCIATIVE REGRESSIVE DECISION RULE MINING FOR ASSOCIATIVE REGRESSI...cscpconf
 
Associative Regressive Decision Rule Mining for Predicting Customer Satisfact...
Associative Regressive Decision Rule Mining for Predicting Customer Satisfact...Associative Regressive Decision Rule Mining for Predicting Customer Satisfact...
Associative Regressive Decision Rule Mining for Predicting Customer Satisfact...csandit
 
LABELING CUSTOMERS USING DISCOVERED KNOWLEDGE CASE STUDY: AUTOMOBILE INSURAN...
LABELING CUSTOMERS USING DISCOVERED KNOWLEDGE  CASE STUDY: AUTOMOBILE INSURAN...LABELING CUSTOMERS USING DISCOVERED KNOWLEDGE  CASE STUDY: AUTOMOBILE INSURAN...
LABELING CUSTOMERS USING DISCOVERED KNOWLEDGE CASE STUDY: AUTOMOBILE INSURAN...ijmvsc
 
Assessment of Decision Tree Algorithms on Student’s Recital
Assessment of Decision Tree Algorithms on Student’s RecitalAssessment of Decision Tree Algorithms on Student’s Recital
Assessment of Decision Tree Algorithms on Student’s RecitalIRJET Journal
 
Performance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsPerformance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsDinusha Dilanka
 
CLUSTERING ALGORITHM FOR A HEALTHCARE DATASET USING SILHOUETTE SCORE VALUE
CLUSTERING ALGORITHM FOR A HEALTHCARE DATASET USING SILHOUETTE SCORE VALUECLUSTERING ALGORITHM FOR A HEALTHCARE DATASET USING SILHOUETTE SCORE VALUE
CLUSTERING ALGORITHM FOR A HEALTHCARE DATASET USING SILHOUETTE SCORE VALUEAIRCC Publishing Corporation
 
CLUSTERING ALGORITHM FOR A HEALTHCARE DATASET USING SILHOUETTE SCORE VALUE
CLUSTERING ALGORITHM FOR A HEALTHCARE DATASET USING SILHOUETTE SCORE VALUECLUSTERING ALGORITHM FOR A HEALTHCARE DATASET USING SILHOUETTE SCORE VALUE
CLUSTERING ALGORITHM FOR A HEALTHCARE DATASET USING SILHOUETTE SCORE VALUEijcsit
 

Similaire à DataMining_CA2-4 (20)

Proficiency comparison ofladtree
Proficiency comparison ofladtreeProficiency comparison ofladtree
Proficiency comparison ofladtree
 
Review of Algorithms for Crime Analysis & Prediction
Review of Algorithms for Crime Analysis & PredictionReview of Algorithms for Crime Analysis & Prediction
Review of Algorithms for Crime Analysis & Prediction
 
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETSA HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
 
CATEGORIZATION OF FACTORS AFFECTING CLASSIFICATION ALGORITHMS SELECTION
CATEGORIZATION OF FACTORS AFFECTING CLASSIFICATION ALGORITHMS SELECTIONCATEGORIZATION OF FACTORS AFFECTING CLASSIFICATION ALGORITHMS SELECTION
CATEGORIZATION OF FACTORS AFFECTING CLASSIFICATION ALGORITHMS SELECTION
 
CATEGORIZATION OF FACTORS AFFECTING CLASSIFICATION ALGORITHMS SELECTION
CATEGORIZATION OF FACTORS AFFECTING CLASSIFICATION ALGORITHMS SELECTIONCATEGORIZATION OF FACTORS AFFECTING CLASSIFICATION ALGORITHMS SELECTION
CATEGORIZATION OF FACTORS AFFECTING CLASSIFICATION ALGORITHMS SELECTION
 
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEY
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEYCLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEY
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEY
 
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
 
Dmml report final
Dmml report finalDmml report final
Dmml report final
 
IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...
IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...
IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...
 
KIT-601 Lecture Notes-UNIT-2.pdf
KIT-601 Lecture Notes-UNIT-2.pdfKIT-601 Lecture Notes-UNIT-2.pdf
KIT-601 Lecture Notes-UNIT-2.pdf
 
Simplified Knowledge Prediction: Application of Machine Learning in Real Life
Simplified Knowledge Prediction: Application of Machine Learning in Real LifeSimplified Knowledge Prediction: Application of Machine Learning in Real Life
Simplified Knowledge Prediction: Application of Machine Learning in Real Life
 
IRJET - Survey on Analysis of Breast Cancer Prediction
IRJET - Survey on Analysis of Breast Cancer PredictionIRJET - Survey on Analysis of Breast Cancer Prediction
IRJET - Survey on Analysis of Breast Cancer Prediction
 
Csit65111ASSOCIATIVE REGRESSIVE DECISION RULE MINING FOR ASSOCIATIVE REGRESSI...
Csit65111ASSOCIATIVE REGRESSIVE DECISION RULE MINING FOR ASSOCIATIVE REGRESSI...Csit65111ASSOCIATIVE REGRESSIVE DECISION RULE MINING FOR ASSOCIATIVE REGRESSI...
Csit65111ASSOCIATIVE REGRESSIVE DECISION RULE MINING FOR ASSOCIATIVE REGRESSI...
 
Associative Regressive Decision Rule Mining for Predicting Customer Satisfact...
Associative Regressive Decision Rule Mining for Predicting Customer Satisfact...Associative Regressive Decision Rule Mining for Predicting Customer Satisfact...
Associative Regressive Decision Rule Mining for Predicting Customer Satisfact...
 
LABELING CUSTOMERS USING DISCOVERED KNOWLEDGE CASE STUDY: AUTOMOBILE INSURAN...
LABELING CUSTOMERS USING DISCOVERED KNOWLEDGE  CASE STUDY: AUTOMOBILE INSURAN...LABELING CUSTOMERS USING DISCOVERED KNOWLEDGE  CASE STUDY: AUTOMOBILE INSURAN...
LABELING CUSTOMERS USING DISCOVERED KNOWLEDGE CASE STUDY: AUTOMOBILE INSURAN...
 
Manuscript dss
Manuscript dssManuscript dss
Manuscript dss
 
Assessment of Decision Tree Algorithms on Student’s Recital
Assessment of Decision Tree Algorithms on Student’s RecitalAssessment of Decision Tree Algorithms on Student’s Recital
Assessment of Decision Tree Algorithms on Student’s Recital
 
Performance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsPerformance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning Algorithms
 
CLUSTERING ALGORITHM FOR A HEALTHCARE DATASET USING SILHOUETTE SCORE VALUE
CLUSTERING ALGORITHM FOR A HEALTHCARE DATASET USING SILHOUETTE SCORE VALUECLUSTERING ALGORITHM FOR A HEALTHCARE DATASET USING SILHOUETTE SCORE VALUE
CLUSTERING ALGORITHM FOR A HEALTHCARE DATASET USING SILHOUETTE SCORE VALUE
 
CLUSTERING ALGORITHM FOR A HEALTHCARE DATASET USING SILHOUETTE SCORE VALUE
CLUSTERING ALGORITHM FOR A HEALTHCARE DATASET USING SILHOUETTE SCORE VALUECLUSTERING ALGORITHM FOR A HEALTHCARE DATASET USING SILHOUETTE SCORE VALUE
CLUSTERING ALGORITHM FOR A HEALTHCARE DATASET USING SILHOUETTE SCORE VALUE
 

DataMining_CA2-4

  • 1. Model Evaluation for Classification and Regression Aravind Kumar Balasubramaniam, 14123754 School of Computing National College of Ireland Dublin, Ireland Email: aravindkumar.balasubramaniam@student.ncirl.ie Anisha Kudalappa Gudagi, 14123223 School of Computing National College of Ireland Dublin, Ireland Email: anishaKudalappa.Gudagi@student.ncirl.ie Abstract—This paper aims to compare classification and re- gression models using CARET package in R. Classification and regression analysis are performed for predicting survival rate on titanic dataset and quantifying hazard score on property-liability dataset respectively. Four models of classification algorithms are modeled for the data samples from Titanic dataset and evaluated against Accuracy, Sensitivity, Specificity, Pos pred value, Neg pred value, Kappa and F-measure metrics. Based on F-measure the best model is selected and classification analysis is performed on the titanic dataset. Three models of regression algorithms are modeled for the data samples from property-liability dataset an evaluated against RMSE,R-squared, RMSE SD and R-squared SD. RMSE metric is used for selecting the best model for regression analysis and predicting the hazard score. I. INTRODUCTION Data prediction is a non-trivial task and has an enormous value, this paper out rights the heuristics about classification and regression modelling by providing a literature review. Two different data sets individually suitable for classification and regression are used for evaluating the model performance metrics with the aid of CARET package and to train test the selected algorithm. This research paper presents and discusses classification based survival prediction of the infamous Titanic shipwrecks and regression based prediction of property haz- ards based on property information in the property-liability insurance industry. Predictive analysis is one of the most important supervised data mining technique which enables us to find unidentified patterns or trends in datasets. Machine learning from disaster is one of the most explored areas in the field of Data mining and disaster management. In the past, many researchers have used Data mining techniques for predicting survivals from natural disasters and for predicting patient survival rate for uncommon and cure-less diseases. We have used classification data mining algorithms to predict what sort of people were likely to survive the shipwrecks. The pre- diction depends on the key attributes such as gender, passenger class, age, parents, siblings and the type of ticket. Regression analysis is done on the property-liability data for quantifying hazards. Property insurance requires inspection of the property based on the condition of the property such as foundation, roof, flooring etc. which are key property attributes. These key property attributes needs to be investigated by the insurance companies before they are approved for insured. Thus, to provide a clear insight to insurance companies to find property hazards from key attributes of properties, predictive analysis is best suited. In this paper we have analyzed and built predictive models to classify the rate of survivals in Titanic shipwreck and to quantify property hazards before time of inspection of Liberty Mutual Insurance company dataset which contains a hazard score provided for each property by inspection. There are two queries- the first query is to “predict what sort of people were likely to survive the shipwreck” using classification and second query is to “predict which of these hazards contribute more for each property. We have built four classification models and three regression models for comparison and selected the best model. The rest of the paper is divided as follows: section (2) related work which discusses about predictive analysis in insurance industry and comparing models, section (3) methodology used to answer the query, section (4) evaluation of results and section (5) conclusion and future work. NCI August 20, 2015 II. RELATED WORK Classification data mining technique refers to assigning each data point instance to a class label. It is one of the most researched platforms in the field of data mining. Classification algorithms aim to find a classifier that will assist in assigning the input instance to a class label [8]. Classification has been used as the most popular technique in the field of disaster management and medicine. A conventional algorithm for survival analysis technique evaluate the data inputs by using the Kaplan-Meier method or Cox proportional hazard model. However during the recent years classification models are widely used for medicine. These models first establishes a tree structure by partitioning the training data into several subclasses. This partitioning is done according to the test conditions until all the data are grouped under one subclass. After the tree structure is established, pruning is done to the tree from the bottom of the tree. After the pruning, rules are created from the output and these rules are used in the classification of new training data for prediction [2].The K-nearest neighbor is one of the oldest methods and non- parametric methods of classification. In this algorithm a class is assigned based on the common class amongst the k-nearest neighbors. Fuzzy k-nearest neighbor is an extension of KNN in which the algorithm assigns the fuzzy memberships of data samples to different classes [3]. Boosting is an iterative
  • 2. algorithm combining classification rules with performance in terms of error rate to produce an accurate classification rule. Regularized discriminant analysis refers to assigning objects to one or several groups which is obtained from each object. Regularization techniques are applied for linear discriminant analysis and quadratic discriminant analysis and have been successful in the results of poorly-posed inverse problems. The efficiency of RDA is to improve the misclassification risk and error rate from LDA and QDA [4].Recursive partitioning (rpart) is a statistical method that is used in classification and regression trees. This method is used to discover data structures in trends in the data sample. It is in various scientific fields as multivariate data exploration, for example: DNA sequencing, medicine. This algorithm can be tuned to perform classification and regression [10]. Data mining techniques have been used in the insurance industry from quite some time since insurance databases consists of large datasets which provide valuable business knowledge for improving the customer relationship or improving profits or expanding the business. Modeling insurance risk is done by applying data mining techniques. Past research has showed that data mining methods improves the existing models by discovering extra variables and by detecting nonlinear relationships. Data mining has been of great importance in the insurance industry by identifying risk factors that helps in predicting profits and losses. Data mining techniques like decision trees and neural network can accurately predict risk. Customer relationship management analysis helps in understanding the customer and accurately select which policies to be offered to a customer [5]. Random Forest is a data mining algorithm for performing classification and regression analysis. The metric measure for randomForest is mtry which specifies the best fit split from the predictor variables. The randomForest package produces two sets of information: a metric for specifying the importance of the predicting variables and a metric for specifying the structure of the data [7]. Multivariate regression methods such as Principal component analysis and partial least square have lot of implications in a variety of industries. Quantitative –structure-activity relations and quantitative structure –prop- erty relations use PSLR and PCR. In PCR, the first a principal components (PCs) is used to approximate the matrix. Next the Y is regressed on the scores, which in turn provides the regression coefficients [9]. Gradient Boosted regression is an iterative algorithm for finding a predictor. Regression trees provides a repeated expansion of nodes until a stopping criteria is met. All the data points in the data are assigned to a single node initially. Parallelizing the boosted regression trees implies to boosting analysis sequentially by parallelizing the building of trees individually [11]. The CARET package which is short for classification and regression training con- sists of classification and regression models. It is used for model tuning and training across models. This package helps in comparing model performance between different models. Since classification and regression models are used in many different applications, Caret package will help in selecting the best model and approach [6] III. METHODOLOGY The data mining methodology used here is CRISP-DM methodology. CRISP-DM stands for Cross-Industry Standard Process for Data Mining. This methodology consists of six steps: Business Understanding: This phase includes understanding the business objectives and requirements. Data Understanding: This phase starts with the initial data collection and identifying subsets in the data. Data Preparation: Transformation and cleaning is done in this phase. Data is prepared for modeling. Modeling: Modeling techniques are applied to the dataset. Evaluation: Evaluation of the model results are performed in this phase. The results are measured to satisfy business objectives. Deployment: The model built is deployed to the customer with the results [1] A. Implementation for Classification Modelling 1) Business Understanding: Theobjective was to identify or predict what sort of people other than upper class; women and children were likely to survive in the RMS Titanic, one of the wicked shipwrecks in history.The preliminary plan was to understand the key variables or dimensions, which will be used for predicting the class-object and use caret package for model evaluation and identifying the parameter values. 2) Data Understanding: The Titanic data set was collected from Kaggle, The key variables identified were ‘Sex’, ‘Age’, ‘Pclass’, ‘Parch’, ‘Fare’, ‘SibSp’, ‘Embarked’ to predict the categorical class ‘Survived’. There were some problems like NULL’s and NA’s were identified in data set, which will be taken care in fore coming stages. 3) Data Preparation: Problems detected in data under- standing are addressed here. • Median values are replaced for NA’s in ‘Fare’. • NULL values are replaced with “S” in ‘Embarked’. • NA’s in ‘Age’ are replaced with -1. • ‘Sex’ is converted to factor. • ‘Embarked’ is converted to factor. 4) Modeling: Four modeling algorithms were selected in a random manner. Random forest (‘rf’), K-Nearest neighbor (‘knn’), Regularized Discriminant Analysis (‘rda’) and Tree based model – CART (‘rpart’). In order to tune the parameters and evaluate the metrics of the above models,Classification And REgression Training package acronym as CARET1 is used. Model Tuning NA’s, NULL’s and factorization is handled in data pre-processing. Data partitioning is performed through simple splitting by creating balanced splits of data in 75&25% for each class. 75% is used for training and 25% is used for testing within caret package. Parameter selection: Train function in ‘CARET’ is used to evaluate by resampling training data with parameterized number 1http://topepo.github.io/caret/index.html
  • 3. of folds and repeats. Here resampling is done using ten folds and three repeats by using ‘repatedcv’ method which means repeating the cross validation. Class probability is set to TRUE to compute class probability for the held out samples. summaryFuntion is set as ‘twoClassSummary’ for caret to compute specificity, sensitivity and the area under Receiver Operating Characteristic curve. Neither the predictors require estimating power transformations nor have zero or negative values, so Box-Cox or Yeo-Jhonson method of preprocessing is eliminated and instead only centering and scaling is used for random forest and ‘knnImpute’ is used to find the distance between k closest samples by using Euclidan distance incase of K-Nearest Neighbour. Tune grid: gamma value is set to range from 0.00 to 1.00 with a scale of 0.25 and lambda is fixed with 0.75 for Regularized Discriminant Analysis to find the optimal ROC. Ntree parameter was checked between ranges 10 to 15 with the scale of 1 and found to have a best fit at 13 for Random Forest. R-Code for Classification Modeling using CARET ##CARET Model evaluation for Titanic data set## ##Common Process accross all algorithms begin## library(caret) library(readr) set.seed(121) #Read Titanic data from csv crtTrain<-read_csv("trainTc.csv", col_names = TRUE,n_max = -1, progress = interactive()) #Data cleaning ’NA’ in AGE is replaced with median value crtTrain$Age[is.na(crtTrain$Age)] <- median(crtTrain$Age, na.rm=TRUE) #Selecting required variables crtTrain <- crtTrain[c(#’PassengerId’, ’Pclass’,#’Name’, ’Sex’,’Age’,’SibSp’,’Parch’, #’Ticket’,’Fare’,’Cabin’,’Embarked’, ’Survived’)] #Convert the survived from binary into factor crtTrain$Survived <- ifelse(crtTrain$Survived==1,’yes’,’no’) crtTrain$Survived <- as.factor(crtTrain$Survived) #Partition 75% to training and remaining to test set.seed(221) inTrain <- createDataPartition(y = crtTrain$Survived,p = .75,list = FALSE) #Create train and test tcTrain <- crtTrain[ inTrain,] tcTest <- crtTrain[-inTrain,] ####Common Process ends##### ####Train function on all 4 algorithms####### ####Random Forest####### set.seed(301) rfctrl <- trainControl(method = "repeatedcv",number=10,repeats = 3, verboseIter=TRUE,classProbs = TRUE, summaryFunction = twoClassSummary) #install.packages("pROC") rfFit <- train(Survived ˜., data = tcTrain, method = "rf", metric="ROC", ntree = 13, preProcess = c("center", "scale"), trControl = rfctrl) rfFit plot(rfFit) #Predict rfPredictTC <- predict(rfFit, newdata = tcTest) rfProbs <- predict(rfFit, newdata = tcTest, type = "prob") confusionMatrix(data = rfPredictTC, tcTest$Survived) ######K-Nearest Neighbour######## knnctrl <- trainControl(method = "repeatedcv",number=10,repeats = 3, verboseIter=TRUE,classProbs = TRUE, summaryFunction = twoClassSummary) set.seed(301) #install.packages("pROC") knnFit <- train(Survived ˜., data = tcTrain, method = "knn", metric="ROC", preProcess = "knnImpute", tuneLength = 10, trControl = knnctrl) knnFit plot(knnFit) #Predict knnPredictTC <- predict(knnFit, newdata = tcTest) knnProbs <- predict(knnFit, newdata = tcTest, type = "prob") confusionMatrix(data = knnPredictTC, tcTest$Survived) ####Regularized Discriminant Analysis##### mygrid <- data.frame(gamma = (0:4)/4, lambda = 3/4) rdactrl <- trainControl(method = "repeatedcv",number=10,repeats = 3, verboseIter=TRUE,classProbs = TRUE, summaryFunction = twoClassSummary) rdaFit <- train(Survived ˜., data = tcTrain, method = "rda", trControl = rdactrl, metric = "ROC", tuneGrid=mygrid, trace = FALSE, maxit = 100) rdaFit plot(rdaFit) #Predict rdaPredictTC <- predict(rdaFit, newdata = tcTest) rdaProbs <- predict(rdaFit, newdata = tcTest,
  • 4. type = "prob") confusionMatrix(data = rdaPredictTC, tcTest$Survived) ####Tree based model CART (rpart)##### rptctrl <- trainControl(method = "repeatedcv",number=10,repeats = 3, verboseIter=TRUE,classProbs = TRUE, summaryFunction = twoClassSummary) rptFit <- train(Survived ˜ ., data = tcTrain,method = "rpart", trControl = rptctrl,metric = "ROC",tuneLength = 10) rptFit plot(rptFit) #Predict rptPredictTC <- predict(rptFit, newdata = tcTest) rptProbs <- predict(rptFit, newdata = tcTest, type = "prob") confusionMatrix(data = rptPredictTC, tcTest$Survived) ######End of Modeling########### Performance metrics comparison between 4 models: Af- ter executing the models using CARET package fore coming statistics were collected to evaluate the results for choosing the best model. Fig 1 shows the tabulation of cross table matrix and F-Measure statistics for all four models. Figure 1. Table: Cross table and F-Measure It is inferred from Fig 2 that Random Forest has 77.3% F-Measure, which is higher while comparing other models. Fig 3 shows the other metrics comparison for all four models. Fig 4 shows the Kappa statistic comparison for all four models. Fig 5 highlights the Random Forest performance against other models. The model Random Forest with its tuning parameter ntree = 13 and mtry = 2 is evaluated to have best-fit using CARET package from the metrics collected and consolidated with other models. B. Implementation for Regression Modelling 1) Business Understanding: The objective of the research was to quantify property hazards before time of inspection of Figure 2. F-Measure comparison Figure 3. Other metrics comparison Liberty Mutual Insurance company dataset. At this stage the key attributes and dimensionality of the dataset were identified. Caret package from R was chosen for model evaluation and then selecting the best model for regression analysis. 2) Data Understanding: The property hazard dataset was downloaded from Kaggle website. The dataset is split into two- train and test. Train dataset contains the hazard score and anonymized predictor variables. This dataset was used for modeling algorithms and selecting the best model. Test dataset contains only the anonymised predictor variables. Each row in the Train dataset corresponds to a property that was evaluated and given hazard score. The hazard score attribute is a continuous number that represents the condition of the property provided by the inspection committee. The dataset contains 51000 records. 3) Data Preparation: The train dataset was used for mod- eling and selecting the best model for regression. This dataset did not contain any NULL values or NA values. The dataset was clean and did not require any preparation or transforma- tion for modeling. 4) Modeling: prediction through regression analysis. Ran- dom Forest (rf), Partial Least Squares (pls) and eXtreme Gradient Boosting (xgboost) are the three models used. Caret package contains all the classification and regression models.
  • 5. Figure 4. Kappa Statistic Figure 5. Other Metrics This package is used for tuning the parameters and comparing the models in R. Train dataset is loaded in to R, data partitioning is done by splitting the data as training data which contains 75 Model Tuning Parameter selection is done through the trainControl function. This function is used for resampling the training data by specifying the tuning parameters. It also controls the computational refinements of the train function. The tuning parameters used are- method (repeatedcv): which specifies the resampling method to be used is selected, number (10): specifies the number of folds or number of resampling iterations which is selected as 10, repeats (3): the number of complete sets of folds to compute, verboseIter (TRUE): which is a logical printing of training log if TRUE is specified, classProbs (FALSE): this parameter is used for classification analysis, summaryFunction (defaultSummary): this is a func- tion to compute performance metrics across resamples. Train function is used to fit predictive models over different tuning parameters. It its each model and calculates a resampling based performance measure. Parameters used for train are: data- training data, method- which specifies the model used (rf, pls and xgboost), trControl- which specifies how the function acts (trainControl function is used), tuneLength- specified as 10 for number of levels, ntree- number of trees (20) and important- as TRUE. R-Code for Regression Modeling using CARET ##CARET Model evaluation for Liberty Mutual Group dataset## ##Common Process across all algorithms begin## #Install ’caret’, ’randomForest’ , readr, pls, xgboost install.packages ("caret") install.packages ("randomForest") install.packages ("readr") install.packages ("xgboost") install.packages ("pls") # Load the caret, readr, xgboost, pls and rf packages library(caret) library(xgboost) library(readr) library(rf) library(pls) # read first 5000 rows of Liberty Mutual Dataset into R using readr package Crttrain<-read_csv("C:/train.csv", n_max = 5000, progress = interactive()) #Partion the data into training and test. intrain<- createDataPartition(y = Crttrain$Hazard,p = .75,list = FALSE) #create train and test tcTrain <- Crttrain[ intrain,] tcTest <- Crttrain [-intrain,] #checking the number of rows and columns nrow(tcTrain) ncol(tcTest) # set seed set.seed(107) ####Common Process ends##### ###Train function on all 4 algorithms####### ###Random Forest#### # tuning the model using trainControl. Specifiying the method as repeatedcv and number of folds as 10 rfctrl <- trainControl(method = "repeatedcv",repeats = 3, number = 10, verboseIter=TRUE, classProbs = FALSE, summaryFunction = defaultSummary) # running the randomForest model rfFit <- train(Hazard˜ .,data = tcTrain, method = "rf", trControl = rfctrl, tuneLength = 10, ntree = 20, importance = TRUE) #checking the output rfFit # plotting the model plot(rfFit) # using Predict function to predict the model rf <- predict(rfFit, newdata = tcTest)
  • 6. str(rf) # Checking for problems rfProbs <- predict(rfFit, newdata = tcTest, type = "raw") head(rfProbs) ####Partial Least Squares#### # tuning the model using trainControl.Specifying the method as repeatedcv and number of folds as 10 plsctrl <- trainControl(method = "repeatedcv",repeats = 3,number = 10, verboseIter=TRUE, classProbs = FALSE, summaryFunction = defaultSummary) # Running the PLS model plsFit <- train(Hazard ˜ .,data = tcTrain, method = "pls", trControl = plsctrl, tuneLength = 10,ntree = 20, importance = TRUE) # checking the output plsFit plot(plsFit) # using Predict function to predict the model pls <- predict(plsFit, newdata = tcTest) str(pls) # Checking for problems plsProbs <- predict(rfFit, newdata = tcTest, type = "raw") head(plsProbs) ####eXtreme Gradient Boosting#### #tuning the model using trainControl. Specifying the method as repeatedcv and number of folds as 10 xgctrl <- trainControl(method = "repeatedcv",repeats = 3,number = 10, verboseIter=TRUE, classProbs = FALSE, summaryFunction = defaultSummary) #running the Xgboost model xgFit <- train(Hazard ˜ .,data = tcTrain, method = "xgbTree", trControl = xgctrl, tuneLength = 10, ntree = 20, importance = TRUE) #checking the output xgFit plot(xgFit) #using Predict function to predict the model xgboost <- predict(xgFit, newdata = tcTest) str(xgboost) #Checking for problems xgProbs <- predict(xgFit, newdata = tcTest, type = "raw") head(xgProbs) ######End of Modeling########### Performance metrics comparison between 3 models: The output of these models shows four metrics namely- RMSE, R Squared, RMSE SD and R Squared SD. RMSE and R Squared are considered for comparing the models. RMSE stands for Root Mean Square Error is a standard metric for reporting predicting error of a continuous variable. R Squared is used to examine how well the model fits the training data. This will tell us what percentage of the variance in the data are explained by the model. The least value of RMSE of a model is considered as optimal. Fig 7 shows the comparison of metrics for all three models. As shown in Fig 5, RF and Xgboost have the least RMSE value. Also these values are very close with 3.874635 and 3.872974 respectively. This allows us to ensemble RF and xgboost models and use it for regression analysis.Below figure shows comparison between 3 models. Figure 6. comparison Metrics IV. EVALUATION A. Classification Evaluation Random Forest model with parameters ntree = 13 and mtry = 2 is executed against test data set and predicted the results. R-Code for Random Forest output ###Survival prediction for Titanic data set using Random Forest###### library(randomForest) library(readr) set.seed(1001) #Read Titanic data from csv tcTrain<-read_csv("trainTc.csv", col_names = TRUE,n_max = -1, progress = interactive()) tcTest <- read_csv("test.csv", col_names = TRUE,n_max = -1, progress = interactive()) #Manual function to preprocess data tcPredictors <- function(inputdat) { predictors <- c("Pclass","Age","Sex",
  • 7. "Parch","SibSp", "Fare","Embarked") preds <- inputdat[,predictors] preds$Fare[is.na(preds$Fare)] <- median(preds$Fare, na.rm=TRUE) preds$Embarked[preds$Embarked==""] = "S" preds$Age[is.na(preds$Age)] <- -1 preds$Sex <- as.factor(preds$Sex) preds$Embarked <- as.factor(preds$Embarked) return(preds) } #Random Forest algorithm rf <- randomForest(tcPredictors(tcTrain), as.factor(tcTrain$Survived), ntree=13,mtry = 2, importance=TRUE) #Write out result outResult <- data.frame(PassengerId = tcTest$PassengerId) outResult$Survived <- predict(rf, tcPredictors(tcTest)) #Write CSV file for out result write.csv(outResult, file = "TitanicResult.csv", row.names=FALSE) ###End of code### Fig 7 shows a screen shot of the output written csv format by Random Forest model. Figure 7. Predicted Output screen shot B. Regression Evaluation With the results of the model comparison, we have selected rf and xgboost models as ensemble for regression analysis. The tuning parameters are selected by the result of the model comparison. For applying the regression algorithm, train and test datasets are used and the predictor dataset is formed from Train data. The tuning parameters are as follow: For random forest ntree=20; sampsize=10000; Xgboost; nrounds=150; eta = .3; maxdepth = 1; and the objective of xgboost is specified as linear regression. R-Code for ensemble of Rf and Xgboost models ### Predicting transformed count of Hazards#### # load required packages require(xgboost) library(caret) library(randomForest) library(readr) # load raw data train = read_csv(’C:/train.csv’) test = read_csv(’C:/test.csv’) # Create the response variable hazard = train$Hazard # Create the predictor data set and encode categorical variables using caret library. htrain = train[,-c(1,2)] htest = test[,-c(1)] dummy <- dummyVars(˜ ., data = htrain) htrain = predict(dummy, newdata = htrain) htest = predict(dummy, newdata = htest) #running Random Forest set.seed(1234) rf <- randomForest(htrain, hazard, ntree=20, imp=TRUE, sampsize=10000, do.trace=TRUE) predict_rf <- predict(rf, htest) # Set necessary parameters and use parallel threads parameters <- list("objective" = "reg:linear", "nthread" = 8, "verbose"=0) # running xgboost model xgb.fit = xgboost(param=parameters, data = htrain, label = hazard, nrounds=150, eta = .3, max_depth = 1, min_child_weight = 5, scale_pos_weight = 1.0, subsample=0.8) predict_xgboost <- predict(xgb.fit, htest) # Predict Hazard for the test set predict <- data.frame(Id=test$Id) predict$Hazard <- (predict_rf+predict_xgboost)/2 #write predict output as csv to system write_csv(predict, "predict.csv") V. CONCLUSION CARET package in R has helped in model evaluation and comparing the performances between different models. Mul- tiple classification and regression models were modeled and evaluated with respect to performance metrics using CARET package. Four classification models were used namely- KNN, rpart, randomForest and regularized discriminant analysis (rda)
  • 8. Figure 8. output of the regression analysis Figure 9. Scree plot of regression analysis using CARET package for checking the best model. Ran- domForest performed best with respect to F-measure. Hence this model was used for predicting what type of people survived the Titanic shipwreck. Three regression models were used namely-principal component analysis, randomForest and extreme gradient boosting (xgboost) using CARET package for checking the model for regression analysis. Ensemble of randomForest and Xgboost was used for predicting the hazards for each property.The future work proposes for ensemble of classification models and evaluating it. We also propose evaluating and comparing other models of classification and regression analysis. REFERENCES [1] Ana Isabel Roj˜ao Lourenc¸o Azevedo. “KDD, SEMMA and CRISP-DM: a parallel overview”. In: (2008). [2] Cheng-Mei Chen et al. “Prediction of survival in pa- tients with liver cancer using artificial neural networks and classification and regression trees”. In: Natural Computation (ICNC), 2011 Seventh International Con- ference on. Vol. 2. IEEE. 2011, pp. 811–815. [3] Hui-Ling Chen et al. “A novel bankruptcy prediction model based on an adaptive fuzzy k-nearest neighbor method”. In: Knowledge-Based Systems 24.8 (2011), pp. 1348–1359. [4] Jerome H Friedman. “Regularized discriminant analy- sis”. In: Journal of the American statistical association 84.405 (1989), pp. 165–175. [5] Lijia Guo. “Applying data mining techniques in proper- ty/casualty insurance”. In: in CAS 2003 Winter Forum, Data Management, Quality, and Technology Call Pa- pers and Ratemaking Discussion Papers, CAS. Citeseer. 2003. [6] Max Kuhn. “Building predictive models in R using the caret package”. In: Journal of Statistical Software 28.5 (2008), pp. 1–26. [7] Andy Liaw and Matthew Wiener. “Classification and regression by randomForest”. In: R news 2.3 (2002), pp. 18–22. [8] Mark Menor and Kyungim Baek. “Relevance units machine for classification”. In: Biomedical Engineering and Informatics (BMEI), 2011 4th International Con- ference on. Vol. 4. IEEE. 2011, pp. 2295–2299. [9] Bj¨orn-Helge Mevik and Ron Wehrens. “The pls pack- age: principal component and partial least squares re- gression in R”. In: Journal of Statistical Software 18.2 (2007), pp. 1–24. [10] Carolin Strobl, James Malley, and Gerhard Tutz. “An introduction to recursive partitioning: rationale, applica- tion, and characteristics of classification and regression trees, bagging, and random forests.” In: Psychological methods 14.4 (2009), p. 323. [11] Stephen Tyree et al. “Parallel boosted regression trees for web search ranking”. In: Proceedings of the 20th international conference on World wide web. ACM. 2011, pp. 387–396.