DataMining_CA2-4

Model Evaluation for Classification and Regression
Aravind Kumar Balasubramaniam, 14123754
School of Computing
National College of Ireland
Dublin, Ireland
Email: aravindkumar.balasubramaniam@student.ncirl.ie
Anisha Kudalappa Gudagi, 14123223
School of Computing
National College of Ireland
Dublin, Ireland
Email: anishaKudalappa.Gudagi@student.ncirl.ie
Abstract—This paper aims to compare classification and re-
gression models using CARET package in R. Classification and
regression analysis are performed for predicting survival rate on
titanic dataset and quantifying hazard score on property-liability
dataset respectively. Four models of classification algorithms are
modeled for the data samples from Titanic dataset and evaluated
against Accuracy, Sensitivity, Specificity, Pos pred value, Neg pred
value, Kappa and F-measure metrics. Based on F-measure the
best model is selected and classification analysis is performed on
the titanic dataset. Three models of regression algorithms are
modeled for the data samples from property-liability dataset an
evaluated against RMSE,R-squared, RMSE SD and R-squared
SD. RMSE metric is used for selecting the best model for
regression analysis and predicting the hazard score.
I. INTRODUCTION
Data prediction is a non-trivial task and has an enormous
value, this paper out rights the heuristics about classification
and regression modelling by providing a literature review.
Two different data sets individually suitable for classification
and regression are used for evaluating the model performance
metrics with the aid of CARET package and to train test the
selected algorithm. This research paper presents and discusses
classification based survival prediction of the infamous Titanic
shipwrecks and regression based prediction of property haz-
ards based on property information in the property-liability
insurance industry. Predictive analysis is one of the most
important supervised data mining technique which enables us
to find unidentified patterns or trends in datasets. Machine
learning from disaster is one of the most explored areas in
the field of Data mining and disaster management. In the
past, many researchers have used Data mining techniques for
predicting survivals from natural disasters and for predicting
patient survival rate for uncommon and cure-less diseases. We
have used classification data mining algorithms to predict what
sort of people were likely to survive the shipwrecks. The pre-
diction depends on the key attributes such as gender, passenger
class, age, parents, siblings and the type of ticket. Regression
analysis is done on the property-liability data for quantifying
hazards. Property insurance requires inspection of the property
based on the condition of the property such as foundation,
roof, flooring etc. which are key property attributes. These key
property attributes needs to be investigated by the insurance
companies before they are approved for insured. Thus, to
provide a clear insight to insurance companies to find property
hazards from key attributes of properties, predictive analysis is
best suited. In this paper we have analyzed and built predictive
models to classify the rate of survivals in Titanic shipwreck
and to quantify property hazards before time of inspection
of Liberty Mutual Insurance company dataset which contains
a hazard score provided for each property by inspection.
There are two queries- the first query is to “predict what
sort of people were likely to survive the shipwreck” using
classification and second query is to “predict which of these
hazards contribute more for each property. We have built
four classification models and three regression models for
comparison and selected the best model. The rest of the paper
is divided as follows: section (2) related work which discusses
about predictive analysis in insurance industry and comparing
models, section (3) methodology used to answer the query,
section (4) evaluation of results and section (5) conclusion
and future work.
NCI
August 20, 2015
II. RELATED WORK
Classification data mining technique refers to assigning each
data point instance to a class label. It is one of the most
researched platforms in the field of data mining. Classification
algorithms aim to find a classifier that will assist in assigning
the input instance to a class label [8]. Classification has
been used as the most popular technique in the field of
disaster management and medicine. A conventional algorithm
for survival analysis technique evaluate the data inputs by
using the Kaplan-Meier method or Cox proportional hazard
model. However during the recent years classification models
are widely used for medicine. These models first establishes
a tree structure by partitioning the training data into several
subclasses. This partitioning is done according to the test
conditions until all the data are grouped under one subclass.
After the tree structure is established, pruning is done to
the tree from the bottom of the tree. After the pruning,
rules are created from the output and these rules are used in
the classification of new training data for prediction [2].The
K-nearest neighbor is one of the oldest methods and non-
parametric methods of classification. In this algorithm a class
is assigned based on the common class amongst the k-nearest
neighbors. Fuzzy k-nearest neighbor is an extension of KNN
in which the algorithm assigns the fuzzy memberships of
data samples to different classes [3]. Boosting is an iterative

algorithm combining classification rules with performance in
terms of error rate to produce an accurate classification rule.
Regularized discriminant analysis refers to assigning objects
to one or several groups which is obtained from each object.
Regularization techniques are applied for linear discriminant
analysis and quadratic discriminant analysis and have been
successful in the results of poorly-posed inverse problems.
The efficiency of RDA is to improve the misclassification risk
and error rate from LDA and QDA [4].Recursive partitioning
(rpart) is a statistical method that is used in classification
and regression trees. This method is used to discover data
structures in trends in the data sample. It is in various
scientific fields as multivariate data exploration, for example:
DNA sequencing, medicine. This algorithm can be tuned
to perform classification and regression [10]. Data mining
techniques have been used in the insurance industry from quite
some time since insurance databases consists of large datasets
which provide valuable business knowledge for improving the
customer relationship or improving profits or expanding the
business. Modeling insurance risk is done by applying data
mining techniques. Past research has showed that data mining
methods improves the existing models by discovering extra
variables and by detecting nonlinear relationships. Data mining
has been of great importance in the insurance industry by
identifying risk factors that helps in predicting profits and
losses. Data mining techniques like decision trees and neural
network can accurately predict risk. Customer relationship
management analysis helps in understanding the customer and
accurately select which policies to be offered to a customer
[5]. Random Forest is a data mining algorithm for performing
classification and regression analysis. The metric measure for
randomForest is mtry which specifies the best fit split from the
predictor variables. The randomForest package produces two
sets of information: a metric for specifying the importance
of the predicting variables and a metric for specifying the
structure of the data [7]. Multivariate regression methods such
as Principal component analysis and partial least square have
lot of implications in a variety of industries. Quantitative
–structure-activity relations and quantitative structure –prop-
erty relations use PSLR and PCR. In PCR, the first a principal
components (PCs) is used to approximate the matrix. Next
the Y is regressed on the scores, which in turn provides
the regression coefficients [9]. Gradient Boosted regression
is an iterative algorithm for finding a predictor. Regression
trees provides a repeated expansion of nodes until a stopping
criteria is met. All the data points in the data are assigned
to a single node initially. Parallelizing the boosted regression
trees implies to boosting analysis sequentially by parallelizing
the building of trees individually [11]. The CARET package
which is short for classification and regression training con-
sists of classification and regression models. It is used for
model tuning and training across models. This package helps
in comparing model performance between different models.
Since classification and regression models are used in many
different applications, Caret package will help in selecting the
best model and approach [6]
III. METHODOLOGY
The data mining methodology used here is CRISP-DM
methodology. CRISP-DM stands for Cross-Industry Standard
Process for Data Mining. This methodology consists of six
steps:
Business Understanding: This phase includes understanding
the business objectives and requirements.
Data Understanding: This phase starts with the initial data
collection and identifying subsets in the data.
Data Preparation: Transformation and cleaning is done in
this phase. Data is prepared for modeling.
Modeling: Modeling techniques are applied to the dataset.
Evaluation: Evaluation of the model results are performed
in this phase. The results are measured to satisfy business
objectives.
Deployment: The model built is deployed to the customer
with the results [1]
A. Implementation for Classification Modelling
1) Business Understanding: Theobjective was to identify
or predict what sort of people other than upper class; women
and children were likely to survive in the RMS Titanic, one
of the wicked shipwrecks in history.The preliminary plan was
to understand the key variables or dimensions, which will be
used for predicting the class-object and use caret package for
model evaluation and identifying the parameter values.
2) Data Understanding: The Titanic data set was collected
from Kaggle, The key variables identified were ‘Sex’, ‘Age’,
‘Pclass’, ‘Parch’, ‘Fare’, ‘SibSp’, ‘Embarked’ to predict the
categorical class ‘Survived’. There were some problems like
NULL’s and NA’s were identified in data set, which will be
taken care in fore coming stages.
3) Data Preparation: Problems detected in data under-
standing are addressed here.
• Median values are replaced for NA’s in ‘Fare’.
• NULL values are replaced with “S” in ‘Embarked’.
• NA’s in ‘Age’ are replaced with -1.
• ‘Sex’ is converted to factor.
• ‘Embarked’ is converted to factor.
4) Modeling: Four modeling algorithms were selected in
a random manner. Random forest (‘rf’), K-Nearest neighbor
(‘knn’), Regularized Discriminant Analysis (‘rda’) and Tree
based model – CART (‘rpart’). In order to tune the parameters
and evaluate the metrics of the above models,Classification
And REgression Training package acronym as CARET1
is
used.
Model Tuning NA’s, NULL’s and factorization is handled
in data pre-processing. Data partitioning is performed
through simple splitting by creating balanced splits of data
in 75&25% for each class. 75% is used for training and
25% is used for testing within caret package. Parameter
selection: Train function in ‘CARET’ is used to evaluate
by resampling training data with parameterized number
1http://topepo.github.io/caret/index.html

of folds and repeats. Here resampling is done using ten
folds and three repeats by using ‘repatedcv’ method which
means repeating the cross validation. Class probability is
set to TRUE to compute class probability for the held out
samples. summaryFuntion is set as ‘twoClassSummary’
for caret to compute specificity, sensitivity and the area
under Receiver Operating Characteristic curve. Neither the
predictors require estimating power transformations nor have
zero or negative values, so Box-Cox or Yeo-Jhonson method
of preprocessing is eliminated and instead only centering
and scaling is used for random forest and ‘knnImpute’ is
used to find the distance between k closest samples by using
Euclidan distance incase of K-Nearest Neighbour. Tune
grid: gamma value is set to range from 0.00 to 1.00 with a
scale of 0.25 and lambda is fixed with 0.75 for Regularized
Discriminant Analysis to find the optimal ROC. Ntree
parameter was checked between ranges 10 to 15 with the
scale of 1 and found to have a best fit at 13 for Random Forest.
R-Code for Classification Modeling using CARET
##CARET Model evaluation for Titanic data
set##
##Common Process accross all algorithms
begin##
library(caret)
library(readr)
set.seed(121)
#Read Titanic data from csv
crtTrain<-read_csv("trainTc.csv",
col_names = TRUE,n_max = -1,
progress = interactive())
#Data cleaning ’NA’ in AGE is replaced with
median value
crtTrain$Age[is.na(crtTrain$Age)] <-
median(crtTrain$Age, na.rm=TRUE)
#Selecting required variables
crtTrain <- crtTrain[c(#’PassengerId’,
’Pclass’,#’Name’,
’Sex’,’Age’,’SibSp’,’Parch’,
#’Ticket’,’Fare’,’Cabin’,’Embarked’,
’Survived’)]
#Convert the survived from binary into factor
crtTrain$Survived <-
ifelse(crtTrain$Survived==1,’yes’,’no’)
crtTrain$Survived <-
as.factor(crtTrain$Survived)
#Partition 75% to training and remaining to
test
set.seed(221)
inTrain <- createDataPartition(y =
crtTrain$Survived,p = .75,list = FALSE)
#Create train and test
tcTrain <- crtTrain[ inTrain,]
tcTest <- crtTrain[-inTrain,]
####Common Process ends#####
####Train function on all 4 algorithms#######
####Random Forest#######
set.seed(301)
rfctrl <- trainControl(method =
"repeatedcv",number=10,repeats = 3,
verboseIter=TRUE,classProbs
= TRUE,
summaryFunction =
twoClassSummary)
#install.packages("pROC")
rfFit <- train(Survived ˜.,
data = tcTrain,
method = "rf",
metric="ROC",
ntree = 13,
preProcess = c("center", "scale"),
trControl = rfctrl)
rfFit
plot(rfFit)
#Predict
rfPredictTC <- predict(rfFit, newdata =
tcTest)
rfProbs <- predict(rfFit, newdata = tcTest,
type = "prob")
confusionMatrix(data = rfPredictTC,
tcTest$Survived)
######K-Nearest Neighbour########
knnctrl <- trainControl(method =
= TRUE,
summaryFunction =
twoClassSummary)
set.seed(301)
#install.packages("pROC")
knnFit <- train(Survived ˜.,
data = tcTrain,
method = "knn",
metric="ROC",
preProcess = "knnImpute",
tuneLength = 10,
trControl = knnctrl)
knnFit
plot(knnFit)
#Predict
knnPredictTC <- predict(knnFit, newdata =
tcTest)
knnProbs <- predict(knnFit, newdata = tcTest,
type = "prob")
confusionMatrix(data = knnPredictTC,
tcTest$Survived)
####Regularized Discriminant Analysis#####
mygrid <- data.frame(gamma = (0:4)/4, lambda
= 3/4)
rdactrl <- trainControl(method =
= TRUE,
summaryFunction =
twoClassSummary)
rdaFit <- train(Survived ˜.,
data = tcTrain,
method = "rda",
trControl = rdactrl,
metric = "ROC",
tuneGrid=mygrid,
trace = FALSE,
maxit = 100)
rdaFit
plot(rdaFit)
#Predict
rdaPredictTC <- predict(rdaFit, newdata =
tcTest)
rdaProbs <- predict(rdaFit, newdata = tcTest,

type = "prob")
confusionMatrix(data = rdaPredictTC,
tcTest$Survived)
####Tree based model CART (rpart)#####
rptctrl <- trainControl(method =
= TRUE,
summaryFunction =
twoClassSummary)
rptFit <- train(Survived ˜ .,
data = tcTrain,method = "rpart",
trControl = rptctrl,metric =
"ROC",tuneLength = 10)
rptFit
plot(rptFit)
#Predict
rptPredictTC <- predict(rptFit, newdata =
tcTest)
rptProbs <- predict(rptFit, newdata = tcTest,
type = "prob")
confusionMatrix(data = rptPredictTC,
tcTest$Survived)
######End of Modeling###########
Performance metrics comparison between 4 models: Af-
ter executing the models using CARET package fore coming
statistics were collected to evaluate the results for choosing the
best model. Fig 1 shows the tabulation of cross table matrix
and F-Measure statistics for all four models.
Figure 1. Table: Cross table and F-Measure
It is inferred from Fig 2 that Random Forest has 77.3%
F-Measure, which is higher while comparing other models.
Fig 3 shows the other metrics comparison for all four
models.
Fig 4 shows the Kappa statistic comparison for all four
models.
Fig 5 highlights the Random Forest performance against
other models.
The model Random Forest with its tuning parameter ntree
= 13 and mtry = 2 is evaluated to have best-fit using CARET
package from the metrics collected and consolidated with other
models.
B. Implementation for Regression Modelling
1) Business Understanding: The objective of the research
was to quantify property hazards before time of inspection of
Figure 2. F-Measure comparison
Figure 3. Other metrics comparison
Liberty Mutual Insurance company dataset. At this stage the
key attributes and dimensionality of the dataset were identified.
Caret package from R was chosen for model evaluation and
then selecting the best model for regression analysis.
2) Data Understanding: The property hazard dataset was
downloaded from Kaggle website. The dataset is split into
two- train and test. Train dataset contains the hazard score
and anonymized predictor variables. This dataset was used
for modeling algorithms and selecting the best model. Test
dataset contains only the anonymised predictor variables. Each
row in the Train dataset corresponds to a property that was
evaluated and given hazard score. The hazard score attribute
is a continuous number that represents the condition of the
property provided by the inspection committee. The dataset
contains 51000 records.
3) Data Preparation: The train dataset was used for mod-
eling and selecting the best model for regression. This dataset
did not contain any NULL values or NA values. The dataset
was clean and did not require any preparation or transforma-
tion for modeling.
4) Modeling: prediction through regression analysis. Ran-
dom Forest (rf), Partial Least Squares (pls) and eXtreme
Gradient Boosting (xgboost) are the three models used. Caret
package contains all the classification and regression models.

Figure 4. Kappa Statistic
Figure 5. Other Metrics
This package is used for tuning the parameters and comparing
the models in R. Train dataset is loaded in to R, data
partitioning is done by splitting the data as training data which
contains 75
Model Tuning Parameter selection is done through the
trainControl function. This function is used for resampling
the training data by specifying the tuning parameters. It also
controls the computational refinements of the train function.
The tuning parameters used are- method (repeatedcv): which
specifies the resampling method to be used is selected, number
(10): specifies the number of folds or number of resampling
iterations which is selected as 10, repeats (3): the number
of complete sets of folds to compute, verboseIter (TRUE):
which is a logical printing of training log if TRUE is specified,
classProbs (FALSE): this parameter is used for classification
analysis, summaryFunction (defaultSummary): this is a func-
tion to compute performance metrics across resamples. Train
function is used to fit predictive models over different tuning
parameters. It its each model and calculates a resampling based
performance measure. Parameters used for train are: data-
training data, method- which specifies the model used (rf, pls
and xgboost), trControl- which specifies how the function acts
(trainControl function is used), tuneLength- specified as 10 for
number of levels, ntree- number of trees (20) and important-
as TRUE.
R-Code for Regression Modeling using CARET
##CARET Model evaluation for Liberty Mutual
Group dataset##
##Common Process across all algorithms begin##
#Install ’caret’, ’randomForest’ , readr,
pls, xgboost
install.packages ("caret")
install.packages ("randomForest")
install.packages ("readr")
install.packages ("xgboost") install.packages
("pls")
# Load the caret, readr, xgboost, pls and rf
packages
library(caret)
library(xgboost)
library(readr)
library(rf)
library(pls)
# read first 5000 rows of Liberty Mutual
Dataset into R using readr package
Crttrain<-read_csv("C:/train.csv", n_max =
5000, progress = interactive())
#Partion the data into training and test.
intrain<- createDataPartition(y =
Crttrain$Hazard,p = .75,list = FALSE)
#create train and test
tcTrain <- Crttrain[ intrain,]
tcTest <- Crttrain [-intrain,]
#checking the number of rows and columns
nrow(tcTrain)
ncol(tcTest)
# set seed
set.seed(107)
####Common Process ends#####
###Train function on all 4 algorithms#######
###Random Forest####
# tuning the model using trainControl.
Specifiying the method as repeatedcv and
number of folds as 10
rfctrl <- trainControl(method =
"repeatedcv",repeats = 3, number = 10,
verboseIter=TRUE,
classProbs = FALSE,
summaryFunction =
defaultSummary)
# running the randomForest model
rfFit <- train(Hazard˜ .,data = tcTrain,
method = "rf", trControl = rfctrl,
tuneLength = 10, ntree = 20, importance =
TRUE)
#checking the output
rfFit
# plotting the model
plot(rfFit)
# using Predict function to predict the model
rf <- predict(rfFit, newdata = tcTest)

str(rf)
# Checking for problems
rfProbs <- predict(rfFit, newdata = tcTest,
type = "raw")
head(rfProbs)
####Partial Least Squares####
# tuning the model using
trainControl.Specifying the method as
repeatedcv and number of folds as 10
plsctrl <- trainControl(method =
"repeatedcv",repeats = 3,number = 10,
verboseIter=TRUE, classProbs = FALSE,
summaryFunction = defaultSummary)
# Running the PLS model
plsFit <- train(Hazard ˜ .,data = tcTrain,
method = "pls", trControl = plsctrl,
tuneLength = 10,ntree = 20, importance =
TRUE)
# checking the output
plsFit
plot(plsFit)
# using Predict function to predict the model
pls <- predict(plsFit, newdata = tcTest)
str(pls)
# Checking for problems
plsProbs <- predict(rfFit, newdata = tcTest,
type = "raw") head(plsProbs)
####eXtreme Gradient Boosting####
#tuning the model using trainControl.
Specifying the method as repeatedcv and
number of folds as 10
xgctrl <- trainControl(method =
"repeatedcv",repeats = 3,number = 10,
verboseIter=TRUE, classProbs = FALSE,
summaryFunction = defaultSummary)
#running the Xgboost model
xgFit <- train(Hazard ˜ .,data = tcTrain,
method = "xgbTree", trControl = xgctrl,
tuneLength = 10, ntree = 20, importance =
TRUE)
#checking the output
xgFit
plot(xgFit)
#using Predict function to predict the model
xgboost <- predict(xgFit, newdata = tcTest)
str(xgboost)
#Checking for problems
xgProbs <- predict(xgFit, newdata = tcTest,
type = "raw") head(xgProbs)
######End of Modeling###########
Performance metrics comparison between 3 models:
The output of these models shows four metrics namely-
RMSE, R Squared, RMSE SD and R Squared SD. RMSE and
R Squared are considered for comparing the models. RMSE
stands for Root Mean Square Error is a standard metric for
reporting predicting error of a continuous variable. R Squared
is used to examine how well the model fits the training data.
This will tell us what percentage of the variance in the data
are explained by the model. The least value of RMSE of a
model is considered as optimal. Fig 7 shows the comparison
of metrics for all three models. As shown in Fig 5, RF and
Xgboost have the least RMSE value. Also these values are very
close with 3.874635 and 3.872974 respectively. This allows us
to ensemble RF and xgboost models and use it for regression
analysis.Below figure shows comparison between 3 models.
Figure 6. comparison Metrics
IV. EVALUATION
A. Classification Evaluation
Random Forest model with parameters ntree = 13 and mtry
= 2 is executed against test data set and predicted the results.
R-Code for Random Forest output
###Survival prediction for Titanic data set
using Random Forest######
library(randomForest)
library(readr)
set.seed(1001)
#Read Titanic data from csv
tcTrain<-read_csv("trainTc.csv",
tcTest <- read_csv("test.csv",
#Manual function to preprocess data
tcPredictors <- function(inputdat) {
predictors <- c("Pclass","Age","Sex",

"Parch","SibSp",
"Fare","Embarked")
preds <- inputdat[,predictors]
preds$Fare[is.na(preds$Fare)] <-
median(preds$Fare, na.rm=TRUE)
preds$Embarked[preds$Embarked==""] = "S"
preds$Age[is.na(preds$Age)] <- -1
preds$Sex <- as.factor(preds$Sex)
preds$Embarked <- as.factor(preds$Embarked)
return(preds)
}
#Random Forest algorithm
rf <- randomForest(tcPredictors(tcTrain),
as.factor(tcTrain$Survived),
ntree=13,mtry = 2,
importance=TRUE)
#Write out result
outResult <- data.frame(PassengerId =
tcTest$PassengerId)
outResult$Survived <- predict(rf,
tcPredictors(tcTest))
#Write CSV file for out result
write.csv(outResult, file =
"TitanicResult.csv", row.names=FALSE)
###End of code###
Fig 7 shows a screen shot of the output written csv format
by Random Forest model.
Figure 7. Predicted Output screen shot
B. Regression Evaluation
With the results of the model comparison, we have selected
rf and xgboost models as ensemble for regression analysis.
The tuning parameters are selected by the result of the model
comparison. For applying the regression algorithm, train and
test datasets are used and the predictor dataset is formed from
Train data. The tuning parameters are as follow: For random
forest
ntree=20; sampsize=10000; Xgboost; nrounds=150;
eta = .3; maxdepth = 1;
and the objective of xgboost is specified as linear regression.
R-Code for ensemble of Rf and Xgboost models
### Predicting transformed count of
Hazards####
# load required packages
require(xgboost)
library(caret)
library(randomForest)
library(readr)
# load raw data
train = read_csv(’C:/train.csv’)
test = read_csv(’C:/test.csv’)
# Create the response variable
hazard = train$Hazard
# Create the predictor data set and encode
categorical variables using caret library.
htrain = train[,-c(1,2)]
htest = test[,-c(1)]
dummy <- dummyVars(˜ ., data = htrain)
htrain = predict(dummy, newdata = htrain)
htest = predict(dummy, newdata = htest)
#running Random Forest
set.seed(1234)
rf <- randomForest(htrain, hazard, ntree=20,
imp=TRUE, sampsize=10000, do.trace=TRUE)
predict_rf <- predict(rf, htest)
# Set necessary parameters and use parallel
threads
parameters <- list("objective" =
"reg:linear", "nthread" = 8, "verbose"=0)
# running xgboost model
xgb.fit = xgboost(param=parameters, data =
htrain, label = hazard, nrounds=150, eta
= .3, max_depth = 1, min_child_weight =
5, scale_pos_weight = 1.0, subsample=0.8)
predict_xgboost <- predict(xgb.fit, htest)
# Predict Hazard for the test set
predict <- data.frame(Id=test$Id)
predict$Hazard <-
(predict_rf+predict_xgboost)/2
#write predict output as csv to system
write_csv(predict, "predict.csv")
V. CONCLUSION
CARET package in R has helped in model evaluation and
comparing the performances between different models. Mul-
tiple classification and regression models were modeled and
evaluated with respect to performance metrics using CARET
package. Four classification models were used namely- KNN,
rpart, randomForest and regularized discriminant analysis (rda)

Figure 8. output of the regression analysis
Figure 9. Scree plot of regression analysis
using CARET package for checking the best model. Ran-
domForest performed best with respect to F-measure. Hence
this model was used for predicting what type of people
survived the Titanic shipwreck. Three regression models were
used namely-principal component analysis, randomForest and
extreme gradient boosting (xgboost) using CARET package
for checking the model for regression analysis. Ensemble of
randomForest and Xgboost was used for predicting the hazards
for each property.The future work proposes for ensemble
of classification models and evaluating it. We also propose
evaluating and comparing other models of classification and
regression analysis.
REFERENCES
[1] Ana Isabel Rojão Lourenço Azevedo. “KDD, SEMMA
and CRISP-DM: a parallel overview”. In: (2008).
[2] Cheng-Mei Chen et al. “Prediction of survival in pa-
tients with liver cancer using artificial neural networks
and classification and regression trees”. In: Natural
Computation (ICNC), 2011 Seventh International Con-
ference on. Vol. 2. IEEE. 2011, pp. 811–815.
[3] Hui-Ling Chen et al. “A novel bankruptcy prediction
model based on an adaptive fuzzy k-nearest neighbor
method”. In: Knowledge-Based Systems 24.8 (2011),
pp. 1348–1359.
[4] Jerome H Friedman. “Regularized discriminant analy-
sis”. In: Journal of the American statistical association
84.405 (1989), pp. 165–175.
[5] Lijia Guo. “Applying data mining techniques in proper-
ty/casualty insurance”. In: in CAS 2003 Winter Forum,
Data Management, Quality, and Technology Call Pa-
pers and Ratemaking Discussion Papers, CAS. Citeseer.
2003.
[6] Max Kuhn. “Building predictive models in R using the
caret package”. In: Journal of Statistical Software 28.5
(2008), pp. 1–26.
[7] Andy Liaw and Matthew Wiener. “Classification and
regression by randomForest”. In: R news 2.3 (2002),
pp. 18–22.
[8] Mark Menor and Kyungim Baek. “Relevance units
machine for classification”. In: Biomedical Engineering
and Informatics (BMEI), 2011 4th International Con-
ference on. Vol. 4. IEEE. 2011, pp. 2295–2299.
[9] Björn-Helge Mevik and Ron Wehrens. “The pls pack-
age: principal component and partial least squares re-
gression in R”. In: Journal of Statistical Software 18.2
(2007), pp. 1–24.
[10] Carolin Strobl, James Malley, and Gerhard Tutz. “An
introduction to recursive partitioning: rationale, applica-
tion, and characteristics of classification and regression
trees, bagging, and random forests.” In: Psychological
methods 14.4 (2009), p. 323.
[11] Stephen Tyree et al. “Parallel boosted regression trees
for web search ranking”. In: Proceedings of the 20th
international conference on World wide web. ACM.
2011, pp. 387–396.

DataMining_CA2-4

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à DataMining_CA2-4

Similaire à DataMining_CA2-4 (20)

DataMining_CA2-4