SlideShare a Scribd company logo
1 of 6
Download to read offline
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 03 | Mar 2019 www.irjet.net p-ISSN: 2395-0072
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 3869
Supervised Learning Classification Algorithms Comparison
Aditya Singh Rathore
B.Tech, J.K. Lakshmipat University
-------------------------------------------------------------***---------------------------------------------------------
Abstract: Under supervised machine learning, classification tasks are one of the most important tasks as a part of data
analysis. It gives a lot of actionable insights to data scientists after using different machine learning algorithms. For study
purpose the Titanic dataset has been chosen. Through this paper, an effort has been made to evaluate different classification
models’ performance on the dataset using the scikit-learn library. The classifiers chosen are Logistic Regression, K-Nearest
Neighbor, Decision Tree, Random Forest, Support Vector Machine and Naïve Bayes classifier. In the end evaluation
parameters like confusion matrix, precision, recall, f1-score and accuracy have been discussed.
Key Words: Numpy, Pandas, classification, knn, logistic regression, decision tree, random forest, Naïve Bayes, SVM,
sklearn.
1. INTRODUCTION:
Machine learning has been getting a lot of attention for the past few years because it holds the key to solve the problems of
the future. In the 21st Century, we are living in a world where data is everywhere and as we go by of our day we are
generating data, be it text messaging or simply walking down the street and different antennas picking GPS signals. The
majority of practical machine learning uses supervised learning. In supervised learning, the machine learning models are
first trained over dataset with associated correct labels. The objective of training the models first is to create a very good
mapping between the independent and dependent variables, so that when that model is run on a new dataset, it can
predict the output variable. Classification is a type of problem under supervised learning. A classification problem is when
the output is a category and when given one or more inputs a classification model will try to predict the value based on
what relation it developed between the input and output. Below are the classifiers chosen for evaluation:
1.1 Logistic Regression:
This type of technique proves to be more appropriate when the dependent variable/output is in the form of binary output,
i.e. 1 or 0. This technique is mainly used when one wants to see the relationship between the dependent variable and one
or more independent variable provided that the dataset is linear. The output of this technique gives out a binary output
which is a better type of output than absolute values as it elucidates the output. In order to limit our output between 0 and
1 we will be using Sigmoid Function.
Below is the mathematical representation of Sigmoid Function:
f(x)= 1/(1+e^-x)
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 03 | Mar 2019 www.irjet.net p-ISSN: 2395-0072
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 3870
1.2 K- Nearest Neighbor:
This type of technique is one of the simplest classification algorithm. It first stores the entire data in its training phase as
an array of data points. . Whenever a prediction is required for a new data it searches through the entire training dataset
for a number of most similar instances and the data with most similar instance is returned as prediction. This number is
defined by an integer ‘k’. The k denotes the number of neighbours which are the determining class of the test data. For
example, if k=3, then the labels of the 3 closest instances are checked and the most common label is assigned to the test
data. The distance measured from test data to the closest instance is the least distance measure. This can be achieved using
various measuring tools such as Euclidean Distance. The Euclidean distance between point p and q is calculated as
below:
1.3 Decision Tree:
A decision tree is a graphical representation of all possible solutions to a decision based on certain conditions. It consists of
a Root/Parent node which is the base node or the first node of a tree. Then comes the splitting which is basically dividing
the Root/Child nodes into different parts on a certain condition. These parts are called the Branch/ Sub Tree. To remove
unwanted branches from the tree we make use of Pruning. The last node is called the leaf node when the data cannot be
segregated to further levels. We will use CART (Classification and Regression Tree) algorithm to our dataset. Below are
some terminologies used by CART algorithm in order to select the root node :
a) Gini Index: It is the measure of impurity used to build a decision tree
b) Information Gain: It is the decrease in entropy after a dataset is split on the basis of a criteria.
c) Reduction in Variance: The split with lower variance is selected as a criteria.
d) Chi Square: It is the measure of statistical significance between the differences between sub-nodes and parent
node.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 03 | Mar 2019 www.irjet.net p-ISSN: 2395-0072
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 3871
1.4 Random Forest:
This classifier is one of the most used algorithms today because it can be used for classification as well as regression. The
main problem with decision trees is that it can over fit the data. One solution to prevent this is to create multiple decision
trees on random subsamples of the data. Now when a new data point is to be predicted then that will be computed by all
decision trees and a majority vote will be calculated. This vote will determine the class of the data. This in turns gives
higher accuracy when compared to decision tree alone. More decision trees directly correlates with more accuracy.
1.5 Support Vector Machine (SVM):
This classifier uses a hyperplane/decision boundary that separates the classes. This hyperplane is positioned in
such a way that it maximises the margin i.e. the space between the decision boundary and the points in each of
the classes that are closest to the decision boundary. These points are called Support Vectors. A hyperplane is a
decision surface that splits the space into 2 parts i.e. it is a binary classifier. This hyperplane in R^n space is an
n-1 dimensional subspace.
SVMs can easily be used for linear classification. However, in order to use SVM for non-linear classification once can use a
kernel trick with which the data is transformed in another dimension to determine the appropriate position of hyperplane.
1.6 Naïve Bayes:
This classifier is used primarily for text classification which generally involves training on high dimensional datasets. This
classifier is relatively faster than other algorithms in making predictions. It is called “naïve” because it makes an
assumption that a feature in the dataset is independent of the occurrence of other features i.e. it is conditionally
independent. This classifier is based on the Bayes Theorem which states that the probability of an event Y given X is equal
to the probability of the event X given Y multiplied by the probability of X upon probability of Y.
P(X|Y) =(P(Y|X) ⋅ P(X))/P(Y)
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 03 | Mar 2019 www.irjet.net p-ISSN: 2395-0072
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 3872
2. METHODOLOGY:
2.1 Dataset:
The data has been split into two groups:
a) training set (train.csv)
b) test set (test.csv)
The training set is used to build and train the machine learning models. The test set is used to see how well the model
performed on unseen data. We will use the trained models to predict whether or not the passengers in the test set
survived the sinking of the Titanic.
Below is the original structure of the dataset:
Data Dictionary:
 Survived: 0 = No, 1 = Yes
 pclass: Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
 sibsp: # of siblings / spouses aboard the Titanic
 parch: # of parents / children aboard the Titanic
 ticket: Ticket number
 cabin: Cabin number
 embarked: Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton
After pre-processing of the data and feature engineering, below is the dataset we would be finally be working with:
2.2 Evaluation Parameters:
2.2.1 Confusion Matrix:
On a classification problem a confusion matrix gives out a summary of predicted results. The basic idea behind it is to
count the number of times the instances of class A are classified as class B.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 03 | Mar 2019 www.irjet.net p-ISSN: 2395-0072
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 3873
Below is the sample confusion matrix:
Pridicted Yes Pridicted No
Actual Yes True Positive False Negative
Actual No False Positive True Negative
2.2.2 Precision:
It is the ratio of total number of true positives divided by total sum of true positives and false positives. In summary it
describes how many of the returned hits were true positive i.e. how many of the found were correct hits.
Precision = True Positives / (True Positives + False Positives)
2.2.3 Recall:
It is the ratio of the total number of true positives divided by total sum of the true positives and the false negatives. It is
also called sensitivity. In summary it describes how many of the true positives were recalled (found), i.e. how many of the
correct hits were also found.
Recall = True Positives / (True Positives + False Negatives)
2.2.4 F1- Score:
It is simply the weighted average of precision and recall wherein the best value for a model is 1 and the worst value is 0.
Below is the formula to calculate the same:
F1-score = 2 * (precision * recall) / (precision + recall)
2.3 Experimental Results:
Below are the experimental results of different Classifiers:
Classifier Class Classification Report Confusion
Matrix
Precision Recall f1-score Accuracy
Logistic
Regression
0 0.81 0.86 0.84 78.90% [[120 19]
[ 28 56]]1 0.75 0.67 0.7
KNN 0 0.73 0.83 0.77 69.90% [[115 24]
[ 43 41]]1 0.63 0.49 0.55
Decision Tree 0 0.84 0.78 0.81 77.17% [[108 31]
[ 20 64]]1 0.67 0.76 0.72
Random Forest 0 0.87 0.88 0.88 84.30% [[123 16]
[ 19 65]]1 0.8 0.77 0.79
SVM 0 0.73 0.76 0.74 67.20% [[106 33]
[ 40 44]]1 0.57 0.52 0.55
Naïve Bayes 0 0.83 0.83 0.83 78.47% [[115 24]
[ 24 60]]1 0.71 0.71 0.71
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 03 | Mar 2019 www.irjet.net p-ISSN: 2395-0072
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 3874
3. CONCLUSION:
In this paper we experimented with 6 types of classifiers on the Titanic dataset. In order to evaluate the above the train
dataset was split into train & test dataset, since there is no true output for test dataset. From the above the random forest
seems to be most effective with highest precision and recall values and has the least number of false positive and false
negatives.
4. REFERENCES:
[1].P-N. Tan, M. Steinbach, V. Kumar, “Introduction to Data Mining,” Addison-Wesley Publishing, 2006.
[2]. Simm, J., Magrans de Abril, I., & Sugiyama, M., Tree-based ensemble multi-task learning method for classification and
regression, 2014.
[3] Takeuchi, I. & Sugiyama, M, Target neighbor consistent feature weighting for nearest neighbor classification.
[4] M. Kantardzic, “Data Mining: Concepts, Models, Methods, and Algorithms,” John Wiley & Sons Publishing, 2003.

More Related Content

What's hot

A tour of the top 10 algorithms for machine learning newbies
A tour of the top 10 algorithms for machine learning newbiesA tour of the top 10 algorithms for machine learning newbies
A tour of the top 10 algorithms for machine learning newbies
Vimal Gupta
 
Slides distancecovariance
Slides distancecovarianceSlides distancecovariance
Slides distancecovariance
Shrey Nishchal
 

What's hot (14)

THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
 
Pca analysis
Pca analysisPca analysis
Pca analysis
 
A tour of the top 10 algorithms for machine learning newbies
A tour of the top 10 algorithms for machine learning newbiesA tour of the top 10 algorithms for machine learning newbies
A tour of the top 10 algorithms for machine learning newbies
 
Slides distancecovariance
Slides distancecovarianceSlides distancecovariance
Slides distancecovariance
 
PCA (Principal component analysis)
PCA (Principal component analysis)PCA (Principal component analysis)
PCA (Principal component analysis)
 
CART Classification and Regression Trees Experienced User Guide
CART Classification and Regression Trees Experienced User GuideCART Classification and Regression Trees Experienced User Guide
CART Classification and Regression Trees Experienced User Guide
 
Principal component analysis and lda
Principal component analysis and ldaPrincipal component analysis and lda
Principal component analysis and lda
 
Principal Component Analysis (PCA) and LDA PPT Slides
Principal Component Analysis (PCA) and LDA PPT SlidesPrincipal Component Analysis (PCA) and LDA PPT Slides
Principal Component Analysis (PCA) and LDA PPT Slides
 
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning Algorithms
 
AI: Belief Networks
AI: Belief NetworksAI: Belief Networks
AI: Belief Networks
 
84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1b84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1b
 
KNN and ARL Based Imputation to Estimate Missing Values
KNN and ARL Based Imputation to Estimate Missing ValuesKNN and ARL Based Imputation to Estimate Missing Values
KNN and ARL Based Imputation to Estimate Missing Values
 
IRJET- Evaluation of Classification Algorithms with Solutions to Class Imbala...
IRJET- Evaluation of Classification Algorithms with Solutions to Class Imbala...IRJET- Evaluation of Classification Algorithms with Solutions to Class Imbala...
IRJET- Evaluation of Classification Algorithms with Solutions to Class Imbala...
 

Similar to IRJET- Supervised Learning Classification Algorithms Comparison

A02610104
A02610104A02610104
A02610104
theijes
 

Similar to IRJET- Supervised Learning Classification Algorithms Comparison (20)

IRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification AlgorithmsIRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification Algorithms
 
A02610104
A02610104A02610104
A02610104
 
IRJET- A Detailed Study on Classification Techniques for Data Mining
IRJET- A Detailed Study on Classification Techniques for Data MiningIRJET- A Detailed Study on Classification Techniques for Data Mining
IRJET- A Detailed Study on Classification Techniques for Data Mining
 
Water Quality Index Calculation of River Ganga using Decision Tree Algorithm
Water Quality Index Calculation of River Ganga using Decision Tree AlgorithmWater Quality Index Calculation of River Ganga using Decision Tree Algorithm
Water Quality Index Calculation of River Ganga using Decision Tree Algorithm
 
IRJET- Titanic Survival Analysis using Logistic Regression
IRJET-  	  Titanic Survival Analysis using Logistic RegressionIRJET-  	  Titanic Survival Analysis using Logistic Regression
IRJET- Titanic Survival Analysis using Logistic Regression
 
Review of Algorithms for Crime Analysis & Prediction
Review of Algorithms for Crime Analysis & PredictionReview of Algorithms for Crime Analysis & Prediction
Review of Algorithms for Crime Analysis & Prediction
 
AMAZON STOCK PRICE PREDICTION BY USING SMLT
AMAZON STOCK PRICE PREDICTION BY USING SMLTAMAZON STOCK PRICE PREDICTION BY USING SMLT
AMAZON STOCK PRICE PREDICTION BY USING SMLT
 
A Comparative Study for Anomaly Detection in Data Mining
A Comparative Study for Anomaly Detection in Data MiningA Comparative Study for Anomaly Detection in Data Mining
A Comparative Study for Anomaly Detection in Data Mining
 
AIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTIONAIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTION
 
Ijaems apr-2016-23 Study of Pruning Techniques to Predict Efficient Business ...
Ijaems apr-2016-23 Study of Pruning Techniques to Predict Efficient Business ...Ijaems apr-2016-23 Study of Pruning Techniques to Predict Efficient Business ...
Ijaems apr-2016-23 Study of Pruning Techniques to Predict Efficient Business ...
 
IRJET - An Overview of Machine Learning Algorithms for Data Science
IRJET - An Overview of Machine Learning Algorithms for Data ScienceIRJET - An Overview of Machine Learning Algorithms for Data Science
IRJET - An Overview of Machine Learning Algorithms for Data Science
 
Real Estate Investment Advising Using Machine Learning
Real Estate Investment Advising Using Machine LearningReal Estate Investment Advising Using Machine Learning
Real Estate Investment Advising Using Machine Learning
 
CASE STUDY: ADMISSION PREDICTION IN ENGINEERING AND TECHNOLOGY COLLEGES
CASE STUDY: ADMISSION PREDICTION IN ENGINEERING AND TECHNOLOGY COLLEGESCASE STUDY: ADMISSION PREDICTION IN ENGINEERING AND TECHNOLOGY COLLEGES
CASE STUDY: ADMISSION PREDICTION IN ENGINEERING AND TECHNOLOGY COLLEGES
 
Accident Prediction System Using Machine Learning
Accident Prediction System Using Machine LearningAccident Prediction System Using Machine Learning
Accident Prediction System Using Machine Learning
 
A detailed analysis of the supervised machine Learning Algorithms
A detailed analysis of the supervised machine Learning AlgorithmsA detailed analysis of the supervised machine Learning Algorithms
A detailed analysis of the supervised machine Learning Algorithms
 
Perfomance Comparison of Decsion Tree Algorithms to Findout the Reason for St...
Perfomance Comparison of Decsion Tree Algorithms to Findout the Reason for St...Perfomance Comparison of Decsion Tree Algorithms to Findout the Reason for St...
Perfomance Comparison of Decsion Tree Algorithms to Findout the Reason for St...
 
The pertinent single-attribute-based classifier for small datasets classific...
The pertinent single-attribute-based classifier  for small datasets classific...The pertinent single-attribute-based classifier  for small datasets classific...
The pertinent single-attribute-based classifier for small datasets classific...
 
IRJET- Expert Independent Bayesian Data Fusion and Decision Making Model for ...
IRJET- Expert Independent Bayesian Data Fusion and Decision Making Model for ...IRJET- Expert Independent Bayesian Data Fusion and Decision Making Model for ...
IRJET- Expert Independent Bayesian Data Fusion and Decision Making Model for ...
 
IRJET- A Comparative Research of Rule based Classification on Dataset using W...
IRJET- A Comparative Research of Rule based Classification on Dataset using W...IRJET- A Comparative Research of Rule based Classification on Dataset using W...
IRJET- A Comparative Research of Rule based Classification on Dataset using W...
 
IRJET- A Novel Gabor Feed Forward Network for Pose Invariant Face Recogni...
IRJET-  	  A Novel Gabor Feed Forward Network for Pose Invariant Face Recogni...IRJET-  	  A Novel Gabor Feed Forward Network for Pose Invariant Face Recogni...
IRJET- A Novel Gabor Feed Forward Network for Pose Invariant Face Recogni...
 

More from IRJET Journal

More from IRJET Journal (20)

TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...
TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...
TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...
 
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURE
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURESTUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURE
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURE
 
A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...
A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...
A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...
 
Effect of Camber and Angles of Attack on Airfoil Characteristics
Effect of Camber and Angles of Attack on Airfoil CharacteristicsEffect of Camber and Angles of Attack on Airfoil Characteristics
Effect of Camber and Angles of Attack on Airfoil Characteristics
 
A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...
A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...
A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...
 
Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...
Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...
Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...
 
Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...
Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...
Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...
 
A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...
A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...
A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...
 
A REVIEW ON MACHINE LEARNING IN ADAS
A REVIEW ON MACHINE LEARNING IN ADASA REVIEW ON MACHINE LEARNING IN ADAS
A REVIEW ON MACHINE LEARNING IN ADAS
 
Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...
Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...
Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...
 
P.E.B. Framed Structure Design and Analysis Using STAAD Pro
P.E.B. Framed Structure Design and Analysis Using STAAD ProP.E.B. Framed Structure Design and Analysis Using STAAD Pro
P.E.B. Framed Structure Design and Analysis Using STAAD Pro
 
A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...
A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...
A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...
 
Survey Paper on Cloud-Based Secured Healthcare System
Survey Paper on Cloud-Based Secured Healthcare SystemSurvey Paper on Cloud-Based Secured Healthcare System
Survey Paper on Cloud-Based Secured Healthcare System
 
Review on studies and research on widening of existing concrete bridges
Review on studies and research on widening of existing concrete bridgesReview on studies and research on widening of existing concrete bridges
Review on studies and research on widening of existing concrete bridges
 
React based fullstack edtech web application
React based fullstack edtech web applicationReact based fullstack edtech web application
React based fullstack edtech web application
 
A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...
A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...
A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...
 
A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.
A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.
A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.
 
Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...
Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...
Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...
 
Multistoried and Multi Bay Steel Building Frame by using Seismic Design
Multistoried and Multi Bay Steel Building Frame by using Seismic DesignMultistoried and Multi Bay Steel Building Frame by using Seismic Design
Multistoried and Multi Bay Steel Building Frame by using Seismic Design
 
Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...
Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...
Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...
 

Recently uploaded

"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
mphochane1998
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
ssuser89054b
 
Verification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxVerification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptx
chumtiyababu
 

Recently uploaded (20)

Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Moment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilMoment Distribution Method For Btech Civil
Moment Distribution Method For Btech Civil
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal load
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
Computer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersComputer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to Computers
 
PE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiesPE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and properties
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdf
 
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxOrlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network Devices
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
Verification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxVerification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptx
 
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
 

IRJET- Supervised Learning Classification Algorithms Comparison

  • 1. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 03 | Mar 2019 www.irjet.net p-ISSN: 2395-0072 © 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 3869 Supervised Learning Classification Algorithms Comparison Aditya Singh Rathore B.Tech, J.K. Lakshmipat University -------------------------------------------------------------***--------------------------------------------------------- Abstract: Under supervised machine learning, classification tasks are one of the most important tasks as a part of data analysis. It gives a lot of actionable insights to data scientists after using different machine learning algorithms. For study purpose the Titanic dataset has been chosen. Through this paper, an effort has been made to evaluate different classification models’ performance on the dataset using the scikit-learn library. The classifiers chosen are Logistic Regression, K-Nearest Neighbor, Decision Tree, Random Forest, Support Vector Machine and Naïve Bayes classifier. In the end evaluation parameters like confusion matrix, precision, recall, f1-score and accuracy have been discussed. Key Words: Numpy, Pandas, classification, knn, logistic regression, decision tree, random forest, Naïve Bayes, SVM, sklearn. 1. INTRODUCTION: Machine learning has been getting a lot of attention for the past few years because it holds the key to solve the problems of the future. In the 21st Century, we are living in a world where data is everywhere and as we go by of our day we are generating data, be it text messaging or simply walking down the street and different antennas picking GPS signals. The majority of practical machine learning uses supervised learning. In supervised learning, the machine learning models are first trained over dataset with associated correct labels. The objective of training the models first is to create a very good mapping between the independent and dependent variables, so that when that model is run on a new dataset, it can predict the output variable. Classification is a type of problem under supervised learning. A classification problem is when the output is a category and when given one or more inputs a classification model will try to predict the value based on what relation it developed between the input and output. Below are the classifiers chosen for evaluation: 1.1 Logistic Regression: This type of technique proves to be more appropriate when the dependent variable/output is in the form of binary output, i.e. 1 or 0. This technique is mainly used when one wants to see the relationship between the dependent variable and one or more independent variable provided that the dataset is linear. The output of this technique gives out a binary output which is a better type of output than absolute values as it elucidates the output. In order to limit our output between 0 and 1 we will be using Sigmoid Function. Below is the mathematical representation of Sigmoid Function: f(x)= 1/(1+e^-x)
  • 2. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 03 | Mar 2019 www.irjet.net p-ISSN: 2395-0072 © 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 3870 1.2 K- Nearest Neighbor: This type of technique is one of the simplest classification algorithm. It first stores the entire data in its training phase as an array of data points. . Whenever a prediction is required for a new data it searches through the entire training dataset for a number of most similar instances and the data with most similar instance is returned as prediction. This number is defined by an integer ‘k’. The k denotes the number of neighbours which are the determining class of the test data. For example, if k=3, then the labels of the 3 closest instances are checked and the most common label is assigned to the test data. The distance measured from test data to the closest instance is the least distance measure. This can be achieved using various measuring tools such as Euclidean Distance. The Euclidean distance between point p and q is calculated as below: 1.3 Decision Tree: A decision tree is a graphical representation of all possible solutions to a decision based on certain conditions. It consists of a Root/Parent node which is the base node or the first node of a tree. Then comes the splitting which is basically dividing the Root/Child nodes into different parts on a certain condition. These parts are called the Branch/ Sub Tree. To remove unwanted branches from the tree we make use of Pruning. The last node is called the leaf node when the data cannot be segregated to further levels. We will use CART (Classification and Regression Tree) algorithm to our dataset. Below are some terminologies used by CART algorithm in order to select the root node : a) Gini Index: It is the measure of impurity used to build a decision tree b) Information Gain: It is the decrease in entropy after a dataset is split on the basis of a criteria. c) Reduction in Variance: The split with lower variance is selected as a criteria. d) Chi Square: It is the measure of statistical significance between the differences between sub-nodes and parent node.
  • 3. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 03 | Mar 2019 www.irjet.net p-ISSN: 2395-0072 © 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 3871 1.4 Random Forest: This classifier is one of the most used algorithms today because it can be used for classification as well as regression. The main problem with decision trees is that it can over fit the data. One solution to prevent this is to create multiple decision trees on random subsamples of the data. Now when a new data point is to be predicted then that will be computed by all decision trees and a majority vote will be calculated. This vote will determine the class of the data. This in turns gives higher accuracy when compared to decision tree alone. More decision trees directly correlates with more accuracy. 1.5 Support Vector Machine (SVM): This classifier uses a hyperplane/decision boundary that separates the classes. This hyperplane is positioned in such a way that it maximises the margin i.e. the space between the decision boundary and the points in each of the classes that are closest to the decision boundary. These points are called Support Vectors. A hyperplane is a decision surface that splits the space into 2 parts i.e. it is a binary classifier. This hyperplane in R^n space is an n-1 dimensional subspace. SVMs can easily be used for linear classification. However, in order to use SVM for non-linear classification once can use a kernel trick with which the data is transformed in another dimension to determine the appropriate position of hyperplane. 1.6 Naïve Bayes: This classifier is used primarily for text classification which generally involves training on high dimensional datasets. This classifier is relatively faster than other algorithms in making predictions. It is called “naïve” because it makes an assumption that a feature in the dataset is independent of the occurrence of other features i.e. it is conditionally independent. This classifier is based on the Bayes Theorem which states that the probability of an event Y given X is equal to the probability of the event X given Y multiplied by the probability of X upon probability of Y. P(X|Y) =(P(Y|X) ⋅ P(X))/P(Y)
  • 4. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 03 | Mar 2019 www.irjet.net p-ISSN: 2395-0072 © 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 3872 2. METHODOLOGY: 2.1 Dataset: The data has been split into two groups: a) training set (train.csv) b) test set (test.csv) The training set is used to build and train the machine learning models. The test set is used to see how well the model performed on unseen data. We will use the trained models to predict whether or not the passengers in the test set survived the sinking of the Titanic. Below is the original structure of the dataset: Data Dictionary:  Survived: 0 = No, 1 = Yes  pclass: Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd  sibsp: # of siblings / spouses aboard the Titanic  parch: # of parents / children aboard the Titanic  ticket: Ticket number  cabin: Cabin number  embarked: Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton After pre-processing of the data and feature engineering, below is the dataset we would be finally be working with: 2.2 Evaluation Parameters: 2.2.1 Confusion Matrix: On a classification problem a confusion matrix gives out a summary of predicted results. The basic idea behind it is to count the number of times the instances of class A are classified as class B.
  • 5. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 03 | Mar 2019 www.irjet.net p-ISSN: 2395-0072 © 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 3873 Below is the sample confusion matrix: Pridicted Yes Pridicted No Actual Yes True Positive False Negative Actual No False Positive True Negative 2.2.2 Precision: It is the ratio of total number of true positives divided by total sum of true positives and false positives. In summary it describes how many of the returned hits were true positive i.e. how many of the found were correct hits. Precision = True Positives / (True Positives + False Positives) 2.2.3 Recall: It is the ratio of the total number of true positives divided by total sum of the true positives and the false negatives. It is also called sensitivity. In summary it describes how many of the true positives were recalled (found), i.e. how many of the correct hits were also found. Recall = True Positives / (True Positives + False Negatives) 2.2.4 F1- Score: It is simply the weighted average of precision and recall wherein the best value for a model is 1 and the worst value is 0. Below is the formula to calculate the same: F1-score = 2 * (precision * recall) / (precision + recall) 2.3 Experimental Results: Below are the experimental results of different Classifiers: Classifier Class Classification Report Confusion Matrix Precision Recall f1-score Accuracy Logistic Regression 0 0.81 0.86 0.84 78.90% [[120 19] [ 28 56]]1 0.75 0.67 0.7 KNN 0 0.73 0.83 0.77 69.90% [[115 24] [ 43 41]]1 0.63 0.49 0.55 Decision Tree 0 0.84 0.78 0.81 77.17% [[108 31] [ 20 64]]1 0.67 0.76 0.72 Random Forest 0 0.87 0.88 0.88 84.30% [[123 16] [ 19 65]]1 0.8 0.77 0.79 SVM 0 0.73 0.76 0.74 67.20% [[106 33] [ 40 44]]1 0.57 0.52 0.55 Naïve Bayes 0 0.83 0.83 0.83 78.47% [[115 24] [ 24 60]]1 0.71 0.71 0.71
  • 6. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 03 | Mar 2019 www.irjet.net p-ISSN: 2395-0072 © 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 3874 3. CONCLUSION: In this paper we experimented with 6 types of classifiers on the Titanic dataset. In order to evaluate the above the train dataset was split into train & test dataset, since there is no true output for test dataset. From the above the random forest seems to be most effective with highest precision and recall values and has the least number of false positive and false negatives. 4. REFERENCES: [1].P-N. Tan, M. Steinbach, V. Kumar, “Introduction to Data Mining,” Addison-Wesley Publishing, 2006. [2]. Simm, J., Magrans de Abril, I., & Sugiyama, M., Tree-based ensemble multi-task learning method for classification and regression, 2014. [3] Takeuchi, I. & Sugiyama, M, Target neighbor consistent feature weighting for nearest neighbor classification. [4] M. Kantardzic, “Data Mining: Concepts, Models, Methods, and Algorithms,” John Wiley & Sons Publishing, 2003.