SlideShare une entreprise Scribd logo
1  sur  11
Télécharger pour lire hors ligne
Caravan Insurance Data
      Mining Assignment
K6225 Knowledge Discovery and Data Mining




                                                 By,
               Sesagiri Raamkumar Aravind(G1101761F)
                Thangavelu Muthu Kumaar(G1101765E)



                   Page 1 of 11
Table of Contents
1.0 Objective ........................................................................................................................................... 3
2.0 Summary of Final Results .................................................................................................................. 3
3.0 Exercise Lifecycle............................................................................................................................... 3
   3.1 Understanding the objective of the exercise and its expectations .............................................. 4
   3.2 Understanding the data dictionary of the data set ...................................................................... 4
   3.3 Assigning appropriate measure values (Set/Range) for data fields.............................................. 4
   3.4 Constructing first level models with Training dataset .................................................................. 4
       3.4.1 Logistic Regression ................................................................................................................. 4
       3.4.2 Decision Trees ........................................................................................................................ 5
       3.11.3 Neural Networks .................................................................................................................. 5
   3.5 Running the first level Models with Test data .............................................................................. 6
   3.6 Performing bivariate analysis on training dataset ........................................................................ 7
   3.7 Creating interaction variables based on results of Step 5 ............................................................ 7
   3.8 Balancing the training data ........................................................................................................... 8
   3.9 Constructing second level models with Training dataset ............................................................. 9
   3.10 Running the second level Models with Test dataset .................................................................. 9
   3.11 Constructing third level models by adding new interaction variables ..................................... 10
   3.12 Running the third level models with Test dataset .................................................................... 10
   3.13 Final Results Interpretation ...................................................................................................... 11




                                                                      Page 2 of 11
1.0 Objective
The objective of this data mining exercise is to find the best possible model to predict whether
customer signature will opt for caravan insurance (mobile home policy) or not. The techniques used
are logistic regression, decision tree and neural network.



2.0 Summary of Final Results
The model built using Logistic Regression and Decision Tree came out with the highest accuracy on
comparison with the models built using Neural Network. The best model had an accuracy of 94%.
The most interesting part of the exercise is that base model (as provided originally) without any
interaction variables and balancing, gave the best results. It has been expectedly observed that most
models had higher accuracy with training data set but the accuracy rate reduced when run with test
dataset .Cross-validation techniques such as 10-step validation was not done in this exercise which
could have delineated the results even more.



3.0 Exercise Lifecycle
The lifecycle of the complete data mining exercise comprises of the following steps:-

    1. Understanding the objective of the exercise and its expectations
    2. Understanding the data dictionary of the data set
    3. Assigning appropriate measure values (Set/Range) for data fields
    4. Constructing first level Models with Training dataset
    5. Running the first level Models with Test dataset
    6. Performing bivariate analysis
    7. Balancing the training data
    8. Constructing second level Models with Training dataset
    9. Dataset modification of Training dataset
    10. Running the second level Models with Test dataset
    11. Creating interaction variables based on results of Step 6
    12. Constructing third level Models with Training dataset
    13. Running the third level Models with Test dataset
    14. Final Results Interpretation




                                                Page 3 of 11
3.1 Understanding the objective of the exercise and its
expectations
The first and foremost step in a data mining exercise is to understand the business objective. The
business wants to use their existing customer signatures to build a predictive model for predicting the
number of mobile home policies. The model construction and its inference will be a precursor for a
potential marketing campaign to target specific customer groups. The data mining techniques that are
in the scope of this exercise are logistic regression, decision trees and neural networks.


3.2 Understanding the data dictionary of the data set
The data dictionary consists of 86 variables with an equal mix of socio-demographic and product
ownership data. There are few ordinal variables that need to be changed to numeric variables for build
efficiency. The socio-demographic variables are captured at zip-code level.


3.3 Assigning appropriate measure values (Set/Range) for data
fields
The measure of the below variables were manually changed to ‘Range’ in Clementine, apart from the
automatically assigned measures:-

MAANTHUI Number of houses

MGEMOMV Avg size household

MGEMLEEF Avg age

MGODRK Roman catholic

PWAPART Contribution private third party insurance

There is an academic insight that socio-demographic variables are to be converted to ‘Range’
variables so that it would be convenient to plot the values in logistic regression graph curve. The
authors retained the variables as ‘Set’ variables initially to test the postulation at a later stage.


3.4 Constructing first level models with Training dataset
The authors made a plan of arriving at the best model by using a three level approach. The models
built in first level will be crude models constructed on the data set directly without any new
interaction variables or data balancing. These models will be the first benchmark to gauge subsequent
improvements. Models were built using the Logistic, C5.0 and Neural Net nodes.

3.4.1 Logistic Regression
No changes were done for the Logistic Regression as all attributes were seemingly optimal.




                                                  Page 4 of 11
3.4.2 Decision Trees
The changes done for C5 node under the Export mode are

Pruning Severity was set to 5

‘Minimum records per child branch’ was changed to 5 from 2 as it was found to be optimal number.
Value 1 impaired the results and the same could be said for values greater than 2

‘Use Boosting’ option was enabled so that more classifiers are created. The value was set to15 for
first level and changed to 5 for second and third level.




                                      Fig 7: C5 Model Attributes

3.4.3 Neural Networks
For the Neural networks, the RBFN method was selected first but the model did not produce better
results. The final method selected was ‘Quick’. The number of hidden layers was set as 3 so that more
transformations can take place. The learning rates were initially increased marginally to check for
performance improvements assuming that the results are converging towards the globally consistent
depression in the learning curve of the networks. But as marginal increase of alpha learning rate didn’t
get produce significant results, it was increased dramatically to 0.9 for overcoming the possibly
assumed local depression. The final values are available in the screenshot below.




                                               Page 5 of 11
Fig 8: Neural Network Attributes




3.5 Running the first level Models with Test data
The trained first level models were run with the test dataset and the results of the different modelling
techniques were compared with the Analysis and Evaluation node. Logistic Regression and Decision
Tree both had the best accuracy rate of 94%. The Nagelkerke Rsquare value with training data set was
16.7%. These results will be maintained as the first level benchmark. Screenshots provided below




                            Fig1: First Level Models Analysis Node Results


                                                Page 6 of 11
Fig 2: First Level Gain and Lift Chart


3.6 Performing bivariate analysis on training dataset
This step marks the start of the second level model building process. Bivariate analysis in Clementine
can be done using the Web node that represents the relationships between the values of variables
using thick and thin lines. The authors performed the analysis using both the normal web and directed
web option in the web node. The directed web had the target as Caravan variable and all the other
variables were put in dependant section. This analysis wasn’t helpful as the relationships were present
among different values in independent variables and CARAVAN therefore no significant inferences
were made. However, the normal web analysis indicated strong relationships between the customer
type and customer subtype, a potential candidate for interaction variable.


3.7 Creating interaction variables based on results of Step 5
The indication from last step was implemented in this step by creating two interaction variables. The
first interaction variable Derive1(aka customer lifestyle reflector) contains the parent variables
Customer Type and Subtype. The second interaction variable Derive3(aka Combined Age-Income
Factor ) contains the parent variables Avg age and Avg Income. This variable was created based on
the author’s intuition that it would help build a better model. Screenshots provided below for
reference




                                                Page 7 of 11
Fig 3: Derived Variables


3.8 Balancing the training data
It has been noticed that the training dataset is not highly representative of positive cases
i.e.CARAVAN=1. Therefore, models constructed using this data set may not be the best predictor for
positive cases. Clementine provides a feature called as Balancing to create more signatures based on
conditions. The overall positivity is increased in the data set. The authors chose a factor of 6 to make
the dataset slightly better looking in terms of value share (72%:28%)




                                            Fig 4: Balancing




                                                 Page 8 of 11
3.9 Constructing second level models with Training dataset
The second level models were built with the balanced dataset. The attributes of the nodes were
maintained from the first level except for C5 node in which the boosting interval was changed to 5 as
the software did not have enough memory to run with value 15.


3.10 Running the second level Models with Test dataset
The trained second level models were run with the test dataset and the results of the different
modelling techniques were compared with the Analysis and Evaluation node. Decision Tree model
came out with the highest accuracy of 90.48%. These results were maintained as the second level
benchmark. Screenshots provided below




                          Fig 5: Second Level Models Analysis Node Results




                                               Page 9 of 11
Fig 6: Second Level Models Gain and Lift Charts


3.11 Constructing third level models by adding new interaction
variables
The third level model building step is not the same as the second level in terms of data fields. The two
new interaction variables Derive 1and Derive 2 were created. No additional balancing was done.


3.12 Running the third level models with Test dataset
The trained third level models were run with the test dataset and the results of the different modelling
techniques were compared with the Analysis and Evaluation node. Neural Network model gives the
best accuracy rate at 90.1%.




                           Fig 9: Third Level Models Analysis Node Results



                                               Page 10 of 11
Fig 10: Third Level Models Gain and Lift Charts


3.13 Final Results Interpretation
The below table compares the output of the Analysis node from all three levels. There is no marked
improvement in each level. It has been inferred that building the model after balancing the training
data set, doesn’t produce a better model.

In Level1 (base dataset): Highest accuracy is generated by both Decision Tree and Logistic
Regression

In Level2 (model build with balanced dataset): Highest accuracy is generated by Decision Tree

In Level 3(model build with balanced dataset and interaction variables): Highest accuracy is
generated by Neural Network

                                                              1st       2nd        3rd
           Technique            Factor                        level     level      level
               Logistic
              Regression        Test dataset Accuracy         94.00%     87.50%    87.50%
             Decision Tree      Test dataset Accuracy         94.00%     90.48%    89.75%
            Neural Network      Test dataset Accuracy         92.05%     90.12%    90.10%
               Combined Agreement with CARAVAN        94.52%             95.11%    94.80%
                                Table 1: Level Comparison




                                              Page 11 of 11

Contenu connexe

Tendances

Cluster Analysis Introduction
Cluster Analysis IntroductionCluster Analysis Introduction
Cluster Analysis IntroductionPrasiddhaSarma
 
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...Md. Main Uddin Rony
 
Chap8 basic cluster_analysis
Chap8 basic cluster_analysisChap8 basic cluster_analysis
Chap8 basic cluster_analysisguru_prasadg
 
Machine Learning Unit 4 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning Unit 4 Semester 3  MSc IT Part 2 Mumbai UniversityMachine Learning Unit 4 Semester 3  MSc IT Part 2 Mumbai University
Machine Learning Unit 4 Semester 3 MSc IT Part 2 Mumbai UniversityMadhav Mishra
 
Cluster Analysis
Cluster AnalysisCluster Analysis
Cluster Analysisguest0edcaf
 

Tendances (10)

Cluster Analysis Introduction
Cluster Analysis IntroductionCluster Analysis Introduction
Cluster Analysis Introduction
 
Lecture7 - IBk
Lecture7 - IBkLecture7 - IBk
Lecture7 - IBk
 
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
 
DMQL(Data Mining Query Language).pptx
DMQL(Data Mining Query Language).pptxDMQL(Data Mining Query Language).pptx
DMQL(Data Mining Query Language).pptx
 
Telcom churn .pptx
Telcom churn .pptxTelcom churn .pptx
Telcom churn .pptx
 
Chap8 basic cluster_analysis
Chap8 basic cluster_analysisChap8 basic cluster_analysis
Chap8 basic cluster_analysis
 
DTH Case Study
DTH Case StudyDTH Case Study
DTH Case Study
 
Machine Learning Unit 4 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning Unit 4 Semester 3  MSc IT Part 2 Mumbai UniversityMachine Learning Unit 4 Semester 3  MSc IT Part 2 Mumbai University
Machine Learning Unit 4 Semester 3 MSc IT Part 2 Mumbai University
 
CAR EVALUATION DATABASE
CAR EVALUATION DATABASECAR EVALUATION DATABASE
CAR EVALUATION DATABASE
 
Cluster Analysis
Cluster AnalysisCluster Analysis
Cluster Analysis
 

Similaire à Caravan insurance data mining prediction models

Practical Data Science: Data Modelling and Presentation
Practical Data Science: Data Modelling and PresentationPractical Data Science: Data Modelling and Presentation
Practical Data Science: Data Modelling and PresentationHariniMS1
 
Machine_Learning_Trushita
Machine_Learning_TrushitaMachine_Learning_Trushita
Machine_Learning_TrushitaTrushita Redij
 
Using Artificial Neural Networks to Detect Multiple Cancers from a Blood Test
Using Artificial Neural Networks to Detect Multiple Cancers from a Blood TestUsing Artificial Neural Networks to Detect Multiple Cancers from a Blood Test
Using Artificial Neural Networks to Detect Multiple Cancers from a Blood TestStevenQu1
 
Microsoft Professional Capstone: Data Science
Microsoft Professional Capstone: Data ScienceMicrosoft Professional Capstone: Data Science
Microsoft Professional Capstone: Data ScienceMashfiq Shahriar
 
Black_Friday_Sales_Trushita
Black_Friday_Sales_TrushitaBlack_Friday_Sales_Trushita
Black_Friday_Sales_TrushitaTrushita Redij
 
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...IRJET Journal
 
Mat189: Cluster Analysis with NBA Sports Data
Mat189: Cluster Analysis with NBA Sports DataMat189: Cluster Analysis with NBA Sports Data
Mat189: Cluster Analysis with NBA Sports DataKathleneNgo
 
Analysis of Common Supervised Learning Algorithms Through Application
Analysis of Common Supervised Learning Algorithms Through ApplicationAnalysis of Common Supervised Learning Algorithms Through Application
Analysis of Common Supervised Learning Algorithms Through Applicationaciijournal
 
ANALYSIS OF COMMON SUPERVISED LEARNING ALGORITHMS THROUGH APPLICATION
ANALYSIS OF COMMON SUPERVISED LEARNING ALGORITHMS THROUGH APPLICATIONANALYSIS OF COMMON SUPERVISED LEARNING ALGORITHMS THROUGH APPLICATION
ANALYSIS OF COMMON SUPERVISED LEARNING ALGORITHMS THROUGH APPLICATIONaciijournal
 
AN IMPROVED METHOD FOR IDENTIFYING WELL-TEST INTERPRETATION MODEL BASED ON AG...
AN IMPROVED METHOD FOR IDENTIFYING WELL-TEST INTERPRETATION MODEL BASED ON AG...AN IMPROVED METHOD FOR IDENTIFYING WELL-TEST INTERPRETATION MODEL BASED ON AG...
AN IMPROVED METHOD FOR IDENTIFYING WELL-TEST INTERPRETATION MODEL BASED ON AG...IAEME Publication
 
Experimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsExperimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsIJDKP
 
2014 USA
2014 USA2014 USA
2014 USALI HE
 
ChartGPT in an Data Science Interview
ChartGPT in an Data Science InterviewChartGPT in an Data Science Interview
ChartGPT in an Data Science InterviewZhitao3
 
Analysis of Common Supervised Learning Algorithms Through Application
Analysis of Common Supervised Learning Algorithms Through ApplicationAnalysis of Common Supervised Learning Algorithms Through Application
Analysis of Common Supervised Learning Algorithms Through Applicationaciijournal
 
fmelleHumanActivityRecognitionWithMobileSensors
fmelleHumanActivityRecognitionWithMobileSensorsfmelleHumanActivityRecognitionWithMobileSensors
fmelleHumanActivityRecognitionWithMobileSensorsFridtjof Melle
 
IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...
IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...
IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...IRJET Journal
 
A Hierarchical Feature Set optimization for effective code change based Defec...
A Hierarchical Feature Set optimization for effective code change based Defec...A Hierarchical Feature Set optimization for effective code change based Defec...
A Hierarchical Feature Set optimization for effective code change based Defec...IOSR Journals
 
Artificial Intelligence based Pattern Recognition
Artificial Intelligence based Pattern RecognitionArtificial Intelligence based Pattern Recognition
Artificial Intelligence based Pattern RecognitionDr. Amarjeet Singh
 
Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)Alexander Decker
 

Similaire à Caravan insurance data mining prediction models (20)

Practical Data Science: Data Modelling and Presentation
Practical Data Science: Data Modelling and PresentationPractical Data Science: Data Modelling and Presentation
Practical Data Science: Data Modelling and Presentation
 
Machine_Learning_Trushita
Machine_Learning_TrushitaMachine_Learning_Trushita
Machine_Learning_Trushita
 
Using Artificial Neural Networks to Detect Multiple Cancers from a Blood Test
Using Artificial Neural Networks to Detect Multiple Cancers from a Blood TestUsing Artificial Neural Networks to Detect Multiple Cancers from a Blood Test
Using Artificial Neural Networks to Detect Multiple Cancers from a Blood Test
 
Microsoft Professional Capstone: Data Science
Microsoft Professional Capstone: Data ScienceMicrosoft Professional Capstone: Data Science
Microsoft Professional Capstone: Data Science
 
Black_Friday_Sales_Trushita
Black_Friday_Sales_TrushitaBlack_Friday_Sales_Trushita
Black_Friday_Sales_Trushita
 
forest-cover-type
forest-cover-typeforest-cover-type
forest-cover-type
 
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
 
Mat189: Cluster Analysis with NBA Sports Data
Mat189: Cluster Analysis with NBA Sports DataMat189: Cluster Analysis with NBA Sports Data
Mat189: Cluster Analysis with NBA Sports Data
 
Analysis of Common Supervised Learning Algorithms Through Application
Analysis of Common Supervised Learning Algorithms Through ApplicationAnalysis of Common Supervised Learning Algorithms Through Application
Analysis of Common Supervised Learning Algorithms Through Application
 
ANALYSIS OF COMMON SUPERVISED LEARNING ALGORITHMS THROUGH APPLICATION
ANALYSIS OF COMMON SUPERVISED LEARNING ALGORITHMS THROUGH APPLICATIONANALYSIS OF COMMON SUPERVISED LEARNING ALGORITHMS THROUGH APPLICATION
ANALYSIS OF COMMON SUPERVISED LEARNING ALGORITHMS THROUGH APPLICATION
 
AN IMPROVED METHOD FOR IDENTIFYING WELL-TEST INTERPRETATION MODEL BASED ON AG...
AN IMPROVED METHOD FOR IDENTIFYING WELL-TEST INTERPRETATION MODEL BASED ON AG...AN IMPROVED METHOD FOR IDENTIFYING WELL-TEST INTERPRETATION MODEL BASED ON AG...
AN IMPROVED METHOD FOR IDENTIFYING WELL-TEST INTERPRETATION MODEL BASED ON AG...
 
Experimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsExperimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithms
 
2014 USA
2014 USA2014 USA
2014 USA
 
ChartGPT in an Data Science Interview
ChartGPT in an Data Science InterviewChartGPT in an Data Science Interview
ChartGPT in an Data Science Interview
 
Analysis of Common Supervised Learning Algorithms Through Application
Analysis of Common Supervised Learning Algorithms Through ApplicationAnalysis of Common Supervised Learning Algorithms Through Application
Analysis of Common Supervised Learning Algorithms Through Application
 
fmelleHumanActivityRecognitionWithMobileSensors
fmelleHumanActivityRecognitionWithMobileSensorsfmelleHumanActivityRecognitionWithMobileSensors
fmelleHumanActivityRecognitionWithMobileSensors
 
IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...
IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...
IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...
 
A Hierarchical Feature Set optimization for effective code change based Defec...
A Hierarchical Feature Set optimization for effective code change based Defec...A Hierarchical Feature Set optimization for effective code change based Defec...
A Hierarchical Feature Set optimization for effective code change based Defec...
 
Artificial Intelligence based Pattern Recognition
Artificial Intelligence based Pattern RecognitionArtificial Intelligence based Pattern Recognition
Artificial Intelligence based Pattern Recognition
 
Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)
 

Plus de Muthu Kumaar Thangavelu

Semantic web design for www.data.gov.sg - Technical Report
Semantic web design for www.data.gov.sg - Technical ReportSemantic web design for www.data.gov.sg - Technical Report
Semantic web design for www.data.gov.sg - Technical ReportMuthu Kumaar Thangavelu
 
Semantic web design for www.data.gov.sg - Presentation
Semantic web design for www.data.gov.sg - PresentationSemantic web design for www.data.gov.sg - Presentation
Semantic web design for www.data.gov.sg - PresentationMuthu Kumaar Thangavelu
 
Knowledge Management and Risk Management Connection explained with Unilever
Knowledge Management and Risk Management Connection explained with UnileverKnowledge Management and Risk Management Connection explained with Unilever
Knowledge Management and Risk Management Connection explained with UnileverMuthu Kumaar Thangavelu
 
Bp business and information strategy alignment
Bp   business and information strategy alignmentBp   business and information strategy alignment
Bp business and information strategy alignmentMuthu Kumaar Thangavelu
 
Unilever's Lipton Risk Management with Business Intelligence
Unilever's Lipton Risk Management with Business IntelligenceUnilever's Lipton Risk Management with Business Intelligence
Unilever's Lipton Risk Management with Business IntelligenceMuthu Kumaar Thangavelu
 
Information to Intelligence (BI Context)
Information to Intelligence (BI Context)Information to Intelligence (BI Context)
Information to Intelligence (BI Context)Muthu Kumaar Thangavelu
 
Load balancing implementation in wireless networks
Load balancing implementation in wireless networksLoad balancing implementation in wireless networks
Load balancing implementation in wireless networksMuthu Kumaar Thangavelu
 
Boeing rocketdyne radical innovation case study
Boeing rocketdyne radical innovation case studyBoeing rocketdyne radical innovation case study
Boeing rocketdyne radical innovation case studyMuthu Kumaar Thangavelu
 
Habits that Knowledge workers need to cultivate
Habits that Knowledge workers need to cultivateHabits that Knowledge workers need to cultivate
Habits that Knowledge workers need to cultivateMuthu Kumaar Thangavelu
 
Knowledge process productivity indexing schema
Knowledge process productivity indexing schemaKnowledge process productivity indexing schema
Knowledge process productivity indexing schemaMuthu Kumaar Thangavelu
 
Innovation management in fashion industry
Innovation management in fashion industryInnovation management in fashion industry
Innovation management in fashion industryMuthu Kumaar Thangavelu
 

Plus de Muthu Kumaar Thangavelu (15)

Semantic web design for www.data.gov.sg - Technical Report
Semantic web design for www.data.gov.sg - Technical ReportSemantic web design for www.data.gov.sg - Technical Report
Semantic web design for www.data.gov.sg - Technical Report
 
Semantic web design for www.data.gov.sg - Presentation
Semantic web design for www.data.gov.sg - PresentationSemantic web design for www.data.gov.sg - Presentation
Semantic web design for www.data.gov.sg - Presentation
 
Knowledge Management and Risk Management Connection explained with Unilever
Knowledge Management and Risk Management Connection explained with UnileverKnowledge Management and Risk Management Connection explained with Unilever
Knowledge Management and Risk Management Connection explained with Unilever
 
Bp business and information strategy alignment
Bp   business and information strategy alignmentBp   business and information strategy alignment
Bp business and information strategy alignment
 
Unilever's Lipton Risk Management with Business Intelligence
Unilever's Lipton Risk Management with Business IntelligenceUnilever's Lipton Risk Management with Business Intelligence
Unilever's Lipton Risk Management with Business Intelligence
 
Ul lipton-presentation v4
Ul lipton-presentation v4Ul lipton-presentation v4
Ul lipton-presentation v4
 
Information to Intelligence (BI Context)
Information to Intelligence (BI Context)Information to Intelligence (BI Context)
Information to Intelligence (BI Context)
 
Load balancing implementation in wireless networks
Load balancing implementation in wireless networksLoad balancing implementation in wireless networks
Load balancing implementation in wireless networks
 
Human Capital Management
Human Capital ManagementHuman Capital Management
Human Capital Management
 
Buckmann labs KM case study
Buckmann labs KM case studyBuckmann labs KM case study
Buckmann labs KM case study
 
Boeing rocketdyne radical innovation case study
Boeing rocketdyne radical innovation case studyBoeing rocketdyne radical innovation case study
Boeing rocketdyne radical innovation case study
 
Habits that Knowledge workers need to cultivate
Habits that Knowledge workers need to cultivateHabits that Knowledge workers need to cultivate
Habits that Knowledge workers need to cultivate
 
Knowledge process productivity indexing schema
Knowledge process productivity indexing schemaKnowledge process productivity indexing schema
Knowledge process productivity indexing schema
 
Innovation management in fashion industry
Innovation management in fashion industryInnovation management in fashion industry
Innovation management in fashion industry
 
Linked data migrational framework
Linked data migrational frameworkLinked data migrational framework
Linked data migrational framework
 

Dernier

Benefits & Challenges of Inclusive Education
Benefits & Challenges of Inclusive EducationBenefits & Challenges of Inclusive Education
Benefits & Challenges of Inclusive EducationMJDuyan
 
How to Show Error_Warning Messages in Odoo 17
How to Show Error_Warning Messages in Odoo 17How to Show Error_Warning Messages in Odoo 17
How to Show Error_Warning Messages in Odoo 17Celine George
 
Prescribed medication order and communication skills.pptx
Prescribed medication order and communication skills.pptxPrescribed medication order and communication skills.pptx
Prescribed medication order and communication skills.pptxraviapr7
 
HED Office Sohayok Exam Question Solution 2023.pdf
HED Office Sohayok Exam Question Solution 2023.pdfHED Office Sohayok Exam Question Solution 2023.pdf
HED Office Sohayok Exam Question Solution 2023.pdfMohonDas
 
How to Manage Cross-Selling in Odoo 17 Sales
How to Manage Cross-Selling in Odoo 17 SalesHow to Manage Cross-Selling in Odoo 17 Sales
How to Manage Cross-Selling in Odoo 17 SalesCeline George
 
In - Vivo and In - Vitro Correlation.pptx
In - Vivo and In - Vitro Correlation.pptxIn - Vivo and In - Vitro Correlation.pptx
In - Vivo and In - Vitro Correlation.pptxAditiChauhan701637
 
How to Make a Field read-only in Odoo 17
How to Make a Field read-only in Odoo 17How to Make a Field read-only in Odoo 17
How to Make a Field read-only in Odoo 17Celine George
 
Easter in the USA presentation by Chloe.
Easter in the USA presentation by Chloe.Easter in the USA presentation by Chloe.
Easter in the USA presentation by Chloe.EnglishCEIPdeSigeiro
 
Patient Counselling. Definition of patient counseling; steps involved in pati...
Patient Counselling. Definition of patient counseling; steps involved in pati...Patient Counselling. Definition of patient counseling; steps involved in pati...
Patient Counselling. Definition of patient counseling; steps involved in pati...raviapr7
 
Maximizing Impact_ Nonprofit Website Planning, Budgeting, and Design.pdf
Maximizing Impact_ Nonprofit Website Planning, Budgeting, and Design.pdfMaximizing Impact_ Nonprofit Website Planning, Budgeting, and Design.pdf
Maximizing Impact_ Nonprofit Website Planning, Budgeting, and Design.pdfTechSoup
 
What is the Future of QuickBooks DeskTop?
What is the Future of QuickBooks DeskTop?What is the Future of QuickBooks DeskTop?
What is the Future of QuickBooks DeskTop?TechSoup
 
The Singapore Teaching Practice document
The Singapore Teaching Practice documentThe Singapore Teaching Practice document
The Singapore Teaching Practice documentXsasf Sfdfasd
 
Practical Research 1 Lesson 9 Scope and delimitation.pptx
Practical Research 1 Lesson 9 Scope and delimitation.pptxPractical Research 1 Lesson 9 Scope and delimitation.pptx
Practical Research 1 Lesson 9 Scope and delimitation.pptxKatherine Villaluna
 
Drug Information Services- DIC and Sources.
Drug Information Services- DIC and Sources.Drug Information Services- DIC and Sources.
Drug Information Services- DIC and Sources.raviapr7
 
CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...
CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...
CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...Nguyen Thanh Tu Collection
 
General views of Histopathology and step
General views of Histopathology and stepGeneral views of Histopathology and step
General views of Histopathology and stepobaje godwin sunday
 
3.21.24 The Origins of Black Power.pptx
3.21.24  The Origins of Black Power.pptx3.21.24  The Origins of Black Power.pptx
3.21.24 The Origins of Black Power.pptxmary850239
 

Dernier (20)

Benefits & Challenges of Inclusive Education
Benefits & Challenges of Inclusive EducationBenefits & Challenges of Inclusive Education
Benefits & Challenges of Inclusive Education
 
How to Show Error_Warning Messages in Odoo 17
How to Show Error_Warning Messages in Odoo 17How to Show Error_Warning Messages in Odoo 17
How to Show Error_Warning Messages in Odoo 17
 
Prescribed medication order and communication skills.pptx
Prescribed medication order and communication skills.pptxPrescribed medication order and communication skills.pptx
Prescribed medication order and communication skills.pptx
 
Prelims of Kant get Marx 2.0: a general politics quiz
Prelims of Kant get Marx 2.0: a general politics quizPrelims of Kant get Marx 2.0: a general politics quiz
Prelims of Kant get Marx 2.0: a general politics quiz
 
HED Office Sohayok Exam Question Solution 2023.pdf
HED Office Sohayok Exam Question Solution 2023.pdfHED Office Sohayok Exam Question Solution 2023.pdf
HED Office Sohayok Exam Question Solution 2023.pdf
 
How to Manage Cross-Selling in Odoo 17 Sales
How to Manage Cross-Selling in Odoo 17 SalesHow to Manage Cross-Selling in Odoo 17 Sales
How to Manage Cross-Selling in Odoo 17 Sales
 
In - Vivo and In - Vitro Correlation.pptx
In - Vivo and In - Vitro Correlation.pptxIn - Vivo and In - Vitro Correlation.pptx
In - Vivo and In - Vitro Correlation.pptx
 
Finals of Kant get Marx 2.0 : a general politics quiz
Finals of Kant get Marx 2.0 : a general politics quizFinals of Kant get Marx 2.0 : a general politics quiz
Finals of Kant get Marx 2.0 : a general politics quiz
 
How to Make a Field read-only in Odoo 17
How to Make a Field read-only in Odoo 17How to Make a Field read-only in Odoo 17
How to Make a Field read-only in Odoo 17
 
Easter in the USA presentation by Chloe.
Easter in the USA presentation by Chloe.Easter in the USA presentation by Chloe.
Easter in the USA presentation by Chloe.
 
Patient Counselling. Definition of patient counseling; steps involved in pati...
Patient Counselling. Definition of patient counseling; steps involved in pati...Patient Counselling. Definition of patient counseling; steps involved in pati...
Patient Counselling. Definition of patient counseling; steps involved in pati...
 
Maximizing Impact_ Nonprofit Website Planning, Budgeting, and Design.pdf
Maximizing Impact_ Nonprofit Website Planning, Budgeting, and Design.pdfMaximizing Impact_ Nonprofit Website Planning, Budgeting, and Design.pdf
Maximizing Impact_ Nonprofit Website Planning, Budgeting, and Design.pdf
 
What is the Future of QuickBooks DeskTop?
What is the Future of QuickBooks DeskTop?What is the Future of QuickBooks DeskTop?
What is the Future of QuickBooks DeskTop?
 
Personal Resilience in Project Management 2 - TV Edit 1a.pdf
Personal Resilience in Project Management 2 - TV Edit 1a.pdfPersonal Resilience in Project Management 2 - TV Edit 1a.pdf
Personal Resilience in Project Management 2 - TV Edit 1a.pdf
 
The Singapore Teaching Practice document
The Singapore Teaching Practice documentThe Singapore Teaching Practice document
The Singapore Teaching Practice document
 
Practical Research 1 Lesson 9 Scope and delimitation.pptx
Practical Research 1 Lesson 9 Scope and delimitation.pptxPractical Research 1 Lesson 9 Scope and delimitation.pptx
Practical Research 1 Lesson 9 Scope and delimitation.pptx
 
Drug Information Services- DIC and Sources.
Drug Information Services- DIC and Sources.Drug Information Services- DIC and Sources.
Drug Information Services- DIC and Sources.
 
CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...
CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...
CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...
 
General views of Histopathology and step
General views of Histopathology and stepGeneral views of Histopathology and step
General views of Histopathology and step
 
3.21.24 The Origins of Black Power.pptx
3.21.24  The Origins of Black Power.pptx3.21.24  The Origins of Black Power.pptx
3.21.24 The Origins of Black Power.pptx
 

Caravan insurance data mining prediction models

  • 1. Caravan Insurance Data Mining Assignment K6225 Knowledge Discovery and Data Mining By, Sesagiri Raamkumar Aravind(G1101761F) Thangavelu Muthu Kumaar(G1101765E) Page 1 of 11
  • 2. Table of Contents 1.0 Objective ........................................................................................................................................... 3 2.0 Summary of Final Results .................................................................................................................. 3 3.0 Exercise Lifecycle............................................................................................................................... 3 3.1 Understanding the objective of the exercise and its expectations .............................................. 4 3.2 Understanding the data dictionary of the data set ...................................................................... 4 3.3 Assigning appropriate measure values (Set/Range) for data fields.............................................. 4 3.4 Constructing first level models with Training dataset .................................................................. 4 3.4.1 Logistic Regression ................................................................................................................. 4 3.4.2 Decision Trees ........................................................................................................................ 5 3.11.3 Neural Networks .................................................................................................................. 5 3.5 Running the first level Models with Test data .............................................................................. 6 3.6 Performing bivariate analysis on training dataset ........................................................................ 7 3.7 Creating interaction variables based on results of Step 5 ............................................................ 7 3.8 Balancing the training data ........................................................................................................... 8 3.9 Constructing second level models with Training dataset ............................................................. 9 3.10 Running the second level Models with Test dataset .................................................................. 9 3.11 Constructing third level models by adding new interaction variables ..................................... 10 3.12 Running the third level models with Test dataset .................................................................... 10 3.13 Final Results Interpretation ...................................................................................................... 11 Page 2 of 11
  • 3. 1.0 Objective The objective of this data mining exercise is to find the best possible model to predict whether customer signature will opt for caravan insurance (mobile home policy) or not. The techniques used are logistic regression, decision tree and neural network. 2.0 Summary of Final Results The model built using Logistic Regression and Decision Tree came out with the highest accuracy on comparison with the models built using Neural Network. The best model had an accuracy of 94%. The most interesting part of the exercise is that base model (as provided originally) without any interaction variables and balancing, gave the best results. It has been expectedly observed that most models had higher accuracy with training data set but the accuracy rate reduced when run with test dataset .Cross-validation techniques such as 10-step validation was not done in this exercise which could have delineated the results even more. 3.0 Exercise Lifecycle The lifecycle of the complete data mining exercise comprises of the following steps:- 1. Understanding the objective of the exercise and its expectations 2. Understanding the data dictionary of the data set 3. Assigning appropriate measure values (Set/Range) for data fields 4. Constructing first level Models with Training dataset 5. Running the first level Models with Test dataset 6. Performing bivariate analysis 7. Balancing the training data 8. Constructing second level Models with Training dataset 9. Dataset modification of Training dataset 10. Running the second level Models with Test dataset 11. Creating interaction variables based on results of Step 6 12. Constructing third level Models with Training dataset 13. Running the third level Models with Test dataset 14. Final Results Interpretation Page 3 of 11
  • 4. 3.1 Understanding the objective of the exercise and its expectations The first and foremost step in a data mining exercise is to understand the business objective. The business wants to use their existing customer signatures to build a predictive model for predicting the number of mobile home policies. The model construction and its inference will be a precursor for a potential marketing campaign to target specific customer groups. The data mining techniques that are in the scope of this exercise are logistic regression, decision trees and neural networks. 3.2 Understanding the data dictionary of the data set The data dictionary consists of 86 variables with an equal mix of socio-demographic and product ownership data. There are few ordinal variables that need to be changed to numeric variables for build efficiency. The socio-demographic variables are captured at zip-code level. 3.3 Assigning appropriate measure values (Set/Range) for data fields The measure of the below variables were manually changed to ‘Range’ in Clementine, apart from the automatically assigned measures:- MAANTHUI Number of houses MGEMOMV Avg size household MGEMLEEF Avg age MGODRK Roman catholic PWAPART Contribution private third party insurance There is an academic insight that socio-demographic variables are to be converted to ‘Range’ variables so that it would be convenient to plot the values in logistic regression graph curve. The authors retained the variables as ‘Set’ variables initially to test the postulation at a later stage. 3.4 Constructing first level models with Training dataset The authors made a plan of arriving at the best model by using a three level approach. The models built in first level will be crude models constructed on the data set directly without any new interaction variables or data balancing. These models will be the first benchmark to gauge subsequent improvements. Models were built using the Logistic, C5.0 and Neural Net nodes. 3.4.1 Logistic Regression No changes were done for the Logistic Regression as all attributes were seemingly optimal. Page 4 of 11
  • 5. 3.4.2 Decision Trees The changes done for C5 node under the Export mode are Pruning Severity was set to 5 ‘Minimum records per child branch’ was changed to 5 from 2 as it was found to be optimal number. Value 1 impaired the results and the same could be said for values greater than 2 ‘Use Boosting’ option was enabled so that more classifiers are created. The value was set to15 for first level and changed to 5 for second and third level. Fig 7: C5 Model Attributes 3.4.3 Neural Networks For the Neural networks, the RBFN method was selected first but the model did not produce better results. The final method selected was ‘Quick’. The number of hidden layers was set as 3 so that more transformations can take place. The learning rates were initially increased marginally to check for performance improvements assuming that the results are converging towards the globally consistent depression in the learning curve of the networks. But as marginal increase of alpha learning rate didn’t get produce significant results, it was increased dramatically to 0.9 for overcoming the possibly assumed local depression. The final values are available in the screenshot below. Page 5 of 11
  • 6. Fig 8: Neural Network Attributes 3.5 Running the first level Models with Test data The trained first level models were run with the test dataset and the results of the different modelling techniques were compared with the Analysis and Evaluation node. Logistic Regression and Decision Tree both had the best accuracy rate of 94%. The Nagelkerke Rsquare value with training data set was 16.7%. These results will be maintained as the first level benchmark. Screenshots provided below Fig1: First Level Models Analysis Node Results Page 6 of 11
  • 7. Fig 2: First Level Gain and Lift Chart 3.6 Performing bivariate analysis on training dataset This step marks the start of the second level model building process. Bivariate analysis in Clementine can be done using the Web node that represents the relationships between the values of variables using thick and thin lines. The authors performed the analysis using both the normal web and directed web option in the web node. The directed web had the target as Caravan variable and all the other variables were put in dependant section. This analysis wasn’t helpful as the relationships were present among different values in independent variables and CARAVAN therefore no significant inferences were made. However, the normal web analysis indicated strong relationships between the customer type and customer subtype, a potential candidate for interaction variable. 3.7 Creating interaction variables based on results of Step 5 The indication from last step was implemented in this step by creating two interaction variables. The first interaction variable Derive1(aka customer lifestyle reflector) contains the parent variables Customer Type and Subtype. The second interaction variable Derive3(aka Combined Age-Income Factor ) contains the parent variables Avg age and Avg Income. This variable was created based on the author’s intuition that it would help build a better model. Screenshots provided below for reference Page 7 of 11
  • 8. Fig 3: Derived Variables 3.8 Balancing the training data It has been noticed that the training dataset is not highly representative of positive cases i.e.CARAVAN=1. Therefore, models constructed using this data set may not be the best predictor for positive cases. Clementine provides a feature called as Balancing to create more signatures based on conditions. The overall positivity is increased in the data set. The authors chose a factor of 6 to make the dataset slightly better looking in terms of value share (72%:28%) Fig 4: Balancing Page 8 of 11
  • 9. 3.9 Constructing second level models with Training dataset The second level models were built with the balanced dataset. The attributes of the nodes were maintained from the first level except for C5 node in which the boosting interval was changed to 5 as the software did not have enough memory to run with value 15. 3.10 Running the second level Models with Test dataset The trained second level models were run with the test dataset and the results of the different modelling techniques were compared with the Analysis and Evaluation node. Decision Tree model came out with the highest accuracy of 90.48%. These results were maintained as the second level benchmark. Screenshots provided below Fig 5: Second Level Models Analysis Node Results Page 9 of 11
  • 10. Fig 6: Second Level Models Gain and Lift Charts 3.11 Constructing third level models by adding new interaction variables The third level model building step is not the same as the second level in terms of data fields. The two new interaction variables Derive 1and Derive 2 were created. No additional balancing was done. 3.12 Running the third level models with Test dataset The trained third level models were run with the test dataset and the results of the different modelling techniques were compared with the Analysis and Evaluation node. Neural Network model gives the best accuracy rate at 90.1%. Fig 9: Third Level Models Analysis Node Results Page 10 of 11
  • 11. Fig 10: Third Level Models Gain and Lift Charts 3.13 Final Results Interpretation The below table compares the output of the Analysis node from all three levels. There is no marked improvement in each level. It has been inferred that building the model after balancing the training data set, doesn’t produce a better model. In Level1 (base dataset): Highest accuracy is generated by both Decision Tree and Logistic Regression In Level2 (model build with balanced dataset): Highest accuracy is generated by Decision Tree In Level 3(model build with balanced dataset and interaction variables): Highest accuracy is generated by Neural Network 1st 2nd 3rd Technique Factor level level level Logistic Regression Test dataset Accuracy 94.00% 87.50% 87.50% Decision Tree Test dataset Accuracy 94.00% 90.48% 89.75% Neural Network Test dataset Accuracy 92.05% 90.12% 90.10% Combined Agreement with CARAVAN 94.52% 95.11% 94.80% Table 1: Level Comparison Page 11 of 11