SlideShare une entreprise Scribd logo
1  sur  4
Télécharger pour lire hors ligne
International Journal of Trend in Scientific Research and Development (IJTSRD)
Volume: 3 | Issue: 3 | Mar-Apr 2019 Available Online: www.ijtsrd.com e-ISSN: 2456 - 6470
@ IJTSRD | Unique Paper ID - IJTSRD21667 | Volume – 3 | Issue – 3 | Mar-Apr 2019 Page: 157
Preprocessing of Low Response Data for Predictive Modeling
Farzana Naz, Imaad Shafi, Md Kamre Alam
Al-Falah University, Dhouj, Haryana, India
How to cite this paper: Farzana Naz |
Imaad Shafi | Md Kamre Alam
"Preprocessing of Low Response Data
for Predictive Modeling" Published in
International Journal of Trend in
Scientific Research and Development
(ijtsrd), ISSN: 2456-
6470, Volume-3 |
Issue-3, April 2019,
pp.157-160, URL:
http://www.ijtsrd.co
m/papers/ijtsrd216
67.pdf
Copyright © 2019 by author(s) and
International Journal of Trend in
Scientific Research and Development
Journal. This is an Open Access article
distributed under
the terms of the
Creative Commons
Attribution License (CC BY 4.0)
(http://creativecommons.org/licenses/
by/4.0)
ABSTRACT
Fortrainingamodel,therawdatahavetogothroughvariouspreprocessingphaseslike
Cleaning,MissingValuesImputation,Dimension/Variablereduction,andSampling.These
steps are data and problem specific and affect the accuracy of the model at a very large
extent.
Forthecurrentscenario,wehave2.2Mrecordswith511variables.Thisdatawasusedina
DirectMailCampaignofsomeLifeInsuranceProductsandnowweknowwhichrecordhad
apositiveresponseforthecampaign.
#Rows(records):2,259,747
#Columns:511
#Rowswithpositiveresponse:2,739,
i.e.ResponseRate:0.1212%.
Thedatasetisnotcomplete,i.e.wehavetotakecareofmissingvalues.
KEYWORDS: Logistic Regression, Datasets, Principal component analysis, Variable
Reduction
INTRODUCTION
We have to build a model using this data, so that each record
is assigned a probability score. This score depicts the
likelihood of any person responding to the Mail Campaign.
Sorting the records according to the score helps us selecting
people whom to send the mail, which in turns reduces the
campaign cost.
Resources
All the steps were performed on 64-bit machines with 8
cores and 32GB of RAM running Ubuntu 12.04. R-Studio was
used to write and run R scripts.
Variable Reduction (Manually)
Analyzing the missing values shows that 166 variables have
more than 50% values missing. Filling the missing values
with any constant1 will
1 Typically, missing values are imputed with Mean, Median
and Mode
reduce the variance of these variables and therefore they
will not contribute significantly to the model. In addition,
considering these variables in model building will result in
increased number of dimensions, which requires more time
and memory to train the model. So, these 166 variables are
discarded before any further analysis [4].
In a very similar manner, variables having less than 10%
values missing were selected without any inspection. And
those having 10% to 50%valuesmissingwere examinedone
by one, for whether they are important2, before selecting.
This led the number of variables to 324.
Furthermore, variables like Names, Consumer IDs were
discarded and the count collapsed to 306.
Birthdays (Year, Month and Date) were replaced by age.
Similarly, date of last purchase was replaced by the total
months passed since last purchase. After all these steps,
number of variables into consideration was around 280.
Missing Values
First we tried to predict the missing entries by performing
FAMD3 and then reconstructing the originals from factors.
This didn’t work mainly due to following reasons4:
2 This step was based on intuition and discussion among
team members
3 Factor Analysis of Mixed Data. Similar to PCA for numeric
variables and MCA (Multiple Correspondence Analysis) for
categorical Variables
IJTSRD21667
International Journal of Trend in Scientific Research and Development (IJTSRD) @ www.ijtsrd.com eISSN: 2456-6470
@ IJTSRD | Unique Paper ID - IJTSRD21667 | Volume – 3 | Issue – 3 | Mar-Apr 2019 Page: 158
A. FAMD creates dummy variables foreach uniquevalueof
categorical variable, demanding more memory than the
provided 32GB.
B. To solve the memory problem, small sample was tried.
But now the problem was with the small number of
records corresponding to some categories of the
categorical variable. A small selected sample was not
able to include all categories, leading to constant
value(0)insome columns(dummy)makingitimpossible
to perform FAMD.
To avoid the problem, missing values of a column were
imputed with median of the available values in that column.
Missing values in columns having either 1 or NA (missing)
were imputed with 0. Mean could have been used instead of
median but we selected the later one due to two reasons:
A. Mean of a Boolean variablecolumn willbe arealnumber
B. For continuous variables like household area, there
were outliers which were distortingthenaturalposition
of mean.
Anyways, imputing median too will compromise the results
compared to if we had used any technique based on PCA,
FAMD or other predictive methods[3].
Categorical Variables
In the following figure, we can see that initiallythereare108
factor variables; but some of them were discarded during
manual variable.
top 15 components were selected on the basis of variance
chart
Figure 3: Variances of top 30 Principal Components
over 400K records
4 Can be solved on a machine with little more memory
using PCA
reduction step. Of the remaining, those having 2
categories/levels were simply converted to Boolean using
dummy variables for each category.
For remaining, which have more than 2categories,following
steps were taken:
A. Create dummies for each category to convert the
variable into Boolean. This expansion of 56 such
categorical variables resulted in an indicator matrix of
906 dummy columns.
B. Now the aim was to perform PCA on this indicator
matrix, but that couldn’t be done on whole 2.2Mrecords
due to resource constraints. To resolve the issue, the
PCA was performed onrandomlyselected400Krecords,
and
C. Remaining records were multiplied with the rotation
matrix for 15PCs to compute/predict5 the components
This way we got new data set with 266 columns containing
58 dummy columns corresponding to 29 categorical (with
two levels) variables, 15PCs and 222 original numeric
columns.
ZIP codes were excluded from the PC computation part, as it
had too many levels. A simple idea6 is to replace the ZIP with
Longitude and Latitude.[7]
Note: n-1 dummy columns are sufficient to represent any
categorical variables having n levels. So, instead of 58
dummy columns, 29 could have sufficed[6].
At this stage, we have preprocessed and clean data having
2.2M rows and 266 columns.
Preparation of Datasets
First of all, the preprocessed data was divided intotwoparts
randomly. First part (70%) for preparing training datasets
using various sampling methods, and the rest 30% left
untouched for testing purpose.
Let us call the first part DS_TRAIN and the second one
DS_TEST.
Both DS_TRAIN and DS_TEST were sampled randomly and
maintain the response rate (≈0.12%) similar to that in the
original raw data.
A model trained with such a low response data will fail to
predict the responded rows. To understand the situation, let
us suppose that we have a dataset of 10,000 records. This
dataset will contain only 12 positive responses. To maintain
the accuracy, model will learn to predict the negative
responses and even if it predicts all 10,000 rows as not
responded, its accuracy is
(
10000
10000
−12
)×100 = 99.88%
( If we replicate the responded rows to increase the data, this problem can be
avoided. But the model will be more optimistic,i.e. false positive predictions
will increase. But when it comes to assign a score to records rather than to
classify them as responded/not-responded, it does good.
5 Levels of a factor variable is the number of categories for
that variable
6 Factor class is a data type used in R to store categorical
variables.
International Journal of Trend in Scientific Research and Development (IJTSRD) @ www.ijtsrd.com eISSN: 2456-6470
@ IJTSRD | Unique Paper ID - IJTSRD21667 | Volume – 3 | Issue – 3 | Mar-Apr 2019 Page: 159
To increase the response rate in training dataset, two
different strategies were taken:
1. Three training datasets were prepared using stratified
sampling. Records with positive response in DS_TRAIN
were replicated via Simple Random Sampling with
Replacement to make the response rate to 10%, 15%
and 20%. Let us call these sets as DS_TRAIN1A,
DS_TRAIN1B and DS_TRAIN1C.
2. DS_TRAIN was divided in twoparts:DS_TRAIN_Rhaving
all records with response 1, and rest in DS_TRAIN_N.
Now DS_TRAIN_N was divided randomly in 10 equal
data sets. Then DS_TRAIN_R was appended to each of
these sets. So all 10 sets have same responded records
but mutually exclusive and exhaustive not-responded
records.
There is a discussion in machine-learning community about
what should be the response rate in the stratified sample.
According to some blogs, keeping the ratio to 50% is
supposed to be a good strategy. Here, we tried threetraining
sa7mples with 10%, 15% and 20% response rates, and our
results shows that there is no need to increase it further.
Variable Reduction (Automated)
Before training the model, variables were selected for all
training datasets using stepwise regression methods:
Forward Selection and Backward Elimination[2]. Forward
Selection involves starting with no variables in the model,
adds variables one by one and compares the statistic9.
Similarly, Backward Elimination involves starting with all
variables and then testing the model after deletion of
variables one by one.
We used the method regsubsets from R package leaps
forcing in all 15PCs.Getting idea from R-Squared plot, 165
variables were selected from forward method and same
number from backward. Intersection of these two sets
resulted in 144-155 variables for different datasets
Figure 4: R-Squared and Mallows’s C Plot for Forward
Selection
79 We have considered R-Squared, but F-tests, Akaike
Information Criterion, Mallows’s Cp etc are also possible
Model selection based on R-Squared statistic was consistent
with Mallows’s Cp and Residual Sum of Squares.
Training and Testing the Models
We used Logistic Regression to train the models. For the
datasets prepared using first strategy, three models were
trained and then were tested on the untouched test dataset,
which is 30% of the whole data[1]. Due to very low response
in the original data, the model can’t be judged just by
observing the confusion matrix. Instead, following method
was followed:
A. The dataset was sorted in decreasing order of
probabilities predicted by the model.
B. Then it was divided in 10 equal groups, each called a
Decile.
C. Number of positive responses in each decile was
calculated and its cumulative sum was plotted against
deciles.
10% Responses 15%Responses Responses
23.54399 23.66791 24.16357
39.15737 39.03346 39.4052
49.19455 51.1772 49.81413
59.72739 59.35564 58.2404
67.28625 68.02974 66.41884
75.5886 75.21685 75.5886
83.02354 82.89963 81.9083
88.59975 88.9715 88.35192
95.66295 95.41512 95.41512
100 100 100
The table shows the percentage of responsescapturedforall
the three training data-sets created using the first strategy.
Figure 5: Decile Plot for Training Dataset having 15%
Responses
For the 10 training data sets prepared using 2nd method,
average of predicted probabilities by all 10 models was
taken as the final score. The results were very similar to
what we have here using the first strategy.Note: Models
using GBM (Gradient BoostingMachines)and Random forest
were also triedon the same datasets by some other team
members . They too came up with similar
International Journal of Trend in Scientific Research and Development (IJTSRD) @ www.ijtsrd.com eISSN: 2456-6470
@ IJTSRD | Unique Paper ID - IJTSRD21667 | Volume – 3 | Issue – 3 | Mar-Apr 2019 Page: 160
Summary and Further Scopes
Observing the similarity in results of different models tried,
it can be interpreted that results are largely dependent on
initial variable selection and sampling strategy followed.
Regsubsets showed that there were multiple linear
dependencies amongvariables. Thesedependenciescould be
removed by doing a little deep analysis. Also creating n-1
dummy
columns for a factor variable with nlevelsmightimprovethe
Principal Components.
Relative importance of variables as depicted by the GLM11
shows that more (than 15) principal components can be
included.
As discussed above, ZIP codes should be replaced by
geographic coordinates.
Most of Single Valued12 variables were discarded during
manual variable selection. It may improve the model if we
impute the missing values of such variables with 0 and
considering them till automated variable reduction phase.
References
[1] https://datascienceplus.com/perform-logistic-
regression-in-r/
[2] http://www.ijcem.org/papers032013/ijcem_032013_0
6.pdf
[3] https://en.wikipedia.org/wiki/Factor_analysis_of_mixed
_data
[4] https://books.google.co.in/books?id=1ysHiIpL4PQC&pg
=PA167&lpg=PA167&dq=A32+ATTRIBUTES&source=bl
&ots=VKf6kFqg33&sig=ACfU3U0U9Q4MwOq3I6LGqyxx
r9hqXEZa7Q&hl=en&sa=X&ved=2ahUKEwjDxoyhufDgA
hUSA3IKHXsqA7oQ6AEwBXoECAEQAQ#v=onepage&q=
A32%20ATTRIBUTES&f=false
[5] https://nhorton.people.amherst.edu/r2/datasets.php
[6] https://developer.ibm.com/customer-
engagement/tutorials/insert-update-records-relational-
table/
[7] http://support.sas.com/documentation/cdl/en/graphre
f/63022/HTML/default/viewer.htm#overview-
geocode.htm

Contenu connexe

Tendances

Linear Discriminant Analysis (LDA)
Linear Discriminant Analysis (LDA)Linear Discriminant Analysis (LDA)
Linear Discriminant Analysis (LDA)Anmol Dwivedi
 
Data mining approaches and methods
Data mining approaches and methodsData mining approaches and methods
Data mining approaches and methodssonangrai
 
7. logistics regression using spss
7. logistics regression using spss7. logistics regression using spss
7. logistics regression using spssDr Nisha Arora
 
Ibm spss statistics 19 brief guide
Ibm spss statistics 19 brief guideIbm spss statistics 19 brief guide
Ibm spss statistics 19 brief guideMarketing Utopia
 
Mb0040 statistics for management spring2015_assignment- SMU_MBA-Solved-Assign...
Mb0040 statistics for management spring2015_assignment- SMU_MBA-Solved-Assign...Mb0040 statistics for management spring2015_assignment- SMU_MBA-Solved-Assign...
Mb0040 statistics for management spring2015_assignment- SMU_MBA-Solved-Assign...pkharb
 
An Introduction to SPSS
An Introduction to SPSSAn Introduction to SPSS
An Introduction to SPSSRayman Soe
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessingSalah Amean
 
Discriminant Analysis in Sports
Discriminant Analysis in SportsDiscriminant Analysis in Sports
Discriminant Analysis in SportsJ P Verma
 
Spss series - data entry and coding
Spss series - data entry and codingSpss series - data entry and coding
Spss series - data entry and codingDr. Majdi Al Jasim
 
Statistics and Data Mining
Statistics and  Data MiningStatistics and  Data Mining
Statistics and Data MiningR A Akerkar
 
Classification modelling review
Classification modelling reviewClassification modelling review
Classification modelling reviewJaideep Adusumelli
 
[M3A4] Data Analysis and Interpretation Specialization
[M3A4] Data Analysis and Interpretation Specialization[M3A4] Data Analysis and Interpretation Specialization
[M3A4] Data Analysis and Interpretation SpecializationAndrea Rubio
 

Tendances (19)

Linear Discriminant Analysis (LDA)
Linear Discriminant Analysis (LDA)Linear Discriminant Analysis (LDA)
Linear Discriminant Analysis (LDA)
 
Data Analyst - Interview Guide
Data Analyst - Interview GuideData Analyst - Interview Guide
Data Analyst - Interview Guide
 
Data mining approaches and methods
Data mining approaches and methodsData mining approaches and methods
Data mining approaches and methods
 
7. logistics regression using spss
7. logistics regression using spss7. logistics regression using spss
7. logistics regression using spss
 
Ibm spss statistics 19 brief guide
Ibm spss statistics 19 brief guideIbm spss statistics 19 brief guide
Ibm spss statistics 19 brief guide
 
Mb0040 statistics for management spring2015_assignment- SMU_MBA-Solved-Assign...
Mb0040 statistics for management spring2015_assignment- SMU_MBA-Solved-Assign...Mb0040 statistics for management spring2015_assignment- SMU_MBA-Solved-Assign...
Mb0040 statistics for management spring2015_assignment- SMU_MBA-Solved-Assign...
 
Multiple Linear Regression
Multiple Linear Regression Multiple Linear Regression
Multiple Linear Regression
 
Spss beginners
Spss beginnersSpss beginners
Spss beginners
 
An Introduction to SPSS
An Introduction to SPSSAn Introduction to SPSS
An Introduction to SPSS
 
Spss an introduction
Spss  an introductionSpss  an introduction
Spss an introduction
 
Data entry in Excel and SPSS
Data entry in Excel and SPSS Data entry in Excel and SPSS
Data entry in Excel and SPSS
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
 
Discriminant Analysis in Sports
Discriminant Analysis in SportsDiscriminant Analysis in Sports
Discriminant Analysis in Sports
 
Machine learning session2
Machine learning   session2Machine learning   session2
Machine learning session2
 
Spss series - data entry and coding
Spss series - data entry and codingSpss series - data entry and coding
Spss series - data entry and coding
 
Statistics and Data Mining
Statistics and  Data MiningStatistics and  Data Mining
Statistics and Data Mining
 
Classification modelling review
Classification modelling reviewClassification modelling review
Classification modelling review
 
Data preparation
Data preparationData preparation
Data preparation
 
[M3A4] Data Analysis and Interpretation Specialization
[M3A4] Data Analysis and Interpretation Specialization[M3A4] Data Analysis and Interpretation Specialization
[M3A4] Data Analysis and Interpretation Specialization
 

Similaire à Preprocessing of Low Response Data for Predictive Modeling

Poor man's missing value imputation
Poor man's missing value imputationPoor man's missing value imputation
Poor man's missing value imputationLeonardo Auslender
 
As mentioned earlier, the mid-term will have conceptual and quanti.docx
As mentioned earlier, the mid-term will have conceptual and quanti.docxAs mentioned earlier, the mid-term will have conceptual and quanti.docx
As mentioned earlier, the mid-term will have conceptual and quanti.docxfredharris32
 
Guide for building GLMS
Guide for building GLMSGuide for building GLMS
Guide for building GLMSAli T. Lotia
 
Store segmentation progresso
Store segmentation progressoStore segmentation progresso
Store segmentation progressoveesingh
 
Data Trend Analysis by Assigning Polynomial Function For Given Data Set
Data Trend Analysis by Assigning Polynomial Function For Given Data SetData Trend Analysis by Assigning Polynomial Function For Given Data Set
Data Trend Analysis by Assigning Polynomial Function For Given Data SetIJCERT
 
Missing Value imputation, Poor man's
Missing Value imputation, Poor man'sMissing Value imputation, Poor man's
Missing Value imputation, Poor man'sLeonardo Auslender
 
Top 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdfTop 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdfDatacademy.ai
 
Towards reducing the
Towards reducing theTowards reducing the
Towards reducing theIJDKP
 
IE332Engineering Statistics IINOTES· Show your work,
IE332Engineering Statistics IINOTES· Show your work,IE332Engineering Statistics IINOTES· Show your work,
IE332Engineering Statistics IINOTES· Show your work,MalikPinckney86
 
Recommender system
Recommender systemRecommender system
Recommender systemBhumi Patel
 
4 Solutions To Exercises 4.1 About These Solutions 4.2 Using The Table Of Ran...
4 Solutions To Exercises 4.1 About These Solutions 4.2 Using The Table Of Ran...4 Solutions To Exercises 4.1 About These Solutions 4.2 Using The Table Of Ran...
4 Solutions To Exercises 4.1 About These Solutions 4.2 Using The Table Of Ran...Wendy Berg
 
Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Derek Kane
 

Similaire à Preprocessing of Low Response Data for Predictive Modeling (20)

JEDM_RR_JF_Final
JEDM_RR_JF_FinalJEDM_RR_JF_Final
JEDM_RR_JF_Final
 
Predictive modeling
Predictive modelingPredictive modeling
Predictive modeling
 
Poor man's missing value imputation
Poor man's missing value imputationPoor man's missing value imputation
Poor man's missing value imputation
 
report
reportreport
report
 
As mentioned earlier, the mid-term will have conceptual and quanti.docx
As mentioned earlier, the mid-term will have conceptual and quanti.docxAs mentioned earlier, the mid-term will have conceptual and quanti.docx
As mentioned earlier, the mid-term will have conceptual and quanti.docx
 
Guide for building GLMS
Guide for building GLMSGuide for building GLMS
Guide for building GLMS
 
Store segmentation progresso
Store segmentation progressoStore segmentation progresso
Store segmentation progresso
 
Data Trend Analysis by Assigning Polynomial Function For Given Data Set
Data Trend Analysis by Assigning Polynomial Function For Given Data SetData Trend Analysis by Assigning Polynomial Function For Given Data Set
Data Trend Analysis by Assigning Polynomial Function For Given Data Set
 
Missing Value imputation, Poor man's
Missing Value imputation, Poor man'sMissing Value imputation, Poor man's
Missing Value imputation, Poor man's
 
Intro to SPSS.ppt
Intro to SPSS.pptIntro to SPSS.ppt
Intro to SPSS.ppt
 
Top 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdfTop 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdf
 
Dmml report final
Dmml report finalDmml report final
Dmml report final
 
Towards reducing the
Towards reducing theTowards reducing the
Towards reducing the
 
IE332Engineering Statistics IINOTES· Show your work,
IE332Engineering Statistics IINOTES· Show your work,IE332Engineering Statistics IINOTES· Show your work,
IE332Engineering Statistics IINOTES· Show your work,
 
Recommender system
Recommender systemRecommender system
Recommender system
 
Bank loan purchase modeling
Bank loan purchase modelingBank loan purchase modeling
Bank loan purchase modeling
 
Telecom customer churn prediction
Telecom customer churn predictionTelecom customer churn prediction
Telecom customer churn prediction
 
4 Solutions To Exercises 4.1 About These Solutions 4.2 Using The Table Of Ran...
4 Solutions To Exercises 4.1 About These Solutions 4.2 Using The Table Of Ran...4 Solutions To Exercises 4.1 About These Solutions 4.2 Using The Table Of Ran...
4 Solutions To Exercises 4.1 About These Solutions 4.2 Using The Table Of Ran...
 
SOLUTION.PDF
SOLUTION.PDFSOLUTION.PDF
SOLUTION.PDF
 
Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests
 

Plus de ijtsrd

‘Six Sigma Technique’ A Journey Through its Implementation
‘Six Sigma Technique’ A Journey Through its Implementation‘Six Sigma Technique’ A Journey Through its Implementation
‘Six Sigma Technique’ A Journey Through its Implementationijtsrd
 
Edge Computing in Space Enhancing Data Processing and Communication for Space...
Edge Computing in Space Enhancing Data Processing and Communication for Space...Edge Computing in Space Enhancing Data Processing and Communication for Space...
Edge Computing in Space Enhancing Data Processing and Communication for Space...ijtsrd
 
Dynamics of Communal Politics in 21st Century India Challenges and Prospects
Dynamics of Communal Politics in 21st Century India Challenges and ProspectsDynamics of Communal Politics in 21st Century India Challenges and Prospects
Dynamics of Communal Politics in 21st Century India Challenges and Prospectsijtsrd
 
Assess Perspective and Knowledge of Healthcare Providers Towards Elehealth in...
Assess Perspective and Knowledge of Healthcare Providers Towards Elehealth in...Assess Perspective and Knowledge of Healthcare Providers Towards Elehealth in...
Assess Perspective and Knowledge of Healthcare Providers Towards Elehealth in...ijtsrd
 
The Impact of Digital Media on the Decentralization of Power and the Erosion ...
The Impact of Digital Media on the Decentralization of Power and the Erosion ...The Impact of Digital Media on the Decentralization of Power and the Erosion ...
The Impact of Digital Media on the Decentralization of Power and the Erosion ...ijtsrd
 
Online Voices, Offline Impact Ambedkars Ideals and Socio Political Inclusion ...
Online Voices, Offline Impact Ambedkars Ideals and Socio Political Inclusion ...Online Voices, Offline Impact Ambedkars Ideals and Socio Political Inclusion ...
Online Voices, Offline Impact Ambedkars Ideals and Socio Political Inclusion ...ijtsrd
 
Problems and Challenges of Agro Entreprenurship A Study
Problems and Challenges of Agro Entreprenurship A StudyProblems and Challenges of Agro Entreprenurship A Study
Problems and Challenges of Agro Entreprenurship A Studyijtsrd
 
Comparative Analysis of Total Corporate Disclosure of Selected IT Companies o...
Comparative Analysis of Total Corporate Disclosure of Selected IT Companies o...Comparative Analysis of Total Corporate Disclosure of Selected IT Companies o...
Comparative Analysis of Total Corporate Disclosure of Selected IT Companies o...ijtsrd
 
The Impact of Educational Background and Professional Training on Human Right...
The Impact of Educational Background and Professional Training on Human Right...The Impact of Educational Background and Professional Training on Human Right...
The Impact of Educational Background and Professional Training on Human Right...ijtsrd
 
A Study on the Effective Teaching Learning Process in English Curriculum at t...
A Study on the Effective Teaching Learning Process in English Curriculum at t...A Study on the Effective Teaching Learning Process in English Curriculum at t...
A Study on the Effective Teaching Learning Process in English Curriculum at t...ijtsrd
 
The Role of Mentoring and Its Influence on the Effectiveness of the Teaching ...
The Role of Mentoring and Its Influence on the Effectiveness of the Teaching ...The Role of Mentoring and Its Influence on the Effectiveness of the Teaching ...
The Role of Mentoring and Its Influence on the Effectiveness of the Teaching ...ijtsrd
 
Design Simulation and Hardware Construction of an Arduino Microcontroller Bas...
Design Simulation and Hardware Construction of an Arduino Microcontroller Bas...Design Simulation and Hardware Construction of an Arduino Microcontroller Bas...
Design Simulation and Hardware Construction of an Arduino Microcontroller Bas...ijtsrd
 
Sustainable Energy by Paul A. Adekunte | Matthew N. O. Sadiku | Janet O. Sadiku
Sustainable Energy by Paul A. Adekunte | Matthew N. O. Sadiku | Janet O. SadikuSustainable Energy by Paul A. Adekunte | Matthew N. O. Sadiku | Janet O. Sadiku
Sustainable Energy by Paul A. Adekunte | Matthew N. O. Sadiku | Janet O. Sadikuijtsrd
 
Concepts for Sudan Survey Act Implementations Executive Regulations and Stand...
Concepts for Sudan Survey Act Implementations Executive Regulations and Stand...Concepts for Sudan Survey Act Implementations Executive Regulations and Stand...
Concepts for Sudan Survey Act Implementations Executive Regulations and Stand...ijtsrd
 
Towards the Implementation of the Sudan Interpolated Geoid Model Khartoum Sta...
Towards the Implementation of the Sudan Interpolated Geoid Model Khartoum Sta...Towards the Implementation of the Sudan Interpolated Geoid Model Khartoum Sta...
Towards the Implementation of the Sudan Interpolated Geoid Model Khartoum Sta...ijtsrd
 
Activating Geospatial Information for Sudans Sustainable Investment Map
Activating Geospatial Information for Sudans Sustainable Investment MapActivating Geospatial Information for Sudans Sustainable Investment Map
Activating Geospatial Information for Sudans Sustainable Investment Mapijtsrd
 
Educational Unity Embracing Diversity for a Stronger Society
Educational Unity Embracing Diversity for a Stronger SocietyEducational Unity Embracing Diversity for a Stronger Society
Educational Unity Embracing Diversity for a Stronger Societyijtsrd
 
Integration of Indian Indigenous Knowledge System in Management Prospects and...
Integration of Indian Indigenous Knowledge System in Management Prospects and...Integration of Indian Indigenous Knowledge System in Management Prospects and...
Integration of Indian Indigenous Knowledge System in Management Prospects and...ijtsrd
 
DeepMask Transforming Face Mask Identification for Better Pandemic Control in...
DeepMask Transforming Face Mask Identification for Better Pandemic Control in...DeepMask Transforming Face Mask Identification for Better Pandemic Control in...
DeepMask Transforming Face Mask Identification for Better Pandemic Control in...ijtsrd
 
Streamlining Data Collection eCRF Design and Machine Learning
Streamlining Data Collection eCRF Design and Machine LearningStreamlining Data Collection eCRF Design and Machine Learning
Streamlining Data Collection eCRF Design and Machine Learningijtsrd
 

Plus de ijtsrd (20)

‘Six Sigma Technique’ A Journey Through its Implementation
‘Six Sigma Technique’ A Journey Through its Implementation‘Six Sigma Technique’ A Journey Through its Implementation
‘Six Sigma Technique’ A Journey Through its Implementation
 
Edge Computing in Space Enhancing Data Processing and Communication for Space...
Edge Computing in Space Enhancing Data Processing and Communication for Space...Edge Computing in Space Enhancing Data Processing and Communication for Space...
Edge Computing in Space Enhancing Data Processing and Communication for Space...
 
Dynamics of Communal Politics in 21st Century India Challenges and Prospects
Dynamics of Communal Politics in 21st Century India Challenges and ProspectsDynamics of Communal Politics in 21st Century India Challenges and Prospects
Dynamics of Communal Politics in 21st Century India Challenges and Prospects
 
Assess Perspective and Knowledge of Healthcare Providers Towards Elehealth in...
Assess Perspective and Knowledge of Healthcare Providers Towards Elehealth in...Assess Perspective and Knowledge of Healthcare Providers Towards Elehealth in...
Assess Perspective and Knowledge of Healthcare Providers Towards Elehealth in...
 
The Impact of Digital Media on the Decentralization of Power and the Erosion ...
The Impact of Digital Media on the Decentralization of Power and the Erosion ...The Impact of Digital Media on the Decentralization of Power and the Erosion ...
The Impact of Digital Media on the Decentralization of Power and the Erosion ...
 
Online Voices, Offline Impact Ambedkars Ideals and Socio Political Inclusion ...
Online Voices, Offline Impact Ambedkars Ideals and Socio Political Inclusion ...Online Voices, Offline Impact Ambedkars Ideals and Socio Political Inclusion ...
Online Voices, Offline Impact Ambedkars Ideals and Socio Political Inclusion ...
 
Problems and Challenges of Agro Entreprenurship A Study
Problems and Challenges of Agro Entreprenurship A StudyProblems and Challenges of Agro Entreprenurship A Study
Problems and Challenges of Agro Entreprenurship A Study
 
Comparative Analysis of Total Corporate Disclosure of Selected IT Companies o...
Comparative Analysis of Total Corporate Disclosure of Selected IT Companies o...Comparative Analysis of Total Corporate Disclosure of Selected IT Companies o...
Comparative Analysis of Total Corporate Disclosure of Selected IT Companies o...
 
The Impact of Educational Background and Professional Training on Human Right...
The Impact of Educational Background and Professional Training on Human Right...The Impact of Educational Background and Professional Training on Human Right...
The Impact of Educational Background and Professional Training on Human Right...
 
A Study on the Effective Teaching Learning Process in English Curriculum at t...
A Study on the Effective Teaching Learning Process in English Curriculum at t...A Study on the Effective Teaching Learning Process in English Curriculum at t...
A Study on the Effective Teaching Learning Process in English Curriculum at t...
 
The Role of Mentoring and Its Influence on the Effectiveness of the Teaching ...
The Role of Mentoring and Its Influence on the Effectiveness of the Teaching ...The Role of Mentoring and Its Influence on the Effectiveness of the Teaching ...
The Role of Mentoring and Its Influence on the Effectiveness of the Teaching ...
 
Design Simulation and Hardware Construction of an Arduino Microcontroller Bas...
Design Simulation and Hardware Construction of an Arduino Microcontroller Bas...Design Simulation and Hardware Construction of an Arduino Microcontroller Bas...
Design Simulation and Hardware Construction of an Arduino Microcontroller Bas...
 
Sustainable Energy by Paul A. Adekunte | Matthew N. O. Sadiku | Janet O. Sadiku
Sustainable Energy by Paul A. Adekunte | Matthew N. O. Sadiku | Janet O. SadikuSustainable Energy by Paul A. Adekunte | Matthew N. O. Sadiku | Janet O. Sadiku
Sustainable Energy by Paul A. Adekunte | Matthew N. O. Sadiku | Janet O. Sadiku
 
Concepts for Sudan Survey Act Implementations Executive Regulations and Stand...
Concepts for Sudan Survey Act Implementations Executive Regulations and Stand...Concepts for Sudan Survey Act Implementations Executive Regulations and Stand...
Concepts for Sudan Survey Act Implementations Executive Regulations and Stand...
 
Towards the Implementation of the Sudan Interpolated Geoid Model Khartoum Sta...
Towards the Implementation of the Sudan Interpolated Geoid Model Khartoum Sta...Towards the Implementation of the Sudan Interpolated Geoid Model Khartoum Sta...
Towards the Implementation of the Sudan Interpolated Geoid Model Khartoum Sta...
 
Activating Geospatial Information for Sudans Sustainable Investment Map
Activating Geospatial Information for Sudans Sustainable Investment MapActivating Geospatial Information for Sudans Sustainable Investment Map
Activating Geospatial Information for Sudans Sustainable Investment Map
 
Educational Unity Embracing Diversity for a Stronger Society
Educational Unity Embracing Diversity for a Stronger SocietyEducational Unity Embracing Diversity for a Stronger Society
Educational Unity Embracing Diversity for a Stronger Society
 
Integration of Indian Indigenous Knowledge System in Management Prospects and...
Integration of Indian Indigenous Knowledge System in Management Prospects and...Integration of Indian Indigenous Knowledge System in Management Prospects and...
Integration of Indian Indigenous Knowledge System in Management Prospects and...
 
DeepMask Transforming Face Mask Identification for Better Pandemic Control in...
DeepMask Transforming Face Mask Identification for Better Pandemic Control in...DeepMask Transforming Face Mask Identification for Better Pandemic Control in...
DeepMask Transforming Face Mask Identification for Better Pandemic Control in...
 
Streamlining Data Collection eCRF Design and Machine Learning
Streamlining Data Collection eCRF Design and Machine LearningStreamlining Data Collection eCRF Design and Machine Learning
Streamlining Data Collection eCRF Design and Machine Learning
 

Dernier

Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfPoh-Sun Goh
 
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-IIFood Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-IIShubhangi Sonawane
 
Role Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptxRole Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptxNikitaBankoti2
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxAreebaZafar22
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...Poonam Aher Patil
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfChris Hunter
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin ClassesCeline George
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 

Dernier (20)

Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-IIFood Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
 
Role Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptxRole Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptx
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 

Preprocessing of Low Response Data for Predictive Modeling

  • 1. International Journal of Trend in Scientific Research and Development (IJTSRD) Volume: 3 | Issue: 3 | Mar-Apr 2019 Available Online: www.ijtsrd.com e-ISSN: 2456 - 6470 @ IJTSRD | Unique Paper ID - IJTSRD21667 | Volume – 3 | Issue – 3 | Mar-Apr 2019 Page: 157 Preprocessing of Low Response Data for Predictive Modeling Farzana Naz, Imaad Shafi, Md Kamre Alam Al-Falah University, Dhouj, Haryana, India How to cite this paper: Farzana Naz | Imaad Shafi | Md Kamre Alam "Preprocessing of Low Response Data for Predictive Modeling" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456- 6470, Volume-3 | Issue-3, April 2019, pp.157-160, URL: http://www.ijtsrd.co m/papers/ijtsrd216 67.pdf Copyright © 2019 by author(s) and International Journal of Trend in Scientific Research and Development Journal. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0) (http://creativecommons.org/licenses/ by/4.0) ABSTRACT Fortrainingamodel,therawdatahavetogothroughvariouspreprocessingphaseslike Cleaning,MissingValuesImputation,Dimension/Variablereduction,andSampling.These steps are data and problem specific and affect the accuracy of the model at a very large extent. Forthecurrentscenario,wehave2.2Mrecordswith511variables.Thisdatawasusedina DirectMailCampaignofsomeLifeInsuranceProductsandnowweknowwhichrecordhad apositiveresponseforthecampaign. #Rows(records):2,259,747 #Columns:511 #Rowswithpositiveresponse:2,739, i.e.ResponseRate:0.1212%. Thedatasetisnotcomplete,i.e.wehavetotakecareofmissingvalues. KEYWORDS: Logistic Regression, Datasets, Principal component analysis, Variable Reduction INTRODUCTION We have to build a model using this data, so that each record is assigned a probability score. This score depicts the likelihood of any person responding to the Mail Campaign. Sorting the records according to the score helps us selecting people whom to send the mail, which in turns reduces the campaign cost. Resources All the steps were performed on 64-bit machines with 8 cores and 32GB of RAM running Ubuntu 12.04. R-Studio was used to write and run R scripts. Variable Reduction (Manually) Analyzing the missing values shows that 166 variables have more than 50% values missing. Filling the missing values with any constant1 will 1 Typically, missing values are imputed with Mean, Median and Mode reduce the variance of these variables and therefore they will not contribute significantly to the model. In addition, considering these variables in model building will result in increased number of dimensions, which requires more time and memory to train the model. So, these 166 variables are discarded before any further analysis [4]. In a very similar manner, variables having less than 10% values missing were selected without any inspection. And those having 10% to 50%valuesmissingwere examinedone by one, for whether they are important2, before selecting. This led the number of variables to 324. Furthermore, variables like Names, Consumer IDs were discarded and the count collapsed to 306. Birthdays (Year, Month and Date) were replaced by age. Similarly, date of last purchase was replaced by the total months passed since last purchase. After all these steps, number of variables into consideration was around 280. Missing Values First we tried to predict the missing entries by performing FAMD3 and then reconstructing the originals from factors. This didn’t work mainly due to following reasons4: 2 This step was based on intuition and discussion among team members 3 Factor Analysis of Mixed Data. Similar to PCA for numeric variables and MCA (Multiple Correspondence Analysis) for categorical Variables IJTSRD21667
  • 2. International Journal of Trend in Scientific Research and Development (IJTSRD) @ www.ijtsrd.com eISSN: 2456-6470 @ IJTSRD | Unique Paper ID - IJTSRD21667 | Volume – 3 | Issue – 3 | Mar-Apr 2019 Page: 158 A. FAMD creates dummy variables foreach uniquevalueof categorical variable, demanding more memory than the provided 32GB. B. To solve the memory problem, small sample was tried. But now the problem was with the small number of records corresponding to some categories of the categorical variable. A small selected sample was not able to include all categories, leading to constant value(0)insome columns(dummy)makingitimpossible to perform FAMD. To avoid the problem, missing values of a column were imputed with median of the available values in that column. Missing values in columns having either 1 or NA (missing) were imputed with 0. Mean could have been used instead of median but we selected the later one due to two reasons: A. Mean of a Boolean variablecolumn willbe arealnumber B. For continuous variables like household area, there were outliers which were distortingthenaturalposition of mean. Anyways, imputing median too will compromise the results compared to if we had used any technique based on PCA, FAMD or other predictive methods[3]. Categorical Variables In the following figure, we can see that initiallythereare108 factor variables; but some of them were discarded during manual variable. top 15 components were selected on the basis of variance chart Figure 3: Variances of top 30 Principal Components over 400K records 4 Can be solved on a machine with little more memory using PCA reduction step. Of the remaining, those having 2 categories/levels were simply converted to Boolean using dummy variables for each category. For remaining, which have more than 2categories,following steps were taken: A. Create dummies for each category to convert the variable into Boolean. This expansion of 56 such categorical variables resulted in an indicator matrix of 906 dummy columns. B. Now the aim was to perform PCA on this indicator matrix, but that couldn’t be done on whole 2.2Mrecords due to resource constraints. To resolve the issue, the PCA was performed onrandomlyselected400Krecords, and C. Remaining records were multiplied with the rotation matrix for 15PCs to compute/predict5 the components This way we got new data set with 266 columns containing 58 dummy columns corresponding to 29 categorical (with two levels) variables, 15PCs and 222 original numeric columns. ZIP codes were excluded from the PC computation part, as it had too many levels. A simple idea6 is to replace the ZIP with Longitude and Latitude.[7] Note: n-1 dummy columns are sufficient to represent any categorical variables having n levels. So, instead of 58 dummy columns, 29 could have sufficed[6]. At this stage, we have preprocessed and clean data having 2.2M rows and 266 columns. Preparation of Datasets First of all, the preprocessed data was divided intotwoparts randomly. First part (70%) for preparing training datasets using various sampling methods, and the rest 30% left untouched for testing purpose. Let us call the first part DS_TRAIN and the second one DS_TEST. Both DS_TRAIN and DS_TEST were sampled randomly and maintain the response rate (≈0.12%) similar to that in the original raw data. A model trained with such a low response data will fail to predict the responded rows. To understand the situation, let us suppose that we have a dataset of 10,000 records. This dataset will contain only 12 positive responses. To maintain the accuracy, model will learn to predict the negative responses and even if it predicts all 10,000 rows as not responded, its accuracy is ( 10000 10000 −12 )×100 = 99.88% ( If we replicate the responded rows to increase the data, this problem can be avoided. But the model will be more optimistic,i.e. false positive predictions will increase. But when it comes to assign a score to records rather than to classify them as responded/not-responded, it does good. 5 Levels of a factor variable is the number of categories for that variable 6 Factor class is a data type used in R to store categorical variables.
  • 3. International Journal of Trend in Scientific Research and Development (IJTSRD) @ www.ijtsrd.com eISSN: 2456-6470 @ IJTSRD | Unique Paper ID - IJTSRD21667 | Volume – 3 | Issue – 3 | Mar-Apr 2019 Page: 159 To increase the response rate in training dataset, two different strategies were taken: 1. Three training datasets were prepared using stratified sampling. Records with positive response in DS_TRAIN were replicated via Simple Random Sampling with Replacement to make the response rate to 10%, 15% and 20%. Let us call these sets as DS_TRAIN1A, DS_TRAIN1B and DS_TRAIN1C. 2. DS_TRAIN was divided in twoparts:DS_TRAIN_Rhaving all records with response 1, and rest in DS_TRAIN_N. Now DS_TRAIN_N was divided randomly in 10 equal data sets. Then DS_TRAIN_R was appended to each of these sets. So all 10 sets have same responded records but mutually exclusive and exhaustive not-responded records. There is a discussion in machine-learning community about what should be the response rate in the stratified sample. According to some blogs, keeping the ratio to 50% is supposed to be a good strategy. Here, we tried threetraining sa7mples with 10%, 15% and 20% response rates, and our results shows that there is no need to increase it further. Variable Reduction (Automated) Before training the model, variables were selected for all training datasets using stepwise regression methods: Forward Selection and Backward Elimination[2]. Forward Selection involves starting with no variables in the model, adds variables one by one and compares the statistic9. Similarly, Backward Elimination involves starting with all variables and then testing the model after deletion of variables one by one. We used the method regsubsets from R package leaps forcing in all 15PCs.Getting idea from R-Squared plot, 165 variables were selected from forward method and same number from backward. Intersection of these two sets resulted in 144-155 variables for different datasets Figure 4: R-Squared and Mallows’s C Plot for Forward Selection 79 We have considered R-Squared, but F-tests, Akaike Information Criterion, Mallows’s Cp etc are also possible Model selection based on R-Squared statistic was consistent with Mallows’s Cp and Residual Sum of Squares. Training and Testing the Models We used Logistic Regression to train the models. For the datasets prepared using first strategy, three models were trained and then were tested on the untouched test dataset, which is 30% of the whole data[1]. Due to very low response in the original data, the model can’t be judged just by observing the confusion matrix. Instead, following method was followed: A. The dataset was sorted in decreasing order of probabilities predicted by the model. B. Then it was divided in 10 equal groups, each called a Decile. C. Number of positive responses in each decile was calculated and its cumulative sum was plotted against deciles. 10% Responses 15%Responses Responses 23.54399 23.66791 24.16357 39.15737 39.03346 39.4052 49.19455 51.1772 49.81413 59.72739 59.35564 58.2404 67.28625 68.02974 66.41884 75.5886 75.21685 75.5886 83.02354 82.89963 81.9083 88.59975 88.9715 88.35192 95.66295 95.41512 95.41512 100 100 100 The table shows the percentage of responsescapturedforall the three training data-sets created using the first strategy. Figure 5: Decile Plot for Training Dataset having 15% Responses For the 10 training data sets prepared using 2nd method, average of predicted probabilities by all 10 models was taken as the final score. The results were very similar to what we have here using the first strategy.Note: Models using GBM (Gradient BoostingMachines)and Random forest were also triedon the same datasets by some other team members . They too came up with similar
  • 4. International Journal of Trend in Scientific Research and Development (IJTSRD) @ www.ijtsrd.com eISSN: 2456-6470 @ IJTSRD | Unique Paper ID - IJTSRD21667 | Volume – 3 | Issue – 3 | Mar-Apr 2019 Page: 160 Summary and Further Scopes Observing the similarity in results of different models tried, it can be interpreted that results are largely dependent on initial variable selection and sampling strategy followed. Regsubsets showed that there were multiple linear dependencies amongvariables. Thesedependenciescould be removed by doing a little deep analysis. Also creating n-1 dummy columns for a factor variable with nlevelsmightimprovethe Principal Components. Relative importance of variables as depicted by the GLM11 shows that more (than 15) principal components can be included. As discussed above, ZIP codes should be replaced by geographic coordinates. Most of Single Valued12 variables were discarded during manual variable selection. It may improve the model if we impute the missing values of such variables with 0 and considering them till automated variable reduction phase. References [1] https://datascienceplus.com/perform-logistic- regression-in-r/ [2] http://www.ijcem.org/papers032013/ijcem_032013_0 6.pdf [3] https://en.wikipedia.org/wiki/Factor_analysis_of_mixed _data [4] https://books.google.co.in/books?id=1ysHiIpL4PQC&pg =PA167&lpg=PA167&dq=A32+ATTRIBUTES&source=bl &ots=VKf6kFqg33&sig=ACfU3U0U9Q4MwOq3I6LGqyxx r9hqXEZa7Q&hl=en&sa=X&ved=2ahUKEwjDxoyhufDgA hUSA3IKHXsqA7oQ6AEwBXoECAEQAQ#v=onepage&q= A32%20ATTRIBUTES&f=false [5] https://nhorton.people.amherst.edu/r2/datasets.php [6] https://developer.ibm.com/customer- engagement/tutorials/insert-update-records-relational- table/ [7] http://support.sas.com/documentation/cdl/en/graphre f/63022/HTML/default/viewer.htm#overview- geocode.htm