I have done this analysis using SAS on a dataset with 5000 records. I have used CART and Logistic regression to build a predictive model to identify customers which are likely to shift to competitors network.
2. Predictive modeling using
CART & Logistic regression
Algorithm
What is Churn Rate & How it
affect Companies ?
Data Collection and Descriptive
Statistics
Comparison between CART & Logistic
Regression model and Final Recommendation
3. High Value Customers
High Value Customers
which are likely to churn
Customers which are
likely to churn
Fig 1.1
4. Sl No. state
account_
length
area_cod
e
internati
onal_pla
n
voice_
mail_
plan
number_
vmail_m
essages
total_day
_minutes
total_day
_calls
total_day
_charge
total_eve
_minutes
total_eve
_calls
total_eve
_charge
total_nig
ht_minut
es
total_nig
ht_calls
total_nig
ht_charg
e
total_intl
_minutes
total_intl
_calls
total_intl
_charge
number_
customer
_service_
calls
churn
1 KS 128area_code_415 no yes 25 265.1 110 45.07 197.4 99 16.78 244.7 91 11.01 10 3 2.7 1 0
2 OH 107area_code_415 no yes 26 161.6 123 27.47 195.5 103 16.62 254.4 103 11.45 13.7 3 3.7 1 0
3 NJ 137area_code_415 no no 0 243.4 114 41.38 121.2 110 10.3 162.6 104 7.32 12.2 5 3.29 0 0
4 OH 84area_code_408 yes no 0 299.4 71 50.9 61.9 88 5.26 196.9 89 8.86 6.6 7 1.78 2 0
5 OK 75area_code_415 yes no 0 166.7 113 28.34 148.3 122 12.61 186.9 121 8.41 10.1 3 2.73 3 0
6 AL 118area_code_510 yes no 0 223.4 98 37.98 220.6 101 18.75 203.9 118 9.18 6.3 6 1.7 0 0
7 MA 121area_code_510 no yes 24 218.2 88 37.09 348.5 108 29.62 212.6 118 9.57 7.5 7 2.03 3 0
8 MO 147area_code_415 yes no 0 157 79 26.69 103.1 94 8.76 211.8 96 9.53 7.1 6 1.92 0 0
9 LA 117area_code_408 no no 0 184.5 97 31.37 351.6 80 29.89 215.8 90 9.71 8.7 4 2.35 1 0
10 WV 141area_code_415 yes yes 37 258.6 84 43.96 222 111 18.87 326.4 97 14.69 11.2 5 3.02 0 0
# of
Observations
# of
Variables
Churn 5000 20
Train_Churn 3333 20
Test_Churn 1667 20
Data Set Dimensions
Data set used in this analysis is taken from Crain Repositories
embedded in C50 package. This data set consist of 5000 observations
and have 20 variables, out of which 19 variables are predictor
variables and 1 variable is the response variables. The data set is
partitioned in Train and Test in the ratio of 2/3.
Table 1.1
Snapshot of Dataset used in the Analysis Table 1.2
5. Description, Role & Class of Variables in the Dataset
Table 1.3
Variable Role Class Description Use in Model
churn Response Binary
0 = Customer didn't left the service provider,
1 = Customer left the service provider
DV
state Predictor Nominal State to which customer belong IV
account_length Predictor Numeric
No. of days customer is associated with service
provider
IV
area_code Predictor Nominal Area within each state IV
international_plan Predictor Categorical
Yes (1) = international plan,
No (0) = No international plan
IV
voice_mail_plan Predictor Categorical
Yes (1) = Active voice mail plan,
No (0) = No voice mail plan
IV
number_vmail_messages Predictor Numeric Self explanatory IV
total_day_minutes Predictor Numeric Self explanatory IV
total_day_calls Predictor Numeric Self explanatory IV
total_day_charge Predictor Numeric Self explanatory IV
total_eve_minutes Predictor Numeric Self explanatory IV
total_eve_calls Predictor Numeric Self explanatory IV
total_eve_charge Predictor Numeric Self explanatory IV
total_night_minutes Predictor Numeric Self explanatory IV
total_night_calls Predictor Numeric Self explanatory IV
total_night_charge Predictor Numeric Self explanatory IV
total_intl_minutes Predictor Numeric Self explanatory IV
total_intl_calls Predictor Numeric Self explanatory IV
total_intl_charge Predictor Numeric Self explanatory IV
number_customer_service_calls Predictor Numeric Self explanatory IV
DV: Dependent VariableIV : Independent Variable
In the Table 1.3, Class,
Role and Description of
each variable is
mentioned.
Churn in the response
variable (Dependent
variable) and 19
variables are Predictor
variables (Independent
Variable ). We are using
all 19 variables for
Modelling. Before going
for modelling we will
find out the descriptive
statistics, so as to gain a
fair idea about the
significance of each
variable on Churn.
6. Next step in the process of Model building is the descriptive statistics to get idea about which predictor variable are
likely to be significant, which will get eventually validated by the model
Fig 1.2
Fig 1.3
First and Foremost is the calculation of the summary
statistics, for which we have PROC MEANS in SAS, and to
gain better understanding of Individual predictor variables
on Churn, we have used Box-plot. Few such box plots are
shown in the Fig.
Table 1.3
In these two Box-plots we can Clearly see that, distribution
of total_day_charge in case of Churn & No-Churn is
significantly different, similarly in case of
no._customer_service_calls (i.e. Number of Service Calls)
distribution is significantly different in case of Churn & No-
Churn.
7. Fig 1.4
In continuation, to
understand the effect of the
Nominal Variable like “State”
we have used Tableau to
generate area Map based on
the Longitude and Latitude
information. From the Area
Map we can clearly notice
that Churn is significantly
high in few states like New
Jersey (NJ) followed by
Texas (TX).
Now we have got the fair
Idea of the relative
importance of each and
every variable, and we have
completed our data
preparation stage, so we will
shift our focus to most
important part of the
analysis i.e., Modeling
8. Predictive Model Using CART ( Classification and Regression Tree ) Algorithm
Tree based learning algorithms are considered to be one of the best and mostly used supervised learning methods. Tree
based methods empower predictive models with high accuracy, stability and ease of interpretation. Unlike linear models,
they map non-linear relationships quite well. They are adaptable at solving any kind of problem at hand (classification or
regression).
Decision tree is a type of supervised learning algorithm (having a pre-defined target variable) that is mostly used
in classification problems. It works for both categorical and continuous input and output variables. In this technique, we
split the population or sample into two or more homogeneous sets (or sub-populations) based on most significant
splitter / differentiator in input variables. Let’s have a look at terminology associated with the Decision Tree.
Root Node: It represents entire population or sample and
this further gets divided into two or more homogeneous
sets.
Decision Node: When a sub-node splits into further sub-
nodes, then it is called decision node.
Leaf/ Terminal Node: Nodes do not split is called Leaf or
Terminal node.
Pruning: When we remove sub-nodes of a decision node,
this process is called pruning. You can say opposite
process of splitting.
Branch / Sub-Tree: A sub section of entire tree is called
branch or sub-tree. Terminology associated with Decision Tree Fig 1.5
9. SAS Code for CART ( Classification & Regression Tree )
PROC HPSPLIT : SAS procedure that builds tree
based statistical models for Classification and
Regression
Fig 1.6
GROW Statement: Specify the criteria using
this statement to minimize the Node’s error.
Entropy is the most common choice when
growing a classification tree. Gini is another
famous criteria
PRUNE : The Prune statement specify the
method for pruning a tree into smaller sub-
tree.
The most common method is pruning
through Cost-complexity.
The Algorithm makes trade off between
Complexity and Error rate.
10. Results for CART ( Classification & Regression Tree )
Table 1.4 Fig 1.7
In the Table 1.4, Split Criteria and
Pruning method is as per our code
and Model level is ‘0’ which means
model is predicting No-Churn.
Fig 1.7 represents graph between ASE ( Average standard error) or
Avg. Misclassification Rate and Cost-complexity. The Vertical
reference line is drawn for the tree with minimum ASE, in this case it
is with # of Leaves = 19.
11. Fig 1.7
Fig 1.8 Fig 1.9
Form the Fig 1.9 we
can clearly see 4
stage Sub-tree
generated out of
completed tree as
shown in Fig 1.8.
First level of splitting
is based on the
total_day_charge
followed by
number_customer_s
ervice_calls &
voice_mail_plan in
the 2nd stage.
12. 0.0 0.2 0.4 0.6 0.8 1.0
1 - Specificity
0.0
0.2
0.4
0.6
0.8
1.0
Sensitivity
ROC Curve for dummy_churn
Training
0.0 0.2 0.4 0.6 0.8 1.0
1 - Specificity
0.0
0.2
0.4
0.6
0.8
1.0
Sensitivity
0.91Training AUC
ROC Curve for dummy_churn
Training
Fig 1.10
Table 1.5
Table 1.4
Type 1 Error
Type 2 Error
From the table 1.4 we can see that Model is able to Classify
No-Churn as No-Churn with an error rate of 1.16% and
Churn as Churn with the error rate of 23.81%.
Total Mis-classification is 4.45% i.e., total accuracy of this
model is 95.55% which is good.
From the table 1.5, we can see that out of 19 predictor
variable only 09 are significant for the model building and
relative importance in the decreasing order is shown in the
table.
13. Introduction to Logistic Regression
What is Logistic Regression ?
Logistic Regression is a classification algorithm. It is used to predict a binary outcome (1 / 0, Yes / No, True / False)
given a set of independent variables. To represent binary / categorical outcome, we use dummy variables. We can also
think of logistic regression as a special case of linear regression when the outcome variable is categorical, where we
are using log of odds as dependent variable. In simple words, it predicts the probability of occurrence of an event by
fitting data to a logit function.
Important Points in GLM ( Generalized Linear Model )
Logistic Regression is part of a larger class of algorithms known as Generalized Linear Model (GLM).
GLM does not assume a linear relationship between dependent and independent variables. However, it assumes a linear
relationship between link function and independent variables in logit model.
The dependent variable need not to be normally distributed.
It does not uses OLS (Ordinary Least Square) for parameter estimation. Instead, it uses maximum likelihood estimation (MLE).
Errors need to be independent but not normally distributed.
Performance Measure of Logistic regression Model
AIC (Akaike Information Criteria) – The analogous metric of adjusted R² in logistic regression is AIC. AIC is the
measure of fit which penalizes model for the number of model coefficients. Therefore, we always prefer model with
minimum AIC value.
Confusion Matrix: It is nothing but a tabular representation of Actual vs Predicted values. This helps us to find the
accuracy of the model and avoid overfitting.
14. SAS code for Logistic Regression
The PROC LOGISTIC
statement invokes the
LOGISTIC procedure and
optionally identifies input
and output data sets,
suppresses the display of
results, and controls the
ordering of the response
levels.
Table 1.6
15. Results for Logistic Regression
Table 1.8
Table 1.7
Table 1.6
Important results obtained for Logistic Regression Algorithm
are mentioned in the Table 1.6, 1.7 & 1.8 respectively.
From the table 1.6, we can see that our Model is build with
Response variable (‘Churn’) and optimization technique used
is Fisher’s scoring.
AIC which is a measure of the performance of the Model, and
high value of AIC in this case represents loose fit i.e., accuracy
of the model is expected to be low.
From the Maximum Likelihood Estimates table we can see
that predictor variables encircled in red are significant at 95%
confidence level.
16. Final Model based on the results we have seen in the Maximum Likelihood Estimates ( Table 1.8 ).
Logit = -8.6514 + 2.0427*( international_plan) - 2.0248*( voice_mail_plan) + 0.0359*( number_vmail_message)
-0.0930*(total_intl_calls) + 16.3896*( total_intl_charge) + 0.5136*( number_customer_serv)
Confusion Matrix on Train data Confusion Matrix on Test Data
Table 1.10Table 1.9
Overall Accuracy in case
of Train data is 89.19%,
and Type II error is
78.46% which is very
high.
Overall Accuracy in case
of Test data is 87.40%,
and Type II error is
80.80% which is very
high.
So, overall accuracy looks
fine but Type II error is
very high.
17. Conclusion & Recommendation
Overall accuracy achieved in case of Model using CART is 95.55% with Type II error is 23.81%.
Overall accuracy achieved in case Model using Logistic Regression is approximately 87% with the type two
error is as high as 80.80%.
Based on these two Key observation we recommend to Use CART in case of telecom Churn.
Key Advantages of CART:
Easy to Understand: Decision tree output is very easy to understand even for people from non-
analytical background. It does not require any statistical knowledge to read and interpret them. Its
graphical representation is very intuitive and users can easily relate their hypothesis.
Less data cleaning required: It requires less data cleaning compared to some other modeling
techniques. It is not influenced by outliers and missing values to a fair degree
Data type is not a constraint: It can handle both numerical and categorical variables.
Non Parametric Method: Decision tree is considered to be a non-parametric method. This means that
decision trees have no assumptions about the space distribution and the classifier structure.
Disadvantages
Over fitting: Over fitting is one of the most practical difficulty for decision tree models. This problem
gets solved by setting constraints on model parameters and pruning (discussed in detailed below).
Not fit for continuous variables: While working with continuous numerical variables, decision tree
looses information when it categorizes variables in different categories.
In case if Box-plots we can Cleary see that, distribution of total_day_charge in case of Churn & No-Churn is significantly different, similarly in case of no._customer_service_calls (i.e. Number of Service Calls) distribution is significantly different in case of Churn & No-Churn.