The project aims at analysis of Customer Complaints/Inquiries received by a US based mortgage (loan) servicing company..
The goal of the project is building a predictive model using the identified significant
contributors and coming up with recommendations for changes which will lead to
1. Reducing Re-work
2. Reducing Operational Cost
3. Improve Customer Satisfaction
4. Improve company preparedness to respond to customer.
Three models were built - Logistic Regression, Random Forest and Gradient Boosting. It was seen that the accuracy, auc (Area under the curve), sensitivity and specificity improved drastically as the model complexity increased from simple to complex.
Logistic regression was not generalizing well to a non-linear data. So the model was suffering from both bias and variance. Random Forest is an ensemble technique in itself and helps with reducing variance to a great extent. Gradient Boosting, with its sequential learning ability, helps reduce the bias. The results from both random forest and gradient boosting did not differ by much. This is confirming the bias-variance trade-off concept which states that complex models will do well on non-linear data as the inflexible simple models will have high bias and can have high variance.
Additionally, a lift chart was built which gives a Cumulative lift of 133% in the first four deciles
Reduction in customer complaints - Mortgage Industry
1. Reduction in Customer Complaints –
Mortgage Servicing Industry
By prediction of customers likely to complaint & taking
proactive steps for prevention of the same.
Original Project Partners
Pranov Mishra
Aniket Chhabra
Vivek Chandel
Madhu Gollpudi
Codes and cleaned dataset can be
found in my github account.
Link
(https://github.com/Pranov1984/Great-Lakes-Capstone-Project)
2. Executive Summary
Project Overview:
The project aims at analysis of Customer Complaints/Inquiries received by a US based mortgage (loan)
servicing company . The scope of the project is limited to complaints received with respect to the part of the servicing
life cycle that is related to Escrow Analysis and other related or subsidiary activities.
Goal Statement
Identification of major contributors towards complaints/inquiries. Utilization of the identified significant
contributors and coming up with recommendations for changes/new implementations with the below goals
▫ Reducing Re-work
▫ Reducing Operational Cost
▫ Improve Customer Satisfaction
▫ Improve company preparedness to respond to customers
Data Considered
A few months of data of standard servicing loans of the organisation which comprises of circa 154,000
records was used for data exploration, visualization and hypothesis generation. The data had a lot of missing
values and the missing values were typically when the event corresponding to the variable concerned) did not
occur for that observation. The data was cleaned by creating dummy variables with no missing values.
The code and the dataset provide in the github link constitutes the cleaned data.
3. Executive Summary Continued ...
• Escalations typically lead to extra work, reputational damage, sometimes regulatory scrutiny and penalties.
Preventing customer complaints and escalations is in the best interest of the company.
• The data was highly imbalanced (Majority class: Minority Class = 96%:4%) and hence appropriate model evaluation
metric was required to be chosen. A combination of Harmonic mean (F1 score) and Area Under the Curve (AUC)
was used to finalize the best model.
• Models tried to arrive at the best are
Simple Model like Logistic Regression with different thresholds for classification
Random Forest after balancing the dataset using Synthetic Minority Oversampling Technique (SMOTE)
Stochastic Gradient Boosting technique after balancing the dataset as was the case with random forest
• The key insights derived from the model with best results, indicate that the variables that significantly impact
customer behaviour can be broadly classified as below:
Waiver of escrow payments which could arise due to incorrect escrow analysis conducted resulting in customer
requesting for waiver of extra charges levied.
Presence of “Initials” which comes into play when a customer is escrowed for the first time. So the customer
could be escalating either because he is incorrectly escrowed or the initial payment calculation for the escrow
services is incorrect.
Process of handling force escrowed loans have been inadequate leading to customer queries and complaints.
• A Gains chart was prepared which gave a cumulative lift of 133% in the first 4 deciles. The customers with highest
probability of making a complaint were identified. The company could use this information to proactively review the
operations performed on the customer’s account and correct any errors if found.
4. Brief Overview of Escrow Account
Escrow:
Money held by a third-party on behalf of transacting parties
Escrow account:
An escrow account is established with a lender to pay for recurring
expenses related to ones property, such as real estate taxes and
homeowner’s insurance.
It helps borrower to anticipate and manage payment of these expenses
by including these expenses as a portion of monthly mortgage payment.
How does an escrow account work?
• At the time one establishes an escrow account, the customer’s
annual real estate taxes and homeowner’s insurance are estimated,
based on the customer’s most recent bills and premiums.
• An incremental amount of these expenses is added to the
customers’ monthly mortgage payment, in order to cover these
expenses when they are due.
• Each year, this escrow account is reviewed to determine if the
amount being escrowed each month is sufficient to pay for any
change in your real estate taxes or homeowner’s insurance
premiums.
• Incase a non escrowed customer defaults on payment of taxes and
insurance, the lender advances payments to the respective agencies
to protect the rights on the property. The lender then force escrows
the delinquent customers to recover the money.
5. • Missing value
treatment
• Outlier treatment
• Removing
inconsistencies
Data Treatment
• Review of each
variable and
transforming the
appropriate
variables
Exploratory Data
Analysis
• Event rate is highly
skewed in favor of
“No Complaints.
• SMOTE* used to
balance the data.
Balancing Data set
• Data was split in
70:30 ratio.
• Models were
trained on 70%
data & validated on
30%.
Data Partition
Analytical framework used to prepare the model
* Synthetic Minority
oversampling Technique
Model Building
Logistic Regression
Random Forest
Stochastic Gradient Boosting
Validation – Evaluation Metrics
Accuracy
Sensitivity(TPR) & Specificity(TNR)
Area under the Curve (AUC)
6. Data Visualization & Data Preparation
Both the graphs give the impression that the distribution is that of a factor variable. They look like
variables which should be a factor variable with most of the data crowding around zero and the
remaining few crowding around one. No data points in between zero and one. Variables transformed to
factor.
7. Data Visualization & Data Preparation
Both the graphs give the impression that the distribution is that of a factor variable. They look like
variables which should be a factor variable with most of the data crowding around zero and the
remaining few crowding around one. No data points in between zero and one. Variables transformed to
factor. Additionally looks like presence of Waiver is associated with complaints received
8. Data Visualization & Data Preparation
Both the graphs give the impression that the distribution is that of a factor variable. They look like
variables which should be a factor variable with most of the data crowding around zero and the
remaining few crowding around one. No data points in between zero and one. Variables transformed to
factor. Additionally, presence of Reversed seems to be associated with complaints received in a bigger
way.
9. Data Visualization & Data Preparation
Looks like a variable which should be numeric but with a high number of outliers. Most of the values are small with
a few very high values. Close to 17% of observations are outliers. Upon further analysis by deciling it, Eight deciles
have min and max as zero which constitute 80% of the data. Complaints received only in the 9th and 10th deciles
when there are surpluses greater than 0. Essentially there is a very high chance of a complaint/query when there is
a surplus. This indicates that the customer thinks that the analysis is incorrect or the surplus is not being returned
on time by the company. Variable converted to factor.
summary(mydata$Surplus)
Min. 1st Qu. Median Mean 3rd Qu. Max. 0.0 0.0 0.0 75.7 0.0 452213.2
10. Data Visualization & Data Preparation
Shortage Spread looks like a continuous variable but had outliers. The outliers were treated by compressing the
extreme values to between 0 and 85 percentile of the actual values.
Anytime there is a shortage, there is a higher chance of a complaint.
11. Data Visualization & Data Preparation
With highly skewed numbers in favor of NonBK analysis a separate analysis only on NONBK analysis loans could be
contemplated.
12. Random Forest models give the best results. Tuning of the parameters i.e. Mtry (no. of variables used while
training models on bootstrapped samples) improved the results marginally. The gradient boosting results were
also nearly as good (marginally less) as the results from random forest. The important point to note though is
that the tree based models were trained on the data after they were balanced by using SMOTE (Synthetic
Minority Oversampling Technique). The logistic regression results were least impressive.
13.
14. Lift Chart
A Cumulative lift of 133% is achieved by use of the lift chart in the first four deciles. This means by choosing
40% of the total customers, with the aid of the model and the associated gains chart, we can identify more
than 50% of the customers who are likely to complain. Without the model we would have probably identified
20% of the potential complaining customers.