Whitepaper - Ready Reckoner: Probability of Default Modeling

Comparison of Credit Scoring
Models for Probability of
Default Estimation
White Paper by Rahul Dutta

Fierce competition amongst the banking and other financial sectors, as well as the recent
global financial crisis and the subsequent new regulatory environments, have brought
modelling credit scoring of business and personal loans to prominence. An accurate
estimation of the credit risk associated with customers has become paramount, as this
information assists financial institutions in deciding whether to grant credit to their
customers. As the demand for credit products rapidly increases and lenders consistently face
potential financial losses due to customers who are likely to default, it is important for
lenders to identify the main risk factors contributing to the probability of default, as well as
to predict the Probability of Default (PD) as accurately as possible. Several modelling
techniques are available for this analysis. In this paper we are going to compare different
credit scoring models for Probability of Default and Loss Given Default (LGD) estimation.
Executive Summary
Probability of Default and Loss Given Default analysis:
Probability of Default/Loss Given Default analysis is a method used by generally larger financial institutions
to calculate expected loss. A probability of default is already assigned to a specific risk measure, per
guidance, and represents the percentage expectation to default, measured most frequently by assessing past
dues. Loss Given Default measures the expected loss, net of any recoveries, expressed as a percentage and
will be unique to the industry or segment.
When combined with the variable Exposure at Default (EAD) or current balance at default, the expected loss
calculation is deceptively simple:
While the equation itself may be simple, deriving the variables requires in-depth analysis. PD and LGD
represent the past experience of a financial institution but also represent what an institution expects to
experience in the future. Expected loss being a function of EAD,PD and LGD depends on the estimates of
these. EAD, PD and LGD can be estimated using various techniques and at different level. Estimation of EAD
or exposure at default can be done using simple OLS techniques or Survival Analysis while LGD or Loss Given
Default can be predicted using either a regression model or decision trees. The most important part of this
calculation is the estimation of Probability of Default. The common methodology followed to estimate
probability of default (PD) is Logistic Modelling which predicts whether a customer will default payment of a
particular debt. For example, if a bank provides Auto Loans to its customers, this method will be able to
predict the probability of a particular customer to be a defaulter for that particular loan and given that a
particular customer is a defaulter, LGD models will help to identify the loss amount for the Bank.
Now, estimation of all these components can be done at different levels. While Probability of Default can be
calculated for every customer, LGD and EAD can be calculated at an aggregated level. E.g after calculating
the probability of default of individuals, EAD and LGD can be calculated for different time periods or for
group of individuals with same loan amount or same FICO score etc. So, finally expected loss is calculated at
an aggregated level.

Instead of calculating Expected Loss at an aggregated level, we can also calculate them for individuals in the
following way.
Let us assume we want to calculate total expected loss for the next one year from an auto loan given by a
Bank to its customers. Assume for the i- th individual PD is the probability of default within t months, EADit it
is the exposure at default and R is the recovery rate for the individual. So, for individual i at time t, expectedit
loss is PD * EAD *(1- R ). So, expected loss within next one year for an individual is as follows:it it it
Clearly, this method gives us the expected loss values for each individual. Estimation of EAD and R can beit it
done in the following way.
 EAD can be expressed as a function of I and t. E.g. if for an individual total loan amount is $100 with anit
interest of 10% and tenure of one year with a monthly premium of $11. Now if after 3 months the
individual becomes defaulter, Exposure at default (EAD) is $67.
 R can be calculated also in a similar fashion with some financial values available which are functions ofit
'I' and 't'.
Probability of Default for the individual or PD can be predicted in several ways. Following are the techniquesit
to calculate Probability of default (PD).
Logistic Regression
Model Structure:
The Logistic Regression takes the following form:
Where, 'p' is the probability of the event occurring, and 'K' independent variables; 'x' each are weighted by a
coefficient: 'β'
The above equation can be written as:
Interpretation: In logistic regression, a change in one factor changes the risk by an amount that is
proportional to the level of the other factors.
Data used: To identify defaulters by this method, data that is being used is usually an account level
data(customer level data) where at any given point of time from the past record models predict that given a
customer has a certain information about the predictors, what is the chance that the customer will be a
defaulter.
Predictors are usually the following ones:
 Loan Amount Issued
 Asset Amount
 Loan to Asset ratio
 No. of months in books
 Down Payment made
 No of months employed
 FICO score
 Total amount of loan taken till date etc.

Disadvantage of this procedure:
Even if this method takes into account a significant amount of information and is quite capable of
interpreting the coefficient of the predictors for the equation, the downside to this approach is that it takes
into account closed cases only, that is, those accounts where either the loan has been paid fully/ not paid /
has not paid anything for a certain period (e.g. 90 days.). That is, some accounts are deliberately considered
as defaulters who have not paid any amount for a certain period of time and then the model is built in order
to predict the probability of default. From an overview of this analysis we can say that the behaviour of
accounts/customers of a particular loan product varies across two dimensions. Firstly, the predictor variables
that are mentioned above which account for the variability between two different customers and try to
quantify the probability of being a defaulter. The second dimension is time which accounts for the change in
payment procedure within a particular observation (customer). This time factor brings in the concept of
failure rate for a customer at any point in time.
So, it is quite clear that Logistic Modelling approach might be helpful to identify /predict the defaulters but is
not able to identify when someone will default, is it after 6 months of approval of the loan or after 2 years;
which upon having the answer will serve the banking institutions considerably.
To resolve this issue the approach that we can follow is a survival analysis technique which is able to take
into account the time variant factors and will be able to answer the following questions:
i) Which borrower will default?
ii) When will that borrower default?
The advantages of Survival analysis method over Logistic Regression for credit scoring are as follows:
i) Survival models naturally match the loan default process,
ii) It gives a clearer approach to assessing the likely profitability of an applicant, and,
iii) Survival estimates will provide a forecast as a function of time
We can understand this with an example,
We expect that rises in interest rates may increase the risk of an individual failing to make payments. This can
be due to increased payment demands on loans and mortgages as well as outstanding credit card debt. The
scenario is similar to economic indicators like unemployment index, property price etc. Since these variables
are time variant, it is quite complicated to include them in a Logistic Regression setup whereas in Survival
Analysis they can be easily incorporated.
Survival analysis, provides the predicted distribution of 'T' (Time to default) along with a number of other
advantages:
 First, it provides a consistent means of predicting probability of default within many different periods
of time (e.g., 12 month default rate, 24 month default rate, etc.).
 Second, it possesses an inherent mechanism for taking into consideration the most recent data. On the
contrary, using Logistic Regression if one wishes to predict the probability of default within 24 months,
customers joining within the past 24 months cannot be included while fitting the model.
 Third, it provides comprehensive information on the predicted behaviour of 'T' via its predicted
distribution.

Proportional Hazard Model:
The objective here is to model the time of default for a particular customer. Let us assume T denotes the
random variable for time to default.
Let f(t) and F(t) denote the Probability Density Function (PDF) and Cumulative Distribution Function (CDF) of
the time 'T' to default (T=0 corresponds to the time of approval of the loan).
The hazard function then is defined as and is interpreted as the instantaneous likelihood
of defaulting at time 't', given that the customer has not defaulted prior to time 't' . From the definition of the
hazard function, it can be shown that
Let x , x , ... , x denote a set of 'M' predictor variables for an applicant, and define the predictor vector x =1 2 M
/
[x ,x , ..., x ] .1 2 M
In survival analysis, perhaps the most popular way to allow the distribution of 'T' to depend on a set of
predictor variables is through a PH survival model, defined below:
Denoting the hazard function for a customer with predictors 'x' by h(t; x) to indicate its explicit dependence
on x, a PH survival model represents:
Baseline Function,f (t) follows any distribution from the exponential family e.g. Exponential, Weibull, Log Normal etc.0
A quantity that can be extracted from the
predicted distribution is the probability that an
applicant will default within the specific time
period, which is what Logistic Regression
produces. For example, the predicted
probability of default within 24 months is
simply F(24; x) which is the predicted CDF
evaluated at month t=24. Conceptually, this is
the area under the predicted PDF f(t; x) between
t=0 and t=24 in the chart.
But the above approach does not include the impact of different economic scenarios, i.e. what will be the
change in default rate of the customers joining in different periods of time? To take into account that time
variability, the Time Dependent Proportional Hazard (TDPH) model is of great use.

Time Dependent Proportional Hazard Model:
To talk about the time factor in a default modelling scenario let's look at the following chart which shows the
percentage of customers who defaulted within the first 9 months of the loan tenure. Three vintages were
considered (i.e., customers joining in the three different quarters: Quarter 2 of 2004, Quarter 4 of 2005, and
Quarter 4 of 2007). The customers in all three vintages fell into the same FICO scoring band (between 675 and
705). Hence, if one ignored market trends and attempted to predict default probability based only on the
applicants' predictor variables, one could naively conclude that customers in the three vintages all have the
same default probability. In reality, the figure shows that the default probability is much higher for the Q4
2007 vintage, because of the severe economic downturn in 2008.
To account for such temporal effects, one potential approach is to incorporate macroeconomic variables into
the PH survival model.
For a customer that joins during month't', denote their hazard function by h(t; x,τ) to explicitly indicate its
dependence not only on x, but also on the time at which the customer joins. In the TDPH survival model, one
represents,
Parameter estimation:
Maximum Likelihood estimation method can be used to estimate the parameters for the equation.
Data Structure:
Unlike Logistic Modelling procedure PH Model or TDPH model does not require data for a particular time
window. For example, in logistic modelling, if we want to model the probability of default for 12 months we
cannot consider any data within one year, prior to the time of data collection. TDPH/PH models do not have
this problem. The basic difference between the data structure from Logistic Model to PH model is the censor
variable. Suppose, the period of the data collected is from 2002 to 2008 and we want to find the probability
of default for a loan with a cut-off point of 24 months. We create a binary (0/1) variable using the cut-off
point – '0' being the ones where either loan tenure is over or payment has been made and '1' being the
censored one where it is an open account. In addition to this, some macroeconomic factors like
unemployment rate, house rate index, interest rate etc. are included. The macroeconomic factors contain the
variability between the customers with loan approved in different time point but having the same kind of
loan information (e.g. Loan Amount, Tenure, Credit History etc.); this loan information corresponds to the
time dynamic part of the data.
Sampling Methods:
Biased sampling methods can be used to draw the training sample.

Generalized Additive Modelling Approach for Probability of
default and Loss given default modelling:
There are several credit scoring methods to calculate the probability of default and loss given default like
Logit Model, Divergence-Discriminant Method, Neural Networks, Proportional Hazard model etc. Most of
the known methods being parametric always involve a distributional assumption to build the model; which
might not always be a good choice, given the dynamic scenario of economy and customer behaviour.
To encounter that effect a semi parametric approach can be taken to incorporate those effects where the
parametric part takes care of the conventional aspect of the predictors and the non-parametric part takes
care of the remaining, which is not as such functional under any known distribution. A suitable modelling
approach of this kind is Generalized Additive Modelling.
In statistics, a Generalized Additive Model (GAM) is a generalized linear model in which the linear predictor
depends linearly on unknown smooth functions of some predictor variables, and interest focuses on
inference about these smooth functions.
The model relates a univariate response variable, 'Y', to some predictor variables, xi. An exponential family
distribution is specified for 'Y' (for example normal, binomial or Poisson distributions) along with a link
function g (for example the identity or log functions) relating the expected value of Y to the predictor
variables via a structure such as:
In the widely-used parametric models, the relationships between continuous predictor variables and the
response variable are assumed to be known functional forms, even though they are mostly unknown in many
empirical applications. By contrast, the semi-parametric GAM does not assume any functional forms for
these relationships, but the data is allowed to determine them. The data-driven cubic B-spline (non-
parametric) method is used to estimate the GAM. By doing so, the underlying true relationships between
continuous predictor variables and the binary response variable may be uncovered.
Performance Measures:
Performance measures for PD, LGD models are universal. Following performance measures can be used to
validate the model:
Model Validation Techniques:
Usual Model validation techniques for Logistic Regression model like KS Analysis or ROC curve can also be
applied in case of a Proportional Hazard model or a Time Dependent Proportional Hazard Model.
Following is a snapshot comparison of ROC curves for
the following four procedures:
 Time Dependent Proportional Hazard Model
 Proportional Hazard Model
 Logistic regression
 Logistic Regression with a dynamic time
component

Receiver Operating Characteristic Curve:
ROC is commonly used to determine the overall
classification power as well as to provide information
on the performance of a model at any cut-off score
point. A widely used simple analysis is to measure the
performance of binary classifier system is a 2×2
contingency table of type I and type II errors. For a
given cut-off point score of 0.5, if the estimated
probability is over 0.5, it is classified as a default or bad
loan.
This is a snapshot of a ROC curve.
The other measures for validating the performance of a
model are,
 Area Under Curve (AUC) : B+C, reflecting total
accuracy of the model
 Gini Coefficient : B/(A+B)
Data Structure:
Similar to Logistic Modelling technique, this method also deals with closed accounts only; where, if one
wants to predict the probability of default within 12 months, data used should be at least one year preceding
the time of scoring. The following traits should be captured at a customer/account level: monthly income
(INC), debt-to-equity ratio (DE), the amount of loan (FND), monthly payment (MPM), and revolving credit
line utilization (UTIL), year(s) of employment experience (EMP), housing ownership (HOM) and delinquency
(DEL) reports within the recent history.
Application of GAM: GAM, a semi-parametric method proposed by Hastie and Tibshirani (1990), been
applied to modelling bankruptcies (Berg 2007). It has also been applied to a comprehensive survey on loan
recovery process of Italian Banks.
References:
 A time-dependent proportional hazards survival model for credit risk analysis, May 2011 - J-K Im, DW Apley, C Qi and X Shan
- Department of Industrial Engineering & Management Sciences, Northwestern University
 Credit Scoring With Macroeconomic Variables Using Survival Analysis, May 2007 - Tony Bellotti and Jonathan Crook, Credit
Research Centre, Management School and Economics, University of Edinburgh
 Survival Analysis Methods For Personal Loan Data, April 2002 - Maria Stepanova, UBS AG, Financial Services Group, Lyn
Thomas, Department of Management, University of Southampton
 A case study on using generalized additive models to fit credit rating scores, 2011 - Marlene Müller, Beuth Hochschule für
Technik Berlin
 · Nonlinear and Semi-parametric Modelling of Personal Loan Credit Scoring, August 2013 - Nithi Sopitpongstorn, Jean-
Pierre Fenech and Param Silvapullea, Department of Econometrics and Business Statistics, Monash University, Australia,
Department of Accounting and Finance, Monash University, Australia.

Office
Bangalore: 389, 2nd Floor, 9th Main, HSR Layout, Sector – 7, Bangalore – 560 102
Phone: +91-80-42102154
US: 1013 Centre Road, ST # 403S, Wilmington, New Castle, DE 19805
Phone: +1 858 312 1075
www.bridgei2i.com | enquiries@bridgei2i.com
Facebook | Twitter | Google+ | LinkedIn: BRIDGEi2i
About BRIDGEi2i
BRIDGEi2i provides Business Analytics Solutions to enterprises globally, enabling them to achieve
accelerated business impact harnessing the power of data. These analytics services and technology
solutions enable business managers to consume more meaningful information from big data, generate
actionable insights from complex business problems and make data driven decisions across pan-
enterprise processes to create sustainable business impact. BRIDGEi2i has featured among the top 10
analytics and big data start-ups in several coveted publications.

Whitepaper - Ready Reckoner: Probability of Default Modeling

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (20)

Plus de BRIDGEi2i Analytics Solutions

Plus de BRIDGEi2i Analytics Solutions (20)

Dernier

Dernier (20)

Whitepaper - Ready Reckoner: Probability of Default Modeling