Identify prospects from a credit data set SMALL using data mining techniques
Data set: SMALL data set
• 145 Variables
• 8,000 observations
Tools Used:
• SAS Enterprise Miner Workstation 7.1
• SAS 9.3_M1
Steps involved:
• Data Quality Check
• Data Partition - TRAIN/ VALIDATE/ TEST
• Mining using Decision Trees - CHAID/ Pruned CHAID/ CART/ C4.5
• Data Mining using Regression - Forward/ Backward/ Stepwise
• Data Mining using Regression with Interaction terms included
• Data Mining using Neural Network
• Model Comparison and Scoring
Final Model Selection Analysis based on:
• LIFT Chart
• ROC Curve
2. Project
Goals
• Goal
:
Using
historical
credit
data
set
SMALL,
develop
a
model
which
can
predict
whether
the
prospect
will
respond
to
a
markeBng
campaign
in
future
• Scope:
SMALL
data
set
– 145
Variables
– 8000
observaBons
– Dependent
Variable:
RESP_FLG
(Binary)
• Responder:
1
• Non-‐Responder:
0
3. Tools
• SAS
Enterprise
Miner
WorkstaBon
7.1
• SAS
9.3_M1
4. Variable
Variable
DefiniBon
Defini,on
Type
AAl01
–
AAL17
All
Types
Char
AAU01
-‐
AAU07
Auto
Char
ABK01
-‐
ABK15
Bankcard
Char
ACE01
-‐
ACE03
Cust
Elim
Char
ACL02
–
ACL12
CollecBon
Char
ADI01
–
ADI09
Derog
By
Ind
Char
AEQ01
-‐
AEQ07
Home
Equity
Char
AHI01
-‐
AHI05
Historical
Char
AIN01
-‐
AIN15
Installmnt
Char
AIQ01
-‐
AIQ05
Inquiries
Char
ALE01
-‐
ALE07
Lease
Char
ALN01
-‐
ALN07
LN
Finance
Char
AMG01
-‐
AMG07
Mortgage
Char
APR17
-‐
APR21
Public
REC
Char
ART01
-‐
ART15
Retail
Char
ARV01
-‐
ARV15
Revolving
Char
CUS04
Customer
Data
Char
SCORE01
FICO
Num
SCORE02
MDS
(Market
DeriveD
SignalS)
Num
RESP_FLG
Responder
Flag
Num
5. Data
Cleaning
• Dataset
SMALL
has
missing
values
for
Variables
SCORE01
(FICO)
and
SCORE02
(MDS)
data
mylib.small_clean
mylib.small_bad;
set
mylib.small;
if
score01
=
.
or
score02
=
.
then
output
mylib.small_bad;
else
output
mylib.small_clean;
run;
LOG:
NOTE:
There
were
8000
observaBons
read
from
the
data
set
MYLIB.SMALL.
NOTE:
The
data
set
MYLIB.SMALL_CLEAN
has
5782
observa:ons
and
145
variables.
NOTE:
The
data
set
MYLIB.SMALL_BAD
has
2218
observa:ons
and
145
variables.
• Going
forward,
will
use
dataset
SMALL_CLEAN
• InvesBgate
separately
why
2218
observaBons
had
missing
values
for
SCORE01
and
SCORE02
10. Model:
Maximum
CHAID
On
the
Lel
side,
the
percentage
of
1’s
i.e.
Respondents
is
higher,
and
hence
people
with
FICO
score
<
700.5,
in
(0,
4,
5,
Missing)
category
of
RETAIL:
BAL
>
0
IN
6
MNTHS,
ALL
will
respond
to
the
markeBng
campaign
62. Neural
Network:
Average
Square
Error
If
we
increase
the
number
of
iteraBons,
then
the
average
square
error
decreases
for
TRAIN
but
increases
for
VALIDATE
data
set
63. Ensemble
Node
Select
the
model
that
performs
best
in
– Decision
Trees
– Regression
– Regression
with
InteracBon
Terms
Build
an
Ensemble
Node
on:
– Pruned
Chaid
– Forward
Regression
– Forward
Reg
InteracBon
66. Model
Comparison
• ASSESSMENT
REPORTS
-‐
NUMBER
OF
BINS
=
50
• MODEL
SELECTION
-‐
SELECTION
STATISTIC
=
MISCLASSIFICATION
RATE
• Comparing
LIFT
at
top
20%
71. Model
Comparison:
Conclusion
• TRAIN:
– Ensemble
works
best,
followed
by
Forward
Regression
– Check
for
Validate
and
Test
results
to
finalize
the
model
• VALIDATE
and
TEST
– Forward
Regression
works
bever
than
Ensemble
72. Final
Model
• Forward
Regression
• List
of
Variables:
– AAL11
–
–
–
–
–
–
–
–
–
–
–
ACE01
AEQ01
AEQ07
AHI01
ALN01
AMG01
AMG07
APR20
ART11
LOG_SCORE01
AEQ02
73. SCORE
• In
Model
Comparison
Node-‐
SCORE
-‐>
SELECTION
EDITOR
• Do
YES
for
Forward
Regression
and
NO
for
Stepwise
Reg
InteracBon
(which
was
selected
by
default)
• Connect
Model
Comparison
with
SCORE,
and
Run
it
• Get
OpBmized
SAS
code