Prospect Identification from a Credit Database using Regression, Decision Trees, And Neural Network

Data
Mining
Project

By:
Akanksha
Jain

Project
Goals

•  Goal
:
Using
historical
credit
data
set
SMALL,
develop
a
model
which
can

predict
whether
the
prospect
will
respond
to
a
markeBng
campaign
in

future

•  Scope:
SMALL
data
set

–  145
Variables

–  8000
observaBons

–  Dependent
Variable:
RESP_FLG
(Binary)

•  Responder:
1

•  Non-‐Responder:
0

Tools

•  SAS
Enterprise
Miner
WorkstaBon
7.1

•  SAS
9.3_M1

Variable

Variable
DeﬁniBon

Deﬁni,on

Type

AAl01
–
AAL17

All
Types

Char

AAU01
-‐
AAU07

Auto

Char

ABK01
-‐
ABK15

Bankcard

Char

ACE01
-‐
ACE03

Cust
Elim

Char

ACL02
–
ACL12

CollecBon

Char

ADI01
–
ADI09

Derog
By
Ind

Char

AEQ01
-‐
AEQ07

Home
Equity

Char

AHI01
-‐
AHI05

Historical

Char

AIN01
-‐
AIN15

Installmnt

Char

AIQ01
-‐
AIQ05

Inquiries

Char

ALE01
-‐
ALE07

Lease

Char

ALN01
-‐
ALN07

LN
Finance

Char

AMG01
-‐
AMG07

Mortgage

Char

APR17
-‐
APR21

Public
REC

Char

ART01
-‐
ART15

Retail

Char

ARV01
-‐
ARV15

Revolving

Char

CUS04

Customer
Data

Char

SCORE01

FICO

Num

SCORE02

MDS
(Market
DeriveD
SignalS)

Num

RESP_FLG

Responder
Flag

Num

Data
Cleaning

•  Dataset
SMALL
has
missing
values
for
Variables
SCORE01
(FICO)
and

SCORE02
(MDS)

data
mylib.small_clean
mylib.small_bad;

set
mylib.small;

if
score01
=
.
or
score02
=
.
then
output
mylib.small_bad;

else
output
mylib.small_clean;

run;

LOG:

NOTE:
There
were
8000
observaBons
read
from
the
data
set
MYLIB.SMALL.

NOTE:
The
data
set
MYLIB.SMALL_CLEAN
has
5782
observa:ons
and
145
variables.

NOTE:
The
data
set
MYLIB.SMALL_BAD
has
2218
observa:ons
and
145
variables.

•  Going
forward,
will
use
dataset
SMALL_CLEAN

•  InvesBgate
separately
why
2218
observaBons
had
missing
values
for

SCORE01
and
SCORE02

Data
Source

•  Rejected
Variables
(have
more
than
20
categories):

–  ACE03

–  ACL10

•  Variable
RESP_FLG

–  Change
Role
to
TARGET

–  Change
Order
to
DESCENDING

•  Set
Prior
ProbabiliBes

–  Non
–
Responder/
event
=
“0”:
0.99

–  Responder/
event
=
“1”:
0.01

Data
ParBBon

•  Train
–
55%

•  Validate
–
35%

•  Test
–
10%

Model:
Maximum
CHAID

•  Nominal
Criterion:
ProbChiSq

•  Signiﬁcance
Level:
0.2

Model:
Maximum
CHAID

On
the
Lel
side,
the
percentage
of
1’s
i.e.
Respondents
is
higher,
and
hence
people
with
FICO

score
<
700.5,
in
(0,
4,
5,
Missing)
category
of
RETAIL:
BAL
>
0
IN
6
MNTHS,
ALL
will
respond
to

the
markeBng
campaign

Maximum
CHAID:
CumulaBve
LIFT

Maximum
CHAID:
Final
Variables

Model:
Pruned
CHAID

• 
• 
• 
• 
• 

Nominal
Criterion:
ProbChiSq

Signiﬁcance
Level:
0.2

Leaf
Size:
120

Split
Size:
300

Maximum
Depth:
3

Pruned
CHAID:
CumulaBve
LIFT

Pruned
CHAID:
Final
Variables

Model:
CART

• 
• 
• 
• 
• 

Nominal
Criterion:
Gini

Signiﬁcance
Level:
0.2

Leaf
Size:
120

Split
Size:
300

Maximum
Depth:
3

Model:
C4.5

• 
• 
• 
• 
• 

Nominal
Criterion:
Entropy

Signiﬁcance
Level:
0.2

Leaf
Size:
120

Split
Size:
300

Maximum
Depth:
3

Variable
Comparison

Maximum
CHAID

Pruned
CHAID

CART

SCORE01

AMG07

ABK10

ART11

AAL04

AIQ04

ACE02

AAU03

AAL14

AEQ07

AEQ01

AIN03

ABK14

AIN10

SCORE01

AMG07

ALN01

SCORE01

AMG07

ABK10

SCORE02

AMG06

AMG01

AMG03

ARV10

ARV03

ARV01

Transform
Variables

SCORE01

SCORE02

Transform
Variables

•  Skewed
SCORE01
and
SCORE02

•  Transform
funcBon
–
LOG

Impute

•  Default
Input
Methods

–  For
Interval
Variables
–
Median

–  For
Class
Variables
-‐
Count

Model:
Full
Model
Regression

• 
• 
• 
• 
• 
• 

Input
Coding
-‐
GLM

MODEL
SELECTION
-‐
None

OPTIMIZATION
OPTIONS
-‐
TECHNIQUE
-‐
Default

OPTIMIZATION
OPTIONS
-‐
DEFAULT
OPTIMIZATION
-‐
No

OPTIMIZATION
OPTIONS
-‐
MAX
ITERATIONS
-‐
20

OPTIMIZATION
OPTIONS
-‐
MAX
FUNCTION
CALLS
-‐
10

Full
Model
Regression:
CumulaBve
LIFT

Model:
Forward
Regression

• 
• 
• 
• 
• 
• 
• 
• 

Input
Coding
-‐
GLM

MODEL
SELECTION
-‐
Forward

SELECTION
CRITERION
-‐
Akaike
InformaBon
Criterion

USE
SELECTION
DEFAULTS
-‐
YES

OPTIMIZATION
OPTIONS
-‐
TECHNIQUE
-‐
Default

OPTIMIZATION
OPTIONS
-‐
DEFAULT
OPTIMIZATION
-‐
No

OPTIMIZATION
OPTIONS
-‐
MAX
ITERATIONS
-‐
20

OPTIMIZATION
OPTIONS
-‐
MAX
FUNCTION
CALLS
-‐
10

Forward
Regression:
CumulaBve
LIFT

Forward
Regression:
CumulaBve
%

Captured
Response

Forward
Regression:
Final
Variables

Model:
Backward
Regression

• 
• 
• 
• 
• 
• 
• 
• 

Input
Coding
-‐
GLM

MODEL
SELECTION
-‐
Backward

SELECTION
CRITERION
-‐
Akaike
InformaBon
Criterion

USE
SELECTION
DEFAULTS
-‐
YES

OPTIMIZATION
OPTIONS
-‐
TECHNIQUE
-‐
Default

OPTIMIZATION
OPTIONS
-‐
DEFAULT
OPTIMIZATION
-‐
No

OPTIMIZATION
OPTIONS
-‐
MAX
ITERATIONS
-‐
20

OPTIMIZATION
OPTIONS
-‐
MAX
FUNCTION
CALLS
-‐
10

Backward
Regression:
CumulaBve
LIFT

Backward
Regression:
CumulaBve
%

Captured
Response

Backward
Regression:
Final
Variables

Model:
Stepwise
Regression

• 
• 
• 
• 
• 

• 
• 
• 
• 

Input
Coding
-‐
GLM

MODEL
SELECTION
-‐
Stepwise

SELECTION
CRITERION
-‐
Akaike
InformaBon
Criterion

USE
SELECTION
DEFAULTS
-‐
No

MODEL
SELECTION
-‐
SELECTION
OPTIONS

–  ENTRY
SIGNIFICANCE
LEVEL
=
0.15

–  STAY
SIGNIFICANCE
LEVEL
=
0.05

–  MAXIMUM
NUMBER
OF
STEPS
=
300

OPTIMIZATION
OPTIONS
-‐
TECHNIQUE
-‐
Default

OPTIMIZATION
OPTIONS
-‐
DEFAULT
OPTIMIZATION
-‐
No

OPTIMIZATION
OPTIONS
-‐
MAX
ITERATIONS
-‐
20

OPTIMIZATION
OPTIONS
-‐
MAX
FUNCTION
CALLS
-‐
10

Stepwise
Regression:
CumulaBve
LIFT

Stepwise
Regression:
CumulaBve
%

Captured
Response

Stepwise
Regression:
Final
Variables

Variable
Comparison

Forward

Backward

Stepwise

AAL11

ACE01

AAL11

ACE01

AEQ01

ACE01

AEQ01

AEQ07

AEQ01

AEQ07

AHI01

AEQ07

AHI01

ALN01

AHI01

ALN01

AMG01

ALN01

AMG01

AMG07

AMG01

AMG07

APR20

AMG07

APR20

LOG_SCORE01

APR20

ART11

AEQ03

ART11

LOG_SCORE01

AEQ04

LOG_SCORE01

AEQ02

ALE01

ALE02

InteracBon
Terms

• 
• 
• 
• 
• 
• 

log_score01
*
log_score01

log_score01
*
ace01

log_score01
*
amg01

log_score01
*
ahi01

log_score01
*
log_score02

log_score02
*
log_score02

Model:
Forward
Reg
InteracBon

• 
• 
• 
• 
• 
• 
• 
• 
• 
• 

EQUATION
-‐
USER
TERMS
-‐
YES

EQUATION
-‐
TERM
EDITOR
-‐
Enter
InteracBon
Terms

Input
Coding
-‐
GLM

MODEL
SELECTION
-‐
Forward
Reg
InteracBon

SELECTION
CRITERION
-‐
Akaike
InformaBon
Criterion

USE
SELECTION
DEFAULTS
-‐
YES

OPTIMIZATION
OPTIONS
-‐
TECHNIQUE
-‐
Default

OPTIMIZATION
OPTIONS
-‐
DEFAULT
OPTIMIZATION
-‐
No

OPTIMIZATION
OPTIONS
-‐
MAX
ITERATIONS
-‐
20

OPTIMIZATION
OPTIONS
-‐
MAX
FUNCTION
CALLS
-‐
10

Forward
Reg
InteracBon:
CumulaBve

LIFT

Forward
Reg
InteracBon:
CumulaBve

%
Captured
Response

Forward
Reg
InteracBon:
Final

Variables

Model:
Backward
Reg
InteracBon

• 
• 
• 
• 
• 
• 
• 
• 
• 
• 

EQUATION
-‐
USER
TERMS
-‐
YES

EQUATION
-‐
TERM
EDITOR
-‐
Enter
InteracBon
Terms

Input
Coding
-‐
GLM

MODEL
SELECTION
-‐
Backward
Reg
InteracBon

SELECTION
CRITERION
-‐
Akaike
InformaBon
Criterion

USE
SELECTION
DEFAULTS
-‐
YES

OPTIMIZATION
OPTIONS
-‐
TECHNIQUE
-‐
Default

OPTIMIZATION
OPTIONS
-‐
DEFAULT
OPTIMIZATION
-‐
No

OPTIMIZATION
OPTIONS
-‐
MAX
ITERATIONS
-‐
20

OPTIMIZATION
OPTIONS
-‐
MAX
FUNCTION
CALLS
-‐
10

Backward
Reg
InteracBon:
CumulaBve

LIFT

Backward
Reg
InteracBon:
CumulaBve

%
Captured
Response

Backward
Reg
InteracBon:
Final

Variables

Model:
Stepwise
Reg
InteracBon

• 
• 
• 
• 
• 
• 
• 

• 
• 
• 
• 

EQUATION
-‐
USER
TERMS
-‐
YES

EQUATION
-‐
TERM
EDITOR
-‐
Enter
InteracBon
Terms

Input
Coding
-‐
GLM

MODEL
SELECTION
-‐
Stepwise
Reg
InteracBon

SELECTION
CRITERION
-‐
Akaike
InformaBon
Criterion

USE
SELECTION
DEFAULTS
-‐
No

MODEL
SELECTION
-‐
SELECTION
OPTIONS
–

–  ENTRY
SIGNIFICANCE
LEVEL
=
0.15

–  STAY
SIGNIFICANCE
LEVEL
=
0.05

–  MAXIMUM
NUMBER
OF
STEPS
=
300

OPTIMIZATION
OPTIONS
-‐
TECHNIQUE
-‐
Default

OPTIMIZATION
OPTIONS
-‐
DEFAULT
OPTIMIZATION
-‐
No

OPTIMIZATION
OPTIONS
-‐
MAX
ITERATIONS
-‐
20

OPTIMIZATION
OPTIONS
-‐
MAX
FUNCTION
CALLS
-‐
10

Stepwise
Reg
InteracBon:
CumulaBve

LIFT

Stepwise
Reg
InteracBon:
CumulaBve

%
Captured
Response

Stepwise
Reg
InteracBon:
Final

Variables

Variable
Comparison

Forward_Interac,on

Backward_Interac,on

Stepwise_Interac,on

LOG_SCORE01*ACE01

AEQ01

LOG_SCORE01*ACE01

LOG_SCORE02*AMG01

AEQ07

LOG_SCORE02*AMG01

AAL11

AHI01

AAL11

AEQ01

ALN01

AEQ01

AEQ07

AMG07

AEQ07

AHI01

APR20

AHI01

ALN01

LOG_SCORE01*LOG_SCORE01

ALN01

AMG07

LOG_SCORE01*AHI01

AMG07

APR20

ACE01

APR20

ART11

AEQ03

ART11

AEQ02

AEQ04

ALE01

ALE02

AMG01

LOG_SCORE01

Model:
Neural
Network

• 
• 
• 
• 
• 

NETWORK
-‐
DIRECT
CONNECTION
=
Yes

OPTIMIZATION
-‐
PRELIMINARY
TRAINING
-‐
ENABLE
=
No

OPTIMIZATION
-‐
Maximum
IteraBons
=
50

OPTIMIZATION
-‐
PRELIMINARY
TRAINING
-‐
Number
of
Runs
=
10

MODEL
SELECTION
CRITERION
-‐
MisclassiﬁcaBon

Neural
Network:
CumulaBve
LIFT

Neural
Network:
CumulaBve
%

Captured
Response

Neural
Network:
Average
Square
Error

If
we
increase
the
number
of
iteraBons,
then
the
average
square
error
decreases
for
TRAIN
but

increases
for
VALIDATE
data
set

Ensemble
Node

Select
the
model
that
performs
best
in

–  Decision
Trees

–  Regression

–  Regression
with
InteracBon
Terms

Build
an
Ensemble
Node
on:

–  Pruned
Chaid

–  Forward
Regression

–  Forward
Reg
InteracBon

Ensemble
Node:
CumulaBve
LIFT

Ensemble
Node:
CumulaBve
%

Captured
Response

Model
Comparison

•  ASSESSMENT
REPORTS
-‐
NUMBER
OF
BINS
=
50

•  MODEL
SELECTION
-‐
SELECTION
STATISTIC
=
MISCLASSIFICATION
RATE

•  Comparing
LIFT
at
top
20%

Model
Comparison:
CumulaBve
LIFT

(Train)

Model
Comparison:
CumulaBve
LIFT

(Validate)

Model
Comparison:
CumulaBve
LIFT

(Test)

Model
Comparison:
Conclusion

•  TRAIN:

–  Ensemble
works
best,
followed
by
Forward
Regression

–  Check
for
Validate
and
Test
results
to
ﬁnalize
the
model

•  VALIDATE
and
TEST

–  Forward
Regression
works
bever
than
Ensemble

Final
Model

•  Forward
Regression

•  List
of
Variables:

–  AAL11

– 
– 
– 
– 
– 
– 
– 
– 
– 
– 
– 

ACE01

AEQ01

AEQ07

AHI01

ALN01

AMG01

AMG07

APR20

ART11

LOG_SCORE01

AEQ02

SCORE

•  In
Model
Comparison
Node-‐
SCORE
-‐>
SELECTION
EDITOR

•  Do
YES
for
Forward
Regression
and
NO
for
Stepwise
Reg
InteracBon

(which
was
selected
by
default)

•  Connect
Model
Comparison
with
SCORE,
and
Run
it

•  Get
OpBmized
SAS
code

Model
Performance

•  PROC
RANK

–  Rank
2:
Top
1/3rd
responders

Thank
You

QuesBons???

Prospect Identification from a Credit Database using Regression, Decision Trees, And Neural Network

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (19)

Similaire à Prospect Identification from a Credit Database using Regression, Decision Trees, And Neural Network

Similaire à Prospect Identification from a Credit Database using Regression, Decision Trees, And Neural Network (20)

Dernier

Dernier (20)

Prospect Identification from a Credit Database using Regression, Decision Trees, And Neural Network