Introduction to cart_2007

Data Mining with Decision Trees:
An Introduction to CART®
Dan Steinberg
Mikhail Golovnya
golomi@salford-systems.com
http://www.salford-systems.com

Introduction to CART® Salford Systems ©2007

Intro CART Outline
Historical background Introduction into splitting
Sample CART model rules
Binary Recursive Constrained trees
Partitioning Introduction into priors and
Fundamentals of tree costs
pruning Cross-Validation
Competitors and Stable trees and train-test
surrogates consistency (TTC)
Missing value handling Hot-spot analysis
Variable Importance Model scoring and
Penalties and structured translation
trees Batteries of CART runs

Introduction to CART® Salford Systems ©2007 2

Historical Background


In The Beginning…
In 1984 Berkeley and Stanford statisticians announced a
remarkable new classification tool which:
Could separate relevant from irrelevant predictors
Did not require any kind of variable transforms (logs, square roots)
Impervious to outliers and missing values
Could yield relatively simple and easy to comprehend models
Required little to no supervision by the analyst
Was frequently more accurate than traditional logistic regression or
discriminant analysis, and other parametric tools


The Years of Struggle
CART was slow to gain widespread recognition in
late 1980s and early 1990s for several reasons
Monograph introducing CART is challenging to read
brilliant book overflowing with insights into tree growing
methodology but fairly technical and brief
Method was not expounded in any textbooks
Originally taught only in advanced graduate statistics
classes at a handful of leading Universities
Original standalone software came with slender
documentation, and output was not self-explanatory
Method was such a radical departure from conventional
statistics


The Final Triumph
Rising interest in data mining
Availability of huge data sets requiring analysis
Need to automate or accelerate and improve analysis process
Comparative performance studies
Advantages of CART over other tree methods
handling of missing values
assistance in interpretation of results (surrogates)
performance advantages: speed, accuracy
New software and documentation make techniques
accessible to end users
Word of mouth generated by early adopters
Easy to use professional software available from Salford
Systems


Sample CART Model


So What is CART?
Best illustrated with a famous example-the UCSD Heart
Disease study
Given the diagnosis of a heart attack based on
Chest pain, Indicative EKGs, Elevation of enzymes typically
released by damaged heart muscle
Predict who is at risk of a 2nd heart attack and early death
within 30 days
Prediction will determine treatment program (intensive care or
not)
For each patient about 100 variables were available, including:
Demographics, medical history, lab results
19 noninvasive variables were used in the analysis


Typical CART Solution
PATIENTS = 215
Example of a
<= 91 > 91
CLASSIFICATION SURVIVE 178 82.8%
DEAD 37 17.2%
tree
Dependent Is BP<=91?
variable is
categorical
(SURVIVE, DIE) Terminal Node A PATIENTS = 195
Want to predict SURVIVE 6 30.0% <= 62.5 SURVIVE 172 88.2%
> 62.5
class membership DEAD 14 70.0% DEAD 23 11.8%
NODE = DEAD
Is AGE<=62.5?

Terminal Node B PATIENTS = 91

SURVIVE 102 98.1% <=.5 >.5
SURVIVE 70 76.9%
DEAD 2 1.9% DEAD 21 23.1%
NODE = SURVIVE
Is SINUS<=.5?

Terminal Node C Terminal Node D

SURVIVE 14 50.0% SURVIVE 56 88.9%
DEAD 14 50.0% DEAD 7 11.1%
NODE = DEAD NODE = SURVIVE


How to Read It
Entire tree represents a complete analysis or model
Has the form of a decision tree
Root of inverted tree contains all data
Root gives rise to child nodes
Child nodes can in turn give rise to their own children
At some point a given path ends in a terminal node
Terminal node classifies object
Path through the tree governed by the answers to
QUESTIONS or RULES


Decision Questions
Question is of the form: is statement TRUE?
Is continuous variable X ≤ c ?
Does categorical variable D take on levels i, j, or k?
e.g. Is geographic region 1, 2, 4, or 7?
Standard split:
If answer to question is YES a case goes left; otherwise it goes right
This is the form of all primary splits
Question is formulated so that only two answers possible
Called binary partitioning
In CART the YES answer always goes left


The Tree is a Classifier
Classification is determined by following a case’s path down
the tree
Terminal nodes are associated with a single class
Any case arriving at a terminal node is assigned to that class
The implied classification threshold is constant across all
nodes and depends on the currently set priors, costs, and
observation weights (discussed later)
Other algorithms often use only the majority rule or a limited set of
fixed rules for node class assignment thus being less flexible
In standard classification, tree assignment is not
probabilistic
Another type of CART tree, the class probability tree does report
distributions of class membership in nodes (discussed later)
With large data sets we can take the empirical distribution in
a terminal node to represent the distribution of classes

The Importance of Binary Splits
Some researchers suggested using 3-way, 4-way, etc. splits
at each node (univariate)
There are strong theoretical and practical reasons why we
feel strongly against such approaches:
Exhaustive search for a binary split has linear algorithm complexity,
the alternatives have much higher complexity
Choosing the best binary split is “less ambiguous” than the
suggested alternatives
Most importantly, binary splits result to the smallest rate of data
fragmentation thus allowing a larger set of predictors to enter a
model
Any multi-way split can be replaced by a sequence of binary splits
There are numerous practical illustrations on the validity of
the above claims


Accuracy of a Tree
For classification trees
All objects reaching a terminal node are classified in the same way
e.g. All heart attack victims with BP>91 and AGE less than 62.5 are
classified as SURVIVORS
Classification is same regardless of other medical history and lab
results
The performance of any classifier can be summarized in
terms of the Prediction Success Table (a.k.a. Confusion
Matrix)
The key to comparing the performance of different models is
how one “wraps” the entire prediction success matrix into a
single number
Overall accuracy
Average within-class accuracy
Most generally – the Expected Cost


Prediction Success Table
Classified As
TRUE Class 1 Class 2
"Survivors" "Early Deaths" Total % Correct
Class 1
"Survivors" 158 20 178 88%

Class 2
"Early Deaths" 9 28 37 75%

Total 167 48 215 86.51%

A large number of different terms exists to name various
simple ratios based on the table above:
Sensitivity, specificity, hit rate, recall, overall accuracy, false/true
positive, false/true negative, predictive value, etc.

Tree Interpretation and Use
Practical decision tool
Trees like this are used for real world decision making
Rules for physicians
Decision tool for a nationwide team of salesmen needing to classify
potential customers
Selection of the most important prognostic variables
Variable screen for parametric model building
Example data set had 100 variables
Use results to find variables to include in a logistic regression
Insight — in our example AGE is not relevant if BP is not
high
Suggests which interactions will be important


Introducing CART Interface


Classification with CART®

Real world study early 1990s
Fixed line service provider offering a
new mobile phone service
Wants to identify customers most
likely to accept new mobile offer
Data set based on limited market
trial


Real World Trial
830 households offered a mobile phone package
All offered identical package but pricing was varied at
random
Handset prices ranged from low to high values
Per minute prices ranged separately from low to high rate rates
Household asked to make yes or no decision on offer
15.2% of the households accepted the offer
Our goal was to answer two key questions
Who to make offers to?
How to price?


Setting Up the Model

Only requirement is to select TARGET (dependent) variable. CART will
do everything else automatically.

Model Results
CART completes analysis and gives access
to all results from the NAVIGATOR
Shown on the next slide
Upper section displays tree of a selected size
Lower section displays error rate for trees of
all possible sizes
Green bar marks most accurate tree
We display a compact 10 node tree for further
scrutiny


Navigator Display

Most accurate tree is marked with the green bar. We select the 10 node tree
for convenience of a more compact display.

Root Node

Root node displays details of TARGET
variable in overall training data
We see that 15.2% of the 830
households accepted the offer
Goal of the analysis is now to extract
patterns characteristic of responders

If we could only use a single piece of
information to separate responders
from non-responders CART chooses
the HANDSET PRICE
Those offered the phone with a price
> 130 contain only 9.9% responders
Those offered a lower price respond
at 21.9%

Maximal Tree

Maximal tree is the raw material for the best model
Goal is to find optimal tree embedded inside maximal
tree
Will find optimal tree via “pruning”
Like backwards stepwise regression


Main Drivers

Red= Good Response
Blue=Poor Response)
High values of a split
variable always go to
the right; low values go
left


Examine Individual Node

Hover mouse over node to see
inside
Even though this node is on the
“high price” of the tree it still
exhibits the strongest response
across all terminal node
segments (43.5% response)
Here we select rules expressed
in C
Entire tree can also be rendered
in Java, XML/PMML, or SAS


Configure Print Image

Variety of fine controls to position and scale image nicely

Variable Importance Ranking

CART offers multiple ways to compute variable importance

Predictive Accuracy

This model is not very accurate but ranks responders well

Gains and ROC Curves

In top decile model captures about 40% of
responders


The Big Picture


Binary Recursive Partitioning
Data is split into two partitions
thus “binary” partition
Partitions can also be split into sub-partitions
hence procedure is recursive
CART tree is generated by repeated dynamic
partitioning of data set
Automatically determines which variables to use
Automatically determines which regions to focus on
The whole process is totally data driven


Three Major Stages
Tree growing
Consider all possible splits of a node
Find the best split according to the given splitting
rule (best improvement)
Continue sequentially until the largest tree is grown
Tree Pruning – creating a sequence of nested
trees (pruning sequence) by systematic removal
of the weakest branches
Optimal Tree Selection – using test sample or
cross-validation to find the best tree in the
pruning sequence

General Workflow
Historical Data
Stage 1
Build a Sequence
Stage 2
Learn of Nested Trees

Stage 3
Monitor Performance
Test
Best

Validate Confirm Findings


Searching All Possible Splits
For any node CART will examine ALL possible splits
Computationally intensive but there are only a finite number of splits
Consider first the variable AGE--in our data set has
minimum value 40
The rule
Is AGE ≤ 40? will separate out these cases to the left — the youngest
persons
Next increase the AGE threshold to the next youngest
person
Is AGE ≤ 43
This will direct six cases to the left
Continue increasing the splitting threshold value by value


Split Tables
Split tables are used to locate continuous splits
There will be as many splits as there are unique values minus one

Sorted by Age Sorted by Blood Pressure

AGE BP SINUST SURVIVE AGE BP SINUST SURVIVE
40 91 0 SURVIVE 43 78 1 DEAD
40 110 0 SURVIVE 40 83 1 DEAD
40 83 1 DEAD 40 91 0 SURVIVE
43 99 0 SURVIVE 43 99 0 SURVIVE
43 78 1 DEAD 40 110 0 SURVIVE
45 120 0 SURVIVE 48 119 1 DEAD
48 119 1 DEAD 45 120 0 SURVIVE
49 150 0 DEAD 43 135 0 SURVIVE
49 110 1 SURVIVE 49 150 0 DEAD


Categorical Predictors
Left Right
Categorical predictors spawn
1 A B, C, D many more splits
2 B A, C, B
For example, only four levels A,
3 C A, B, D
B, C, D result to 7 splits
4 D A, B, C
5 A, B C, D In general, there will be 2K-1-1
6 A, C B, D splits
7 A, D B, C
CART considers all possible
splits based on a categorical
predictor unless the number of
categories exceeds 15


Binary Target Shortcut
When the target is binary, special algorithms reduce
compute time to linear (K-1 instead of 2K-1-1) in levels
30 levels takes twice as long as 15, not 10,000+ times as long
When you have high-level categorical variables in a multi-
class problem
Create a binary classification problem
Try different definitions of DPV (which groups to combine)
Explore predictor groupings produced by CART
From a study of all results decide which are most informative
Create new grouped predicted variable for multi-class problem


Pruning and Variable Importance


GYM Model Example
The original data comes from a financial
institution and disguised as a health club
Problem: need to understand a market
research clustering scheme
Clusters were created externally using 18
variables and conventional clustering software
Need to find simple rules to describe cluster
membership
CART tree provides a nice way to arrive at an
intuitive story


Variable Definitions
ID Identification # of member not used
CLUSTER Cluster assigned from clustering scheme (10 level categorical coded 1-10) target
ANYRAQT Racquet ball usage (binary indicator coded 0, 1)
ONRCT Number of on-peak racquet ball uses
ANYPOOL Pool usage (binary indicator coded 0, 1)
ONPOOL Number of on-peak pool uses
PLRQTPCT Percent of pool and racquet ball usage
TPLRCT Total number of pool and racquet ball uses
ONAER Number of on-peak aerobics classes attended
OFFAER Number of off-peak aerobics classes attended
SAERDIF Difference between number of on- and off-peak aerobics visits
TANNING Number of visits to tanning salon
predictors
PERSTRN Personal trainer (binary indicator coded 0, 1)
CLASSES Number of classes taken
NSUPPS Number of supplements/vitamins/frozen dinners purchased
SMALLBUS Small business discount (binary indicator coded 0, 1)
OFFER Terms of offer
IPAKPRIC Index variable for package price
MONFEE Monthly fee paid
FIT Fitness score
NFAMMEN Number of family members
HOME Home ownership (binary indicator coded 0, 1)


Things to Do
Dataset: gymsmall.syd
Set the target and predictors
Set 50% test sample
Navigate the largest, smallest, and optimal trees
Check [Splitters] and [Tree Details]
Observe the logic of pruning
Check [Summary Reports]
Understand competitors and surrogates
Understand variable importance


Pruning Multiple Nodes

Note that Terminal Node 5
has class assignment 1
All the remaining nodes
have class assignment 2
Pruning merges Terminal
Nodes 5 and 6 thus
creating a “chain reaction”
of redundant split removal

Pruning Weaker Split First
Eliminating nodes
3 and 4 results to
“lesser” damage –
loosing one
correct dominant
class assignment
Eliminating nodes
1 and 2 or 5 and 6
would result to
loosing one
correct minority
class assignment


Understanding Variable Importance
Both FIT and
CLASSES are
reported as equally
important
Yet the tree only
includes CLASSES
ONPOOL is one of
the main splitters yet
is has very low
importance
Want to understand
why


Competitors and Surrogates

Node details for the root node split are shown above
Improvement measures class separation between two sides
Competitors are next best splits in the given node ordered
by improvement values
Surrogates are splits most resembling (case by case) the
main split
Association measures the degree of resemblance
Looking for competitors is free, finding surrogates is expensive
It is possible to have an empty list of surrogates
FIT is a perfect surrogate (and competitor) for CLASSES

The Utility of Surrogates
Suppose that the main splitter is INCOME and comes from
a survey form
People with very low or very high income are likely not to
report it – informative missingness in both tails of the
income distribution
The splitter itself wants to separate low income on the left
from high income on the right – hence, it is unreasonable to
treat missing INCOME as a separate unique value (usual
way to handle such case)
Using a surrogate will help to resolve ambiguity and
redistribute missing income records between the two sides
For example, all low education subjects will join the low
income side while all high education subjects will join the
high income side

Competitors and Surrogates Are Different

A A Three nominal
B B classes
C C
Equally present
A Split X C A Split Y B Equal priors, unit
B C costs, etc.
For symmetry considerations, both splits must result to
identical improvements
Yet one split can’t be substituted for another
Hence Split X and Split Y are perfect competitors (same
improvement) but very poor surrogates (low association)
The reverse does not hold – a perfect surrogate is always a
perfect competitor

Calculating Association
Split X Split Y Default Ten observations in a
node
Main splitter X
Split Y mismatches
on 2 observations
The default rule
sends all cases to the
dominant side
mismatching 3
observations

Association = (Default – Split Y)/Default = (3-2)/3 = 1/3

Notes on Association
Measure local to a node, non-parametric in nature
Association is negative for nodes with very poor
resemblance, not limited by -1 from below
Any positive association is useful (better strategy
than the default rule)
Hence define surrogates as splits with the highest
positive association
In calculating case mismatch, check the possibility
for split reversal (if the rule is met, send cases to
the right)
CART marks straight splits as “s” and reverse splits as “r”


Calculating Variable Importance
List all main splitters as well as all surrogates along with the
resulting improvements
Aggregate the above list by variable accumulating
improvements
Sort the result descending
Divide each entry by the largest value and scale to 100%
The resulting importance list is based on the given tree
exclusively, if a different tree a grown the importance list will
most likely change
Variable importance reveals a deeper structure in stark
contrast to a casual look at the main splitters


Mystery Solved
In our example
improvements
in the root
node dominate
the whole tree
Consequently,
the variable
importance
reflects the
surrogates
table in the
root node


Variable Importance Caution
Importance is a function of the OVERALL tree including
deepest nodes
Suppose you grow a large exploratory tree — review
importances
Then find an optimal tree via test set or CV yielding smaller
tree
Optimal tree SAME as exploratory tree in the top nodes
YET importances might be quite different.
WHY? Because larger tree uses more nodes to compute the
importance
When comparing results be sure to compare similar or same
sized trees

Splitting Rules and Friends


Splitting Rules - Introduction
Splitting rule controls the tree growing stage
At each node, a finite pool of all possible splits exists
The pool is independent of the target variable, it is based solely on
the observations (and corresponding values of predictors) in the
node
The task of the splitting rule is to rank all splits in the pool
according to their improvements
The winner will become the main splitter
The top runner-ups will become competitors
It is possible to construct different splitting rules
emphasizing various approaches to the definition of the
“best split”


Splitting Rules Control
Selected in the “Method” tab
of the Model Setup dialog
There are 6 alternative rules
for classification and 2 rules
for regression
Since there is no theoretical
justification on the best
splitting rule, we recommend
trying all of them and picking
the best resulting model


Further Insights
On binary classification problem, GINI and
TWOING are equivalent
Therefore, we recommend setting “Favor even Splits”
control to 1 when TWOING is used
GINI and Symmetric GINI only differ when an
asymmetric cost matrix is used and identical
otherwise
Ordered TWOING is a restricted version of
TWOING on multinomial targets coded into a set of
contiguous integers
Class Probability is identical to GINI during tree
growing but differs during the pruning

Battery RULES

The battery was designed to automatically try all six
alternative splitting rules
According to the output above, Entropy splitting rule resulted
to the best accuracy model on the given data
Individual model navigators are also available

Controlled Tree Growth
CART was originally conceived to be fully
automated thus protecting an analyst from making
erroneous split decisions
In some cases, however, it could be desirable to
influence tree evolution to address specific
modeling needs
Modern CART engine allows multiple direct and
indirect ways of accomplishing this:
Penalties on variables, missing values, and categories
Controlling which variables are allowed or disallowed at
different levels in the tree and node sizes
Direct forcing of splits at the top two levels in the tree


Penalizing Individual Variables

We use variable penalty control to
penalize CLASSES a little bit
This is enough to ensure FIT the main
splitter position
The improvement associated with
CLASSES is now downgraded by 1
minus the penalty factor


Penalizing Missing and Categorical Predictors

Since splits on variables with
missing values are evaluated on a
reduced set of observations, such
splits may have an unfair advantage
Introducing systematic penalty set to
the proportion of the variable
missing effectively removes the
possible bias
Same holds for categorical variables
with large number of distinct levels
We recommend that the user always
set both controls to 1


Forcing Splits

CART allows direct forcing of the split variable (and possibly
split value) in the top three nodes in the tree

Notes on Forcing Splits
It is not guaranteed
that a forced split
below the root node
level will be preserved
(because of pruning)
Do not simply overturn
“illogical” automatic
splits by forcing
something else – look
for an explanation why
CART makes a
“wrong” choice, it
could indicate a
problem with your data


Constrained Trees
Many predictive models can benefit from
Salford’s patent pending “Structured Trees”
Trees constrained in how they are grown to
reflect decision support requirements
In mobile phone example: want tree to first
segment on customer characteristics and
then complete using price variables
Price variables are under the control of the
company
Customer characteristics are not under
company control


Setting Up Constraints

Green indicates where in the tree variables of group are
allowed to appear

Constrained Tree

Demographic and spend information at top of tree
Handset (HANDPRIC) and per minute pricing (USEPRICE) at bottom


Prior Probabilities
An essential component of any CART analysis
A prior probability distribution for the classes is needed to do
proper splitting since prior probabilities are used to calculate
the improvements in the Gini formula
Example: Suppose data contains
99% type A (nonresponders),
1% type B (responders),
CART might focus on not missing any class A objects
one "solution" classify all objects nonresponders
Advantages: Only 1% error rate
Very simple predictive rules, although not informative
But realize this could be an optimal rule

Priors EQUAL
Most common priors: PRIORS EQUAL
This gives each class equal weight regardless of its frequency in
the data
Prevents CART from favoring more prevalent classes
If we have 900 class A and 100 class B and equal priors
A: 900
B: 100

A: 300 A: 600
B: 10 B: 90

1/3 of all A 2/3 of all A
1/10 of all B 9/10 of all B
Class as A Class as B

With equal priors prevalence measured as percent of OWN
class size
Measurements relative to own class not entire learning set
If PRIORS DATA both child nodes would be class A
Equal priors puts both classes on an equal footing

Class Assignment Formula I

For equal priors class node is class 1 if
N 1( t ) N 1
>
N 2( t ) N 2
Ni = number of cases in class i at the root node
Ni (t) = number of cases in class i at the node t

Classify any node by “count ratio in node” relative to
“count ratio in root”
Classify by which level has greatest relative richness

Class Assignment Formula II
When priors are not equal:
π1N1(t) / π2N2(t) > N1 / N2
where πi is the prior for class i
Since π1 / π2 always appears as ratio just think in
terms of boosting amount Ni(t) by a factor, e.g. 9:1,
100:1, 2:1
If priors are EQUAL, πi terms cancel out
If priors are DATA, then π1 / π2 = N1 / N2
which simplifies formula to plurality rule (simple
counting)
classify as class 1 if N1(t) > N2(t)
If priors are neither EQUAL nor DATA then formula
will provide boost to up-weighted classes

Example

A:
A: 900
900
B:
B: 100
100
A: 300 A: 600
A: 600
A: 300
B: 10 B: 90
B: 90
B: 10

300 > 900 so class as A 90 > 100 so class as B
10 100 600 900

In gen eral th e tests w ou ld b e w eigh ted b y th e ap p rop riate p riors
π (B ) 90 > 100 π (A ) 300 > 900
π (A ) 600 900 π (B ) 10 100

tests can b e w ritten as A or B as con ven ien t
B A

Cross Validation


Cross-Validation

Cross-validation is a recent computer intensive
development in statistics
Purpose is to protect oneself from over fitting errors
Don't want to capitalize on chance — track idiosyncrasies
of this data
Idiosyncrasies which will NOT be observed on fresh data
Ideally would like to use large test data sets to evaluate
trees, N>5000
Practically, some studies don't have sufficient data to
spare for testing
Cross-validation will use SAME data for learning and for
testing

10-fold CV: Industry Standard
Begin by growing maximal tree on ALL data
Divide data into 10 portions stratified on dependent
variable levels
Reserve first portion for test, grow new tree on remaining
9 portions
Use the 1/10 test set to measure error rate for this 9/10
data tree
Error rate is measured for the maximal tree and for all
subtrees
Now rotate out new 1/10 test data set
Grow new tree on remaining 9 portions
Repeat until all 10 portions of the data have been used
as test sets

The CV Procedure

1 2 3 4 5 6 7 8 9 10

Test Learn

1 2 3 4 5 6 7 8 9 10

Learn Test Learn

1 2 3 4 5 6 7 8 9 10

Learn Test Learn
ETC...

1 2 3 4 5 6 7 8 9 10

Learn Test

CV Details
Every observation is used as test case exactly once
In 10 fold CV each observation is used as a learning case
9 times
In 10 fold CV 10 auxiliary CART trees are grown in
addition to initial tree
When all 10 CV trees are done the error rates are
cumulated (summed)
Summing of error rates is done by TREE complexity
Summed Error (cost) rates are then attributed to INITIAL
tree
Observe that no two of these trees need be the same
Results are subject to random fluctuations — sometimes
severe

CV Replication and Stability
Look at the separate CV tree results of a single CV analysis
CART produces a summary of the results for each tree at
end of output
Table titled "CV TREE COMPETITOR LISTINGS" reports
results for each CV tree Root (TOP), Left Child of Root
(LEFT), and Right Child of Root (RIGHT)
Want to know how stable results are -- do different variables
split root node in different trees?
Only difference between different CV trees is random
elimination of 1/10 data
Special battery called CVR exists to study the effects of
repeated cross-validation process

Battery CVR


Regression Trees


General Comments
Regression trees result to piece-wise constant
models (multi-dimensional staircase) on an
orthogonal partition of the data space
Thus usually not the best possible performer in terms of
conventional regression loss functions
Only a very limited number of controls is available
to influence the modeling process
Priors and costs are no longer applicable
There are two splitting rules: LS and LAD
Very powerful in capturing high-order interactions
but somewhat weak in explaining simple main
effects


Boston Housing Data Set
Harrison, D. and D. Rubinfeld. Hedonic Housing Prices & Demand For
Clean Air. Journal of Environmental Economics and Management, v5,
81-102 , 1978
506 census tracts in City of Boston for the year 1970
Goal: study relationship between quality of life variables and property values
MV median value of owner-occupied homes in tract (‘000s)
CRIM per capita crime rates
NOX concentration of nitrogen oxides (pphm)
AGE percent built before 1940
DIS weighted distance to centers of employment
RM average number of rooms per house
LSTAT percent neighborhood ‘lower SES’
RAD accessibility to radial highways
CHAS borders Charles River (0/1)
INDUS percent non-retail business
TAX tax rate
PT pupil teacher ratio


Splitting the Root Node

Improvement is defined in terms of the greatest reduction in the sum of
squared errors when a single constant prediction is replaced by two
separate constants on each side

Regression Tree Model

All cases in the given node are assigned the same predicted
response – the node average of the original target
Nodes are color-coded according to the predicted response
We have a convenient segmentation of the population
according to the average response levels


The Best and the Worst Segments


Scoring and Deployment


Notes on Scoring
Trees cannot be expressed as a simple analytical
expression, instead, a lengthy collection of rules is
needed to express the tree logic
Nonetheless, the scoring operation is very fast
once the rules are compiled into the machine code
Scoring can be done in two ways:
Internally – by using the SCORE command
Externally – by using the TRANSLATE command to
generate C, SAS, Java, or PMML code
Model of any size can be scored or translated


Internal Scoring


Output File Format
CASEID RESPONSE NODE CORRECT DV PROB_1 PROB_2 PATH_1 PATH_2 PATH_3 PATH_4 PATH_5 PATH_6 XA XB XC
1 0 1 1 0 0.998795 0.001205 1 2 3 4 5 -1 0 0 0
2 1 7 0 0 0.738636 0.261364 1 -7 0 0 0 0 1 0 0
3 0 1 1 0 0.998795 0.001205 1 2 3 4 5 -1 0 0 0

CART creates the following output variables
RESPONSE – predicted class assignment
NODE – terminal node the case falls into
CORRECT – indicates whether the prediction is correct (assuming
that the actual target is known)
PROB_x – class probability (constant within each node)
PATH_x – path traveled by case in the tree
Case 1 starts in the root node then follows through internal nodes 2, 3,
4, 5 and finally lands into terminal node 1
Case 2 starts in the root node and then lands into terminal node 7
Finally CART adds all of the original model variables including the
target and predictors


Model Translation


Output Code


CART Automation


Automated CART Runs
Starting with CART 6.0 we have added a new feature called
batteries
A battery is a collection of runs obtained via a predefined
procedure
A trivial set of batteries is varying one of the key CART
parameters:
RULES – run every major splitting rule
ATOM – vary minimum parent node size
MINCHILD – vary minimum node size
DEPTH – vary tree depth
NODES – vary maximum number of nodes allowed
FLIP – flipping the learn and test samples (two runs only)


More Batteries
DRAW – each run uses a randomly drawn learn sample
from the original data
CV – varying number of folds used in cross-validation
CVR – repeated cross-validation each time with a different
random number seed
KEEP – select a given number of predictors at random from
the initial larger set
Supports a core set of predictors that must always be present
ONEOFF – one predictor model for each predictor in the
given list
The remaining batteries are more advanced and will be
discussed individually


Battery MCT
The primary purpose is to test possible model
overfitting by running Monte Carlo simulation
Randomly permute the target variable
Build a CART tree
Repeat a number of times
The resulting family of profiles measures intrinsic
overfitting due to the learning process
Compare the actual run profile to the family of random
profiles to check the significance of the results
This battery is especially useful when the relative
error of the original run exceeds 90%


Battery MVI
Designed to check the impact of the missing values
indicators on the model performance
Build a standard model, missing values are handled via
the mechanism of surrogates
Build a model using MVIs only – measures the amount of
predictive information contained in the missing value
indicators alone
Combine the two
Same as above with missing penalties turned on


Battery PRIOR
Vary the priors for the specified class within a given range
For example, vary the prior on class 1 from 0.2 to 0.8 in increments
of 0.02
Will result to a family of models with drastically different
performance in terms of class richness and accuracy (see
the earlier discussion)
Can be used for hot spot detection
Can also be used to construct a very efficient voting
ensemble of trees
This battery is covered in greater length in our Advanced
CART class


Battery SHAVING
Automates variable selection process
Shaving from the top – each step the most important
variable is eliminated
Capable to detect and eliminate “model hijackers” – variables that
appear to be important on the learn sample but in general may hurt
ultimate model performance (for example, ID variable)
Shaving from the bottom – each step the least important
variable is eliminated
May drastically reduce the number of variables used by the model
SHAVING ERROR – at each iteration all current variables
are tried for elimination one at a time to determine which
predictor contributes the least
May result to a very long sequence of runs (quadratic complexity)


Battery SAMPLE
Build a series of models in which the learn sample
is systematically reduced
Used to determine the impact of the learn sample
size on the model performance
Can be combined with BATTERY DRAW to get
additional spread


Battery TARGET
Attempt to build a model for each variable in the
given list as a target using the remaining variables
as predictors
Very effective in capturing inter-variable
dependency patterns
Can be used as an input to variable clustering and
multi-dimensional scaling approaches
When run on MVIs only, can reveal missing value
patterns in the data


Stable Trees and Consistency


Notes on Tree Stability
Tree are usually stable near the top and become
progressively more and more unstable as the nodes
become smaller
Therefore, smaller trees in the pruning sequence will tend to
be more stable
Node instability usually manifests itself when comparing
node class distributions between the learn and test samples
Ideally we want to identify the largest stable tree in the
pruning sequence
The optimal tree suggested by CART may contain a few
unstable nodes structurally needed to optimize the overall
accuracy
Train Test Consistency feature (TTC) was designed to
address the topic at length

MJ2000 Run

The dataset comes from an undisclosed financial institution
Set up a default classification run with 50% random test
sample
CART determines 36-node optimal tree


Node Stability
Terminal Nodes TTC Report

A quick look at the Learn and Test node distributions
reveals a large number of unstable nodes
Let’s use TTC feature by pressing the [T/T Consist] button
to identify the largest stable tree

The Largest Stable Tree

TTC Report

Tree with 11 terminal nodes is identified as stable (all nodes
match on both direction and rank)
While within the specified tolerance, nodes 2, 6, 8 are
somewhat unsatisfactory in terms of rank-order
These nodes will emerge as rank-order unstable if we lower the
tolerance level to 0.5

Improving Predictor List
Predictor Elimination by Shaving

Run BATTERY SHAVE to systematically eliminate the least important
predictors
The resulting optimal tree has even better accuracy than before
It also uses a much smaller set of only 6 predictors
Let’s study the tree stability for the new model


Comparing the Results
Original Tree New Tree

The tree on the left is the original (unshaved) one
The tree on the right is the result of shaving
The new tree is far more stable than the original one, except
for node 8 Introduction to CART® Salford Systems ©2007 106

Justification
Learn Sample Test Sample

Node 8 is irrelevant for classification but its presence is
required in order to allow a very strong split below


Recommended Reading
Breiman, L., J. Friedman, R. Olshen and C. Stone (1984),
Classification and Regression Trees, Pacific Grove:
Wadsworth
Breiman, L (1996), Bagging Predictors, Machine Learning,
24, 123-140
Breiman, L (1996), Stacking Regressions, Machine
Learning, 24, 49-64.
Friedman, J. H., Hastie, T. and Tibshirani, R. "Additive
Logistic Regression: a Statistical View of Boosting." (Aug.
1998)


Introduction to cart_2007

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (19)

Similaire à Introduction to cart_2007

Similaire à Introduction to cart_2007 (20)

Dernier

Dernier (20)

Introduction to cart_2007