SlideShare une entreprise Scribd logo
1  sur  108
Télécharger pour lire hors ligne
Data Mining with Decision Trees:
     An Introduction to CART®
                  Dan Steinberg
                Mikhail Golovnya
               golomi@salford-systems.com
            http://www.salford-systems.com




             Introduction to CART® Salford Systems ©2007
Intro CART Outline
 Historical background                    Introduction into splitting
 Sample CART model                        rules
 Binary Recursive                         Constrained trees
 Partitioning                             Introduction into priors and
 Fundamentals of tree                     costs
 pruning                                  Cross-Validation
 Competitors and                          Stable trees and train-test
 surrogates                               consistency (TTC)
 Missing value handling                   Hot-spot analysis
 Variable Importance                      Model scoring and
 Penalties and structured                 translation
 trees                                    Batteries of CART runs


              Introduction to CART® Salford Systems ©2007                2
Historical Background




 Introduction to CART® Salford Systems ©2007   3
In The Beginning…
 In 1984 Berkeley and Stanford statisticians announced a
 remarkable new classification tool which:
   Could separate relevant from irrelevant predictors
   Did not require any kind of variable transforms (logs, square roots)
   Impervious to outliers and missing values
   Could yield relatively simple and easy to comprehend models
   Required little to no supervision by the analyst
   Was frequently more accurate than traditional logistic regression or
   discriminant analysis, and other parametric tools




               Introduction to CART® Salford Systems ©2007                4
The Years of Struggle
 CART was slow to gain widespread recognition in
 late 1980s and early 1990s for several reasons
   Monograph introducing CART is challenging to read
     brilliant book overflowing with insights into tree growing
     methodology but fairly technical and brief
   Method was not expounded in any textbooks
   Originally taught only in advanced graduate statistics
   classes at a handful of leading Universities
   Original standalone software came with slender
   documentation, and output was not self-explanatory
   Method was such a radical departure from conventional
   statistics

              Introduction to CART® Salford Systems ©2007         5
The Final Triumph
Rising interest in data mining
   Availability of huge data sets requiring analysis
   Need to automate or accelerate and improve analysis process
   Comparative performance studies
Advantages of CART over other tree methods
   handling of missing values
   assistance in interpretation of results (surrogates)
   performance advantages: speed, accuracy
New software and documentation make techniques
accessible to end users
Word of mouth generated by early adopters
Easy to use professional software available from Salford
Systems


                 Introduction to CART® Salford Systems ©2007     6
Sample CART Model




Introduction to CART® Salford Systems ©2007   7
So What is CART?
 Best illustrated with a famous example-the UCSD Heart
 Disease study
 Given the diagnosis of a heart attack based on
    Chest pain, Indicative EKGs, Elevation of enzymes typically
    released by damaged heart muscle
 Predict who is at risk of a 2nd heart attack and early death
 within 30 days
    Prediction will determine treatment program (intensive care or
    not)
 For each patient about 100 variables were available, including:
    Demographics, medical history, lab results
    19 noninvasive variables were used in the analysis


                Introduction to CART® Salford Systems ©2007          8
Typical CART Solution
                                          PATIENTS = 215
 Example of a
                             <= 91                                    > 91
 CLASSIFICATION                       SURVIVE    178     82.8%
                                      DEAD       37      17.2%
 tree
 Dependent                                   Is BP<=91?
 variable is
 categorical
 (SURVIVE, DIE)         Terminal Node A                          PATIENTS = 195
 Want to predict     SURVIVE 6   30.0%        <= 62.5      SURVIVE     172   88.2%
                                                                                         > 62.5
 class membership    DEAD    14  70.0%                     DEAD        23    11.8%
                        NODE = DEAD
                                                                 Is AGE<=62.5?




                                          Terminal Node B                            PATIENTS = 91

                                     SURVIVE 102 98.1%               <=.5                                   >.5
                                                                              SURVIVE     70      76.9%
                                     DEAD    2     1.9%                       DEAD        21      23.1%
                                       NODE = SURVIVE
                                                                                     Is SINUS<=.5?




                                                           Terminal Node C                                Terminal Node D

                                                        SURVIVE 14  50.0%                             SURVIVE 56  88.9%
                                                        DEAD    14  50.0%                             DEAD    7   11.1%
                                                           NODE = DEAD                                  NODE = SURVIVE

                    Introduction to CART® Salford Systems ©2007                                                             9
How to Read It
 Entire tree represents a complete analysis or model
 Has the form of a decision tree
 Root of inverted tree contains all data
 Root gives rise to child nodes
 Child nodes can in turn give rise to their own children
 At some point a given path ends in a terminal node
 Terminal node classifies object
 Path through the tree governed by the answers to
 QUESTIONS or RULES




               Introduction to CART® Salford Systems ©2007   10
Decision Questions
 Question is of the form: is statement TRUE?
 Is continuous variable X ≤ c ?
 Does categorical variable D take on levels i, j, or k?
    e.g. Is geographic region 1, 2, 4, or 7?
 Standard split:
    If answer to question is YES a case goes left; otherwise it goes right
    This is the form of all primary splits
 Question is formulated so that only two answers possible
    Called binary partitioning
    In CART the YES answer always goes left




                 Introduction to CART® Salford Systems ©2007                 11
The Tree is a Classifier
 Classification is determined by following a case’s path down
 the tree
 Terminal nodes are associated with a single class
    Any case arriving at a terminal node is assigned to that class
 The implied classification threshold is constant across all
 nodes and depends on the currently set priors, costs, and
 observation weights (discussed later)
    Other algorithms often use only the majority rule or a limited set of
    fixed rules for node class assignment thus being less flexible
 In standard classification, tree assignment is not
 probabilistic
    Another type of CART tree, the class probability tree does report
    distributions of class membership in nodes (discussed later)
 With large data sets we can take the empirical distribution in
 a terminal node to represent the distribution of classes
                 Introduction to CART® Salford Systems ©2007                12
The Importance of Binary Splits
 Some researchers suggested using 3-way, 4-way, etc. splits
 at each node (univariate)
 There are strong theoretical and practical reasons why we
 feel strongly against such approaches:
    Exhaustive search for a binary split has linear algorithm complexity,
    the alternatives have much higher complexity
    Choosing the best binary split is “less ambiguous” than the
    suggested alternatives
    Most importantly, binary splits result to the smallest rate of data
    fragmentation thus allowing a larger set of predictors to enter a
    model
    Any multi-way split can be replaced by a sequence of binary splits
 There are numerous practical illustrations on the validity of
 the above claims


                 Introduction to CART® Salford Systems ©2007                13
Accuracy of a Tree
 For classification trees
    All objects reaching a terminal node are classified in the same way
       e.g. All heart attack victims with BP>91 and AGE less than 62.5 are
       classified as SURVIVORS
    Classification is same regardless of other medical history and lab
    results
 The performance of any classifier can be summarized in
 terms of the Prediction Success Table (a.k.a. Confusion
 Matrix)
 The key to comparing the performance of different models is
 how one “wraps” the entire prediction success matrix into a
 single number
    Overall accuracy
    Average within-class accuracy
    Most generally – the Expected Cost

                 Introduction to CART® Salford Systems ©2007                 14
Prediction Success Table
                              Classified As
                 TRUE   Class 1       Class 2
                      "Survivors" "Early Deaths"          Total % Correct
        Class 1
        "Survivors"            158             20         178         88%

        Class 2
        "Early Deaths"          9              28          37         75%

        Total                  167             48         215       86.51%




 A large number of different terms exists to name various
 simple ratios based on the table above:
    Sensitivity, specificity, hit rate, recall, overall accuracy, false/true
    positive, false/true negative, predictive value, etc.
                      Introduction to CART® Salford Systems ©2007              15
Tree Interpretation and Use
 Practical decision tool
    Trees like this are used for real world decision making
       Rules for physicians
       Decision tool for a nationwide team of salesmen needing to classify
       potential customers
 Selection of the most important prognostic variables
    Variable screen for parametric model building
    Example data set had 100 variables
    Use results to find variables to include in a logistic regression
 Insight — in our example AGE is not relevant if BP is not
 high
    Suggests which interactions will be important



                 Introduction to CART® Salford Systems ©2007                 16
Introducing CART Interface




   Introduction to CART® Salford Systems ©2007   17
Classification with CART®

             Real world study early 1990s
             Fixed line service provider offering a
             new mobile phone service
             Wants to identify customers most
             likely to accept new mobile offer
             Data set based on limited market
             trial



       Introduction to CART® Salford Systems ©2007    18
Real World Trial
 830 households offered a mobile phone package
 All offered identical package but pricing was varied at
 random
    Handset prices ranged from low to high values
    Per minute prices ranged separately from low to high rate rates
 Household asked to make yes or no decision on offer
 15.2% of the households accepted the offer
 Our goal was to answer two key questions
    Who to make offers to?
    How to price?




                Introduction to CART® Salford Systems ©2007           19
Setting Up the Model




Only requirement is to select TARGET (dependent) variable. CART will
do everything else automatically.
               Introduction to CART® Salford Systems ©2007             20
Model Results
  CART completes analysis and gives access
  to all results from the NAVIGATOR
    Shown on the next slide
  Upper section displays tree of a selected size
  Lower section displays error rate for trees of
  all possible sizes
  Green bar marks most accurate tree
  We display a compact 10 node tree for further
  scrutiny


            Introduction to CART® Salford Systems ©2007   21
Navigator Display




 Most accurate tree is marked with the green bar. We select the 10 node tree
 for convenience of a more compact display.
                  Introduction to CART® Salford Systems ©2007                  22
Root Node

                  Root node displays details of TARGET
                  variable in overall training data
                  We see that 15.2% of the 830
                  households accepted the offer
                  Goal of the analysis is now to extract
                  patterns characteristic of responders

                         If we could only use a single piece of
                         information to separate responders
                         from non-responders CART chooses
                         the HANDSET PRICE
                         Those offered the phone with a price
                         > 130 contain only 9.9% responders
                         Those offered a lower price respond
                         at 21.9%
   Introduction to CART® Salford Systems ©2007                    23
Maximal Tree




Maximal tree is the raw material for the best model
Goal is to find optimal tree embedded inside maximal
tree
Will find optimal tree via “pruning”
Like backwards stepwise regression

            Introduction to CART® Salford Systems ©2007   24
Main Drivers

                                  Red= Good Response
                                  Blue=Poor Response)
                                  High values of a split
                                  variable always go to
                                  the right; low values go
                                  left




      Introduction to CART® Salford Systems ©2007            25
Examine Individual Node

                   Hover mouse over node to see
                   inside
                   Even though this node is on the
                   “high price” of the tree it still
                   exhibits the strongest response
                   across all terminal node
                   segments (43.5% response)
                   Here we select rules expressed
                   in C
                   Entire tree can also be rendered
                   in Java, XML/PMML, or SAS

        Introduction to CART® Salford Systems ©2007    26
Configure Print Image




 Variety of fine controls to position and scale image nicely
               Introduction to CART® Salford Systems ©2007     27
Variable Importance Ranking




   CART offers multiple ways to compute variable importance
             Introduction to CART® Salford Systems ©2007      28
Predictive Accuracy




 This model is not very accurate but ranks responders well
            Introduction to CART® Salford Systems ©2007      29
Gains and ROC Curves




 In top decile model captures about 40% of
 responders

            Introduction to CART® Salford Systems ©2007   30
The Big Picture




Introduction to CART® Salford Systems ©2007   31
Binary Recursive Partitioning
 Data is split into two partitions
   thus “binary” partition
 Partitions can also be split into sub-partitions
   hence procedure is recursive
 CART tree is generated by repeated dynamic
 partitioning of data set
   Automatically determines which variables to use
   Automatically determines which regions to focus on
   The whole process is totally data driven


             Introduction to CART® Salford Systems ©2007   32
Three Major Stages
 Tree growing
    Consider all possible splits of a node
    Find the best split according to the given splitting
   rule (best improvement)
   Continue sequentially until the largest tree is grown
 Tree Pruning – creating a sequence of nested
 trees (pruning sequence) by systematic removal
 of the weakest branches
 Optimal Tree Selection – using test sample or
 cross-validation to find the best tree in the
 pruning sequence
            Introduction to CART® Salford Systems ©2007   33
General Workflow
 Historical Data
                                                                 Stage 1
                                    Build a Sequence
                                                                 Stage 2
     Learn                          of Nested Trees



                                                                 Stage 3
                                  Monitor Performance
      Test
                                                     Best

    Validate                        Confirm Findings



                   Introduction to CART® Salford Systems ©2007             34
Searching All Possible Splits
 For any node CART will examine ALL possible splits
    Computationally intensive but there are only a finite number of splits
 Consider first the variable AGE--in our data set has
 minimum value 40
    The rule
       Is AGE ≤ 40? will separate out these cases to the left — the youngest
       persons
 Next increase the AGE threshold to the next youngest
 person
       Is AGE ≤ 43
       This will direct six cases to the left
 Continue increasing the splitting threshold value by value


                  Introduction to CART® Salford Systems ©2007                  35
Split Tables
 Split tables are used to locate continuous splits
    There will be as many splits as there are unique values minus one

       Sorted by Age                          Sorted by Blood Pressure

 AGE    BP    SINUST SURVIVE                 AGE       BP     SINUST SURVIVE
  40     91      0    SURVIVE                 43       78        1     DEAD
  40    110      0    SURVIVE                 40       83        1     DEAD
  40     83      1     DEAD                   40       91        0    SURVIVE
  43     99      0    SURVIVE                 43       99        0    SURVIVE
  43     78      1     DEAD                   40       110       0    SURVIVE
  43    135      0    SURVIVE                 49       110       1    SURVIVE
  45    120      0    SURVIVE                 48       119       1     DEAD
  48    119      1     DEAD                   45       120       0    SURVIVE
  48    122      0    SURVIVE                 48       122       0    SURVIVE
  49    150      0     DEAD                   43       135       0    SURVIVE
  49    110      1    SURVIVE                 49       150       0     DEAD

                Introduction to CART® Salford Systems ©2007                     36
Categorical Predictors
    Left   Right
                            Categorical predictors spawn
1   A      B, C, D          many more splits
2   B      A, C, B
                                 For example, only four levels A,
3   C      A, B, D
                                 B, C, D result to 7 splits
4   D      A, B, C
5   A, B   C, D                  In general, there will be 2K-1-1
6   A, C   B, D                  splits
7   A, D   B, C
                            CART considers all possible
                            splits based on a categorical
                            predictor unless the number of
                            categories exceeds 15


            Introduction to CART® Salford Systems ©2007             37
Binary Target Shortcut
 When the target is binary, special algorithms reduce
 compute time to linear (K-1 instead of 2K-1-1) in levels
    30 levels takes twice as long as 15, not 10,000+ times as long
 When you have high-level categorical variables in a multi-
 class problem
    Create a binary classification problem
    Try different definitions of DPV (which groups to combine)
    Explore predictor groupings produced by CART
    From a study of all results decide which are most informative
    Create new grouped predicted variable for multi-class problem




                Introduction to CART® Salford Systems ©2007          38
Pruning and Variable Importance




      Introduction to CART® Salford Systems ©2007   39
GYM Model Example
 The original data comes from a financial
 institution and disguised as a health club
 Problem: need to understand a market
 research clustering scheme
 Clusters were created externally using 18
 variables and conventional clustering software
 Need to find simple rules to describe cluster
 membership
 CART tree provides a nice way to arrive at an
 intuitive story

           Introduction to CART® Salford Systems ©2007   40
Variable Definitions
  ID         Identification # of member                                                  not used
  CLUSTER    Cluster assigned from clustering scheme (10 level categorical coded 1-10)   target
  ANYRAQT    Racquet ball usage (binary indicator coded 0, 1)
  ONRCT      Number of on-peak racquet ball uses
  ANYPOOL    Pool usage (binary indicator coded 0, 1)
  ONPOOL     Number of on-peak pool uses
  PLRQTPCT   Percent of pool and racquet ball usage
  TPLRCT     Total number of pool and racquet ball uses
  ONAER      Number of on-peak aerobics classes attended
  OFFAER     Number of off-peak aerobics classes attended
  SAERDIF    Difference between number of on- and off-peak aerobics visits
  TANNING    Number of visits to tanning salon
                                                                                         predictors
  PERSTRN    Personal trainer (binary indicator coded 0, 1)
  CLASSES    Number of classes taken
  NSUPPS     Number of supplements/vitamins/frozen dinners purchased
  SMALLBUS   Small business discount (binary indicator coded 0, 1)
  OFFER      Terms of offer
  IPAKPRIC   Index variable for package price
  MONFEE     Monthly fee paid
  FIT        Fitness score
  NFAMMEN    Number of family members
  HOME       Home ownership (binary indicator coded 0, 1)


             Introduction to CART® Salford Systems ©2007                                              41
Things to Do
 Dataset: gymsmall.syd
 Set the target and predictors
 Set 50% test sample
 Navigate the largest, smallest, and optimal trees
 Check [Splitters] and [Tree Details]
 Observe the logic of pruning
 Check [Summary Reports]
 Understand competitors and surrogates
 Understand variable importance

            Introduction to CART® Salford Systems ©2007   42
Pruning Multiple Nodes




                                      Note that Terminal Node 5
                                      has class assignment 1
                                      All the remaining nodes
                                      have class assignment 2
                                      Pruning merges Terminal
                                      Nodes 5 and 6 thus
                                      creating a “chain reaction”
                                      of redundant split removal
        Introduction to CART® Salford Systems ©2007             43
Pruning Weaker Split First
                                                  Eliminating nodes
                                                  3 and 4 results to
                                                  “lesser” damage –
                                                  loosing one
                                                  correct dominant
                                                  class assignment
                                                  Eliminating nodes
                                                  1 and 2 or 5 and 6
                                                  would result to
                                                  loosing one
                                                  correct minority
                                                  class assignment


        Introduction to CART® Salford Systems ©2007                    44
Understanding Variable Importance
                                              Both FIT and
                                              CLASSES are
                                              reported as equally
                                              important
                                              Yet the tree only
                                              includes CLASSES
                                              ONPOOL is one of
                                              the main splitters yet
                                              is has very low
                                              importance
                                              Want to understand
                                              why

        Introduction to CART® Salford Systems ©2007                    45
Competitors and Surrogates



 Node details for the root node split are shown above
 Improvement measures class separation between two sides
 Competitors are next best splits in the given node ordered
 by improvement values
 Surrogates are splits most resembling (case by case) the
 main split
   Association measures the degree of resemblance
   Looking for competitors is free, finding surrogates is expensive
   It is possible to have an empty list of surrogates
 FIT is a perfect surrogate (and competitor) for CLASSES
                Introduction to CART® Salford Systems ©2007           46
The Utility of Surrogates
 Suppose that the main splitter is INCOME and comes from
 a survey form
 People with very low or very high income are likely not to
 report it – informative missingness in both tails of the
 income distribution
 The splitter itself wants to separate low income on the left
 from high income on the right – hence, it is unreasonable to
 treat missing INCOME as a separate unique value (usual
 way to handle such case)
 Using a surrogate will help to resolve ambiguity and
 redistribute missing income records between the two sides
 For example, all low education subjects will join the low
 income side while all high education subjects will join the
 high income side
              Introduction to CART® Salford Systems ©2007       47
Competitors and Surrogates Are Different

        A                           A                    Three nominal
        B                           B                    classes
        C                           C
                                                         Equally present
A     Split X   C         A      Split Y      B          Equal priors, unit
B                         C                              costs, etc.
    For symmetry considerations, both splits must result to
    identical improvements
    Yet one split can’t be substituted for another
    Hence Split X and Split Y are perfect competitors (same
    improvement) but very poor surrogates (low association)
    The reverse does not hold – a perfect surrogate is always a
    perfect competitor
                    Introduction to CART® Salford Systems ©2007               48
Calculating Association
Split X    Split Y       Default             Ten observations in a
                                             node
                                             Main splitter X
                                             Split Y mismatches
                                             on 2 observations
                                             The default rule
                                             sends all cases to the
                                             dominant side
                                             mismatching 3
                                             observations

 Association = (Default – Split Y)/Default = (3-2)/3 = 1/3
              Introduction to CART® Salford Systems ©2007             49
Notes on Association
 Measure local to a node, non-parametric in nature
 Association is negative for nodes with very poor
 resemblance, not limited by -1 from below
 Any positive association is useful (better strategy
 than the default rule)
 Hence define surrogates as splits with the highest
 positive association
 In calculating case mismatch, check the possibility
 for split reversal (if the rule is met, send cases to
 the right)
   CART marks straight splits as “s” and reverse splits as “r”

             Introduction to CART® Salford Systems ©2007         50
Calculating Variable Importance
 List all main splitters as well as all surrogates along with the
 resulting improvements
 Aggregate the above list by variable accumulating
 improvements
 Sort the result descending
 Divide each entry by the largest value and scale to 100%
 The resulting importance list is based on the given tree
 exclusively, if a different tree a grown the importance list will
 most likely change
 Variable importance reveals a deeper structure in stark
 contrast to a casual look at the main splitters


               Introduction to CART® Salford Systems ©2007           51
Mystery Solved
                                                     In our example
                                                     improvements
                                                     in the root
                                                     node dominate
                                                     the whole tree
                                                     Consequently,
                                                     the variable
                                                     importance
                                                     reflects the
                                                     surrogates
                                                     table in the
                                                     root node

       Introduction to CART® Salford Systems ©2007                    52
Variable Importance Caution
 Importance is a function of the OVERALL tree including
 deepest nodes
 Suppose you grow a large exploratory tree — review
 importances
 Then find an optimal tree via test set or CV yielding smaller
 tree
 Optimal tree SAME as exploratory tree in the top nodes
 YET importances might be quite different.
 WHY? Because larger tree uses more nodes to compute the
 importance
 When comparing results be sure to compare similar or same
 sized trees
Splitting Rules and Friends




   Introduction to CART® Salford Systems ©2007   54
Splitting Rules - Introduction
 Splitting rule controls the tree growing stage
 At each node, a finite pool of all possible splits exists
    The pool is independent of the target variable, it is based solely on
    the observations (and corresponding values of predictors) in the
    node
 The task of the splitting rule is to rank all splits in the pool
 according to their improvements
    The winner will become the main splitter
    The top runner-ups will become competitors
 It is possible to construct different splitting rules
 emphasizing various approaches to the definition of the
 “best split”


                 Introduction to CART® Salford Systems ©2007                55
Splitting Rules Control
                            Selected in the “Method” tab
                            of the Model Setup dialog
                            There are 6 alternative rules
                            for classification and 2 rules
                            for regression
                            Since there is no theoretical
                            justification on the best
                            splitting rule, we recommend
                            trying all of them and picking
                            the best resulting model

         Introduction to CART® Salford Systems ©2007         56
Further Insights
 On binary classification problem, GINI and
 TWOING are equivalent
   Therefore, we recommend setting “Favor even Splits”
   control to 1 when TWOING is used
 GINI and Symmetric GINI only differ when an
 asymmetric cost matrix is used and identical
 otherwise
 Ordered TWOING is a restricted version of
 TWOING on multinomial targets coded into a set of
 contiguous integers
 Class Probability is identical to GINI during tree
 growing but differs during the pruning
             Introduction to CART® Salford Systems ©2007   57
Battery RULES




 The battery was designed to automatically try all six
 alternative splitting rules
 According to the output above, Entropy splitting rule resulted
 to the best accuracy model on the given data
 Individual model navigators are also available
              Introduction to CART® Salford Systems ©2007     58
Controlled Tree Growth
 CART was originally conceived to be fully
 automated thus protecting an analyst from making
 erroneous split decisions
 In some cases, however, it could be desirable to
 influence tree evolution to address specific
 modeling needs
 Modern CART engine allows multiple direct and
 indirect ways of accomplishing this:
   Penalties on variables, missing values, and categories
   Controlling which variables are allowed or disallowed at
   different levels in the tree and node sizes
   Direct forcing of splits at the top two levels in the tree

              Introduction to CART® Salford Systems ©2007       59
Penalizing Individual Variables




                   We use variable penalty control to
                   penalize CLASSES a little bit
                   This is enough to ensure FIT the main
                   splitter position
                   The improvement associated with
                   CLASSES is now downgraded by 1
                   minus the penalty factor

        Introduction to CART® Salford Systems ©2007        60
Penalizing Missing and Categorical Predictors

                          Since splits on variables with
                          missing values are evaluated on a
                          reduced set of observations, such
                          splits may have an unfair advantage
                          Introducing systematic penalty set to
                          the proportion of the variable
                          missing effectively removes the
                          possible bias
                          Same holds for categorical variables
                          with large number of distinct levels
                          We recommend that the user always
                          set both controls to 1

           Introduction to CART® Salford Systems ©2007        61
Forcing Splits




 CART allows direct forcing of the split variable (and possibly
 split value) in the top three nodes in the tree
               Introduction to CART® Salford Systems ©2007        62
Notes on Forcing Splits
                                           It is not guaranteed
                                           that a forced split
                                           below the root node
                                           level will be preserved
                                           (because of pruning)
                                           Do not simply overturn
                                           “illogical” automatic
                                           splits by forcing
                                           something else – look
                                           for an explanation why
                                           CART makes a
                                           “wrong” choice, it
                                           could indicate a
                                           problem with your data

        Introduction to CART® Salford Systems ©2007                  63
Constrained Trees
 Many predictive models can benefit from
 Salford’s patent pending “Structured Trees”
 Trees constrained in how they are grown to
 reflect decision support requirements
 In mobile phone example: want tree to first
 segment on customer characteristics and
 then complete using price variables
 Price variables are under the control of the
 company
 Customer characteristics are not under
 company control

           Introduction to CART® Salford Systems ©2007   64
Setting Up Constraints




Green indicates where in the tree variables of group are
allowed to appear
             Introduction to CART® Salford Systems ©2007   65
Constrained Tree




 Demographic and spend information at top of tree
 Handset (HANDPRIC) and per minute pricing (USEPRICE) at bottom

               Introduction to CART® Salford Systems ©2007        66
Prior Probabilities
 An essential component of any CART analysis
 A prior probability distribution for the classes is needed to do
 proper splitting since prior probabilities are used to calculate
 the improvements in the Gini formula
 Example: Suppose data contains
     99% type A (nonresponders),
      1% type B (responders),
 CART might focus on not missing any class A objects
    one "solution" classify all objects nonresponders
 Advantages: Only 1% error rate
 Very simple predictive rules, although not informative
 But realize this could be an optimal rule
Priors EQUAL
  Most common priors: PRIORS EQUAL
  This gives each class equal weight regardless of its frequency in
  the data
  Prevents CART from favoring more prevalent classes
  If we have 900 class A and 100 class B and equal priors
                                     A:   900
                                     B:   100



                         A:    300              A:      600
                         B:     10              B:       90

                        1/3 of all A            2/3 of all A
                        1/10 of all B           9/10 of all B
                        Class as A              Class as B

  With equal priors prevalence measured as percent of OWN
  class size
  Measurements relative to own class not entire learning set
  If PRIORS DATA both child nodes would be class A
  Equal priors puts both classes on an equal footing
Class Assignment Formula I

 For equal priors class node is class 1 if
                        N 1( t ) N 1
                                >
                        N 2( t ) N 2
       Ni = number of cases in class i at the root node
       Ni (t) = number of cases in class i at the node t


 Classify any node by “count ratio in node” relative to
 “count ratio in root”
 Classify by which level has greatest relative richness
Class Assignment Formula II
  When priors are not equal:
  π1N1(t) / π2N2(t) > N1 / N2
  where πi is the prior for class i
  Since π1 / π2 always appears as ratio just think in
  terms of boosting amount Ni(t) by a factor, e.g. 9:1,
  100:1, 2:1
  If priors are EQUAL, πi terms cancel out
  If priors are DATA, then π1 / π2 = N1 / N2
  which simplifies formula to plurality rule (simple
  counting)
  classify as class 1 if N1(t) > N2(t)
  If priors are neither EQUAL nor DATA then formula
  will provide boost to up-weighted classes
Example

                                    A:
                                    A:       900
                                              900
                                    B:
                                    B:       100
                                              100
               A:    300                                      A: 600
                                                               A: 600
                A:    300
               B:    10                                       B: 90
                                                               B: 90
                B:    10

      300 > 900 so class as A                         90 > 100 so class as B
      10 100                                         600 900




 In gen eral th e tests w ou ld b e w eigh ted b y th e ap p rop riate p riors
                 π (B ) 90 > 100          π (A ) 300 > 900
                 π (A ) 600 900             π (B ) 10    100

            tests can b e w ritten as A or B as con ven ien t
                                      B    A
Cross Validation




Introduction to CART® Salford Systems ©2007   72
Cross-Validation

  Cross-validation is a recent computer intensive
  development in statistics
  Purpose is to protect oneself from over fitting errors
    Don't want to capitalize on chance — track idiosyncrasies
    of this data
    Idiosyncrasies which will NOT be observed on fresh data
  Ideally would like to use large test data sets to evaluate
  trees, N>5000
  Practically, some studies don't have sufficient data to
  spare for testing
  Cross-validation will use SAME data for learning and for
  testing
10-fold CV: Industry Standard
  Begin by growing maximal tree on ALL data
  Divide data into 10 portions stratified on dependent
  variable levels
  Reserve first portion for test, grow new tree on remaining
  9 portions
  Use the 1/10 test set to measure error rate for this 9/10
  data tree
     Error rate is measured for the maximal tree and for all
     subtrees
  Now rotate out new 1/10 test data set
  Grow new tree on remaining 9 portions
  Repeat until all 10 portions of the data have been used
  as test sets
The CV Procedure

    1           2    3     4    5       6       7      8   9   10

   Test                                Learn

    1           2    3     4    5       6       7      8   9   10

   Learn Test                               Learn

    1           2    3     4    5       6       7      8   9   10

        Learn       Test                       Learn
                                                                      ETC...

    1           2    3     4    5       6       7      8   9   10

                               Learn                           Test
CV Details
 Every observation is used as test case exactly once
 In 10 fold CV each observation is used as a learning case
 9 times
 In 10 fold CV 10 auxiliary CART trees are grown in
 addition to initial tree
 When all 10 CV trees are done the error rates are
 cumulated (summed)
 Summing of error rates is done by TREE complexity
 Summed Error (cost) rates are then attributed to INITIAL
 tree
 Observe that no two of these trees need be the same
 Results are subject to random fluctuations — sometimes
 severe
CV Replication and Stability
 Look at the separate CV tree results of a single CV analysis
 CART produces a summary of the results for each tree at
 end of output
 Table titled "CV TREE COMPETITOR LISTINGS" reports
 results for each CV tree Root (TOP), Left Child of Root
 (LEFT), and Right Child of Root (RIGHT)
 Want to know how stable results are -- do different variables
 split root node in different trees?
 Only difference between different CV trees is random
 elimination of 1/10 data
 Special battery called CVR exists to study the effects of
 repeated cross-validation process
Battery CVR




       Introduction to CART® Salford Systems ©2007   78
Regression Trees




Introduction to CART® Salford Systems ©2007   79
General Comments
 Regression trees result to piece-wise constant
 models (multi-dimensional staircase) on an
 orthogonal partition of the data space
   Thus usually not the best possible performer in terms of
   conventional regression loss functions
 Only a very limited number of controls is available
 to influence the modeling process
   Priors and costs are no longer applicable
   There are two splitting rules: LS and LAD
 Very powerful in capturing high-order interactions
 but somewhat weak in explaining simple main
 effects

             Introduction to CART® Salford Systems ©2007      80
Boston Housing Data Set
 Harrison, D. and D. Rubinfeld. Hedonic Housing Prices & Demand For
 Clean Air. Journal of Environmental Economics and Management, v5,
 81-102 , 1978
    506 census tracts in City of Boston for the year 1970
    Goal: study relationship between quality of life variables and property values
    MV          median value of owner-occupied homes in tract (‘000s)
    CRIM        per capita crime rates
    NOX         concentration of nitrogen oxides (pphm)
    AGE         percent built before 1940
    DIS         weighted distance to centers of employment
    RM          average number of rooms per house
    LSTAT       percent neighborhood ‘lower SES’
    RAD         accessibility to radial highways
    CHAS        borders Charles River (0/1)
    INDUS       percent non-retail business
    TAX         tax rate
    PT          pupil teacher ratio



                    Introduction to CART® Salford Systems ©2007                      81
Splitting the Root Node




 Improvement is defined in terms of the greatest reduction in the sum of
 squared errors when a single constant prediction is replaced by two
 separate constants on each side
                 Introduction to CART® Salford Systems ©2007               82
Regression Tree Model




 All cases in the given node are assigned the same predicted
 response – the node average of the original target
 Nodes are color-coded according to the predicted response
 We have a convenient segmentation of the population
 according to the average response levels

              Introduction to CART® Salford Systems ©2007   83
The Best and the Worst Segments




        Introduction to CART® Salford Systems ©2007   84
Scoring and Deployment




  Introduction to CART® Salford Systems ©2007   85
Notes on Scoring
 Trees cannot be expressed as a simple analytical
 expression, instead, a lengthy collection of rules is
 needed to express the tree logic
 Nonetheless, the scoring operation is very fast
 once the rules are compiled into the machine code
 Scoring can be done in two ways:
   Internally – by using the SCORE command
   Externally – by using the TRANSLATE command to
   generate C, SAS, Java, or PMML code
 Model of any size can be scored or translated

             Introduction to CART® Salford Systems ©2007   86
Internal Scoring




        Introduction to CART® Salford Systems ©2007   87
Output File Format
  CASEID RESPONSE NODE CORRECT   DV   PROB_1   PROB_2 PATH_1 PATH_2 PATH_3 PATH_4 PATH_5 PATH_6   XA   XB   XC
      1        0    1        1    0 0.998795 0.001205      1      2      3     4      5      -1    0   0     0
      2        1    7        0    0 0.738636 0.261364      1     -7      0     0      0       0    1   0     0
      3        0    1        1    0 0.998795 0.001205      1      2      3     4      5      -1    0   0     0



 CART creates the following output variables
     RESPONSE – predicted class assignment
     NODE – terminal node the case falls into
     CORRECT – indicates whether the prediction is correct (assuming
     that the actual target is known)
     PROB_x – class probability (constant within each node)
     PATH_x – path traveled by case in the tree
          Case 1 starts in the root node then follows through internal nodes 2, 3,
          4, 5 and finally lands into terminal node 1
          Case 2 starts in the root node and then lands into terminal node 7
     Finally CART adds all of the original model variables including the
     target and predictors


                         Introduction to CART® Salford Systems ©2007                                             88
Model Translation




        Introduction to CART® Salford Systems ©2007   89
Output Code




       Introduction to CART® Salford Systems ©2007   90
CART Automation




Introduction to CART® Salford Systems ©2007   91
Automated CART Runs
 Starting with CART 6.0 we have added a new feature called
 batteries
 A battery is a collection of runs obtained via a predefined
 procedure
 A trivial set of batteries is varying one of the key CART
 parameters:
   RULES – run every major splitting rule
   ATOM – vary minimum parent node size
   MINCHILD – vary minimum node size
   DEPTH – vary tree depth
   NODES – vary maximum number of nodes allowed
   FLIP – flipping the learn and test samples (two runs only)


               Introduction to CART® Salford Systems ©2007      92
More Batteries
 DRAW – each run uses a randomly drawn learn sample
 from the original data
 CV – varying number of folds used in cross-validation
 CVR – repeated cross-validation each time with a different
 random number seed
 KEEP – select a given number of predictors at random from
 the initial larger set
   Supports a core set of predictors that must always be present
 ONEOFF – one predictor model for each predictor in the
 given list
 The remaining batteries are more advanced and will be
 discussed individually

               Introduction to CART® Salford Systems ©2007         93
Battery MCT
 The primary purpose is to test possible model
 overfitting by running Monte Carlo simulation
   Randomly permute the target variable
   Build a CART tree
   Repeat a number of times
   The resulting family of profiles measures intrinsic
   overfitting due to the learning process
   Compare the actual run profile to the family of random
   profiles to check the significance of the results
 This battery is especially useful when the relative
 error of the original run exceeds 90%

             Introduction to CART® Salford Systems ©2007    94
Battery MVI
 Designed to check the impact of the missing values
 indicators on the model performance
   Build a standard model, missing values are handled via
   the mechanism of surrogates
   Build a model using MVIs only – measures the amount of
   predictive information contained in the missing value
   indicators alone
   Combine the two
   Same as above with missing penalties turned on




             Introduction to CART® Salford Systems ©2007    95
Battery PRIOR
 Vary the priors for the specified class within a given range
    For example, vary the prior on class 1 from 0.2 to 0.8 in increments
    of 0.02
 Will result to a family of models with drastically different
 performance in terms of class richness and accuracy (see
 the earlier discussion)
 Can be used for hot spot detection
 Can also be used to construct a very efficient voting
 ensemble of trees
 This battery is covered in greater length in our Advanced
 CART class



                Introduction to CART® Salford Systems ©2007                96
Battery SHAVING
 Automates variable selection process
 Shaving from the top – each step the most important
 variable is eliminated
   Capable to detect and eliminate “model hijackers” – variables that
   appear to be important on the learn sample but in general may hurt
   ultimate model performance (for example, ID variable)
 Shaving from the bottom – each step the least important
 variable is eliminated
   May drastically reduce the number of variables used by the model
 SHAVING ERROR – at each iteration all current variables
 are tried for elimination one at a time to determine which
 predictor contributes the least
   May result to a very long sequence of runs (quadratic complexity)

               Introduction to CART® Salford Systems ©2007              97
Battery SAMPLE
 Build a series of models in which the learn sample
 is systematically reduced
 Used to determine the impact of the learn sample
 size on the model performance
 Can be combined with BATTERY DRAW to get
 additional spread




            Introduction to CART® Salford Systems ©2007   98
Battery TARGET
 Attempt to build a model for each variable in the
 given list as a target using the remaining variables
 as predictors
 Very effective in capturing inter-variable
 dependency patterns
 Can be used as an input to variable clustering and
 multi-dimensional scaling approaches
 When run on MVIs only, can reveal missing value
 patterns in the data


             Introduction to CART® Salford Systems ©2007   99
Stable Trees and Consistency




    Introduction to CART® Salford Systems ©2007   100
Notes on Tree Stability
 Tree are usually stable near the top and become
 progressively more and more unstable as the nodes
 become smaller
 Therefore, smaller trees in the pruning sequence will tend to
 be more stable
 Node instability usually manifests itself when comparing
 node class distributions between the learn and test samples
 Ideally we want to identify the largest stable tree in the
 pruning sequence
 The optimal tree suggested by CART may contain a few
 unstable nodes structurally needed to optimize the overall
 accuracy
 Train Test Consistency feature (TTC) was designed to
 address the topic at length
              Introduction to CART® Salford Systems ©2007    101
MJ2000 Run




 The dataset comes from an undisclosed financial institution
 Set up a default classification run with 50% random test
 sample
 CART determines 36-node optimal tree


              Introduction to CART® Salford Systems ©2007      102
Node Stability
      Terminal Nodes                                    TTC Report




 A quick look at the Learn and Test node distributions
 reveals a large number of unstable nodes
 Let’s use TTC feature by pressing the [T/T Consist] button
 to identify the largest stable tree
              Introduction to CART® Salford Systems ©2007            103
The Largest Stable Tree

           TTC Report




 Tree with 11 terminal nodes is identified as stable (all nodes
 match on both direction and rank)
 While within the specified tolerance, nodes 2, 6, 8 are
 somewhat unsatisfactory in terms of rank-order
    These nodes will emerge as rank-order unstable if we lower the
    tolerance level to 0.5
                Introduction to CART® Salford Systems ©2007          104
Improving Predictor List
                          Predictor Elimination by Shaving




 Run BATTERY SHAVE to systematically eliminate the least important
 predictors
 The resulting optimal tree has even better accuracy than before
 It also uses a much smaller set of only 6 predictors
 Let’s study the tree stability for the new model

                Introduction to CART® Salford Systems ©2007          105
Comparing the Results
        Original Tree                   New Tree




 The tree on the left is the original (unshaved) one
 The tree on the right is the result of shaving
 The new tree is far more stable than the original one, except
 for node 8    Introduction to CART® Salford Systems ©2007   106
Justification
        Learn Sample                                Test Sample




 Node 8 is irrelevant for classification but its presence is
 required in order to allow a very strong split below

               Introduction to CART® Salford Systems ©2007        107
Recommended Reading
 Breiman, L., J. Friedman, R. Olshen and C. Stone (1984),
 Classification and Regression Trees, Pacific Grove:
 Wadsworth
 Breiman, L (1996), Bagging Predictors, Machine Learning,
 24, 123-140
 Breiman, L (1996), Stacking Regressions, Machine
 Learning, 24, 49-64.
 Friedman, J. H., Hastie, T. and Tibshirani, R. "Additive
 Logistic Regression: a Statistical View of Boosting." (Aug.
 1998)




              Introduction to CART® Salford Systems ©2007      108

Contenu connexe

En vedette

Silabo inyectoterapia
Silabo inyectoterapiaSilabo inyectoterapia
Silabo inyectoterapiadinayajaira
 
Photography and Planning for poster
Photography and Planning for posterPhotography and Planning for poster
Photography and Planning for postertiffanybreen1
 
Jones mooney primary_geometry_2003
Jones mooney primary_geometry_2003Jones mooney primary_geometry_2003
Jones mooney primary_geometry_2003shahkalam
 
Deluge Script - An OverView
Deluge Script - An OverViewDeluge Script - An OverView
Deluge Script - An OverViewAparna Naveen
 
Chapter 1 elements of nuclear physics
Chapter 1 elements of nuclear physicsChapter 1 elements of nuclear physics
Chapter 1 elements of nuclear physicsROBERT ESHUN
 
Botany book in arabic
Botany book in arabicBotany book in arabic
Botany book in arabickaakaawaah
 
Veterans day3
Veterans day3Veterans day3
Veterans day3Texsun
 
Murder mystery questionnaire
Murder mystery questionnaireMurder mystery questionnaire
Murder mystery questionnaire4494mepham
 
Formulario proyecto innova tesis
Formulario proyecto innova tesisFormulario proyecto innova tesis
Formulario proyecto innova tesisAlejandra Bernal
 
USAID PEER Program
USAID PEER ProgramUSAID PEER Program
USAID PEER Programstacywhittle
 
「開放式領導」第一章 掌控還管用嗎?
「開放式領導」第一章 掌控還管用嗎?「開放式領導」第一章 掌控還管用嗎?
「開放式領導」第一章 掌控還管用嗎?Douny Yang
 
Yaksha prasna questions of yaksha
Yaksha prasna   questions of yakshaYaksha prasna   questions of yaksha
Yaksha prasna questions of yakshaBASKARAN P
 

En vedette (19)

Silabo inyectoterapia
Silabo inyectoterapiaSilabo inyectoterapia
Silabo inyectoterapia
 
Photography and Planning for poster
Photography and Planning for posterPhotography and Planning for poster
Photography and Planning for poster
 
El Nàstic, passió i empresa
El Nàstic, passió i empresaEl Nàstic, passió i empresa
El Nàstic, passió i empresa
 
Cricket
CricketCricket
Cricket
 
Jones mooney primary_geometry_2003
Jones mooney primary_geometry_2003Jones mooney primary_geometry_2003
Jones mooney primary_geometry_2003
 
Entrepreneurship: 7 Core Pillars
Entrepreneurship: 7 Core PillarsEntrepreneurship: 7 Core Pillars
Entrepreneurship: 7 Core Pillars
 
1 alic
1 alic1 alic
1 alic
 
Drug
DrugDrug
Drug
 
Magic body
Magic bodyMagic body
Magic body
 
Deluge Script - An OverView
Deluge Script - An OverViewDeluge Script - An OverView
Deluge Script - An OverView
 
Chapter 1 elements of nuclear physics
Chapter 1 elements of nuclear physicsChapter 1 elements of nuclear physics
Chapter 1 elements of nuclear physics
 
Botany book in arabic
Botany book in arabicBotany book in arabic
Botany book in arabic
 
Veterans day3
Veterans day3Veterans day3
Veterans day3
 
Murder mystery questionnaire
Murder mystery questionnaireMurder mystery questionnaire
Murder mystery questionnaire
 
Formulario proyecto innova tesis
Formulario proyecto innova tesisFormulario proyecto innova tesis
Formulario proyecto innova tesis
 
USAID PEER Program
USAID PEER ProgramUSAID PEER Program
USAID PEER Program
 
「開放式領導」第一章 掌控還管用嗎?
「開放式領導」第一章 掌控還管用嗎?「開放式領導」第一章 掌控還管用嗎?
「開放式領導」第一章 掌控還管用嗎?
 
Geografia
GeografiaGeografia
Geografia
 
Yaksha prasna questions of yaksha
Yaksha prasna   questions of yakshaYaksha prasna   questions of yaksha
Yaksha prasna questions of yaksha
 

Similaire à Introduction to cart_2007

Analyzing quantitative data
Analyzing quantitative dataAnalyzing quantitative data
Analyzing quantitative dataBing Villamor
 
Jo Dens: Re-entry devices – essential or not?
Jo Dens: Re-entry devices – essential or not?Jo Dens: Re-entry devices – essential or not?
Jo Dens: Re-entry devices – essential or not?Euro CTO Club
 
Introduction To Statistical Process Control 20 Jun 2011
Introduction To Statistical Process Control 20 Jun  2011Introduction To Statistical Process Control 20 Jun  2011
Introduction To Statistical Process Control 20 Jun 2011Robert_Wade_Sherrill
 
3.2 measures of variation
3.2 measures of variation3.2 measures of variation
3.2 measures of variationleblance
 
3.2 measures of variation
3.2 measures of variation3.2 measures of variation
3.2 measures of variationleblance
 
Predicting Hospital Readmission Using TreeNet
Predicting Hospital Readmission Using TreeNetPredicting Hospital Readmission Using TreeNet
Predicting Hospital Readmission Using TreeNetSalford Systems
 
Protein Secondary Structure Prediction using Deep Learning methods
Protein Secondary Structure Prediction using Deep Learning methodsProtein Secondary Structure Prediction using Deep Learning methods
Protein Secondary Structure Prediction using Deep Learning methodsChrysoula Kosma
 
Akram najjar exploiting your data (for printing)
Akram najjar   exploiting your data (for printing)Akram najjar   exploiting your data (for printing)
Akram najjar exploiting your data (for printing)Instabeat Inc.
 
Akram najjar exploiting your data (for printing)
Akram najjar   exploiting your data (for printing)Akram najjar   exploiting your data (for printing)
Akram najjar exploiting your data (for printing)BeirutQuantifiedSelf
 
Software Defect Repair Times: A Multiplicative Model
Software Defect Repair Times: A Multiplicative ModelSoftware Defect Repair Times: A Multiplicative Model
Software Defect Repair Times: A Multiplicative Modelgregoryg
 
Novel Frequency Domain Classification Algorithm Based On Parameter Weight Fac...
Novel Frequency Domain Classification Algorithm Based On Parameter Weight Fac...Novel Frequency Domain Classification Algorithm Based On Parameter Weight Fac...
Novel Frequency Domain Classification Algorithm Based On Parameter Weight Fac...ahmedbohy
 
Decision tree induction
Decision tree inductionDecision tree induction
Decision tree inductionthamizh arasi
 
An Information-Theoretic Approach for Clonal Selection Algorithms
An Information-Theoretic Approach for Clonal Selection AlgorithmsAn Information-Theoretic Approach for Clonal Selection Algorithms
An Information-Theoretic Approach for Clonal Selection AlgorithmsMario Pavone
 
My Postdoctoral Research
My Postdoctoral ResearchMy Postdoctoral Research
My Postdoctoral ResearchPo-Ting Wu
 
OSMC 2023 | Know your data: The stats behind your alerts by Dave McAllister
OSMC 2023 | Know your data: The stats behind your alerts by Dave McAllisterOSMC 2023 | Know your data: The stats behind your alerts by Dave McAllister
OSMC 2023 | Know your data: The stats behind your alerts by Dave McAllisterNETWAYS
 

Similaire à Introduction to cart_2007 (20)

Advanced cart 2007
Advanced cart 2007Advanced cart 2007
Advanced cart 2007
 
Introduction to cart_2009
Introduction to cart_2009Introduction to cart_2009
Introduction to cart_2009
 
4 1 tree world
4 1 tree world4 1 tree world
4 1 tree world
 
Analyzing quantitative data
Analyzing quantitative dataAnalyzing quantitative data
Analyzing quantitative data
 
Jo Dens: Re-entry devices – essential or not?
Jo Dens: Re-entry devices – essential or not?Jo Dens: Re-entry devices – essential or not?
Jo Dens: Re-entry devices – essential or not?
 
16 Simple CART
16 Simple CART16 Simple CART
16 Simple CART
 
4_1_Tree World.pdf
4_1_Tree World.pdf4_1_Tree World.pdf
4_1_Tree World.pdf
 
Introduction To Statistical Process Control 20 Jun 2011
Introduction To Statistical Process Control 20 Jun  2011Introduction To Statistical Process Control 20 Jun  2011
Introduction To Statistical Process Control 20 Jun 2011
 
3.2 measures of variation
3.2 measures of variation3.2 measures of variation
3.2 measures of variation
 
3.2 measures of variation
3.2 measures of variation3.2 measures of variation
3.2 measures of variation
 
Predicting Hospital Readmission Using TreeNet
Predicting Hospital Readmission Using TreeNetPredicting Hospital Readmission Using TreeNet
Predicting Hospital Readmission Using TreeNet
 
Protein Secondary Structure Prediction using Deep Learning methods
Protein Secondary Structure Prediction using Deep Learning methodsProtein Secondary Structure Prediction using Deep Learning methods
Protein Secondary Structure Prediction using Deep Learning methods
 
Akram najjar exploiting your data (for printing)
Akram najjar   exploiting your data (for printing)Akram najjar   exploiting your data (for printing)
Akram najjar exploiting your data (for printing)
 
Akram najjar exploiting your data (for printing)
Akram najjar   exploiting your data (for printing)Akram najjar   exploiting your data (for printing)
Akram najjar exploiting your data (for printing)
 
Software Defect Repair Times: A Multiplicative Model
Software Defect Repair Times: A Multiplicative ModelSoftware Defect Repair Times: A Multiplicative Model
Software Defect Repair Times: A Multiplicative Model
 
Novel Frequency Domain Classification Algorithm Based On Parameter Weight Fac...
Novel Frequency Domain Classification Algorithm Based On Parameter Weight Fac...Novel Frequency Domain Classification Algorithm Based On Parameter Weight Fac...
Novel Frequency Domain Classification Algorithm Based On Parameter Weight Fac...
 
Decision tree induction
Decision tree inductionDecision tree induction
Decision tree induction
 
An Information-Theoretic Approach for Clonal Selection Algorithms
An Information-Theoretic Approach for Clonal Selection AlgorithmsAn Information-Theoretic Approach for Clonal Selection Algorithms
An Information-Theoretic Approach for Clonal Selection Algorithms
 
My Postdoctoral Research
My Postdoctoral ResearchMy Postdoctoral Research
My Postdoctoral Research
 
OSMC 2023 | Know your data: The stats behind your alerts by Dave McAllister
OSMC 2023 | Know your data: The stats behind your alerts by Dave McAllisterOSMC 2023 | Know your data: The stats behind your alerts by Dave McAllister
OSMC 2023 | Know your data: The stats behind your alerts by Dave McAllister
 

Dernier

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 

Dernier (20)

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 

Introduction to cart_2007

  • 1. Data Mining with Decision Trees: An Introduction to CART® Dan Steinberg Mikhail Golovnya golomi@salford-systems.com http://www.salford-systems.com Introduction to CART® Salford Systems ©2007
  • 2. Intro CART Outline Historical background Introduction into splitting Sample CART model rules Binary Recursive Constrained trees Partitioning Introduction into priors and Fundamentals of tree costs pruning Cross-Validation Competitors and Stable trees and train-test surrogates consistency (TTC) Missing value handling Hot-spot analysis Variable Importance Model scoring and Penalties and structured translation trees Batteries of CART runs Introduction to CART® Salford Systems ©2007 2
  • 3. Historical Background Introduction to CART® Salford Systems ©2007 3
  • 4. In The Beginning… In 1984 Berkeley and Stanford statisticians announced a remarkable new classification tool which: Could separate relevant from irrelevant predictors Did not require any kind of variable transforms (logs, square roots) Impervious to outliers and missing values Could yield relatively simple and easy to comprehend models Required little to no supervision by the analyst Was frequently more accurate than traditional logistic regression or discriminant analysis, and other parametric tools Introduction to CART® Salford Systems ©2007 4
  • 5. The Years of Struggle CART was slow to gain widespread recognition in late 1980s and early 1990s for several reasons Monograph introducing CART is challenging to read brilliant book overflowing with insights into tree growing methodology but fairly technical and brief Method was not expounded in any textbooks Originally taught only in advanced graduate statistics classes at a handful of leading Universities Original standalone software came with slender documentation, and output was not self-explanatory Method was such a radical departure from conventional statistics Introduction to CART® Salford Systems ©2007 5
  • 6. The Final Triumph Rising interest in data mining Availability of huge data sets requiring analysis Need to automate or accelerate and improve analysis process Comparative performance studies Advantages of CART over other tree methods handling of missing values assistance in interpretation of results (surrogates) performance advantages: speed, accuracy New software and documentation make techniques accessible to end users Word of mouth generated by early adopters Easy to use professional software available from Salford Systems Introduction to CART® Salford Systems ©2007 6
  • 7. Sample CART Model Introduction to CART® Salford Systems ©2007 7
  • 8. So What is CART? Best illustrated with a famous example-the UCSD Heart Disease study Given the diagnosis of a heart attack based on Chest pain, Indicative EKGs, Elevation of enzymes typically released by damaged heart muscle Predict who is at risk of a 2nd heart attack and early death within 30 days Prediction will determine treatment program (intensive care or not) For each patient about 100 variables were available, including: Demographics, medical history, lab results 19 noninvasive variables were used in the analysis Introduction to CART® Salford Systems ©2007 8
  • 9. Typical CART Solution PATIENTS = 215 Example of a <= 91 > 91 CLASSIFICATION SURVIVE 178 82.8% DEAD 37 17.2% tree Dependent Is BP<=91? variable is categorical (SURVIVE, DIE) Terminal Node A PATIENTS = 195 Want to predict SURVIVE 6 30.0% <= 62.5 SURVIVE 172 88.2% > 62.5 class membership DEAD 14 70.0% DEAD 23 11.8% NODE = DEAD Is AGE<=62.5? Terminal Node B PATIENTS = 91 SURVIVE 102 98.1% <=.5 >.5 SURVIVE 70 76.9% DEAD 2 1.9% DEAD 21 23.1% NODE = SURVIVE Is SINUS<=.5? Terminal Node C Terminal Node D SURVIVE 14 50.0% SURVIVE 56 88.9% DEAD 14 50.0% DEAD 7 11.1% NODE = DEAD NODE = SURVIVE Introduction to CART® Salford Systems ©2007 9
  • 10. How to Read It Entire tree represents a complete analysis or model Has the form of a decision tree Root of inverted tree contains all data Root gives rise to child nodes Child nodes can in turn give rise to their own children At some point a given path ends in a terminal node Terminal node classifies object Path through the tree governed by the answers to QUESTIONS or RULES Introduction to CART® Salford Systems ©2007 10
  • 11. Decision Questions Question is of the form: is statement TRUE? Is continuous variable X ≤ c ? Does categorical variable D take on levels i, j, or k? e.g. Is geographic region 1, 2, 4, or 7? Standard split: If answer to question is YES a case goes left; otherwise it goes right This is the form of all primary splits Question is formulated so that only two answers possible Called binary partitioning In CART the YES answer always goes left Introduction to CART® Salford Systems ©2007 11
  • 12. The Tree is a Classifier Classification is determined by following a case’s path down the tree Terminal nodes are associated with a single class Any case arriving at a terminal node is assigned to that class The implied classification threshold is constant across all nodes and depends on the currently set priors, costs, and observation weights (discussed later) Other algorithms often use only the majority rule or a limited set of fixed rules for node class assignment thus being less flexible In standard classification, tree assignment is not probabilistic Another type of CART tree, the class probability tree does report distributions of class membership in nodes (discussed later) With large data sets we can take the empirical distribution in a terminal node to represent the distribution of classes Introduction to CART® Salford Systems ©2007 12
  • 13. The Importance of Binary Splits Some researchers suggested using 3-way, 4-way, etc. splits at each node (univariate) There are strong theoretical and practical reasons why we feel strongly against such approaches: Exhaustive search for a binary split has linear algorithm complexity, the alternatives have much higher complexity Choosing the best binary split is “less ambiguous” than the suggested alternatives Most importantly, binary splits result to the smallest rate of data fragmentation thus allowing a larger set of predictors to enter a model Any multi-way split can be replaced by a sequence of binary splits There are numerous practical illustrations on the validity of the above claims Introduction to CART® Salford Systems ©2007 13
  • 14. Accuracy of a Tree For classification trees All objects reaching a terminal node are classified in the same way e.g. All heart attack victims with BP>91 and AGE less than 62.5 are classified as SURVIVORS Classification is same regardless of other medical history and lab results The performance of any classifier can be summarized in terms of the Prediction Success Table (a.k.a. Confusion Matrix) The key to comparing the performance of different models is how one “wraps” the entire prediction success matrix into a single number Overall accuracy Average within-class accuracy Most generally – the Expected Cost Introduction to CART® Salford Systems ©2007 14
  • 15. Prediction Success Table Classified As TRUE Class 1 Class 2 "Survivors" "Early Deaths" Total % Correct Class 1 "Survivors" 158 20 178 88% Class 2 "Early Deaths" 9 28 37 75% Total 167 48 215 86.51% A large number of different terms exists to name various simple ratios based on the table above: Sensitivity, specificity, hit rate, recall, overall accuracy, false/true positive, false/true negative, predictive value, etc. Introduction to CART® Salford Systems ©2007 15
  • 16. Tree Interpretation and Use Practical decision tool Trees like this are used for real world decision making Rules for physicians Decision tool for a nationwide team of salesmen needing to classify potential customers Selection of the most important prognostic variables Variable screen for parametric model building Example data set had 100 variables Use results to find variables to include in a logistic regression Insight — in our example AGE is not relevant if BP is not high Suggests which interactions will be important Introduction to CART® Salford Systems ©2007 16
  • 17. Introducing CART Interface Introduction to CART® Salford Systems ©2007 17
  • 18. Classification with CART® Real world study early 1990s Fixed line service provider offering a new mobile phone service Wants to identify customers most likely to accept new mobile offer Data set based on limited market trial Introduction to CART® Salford Systems ©2007 18
  • 19. Real World Trial 830 households offered a mobile phone package All offered identical package but pricing was varied at random Handset prices ranged from low to high values Per minute prices ranged separately from low to high rate rates Household asked to make yes or no decision on offer 15.2% of the households accepted the offer Our goal was to answer two key questions Who to make offers to? How to price? Introduction to CART® Salford Systems ©2007 19
  • 20. Setting Up the Model Only requirement is to select TARGET (dependent) variable. CART will do everything else automatically. Introduction to CART® Salford Systems ©2007 20
  • 21. Model Results CART completes analysis and gives access to all results from the NAVIGATOR Shown on the next slide Upper section displays tree of a selected size Lower section displays error rate for trees of all possible sizes Green bar marks most accurate tree We display a compact 10 node tree for further scrutiny Introduction to CART® Salford Systems ©2007 21
  • 22. Navigator Display Most accurate tree is marked with the green bar. We select the 10 node tree for convenience of a more compact display. Introduction to CART® Salford Systems ©2007 22
  • 23. Root Node Root node displays details of TARGET variable in overall training data We see that 15.2% of the 830 households accepted the offer Goal of the analysis is now to extract patterns characteristic of responders If we could only use a single piece of information to separate responders from non-responders CART chooses the HANDSET PRICE Those offered the phone with a price > 130 contain only 9.9% responders Those offered a lower price respond at 21.9% Introduction to CART® Salford Systems ©2007 23
  • 24. Maximal Tree Maximal tree is the raw material for the best model Goal is to find optimal tree embedded inside maximal tree Will find optimal tree via “pruning” Like backwards stepwise regression Introduction to CART® Salford Systems ©2007 24
  • 25. Main Drivers Red= Good Response Blue=Poor Response) High values of a split variable always go to the right; low values go left Introduction to CART® Salford Systems ©2007 25
  • 26. Examine Individual Node Hover mouse over node to see inside Even though this node is on the “high price” of the tree it still exhibits the strongest response across all terminal node segments (43.5% response) Here we select rules expressed in C Entire tree can also be rendered in Java, XML/PMML, or SAS Introduction to CART® Salford Systems ©2007 26
  • 27. Configure Print Image Variety of fine controls to position and scale image nicely Introduction to CART® Salford Systems ©2007 27
  • 28. Variable Importance Ranking CART offers multiple ways to compute variable importance Introduction to CART® Salford Systems ©2007 28
  • 29. Predictive Accuracy This model is not very accurate but ranks responders well Introduction to CART® Salford Systems ©2007 29
  • 30. Gains and ROC Curves In top decile model captures about 40% of responders Introduction to CART® Salford Systems ©2007 30
  • 31. The Big Picture Introduction to CART® Salford Systems ©2007 31
  • 32. Binary Recursive Partitioning Data is split into two partitions thus “binary” partition Partitions can also be split into sub-partitions hence procedure is recursive CART tree is generated by repeated dynamic partitioning of data set Automatically determines which variables to use Automatically determines which regions to focus on The whole process is totally data driven Introduction to CART® Salford Systems ©2007 32
  • 33. Three Major Stages Tree growing Consider all possible splits of a node Find the best split according to the given splitting rule (best improvement) Continue sequentially until the largest tree is grown Tree Pruning – creating a sequence of nested trees (pruning sequence) by systematic removal of the weakest branches Optimal Tree Selection – using test sample or cross-validation to find the best tree in the pruning sequence Introduction to CART® Salford Systems ©2007 33
  • 34. General Workflow Historical Data Stage 1 Build a Sequence Stage 2 Learn of Nested Trees Stage 3 Monitor Performance Test Best Validate Confirm Findings Introduction to CART® Salford Systems ©2007 34
  • 35. Searching All Possible Splits For any node CART will examine ALL possible splits Computationally intensive but there are only a finite number of splits Consider first the variable AGE--in our data set has minimum value 40 The rule Is AGE ≤ 40? will separate out these cases to the left — the youngest persons Next increase the AGE threshold to the next youngest person Is AGE ≤ 43 This will direct six cases to the left Continue increasing the splitting threshold value by value Introduction to CART® Salford Systems ©2007 35
  • 36. Split Tables Split tables are used to locate continuous splits There will be as many splits as there are unique values minus one Sorted by Age Sorted by Blood Pressure AGE BP SINUST SURVIVE AGE BP SINUST SURVIVE 40 91 0 SURVIVE 43 78 1 DEAD 40 110 0 SURVIVE 40 83 1 DEAD 40 83 1 DEAD 40 91 0 SURVIVE 43 99 0 SURVIVE 43 99 0 SURVIVE 43 78 1 DEAD 40 110 0 SURVIVE 43 135 0 SURVIVE 49 110 1 SURVIVE 45 120 0 SURVIVE 48 119 1 DEAD 48 119 1 DEAD 45 120 0 SURVIVE 48 122 0 SURVIVE 48 122 0 SURVIVE 49 150 0 DEAD 43 135 0 SURVIVE 49 110 1 SURVIVE 49 150 0 DEAD Introduction to CART® Salford Systems ©2007 36
  • 37. Categorical Predictors Left Right Categorical predictors spawn 1 A B, C, D many more splits 2 B A, C, B For example, only four levels A, 3 C A, B, D B, C, D result to 7 splits 4 D A, B, C 5 A, B C, D In general, there will be 2K-1-1 6 A, C B, D splits 7 A, D B, C CART considers all possible splits based on a categorical predictor unless the number of categories exceeds 15 Introduction to CART® Salford Systems ©2007 37
  • 38. Binary Target Shortcut When the target is binary, special algorithms reduce compute time to linear (K-1 instead of 2K-1-1) in levels 30 levels takes twice as long as 15, not 10,000+ times as long When you have high-level categorical variables in a multi- class problem Create a binary classification problem Try different definitions of DPV (which groups to combine) Explore predictor groupings produced by CART From a study of all results decide which are most informative Create new grouped predicted variable for multi-class problem Introduction to CART® Salford Systems ©2007 38
  • 39. Pruning and Variable Importance Introduction to CART® Salford Systems ©2007 39
  • 40. GYM Model Example The original data comes from a financial institution and disguised as a health club Problem: need to understand a market research clustering scheme Clusters were created externally using 18 variables and conventional clustering software Need to find simple rules to describe cluster membership CART tree provides a nice way to arrive at an intuitive story Introduction to CART® Salford Systems ©2007 40
  • 41. Variable Definitions ID Identification # of member not used CLUSTER Cluster assigned from clustering scheme (10 level categorical coded 1-10) target ANYRAQT Racquet ball usage (binary indicator coded 0, 1) ONRCT Number of on-peak racquet ball uses ANYPOOL Pool usage (binary indicator coded 0, 1) ONPOOL Number of on-peak pool uses PLRQTPCT Percent of pool and racquet ball usage TPLRCT Total number of pool and racquet ball uses ONAER Number of on-peak aerobics classes attended OFFAER Number of off-peak aerobics classes attended SAERDIF Difference between number of on- and off-peak aerobics visits TANNING Number of visits to tanning salon predictors PERSTRN Personal trainer (binary indicator coded 0, 1) CLASSES Number of classes taken NSUPPS Number of supplements/vitamins/frozen dinners purchased SMALLBUS Small business discount (binary indicator coded 0, 1) OFFER Terms of offer IPAKPRIC Index variable for package price MONFEE Monthly fee paid FIT Fitness score NFAMMEN Number of family members HOME Home ownership (binary indicator coded 0, 1) Introduction to CART® Salford Systems ©2007 41
  • 42. Things to Do Dataset: gymsmall.syd Set the target and predictors Set 50% test sample Navigate the largest, smallest, and optimal trees Check [Splitters] and [Tree Details] Observe the logic of pruning Check [Summary Reports] Understand competitors and surrogates Understand variable importance Introduction to CART® Salford Systems ©2007 42
  • 43. Pruning Multiple Nodes Note that Terminal Node 5 has class assignment 1 All the remaining nodes have class assignment 2 Pruning merges Terminal Nodes 5 and 6 thus creating a “chain reaction” of redundant split removal Introduction to CART® Salford Systems ©2007 43
  • 44. Pruning Weaker Split First Eliminating nodes 3 and 4 results to “lesser” damage – loosing one correct dominant class assignment Eliminating nodes 1 and 2 or 5 and 6 would result to loosing one correct minority class assignment Introduction to CART® Salford Systems ©2007 44
  • 45. Understanding Variable Importance Both FIT and CLASSES are reported as equally important Yet the tree only includes CLASSES ONPOOL is one of the main splitters yet is has very low importance Want to understand why Introduction to CART® Salford Systems ©2007 45
  • 46. Competitors and Surrogates Node details for the root node split are shown above Improvement measures class separation between two sides Competitors are next best splits in the given node ordered by improvement values Surrogates are splits most resembling (case by case) the main split Association measures the degree of resemblance Looking for competitors is free, finding surrogates is expensive It is possible to have an empty list of surrogates FIT is a perfect surrogate (and competitor) for CLASSES Introduction to CART® Salford Systems ©2007 46
  • 47. The Utility of Surrogates Suppose that the main splitter is INCOME and comes from a survey form People with very low or very high income are likely not to report it – informative missingness in both tails of the income distribution The splitter itself wants to separate low income on the left from high income on the right – hence, it is unreasonable to treat missing INCOME as a separate unique value (usual way to handle such case) Using a surrogate will help to resolve ambiguity and redistribute missing income records between the two sides For example, all low education subjects will join the low income side while all high education subjects will join the high income side Introduction to CART® Salford Systems ©2007 47
  • 48. Competitors and Surrogates Are Different A A Three nominal B B classes C C Equally present A Split X C A Split Y B Equal priors, unit B C costs, etc. For symmetry considerations, both splits must result to identical improvements Yet one split can’t be substituted for another Hence Split X and Split Y are perfect competitors (same improvement) but very poor surrogates (low association) The reverse does not hold – a perfect surrogate is always a perfect competitor Introduction to CART® Salford Systems ©2007 48
  • 49. Calculating Association Split X Split Y Default Ten observations in a node Main splitter X Split Y mismatches on 2 observations The default rule sends all cases to the dominant side mismatching 3 observations Association = (Default – Split Y)/Default = (3-2)/3 = 1/3 Introduction to CART® Salford Systems ©2007 49
  • 50. Notes on Association Measure local to a node, non-parametric in nature Association is negative for nodes with very poor resemblance, not limited by -1 from below Any positive association is useful (better strategy than the default rule) Hence define surrogates as splits with the highest positive association In calculating case mismatch, check the possibility for split reversal (if the rule is met, send cases to the right) CART marks straight splits as “s” and reverse splits as “r” Introduction to CART® Salford Systems ©2007 50
  • 51. Calculating Variable Importance List all main splitters as well as all surrogates along with the resulting improvements Aggregate the above list by variable accumulating improvements Sort the result descending Divide each entry by the largest value and scale to 100% The resulting importance list is based on the given tree exclusively, if a different tree a grown the importance list will most likely change Variable importance reveals a deeper structure in stark contrast to a casual look at the main splitters Introduction to CART® Salford Systems ©2007 51
  • 52. Mystery Solved In our example improvements in the root node dominate the whole tree Consequently, the variable importance reflects the surrogates table in the root node Introduction to CART® Salford Systems ©2007 52
  • 53. Variable Importance Caution Importance is a function of the OVERALL tree including deepest nodes Suppose you grow a large exploratory tree — review importances Then find an optimal tree via test set or CV yielding smaller tree Optimal tree SAME as exploratory tree in the top nodes YET importances might be quite different. WHY? Because larger tree uses more nodes to compute the importance When comparing results be sure to compare similar or same sized trees
  • 54. Splitting Rules and Friends Introduction to CART® Salford Systems ©2007 54
  • 55. Splitting Rules - Introduction Splitting rule controls the tree growing stage At each node, a finite pool of all possible splits exists The pool is independent of the target variable, it is based solely on the observations (and corresponding values of predictors) in the node The task of the splitting rule is to rank all splits in the pool according to their improvements The winner will become the main splitter The top runner-ups will become competitors It is possible to construct different splitting rules emphasizing various approaches to the definition of the “best split” Introduction to CART® Salford Systems ©2007 55
  • 56. Splitting Rules Control Selected in the “Method” tab of the Model Setup dialog There are 6 alternative rules for classification and 2 rules for regression Since there is no theoretical justification on the best splitting rule, we recommend trying all of them and picking the best resulting model Introduction to CART® Salford Systems ©2007 56
  • 57. Further Insights On binary classification problem, GINI and TWOING are equivalent Therefore, we recommend setting “Favor even Splits” control to 1 when TWOING is used GINI and Symmetric GINI only differ when an asymmetric cost matrix is used and identical otherwise Ordered TWOING is a restricted version of TWOING on multinomial targets coded into a set of contiguous integers Class Probability is identical to GINI during tree growing but differs during the pruning Introduction to CART® Salford Systems ©2007 57
  • 58. Battery RULES The battery was designed to automatically try all six alternative splitting rules According to the output above, Entropy splitting rule resulted to the best accuracy model on the given data Individual model navigators are also available Introduction to CART® Salford Systems ©2007 58
  • 59. Controlled Tree Growth CART was originally conceived to be fully automated thus protecting an analyst from making erroneous split decisions In some cases, however, it could be desirable to influence tree evolution to address specific modeling needs Modern CART engine allows multiple direct and indirect ways of accomplishing this: Penalties on variables, missing values, and categories Controlling which variables are allowed or disallowed at different levels in the tree and node sizes Direct forcing of splits at the top two levels in the tree Introduction to CART® Salford Systems ©2007 59
  • 60. Penalizing Individual Variables We use variable penalty control to penalize CLASSES a little bit This is enough to ensure FIT the main splitter position The improvement associated with CLASSES is now downgraded by 1 minus the penalty factor Introduction to CART® Salford Systems ©2007 60
  • 61. Penalizing Missing and Categorical Predictors Since splits on variables with missing values are evaluated on a reduced set of observations, such splits may have an unfair advantage Introducing systematic penalty set to the proportion of the variable missing effectively removes the possible bias Same holds for categorical variables with large number of distinct levels We recommend that the user always set both controls to 1 Introduction to CART® Salford Systems ©2007 61
  • 62. Forcing Splits CART allows direct forcing of the split variable (and possibly split value) in the top three nodes in the tree Introduction to CART® Salford Systems ©2007 62
  • 63. Notes on Forcing Splits It is not guaranteed that a forced split below the root node level will be preserved (because of pruning) Do not simply overturn “illogical” automatic splits by forcing something else – look for an explanation why CART makes a “wrong” choice, it could indicate a problem with your data Introduction to CART® Salford Systems ©2007 63
  • 64. Constrained Trees Many predictive models can benefit from Salford’s patent pending “Structured Trees” Trees constrained in how they are grown to reflect decision support requirements In mobile phone example: want tree to first segment on customer characteristics and then complete using price variables Price variables are under the control of the company Customer characteristics are not under company control Introduction to CART® Salford Systems ©2007 64
  • 65. Setting Up Constraints Green indicates where in the tree variables of group are allowed to appear Introduction to CART® Salford Systems ©2007 65
  • 66. Constrained Tree Demographic and spend information at top of tree Handset (HANDPRIC) and per minute pricing (USEPRICE) at bottom Introduction to CART® Salford Systems ©2007 66
  • 67. Prior Probabilities An essential component of any CART analysis A prior probability distribution for the classes is needed to do proper splitting since prior probabilities are used to calculate the improvements in the Gini formula Example: Suppose data contains 99% type A (nonresponders), 1% type B (responders), CART might focus on not missing any class A objects one "solution" classify all objects nonresponders Advantages: Only 1% error rate Very simple predictive rules, although not informative But realize this could be an optimal rule
  • 68. Priors EQUAL Most common priors: PRIORS EQUAL This gives each class equal weight regardless of its frequency in the data Prevents CART from favoring more prevalent classes If we have 900 class A and 100 class B and equal priors A: 900 B: 100 A: 300 A: 600 B: 10 B: 90 1/3 of all A 2/3 of all A 1/10 of all B 9/10 of all B Class as A Class as B With equal priors prevalence measured as percent of OWN class size Measurements relative to own class not entire learning set If PRIORS DATA both child nodes would be class A Equal priors puts both classes on an equal footing
  • 69. Class Assignment Formula I For equal priors class node is class 1 if N 1( t ) N 1 > N 2( t ) N 2 Ni = number of cases in class i at the root node Ni (t) = number of cases in class i at the node t Classify any node by “count ratio in node” relative to “count ratio in root” Classify by which level has greatest relative richness
  • 70. Class Assignment Formula II When priors are not equal: π1N1(t) / π2N2(t) > N1 / N2 where πi is the prior for class i Since π1 / π2 always appears as ratio just think in terms of boosting amount Ni(t) by a factor, e.g. 9:1, 100:1, 2:1 If priors are EQUAL, πi terms cancel out If priors are DATA, then π1 / π2 = N1 / N2 which simplifies formula to plurality rule (simple counting) classify as class 1 if N1(t) > N2(t) If priors are neither EQUAL nor DATA then formula will provide boost to up-weighted classes
  • 71. Example A: A: 900 900 B: B: 100 100 A: 300 A: 600 A: 600 A: 300 B: 10 B: 90 B: 90 B: 10 300 > 900 so class as A 90 > 100 so class as B 10 100 600 900 In gen eral th e tests w ou ld b e w eigh ted b y th e ap p rop riate p riors π (B ) 90 > 100 π (A ) 300 > 900 π (A ) 600 900 π (B ) 10 100 tests can b e w ritten as A or B as con ven ien t B A
  • 72. Cross Validation Introduction to CART® Salford Systems ©2007 72
  • 73. Cross-Validation Cross-validation is a recent computer intensive development in statistics Purpose is to protect oneself from over fitting errors Don't want to capitalize on chance — track idiosyncrasies of this data Idiosyncrasies which will NOT be observed on fresh data Ideally would like to use large test data sets to evaluate trees, N>5000 Practically, some studies don't have sufficient data to spare for testing Cross-validation will use SAME data for learning and for testing
  • 74. 10-fold CV: Industry Standard Begin by growing maximal tree on ALL data Divide data into 10 portions stratified on dependent variable levels Reserve first portion for test, grow new tree on remaining 9 portions Use the 1/10 test set to measure error rate for this 9/10 data tree Error rate is measured for the maximal tree and for all subtrees Now rotate out new 1/10 test data set Grow new tree on remaining 9 portions Repeat until all 10 portions of the data have been used as test sets
  • 75. The CV Procedure 1 2 3 4 5 6 7 8 9 10 Test Learn 1 2 3 4 5 6 7 8 9 10 Learn Test Learn 1 2 3 4 5 6 7 8 9 10 Learn Test Learn ETC... 1 2 3 4 5 6 7 8 9 10 Learn Test
  • 76. CV Details Every observation is used as test case exactly once In 10 fold CV each observation is used as a learning case 9 times In 10 fold CV 10 auxiliary CART trees are grown in addition to initial tree When all 10 CV trees are done the error rates are cumulated (summed) Summing of error rates is done by TREE complexity Summed Error (cost) rates are then attributed to INITIAL tree Observe that no two of these trees need be the same Results are subject to random fluctuations — sometimes severe
  • 77. CV Replication and Stability Look at the separate CV tree results of a single CV analysis CART produces a summary of the results for each tree at end of output Table titled "CV TREE COMPETITOR LISTINGS" reports results for each CV tree Root (TOP), Left Child of Root (LEFT), and Right Child of Root (RIGHT) Want to know how stable results are -- do different variables split root node in different trees? Only difference between different CV trees is random elimination of 1/10 data Special battery called CVR exists to study the effects of repeated cross-validation process
  • 78. Battery CVR Introduction to CART® Salford Systems ©2007 78
  • 79. Regression Trees Introduction to CART® Salford Systems ©2007 79
  • 80. General Comments Regression trees result to piece-wise constant models (multi-dimensional staircase) on an orthogonal partition of the data space Thus usually not the best possible performer in terms of conventional regression loss functions Only a very limited number of controls is available to influence the modeling process Priors and costs are no longer applicable There are two splitting rules: LS and LAD Very powerful in capturing high-order interactions but somewhat weak in explaining simple main effects Introduction to CART® Salford Systems ©2007 80
  • 81. Boston Housing Data Set Harrison, D. and D. Rubinfeld. Hedonic Housing Prices & Demand For Clean Air. Journal of Environmental Economics and Management, v5, 81-102 , 1978 506 census tracts in City of Boston for the year 1970 Goal: study relationship between quality of life variables and property values MV median value of owner-occupied homes in tract (‘000s) CRIM per capita crime rates NOX concentration of nitrogen oxides (pphm) AGE percent built before 1940 DIS weighted distance to centers of employment RM average number of rooms per house LSTAT percent neighborhood ‘lower SES’ RAD accessibility to radial highways CHAS borders Charles River (0/1) INDUS percent non-retail business TAX tax rate PT pupil teacher ratio Introduction to CART® Salford Systems ©2007 81
  • 82. Splitting the Root Node Improvement is defined in terms of the greatest reduction in the sum of squared errors when a single constant prediction is replaced by two separate constants on each side Introduction to CART® Salford Systems ©2007 82
  • 83. Regression Tree Model All cases in the given node are assigned the same predicted response – the node average of the original target Nodes are color-coded according to the predicted response We have a convenient segmentation of the population according to the average response levels Introduction to CART® Salford Systems ©2007 83
  • 84. The Best and the Worst Segments Introduction to CART® Salford Systems ©2007 84
  • 85. Scoring and Deployment Introduction to CART® Salford Systems ©2007 85
  • 86. Notes on Scoring Trees cannot be expressed as a simple analytical expression, instead, a lengthy collection of rules is needed to express the tree logic Nonetheless, the scoring operation is very fast once the rules are compiled into the machine code Scoring can be done in two ways: Internally – by using the SCORE command Externally – by using the TRANSLATE command to generate C, SAS, Java, or PMML code Model of any size can be scored or translated Introduction to CART® Salford Systems ©2007 86
  • 87. Internal Scoring Introduction to CART® Salford Systems ©2007 87
  • 88. Output File Format CASEID RESPONSE NODE CORRECT DV PROB_1 PROB_2 PATH_1 PATH_2 PATH_3 PATH_4 PATH_5 PATH_6 XA XB XC 1 0 1 1 0 0.998795 0.001205 1 2 3 4 5 -1 0 0 0 2 1 7 0 0 0.738636 0.261364 1 -7 0 0 0 0 1 0 0 3 0 1 1 0 0.998795 0.001205 1 2 3 4 5 -1 0 0 0 CART creates the following output variables RESPONSE – predicted class assignment NODE – terminal node the case falls into CORRECT – indicates whether the prediction is correct (assuming that the actual target is known) PROB_x – class probability (constant within each node) PATH_x – path traveled by case in the tree Case 1 starts in the root node then follows through internal nodes 2, 3, 4, 5 and finally lands into terminal node 1 Case 2 starts in the root node and then lands into terminal node 7 Finally CART adds all of the original model variables including the target and predictors Introduction to CART® Salford Systems ©2007 88
  • 89. Model Translation Introduction to CART® Salford Systems ©2007 89
  • 90. Output Code Introduction to CART® Salford Systems ©2007 90
  • 91. CART Automation Introduction to CART® Salford Systems ©2007 91
  • 92. Automated CART Runs Starting with CART 6.0 we have added a new feature called batteries A battery is a collection of runs obtained via a predefined procedure A trivial set of batteries is varying one of the key CART parameters: RULES – run every major splitting rule ATOM – vary minimum parent node size MINCHILD – vary minimum node size DEPTH – vary tree depth NODES – vary maximum number of nodes allowed FLIP – flipping the learn and test samples (two runs only) Introduction to CART® Salford Systems ©2007 92
  • 93. More Batteries DRAW – each run uses a randomly drawn learn sample from the original data CV – varying number of folds used in cross-validation CVR – repeated cross-validation each time with a different random number seed KEEP – select a given number of predictors at random from the initial larger set Supports a core set of predictors that must always be present ONEOFF – one predictor model for each predictor in the given list The remaining batteries are more advanced and will be discussed individually Introduction to CART® Salford Systems ©2007 93
  • 94. Battery MCT The primary purpose is to test possible model overfitting by running Monte Carlo simulation Randomly permute the target variable Build a CART tree Repeat a number of times The resulting family of profiles measures intrinsic overfitting due to the learning process Compare the actual run profile to the family of random profiles to check the significance of the results This battery is especially useful when the relative error of the original run exceeds 90% Introduction to CART® Salford Systems ©2007 94
  • 95. Battery MVI Designed to check the impact of the missing values indicators on the model performance Build a standard model, missing values are handled via the mechanism of surrogates Build a model using MVIs only – measures the amount of predictive information contained in the missing value indicators alone Combine the two Same as above with missing penalties turned on Introduction to CART® Salford Systems ©2007 95
  • 96. Battery PRIOR Vary the priors for the specified class within a given range For example, vary the prior on class 1 from 0.2 to 0.8 in increments of 0.02 Will result to a family of models with drastically different performance in terms of class richness and accuracy (see the earlier discussion) Can be used for hot spot detection Can also be used to construct a very efficient voting ensemble of trees This battery is covered in greater length in our Advanced CART class Introduction to CART® Salford Systems ©2007 96
  • 97. Battery SHAVING Automates variable selection process Shaving from the top – each step the most important variable is eliminated Capable to detect and eliminate “model hijackers” – variables that appear to be important on the learn sample but in general may hurt ultimate model performance (for example, ID variable) Shaving from the bottom – each step the least important variable is eliminated May drastically reduce the number of variables used by the model SHAVING ERROR – at each iteration all current variables are tried for elimination one at a time to determine which predictor contributes the least May result to a very long sequence of runs (quadratic complexity) Introduction to CART® Salford Systems ©2007 97
  • 98. Battery SAMPLE Build a series of models in which the learn sample is systematically reduced Used to determine the impact of the learn sample size on the model performance Can be combined with BATTERY DRAW to get additional spread Introduction to CART® Salford Systems ©2007 98
  • 99. Battery TARGET Attempt to build a model for each variable in the given list as a target using the remaining variables as predictors Very effective in capturing inter-variable dependency patterns Can be used as an input to variable clustering and multi-dimensional scaling approaches When run on MVIs only, can reveal missing value patterns in the data Introduction to CART® Salford Systems ©2007 99
  • 100. Stable Trees and Consistency Introduction to CART® Salford Systems ©2007 100
  • 101. Notes on Tree Stability Tree are usually stable near the top and become progressively more and more unstable as the nodes become smaller Therefore, smaller trees in the pruning sequence will tend to be more stable Node instability usually manifests itself when comparing node class distributions between the learn and test samples Ideally we want to identify the largest stable tree in the pruning sequence The optimal tree suggested by CART may contain a few unstable nodes structurally needed to optimize the overall accuracy Train Test Consistency feature (TTC) was designed to address the topic at length Introduction to CART® Salford Systems ©2007 101
  • 102. MJ2000 Run The dataset comes from an undisclosed financial institution Set up a default classification run with 50% random test sample CART determines 36-node optimal tree Introduction to CART® Salford Systems ©2007 102
  • 103. Node Stability Terminal Nodes TTC Report A quick look at the Learn and Test node distributions reveals a large number of unstable nodes Let’s use TTC feature by pressing the [T/T Consist] button to identify the largest stable tree Introduction to CART® Salford Systems ©2007 103
  • 104. The Largest Stable Tree TTC Report Tree with 11 terminal nodes is identified as stable (all nodes match on both direction and rank) While within the specified tolerance, nodes 2, 6, 8 are somewhat unsatisfactory in terms of rank-order These nodes will emerge as rank-order unstable if we lower the tolerance level to 0.5 Introduction to CART® Salford Systems ©2007 104
  • 105. Improving Predictor List Predictor Elimination by Shaving Run BATTERY SHAVE to systematically eliminate the least important predictors The resulting optimal tree has even better accuracy than before It also uses a much smaller set of only 6 predictors Let’s study the tree stability for the new model Introduction to CART® Salford Systems ©2007 105
  • 106. Comparing the Results Original Tree New Tree The tree on the left is the original (unshaved) one The tree on the right is the result of shaving The new tree is far more stable than the original one, except for node 8 Introduction to CART® Salford Systems ©2007 106
  • 107. Justification Learn Sample Test Sample Node 8 is irrelevant for classification but its presence is required in order to allow a very strong split below Introduction to CART® Salford Systems ©2007 107
  • 108. Recommended Reading Breiman, L., J. Friedman, R. Olshen and C. Stone (1984), Classification and Regression Trees, Pacific Grove: Wadsworth Breiman, L (1996), Bagging Predictors, Machine Learning, 24, 123-140 Breiman, L (1996), Stacking Regressions, Machine Learning, 24, 49-64. Friedman, J. H., Hastie, T. and Tibshirani, R. "Additive Logistic Regression: a Statistical View of Boosting." (Aug. 1998) Introduction to CART® Salford Systems ©2007 108