1. Decision Tree
R. Akerkar
TMRF, Kolhapur, India
R. Akerkar 1
2. Introduction
A classification scheme which generates a tree and
g
a set of rules from given data set.
The t f
Th set of records available f d
d il bl for developing
l i
classification methods is divided into two disjoint
subsets – a training set and a test set.
g
The attributes of the records are categorise into two
types:
Attributes hose
Attrib tes whose domain is n merical are called n merical
numerical numerical
attributes.
Attributes whose domain is not numerical are called the
categorical attributes
attributes.
R. Akerkar 2
3. Introduction
A decision tree is a tree with the following p p
g properties:
An inner node represents an attribute.
An edge represents a test on the attribute of the father
node.
node
A leaf represents one of the classes.
Construction of a decision tree
Based on the training data
Top Down strategy
Top-Down
R. Akerkar 3
4. Decision Tree
Example
The data set has five attributes.
There is a special attribute: the attribute class is the class label.
The attributes, temp (temperature) and humidity are numerical
attributes
Other attributes are categorical, that is, they cannot be ordered.
Based on the training data set, we want to find a set of rules to
know what values of outlook, temperature, humidity and wind,
determine whether or not to play golf.
R. Akerkar 4
5. Decision Tree
Example
We have five leaf nodes.
In a decision tree, each leaf node represents a rule.
We have the following rules corresponding to the tree given in
Figure.
RULE 1 If it is sunny and the humidity is not above 75% then play
75%, play.
RULE 2 If it is sunny and the humidity is above 75%, then do not play.
RULE 3 If it is overcast, then play.
RULE 4 If it is rainy and not windy, then play.
RULE 5 If it is rainy and windy, then don't play.
i i d i d th d 't l
R. Akerkar 5
6. Classification
The classification of an unknown input vector is done by
p y
traversing the tree from the root node to a leaf node.
A record enters the tree at the root node.
At the root, a test is applied to determine which child
root
node the record will encounter next.
This process is repeated until the record arrives at a leaf
node.
d
All the records that end up at a given leaf of the tree are
classified in the same way. y
There is a unique path from the root to each leaf.
The path is a rule which is used to classify the records.
R. Akerkar 6
7. In our tree, we can carry out the classification
for a u
o an unknown record as follows.
o eco d o o s
Let us assume, for the record, that we know
the values of the first four attributes (but we
do not know the value of class attribute) as
outlook= rain; temp = 70; humidity = 65; and
windy true.
windy= true
R. Akerkar 7
8. We start from the root node to check the value of the attribute
associated at the root node.
This attribute is the splitting attribute at this node.
For a decision tree, at every node there is an attribute associated
with the node called the splitting attribute.
In our example, outlook is the splitting attribute at root.
Since for the given record, outlook = rain, we move to the right-
most child node of the root.
At this node, the splitting attribute is windy and we find that for
the record we want classify, windy = true.
Hence, we move to the left child node to conclude that the class
label Is "no l "
l b l I " play".
R. Akerkar 8
9. The accuracy of the classifier is determined by the percentage of the
test d t
t t data set that is correctly classified.
t th t i tl l ifi d
We can see that for Rule 1 there are two records of the test data set
satisfying outlook= sunny and humidity < 75, and only one of these
is correctly classified as play.
Thus, the accuracy of this rule is 0.5 (or 50%). Similarly, the
accuracy of Rule 2 is also 0.5 (or 50%). The accuracy of Rule 3 is
0.66.
RULE 1
If it is sunny and the humidity
is not above 75%, then play.
R. Akerkar 9
10. Concept of Categorical Attributes
Consider the following training
data set.
There are three attributes,
namely, age, pincode and class.
The attribute class is used for
class label.
The attribute age is a numeric attribute, whereas pincode is a categorical
one.
Though th d
Th h the domain of pincode i numeric, no ordering can b d fi d
i f i d is i d i be defined
among pincode values.
You cannot derive any useful information if one pin-code is greater than
another pincode
pincode.
R. Akerkar 10
11. Figure gives a decision tree for the
training data
data.
The splitting attribute at the root is
pincode and the splitting criterion
here is pincode = 500 046.
Similarly, for the left child node, the
splitting criterion is age < 48 (the
p g g (
splitting attribute is age).
At root level, we have 9 records.
Although the right child node has The associated splitting criterion is
p g
the same attribute as the splitting pincode = 500 046.
attribute, the splitting criterion is
different. As a result, we split the records
into two subsets. Records 1, 2, 4, 8,
and 9 are to the left child note and
remaining to the right node.
The process is repeated at every
node.
R. Akerkar 11
12. Advantages and Shortcomings of Decision
Tree Classifications
A decision tree construction process is concerned with identifying
the splitting attributes and splitting criterion at every l
h li i ib d li i i i level of the tree.
l f h
Major strengths are:
Decision tree able to generate understandable rules.
They are able to handle both numerical and categorical attributes.
They provide clear indication of which fields are most important for
prediction or classification
classification.
Weaknesses are:
The process of growing a decision tree is computationally expensive At
expensive.
each node, each candidate splitting field is examined before its best split
can be found.
Some decision tree can only deal with binary-valued target classes.
R. Akerkar 12
13. Iterative Dichotomizer (ID3)
Quinlan (1986)
Each node corresponds to a splitting attribute
Each arc is a possible value of that attribute.
At each node the splitting attribute is selected to be the most
informative among the attributes not yet considered in the path from
the root.
Entropy is used to measure how informative is a node.
The algorithm uses the criterion of information gain to determine the
goodness of a split.
d f lit
The attribute with the greatest information gain is taken as
the splitting attribute, and the data set is split for all distinct
values of the attribute
attribute.
R. Akerkar 13
14. Training Dataset
This follows an example from Quinlan’s ID3
The class label attribute,
buys_computer, has two distinct
values. age income student credit_rating buys_computer
<=30 high
g no fair no
Thus there are two distinct <=30 high no excellent no
classes. (m =2) 31…40 high no fair yes
Class C1 corresponds to yes >40 medium no fair yes
and class C2 corresponds to no
no. >40 low yes fair yes
>40 low yes excellent no
There are 9 samples of class yes 31…40 low yes excellent yes
and 5 samples of class no. <=30 medium no fair no
<=30 low yes fair yes
>40
40 medium
di yes f i
fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
R. Akerkar 14
15. Extracting Classification Rules from Trees
g
Represent the knowledge in
the form of IF-THEN rules
One rule is created for each
path from the root to a leaf
Each attribute-value pair
along a path forms a
conjunction
The leaf node holds the class
p
prediction
What are the rules?
Rules are easier for humans
to understand
R. Akerkar 15
16. Solution (Rules)
IF age = “<=30” AND student = “no” THEN buys_computer = “no”
IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”
IF age = “31…40” THEN buys_computer = “yes”
IF age = “>40” AND credit_rating = “excellent” THEN
buys_computer = “yes”
IF age = “<=30” AND credit_rating = “fair” THEN buys_computer =
“no”
R. Akerkar 16
17. Algorithm for Decision Tree Induction
Basic algorithm (a greedy algorithm)
Tree is constructed in a top-down recursive divide-and-conquer
manner
At start, all the training examples are at the root
Attributes are categorical ( continuous-valued, they are
g (if y
discretized in advance)
Examples are partitioned recursively based on selected attributes
Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
Conditions for stopping partitioning
All samples for a g
p given node belong to the same class
g
There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
There are no samples left
p
R. Akerkar 17
18. Attribute Selection Measure: Information
Gain (ID3/C4.5)
(ID3/C4 5)
Select the attribute with the highest information gain
S contains si tuples of class Ci for i = {1, …, m}
information measures info required to classify any
q y y
arbitrary tuple m
si si
I( s1,s 2,...,s m ) log 2
i 1 s s ….information is encoded in bits.
entropy of attribute A with values {a1,a2,…,av}
f b h l {
v
s1 j ... smj
E(A) I ( s1 j ,...,smj )
j 1 s
information gained by branching on attribute A
Gain(A) I(s 1, s 2 ,..., sm) E(A)
R. Akerkar 18
19. Entropy
Entropy measures the homogeneity (purity) of a set of examples.
It gives the information content of the set in terms of the class labels of
the examples.
Consider that you have a set of examples, S with two classes, P and N. Let
the set have p instances for the class P and n instances for the class N.
So the total number of instances we have is t = p + n. The view [p, n] can
be seen as a class distribution of S.
The entropy for S is defined as
Entropy(S) = - (p/t).log2(p/t) - (n/t).log2(n/t)
Example: Let a set of examples consists of 9 instances for class positive,
and 5 instances for class negative.
Answer: p = 9 and n = 5.
So Entropy(S) = - (9/14).log2(9/14) - (5/14).log2(5/14)
= -(0.64286)(-0.6375) - (0.35714)(-1.48557)
= (0.40982) + (0.53056)
= 0.940
R. Akerkar 19
20. Entropy
y
The entropy for a completely pure set is 0 and is 1 for a set with
equal occurrences f both th classes.
l for b th the l
i.e. Entropy[14,0] = - (14/14).log2(14/14) - (0/14).log2(0/14)
= -1.log2(1) - 0 l 2(0)
1 l 2(1) 0.log2(0)
= -1.0 - 0
=0
i.e. Entropy[7,7] = - (7/14).log2(7/14) - (7/14).log2(7/14)
= - (0.5).log2(0.5) - (0.5).log2(0.5)
= - (0.5).(-1) - (0.5).(-1)
= 0.5 + 0.5
=1
R. Akerkar 20
21. Attribute Selection by Information Gain
Computation
5 4
Class P: buys_computer = “yes” E ( age ) I ( 2 ,3) I ( 4,0 )
14 14
Class N: buys_computer = “no”
5
I(p, n) = I(9, 5) =0.940 I (3, 2 ) 0 .694
Compute the entropy for age:
14
age pi ni I(pi, ni) 5
<=30 2 3 0.971 I ( 2,3) means ““age <=30” h 5
30” has
14
30…40 4 0 0 out of 14 samples, with 2 yes's
>40
40 3 2 0.971 and 3 no’s. Hence
no s.
age income student credit_rating buys_computer
<=30
<=30
high
high
no
no
fair
excellent
no
no
Gain ( age ) I ( p , n ) E ( age ) 0.246
31…40 high no fair yes
>40
>40
medium
low
no
yes
fair
fair
yes
yes
Similarly Gain(income) 0.029
Similarly,
>40 low yes excellent no Gain( student ) 0.151
31…40 low yes excellent yes
<=30 medium no fair no Gain(credit _ rating ) 0.048
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes Since, age has the highest information gain
31…40 medium no excellent yes among the attributes, it is selected as the
31…40 high yes fair yes
>40 medium no excellent no R. Akerkar
test attribute. 21
22. Exercise 1
The following table consists of training data from an employee
database.
database
Let status be the class attribute. Use the ID3 algorithm to construct a
decision tree from the given data.
R. Akerkar 22
24. Other Attribute Selection Measures
Gini index (CART IBM IntelligentMiner)
(CART,
All attributes are assumed continuous-valued
Assume there exist several possible split values for each
attribute
May need other tools, such as clustering, to get the
possible split values
Can be modified for categorical attributes
R. Akerkar 24
25. Gini Index (IBM IntelligentMiner)
If a data set T contains examples from n classes, gini index, gini(T) is
n
defined as gini (T ) 1 p 2
i i j
j 1
where pj is the relative frequency of class j in T.
If a data set T is split into two subsets T1 and T2 with sizes N1 and N2
respectively, the gini index of the split data contains examples from n
classes, the gini index gini(T) is defined as
gini split
(T ) N 1 gini (T 1) N 2 gini (T 2 )
N N
The attribute provides the smallest ginisplit(T) is chosen to split the node
(need to enumerate all possible splitting points for each attribute).
R. Akerkar 25
28. Exercise 3
In previous exercise which is a better split of
exercise,
the data among the two split points? Why?
R. Akerkar 28
29. Solution 3
Intuitively Salary <= 65K is a better split point since it produces
relatively ``pure'' partitions as opposed to Age <= 50 which
pure'' 50,
results in more mixed partitions (i.e., just look at the distribution
of Highs and Lows in S1 and S2).
More formally, let us consider the properties of the Gini index.
If a partition is totally pure, i.e., has all elements from the same
class, then gini(S) = 1-[1x1+0x0] = 1-1 = 0 (for two classes).
On the other hand if the classes are totally mixed, i.e., both
classes have equal probability then
gini(S) = 1 - [0 5x0 5 + 0 5x0 5] = 1 [0 25+0 25] = 0 5
[0.5x0.5 0.5x0.5] 1-[0.25+0.25] 0.5.
In other words the closer the gini value is to 0, the better the
partition is. Since Salary has lower gini it is a better split.
is split
R. Akerkar 29
30. Avoid Overfitting in Classification
vo d v g C ss c o
Overfitting: An induced tree may overfit the training data
Too many branches, some may reflect anomalies due to noise
or outliers
Poor accuracy for unseen samples
y p
Two approaches to avoid overfitting
Prepruning: Halt tree construction early—do not split a node if
this would result in the goodness measure falling below a
threshold
Difficult to choose an appropriate threshold
Postpruning: Remove branches from a “fully grown” tree—get a
sequence of progressively pruned t
f i l d trees
Use a set of data different from the training data to decide
which is the “best pruned tree”
R. Akerkar 30