ML Decision Tree_2.pptx

BCSE209L Machine Learning
Module- : Decision Trees
Dr. R. Jothi
Decision Tree Induction Algorithms
■ ID3
■ Can handle both numerical and categorical features
■ Feature selection – Entropy
■ CART (continuous features and continuous
label)
■ Can handle both numerical and categorical features
■ Feature selection – Gini
■ Generally used for both regression and
classification
Measure of Impurity: GINI
• The Gini Index is the probability that a variable will not
be classified correctly if it was chosen randomly.
• Gini Index for a given node t with classes j
NOTE: p( j | t) is computed as the relative frequency of class j at
node t



j
t
j
p
t
GINI 2
)]
|
(
[
1
)
(
3
GINI Index : Example
• Example: Two classes C1 & C2 and node t has 5 C1 and
5 C2 examples. Compute Gini(t)
• 1 – [p(C1|t) + p(C2|t)] = 1 – [(5/10)2 + [(5/10)2 ]
• 1 – [¼ + ¼] = ½.



j
t
j
p
t
GINI 2
)]
|
(
[
1
)
(
4
The Gini index will always be between [0, 0.5], where 0 is
a selection that perfectly splits each class in your dataset
(pure), and 0.5 means that neither of the classes was
correctly classified (impure).
More on Gini
• Worst Gini corresponds to probabilities of 1/nc, where nc is the number of
classes.
• For 2-class problems the worst Gini will be ½
• How do we get the best Gini? Come up with an example for node t with 10
examples for classes C1 and C2
• 10 C1 and 0 C2
• Now what is the Gini?
• 1 – [(10/10)2 + (0/10)2 = 1 – [1 + 0] = 0
• So 0 is the best Gini
• So for 2-class problems:
• Gini varies from 0 (best) to ½ (worst).
5
Some More Examples
• Below we see the Gini values for 4 nodes with
different distributions. They are ordered from best to
worst. See next slide for details
• Note that thus far we are only computing GINI for one node.
We need to compute it for a split and then compute the
change in Gini from the parent node.
C1 0
C2 6
Gini=0.000
C1 2
C2 4
Gini=0.444
C1 3
C2 3
Gini=0.500
C1 1
C2 5
Gini=0.278
6
Examples for computing GINI
C1 0
C2 6
C1 2
C2 4
C1 1
C2 5
P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0



j
t
j
p
t
GINI 2
)]
|
(
[
1
)
(
P(C1) = 1/6 P(C2) = 5/6
Gini = 1 – (1/6)2 – (5/6)2 = 0.278
P(C1) = 2/6 P(C2) = 4/6
Gini = 1 – (2/6)2 – (4/6)2 = 0.444
Examples for Computing Error
C1 0
C2 6
C1 2
C2 4
C1 1
C2 5
P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Error = 1 – max (0, 1) = 1 – 1 = 0
P(C1) = 1/6 P(C2) = 5/6
Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6
P(C1) = 2/6 P(C2) = 4/6
Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3
)
|
(
max
1
)
( t
i
P
t
Error i


8
CSEDIU
Comparison among Splitting Criteria
For a 2-class problem:
9
CSEDIU
Example: Construct Decision tree using Gini
index
Example: Construct Decision tree using Gini
index
Therefore, attribute B will be chosen to split the node.
Gini vs Entropy
•Computationally, entropy is more complex since it makes use
of logarithms and consequently, the calculation of the Gini Index
will be faster.
• Accuracy using the entropy criterion are slightly better (not
always).
Table 11.6
Algorithm Splitting Criteria Remark
ID3 Information Gain
𝛼 𝐴, 𝐷 = 𝐸 𝐷 − 𝐸𝐴(D)
Where
𝐸 𝐷 = Entropy of D (a
measure of uncertainty) =
− 𝑖=1
𝑘
𝑝𝑖 log 2𝑝𝑖
where D is with set of k classes
𝑐1, 𝑐2, … , 𝑐𝑘 and 𝑝𝑖 =
|𝐶𝑖,𝐷|
|𝐷|
;
Here, 𝐶𝑖,𝐷 is the set of tuples
with class 𝑐𝑖 in D.
𝐸𝐴 (D) = Weighted average
entropy when D is partitioned
on the values of attribute A =
𝑗=1
𝑚 |𝐷𝑗|
|𝐷|
𝐸(𝐷𝑗)
Here, m denotes the distinct
values of attribute A.
• The algorithm calculates
𝛼(𝐴𝑖,D) for all 𝐴𝑖 in D
and choose that attribute
which has maximum
𝛼(𝐴𝑖,D).
• The algorithm can handle
both categorical and
numerical attributes.
• It favors splitting those
attributes, which has a
large number of distinct
values.
13
Algorithm Splitting Criteria Remark
CART Gini Index
𝛾 𝐴, 𝐷 = 𝐺 𝐷 − 𝐺𝐴(D)
where
𝐺 𝐷 = Gini index (a measure of
impurity)
= 1 − 𝑖=1
𝑘
𝑝𝑖
2
Here, 𝑝𝑖 =
|𝐶𝑖,𝐷|
|𝐷|
and D is with k
number of classes and
GA(D) =
|𝐷1|
|𝐷|
𝐺(𝐷1) +
|𝐷2|
|𝐷|
𝐺(𝐷2),
when D is partitioned into two
data sets 𝐷1 and 𝐷2 based on
some values of attribute A.
• The algorithm calculates
all binary partitions for
all possible values of
attribute A and choose
that binary partition
which has the maximum
𝛾 𝐴, 𝐷 .
• The algorithm is
computationally very
expensive when the
attribute A has a large
number of values.
14
Algorithm Splitting Criteria Remark
C4.5 Gain Ratio
𝛽 𝐴, 𝐷 =
𝛼 𝐴, 𝐷
𝐸𝐴
∗
(D)
where
𝛼 𝐴, 𝐷 = Information gain of
D (same as in ID3, and
𝐸𝐴
∗
(D) = splitting information
= − 𝑗=1
𝑚 |𝐷𝑗|
|𝐷|
𝑙𝑜𝑔2
|𝐷𝑗|
|𝐷|
when D is partitioned into 𝐷1,
𝐷2, … , 𝐷𝑚 partitions
corresponding to m distinct
attribute values of A.
• The attribute A with
maximum value of
𝛽 𝐴, 𝐷 is selected for
splitting.
• Splitting information is a
kind of normalization, so
that it can check the
biasness of information
gain towards the
choosing attributes with a
large number of distinct
values.
In addition to this, we also highlight few important characteristics
of decision tree induction algorithms in the following.
15
1 sur 15

Recommandé

BAS 250 Lecture 8 par
BAS 250 Lecture 8BAS 250 Lecture 8
BAS 250 Lecture 8Wake Tech BAS
367 vues44 diapositives
ID3 Algorithm & ROC Analysis par
ID3 Algorithm & ROC AnalysisID3 Algorithm & ROC Analysis
ID3 Algorithm & ROC AnalysisTalha Kabakus
6.3K vues51 diapositives
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts par
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic ConceptsData Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic ConceptsSalah Amean
30.9K vues81 diapositives
Decision Tree par
Decision Tree Decision Tree
Decision Tree Konkuk University, Korea
130 vues20 diapositives
Decision tree of cart par
Decision tree of cartDecision tree of cart
Decision tree of cartkalung0313
202 vues20 diapositives
Random forest par
Random forestRandom forest
Random forestMusa Hawamdah
53.5K vues23 diapositives

Contenu connexe

Similaire à ML Decision Tree_2.pptx

Chapter 8. Classification Basic Concepts.ppt par
Chapter 8. Classification Basic Concepts.pptChapter 8. Classification Basic Concepts.ppt
Chapter 8. Classification Basic Concepts.pptSubrata Kumer Paul
76 vues81 diapositives
08ClassBasic.ppt par
08ClassBasic.ppt08ClassBasic.ppt
08ClassBasic.pptGauravWani20
1 vue105 diapositives
08ClassBasic.ppt par
08ClassBasic.ppt08ClassBasic.ppt
08ClassBasic.pptharsh708944
18 vues81 diapositives
Basics of Classification.ppt par
Basics of Classification.pptBasics of Classification.ppt
Basics of Classification.pptNBACriteria2SICET
5 vues81 diapositives
[系列活動] Data exploration with modern R par
[系列活動] Data exploration with modern R[系列活動] Data exploration with modern R
[系列活動] Data exploration with modern R台灣資料科學年會
4.7K vues61 diapositives
08ClassBasic VT.ppt par
08ClassBasic VT.ppt08ClassBasic VT.ppt
08ClassBasic VT.pptGaneshaAdhik
7 vues42 diapositives

Similaire à ML Decision Tree_2.pptx(20)

Asymptotic analysis par Soujanya V
Asymptotic analysisAsymptotic analysis
Asymptotic analysis
Soujanya V3.8K vues
Data Mining Concepts and Techniques.ppt par Rvishnupriya2
Data Mining Concepts and Techniques.pptData Mining Concepts and Techniques.ppt
Data Mining Concepts and Techniques.ppt
Rvishnupriya217 vues
Data Mining Concepts and Techniques.ppt par Rvishnupriya2
Data Mining Concepts and Techniques.pptData Mining Concepts and Techniques.ppt
Data Mining Concepts and Techniques.ppt
Rvishnupriya228 vues
Hierarchical Deterministic Quadrature Methods for Option Pricing under the Ro... par Chiheb Ben Hammouda
Hierarchical Deterministic Quadrature Methods for Option Pricing under the Ro...Hierarchical Deterministic Quadrature Methods for Option Pricing under the Ro...
Hierarchical Deterministic Quadrature Methods for Option Pricing under the Ro...
Pair-Atomic Resolution-of-the-Identity par patrime
Pair-Atomic Resolution-of-the-IdentityPair-Atomic Resolution-of-the-Identity
Pair-Atomic Resolution-of-the-Identity
patrime418 vues

Dernier

6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf par
6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf
6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf10urkyr34
7 vues259 diapositives
AvizoImageSegmentation.pptx par
AvizoImageSegmentation.pptxAvizoImageSegmentation.pptx
AvizoImageSegmentation.pptxnathanielbutterworth1
7 vues14 diapositives
Oral presentation (1).pdf par
Oral presentation (1).pdfOral presentation (1).pdf
Oral presentation (1).pdfreemalmazroui8
5 vues10 diapositives
Customer Data Cleansing Project.pptx par
Customer Data Cleansing Project.pptxCustomer Data Cleansing Project.pptx
Customer Data Cleansing Project.pptxNat O
6 vues23 diapositives
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion par
Games, Queries, and Argumentation Frameworks: Time for a Family ReunionGames, Queries, and Argumentation Frameworks: Time for a Family Reunion
Games, Queries, and Argumentation Frameworks: Time for a Family ReunionBertram Ludäscher
9 vues37 diapositives
CRM stick or twist workshop par
CRM stick or twist workshopCRM stick or twist workshop
CRM stick or twist workshopinfo828217
14 vues16 diapositives

Dernier(20)

6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf par 10urkyr34
6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf
6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf
10urkyr347 vues
Customer Data Cleansing Project.pptx par Nat O
Customer Data Cleansing Project.pptxCustomer Data Cleansing Project.pptx
Customer Data Cleansing Project.pptx
Nat O6 vues
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion par Bertram Ludäscher
Games, Queries, and Argumentation Frameworks: Time for a Family ReunionGames, Queries, and Argumentation Frameworks: Time for a Family Reunion
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion
CRM stick or twist workshop par info828217
CRM stick or twist workshopCRM stick or twist workshop
CRM stick or twist workshop
info82821714 vues
CRM stick or twist.pptx par info828217
CRM stick or twist.pptxCRM stick or twist.pptx
CRM stick or twist.pptx
info82821711 vues
Running PostgreSQL in a Kubernetes cluster: CloudNativePG par Nick Ivanov
Running PostgreSQL in a Kubernetes cluster: CloudNativePGRunning PostgreSQL in a Kubernetes cluster: CloudNativePG
Running PostgreSQL in a Kubernetes cluster: CloudNativePG
Nick Ivanov7 vues
Pydata Global 2023 - How can a learnt model unlearn something par SARADINDU SENGUPTA
Pydata Global 2023 - How can a learnt model unlearn somethingPydata Global 2023 - How can a learnt model unlearn something
Pydata Global 2023 - How can a learnt model unlearn something
Enhancing Financial Sentiment Analysis via Retrieval Augmented Large Language... par patiladiti752
Enhancing Financial Sentiment Analysis via Retrieval Augmented Large Language...Enhancing Financial Sentiment Analysis via Retrieval Augmented Large Language...
Enhancing Financial Sentiment Analysis via Retrieval Augmented Large Language...
patiladiti7528 vues
PyData Global 2022 - Things I learned while running neural networks on microc... par SARADINDU SENGUPTA
PyData Global 2022 - Things I learned while running neural networks on microc...PyData Global 2022 - Things I learned while running neural networks on microc...
PyData Global 2022 - Things I learned while running neural networks on microc...
Dr. Ousmane Badiane-2023 ReSAKSS Conference par AKADEMIYA2063
Dr. Ousmane Badiane-2023 ReSAKSS ConferenceDr. Ousmane Badiane-2023 ReSAKSS Conference
Dr. Ousmane Badiane-2023 ReSAKSS Conference
AKADEMIYA20635 vues

ML Decision Tree_2.pptx

  • 1. BCSE209L Machine Learning Module- : Decision Trees Dr. R. Jothi
  • 2. Decision Tree Induction Algorithms ■ ID3 ■ Can handle both numerical and categorical features ■ Feature selection – Entropy ■ CART (continuous features and continuous label) ■ Can handle both numerical and categorical features ■ Feature selection – Gini ■ Generally used for both regression and classification
  • 3. Measure of Impurity: GINI • The Gini Index is the probability that a variable will not be classified correctly if it was chosen randomly. • Gini Index for a given node t with classes j NOTE: p( j | t) is computed as the relative frequency of class j at node t    j t j p t GINI 2 )] | ( [ 1 ) ( 3
  • 4. GINI Index : Example • Example: Two classes C1 & C2 and node t has 5 C1 and 5 C2 examples. Compute Gini(t) • 1 – [p(C1|t) + p(C2|t)] = 1 – [(5/10)2 + [(5/10)2 ] • 1 – [¼ + ¼] = ½.    j t j p t GINI 2 )] | ( [ 1 ) ( 4 The Gini index will always be between [0, 0.5], where 0 is a selection that perfectly splits each class in your dataset (pure), and 0.5 means that neither of the classes was correctly classified (impure).
  • 5. More on Gini • Worst Gini corresponds to probabilities of 1/nc, where nc is the number of classes. • For 2-class problems the worst Gini will be ½ • How do we get the best Gini? Come up with an example for node t with 10 examples for classes C1 and C2 • 10 C1 and 0 C2 • Now what is the Gini? • 1 – [(10/10)2 + (0/10)2 = 1 – [1 + 0] = 0 • So 0 is the best Gini • So for 2-class problems: • Gini varies from 0 (best) to ½ (worst). 5
  • 6. Some More Examples • Below we see the Gini values for 4 nodes with different distributions. They are ordered from best to worst. See next slide for details • Note that thus far we are only computing GINI for one node. We need to compute it for a split and then compute the change in Gini from the parent node. C1 0 C2 6 Gini=0.000 C1 2 C2 4 Gini=0.444 C1 3 C2 3 Gini=0.500 C1 1 C2 5 Gini=0.278 6
  • 7. Examples for computing GINI C1 0 C2 6 C1 2 C2 4 C1 1 C2 5 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0    j t j p t GINI 2 )] | ( [ 1 ) ( P(C1) = 1/6 P(C2) = 5/6 Gini = 1 – (1/6)2 – (5/6)2 = 0.278 P(C1) = 2/6 P(C2) = 4/6 Gini = 1 – (2/6)2 – (4/6)2 = 0.444
  • 8. Examples for Computing Error C1 0 C2 6 C1 2 C2 4 C1 1 C2 5 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Error = 1 – max (0, 1) = 1 – 1 = 0 P(C1) = 1/6 P(C2) = 5/6 Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6 P(C1) = 2/6 P(C2) = 4/6 Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3 ) | ( max 1 ) ( t i P t Error i   8 CSEDIU
  • 9. Comparison among Splitting Criteria For a 2-class problem: 9 CSEDIU
  • 10. Example: Construct Decision tree using Gini index
  • 11. Example: Construct Decision tree using Gini index Therefore, attribute B will be chosen to split the node.
  • 12. Gini vs Entropy •Computationally, entropy is more complex since it makes use of logarithms and consequently, the calculation of the Gini Index will be faster. • Accuracy using the entropy criterion are slightly better (not always).
  • 13. Table 11.6 Algorithm Splitting Criteria Remark ID3 Information Gain 𝛼 𝐴, 𝐷 = 𝐸 𝐷 − 𝐸𝐴(D) Where 𝐸 𝐷 = Entropy of D (a measure of uncertainty) = − 𝑖=1 𝑘 𝑝𝑖 log 2𝑝𝑖 where D is with set of k classes 𝑐1, 𝑐2, … , 𝑐𝑘 and 𝑝𝑖 = |𝐶𝑖,𝐷| |𝐷| ; Here, 𝐶𝑖,𝐷 is the set of tuples with class 𝑐𝑖 in D. 𝐸𝐴 (D) = Weighted average entropy when D is partitioned on the values of attribute A = 𝑗=1 𝑚 |𝐷𝑗| |𝐷| 𝐸(𝐷𝑗) Here, m denotes the distinct values of attribute A. • The algorithm calculates 𝛼(𝐴𝑖,D) for all 𝐴𝑖 in D and choose that attribute which has maximum 𝛼(𝐴𝑖,D). • The algorithm can handle both categorical and numerical attributes. • It favors splitting those attributes, which has a large number of distinct values. 13
  • 14. Algorithm Splitting Criteria Remark CART Gini Index 𝛾 𝐴, 𝐷 = 𝐺 𝐷 − 𝐺𝐴(D) where 𝐺 𝐷 = Gini index (a measure of impurity) = 1 − 𝑖=1 𝑘 𝑝𝑖 2 Here, 𝑝𝑖 = |𝐶𝑖,𝐷| |𝐷| and D is with k number of classes and GA(D) = |𝐷1| |𝐷| 𝐺(𝐷1) + |𝐷2| |𝐷| 𝐺(𝐷2), when D is partitioned into two data sets 𝐷1 and 𝐷2 based on some values of attribute A. • The algorithm calculates all binary partitions for all possible values of attribute A and choose that binary partition which has the maximum 𝛾 𝐴, 𝐷 . • The algorithm is computationally very expensive when the attribute A has a large number of values. 14
  • 15. Algorithm Splitting Criteria Remark C4.5 Gain Ratio 𝛽 𝐴, 𝐷 = 𝛼 𝐴, 𝐷 𝐸𝐴 ∗ (D) where 𝛼 𝐴, 𝐷 = Information gain of D (same as in ID3, and 𝐸𝐴 ∗ (D) = splitting information = − 𝑗=1 𝑚 |𝐷𝑗| |𝐷| 𝑙𝑜𝑔2 |𝐷𝑗| |𝐷| when D is partitioned into 𝐷1, 𝐷2, … , 𝐷𝑚 partitions corresponding to m distinct attribute values of A. • The attribute A with maximum value of 𝛽 𝐴, 𝐷 is selected for splitting. • Splitting information is a kind of normalization, so that it can check the biasness of information gain towards the choosing attributes with a large number of distinct values. In addition to this, we also highlight few important characteristics of decision tree induction algorithms in the following. 15