SlideShare une entreprise Scribd logo
1  sur  30
02/03/2021 Introduction to Data Mining, 2nd Edition 1
Data Mining
Model Overfitting
Introduction to Data Mining, 2nd Edition
by
Tan, Steinbach, Karpatne, Kumar
02/03/2021 Introduction to Data Mining, 2nd Edition 2
Classification Errors
 Training errors: Errors committed on the training set
 Test errors: Errors committed on the test set
 Generalization errors: Expected error of a model over random selection of records from same distribution
Apply
Model
Induction
Deduction
Learn
Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes
10
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Learning
algorithm
Training Set
02/03/2021 Introduction to Data Mining, 2nd Edition 3
Example Data Set
Two class problem:
+ : 5400 instances
• 5000 instances generated
from a Gaussian centered at
(10,10)
• 400 noisy instances added
o : 5400 instances
• Generated from a uniform
distribution
10 % of the data used for
training and 90% of the
data used for testing
02/03/2021 Introduction to Data Mining, 2nd Edition 4
Increasing number of nodes in Decision Trees
02/03/2021 Introduction to Data Mining, 2nd Edition 5
Decision Tree with 4 nodes
Decision Tree
Decision boundaries on Training data
02/03/2021 Introduction to Data Mining, 2nd Edition 6
Decision Tree with 50 nodes
Decision Tree
Decision Tree
Decision boundaries on Training data
02/03/2021 Introduction to Data Mining, 2nd Edition 7
Which tree is better?
Decision Tree with 4 nodes
Decision Tree with 50 nodes
Which tree is better ?
02/03/2021 Introduction to Data Mining, 2nd Edition 8
Model Underfitting and Overfitting
Underfitting: when model is too simple, both training and test errors are large
Overfitting: when model is too complex, training error is small but test error is large
•As the model becomes more and more complex, test errors can start
increasing even though training error may be decreasing
02/03/2021 Introduction to Data Mining, 2nd Edition 9
Model Overfitting – Impact of Training Data Size
Using twice the number of data instances
• Increasing the size of training data reduces the difference between training and
testing errors at a given size of model
02/03/2021 Introduction to Data Mining, 2nd Edition 10
Model Overfitting – Impact of Training Data Size
Using twice the number of data instances
• Increasing the size of training data reduces the difference between training and
testing errors at a given size of model
Decision Tree with 50 nodes Decision Tree with 50 nodes
02/03/2021 Introduction to Data Mining, 2nd Edition 11
Reasons for Model Overfitting
 Not enough training data
 High model complexity
– Multiple Comparison Procedure
02/03/2021 Introduction to Data Mining, 2nd Edition 12
Effect of Multiple Comparison Procedure
 Consider the task of predicting whether
stock market will rise/fall in the next 10
trading days
 Random guessing:
P(correct) = 0.5
 Make 10 random guesses in a row:
Day 1 Up
Day 2 Down
Day 3 Down
Day 4 Up
Day 5 Down
Day 6 Down
Day 7 Up
Day 8 Up
Day 9 Up
Day 10 Down
0547
.
0
2
10
10
9
10
8
10
)
8
(# 10





























correct
P
02/03/2021 Introduction to Data Mining, 2nd Edition 13
Effect of Multiple Comparison Procedure
 Approach:
– Get 50 analysts
– Each analyst makes 10 random guesses
– Choose the analyst that makes the most
number of correct predictions
 Probability that at least one analyst makes at
least 8 correct predictions
9399
.
0
)
0547
.
0
1
(
1
)
8
(# 50





correct
P
02/03/2021 Introduction to Data Mining, 2nd Edition 14
Effect of Multiple Comparison Procedure
 Many algorithms employ the following greedy strategy:
– Initial model: M
– Alternative model: M’ = M  ,
where  is a component to be added to the model
(e.g., a test condition of a decision tree)
– Keep M’ if improvement, (M,M’) > 
 Often times,  is chosen from a set of alternative
components,  = {1, 2, …, k}
 If many alternatives are available, one may inadvertently
add irrelevant components to the model, resulting in
model overfitting
02/03/2021 Introduction to Data Mining, 2nd Edition 15
Effect of Multiple Comparison - Example
Use additional 100 noisy variables
generated from a uniform distribution
along with X and Y as attributes.
Use 30% of the data for training and
70% of the data for testing
Using only X and Y as attributes
02/03/2021 Introduction to Data Mining, 2nd Edition 16
Notes on Overfitting
 Overfitting results in decision trees that are more
complex than necessary
 Training error does not provide a good estimate
of how well the tree will perform on previously
unseen records
 Need ways for estimating generalization errors
02/03/2021 Introduction to Data Mining, 2nd Edition 17
Model Selection
 Performed during model building
 Purpose is to ensure that model is not overly
complex (to avoid overfitting)
 Need to estimate generalization error
– Using Validation Set
– Incorporating Model Complexity
02/03/2021 Introduction to Data Mining, 2nd Edition 18
Model Selection:
Using Validation Set
 Divide training data into two parts:
– Training set:
 use for model building
– Validation set:
 use for estimating generalization error
 Note: validation set is not the same as test set
 Drawback:
– Less data available for training
02/03/2021 Introduction to Data Mining, 2nd Edition 19
Model Selection:
Incorporating Model Complexity
 Rationale: Occam’s Razor
– Given two models of similar generalization errors,
one should prefer the simpler model over the more
complex model
– A complex model has a greater chance of being fitted
accidentally
– Therefore, one should include model complexity when
evaluating a model
Gen. Error(Model) = Train. Error(Model, Train. Data) +
x Complexity(Model)
02/03/2021 Introduction to Data Mining, 2nd Edition 20
Estimating the Complexity of Decision Trees
 Pessimistic Error Estimate of decision tree T
with k leaf nodes:
– err(T): error rate on all training records
– : trade-off hyper-parameter (similar to )
 Relative cost of adding a leaf node
– k: number of leaf nodes
– Ntrain: total number of training records
02/03/2021 Introduction to Data Mining, 2nd Edition 21
Estimating the Complexity of Decision Trees: Example
+: 5
-: 2
+: 1
-: 4
+: 3
-: 0
+: 3
-: 6
+: 3
-: 0
+: 0
-: 5
+: 3
-: 1
+: 1
-: 2
+: 0
-: 2
+: 2
-: 1
+: 3
-: 1
Decision Tree, TL Decision Tree, TR
e(TL) = 4/24
e(TR) = 6/24
 = 1
egen(TL) = 4/24 + 1*7/24 = 11/24 = 0.458
egen(TR) = 6/24 + 1*4/24 = 10/24 = 0.417
02/03/2021 Introduction to Data Mining, 2nd Edition 22
Estimating the Complexity of Decision Trees
 Resubstitution Estimate:
– Using training error as an optimistic estimate of
generalization error
– Referred to as optimistic error estimate
+: 5
-: 2
+: 1
-: 4
+: 3
-: 0
+: 3
-: 6
+: 3
-: 0
+: 0
-: 5
+: 3
-: 1
+: 1
-: 2
+: 0
-: 2
+: 2
-: 1
+: 3
-: 1
Decision Tree, TL Decision Tree, TR
e(TL) = 4/24
e(TR) = 6/24
02/03/2021 Introduction to Data Mining, 2nd Edition 23
Minimum Description Length (MDL)
 Cost(Model,Data) = Cost(Data|Model) + x Cost(Model)
– Cost is the number of bits needed for encoding.
– Search for the least costly model.
 Cost(Data|Model) encodes the misclassification errors.
 Cost(Model) uses node encoding (number of children)
plus splitting condition encoding.
A B
A?
B?
C?
1
0
0
1
Yes No
B1 B2
C1 C2
X y
X1 1
X2 0
X3 0
X4 1
… …
Xn 1
X y
X1 ?
X2 ?
X3 ?
X4 ?
… …
Xn ?
02/03/2021 Introduction to Data Mining, 2nd Edition 24
Model Selection for Decision Trees
 Pre-Pruning (Early Stopping Rule)
– Stop the algorithm before it becomes a fully-grown tree
– Typical stopping conditions for a node:
 Stop if all instances belong to the same class
 Stop if all the attribute values are the same
– More restrictive conditions:
 Stop if number of instances is less than some user-specified
threshold
 Stop if class distribution of instances are independent of the
available features (e.g., using  2 test)
 Stop if expanding the current node does not improve impurity
measures (e.g., Gini or information gain).
 Stop if estimated generalization error falls below certain threshold
02/03/2021 Introduction to Data Mining, 2nd Edition 25
Model Selection for Decision Trees
 Post-pruning
– Grow decision tree to its entirety
– Subtree replacement
 Trim the nodes of the decision tree in a bottom-up
fashion
 If generalization error improves after trimming,
replace sub-tree by a leaf node
 Class label of leaf node is determined from
majority class of instances in the sub-tree
02/03/2021 Introduction to Data Mining, 2nd Edition 26
Example of Post-Pruning
A?
A1
A2 A3
A4
Class = Yes 20
Class = No 10
Error = 10/30
Training Error (Before splitting) = 10/30
Pessimistic error = (10 + 0.5)/30 = 10.5/30
Training Error (After splitting) = 9/30
Pessimistic error (After splitting)
= (9 + 4  0.5)/30 = 11/30
PRUNE!
Class = Yes 8
Class = No 4
Class = Yes 3
Class = No 4
Class = Yes 4
Class = No 1
Class = Yes 5
Class = No 1
02/03/2021 Introduction to Data Mining, 2nd Edition 27
Examples of Post-pruning
Simplified Decision Tree:
depth = 1 :
| ImagePages <= 0.1333 : class 1
| ImagePages > 0.1333 :
| | breadth <= 6 : class 0
| | breadth > 6 : class 1
depth > 1 :
| MultiAgent = 0: class 0
| MultiAgent = 1:
| | totalPages <= 81 : class 0
| | totalPages > 81 : class 1
Decision Tree:
depth = 1 :
| breadth > 7 : class 1
| breadth <= 7 :
| | breadth <= 3 :
| | | ImagePages > 0.375 : class 0
| | | ImagePages <= 0.375 :
| | | | totalPages <= 6 : class 1
| | | | totalPages > 6 :
| | | | | breadth <= 1 : class 1
| | | | | breadth > 1 : class 0
| | width > 3 :
| | | MultiIP = 0:
| | | | ImagePages <= 0.1333 : class 1
| | | | ImagePages > 0.1333 :
| | | | | breadth <= 6 : class 0
| | | | | breadth > 6 : class 1
| | | MultiIP = 1:
| | | | TotalTime <= 361 : class 0
| | | | TotalTime > 361 : class 1
depth > 1 :
| MultiAgent = 0:
| | depth > 2 : class 0
| | depth <= 2 :
| | | MultiIP = 1: class 0
| | | MultiIP = 0:
| | | | breadth <= 6 : class 0
| | | | breadth > 6 :
| | | | | RepeatedAccess <= 0.0322 : class 0
| | | | | RepeatedAccess > 0.0322 : class 1
| MultiAgent = 1:
| | totalPages <= 81 : class 0
| | totalPages > 81 : class 1
Subtree
Raising
Subtree
Replacement
02/03/2021 Introduction to Data Mining, 2nd Edition 28
Model Evaluation
 Purpose:
– To estimate performance of classifier on previously
unseen data (test set)
 Holdout
– Reserve k% for training and (100-k)% for testing
– Random subsampling: repeated holdout
 Cross validation
– Partition data into k disjoint subsets
– k-fold: train on k-1 partitions, test on the remaining one
– Leave-one-out: k=n
02/03/2021 Introduction to Data Mining, 2nd Edition 29
Cross-validation Example
 3-fold cross-validation
02/03/2021 Introduction to Data Mining, 2nd Edition 30
Variations on Cross-validation
 Repeated cross-validation
– Perform cross-validation a number of times
– Gives an estimate of the variance of the
generalization error
 Stratified cross-validation
– Guarantee the same percentage of class
labels in training and test
– Important when classes are imbalanced and
the sample is small
 Use nested cross-validation approach for model
selection and evaluation

Contenu connexe

Tendances

Convolutional Neural Networks : Popular Architectures
Convolutional Neural Networks : Popular ArchitecturesConvolutional Neural Networks : Popular Architectures
Convolutional Neural Networks : Popular Architecturesananth
 
Over fitting underfitting
Over fitting underfittingOver fitting underfitting
Over fitting underfittingSivapriyaS12
 
Convolutional neural network from VGG to DenseNet
Convolutional neural network from VGG to DenseNetConvolutional neural network from VGG to DenseNet
Convolutional neural network from VGG to DenseNetSungminYou
 
Lecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationLecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationMarina Santini
 
Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural NetworksDatabricks
 
Word embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTMWord embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTMDivya Gera
 
Anomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine LearningAnomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine LearningKuppusamy P
 
Ensemble methods in machine learning
Ensemble methods in machine learningEnsemble methods in machine learning
Ensemble methods in machine learningSANTHOSH RAJA M G
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You NeedDaiki Tanaka
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning Gopal Sakarkar
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep LearningOswald Campesato
 
Density Based Clustering
Density Based ClusteringDensity Based Clustering
Density Based ClusteringSSA KPI
 
Multilayer perceptron
Multilayer perceptronMultilayer perceptron
Multilayer perceptronomaraldabash
 
5.2 mining time series data
5.2 mining time series data5.2 mining time series data
5.2 mining time series dataKrish_ver2
 

Tendances (20)

Convolutional Neural Networks : Popular Architectures
Convolutional Neural Networks : Popular ArchitecturesConvolutional Neural Networks : Popular Architectures
Convolutional Neural Networks : Popular Architectures
 
Resnet
ResnetResnet
Resnet
 
Over fitting underfitting
Over fitting underfittingOver fitting underfitting
Over fitting underfitting
 
Convolutional neural network from VGG to DenseNet
Convolutional neural network from VGG to DenseNetConvolutional neural network from VGG to DenseNet
Convolutional neural network from VGG to DenseNet
 
Lecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationLecture: Word Sense Disambiguation
Lecture: Word Sense Disambiguation
 
Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural Networks
 
Word embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTMWord embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTM
 
Kmeans
KmeansKmeans
Kmeans
 
Anomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine LearningAnomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine Learning
 
Ensemble methods in machine learning
Ensemble methods in machine learningEnsemble methods in machine learning
Ensemble methods in machine learning
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep Learning
 
Clustering - K-Means, DBSCAN
Clustering - K-Means, DBSCANClustering - K-Means, DBSCAN
Clustering - K-Means, DBSCAN
 
Density Based Clustering
Density Based ClusteringDensity Based Clustering
Density Based Clustering
 
Transfer Learning
Transfer LearningTransfer Learning
Transfer Learning
 
Multilayer perceptron
Multilayer perceptronMultilayer perceptron
Multilayer perceptron
 
Dbscan algorithom
Dbscan algorithomDbscan algorithom
Dbscan algorithom
 
Recurrent neural network
Recurrent neural networkRecurrent neural network
Recurrent neural network
 
5.2 mining time series data
5.2 mining time series data5.2 mining time series data
5.2 mining time series data
 

Similaire à Overfitting.pptx

Dr. Oner CelepcikayCS 4319CS 4319Machine LearningW.docx
Dr. Oner CelepcikayCS 4319CS 4319Machine LearningW.docxDr. Oner CelepcikayCS 4319CS 4319Machine LearningW.docx
Dr. Oner CelepcikayCS 4319CS 4319Machine LearningW.docxmadlynplamondon
 
Applying Deep Learning with Weak and Noisy labels
Applying Deep Learning with Weak and Noisy labelsApplying Deep Learning with Weak and Noisy labels
Applying Deep Learning with Weak and Noisy labelsDarian Frajberg
 
Racing for unbalanced methods selection
Racing for unbalanced methods selectionRacing for unbalanced methods selection
Racing for unbalanced methods selectionAndrea Dal Pozzolo
 
Huong dan cu the svm
Huong dan cu the svmHuong dan cu the svm
Huong dan cu the svmtaikhoan262
 
Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)Marina Santini
 
Leveraging Feature Selection Within TreeNet
Leveraging Feature Selection Within TreeNetLeveraging Feature Selection Within TreeNet
Leveraging Feature Selection Within TreeNetagdavis
 
Probability density estimation using Product of Conditional Experts
Probability density estimation using Product of Conditional ExpertsProbability density estimation using Product of Conditional Experts
Probability density estimation using Product of Conditional ExpertsChirag Gupta
 
Practical tips for handling noisy data and annotaiton
Practical tips for handling noisy data and annotaitonPractical tips for handling noisy data and annotaiton
Practical tips for handling noisy data and annotaitonRyuichiKanoh
 
2.2 decision tree
2.2 decision tree2.2 decision tree
2.2 decision treeKrish_ver2
 
Machine learning session6(decision trees random forrest)
Machine learning   session6(decision trees random forrest)Machine learning   session6(decision trees random forrest)
Machine learning session6(decision trees random forrest)Abhimanyu Dwivedi
 
Presentation on supervised learning
Presentation on supervised learningPresentation on supervised learning
Presentation on supervised learningTonmoy Bhagawati
 
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHESIMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHESVikash Kumar
 
Implications of Ceiling Effects in Defect Predictors
Implications of Ceiling Effects in Defect PredictorsImplications of Ceiling Effects in Defect Predictors
Implications of Ceiling Effects in Defect Predictorsgregoryg
 
Module III - Classification Decision tree (1).pptx
Module III - Classification Decision tree (1).pptxModule III - Classification Decision tree (1).pptx
Module III - Classification Decision tree (1).pptxShivakrishnan18
 
(Machine Learning) Ensemble learning
(Machine Learning) Ensemble learning (Machine Learning) Ensemble learning
(Machine Learning) Ensemble learning Omkar Rane
 
deep_Visualization in Data mining.ppt
deep_Visualization in Data mining.pptdeep_Visualization in Data mining.ppt
deep_Visualization in Data mining.pptPerumalPitchandi
 
IRJET-Handwritten Digit Classification using Machine Learning Models
IRJET-Handwritten Digit Classification using Machine Learning ModelsIRJET-Handwritten Digit Classification using Machine Learning Models
IRJET-Handwritten Digit Classification using Machine Learning ModelsIRJET Journal
 

Similaire à Overfitting.pptx (20)

Dr. Oner CelepcikayCS 4319CS 4319Machine LearningW.docx
Dr. Oner CelepcikayCS 4319CS 4319Machine LearningW.docxDr. Oner CelepcikayCS 4319CS 4319Machine LearningW.docx
Dr. Oner CelepcikayCS 4319CS 4319Machine LearningW.docx
 
Applying Deep Learning with Weak and Noisy labels
Applying Deep Learning with Weak and Noisy labelsApplying Deep Learning with Weak and Noisy labels
Applying Deep Learning with Weak and Noisy labels
 
Racing for unbalanced methods selection
Racing for unbalanced methods selectionRacing for unbalanced methods selection
Racing for unbalanced methods selection
 
2. visualization in data mining
2. visualization in data mining2. visualization in data mining
2. visualization in data mining
 
Guide
GuideGuide
Guide
 
Huong dan cu the svm
Huong dan cu the svmHuong dan cu the svm
Huong dan cu the svm
 
Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)
 
Leveraging Feature Selection Within TreeNet
Leveraging Feature Selection Within TreeNetLeveraging Feature Selection Within TreeNet
Leveraging Feature Selection Within TreeNet
 
Probability density estimation using Product of Conditional Experts
Probability density estimation using Product of Conditional ExpertsProbability density estimation using Product of Conditional Experts
Probability density estimation using Product of Conditional Experts
 
Practical tips for handling noisy data and annotaiton
Practical tips for handling noisy data and annotaitonPractical tips for handling noisy data and annotaiton
Practical tips for handling noisy data and annotaiton
 
2.2 decision tree
2.2 decision tree2.2 decision tree
2.2 decision tree
 
Machine learning session6(decision trees random forrest)
Machine learning   session6(decision trees random forrest)Machine learning   session6(decision trees random forrest)
Machine learning session6(decision trees random forrest)
 
Presentation on supervised learning
Presentation on supervised learningPresentation on supervised learning
Presentation on supervised learning
 
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHESIMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
 
Implications of Ceiling Effects in Defect Predictors
Implications of Ceiling Effects in Defect PredictorsImplications of Ceiling Effects in Defect Predictors
Implications of Ceiling Effects in Defect Predictors
 
Module III - Classification Decision tree (1).pptx
Module III - Classification Decision tree (1).pptxModule III - Classification Decision tree (1).pptx
Module III - Classification Decision tree (1).pptx
 
(Machine Learning) Ensemble learning
(Machine Learning) Ensemble learning (Machine Learning) Ensemble learning
(Machine Learning) Ensemble learning
 
deep CNN vs conventional ML
deep CNN vs conventional MLdeep CNN vs conventional ML
deep CNN vs conventional ML
 
deep_Visualization in Data mining.ppt
deep_Visualization in Data mining.pptdeep_Visualization in Data mining.ppt
deep_Visualization in Data mining.ppt
 
IRJET-Handwritten Digit Classification using Machine Learning Models
IRJET-Handwritten Digit Classification using Machine Learning ModelsIRJET-Handwritten Digit Classification using Machine Learning Models
IRJET-Handwritten Digit Classification using Machine Learning Models
 

Plus de PerumalPitchandi

Lecture Notes on Recommender System Introduction
Lecture Notes on Recommender System IntroductionLecture Notes on Recommender System Introduction
Lecture Notes on Recommender System IntroductionPerumalPitchandi
 
22ADE002 – Business Analytics- Module 1.pptx
22ADE002 – Business Analytics- Module 1.pptx22ADE002 – Business Analytics- Module 1.pptx
22ADE002 – Business Analytics- Module 1.pptxPerumalPitchandi
 
Descriptive_Statistics_PPT.ppt
Descriptive_Statistics_PPT.pptDescriptive_Statistics_PPT.ppt
Descriptive_Statistics_PPT.pptPerumalPitchandi
 
20IT204-COA-Lecture 18.ppt
20IT204-COA-Lecture 18.ppt20IT204-COA-Lecture 18.ppt
20IT204-COA-Lecture 18.pptPerumalPitchandi
 
20IT204-COA- Lecture 17.pptx
20IT204-COA- Lecture 17.pptx20IT204-COA- Lecture 17.pptx
20IT204-COA- Lecture 17.pptxPerumalPitchandi
 
Capability Maturity Model (CMM).pptx
Capability Maturity Model (CMM).pptxCapability Maturity Model (CMM).pptx
Capability Maturity Model (CMM).pptxPerumalPitchandi
 
Comparison_between_Waterfall_and_Agile_m (1).pptx
Comparison_between_Waterfall_and_Agile_m (1).pptxComparison_between_Waterfall_and_Agile_m (1).pptx
Comparison_between_Waterfall_and_Agile_m (1).pptxPerumalPitchandi
 
Introduction to Data Science.pptx
Introduction to Data Science.pptxIntroduction to Data Science.pptx
Introduction to Data Science.pptxPerumalPitchandi
 
FDS Module I 24.1.2022.ppt
FDS Module I 24.1.2022.pptFDS Module I 24.1.2022.ppt
FDS Module I 24.1.2022.pptPerumalPitchandi
 
FDS Module I 20.1.2022.ppt
FDS Module I 20.1.2022.pptFDS Module I 20.1.2022.ppt
FDS Module I 20.1.2022.pptPerumalPitchandi
 
AgileSoftwareDevelopment.ppt
AgileSoftwareDevelopment.pptAgileSoftwareDevelopment.ppt
AgileSoftwareDevelopment.pptPerumalPitchandi
 
Agile and its impact to Project Management 022218.pptx
Agile and its impact to Project Management 022218.pptxAgile and its impact to Project Management 022218.pptx
Agile and its impact to Project Management 022218.pptxPerumalPitchandi
 

Plus de PerumalPitchandi (20)

Lecture Notes on Recommender System Introduction
Lecture Notes on Recommender System IntroductionLecture Notes on Recommender System Introduction
Lecture Notes on Recommender System Introduction
 
22ADE002 – Business Analytics- Module 1.pptx
22ADE002 – Business Analytics- Module 1.pptx22ADE002 – Business Analytics- Module 1.pptx
22ADE002 – Business Analytics- Module 1.pptx
 
biv_mult.ppt
biv_mult.pptbiv_mult.ppt
biv_mult.ppt
 
ppt_ids-data science.pdf
ppt_ids-data science.pdfppt_ids-data science.pdf
ppt_ids-data science.pdf
 
ANOVA Presentation.ppt
ANOVA Presentation.pptANOVA Presentation.ppt
ANOVA Presentation.ppt
 
Data Science Intro.pptx
Data Science Intro.pptxData Science Intro.pptx
Data Science Intro.pptx
 
Descriptive_Statistics_PPT.ppt
Descriptive_Statistics_PPT.pptDescriptive_Statistics_PPT.ppt
Descriptive_Statistics_PPT.ppt
 
SW_Cost_Estimation.ppt
SW_Cost_Estimation.pptSW_Cost_Estimation.ppt
SW_Cost_Estimation.ppt
 
CostEstimation-1.ppt
CostEstimation-1.pptCostEstimation-1.ppt
CostEstimation-1.ppt
 
20IT204-COA-Lecture 18.ppt
20IT204-COA-Lecture 18.ppt20IT204-COA-Lecture 18.ppt
20IT204-COA-Lecture 18.ppt
 
20IT204-COA- Lecture 17.pptx
20IT204-COA- Lecture 17.pptx20IT204-COA- Lecture 17.pptx
20IT204-COA- Lecture 17.pptx
 
Capability Maturity Model (CMM).pptx
Capability Maturity Model (CMM).pptxCapability Maturity Model (CMM).pptx
Capability Maturity Model (CMM).pptx
 
Comparison_between_Waterfall_and_Agile_m (1).pptx
Comparison_between_Waterfall_and_Agile_m (1).pptxComparison_between_Waterfall_and_Agile_m (1).pptx
Comparison_between_Waterfall_and_Agile_m (1).pptx
 
Introduction to Data Science.pptx
Introduction to Data Science.pptxIntroduction to Data Science.pptx
Introduction to Data Science.pptx
 
FDS Module I 24.1.2022.ppt
FDS Module I 24.1.2022.pptFDS Module I 24.1.2022.ppt
FDS Module I 24.1.2022.ppt
 
FDS Module I 20.1.2022.ppt
FDS Module I 20.1.2022.pptFDS Module I 20.1.2022.ppt
FDS Module I 20.1.2022.ppt
 
AgileSoftwareDevelopment.ppt
AgileSoftwareDevelopment.pptAgileSoftwareDevelopment.ppt
AgileSoftwareDevelopment.ppt
 
Agile and its impact to Project Management 022218.pptx
Agile and its impact to Project Management 022218.pptxAgile and its impact to Project Management 022218.pptx
Agile and its impact to Project Management 022218.pptx
 
Data_exploration.ppt
Data_exploration.pptData_exploration.ppt
Data_exploration.ppt
 
state-spaces29Sep06.ppt
state-spaces29Sep06.pptstate-spaces29Sep06.ppt
state-spaces29Sep06.ppt
 

Dernier

KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).pptssuser5c9d4b1
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 

Dernier (20)

KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 

Overfitting.pptx

  • 1. 02/03/2021 Introduction to Data Mining, 2nd Edition 1 Data Mining Model Overfitting Introduction to Data Mining, 2nd Edition by Tan, Steinbach, Karpatne, Kumar
  • 2. 02/03/2021 Introduction to Data Mining, 2nd Edition 2 Classification Errors  Training errors: Errors committed on the training set  Test errors: Errors committed on the test set  Generalization errors: Expected error of a model over random selection of records from same distribution Apply Model Induction Deduction Learn Model Model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes 10 Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? 10 Test Set Learning algorithm Training Set
  • 3. 02/03/2021 Introduction to Data Mining, 2nd Edition 3 Example Data Set Two class problem: + : 5400 instances • 5000 instances generated from a Gaussian centered at (10,10) • 400 noisy instances added o : 5400 instances • Generated from a uniform distribution 10 % of the data used for training and 90% of the data used for testing
  • 4. 02/03/2021 Introduction to Data Mining, 2nd Edition 4 Increasing number of nodes in Decision Trees
  • 5. 02/03/2021 Introduction to Data Mining, 2nd Edition 5 Decision Tree with 4 nodes Decision Tree Decision boundaries on Training data
  • 6. 02/03/2021 Introduction to Data Mining, 2nd Edition 6 Decision Tree with 50 nodes Decision Tree Decision Tree Decision boundaries on Training data
  • 7. 02/03/2021 Introduction to Data Mining, 2nd Edition 7 Which tree is better? Decision Tree with 4 nodes Decision Tree with 50 nodes Which tree is better ?
  • 8. 02/03/2021 Introduction to Data Mining, 2nd Edition 8 Model Underfitting and Overfitting Underfitting: when model is too simple, both training and test errors are large Overfitting: when model is too complex, training error is small but test error is large •As the model becomes more and more complex, test errors can start increasing even though training error may be decreasing
  • 9. 02/03/2021 Introduction to Data Mining, 2nd Edition 9 Model Overfitting – Impact of Training Data Size Using twice the number of data instances • Increasing the size of training data reduces the difference between training and testing errors at a given size of model
  • 10. 02/03/2021 Introduction to Data Mining, 2nd Edition 10 Model Overfitting – Impact of Training Data Size Using twice the number of data instances • Increasing the size of training data reduces the difference between training and testing errors at a given size of model Decision Tree with 50 nodes Decision Tree with 50 nodes
  • 11. 02/03/2021 Introduction to Data Mining, 2nd Edition 11 Reasons for Model Overfitting  Not enough training data  High model complexity – Multiple Comparison Procedure
  • 12. 02/03/2021 Introduction to Data Mining, 2nd Edition 12 Effect of Multiple Comparison Procedure  Consider the task of predicting whether stock market will rise/fall in the next 10 trading days  Random guessing: P(correct) = 0.5  Make 10 random guesses in a row: Day 1 Up Day 2 Down Day 3 Down Day 4 Up Day 5 Down Day 6 Down Day 7 Up Day 8 Up Day 9 Up Day 10 Down 0547 . 0 2 10 10 9 10 8 10 ) 8 (# 10                              correct P
  • 13. 02/03/2021 Introduction to Data Mining, 2nd Edition 13 Effect of Multiple Comparison Procedure  Approach: – Get 50 analysts – Each analyst makes 10 random guesses – Choose the analyst that makes the most number of correct predictions  Probability that at least one analyst makes at least 8 correct predictions 9399 . 0 ) 0547 . 0 1 ( 1 ) 8 (# 50      correct P
  • 14. 02/03/2021 Introduction to Data Mining, 2nd Edition 14 Effect of Multiple Comparison Procedure  Many algorithms employ the following greedy strategy: – Initial model: M – Alternative model: M’ = M  , where  is a component to be added to the model (e.g., a test condition of a decision tree) – Keep M’ if improvement, (M,M’) >   Often times,  is chosen from a set of alternative components,  = {1, 2, …, k}  If many alternatives are available, one may inadvertently add irrelevant components to the model, resulting in model overfitting
  • 15. 02/03/2021 Introduction to Data Mining, 2nd Edition 15 Effect of Multiple Comparison - Example Use additional 100 noisy variables generated from a uniform distribution along with X and Y as attributes. Use 30% of the data for training and 70% of the data for testing Using only X and Y as attributes
  • 16. 02/03/2021 Introduction to Data Mining, 2nd Edition 16 Notes on Overfitting  Overfitting results in decision trees that are more complex than necessary  Training error does not provide a good estimate of how well the tree will perform on previously unseen records  Need ways for estimating generalization errors
  • 17. 02/03/2021 Introduction to Data Mining, 2nd Edition 17 Model Selection  Performed during model building  Purpose is to ensure that model is not overly complex (to avoid overfitting)  Need to estimate generalization error – Using Validation Set – Incorporating Model Complexity
  • 18. 02/03/2021 Introduction to Data Mining, 2nd Edition 18 Model Selection: Using Validation Set  Divide training data into two parts: – Training set:  use for model building – Validation set:  use for estimating generalization error  Note: validation set is not the same as test set  Drawback: – Less data available for training
  • 19. 02/03/2021 Introduction to Data Mining, 2nd Edition 19 Model Selection: Incorporating Model Complexity  Rationale: Occam’s Razor – Given two models of similar generalization errors, one should prefer the simpler model over the more complex model – A complex model has a greater chance of being fitted accidentally – Therefore, one should include model complexity when evaluating a model Gen. Error(Model) = Train. Error(Model, Train. Data) + x Complexity(Model)
  • 20. 02/03/2021 Introduction to Data Mining, 2nd Edition 20 Estimating the Complexity of Decision Trees  Pessimistic Error Estimate of decision tree T with k leaf nodes: – err(T): error rate on all training records – : trade-off hyper-parameter (similar to )  Relative cost of adding a leaf node – k: number of leaf nodes – Ntrain: total number of training records
  • 21. 02/03/2021 Introduction to Data Mining, 2nd Edition 21 Estimating the Complexity of Decision Trees: Example +: 5 -: 2 +: 1 -: 4 +: 3 -: 0 +: 3 -: 6 +: 3 -: 0 +: 0 -: 5 +: 3 -: 1 +: 1 -: 2 +: 0 -: 2 +: 2 -: 1 +: 3 -: 1 Decision Tree, TL Decision Tree, TR e(TL) = 4/24 e(TR) = 6/24  = 1 egen(TL) = 4/24 + 1*7/24 = 11/24 = 0.458 egen(TR) = 6/24 + 1*4/24 = 10/24 = 0.417
  • 22. 02/03/2021 Introduction to Data Mining, 2nd Edition 22 Estimating the Complexity of Decision Trees  Resubstitution Estimate: – Using training error as an optimistic estimate of generalization error – Referred to as optimistic error estimate +: 5 -: 2 +: 1 -: 4 +: 3 -: 0 +: 3 -: 6 +: 3 -: 0 +: 0 -: 5 +: 3 -: 1 +: 1 -: 2 +: 0 -: 2 +: 2 -: 1 +: 3 -: 1 Decision Tree, TL Decision Tree, TR e(TL) = 4/24 e(TR) = 6/24
  • 23. 02/03/2021 Introduction to Data Mining, 2nd Edition 23 Minimum Description Length (MDL)  Cost(Model,Data) = Cost(Data|Model) + x Cost(Model) – Cost is the number of bits needed for encoding. – Search for the least costly model.  Cost(Data|Model) encodes the misclassification errors.  Cost(Model) uses node encoding (number of children) plus splitting condition encoding. A B A? B? C? 1 0 0 1 Yes No B1 B2 C1 C2 X y X1 1 X2 0 X3 0 X4 1 … … Xn 1 X y X1 ? X2 ? X3 ? X4 ? … … Xn ?
  • 24. 02/03/2021 Introduction to Data Mining, 2nd Edition 24 Model Selection for Decision Trees  Pre-Pruning (Early Stopping Rule) – Stop the algorithm before it becomes a fully-grown tree – Typical stopping conditions for a node:  Stop if all instances belong to the same class  Stop if all the attribute values are the same – More restrictive conditions:  Stop if number of instances is less than some user-specified threshold  Stop if class distribution of instances are independent of the available features (e.g., using  2 test)  Stop if expanding the current node does not improve impurity measures (e.g., Gini or information gain).  Stop if estimated generalization error falls below certain threshold
  • 25. 02/03/2021 Introduction to Data Mining, 2nd Edition 25 Model Selection for Decision Trees  Post-pruning – Grow decision tree to its entirety – Subtree replacement  Trim the nodes of the decision tree in a bottom-up fashion  If generalization error improves after trimming, replace sub-tree by a leaf node  Class label of leaf node is determined from majority class of instances in the sub-tree
  • 26. 02/03/2021 Introduction to Data Mining, 2nd Edition 26 Example of Post-Pruning A? A1 A2 A3 A4 Class = Yes 20 Class = No 10 Error = 10/30 Training Error (Before splitting) = 10/30 Pessimistic error = (10 + 0.5)/30 = 10.5/30 Training Error (After splitting) = 9/30 Pessimistic error (After splitting) = (9 + 4  0.5)/30 = 11/30 PRUNE! Class = Yes 8 Class = No 4 Class = Yes 3 Class = No 4 Class = Yes 4 Class = No 1 Class = Yes 5 Class = No 1
  • 27. 02/03/2021 Introduction to Data Mining, 2nd Edition 27 Examples of Post-pruning Simplified Decision Tree: depth = 1 : | ImagePages <= 0.1333 : class 1 | ImagePages > 0.1333 : | | breadth <= 6 : class 0 | | breadth > 6 : class 1 depth > 1 : | MultiAgent = 0: class 0 | MultiAgent = 1: | | totalPages <= 81 : class 0 | | totalPages > 81 : class 1 Decision Tree: depth = 1 : | breadth > 7 : class 1 | breadth <= 7 : | | breadth <= 3 : | | | ImagePages > 0.375 : class 0 | | | ImagePages <= 0.375 : | | | | totalPages <= 6 : class 1 | | | | totalPages > 6 : | | | | | breadth <= 1 : class 1 | | | | | breadth > 1 : class 0 | | width > 3 : | | | MultiIP = 0: | | | | ImagePages <= 0.1333 : class 1 | | | | ImagePages > 0.1333 : | | | | | breadth <= 6 : class 0 | | | | | breadth > 6 : class 1 | | | MultiIP = 1: | | | | TotalTime <= 361 : class 0 | | | | TotalTime > 361 : class 1 depth > 1 : | MultiAgent = 0: | | depth > 2 : class 0 | | depth <= 2 : | | | MultiIP = 1: class 0 | | | MultiIP = 0: | | | | breadth <= 6 : class 0 | | | | breadth > 6 : | | | | | RepeatedAccess <= 0.0322 : class 0 | | | | | RepeatedAccess > 0.0322 : class 1 | MultiAgent = 1: | | totalPages <= 81 : class 0 | | totalPages > 81 : class 1 Subtree Raising Subtree Replacement
  • 28. 02/03/2021 Introduction to Data Mining, 2nd Edition 28 Model Evaluation  Purpose: – To estimate performance of classifier on previously unseen data (test set)  Holdout – Reserve k% for training and (100-k)% for testing – Random subsampling: repeated holdout  Cross validation – Partition data into k disjoint subsets – k-fold: train on k-1 partitions, test on the remaining one – Leave-one-out: k=n
  • 29. 02/03/2021 Introduction to Data Mining, 2nd Edition 29 Cross-validation Example  3-fold cross-validation
  • 30. 02/03/2021 Introduction to Data Mining, 2nd Edition 30 Variations on Cross-validation  Repeated cross-validation – Perform cross-validation a number of times – Gives an estimate of the variance of the generalization error  Stratified cross-validation – Guarantee the same percentage of class labels in training and test – Important when classes are imbalanced and the sample is small  Use nested cross-validation approach for model selection and evaluation