SlideShare une entreprise Scribd logo
1  sur  62
DATA MINING
TECHNIQUES
UNIT-IV
Classification-Basic Concept
• Decision Tree Induction – Bayes Classification Methods – Rule-Based Classification – Model Evaluation and
Selection – Techniques to improve Classification Accuracy - Bayesian Belief Networks – Classification by
Backpropagation.
• Classification and prediction are two forms of data analysis that can be used to extract models describing
important data classes or to predict future data trends
• Classification predicts categorical, discrete and unordered labels
• Prediction model –continuous valued functions
• eg,., bank loan application, all electronics
• Regression analysis is a statistical methodology that is most often used for numeric prediction.
• Data classification is a two step process
 Learning
 Classification
Classification-Basic Concept
• What is trained data and test data?
• What is supervised learning and unsupervised learning?
• Issues regarding classification and prediction
Pre processing the data for classification and prediction
Data cleaning
Relevance analysis
Data transformation and reduction-normalization
• Comparing classification and prediction methods
Accuracy , speed, robustness, scalability,interpretability
Decision Tree Induction
• Decision tree induction is the learning of decision tree from class-
labelled training tuples.
• A decision tree is a flow chart like tree structure where internal
node(non leaf node) denotes a test on an attribute, each branch
represents an outcome of the test and each leaf node(or terminal node)
holds a class label.
• The topmost node in a tree is the root node.
Decision Tree Induction
• How decision tree used for classification?
• Why are decision tree classifiers so popular?
• Attribute selection measures are used to select the attribute that best
partitions the tuples into distinct classes
• Tree pruning attempts to identify and remove such branches with the
goal of improving classification accuracy on unseen data.
Decision Tree Induction
• J. Ross Quinlan, a researcher in machine learning developed a decision tree algorithm known as
ID3(Iterative Dichotomiser)
• Successor of ID3-C4.5
• Classification and regression tree(CART)
• ID3,C4.4 and CART adopt greedy(i.e., nonbacktracking),top-down recursive divide and conquer
manner.
• Attribute selection Measures: Information Gain, Gain Ratio, Gini Index
• Other attribute selection measures
• Tree pruning
• Scalability and decision tree induction
• Visual mining for decision tree induction
Attribute selection measures
Bayes classification Methods
• What are Bayesian classifiers?
Bayesian classifiers are statistical classifiers.
They can predict class membership probabilities such as the probability that a
given tuple belongs to a particular class.
Bayesian classification is based on bayes theorem.
Studies comparing classification algorithms have found a simple Bayesian
classifier known as the naïve Bayesian classifier.-to be comparable in performance
with decision tree and selected neural network classifiers.
Bayesian classifiers have also exhibited high accuracy and speed when applied to
large database.
Bayes classification Methods
Naïve Bayesian classifiers assume that the effect of an attribute value
on a given class is independent of the values of the other attributes.
This assumption is called class conditional independence.
Bayes Theorem
Naïve Bayesian classification
Bayesian belief networks
Training Bayesian belief networks.
Bayesian belief networks
• Basic concepts and Mechanisms
• Training Bayesian Belief Networks
Compute the gradients
Take a small step in the direction of the gradient
Renormalize the weights
Bayesian belief networks
Classification by back propagation
• Multilayer feed-forward neural network
Classification by back propagation
• Defining network topology
• Backpropagation
Initialize the weights
Propagate the inputs forward
Backpropagate the error: updating the weights and biases after the
presentation of each tuple .this is referred to as case updating
Alternatively, the weights and biases increments could be accumulated
in variables, so that the weights and biases are updated after all the
tuples in the training set have been presented. This is called epoch
updating ,where one iteration through the training set is an epoch
Classification by back propagation
Terminating conditions
• Inside the black box: backpropagation and interpretability
Laplacian correction to avoid computing
probability values of zero
Rule based classification
• Using IF-THEN rules for classification: Size ordering and Rule
ordering
• Rule extraction from a decision tree
• Rule induction using a sequential covering algorithm
Rule quality measures
Rule pruning
Using IF-THEN rules for classification
Rule extraction from a decision tree
• To extract rules from a decision tree, one rule is created for each path
from the root to leaf node.
• Each splitting criterion along a given path is logically ANDed to form
the rule antecedent(“IF “ part).The leaf node holds the class prediction,
forming the rule consequent(“THEN” part).
Rule induction using a sequential covering
algorithm
Rule Quality Measure
• Learn_one_rule needs a measure of rule quality. Every time it
considers an attribute test, it must check to see if appending such a test
to the current rule’s condition will result in an improved rule.
• Rule pruning:
Model Evaluation and Selection
Choosing one classifier over another
• Metrics for evaluating classifier performance
Positive tuples(tuples of main class of interest)
Negative tuples(all other tuples)
• Holdout method and random subsampling
• Cross validation
• Bootstrap
• Model selection using statistical tests of significance
• Comparing classifiers based on cost-benefit and ROC curves
Metrics for evaluating classifier
performance
Model Evaluation and Selection
• Evaluation metrics: How can we measure accuracy? Other metrics to consider?
• Use validation test set of class-labeled tuples instead of training set when
assessing accuracy
• Methods for estimating a classifier’s accuracy:
• Holdout method, random subsampling
• Cross-validation
• Bootstrap
• Comparing classifiers:
• Confidence intervals
• Cost-benefit analysis and ROC Curves
Classifier Evaluation Metrics:
Confusion Matrix
Actual classPredicted
class
buy_computer
= yes
buy_computer
= no
Total
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000
• Given m classes, an entry, CMi,j in a confusion matrix indicates #
of tuples in class i that were labeled by the classifier as class j
• May have extra rows/columns to provide totals
Confusion Matrix:
Actual classPredicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)
Example of Confusion Matrix:
43
Classifier Evaluation Metrics: Accuracy, Error Rate,
Sensitivity and Specificity
• Classifier Accuracy, or recognition
rate: percentage of test set tuples
that are correctly classified
Accuracy = (TP + TN)/All
• Error rate: 1 – accuracy, or
Error rate = (FP + FN)/All
 Class Imbalance Problem:
 One class may be rare, e.g.
fraud, or HIV-positive
 Significant majority of the
negative class and minority of
the positive class
 Sensitivity: True Positive
recognition rate
 Sensitivity = TP/P
 Specificity: True Negative
recognition rate
 Specificity = TN/N
AP C ¬C
C TP FN P
¬C FP TN N
P’ N’ All
44
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
• Precision: exactness – what % of tuples that the classifier
labeled as positive are actually positive
• Recall: completeness – what % of positive tuples did the
classifier label as positive?
• Perfect score is 1.0
• Inverse relationship between precision & recall
• F measure (F1 or F-score): harmonic mean of precision and
recall,
• Fß: weighted measure of precision and recall
• assigns ß times as much weight to recall as to precision
45
Classifier Evaluation Metrics:
Example
46
Precision = 90/230 = 39.13% Recall = 90/300 = 30.00%
Actual ClassPredicted class cancer = yes cancer = no Total Recognition(%)
cancer = yes 90 210 300 30.00 (sensitivity
cancer = no 140 9560 9700 98.56 (specificity)
Total 230 9770 10000 96.40 (accuracy)
Evaluating Classifier Accuracy:
Holdout & Cross-Validation Methods
• Holdout method
• Given data is randomly partitioned into two independent sets
• Training set (e.g., 2/3) for model construction
• Test set (e.g., 1/3) for accuracy estimation
• Random sampling: a variation of holdout
• Repeat holdout k times, accuracy = avg. of the accuracies obtained
• Cross-validation (k-fold, where k = 10 is most popular)
• Randomly partition the data into k mutually exclusive subsets,
each approximately equal size
• At i-th iteration, use Di as test set and others as training set
• Leave-one-out: k folds where k = # of tuples, for small sized
data
• *Stratified cross-validation*: folds are stratified so that class
dist. in each fold is approx. the same as that in the initial data
47
Evaluating Classifier Accuracy:
Bootstrap
• Bootstrap
• Works well with small data sets
• Samples the given training tuples uniformly with replacement
• i.e., each time a tuple is selected, it is equally likely to be selected
again and re-added to the training set
• Several bootstrap methods, and a common one is .632 boostrap
• A data set with d tuples is sampled d times, with replacement, resulting in
a training set of d samples. The data tuples that did not make it into the
training set end up forming the test set. About 63.2% of the original data
end up in the bootstrap, and the remaining 36.8% form the test set (since
(1 – 1/d)d ≈ e-1 = 0.368)
• Repeat the sampling procedure k times, overall accuracy of the model:
48
Estimating Confidence Intervals:
Classifier Models M1 vs. M2
• Suppose we have 2 classifiers, M1 and M2, which one is better?
• Use 10-fold cross-validation to obtain and
• These mean error rates are just estimates of error on the true population of
future data cases
• What if the difference between the 2 error rates is just attributed to chance?
• Use a test of statistical significance
• Obtain confidence limits for our error estimates
49
Estimating Confidence Intervals:
Null Hypothesis
• Perform 10-fold cross-validation
• Assume samples follow a t distribution with k–1 degrees of freedom (here, k=10)
• Use t-test (or Student’s t-test)
• Null Hypothesis: M1 & M2 are the same
• If we can reject null hypothesis, then
• we conclude that the difference between M1 & M2 is statistically significant
• Chose model with lower error rate
50
Estimating Confidence Intervals: t-test
• If only 1 test set available: pairwise comparison
• For ith round of 10-fold cross-validation, the same cross partitioning is used to
obtain err(M1)i and err(M2)i
• Average over 10 rounds to get
• t-test computes t-statistic with k-1 degrees of freedom:
• If two test sets available: use non-paired t-test
where
and
where
where k1 & k2 are # of cross-validation samples used for M1 & M2, resp.
51
Estimating Confidence Intervals:
Table for t-distribution
• Symmetric
• Significance level,
e.g., sig = 0.05 or 5%
means M1 & M2 are
significantly different
for 95% of
population
• Confidence limit, z =
sig/2
52
Estimating Confidence Intervals:
Statistical Significance
• Are M1 & M2 significantly different?
• Compute t. Select significance level (e.g. sig = 5%)
• Consult table for t-distribution: Find t value corresponding to k-1 degrees of
freedom (here, 9)
• t-distribution is symmetric: typically upper % points of distribution shown →
look up value for confidence limit z=sig/2 (here, 0.025)
• If t > z or t < -z, then t value lies in rejection region:
• Reject null hypothesis that mean error rates of M1 & M2 are same
• Conclude: statistically significant difference between M1 & M2
• Otherwise, conclude that any difference is chance
53
Model Selection: ROC Curves
• ROC (Receiver Operating
Characteristics) curves: for visual
comparison of classification models
• Originated from signal detection theory
• Shows the trade-off between the true
positive rate and the false positive rate
• The area under the ROC curve is a
measure of the accuracy of the model
• Rank the test tuples in decreasing
order: the one that is most likely to
belong to the positive class appears at
the top of the list
• The closer to the diagonal line (i.e., the
closer the area is to 0.5), the less
accurate is the model
 Vertical axis
represents the true
positive rate
 Horizontal axis rep.
the false positive rate
 The plot also shows a
diagonal line
 A model with perfect
accuracy will have an
area of 1.0
54
Issues Affecting Model Selection
• Accuracy
• classifier accuracy: predicting class label
• Speed
• time to construct the model (training time)
• time to use the model (classification/prediction time)
• Robustness: handling noise and missing values
• Scalability: efficiency in disk-resident databases
• Interpretability
• understanding and insight provided by the model
• Other measures, e.g., goodness of rules, such as decision tree
size or compactness of classification rules 55
Techniques to Improve Classification
Accuracy
• Introducing Ensemble Methods
• Bagging
• Boosting and AdaBoost
• Random forest
• Improving classification accuracy of class Imbalanced data
Ensemble Methods: Increasing the
Accuracy
• Ensemble methods
• Use a combination of models to increase accuracy
• Combine a series of k learned models, M1, M2, …, Mk, with
the aim of creating an improved model M*
• Popular ensemble methods
• Bagging: averaging the prediction over a collection of
classifiers
• Boosting: weighted vote with a collection of classifiers
• Ensemble: combining a set of heterogeneous classifiers
57
Bagging: Boostrap Aggregation
• Analogy: Diagnosis based on multiple doctors’ majority vote
• Training
• Given a set D of d tuples, at each iteration i, a training set Di of d tuples is
sampled with replacement from D (i.e., bootstrap)
• A classifier model Mi is learned for each training set Di
• Classification: classify an unknown sample X
• Each classifier Mi returns its class prediction
• The bagged classifier M* counts the votes and assigns the class with the
most votes to X
• Prediction: can be applied to the prediction of continuous values by taking the
average value of each prediction for a given test tuple
• Accuracy
• Often significantly better than a single classifier derived from D
• For noise data: not considerably worse, more robust
• Proved improved accuracy in prediction
58
Boosting
• Analogy: Consult several doctors, based on a combination of
weighted diagnoses—weight assigned based on the previous
diagnosis accuracy
• How boosting works?
• Weights are assigned to each training tuple
• A series of k classifiers is iteratively learned
• After a classifier Mi is learned, the weights are updated to
allow the subsequent classifier, Mi+1, to pay more attention to
the training tuples that were misclassified by Mi
• The final M* combines the votes of each individual classifier,
where the weight of each classifier's vote is a function of its
accuracy
• Boosting algorithm can be extended for numeric prediction
• Comparing with bagging: Boosting tends to have greater accuracy,
but it also risks overfitting the model to misclassified data 59
60
Adaboost (Freund and Schapire,
1997)
• Given a set of d class-labeled tuples, (X1, y1), …, (Xd, yd)
• Initially, all the weights of tuples are set the same (1/d)
• Generate k classifiers in k rounds. At round i,
• Tuples from D are sampled (with replacement) to form a training set Di
of the same size
• Each tuple’s chance of being selected is based on its weight
• A classification model Mi is derived from Di
• Its error rate is calculated using Di as a test set
• If a tuple is misclassified, its weight is increased, o.w. it is decreased
• Error rate: err(Xj) is the misclassification error of tuple Xj. Classifier Mi
error rate is the sum of the weights of the misclassified tuples:
• The weight of classifier Mi’s vote is )
(
)
(
1
log
i
i
M
error
M
error

 

d
j
j
i err
w
M
error )
(
)
( j
X
Random Forest (Breiman 2001)
• Random Forest:
• Each classifier in the ensemble is a decision tree classifier and is
generated using a random selection of attributes at each node to determine
the split
• During classification, each tree votes and the most popular class is
returned
• Two Methods to construct Random Forest:
• Forest-RI (random input selection): Randomly select, at each node, F
attributes as candidates for the split at the node. The CART methodology
is used to grow the trees to maximum size
• Forest-RC (random linear combinations): Creates new attributes (or
features) that are a linear combination of the existing attributes (reduces
the correlation between individual classifiers)
• Comparable in accuracy to Adaboost, but more robust to errors and outliers
• Insensitive to the number of attributes selected for consideration at each split,
and faster than bagging or boosting
61
Classification of Class-Imbalanced Data Sets
• Class-imbalance problem: Rare positive example but numerous
negative ones, e.g., medical diagnosis, fraud, oil-spill, fault, etc.
• Traditional methods assume a balanced distribution of classes and
equal error costs: not suitable for class-imbalanced data
• Typical methods for imbalance data in 2-class classification:
• Oversampling: re-sampling of data from positive class
• Under-sampling: randomly eliminate tuples from negative
class
• Threshold-moving: moves the decision threshold, t, so that the
rare class tuples are easier to classify, and hence, less chance of
costly false negative errors
• Ensemble techniques: Ensemble multiple classifiers introduced
above
• Still difficult for class imbalance problem on multiclass tasks
62

Contenu connexe

Tendances

Data mining Basics and complete description onword
Data mining Basics and complete description onwordData mining Basics and complete description onword
Data mining Basics and complete description onword
Sulman Ahmed
 
Data mining-primitives-languages-and-system-architectures2641
Data mining-primitives-languages-and-system-architectures2641Data mining-primitives-languages-and-system-architectures2641
Data mining-primitives-languages-and-system-architectures2641
Aiswaryadevi Jaganmohan
 
Discretization and concept hierarchy(os)
Discretization and concept hierarchy(os)Discretization and concept hierarchy(os)
Discretization and concept hierarchy(os)
snegacmr
 

Tendances (20)

04 Classification in Data Mining
04 Classification in Data Mining04 Classification in Data Mining
04 Classification in Data Mining
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Data Mining
Data MiningData Mining
Data Mining
 
Classification techniques in data mining
Classification techniques in data miningClassification techniques in data mining
Classification techniques in data mining
 
Data mining Basics and complete description onword
Data mining Basics and complete description onwordData mining Basics and complete description onword
Data mining Basics and complete description onword
 
Data mining Basics and complete description
Data mining Basics and complete description Data mining Basics and complete description
Data mining Basics and complete description
 
Data Reduction
Data ReductionData Reduction
Data Reduction
 
Data mining
Data miningData mining
Data mining
 
Cluster analysis for market segmentation
Cluster analysis for market segmentationCluster analysis for market segmentation
Cluster analysis for market segmentation
 
Clustering
ClusteringClustering
Clustering
 
Data mining-primitives-languages-and-system-architectures2641
Data mining-primitives-languages-and-system-architectures2641Data mining-primitives-languages-and-system-architectures2641
Data mining-primitives-languages-and-system-architectures2641
 
Data mining primitives
Data mining primitivesData mining primitives
Data mining primitives
 
Data For Datamining
Data For DataminingData For Datamining
Data For Datamining
 
Survey on Various Classification Techniques in Data Mining
Survey on Various Classification Techniques in Data MiningSurvey on Various Classification Techniques in Data Mining
Survey on Various Classification Techniques in Data Mining
 
Cluster spss week7
Cluster spss week7Cluster spss week7
Cluster spss week7
 
Discretization and concept hierarchy(os)
Discretization and concept hierarchy(os)Discretization and concept hierarchy(os)
Discretization and concept hierarchy(os)
 
03 Data Mining Techniques
03 Data Mining Techniques03 Data Mining Techniques
03 Data Mining Techniques
 
Clustering &amp; classification
Clustering &amp; classificationClustering &amp; classification
Clustering &amp; classification
 
1.7 data reduction
1.7 data reduction1.7 data reduction
1.7 data reduction
 
Basics of Clustering
Basics of ClusteringBasics of Clustering
Basics of Clustering
 

Similaire à Data mining techniques unit iv

Classification Using Decision Trees and RulesChapter 5.docx
Classification Using Decision Trees and RulesChapter 5.docxClassification Using Decision Trees and RulesChapter 5.docx
Classification Using Decision Trees and RulesChapter 5.docx
monicafrancis71118
 

Similaire à Data mining techniques unit iv (20)

Machine learning algorithms for data mining
Machine learning algorithms for data miningMachine learning algorithms for data mining
Machine learning algorithms for data mining
 
UNIT 3: Data Warehousing and Data Mining
UNIT 3: Data Warehousing and Data MiningUNIT 3: Data Warehousing and Data Mining
UNIT 3: Data Warehousing and Data Mining
 
BTech Pattern Recognition Notes
BTech Pattern Recognition NotesBTech Pattern Recognition Notes
BTech Pattern Recognition Notes
 
crossvalidation.pptx
crossvalidation.pptxcrossvalidation.pptx
crossvalidation.pptx
 
Statistical Learning and Model Selection (1).pptx
Statistical Learning and Model Selection (1).pptxStatistical Learning and Model Selection (1).pptx
Statistical Learning and Model Selection (1).pptx
 
dataminingclassificationprediction123 .pptx
dataminingclassificationprediction123 .pptxdataminingclassificationprediction123 .pptx
dataminingclassificationprediction123 .pptx
 
Lect8 Classification & prediction
Lect8 Classification & predictionLect8 Classification & prediction
Lect8 Classification & prediction
 
Week 11 Model Evalaution Model Evaluation
Week 11 Model Evalaution Model EvaluationWeek 11 Model Evalaution Model Evaluation
Week 11 Model Evalaution Model Evaluation
 
Classification in Data Mining
Classification in Data MiningClassification in Data Mining
Classification in Data Mining
 
Classification Using Decision Trees and RulesChapter 5.docx
Classification Using Decision Trees and RulesChapter 5.docxClassification Using Decision Trees and RulesChapter 5.docx
Classification Using Decision Trees and RulesChapter 5.docx
 
Introduction to Data Analysis for Nurse Researchers
Introduction to Data Analysis for Nurse ResearchersIntroduction to Data Analysis for Nurse Researchers
Introduction to Data Analysis for Nurse Researchers
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Classification and Prediction.pptx
Classification and Prediction.pptxClassification and Prediction.pptx
Classification and Prediction.pptx
 
Data mining chapter04and5-best
Data mining chapter04and5-bestData mining chapter04and5-best
Data mining chapter04and5-best
 
6 Evaluating Predictive Performance and ensemble.pptx
6 Evaluating Predictive Performance and ensemble.pptx6 Evaluating Predictive Performance and ensemble.pptx
6 Evaluating Predictive Performance and ensemble.pptx
 
classification in data mining and data warehousing.pdf
classification in data mining and data warehousing.pdfclassification in data mining and data warehousing.pdf
classification in data mining and data warehousing.pdf
 
ML SFCSE.pptx
ML SFCSE.pptxML SFCSE.pptx
ML SFCSE.pptx
 
Sampling and Data_Update.ppt
Sampling and Data_Update.pptSampling and Data_Update.ppt
Sampling and Data_Update.ppt
 
evaluation and credibility-Part 1
evaluation and credibility-Part 1evaluation and credibility-Part 1
evaluation and credibility-Part 1
 
Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...
Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...
Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...
 

Plus de malathieswaran29 (12)

Data mining techniques unit III
Data mining techniques unit IIIData mining techniques unit III
Data mining techniques unit III
 
Data mining techniques unit 1
Data mining techniques  unit 1Data mining techniques  unit 1
Data mining techniques unit 1
 
Bitcoin data mining
Bitcoin data miningBitcoin data mining
Bitcoin data mining
 
Principles of management organizing & reengineering
Principles of management organizing & reengineeringPrinciples of management organizing & reengineering
Principles of management organizing & reengineering
 
Principles of management human factor & motivation
Principles of management human factor & motivationPrinciples of management human factor & motivation
Principles of management human factor & motivation
 
Principles given by fayol
Principles given by fayolPrinciples given by fayol
Principles given by fayol
 
Software maintenance real world maintenance cost
Software maintenance real world maintenance costSoftware maintenance real world maintenance cost
Software maintenance real world maintenance cost
 
SOFTWARE MAINTENANCE -4
SOFTWARE MAINTENANCE -4SOFTWARE MAINTENANCE -4
SOFTWARE MAINTENANCE -4
 
SOFTWARE MAINTENANCE -3
SOFTWARE MAINTENANCE -3SOFTWARE MAINTENANCE -3
SOFTWARE MAINTENANCE -3
 
SOFTWARE MAINTENANCE -2
SOFTWARE MAINTENANCE -2SOFTWARE MAINTENANCE -2
SOFTWARE MAINTENANCE -2
 
SOFTWARE MAINTENANCE -1
SOFTWARE MAINTENANCE -1SOFTWARE MAINTENANCE -1
SOFTWARE MAINTENANCE -1
 
SOFTWARE MAINTENANCE- 5
SOFTWARE MAINTENANCE- 5SOFTWARE MAINTENANCE- 5
SOFTWARE MAINTENANCE- 5
 

Dernier

AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
ankushspencer015
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Christo Ananth
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
rknatarajan
 

Dernier (20)

data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICSUNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
 

Data mining techniques unit iv

  • 2. Classification-Basic Concept • Decision Tree Induction – Bayes Classification Methods – Rule-Based Classification – Model Evaluation and Selection – Techniques to improve Classification Accuracy - Bayesian Belief Networks – Classification by Backpropagation. • Classification and prediction are two forms of data analysis that can be used to extract models describing important data classes or to predict future data trends • Classification predicts categorical, discrete and unordered labels • Prediction model –continuous valued functions • eg,., bank loan application, all electronics • Regression analysis is a statistical methodology that is most often used for numeric prediction. • Data classification is a two step process  Learning  Classification
  • 3. Classification-Basic Concept • What is trained data and test data? • What is supervised learning and unsupervised learning? • Issues regarding classification and prediction Pre processing the data for classification and prediction Data cleaning Relevance analysis Data transformation and reduction-normalization • Comparing classification and prediction methods Accuracy , speed, robustness, scalability,interpretability
  • 4.
  • 5. Decision Tree Induction • Decision tree induction is the learning of decision tree from class- labelled training tuples. • A decision tree is a flow chart like tree structure where internal node(non leaf node) denotes a test on an attribute, each branch represents an outcome of the test and each leaf node(or terminal node) holds a class label. • The topmost node in a tree is the root node.
  • 6.
  • 7. Decision Tree Induction • How decision tree used for classification? • Why are decision tree classifiers so popular? • Attribute selection measures are used to select the attribute that best partitions the tuples into distinct classes • Tree pruning attempts to identify and remove such branches with the goal of improving classification accuracy on unseen data.
  • 8. Decision Tree Induction • J. Ross Quinlan, a researcher in machine learning developed a decision tree algorithm known as ID3(Iterative Dichotomiser) • Successor of ID3-C4.5 • Classification and regression tree(CART) • ID3,C4.4 and CART adopt greedy(i.e., nonbacktracking),top-down recursive divide and conquer manner. • Attribute selection Measures: Information Gain, Gain Ratio, Gini Index • Other attribute selection measures • Tree pruning • Scalability and decision tree induction • Visual mining for decision tree induction
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15. Bayes classification Methods • What are Bayesian classifiers? Bayesian classifiers are statistical classifiers. They can predict class membership probabilities such as the probability that a given tuple belongs to a particular class. Bayesian classification is based on bayes theorem. Studies comparing classification algorithms have found a simple Bayesian classifier known as the naïve Bayesian classifier.-to be comparable in performance with decision tree and selected neural network classifiers. Bayesian classifiers have also exhibited high accuracy and speed when applied to large database.
  • 16. Bayes classification Methods Naïve Bayesian classifiers assume that the effect of an attribute value on a given class is independent of the values of the other attributes. This assumption is called class conditional independence. Bayes Theorem Naïve Bayesian classification Bayesian belief networks Training Bayesian belief networks.
  • 17. Bayesian belief networks • Basic concepts and Mechanisms • Training Bayesian Belief Networks Compute the gradients Take a small step in the direction of the gradient Renormalize the weights
  • 19. Classification by back propagation • Multilayer feed-forward neural network
  • 20. Classification by back propagation • Defining network topology • Backpropagation Initialize the weights Propagate the inputs forward Backpropagate the error: updating the weights and biases after the presentation of each tuple .this is referred to as case updating Alternatively, the weights and biases increments could be accumulated in variables, so that the weights and biases are updated after all the tuples in the training set have been presented. This is called epoch updating ,where one iteration through the training set is an epoch
  • 21. Classification by back propagation Terminating conditions • Inside the black box: backpropagation and interpretability
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28. Laplacian correction to avoid computing probability values of zero
  • 29. Rule based classification • Using IF-THEN rules for classification: Size ordering and Rule ordering • Rule extraction from a decision tree • Rule induction using a sequential covering algorithm Rule quality measures Rule pruning
  • 30.
  • 31. Using IF-THEN rules for classification
  • 32.
  • 33. Rule extraction from a decision tree • To extract rules from a decision tree, one rule is created for each path from the root to leaf node. • Each splitting criterion along a given path is logically ANDed to form the rule antecedent(“IF “ part).The leaf node holds the class prediction, forming the rule consequent(“THEN” part).
  • 34. Rule induction using a sequential covering algorithm
  • 35.
  • 36. Rule Quality Measure • Learn_one_rule needs a measure of rule quality. Every time it considers an attribute test, it must check to see if appending such a test to the current rule’s condition will result in an improved rule.
  • 38. Model Evaluation and Selection Choosing one classifier over another • Metrics for evaluating classifier performance Positive tuples(tuples of main class of interest) Negative tuples(all other tuples) • Holdout method and random subsampling • Cross validation • Bootstrap • Model selection using statistical tests of significance • Comparing classifiers based on cost-benefit and ROC curves
  • 39. Metrics for evaluating classifier performance
  • 40.
  • 41.
  • 42. Model Evaluation and Selection • Evaluation metrics: How can we measure accuracy? Other metrics to consider? • Use validation test set of class-labeled tuples instead of training set when assessing accuracy • Methods for estimating a classifier’s accuracy: • Holdout method, random subsampling • Cross-validation • Bootstrap • Comparing classifiers: • Confidence intervals • Cost-benefit analysis and ROC Curves
  • 43. Classifier Evaluation Metrics: Confusion Matrix Actual classPredicted class buy_computer = yes buy_computer = no Total buy_computer = yes 6954 46 7000 buy_computer = no 412 2588 3000 Total 7366 2634 10000 • Given m classes, an entry, CMi,j in a confusion matrix indicates # of tuples in class i that were labeled by the classifier as class j • May have extra rows/columns to provide totals Confusion Matrix: Actual classPredicted class C1 ¬ C1 C1 True Positives (TP) False Negatives (FN) ¬ C1 False Positives (FP) True Negatives (TN) Example of Confusion Matrix: 43
  • 44. Classifier Evaluation Metrics: Accuracy, Error Rate, Sensitivity and Specificity • Classifier Accuracy, or recognition rate: percentage of test set tuples that are correctly classified Accuracy = (TP + TN)/All • Error rate: 1 – accuracy, or Error rate = (FP + FN)/All  Class Imbalance Problem:  One class may be rare, e.g. fraud, or HIV-positive  Significant majority of the negative class and minority of the positive class  Sensitivity: True Positive recognition rate  Sensitivity = TP/P  Specificity: True Negative recognition rate  Specificity = TN/N AP C ¬C C TP FN P ¬C FP TN N P’ N’ All 44
  • 45. Classifier Evaluation Metrics: Precision and Recall, and F-measures • Precision: exactness – what % of tuples that the classifier labeled as positive are actually positive • Recall: completeness – what % of positive tuples did the classifier label as positive? • Perfect score is 1.0 • Inverse relationship between precision & recall • F measure (F1 or F-score): harmonic mean of precision and recall, • Fß: weighted measure of precision and recall • assigns ß times as much weight to recall as to precision 45
  • 46. Classifier Evaluation Metrics: Example 46 Precision = 90/230 = 39.13% Recall = 90/300 = 30.00% Actual ClassPredicted class cancer = yes cancer = no Total Recognition(%) cancer = yes 90 210 300 30.00 (sensitivity cancer = no 140 9560 9700 98.56 (specificity) Total 230 9770 10000 96.40 (accuracy)
  • 47. Evaluating Classifier Accuracy: Holdout & Cross-Validation Methods • Holdout method • Given data is randomly partitioned into two independent sets • Training set (e.g., 2/3) for model construction • Test set (e.g., 1/3) for accuracy estimation • Random sampling: a variation of holdout • Repeat holdout k times, accuracy = avg. of the accuracies obtained • Cross-validation (k-fold, where k = 10 is most popular) • Randomly partition the data into k mutually exclusive subsets, each approximately equal size • At i-th iteration, use Di as test set and others as training set • Leave-one-out: k folds where k = # of tuples, for small sized data • *Stratified cross-validation*: folds are stratified so that class dist. in each fold is approx. the same as that in the initial data 47
  • 48. Evaluating Classifier Accuracy: Bootstrap • Bootstrap • Works well with small data sets • Samples the given training tuples uniformly with replacement • i.e., each time a tuple is selected, it is equally likely to be selected again and re-added to the training set • Several bootstrap methods, and a common one is .632 boostrap • A data set with d tuples is sampled d times, with replacement, resulting in a training set of d samples. The data tuples that did not make it into the training set end up forming the test set. About 63.2% of the original data end up in the bootstrap, and the remaining 36.8% form the test set (since (1 – 1/d)d ≈ e-1 = 0.368) • Repeat the sampling procedure k times, overall accuracy of the model: 48
  • 49. Estimating Confidence Intervals: Classifier Models M1 vs. M2 • Suppose we have 2 classifiers, M1 and M2, which one is better? • Use 10-fold cross-validation to obtain and • These mean error rates are just estimates of error on the true population of future data cases • What if the difference between the 2 error rates is just attributed to chance? • Use a test of statistical significance • Obtain confidence limits for our error estimates 49
  • 50. Estimating Confidence Intervals: Null Hypothesis • Perform 10-fold cross-validation • Assume samples follow a t distribution with k–1 degrees of freedom (here, k=10) • Use t-test (or Student’s t-test) • Null Hypothesis: M1 & M2 are the same • If we can reject null hypothesis, then • we conclude that the difference between M1 & M2 is statistically significant • Chose model with lower error rate 50
  • 51. Estimating Confidence Intervals: t-test • If only 1 test set available: pairwise comparison • For ith round of 10-fold cross-validation, the same cross partitioning is used to obtain err(M1)i and err(M2)i • Average over 10 rounds to get • t-test computes t-statistic with k-1 degrees of freedom: • If two test sets available: use non-paired t-test where and where where k1 & k2 are # of cross-validation samples used for M1 & M2, resp. 51
  • 52. Estimating Confidence Intervals: Table for t-distribution • Symmetric • Significance level, e.g., sig = 0.05 or 5% means M1 & M2 are significantly different for 95% of population • Confidence limit, z = sig/2 52
  • 53. Estimating Confidence Intervals: Statistical Significance • Are M1 & M2 significantly different? • Compute t. Select significance level (e.g. sig = 5%) • Consult table for t-distribution: Find t value corresponding to k-1 degrees of freedom (here, 9) • t-distribution is symmetric: typically upper % points of distribution shown → look up value for confidence limit z=sig/2 (here, 0.025) • If t > z or t < -z, then t value lies in rejection region: • Reject null hypothesis that mean error rates of M1 & M2 are same • Conclude: statistically significant difference between M1 & M2 • Otherwise, conclude that any difference is chance 53
  • 54. Model Selection: ROC Curves • ROC (Receiver Operating Characteristics) curves: for visual comparison of classification models • Originated from signal detection theory • Shows the trade-off between the true positive rate and the false positive rate • The area under the ROC curve is a measure of the accuracy of the model • Rank the test tuples in decreasing order: the one that is most likely to belong to the positive class appears at the top of the list • The closer to the diagonal line (i.e., the closer the area is to 0.5), the less accurate is the model  Vertical axis represents the true positive rate  Horizontal axis rep. the false positive rate  The plot also shows a diagonal line  A model with perfect accuracy will have an area of 1.0 54
  • 55. Issues Affecting Model Selection • Accuracy • classifier accuracy: predicting class label • Speed • time to construct the model (training time) • time to use the model (classification/prediction time) • Robustness: handling noise and missing values • Scalability: efficiency in disk-resident databases • Interpretability • understanding and insight provided by the model • Other measures, e.g., goodness of rules, such as decision tree size or compactness of classification rules 55
  • 56. Techniques to Improve Classification Accuracy • Introducing Ensemble Methods • Bagging • Boosting and AdaBoost • Random forest • Improving classification accuracy of class Imbalanced data
  • 57. Ensemble Methods: Increasing the Accuracy • Ensemble methods • Use a combination of models to increase accuracy • Combine a series of k learned models, M1, M2, …, Mk, with the aim of creating an improved model M* • Popular ensemble methods • Bagging: averaging the prediction over a collection of classifiers • Boosting: weighted vote with a collection of classifiers • Ensemble: combining a set of heterogeneous classifiers 57
  • 58. Bagging: Boostrap Aggregation • Analogy: Diagnosis based on multiple doctors’ majority vote • Training • Given a set D of d tuples, at each iteration i, a training set Di of d tuples is sampled with replacement from D (i.e., bootstrap) • A classifier model Mi is learned for each training set Di • Classification: classify an unknown sample X • Each classifier Mi returns its class prediction • The bagged classifier M* counts the votes and assigns the class with the most votes to X • Prediction: can be applied to the prediction of continuous values by taking the average value of each prediction for a given test tuple • Accuracy • Often significantly better than a single classifier derived from D • For noise data: not considerably worse, more robust • Proved improved accuracy in prediction 58
  • 59. Boosting • Analogy: Consult several doctors, based on a combination of weighted diagnoses—weight assigned based on the previous diagnosis accuracy • How boosting works? • Weights are assigned to each training tuple • A series of k classifiers is iteratively learned • After a classifier Mi is learned, the weights are updated to allow the subsequent classifier, Mi+1, to pay more attention to the training tuples that were misclassified by Mi • The final M* combines the votes of each individual classifier, where the weight of each classifier's vote is a function of its accuracy • Boosting algorithm can be extended for numeric prediction • Comparing with bagging: Boosting tends to have greater accuracy, but it also risks overfitting the model to misclassified data 59
  • 60. 60 Adaboost (Freund and Schapire, 1997) • Given a set of d class-labeled tuples, (X1, y1), …, (Xd, yd) • Initially, all the weights of tuples are set the same (1/d) • Generate k classifiers in k rounds. At round i, • Tuples from D are sampled (with replacement) to form a training set Di of the same size • Each tuple’s chance of being selected is based on its weight • A classification model Mi is derived from Di • Its error rate is calculated using Di as a test set • If a tuple is misclassified, its weight is increased, o.w. it is decreased • Error rate: err(Xj) is the misclassification error of tuple Xj. Classifier Mi error rate is the sum of the weights of the misclassified tuples: • The weight of classifier Mi’s vote is ) ( ) ( 1 log i i M error M error     d j j i err w M error ) ( ) ( j X
  • 61. Random Forest (Breiman 2001) • Random Forest: • Each classifier in the ensemble is a decision tree classifier and is generated using a random selection of attributes at each node to determine the split • During classification, each tree votes and the most popular class is returned • Two Methods to construct Random Forest: • Forest-RI (random input selection): Randomly select, at each node, F attributes as candidates for the split at the node. The CART methodology is used to grow the trees to maximum size • Forest-RC (random linear combinations): Creates new attributes (or features) that are a linear combination of the existing attributes (reduces the correlation between individual classifiers) • Comparable in accuracy to Adaboost, but more robust to errors and outliers • Insensitive to the number of attributes selected for consideration at each split, and faster than bagging or boosting 61
  • 62. Classification of Class-Imbalanced Data Sets • Class-imbalance problem: Rare positive example but numerous negative ones, e.g., medical diagnosis, fraud, oil-spill, fault, etc. • Traditional methods assume a balanced distribution of classes and equal error costs: not suitable for class-imbalanced data • Typical methods for imbalance data in 2-class classification: • Oversampling: re-sampling of data from positive class • Under-sampling: randomly eliminate tuples from negative class • Threshold-moving: moves the decision threshold, t, so that the rare class tuples are easier to classify, and hence, less chance of costly false negative errors • Ensemble techniques: Ensemble multiple classifiers introduced above • Still difficult for class imbalance problem on multiclass tasks 62