This document discusses decision trees for data classification. It defines a decision tree as a tree where internal nodes represent attributes, branches represent attribute values, and leaf nodes represent class predictions. It describes the basic decision tree algorithm which builds the tree by recursively splitting the training data on attributes and stopping when data is pure or stopping criteria are met. Finally, it notes advantages like interpretability but also disadvantages like potential overfitting and issues with non-numeric data.
2. Contents
• Introduction
• Decision Tree
• Decision Tree Algorithm
• Decision Tree Based Algorithm
• Algorithm
• Decision Tree Advantages and Disadvantages
3. Introduction
• Classification is a most familiar and most
popular data mining technique.
• Classification applications includes image and
pattern recognition, loan approval, detecting
faults in industrial applications.
• All approaches to performing classification
assumes some knowledge of the data.
• Training set is used to develop specific
parameters required by the technique.
4. Decision Tree
• Decision Tree (DT):
▫ Tree where the root and each internal node is
labeled with a question.
▫ The arcs represent each possible answer to the
associated question.
▫ Each leaf node represents a prediction of a solution
to the problem.
• Popular technique for classification; Leaf node
indicates class to which the corresponding tuple
belongs.
6. Decision Tree
• A Decision Tree Model is a computational
model consisting of three parts:
▫ Decision Tree
▫ Algorithm to create the tree
▫ Algorithm that applies the tree to data
• Creation of the tree is the most difficult part.
• Processing is basically a search similar to that
in a binary search tree (although DT may not
be binary).
8. Algorithm Definition
• The decision tree approach is most useful in
classification problems. With this technique, a
tree is constructed to model the classification
process.
• Once the tree is build, it is applied to each tuple
in the database and results in a classification for
that tuple.
• There are two basics step in this techinque:
Building the tree and Applying the tree to the
database.
9. • The decision tree approach to classification is to
divide the search space into rectangular region.
A tuple is classified based on the region into
which it falls.
• Definition: Given a database D={t1……..tn}
where ti=<ti1……..tih> and the database schema
consist of following attributes {A1,A2,………,Ah}
also a set of classes C={C1,……,Cm}. A decision
tree DT or classification tree is a tree associated
with D that has the following properties:
▫ Each internal node is labeled with an attribute Ai
▫ Each arc is labeled with a predicate that can be
applied to a attribute associated with a parent.
▫ Each leaf node is labeled with a class Cj.
10. Algorithm
• Input:
D // Training data
• Output:
T //Decision tree
• DTBuild algorithm
// Simplistic algorithm to illustrate naive
approach to building DT
11. • T=0;
Determine best splitting criterion;
T=Create root node, node and label with splitting
attribute;
T=Add arc to root node for each split predicate and
label;
for each arc do
D= database created by applying splitting predicate to
D;
if stopping point reached for this path, then
T’= Create leaf node and label with appropriate class;
else
T’=DTBuild(D);
T=Add T’ to arc;
12. DT Advantages/Disadvantages
• Advantages:
▫ Easy to understand.
▫ Easy to generate rules
• Disadvantages:
▫ May suffer from overfitting.
▫ Classifies by rectangular partitioning.
▫ Does not easily handle nonnumeric data.
▫ Can be quite large – pruning is necessary.