K nearest neighbor

 Classification is done by relating the unknown to the known according to some
distance/similarity function
 Stores all available cases and classifies new cases based on similarity measure
 Different names
 Memory-based reasoning
 Example-based reasoning
 Instance-based reasoning
 Case-based reasoning
 Lazy learning

 kNN determines the decision boundary locally. Ex. for 1NN we assign each
document to the class of its closest neighbor
 For kNN we assign each document to the majority class of its closest neighbors
where k is a parameter
 The rationale of kNN classification is based on contiguity hypothesis, we expect
the test document to have the same training label as the training documents
located in the local region surrounding the document.
 Veronoi tessellation of a set of objects decomposes space into Voronoi cells, where
each object’s cell consist of all points that are closer to the object than to other
objects.
 It partitions the plane to complex polygons, each containing its corresponding
document.

 Advantages
 Non-parametric architecture
 Simple
 Powerful
 Requires no training time
 Disadvantages
 Memory intensive
 Classification/estimation is slow
 The distance is calculated using Euclidean distance

2
21
2
21 )()( yyxxD 

 If k=1, select the nearest neighbors
 If k>1
 For classification, select the most frequent neighbors
 For regression, calculate the average of k neighbors

 An inductive learning task – use particular facts to make more generalized
conclusions
 Predictive model based on branching series of Boolean test – these Boolean test
are less complex than the one-stage classifier
 Its learning from class labeled tuples
 Can be used as visual aid to structure and solve sequential problems
 Internal node (Non-leaf node) denotes a test on an attribute, each branch
represents an outcome of the test and each leaf node holds a class label

If we leave at 10
AM and there are
no cars stalled on
the road, what will
our commute time
be?
Leave At
Stall? Accident?
10 AM 9 AM
8 AM
Long
Long
Short Medium Long
No Yes No Yes

 In this decision tree, we made a series of Boolean decision and followed a
corresponding branch –
 Did we leave at 10AM?
 Did the car stall on road?
 Is there an accident on the road?
 By answering each of these questions as yes or no, we can come to a conclusion on
how long our commute might take

 We do not have to represent this tree graphically
 We can represent this as a set of rules. However, it may be harder to read
if hour == 8am
commute time = long
else if hour == 9am
if accident == yes
commute time = long
else
commute time = medium
else if hour == 10am
if stall == yes
commute time = long
else
commute time = short

 The algorithm is called with three parameters – data partition, attribute list,
attribute subset selection.
 It’s a set of tuples and there associated class label
 Attribute list is a list of attributes describing the tuples
 Attribute selection method specifies a heuristic procedure for selecting attribute that
best discriminates the tuples
 Tree starts at node N. if all the tuples in D are of the same class, then node N
becomes a leaf and is labelled with that class
 Else attribute selection method is used to determine the splitting criteria.
 Node N is labelled with splitting criteria, which serves as a test at the node.

 The previous experience decision table showed 4 attributes – hour, weather,
accident and stall
 But the decision tree showed three attributes – hour, attribute and stall
 So which attribute is to be kept and which is to be removed?
 Methods for selecting attribute shows that weather is not a discriminating
attribute
 Method – given a number of competing hypothesis, the simplest one is preferable
 We will focus on ID3 algorithm

 Basic idea
 Choose the best attribute to split the remaining instances and make that attribute a
decision node
 Repeat this process for recursively for each child
 Stop when
 All attribute have same target attribute value
 There are no more attributes
 There are no more instances
 ID3 splits attributes based on their entropy.
 Entropy is a measure of disinformation

 Entropy is minimized when all values of target attribute are the same
 If we know that the commute time will be short, the entropy=0
 Entropy is maximized when there is an equal chance of values for the target
attribute (i.e. result is random)
 If commute time = short in 3 instances, medium in 3 instances and long in 3 instances,
entropy is maximized
 Calculation of entropy
 Entropy S = ∑(i=1 to l)-|Si|/|S| * log2(|Si|/|S|)
 S = set of examples
 Si = subset of S with value vi under the target attribute
 L – size of range of target attribute

 If we break down the leaving time to the minute, we might get something like this
 Since the entropy is very less for each branch and we have n branches with n
leaves. This would not be helpful for predictive modelling
 We use a technique called as discretization. We choose cut point such as 9AM for
splitting continuous attributes
8:02 AM 10:02 AM8:03 AM 9:09 AM9:05 AM 9:07 AM
Long Medium Short Long Long Short

 Consider the attribute commute time
 When we split the attribute, we increase the entropy so we don’t have a decision
tree with the same number of cut points as leaves
8:00 (L), 8:02 (L), 8:07 (M), 9:00 (S), 9:20 (S), 9:25 (S), 10:00 (S), 10:02 (M)

 Binary decision trees
 Classification of an input vector is done by traversing the tree beginning at the root node
and ending at the leaf
 Each node of the tree computes an inequality
 Each leaf is assigned to a particular class
 Input space is based on one input variable
 Each node draws a boundary that can be geometrically interpreted as a hyperplane
perpendicular to the axis

B C
Yes No
Yes No
NoYes
BMI<24

 They are similar to binary tree
 Inequality computed at each node takes on a linear form that may depend on
linear variable
aX1+bX2
Yes No
Yes No
NoYes

 Chi-squared automatic intersection detector(CHAID)
 Non-binary decision tree
 Decision made at each node is based on single variable, but can result in multiple
branches
 Continuous variables are grouped into a finite number of bins to create categories
 Equal population bins is created for CHAID
 Classification and Regression Trees (CART) are binary decision trees which split a
single variable at each node
 The CART algorithm goes through an exhaustive search of all variables and split values
to find the optimal splitting rule for each node.

 There is another technique for reducing the number of attributes used in tree –
pruning
 Two types of pruning
 Pre-pruning (forward pruning)
 Post-pruning (backward pruning)
 Pre-pruning
 We decide during the building process, when to stop adding attributes (possibly based on
their information gain)
 However, this may be problematic – why?
 Sometimes, attribute individually do not compute much to a decision, but combined they may
have significant impact.

 Post-pruning waits until full decision tree has been built and then prunes the
attributes.
 Two techniques:
 Subtree replacement
 Subtree raising
 Subtree replacement
A
B
C
1 2 3
4 5

 Node 6 replaced the subtree
 May increase accuracy
A
B
6 4 5

 Entire subtree is raised onto another node
A
B
C
1 2 3
4 5
A
C
1 2 3

 While decision tree classifies quickly, the time taken for building the tree may be
higher than any other type of classifier.
 Decision tree suffer from problem of error propagation throughout the tree

 Since decision trees work by a series of local decision, what happens if one of these
decision is wrong?
 Every decision from that point on may be wrong
 We may return to the correct path of the tree

K nearest neighbor

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à K nearest neighbor

Similaire à K nearest neighbor (20)

Plus de Ujjawal

Plus de Ujjawal (10)

Dernier

Dernier (20)

K nearest neighbor