2. Instance-based learning
• Instance-based learning is often termed lazy learning, as there is typically no
“transformation” of training instances into more general “statements”
• Instance-based learning methods simply store the training examples.
• Generalizing beyond these examples is postponed until a new instance must be classified.
• Each time a new query instance is encountered, its relationship to the previously stored
examples is examined in order to assign a target function value for the new instance.
•
• Instance-based methods are sometimes referred to as "lazy" learning methods because
they delay processing until a new instance must be classified.
3. Instance-based learning
• One disadvantage of instance-based approaches is that the cost of classifying new instances
can be high.
• This is due to the fact that nearly all computation takes place at classification time rather
than when the training examples are first encountered.
4. k-Nearest Neighbor
• Pros:
• High accuracy,
• insensitive to outliers,
• no assumptions about data
• Cons:
• Computationally expensive,
• requires a lot of memory
• Works with:
• Numeric values,
• nominal values
5. k-NEAREST Neighbor LEARNING
• This algorithm assumes all instances correspond to points in the n-dimensional space Rn.
• The nearest neighbors of an instance are defined in terms of the standard Euclidean
distance.
• more precisely, let an arbitrary instance x be described by the feature vector
• where ar(x) denotes the value of the rth attribute of instance x. Then the distance between two
instances xi and xj is defined to be d(xi, xj), where
6. Similarity and Dissimilarity
• Similarity
• Numerical measure of how alike two data objects are.
• Is higher when objects are more alike.
• Often falls in the range [0,1]:
• Examples: Cosine, Jaccard, Tanimoto,
• Dissimilarity
• Numerical measure of how different two data objects are
• Lower when objects are more alike
• Minimum dissimilarity is often 0
• Upper limit varies
• Proximity refers to a similarity or dissimilarity
Src: “Introduction to Data Mining” by Vipin Kumar et al
7. Distance Metric
• Distance d (p, q) between two points p and q is a dissimilarity measure if it
satisfies:
1. Positive definiteness:
d (p, q) 0 for all p and q and
d (p, q) = 0 only if p = q.
2. Symmetry: d (p, q) = d (q, p) for all p and q.
3. Triangle Inequality:
d (p, r) d (p, q) + d (q, r) for all points p, q, and r.
• Examples:
• Euclidean distance
• Minkowski distance
• Mahalanobis distance
Src: “Introduction to Data Mining” by Vipin Kumar et al
8. General approach to kNN
1. Collect: Any method.
2. Prepare: Numeric values are needed for a distance calculation. A
structured data format is best.
3. Analyze: Any method.
4. Train: Does not apply to the kNN algorithm.
5. Test: Calculate the error rate.
6. Use: This application needs to get some input data and output structured
numeric values. Next, the application runs the kNN algorithm on this
input data and determines which class the input data should belong to.
The application then takes some action on the calculated class.
10. Instance-Based Learning
• Idea:
• Similar examples have similar label.
• Classify new examples like similar training examples.
• Algorithm:
• Given some new example x for which we need to predict its class y
• Find most similar training examples
• Classify x “like” these most similar examples
• Questions:
• How to determine similarity?
• How many similar training examples to consider?
• How to resolve inconsistencies among the training examples?
11. Summary: k-NN
• The simplest machine learning algorithm.
• No explicit training.
• A non-parametric method: no parameter to be optimised.
• The algorithm relies on a distance measure.
12
Think: what could be the possible effect
of the number of training samples used
in k-NN?
12. 1-Nearest Neighbor
• One of the simplest of all machine learning classifiers
• Simple idea: label a new point the same as the closest known point
Label it red.
13. Distance Metrics
• Different metrics can change the decision surface
• Standard Euclidean distance metric:
• Two-dimensional: Dist(a,b) = sqrt((a1 – b1)2 + (a2 – b2)2)
• Multivariate: Dist(a,b) = sqrt(∑ (ai – bi)2)
Dist(a,b) =(a1 – b1)2 + (a2 – b2)2 Dist(a,b) =(a1 – b1)2 + (3a2 – 3b2)2
Adapted from “Instance-Based Learning”
lecture slides by Andrew Moore, CMU.
14. k – Nearest Neighbor
• Generalizes 1-NN to smooth away noise in the labels
• A new point is now assigned the most frequent label of its k nearest
neighbors
Label it red, when k = 3
Label it blue, when k = 7
15. Selecting the Number of Neighbors
• Increase k:
• Makes KNN less sensitive to noise
• Decrease k:
• Allows capturing finer structure of space
• Pick k not too large, but not too small (depends
on data)
16. Neighbour Number k
• Hyper-parameter: neighbour number k.
• The process of determining the neighbour number k is called
hyper-parameter selection or model selection.
• In binary classification, what can happen when k is set as an
even number?
We will talk more on hyper-parameter
selection in the next lectures.
18
17. Neighbour Number k
• Hyper-parameter: neighbour number k.
• The process of determining the neighbour number k is called
hyper-parameter selection or model selection.
• In binary classification, what can happen when k is set as an
even number?
We will talk more on hyper-parameter
selection in the next lecture.
19
Think: what could be
the possible effect of
the neighbour
number?
18. Effect of Training Samples
• Small number of training samples:
• insufficient information
• Large number of training samples:
• more information
• but, time and memory cost consuming (distance calculation, sorting)
• Noisy training samples:
• overfitting
20
19. Curse-of-Dimensionality
• Prediction accuracy can quickly degrade when number of attributes
grows.
• Irrelevant attributes easily “swamp” information from relevant attributes
• When many irrelevant attributes, similarity/distance measure becomes less
reliable
• Remedy
• Try to remove irrelevant attributes in pre-processing step
• Weight attributes differently
• Increase k (but not too much)
20. k NEAREST NEIGHBOR
• Requires 3 things:
• Feature Space(Training Data)
• Distance metric
• to compute distance between
records
• The value of k
• the number of nearest
neighbors to retrieve from which
to get majority class
• To classify an unknown record:
• Compute distance to other training
records
• Identify k nearest neighbors
• Use class labels of nearest
neighbors to determine the
class label of unknown record
?
21. K-Nearest Neighbor
• Features
• All instances correspond to points in an n-dimensional Euclidean
space
• Classification is delayed till a new instance arrives
• Classification done by comparing feature vectors of the different
points
• Target function may be discrete or real-valued
22. Summary: Neighbour Number k
• Hyper-parameter: neighbour number k.
• In binary classification, k needs to be an uneven number.
• Small k: we may model noise!
• Large k: neighbours will include too many samples from other
classes.
24
23. Example
Consider the following database defining a concept
Ex Attribute Attribute Attribute Concept
No. Size Colour Shape Satisfied
1 medium blue brick yes
2 small red wedge no
3 small red sphere yes
4 large red wedge no
5 large green pillar yes
6 large red pillar no
7 large green pillar yes
24. k NEAREST NEIGHBOR
• Requires 3 things:
• Feature Space(Training Data)
• Distance metric
• to compute distance between
records
• The value of k
• the number of nearest
neighbors to retrieve from which
to get majority class
• To classify an unknown record:
• Compute distance to other training
records
• Identify k nearest neighbors
• Use class labels of nearest
neighbors to determine the
class label of unknown record
?
25. K-Nearest Neighbor
• Features
• All instances correspond to points in an n-dimensional Euclidean
space
• Classification is delayed till a new instance arrives
• Classification done by comparing feature vectors of the different
points
• Target function may be discrete or real-valued
28. Example
• We can now use the training set to classify an unknown case (Age=48 and
Loan=$142,000) using Euclidean distance. If K=1 then the nearest neighbor is
the last case in the training set with Default=Y.
29. • We can now use the training set to classify an unknown case (Age=48 and
Loan=$142,000) using Euclidean distance. If K=1 then the nearest neighbor is the last
case in the training set with Default=Y.
• With K=3, there are two Default=Y and one Default=N out of three closest neighbors.
The prediction for the unknown case is again Default=Y.
33. In Class Exercise
• KNN Solved Example to predict Sugar of Diabetic Patient
given BMI and Age
• Test Example BMI=43.6, Age=40, Sugar=?
34.
35. Hamming Distance
• Let's learn with the help of example;
• The restaurant A sells burger optional flavors; Pepper, Ginger and
Chilly
• Using Hamming Distance, show how the majority voting would
classify with respect to KNNs
• Query of test instance is
• Pepper: False, Ginger: True, Chilly: True
37. Hamming Distance
• The training examples contain three attributes, Pepper, Ginger,
and Chilly. Each of these attributes takes either True or False as
the attribute values. Liked is the target that takes either True or
False as the value.
• In the k-nearest neighbor’s algorithm, first, we calculate the
distance between the new example and the training examples.
using this distance we find k-nearest neighbors from the training
examples.
• To calculate the distance the attribute values must be real
numbers. But in our case, the dataset set contains the
categorical values. Hence we use hamming distance measure
to find the distance between the new example and training
examples.
38. Hamming Distance
• Let x1 and x2 be the attribute values of two instances.
• Then, in the hamming distance, if the categorical values are the
same or matching that is x1 is the same as x2 then the distance
is 0, otherwise 1.
• For example,
• If the value of x1 is blue and x2 is also blue then the distance
between x1 and x2 is 0.
• If the value of x1 is blue and x2 is red then the distance
between x1 and x2 is 1.
39.
40.
41. KNN Example
•New examples:
• Example 1 (great, no, no, normal, no)
• Example 2 (mediocre, yes, no, normal, no)
Food
(3)
Chat
(2)
Fast
(2)
Price
(3)
Bar
(2)
BigTip
1 great yes yes normal no yes
2 great no yes normal no yes
3 mediocre yes no high no no
4 great yes yes normal yes yes
most similar: number 2 (1 mismatch, 4 match) yes
Second most similar example: number 1 (2 mismatch, 3 match) yes
Similarity metric: Number of matching attributes (k=2)
Yes
Most similar: number 3 (1 mismatch, 4 match) no
Second most similar example: number 1 (2 mismatch, 3 match) yes
Yes/No
42. Classification Performance Measures
44
• Classification accuracy can be unreliable when assessing
unbalanced data.
• For example, if there are 95 samples from class A and only 5
from class B in the data set, a particular classifier might classify
all the observations as class A.
• Confusion matrix and other classification measures provide
more information on the classification results.
• A self-learning practice based on the next slide.
You will practise it in lab ex2.
43. Classification Performance Measures
45
• Confusion matrix is a table with two rows and
two columns that reports the number of false
positives (FP), false negatives (FN), true
positives (TP), and true negatives (TN). In
multi-class classification, a confusion matrix
can be computed for each class.
• Other classification measures:
• Sensitivity (recall):
• Specificity:
• Precision:
• F1-score:
confusion matrix for “cat” class
TP
the number of real positives
=
TP
TP+FN
TP
TP+FP
2
1
precision
+
1
recall
=
2precision´recall
precision+recall
overall classification accurcy =
TP+TN
total number of sample
=
TP+TN
TP+NP+TN+FN
TN
the number of real negatives
=
TN
TN+FP
Self-learning material:
https://en.wikipedia.org/wiki/Confusion_matrix
44. Summary
• In this chapter, we have learned the simplest machine learning
algorithm k-NN.
• non-parametric
• no training
• We have also studied the behavior of the k-NN algorithm, e.g.,
how/why its performance changes over varying numbers of
neighbors and training samples.
• In the next chapter, we will talk about the simplest parametric
model in machine learning: a linear model trained using the least
squares approach.
46
You are asked to use k-NN
classifier for face recognition
and observe its behaviour in
lab ex2.