SlideShare une entreprise Scribd logo
1  sur  31
By:
Neha Kulkarni (5201)
Pune Institute of Computer Technology,
Pune
K Nearest Neighbour Classifier
Contents
 Eager learners vs Lazy learners
 What is KNN?
 Discussion about categorical attributes
 Discussion about missing values
 How to choose k?
 KNN algorithm – choosing distance measure and k
 Solving an Example
 Weka Demonstration
 Advantages and Disadvantages of KNN
 Applications of KNN
 Comparison of various classifiers
 Conclusion
 References
Eager Learners vs Lazy Learners
 Eager learners, when given a set of training tuples,
will construct a generalization model before receiving
new (e.g., test) tuples to classify.
 Lazy learners simply stores data (or does only a little
minor processing) and waits until it is given a test
tuple.
 Lazy learners store the training tuples or “instances,”
they are also referred to as instance based learners,
even though all learning is essentially based on
instances.
 Lazy learner: less time in training but more in
predicting.
-k- Nearest Neighbor Classifier
-Case Based Classifier
k- Nearest Neighbor Classifier
 History
• It was first described in the early 1950s.
• The method is labor intensive when given large
training sets.
• Gained popularity, when increased computing
power became available.
• Used widely in area of pattern recognition and
statistical estimation.
What is k- NN??
 Nearest-neighbor classifiers are based on
learning by analogy, that is, by comparing a given
test tuple with training tuples that are similar to it.
 The training tuples are described by n attributes.
 When k = 1, the unknown tuple is assigned the
class of the training tuple that is closest to it in
pattern space.
When k=3 or k=5??
Remarks!!
 Similarity Function Based.
 Choose an odd value of k for 2 class problem.
 k must not be multiple of number of classes.
Closeness
 The Euclidean distance between two points or
tuples, say,
X1 = (x11,x12,...,x1n) and X2 =(x21,x22,...,x2n), is
 Min-max normalization can be used to transform
a value v of a numeric attribute A to v0 in the
range [0,1] by computing
What if attributes are categorical??
 How can distance be computed for attribute
such as colour?
-Simple Method: Compare corresponding value of
attributes
-Other Method: Differential grading
What about missing values ??
 If the value of a given attribute A is missing in
tuple X1 and/or in tuple X2, we assume the
maximum possible difference.
 For categorical attributes, we take the difference
value to be 1 if either one or both of the
corresponding values of A are missing.
 If A is numeric and missing from both tuples X1
and X2, then the difference is also taken to be 1.
How to determine a good value for
k?
 Starting with k = 1, we use a test set to estimate
the error rate of the classifier.
 The k value that gives the minimum error rate
may be selected.
KNN Algorithm and Example
Distance Measures
Which distance measure to use?
We use Euclidean Distance as it treats each feature as
equally important.
𝐸𝑢𝑐𝑙𝑖𝑑𝑒𝑎𝑛 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 ∶ 𝑑 𝑥, 𝑦 = 𝑥𝑖 − 𝑦𝑖 2
𝑆𝑞𝑢𝑎𝑟𝑒𝑑 𝐸𝑢𝑐𝑙𝑖𝑑𝑒𝑎𝑛 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 ∶ 𝑑 𝑥, 𝑦 = 𝑥𝑖 − 𝑦𝑖 2
𝑀𝑎𝑛ℎ𝑎𝑡𝑡𝑎𝑛 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 ∶ 𝑑 𝑥, 𝑦 = |(𝑥𝑖 − 𝑦𝑖)|
How to choose K?
 If infinite number of samples available, the larger
is k, the better is classification.
 k = 1 is often used for efficiency, but sensitive to
“noise”
 Larger k gives smoother boundaries, better for generalization,
but only if locality is preserved. Locality is not preserved if end
up looking at samples too far away, not from the same class.
 Interesting relation to find k for large sample data : k =
sqrt(n)/2 where n is # of examples
 Can choose k through cross-validation
KNN Classifier Algorithm
Example
 We have data from the questionnaires survey and
objective testing with two attributes (acid durability and
strength) to classify whether a special paper tissue is good
or not. Here are four training samples :
X1 = Acid Durability
(seconds)
X2 = Strength
(kg/square meter)
Y = Classification
7 7 Bad
7 4 Bad
3 4 Good
1 4 Good
Now the factory produces a new paper tissue that passes the
laboratory test with X1 = 3 and X2 = 7. Guess the classification of
this new tissue.
 Step 1 : Initialize and Define k.
Lets say, k = 3
(Always choose k as an odd number if the number of
attributes is even to avoid a tie in the class prediction)
 Step 2 : Compute the distance between input sample and
training sample
- Co-ordinate of the input sample is (3,7).
- Instead of calculating the Euclidean distance, we
calculate the Squared Euclidean distance.
X1 = Acid Durability
(seconds)
X2 = Strength
(kg/square meter)
Squared Euclidean distance
7 7 (7-3)2 + (7-7)2 = 16
7 4 (7-3)2 + (4-7)2 = 25
3 4 (3-3)2 + (4-7)2 = 09
1 4 (1-3)2 + (4-7)2 = 13
 Step 3 : Sort the distance and determine the nearest
neighbours based of the Kth minimum distance :
X1 = Acid
Durability
(seconds)
X2 = Strength
(kg/square
meter)
Squared
Euclidean
distance
Rank
minimum
distance
Is it included
in 3-Nearest
Neighbour?
7 7 16 3 Yes
7 4 25 4 No
3 4 09 1 Yes
1 4 13 2 Yes
 Step 4 : Take 3-Nearest Neighbours:
 Gather the category Y of the nearest neighbours.
X1 = Acid
Durability
(seconds)
X2 =
Strength
(kg/square
meter)
Squared
Euclidean
distance
Rank
minimum
distance
Is it
included in
3-Nearest
Neighbour?
Y =
Category of
the nearest
neighbour
7 7 16 3 Yes Bad
7 4 25 4 No -
3 4 09 1 Yes Good
1 4 13 2 Yes Good
 Step 5 : Apply simple majority
 Use simple majority of the category of the nearest
neighbours as the prediction value of the query instance.
 We have 2 “good” and 1 “bad”. Thus we conclude that
the new paper tissue that passes the laboratory test with
X1 = 3 and X2 = 7 is included in the “good” category.
Iris Dataset Example using Weka
 Iris dataset contains 150 sample instances belonging
to 3 classes. 50 samples belong to each of these 3
classes.
 Statistical observations :
 Let's denote the true value of interest as 𝜃 (𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑)
and the value estimated using some algorithm as
𝜃. (𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑)
 Kappa Statistics : The kappa statistic measures the
agreement of prediction with the true class -- 1.0
signifies complete agreement. It measures the
significance of the classification with respect to the
observed value and expected value.
 Mean absolute error :
 Root Mean Square Error :
 Relative Absolute Error :
 Root Relative Squared Error :
Complexity
 Basic kNN algorithm stores all examples
 Suppose we have n examples each of dimension
d
 O(d) to compute distance to one examples
 O(nd) to computed distances to all examples
 Plus O(nk) time to find k closest examples
 Total time: O(nk+nd)
 Very expensive for a large number of samples
 But we need a large number of samples for kNN
to to work well!!
 Advantages of KNN classifier :
 Can be applied to the data from any distribution for
example, data does not have to be separable with a
linear boundary
 Very simple and intuitive
 Good classification if the number of samples is large
enough
 Disadvantages of KNN classifier :
 Choosing k may be tricky
 Test stage is computationally expensive
 No training stage, all the work is done during the test
stage
 This is actually the opposite of what we want. Usually we
can afford training step to take a long time, but we want
Applications of KNN Classifier
 Used in classification
 Used to get missing values
 Used in pattern recognition
 Used in gene expression
 Used in protein-protein prediction
 Used to get 3D structure of protein
 Used to measure document similarity
Comparison of various classifiers
Algorithm Features Limitations
C4.5
Algorithm
- Models built can be easily
interpreted
- Easy to implement
- Can use both discrete and
continuous values
- Deals with noise
- Small variation in data can lead
to different decision trees
- Does not work very well on
small training dataset
- Over-fitting
ID3
Algorithm
- It produces more accuracy
than C4.5
- Detection rate is increased
and space consumption is
reduced
- Requires large searching time
- Sometimes it may generate
very long rules which are
difficult to prune
- Requires large amount of
memory to store tree
K-Nearest
Neighbour
Algorithm
- Classes need not be linearly
separable
- Zero cost of the learning
process
- Sometimes it is robust with
regard to noisy training data
- Well suited for multimodal
- Time to find the nearest
neighbours in a large training
dataset can be excessive
- It is sensitive to noisy or
irrelevant attributes
- Performance of the algorithm
depends on the number of
Naïve Bayes
Algorithm
- Simple to implement
- Great computational efficiency
and classification rate
- It predicts accurate results for
most of the classification and
prediction problems
- The precision of the
algorithm decreases if the
amount of data is less
- For obtaining good results,
it requires a very large
number of records
Support vector
machine
Algorithm
- High accuracy
- Work well even if the data is
not linearly separable in the
base feature space
- Speed and size
requirement both in training
and testing is more
- High complexity and
extensive memory
requirements for
classification in many
cases
Artificial Neural
Networks
Algorithm
- It is easy to use with few
parameters to adjust
- A neural network learns and
reprogramming is not needed.
- Easy to implement
- Applicable to a wide range of
problems in real life.
- Requires high processing
time if neural network is
large
- Difficult to know how many
neurons and layers are
necessary
- Learning can be slow
Conclusion
 KNN is what we call lazy learning (vs. eager
learning)
 Conceptually simple, easy to understand and
explain
 Very flexible decision boundaries
 Not much learning at all!
 It can be hard to find a good distance measure
 Irrelevant features and noise can be very
detrimental
 Typically can not handle more than a few dozen
attributes
 Computational cost: requires a lot computation
References
 “Data Mining : Concepts and Techniques”, J. Han, J.
Pei, 2001
 “A Comparative Analysis of Classification Techniques
on Categorical Data in Data Mining”, Sakshi, S.
Khare, International Journal on Recent and Innovation
Trends in Computing and Communication, Volume: 3
Issue: 8, ISSN: 2321-8169
 “Comparison of various classification algorithms on
iris datasets using WEKA”, Kanu Patel et al, IJAERD,
Volume 1 Issue 1, February 2014, ISSN: 2348 - 4470
K-Nearest Neighbor Classifier

Contenu connexe

Tendances

K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithmparry prabhu
 
Classification techniques in data mining
Classification techniques in data miningClassification techniques in data mining
Classification techniques in data miningKamal Acharya
 
Support vector machines (svm)
Support vector machines (svm)Support vector machines (svm)
Support vector machines (svm)Sharayu Patil
 
K Nearest Neighbor Presentation
K Nearest Neighbor PresentationK Nearest Neighbor Presentation
K Nearest Neighbor PresentationDessy Amirudin
 
Ensemble learning
Ensemble learningEnsemble learning
Ensemble learningHaris Jamil
 
Classification Algorithm.
Classification Algorithm.Classification Algorithm.
Classification Algorithm.Megha Sharma
 
Machine Learning-Linear regression
Machine Learning-Linear regressionMachine Learning-Linear regression
Machine Learning-Linear regressionkishanthkumaar
 
Deep Learning - CNN and RNN
Deep Learning - CNN and RNNDeep Learning - CNN and RNN
Deep Learning - CNN and RNNAshray Bhandare
 
Knn Algorithm presentation
Knn Algorithm presentationKnn Algorithm presentation
Knn Algorithm presentationRishavSharma112
 
K nearest neighbor
K nearest neighborK nearest neighbor
K nearest neighborUjjawal
 
k Nearest Neighbor
k Nearest Neighbork Nearest Neighbor
k Nearest Neighborbutest
 
K means clustering
K means clusteringK means clustering
K means clusteringkeshav goyal
 
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...Simplilearn
 
Introduction to Linear Discriminant Analysis
Introduction to Linear Discriminant AnalysisIntroduction to Linear Discriminant Analysis
Introduction to Linear Discriminant AnalysisJaclyn Kokx
 
Feature Selection in Machine Learning
Feature Selection in Machine LearningFeature Selection in Machine Learning
Feature Selection in Machine LearningUpekha Vandebona
 

Tendances (20)

K means Clustering Algorithm
K means Clustering AlgorithmK means Clustering Algorithm
K means Clustering Algorithm
 
K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithm
 
Support Vector Machines ( SVM )
Support Vector Machines ( SVM ) Support Vector Machines ( SVM )
Support Vector Machines ( SVM )
 
Classification techniques in data mining
Classification techniques in data miningClassification techniques in data mining
Classification techniques in data mining
 
Support vector machines (svm)
Support vector machines (svm)Support vector machines (svm)
Support vector machines (svm)
 
K Nearest Neighbor Presentation
K Nearest Neighbor PresentationK Nearest Neighbor Presentation
K Nearest Neighbor Presentation
 
Ensemble learning
Ensemble learningEnsemble learning
Ensemble learning
 
Classification Algorithm.
Classification Algorithm.Classification Algorithm.
Classification Algorithm.
 
Machine Learning-Linear regression
Machine Learning-Linear regressionMachine Learning-Linear regression
Machine Learning-Linear regression
 
Deep Learning - CNN and RNN
Deep Learning - CNN and RNNDeep Learning - CNN and RNN
Deep Learning - CNN and RNN
 
Knn Algorithm presentation
Knn Algorithm presentationKnn Algorithm presentation
Knn Algorithm presentation
 
Presentation on K-Means Clustering
Presentation on K-Means ClusteringPresentation on K-Means Clustering
Presentation on K-Means Clustering
 
KNN
KNN KNN
KNN
 
K nearest neighbor
K nearest neighborK nearest neighbor
K nearest neighbor
 
k Nearest Neighbor
k Nearest Neighbork Nearest Neighbor
k Nearest Neighbor
 
K means clustering
K means clusteringK means clustering
K means clustering
 
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
 
Introduction to Linear Discriminant Analysis
Introduction to Linear Discriminant AnalysisIntroduction to Linear Discriminant Analysis
Introduction to Linear Discriminant Analysis
 
Support vector machine
Support vector machineSupport vector machine
Support vector machine
 
Feature Selection in Machine Learning
Feature Selection in Machine LearningFeature Selection in Machine Learning
Feature Selection in Machine Learning
 

Similaire à K-Nearest Neighbor Classifier

Machine Learning
Machine LearningMachine Learning
Machine Learningbutest
 
Enhancing Classification Accuracy of K-Nearest Neighbors Algorithm using Gain...
Enhancing Classification Accuracy of K-Nearest Neighbors Algorithm using Gain...Enhancing Classification Accuracy of K-Nearest Neighbors Algorithm using Gain...
Enhancing Classification Accuracy of K-Nearest Neighbors Algorithm using Gain...IRJET Journal
 
SVM - Functional Verification
SVM - Functional VerificationSVM - Functional Verification
SVM - Functional VerificationSai Kiran Kadam
 
Instance based learning
Instance based learningInstance based learning
Instance based learningSlideshare
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3Nandhini S
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learningAmAn Singh
 
KNN.pptx
KNN.pptxKNN.pptx
KNN.pptxdfgd7
 
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...Maninda Edirisooriya
 
Selection K in K-means Clustering
Selection K in K-means ClusteringSelection K in K-means Clustering
Selection K in K-means ClusteringJunghoon Kim
 
Training machine learning knn 2017
Training machine learning knn 2017Training machine learning knn 2017
Training machine learning knn 2017Iwan Sofana
 
CLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxCLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxShwetapadmaBabu1
 
Lecture 11
Lecture 11Lecture 11
Lecture 11Jeet Das
 
Hidden Layer Leraning Vector Quantizatio
Hidden Layer Leraning Vector Quantizatio Hidden Layer Leraning Vector Quantizatio
Hidden Layer Leraning Vector Quantizatio Armando Vieira
 
09_dm1_knn_2022_23.pdf
09_dm1_knn_2022_23.pdf09_dm1_knn_2022_23.pdf
09_dm1_knn_2022_23.pdfArafathJazeeb1
 
Instance Based Learning in Machine Learning
Instance Based Learning in Machine LearningInstance Based Learning in Machine Learning
Instance Based Learning in Machine LearningPavithra Thippanaik
 

Similaire à K-Nearest Neighbor Classifier (20)

Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Enhancing Classification Accuracy of K-Nearest Neighbors Algorithm using Gain...
Enhancing Classification Accuracy of K-Nearest Neighbors Algorithm using Gain...Enhancing Classification Accuracy of K-Nearest Neighbors Algorithm using Gain...
Enhancing Classification Accuracy of K-Nearest Neighbors Algorithm using Gain...
 
SVM - Functional Verification
SVM - Functional VerificationSVM - Functional Verification
SVM - Functional Verification
 
KNN Classifier
KNN ClassifierKNN Classifier
KNN Classifier
 
Instance based learning
Instance based learningInstance based learning
Instance based learning
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3
 
Neural nw k means
Neural nw k meansNeural nw k means
Neural nw k means
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learning
 
KNN.pptx
KNN.pptxKNN.pptx
KNN.pptx
 
KNN.pptx
KNN.pptxKNN.pptx
KNN.pptx
 
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
 
Selection K in K-means Clustering
Selection K in K-means ClusteringSelection K in K-means Clustering
Selection K in K-means Clustering
 
Text categorization
Text categorizationText categorization
Text categorization
 
Training machine learning knn 2017
Training machine learning knn 2017Training machine learning knn 2017
Training machine learning knn 2017
 
CLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxCLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptx
 
KNN
KNNKNN
KNN
 
Lecture 11
Lecture 11Lecture 11
Lecture 11
 
Hidden Layer Leraning Vector Quantizatio
Hidden Layer Leraning Vector Quantizatio Hidden Layer Leraning Vector Quantizatio
Hidden Layer Leraning Vector Quantizatio
 
09_dm1_knn_2022_23.pdf
09_dm1_knn_2022_23.pdf09_dm1_knn_2022_23.pdf
09_dm1_knn_2022_23.pdf
 
Instance Based Learning in Machine Learning
Instance Based Learning in Machine LearningInstance Based Learning in Machine Learning
Instance Based Learning in Machine Learning
 

Dernier

The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxVivek487417
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxParas Gupta
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxchadhar227
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangeThinkInnovation
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...gajnagarg
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATIONLakpaYanziSherpa
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制vexqp
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubaikojalkojal131
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制vexqp
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制vexqp
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schscnajjemba
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.pptibrahimabdi22
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRajesh Mondal
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1ranjankumarbehera14
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdftheeltifs
 

Dernier (20)

The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdf
 

K-Nearest Neighbor Classifier

  • 1. By: Neha Kulkarni (5201) Pune Institute of Computer Technology, Pune K Nearest Neighbour Classifier
  • 2. Contents  Eager learners vs Lazy learners  What is KNN?  Discussion about categorical attributes  Discussion about missing values  How to choose k?  KNN algorithm – choosing distance measure and k  Solving an Example  Weka Demonstration  Advantages and Disadvantages of KNN  Applications of KNN  Comparison of various classifiers  Conclusion  References
  • 3. Eager Learners vs Lazy Learners  Eager learners, when given a set of training tuples, will construct a generalization model before receiving new (e.g., test) tuples to classify.  Lazy learners simply stores data (or does only a little minor processing) and waits until it is given a test tuple.  Lazy learners store the training tuples or “instances,” they are also referred to as instance based learners, even though all learning is essentially based on instances.  Lazy learner: less time in training but more in predicting. -k- Nearest Neighbor Classifier -Case Based Classifier
  • 4. k- Nearest Neighbor Classifier  History • It was first described in the early 1950s. • The method is labor intensive when given large training sets. • Gained popularity, when increased computing power became available. • Used widely in area of pattern recognition and statistical estimation.
  • 5. What is k- NN??  Nearest-neighbor classifiers are based on learning by analogy, that is, by comparing a given test tuple with training tuples that are similar to it.  The training tuples are described by n attributes.  When k = 1, the unknown tuple is assigned the class of the training tuple that is closest to it in pattern space.
  • 6. When k=3 or k=5??
  • 7. Remarks!!  Similarity Function Based.  Choose an odd value of k for 2 class problem.  k must not be multiple of number of classes.
  • 8. Closeness  The Euclidean distance between two points or tuples, say, X1 = (x11,x12,...,x1n) and X2 =(x21,x22,...,x2n), is  Min-max normalization can be used to transform a value v of a numeric attribute A to v0 in the range [0,1] by computing
  • 9. What if attributes are categorical??  How can distance be computed for attribute such as colour? -Simple Method: Compare corresponding value of attributes -Other Method: Differential grading
  • 10. What about missing values ??  If the value of a given attribute A is missing in tuple X1 and/or in tuple X2, we assume the maximum possible difference.  For categorical attributes, we take the difference value to be 1 if either one or both of the corresponding values of A are missing.  If A is numeric and missing from both tuples X1 and X2, then the difference is also taken to be 1.
  • 11. How to determine a good value for k?  Starting with k = 1, we use a test set to estimate the error rate of the classifier.  The k value that gives the minimum error rate may be selected.
  • 12. KNN Algorithm and Example
  • 13. Distance Measures Which distance measure to use? We use Euclidean Distance as it treats each feature as equally important. 𝐸𝑢𝑐𝑙𝑖𝑑𝑒𝑎𝑛 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 ∶ 𝑑 𝑥, 𝑦 = 𝑥𝑖 − 𝑦𝑖 2 𝑆𝑞𝑢𝑎𝑟𝑒𝑑 𝐸𝑢𝑐𝑙𝑖𝑑𝑒𝑎𝑛 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 ∶ 𝑑 𝑥, 𝑦 = 𝑥𝑖 − 𝑦𝑖 2 𝑀𝑎𝑛ℎ𝑎𝑡𝑡𝑎𝑛 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 ∶ 𝑑 𝑥, 𝑦 = |(𝑥𝑖 − 𝑦𝑖)|
  • 14. How to choose K?  If infinite number of samples available, the larger is k, the better is classification.  k = 1 is often used for efficiency, but sensitive to “noise”
  • 15.  Larger k gives smoother boundaries, better for generalization, but only if locality is preserved. Locality is not preserved if end up looking at samples too far away, not from the same class.  Interesting relation to find k for large sample data : k = sqrt(n)/2 where n is # of examples  Can choose k through cross-validation
  • 17. Example  We have data from the questionnaires survey and objective testing with two attributes (acid durability and strength) to classify whether a special paper tissue is good or not. Here are four training samples : X1 = Acid Durability (seconds) X2 = Strength (kg/square meter) Y = Classification 7 7 Bad 7 4 Bad 3 4 Good 1 4 Good Now the factory produces a new paper tissue that passes the laboratory test with X1 = 3 and X2 = 7. Guess the classification of this new tissue.
  • 18.  Step 1 : Initialize and Define k. Lets say, k = 3 (Always choose k as an odd number if the number of attributes is even to avoid a tie in the class prediction)  Step 2 : Compute the distance between input sample and training sample - Co-ordinate of the input sample is (3,7). - Instead of calculating the Euclidean distance, we calculate the Squared Euclidean distance. X1 = Acid Durability (seconds) X2 = Strength (kg/square meter) Squared Euclidean distance 7 7 (7-3)2 + (7-7)2 = 16 7 4 (7-3)2 + (4-7)2 = 25 3 4 (3-3)2 + (4-7)2 = 09 1 4 (1-3)2 + (4-7)2 = 13
  • 19.  Step 3 : Sort the distance and determine the nearest neighbours based of the Kth minimum distance : X1 = Acid Durability (seconds) X2 = Strength (kg/square meter) Squared Euclidean distance Rank minimum distance Is it included in 3-Nearest Neighbour? 7 7 16 3 Yes 7 4 25 4 No 3 4 09 1 Yes 1 4 13 2 Yes
  • 20.  Step 4 : Take 3-Nearest Neighbours:  Gather the category Y of the nearest neighbours. X1 = Acid Durability (seconds) X2 = Strength (kg/square meter) Squared Euclidean distance Rank minimum distance Is it included in 3-Nearest Neighbour? Y = Category of the nearest neighbour 7 7 16 3 Yes Bad 7 4 25 4 No - 3 4 09 1 Yes Good 1 4 13 2 Yes Good
  • 21.  Step 5 : Apply simple majority  Use simple majority of the category of the nearest neighbours as the prediction value of the query instance.  We have 2 “good” and 1 “bad”. Thus we conclude that the new paper tissue that passes the laboratory test with X1 = 3 and X2 = 7 is included in the “good” category.
  • 22. Iris Dataset Example using Weka  Iris dataset contains 150 sample instances belonging to 3 classes. 50 samples belong to each of these 3 classes.  Statistical observations :  Let's denote the true value of interest as 𝜃 (𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑) and the value estimated using some algorithm as 𝜃. (𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑)  Kappa Statistics : The kappa statistic measures the agreement of prediction with the true class -- 1.0 signifies complete agreement. It measures the significance of the classification with respect to the observed value and expected value.  Mean absolute error :
  • 23.  Root Mean Square Error :  Relative Absolute Error :  Root Relative Squared Error :
  • 24. Complexity  Basic kNN algorithm stores all examples  Suppose we have n examples each of dimension d  O(d) to compute distance to one examples  O(nd) to computed distances to all examples  Plus O(nk) time to find k closest examples  Total time: O(nk+nd)  Very expensive for a large number of samples  But we need a large number of samples for kNN to to work well!!
  • 25.  Advantages of KNN classifier :  Can be applied to the data from any distribution for example, data does not have to be separable with a linear boundary  Very simple and intuitive  Good classification if the number of samples is large enough  Disadvantages of KNN classifier :  Choosing k may be tricky  Test stage is computationally expensive  No training stage, all the work is done during the test stage  This is actually the opposite of what we want. Usually we can afford training step to take a long time, but we want
  • 26. Applications of KNN Classifier  Used in classification  Used to get missing values  Used in pattern recognition  Used in gene expression  Used in protein-protein prediction  Used to get 3D structure of protein  Used to measure document similarity
  • 27. Comparison of various classifiers Algorithm Features Limitations C4.5 Algorithm - Models built can be easily interpreted - Easy to implement - Can use both discrete and continuous values - Deals with noise - Small variation in data can lead to different decision trees - Does not work very well on small training dataset - Over-fitting ID3 Algorithm - It produces more accuracy than C4.5 - Detection rate is increased and space consumption is reduced - Requires large searching time - Sometimes it may generate very long rules which are difficult to prune - Requires large amount of memory to store tree K-Nearest Neighbour Algorithm - Classes need not be linearly separable - Zero cost of the learning process - Sometimes it is robust with regard to noisy training data - Well suited for multimodal - Time to find the nearest neighbours in a large training dataset can be excessive - It is sensitive to noisy or irrelevant attributes - Performance of the algorithm depends on the number of
  • 28. Naïve Bayes Algorithm - Simple to implement - Great computational efficiency and classification rate - It predicts accurate results for most of the classification and prediction problems - The precision of the algorithm decreases if the amount of data is less - For obtaining good results, it requires a very large number of records Support vector machine Algorithm - High accuracy - Work well even if the data is not linearly separable in the base feature space - Speed and size requirement both in training and testing is more - High complexity and extensive memory requirements for classification in many cases Artificial Neural Networks Algorithm - It is easy to use with few parameters to adjust - A neural network learns and reprogramming is not needed. - Easy to implement - Applicable to a wide range of problems in real life. - Requires high processing time if neural network is large - Difficult to know how many neurons and layers are necessary - Learning can be slow
  • 29. Conclusion  KNN is what we call lazy learning (vs. eager learning)  Conceptually simple, easy to understand and explain  Very flexible decision boundaries  Not much learning at all!  It can be hard to find a good distance measure  Irrelevant features and noise can be very detrimental  Typically can not handle more than a few dozen attributes  Computational cost: requires a lot computation
  • 30. References  “Data Mining : Concepts and Techniques”, J. Han, J. Pei, 2001  “A Comparative Analysis of Classification Techniques on Categorical Data in Data Mining”, Sakshi, S. Khare, International Journal on Recent and Innovation Trends in Computing and Communication, Volume: 3 Issue: 8, ISSN: 2321-8169  “Comparison of various classification algorithms on iris datasets using WEKA”, Kanu Patel et al, IJAERD, Volume 1 Issue 1, February 2014, ISSN: 2348 - 4470