SlideShare une entreprise Scribd logo
1  sur  23
Télécharger pour lire hors ligne
DATA MINING 1
Instance-based Classifiers
Dino Pedreschi, Riccardo Guidotti
a.a. 2022/2023
Slides edited from Tan, Steinbach, Kumar, Introduction to Data Mining
Instance-based Classifiers
• Instead of performing explicit generalization, compare new instances
with instances seen in training, which have been stored in memory.
• Sometimes called memory-based learning.
• Advantages
• Adapt its model to previously unseen data by storing a new instance or
throwing an old instance away.
• Disadvantages
• Lazy learner: it does not build a model explicitly.
• Classifying unknown records is relatively expensive: in the worst case, given n
training items, the complexity of classifying a single instance is O(n).
Nearest-Neighbor Classifier (K-NN)
Basic idea: If it walks like a duck, quacks
like a duck, then it’s probably a duck.
Requires three things
1. Training set of stored records
2. Distance metric to compute
distance between records
3. The value of k, the number of
nearest neighbors to retrieve
Nearest-Neighbor Classifier (K-NN)
Given a set of training records (memory),
and a test record:
1. Compute the distances from the
records in the training to the test.
2. Identify the k “nearest” records.
3. Use class labels of nearest neighbors to
determine the class label of unknown
record (e.g., by taking majority vote).
Definition of Nearest Neighbor
• K-nearest neighbors of a record x are data points that have the k
smallest distance to x.
Choosing the Value of K
• If k is too small, it is sensitive to
noise points and it can leads to
overfitting to the noise in the
training set.
• If k is too large, the neighborhood
may include points from other
classes.
• General practice k = sqrt(N) where
N is the number of samples in the
training dataset.
Nearest Neighbor Classification
Compute distance between two points:
• Euclidean distance 𝑑 𝑝, 𝑞 = ∑!(𝑝! − 𝑞!)"
Determine the class from nearest neighbors
• take the majority vote of class labels among the k nearest neighbors
• weigh the vote according to distance (e.g. weight factor, w = 1/d2)
Dimensionality and Scaling Issues
• Problem with Euclidean measure: high dimensional data can cause
curse of dimensionality.
• Solution: normalize the vectors to unit length
• Attributes may have to be scaled to prevent distance measures from
being dominated by one of the attributes.
• Example:
• height of a person may vary from 1.5m to 1.8m
• weight of a person may vary from 10km to 200kg
• income of a person may vary from $10K to $1M
Parallel Exemplar-Based Learning System (PEBLS)
• PEBLS is a nearest-neighbor learning system (k=1) designed for
applications where the instances have symbolic feature values.
• Works with both continuous and nominal features.
• For nominal features, the distance between two nominal values is
computed using Modified Value Difference Metric (MVDM)
• 𝑑 𝑉#, 𝑉" = ∑! |
$!"
$!
−
$#"
$#
|
• Where n1 is the number of records that consists of nominal attribute
value V1 and n1i is the number of records whose target label is class i.
Distance Between Nominal Attribute Values
• d(Status=Single, Status=Married) = | 2/4 – 0/4 | + | 2/4 – 4/4 | = 1
• d(Status=Single, Status=Divorced) = | 2/4 – 1/2 | + | 2/4 – 1/2 | = 0
• d(Status=Married, Status=Divorced) = | 0/4 – 1/2 | + | 4/4 – 1/2 | = 1
• d(Refund=Yes, Refund=No) = | 0/3 – 3/7 | + | 3/3 – 4/7 | = 6/7
Distance Between Records
• 𝛿 𝑋, 𝑌 = 𝑤!𝑤" ∑#$%
&
𝑑(𝑋#, 𝑌#)
• Each record X is assigned a weight 𝑤! =
'!"#$%&'(
'!"#$%&'(
')##$'(
, which represents its reliability
• 𝑁!"#$%&'(
is the number of times X is used for prediction
• 𝑁!"#$%&'(
')##$'( is the number of times the prediction using X is correct
• If 𝑤! ≅ 1 X makes accurate prediction most of the time
• If 𝑤!> 1, then X is not reliable for making predictions. High 𝑤! >1 would result in
high distance, which makes it less possible to use X to make predictions.
Characteristics of Nearest Neighbor Classifiers
• Instance-based learner: makes predictions without maintaining
abstraction, i.e., building a model like decision trees.
• It is a lazy learner: classifying a test example can be expensive because
need to compute the proximity values between test and training examples.
• In contrast eager learners spend time in building the model but then the
classification is fast.
• Make their prediction on local information and for low k they are
susceptible to noise.
• Can produce wrong predictions if inappropriate distance functions and/or
preprocessing steps are performed.
k-NN for Regression
Given a set of training records
(memory), and a test record:
1. Compute the distances from the
records in the training to the test.
2. Identify the k “nearest” records.
3. Use target value of nearest
neighbors to determine the value
of unknown record (e.g., by
averaging the values).
3.4
5.4
5.7
8.7
9.7
3.9
3.8
4.1
5.1
4.3
10.3 9.2
3.2
5.9
4.7
12.1
10.8
8.7
9.7
3.4
4.2
References
• Nearest Neighbor classifiers. Chapter 5.2.
Introduction to Data Mining.
Exercises - kNN
09_dm1_knn_2022_23.pdf
09_dm1_knn_2022_23.pdf
09_dm1_knn_2022_23.pdf
09_dm1_knn_2022_23.pdf
09_dm1_knn_2022_23.pdf
09_dm1_knn_2022_23.pdf
09_dm1_knn_2022_23.pdf
09_dm1_knn_2022_23.pdf

Contenu connexe

Similaire à 09_dm1_knn_2022_23.pdf

k-Nearest Neighbors with brief explanation.pptx
k-Nearest Neighbors with brief explanation.pptxk-Nearest Neighbors with brief explanation.pptx
k-Nearest Neighbors with brief explanation.pptxgamingzonedead880
 
Could a Data Science Program use Data Science Insights?
Could a Data Science Program use Data Science Insights?Could a Data Science Program use Data Science Insights?
Could a Data Science Program use Data Science Insights?Zachary Thomas
 
PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)Jinwon Lee
 
[UMAP2013] Recommendation with Differential Context Weighting
[UMAP2013] Recommendation with Differential Context Weighting[UMAP2013] Recommendation with Differential Context Weighting
[UMAP2013] Recommendation with Differential Context WeightingYONG ZHENG
 
Machine Learning workshop by GDSC Amity University Chhattisgarh
Machine Learning workshop by GDSC Amity University ChhattisgarhMachine Learning workshop by GDSC Amity University Chhattisgarh
Machine Learning workshop by GDSC Amity University ChhattisgarhPoorabpatel
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learningAmAn Singh
 
DMTM 2015 - 06 Introduction to Clustering
DMTM 2015 - 06 Introduction to ClusteringDMTM 2015 - 06 Introduction to Clustering
DMTM 2015 - 06 Introduction to ClusteringPier Luca Lanzi
 
How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?Tuan Yang
 
CSA 3702 machine learning module 2
CSA 3702 machine learning module 2CSA 3702 machine learning module 2
CSA 3702 machine learning module 2Nandhini S
 
Machine Learning
Machine LearningMachine Learning
Machine Learningbutest
 
Mean shift and Hierarchical clustering
Mean shift and Hierarchical clustering Mean shift and Hierarchical clustering
Mean shift and Hierarchical clustering Yan Xu
 
Machine Learning Comparative Analysis - Part 1
Machine Learning Comparative Analysis - Part 1Machine Learning Comparative Analysis - Part 1
Machine Learning Comparative Analysis - Part 1Kaniska Mandal
 
Machine learning by using python Lesson One Part 2 By Professor Lili Saghafi
Machine learning by using python Lesson One Part 2 By Professor Lili SaghafiMachine learning by using python Lesson One Part 2 By Professor Lili Saghafi
Machine learning by using python Lesson One Part 2 By Professor Lili SaghafiProfessor Lili Saghafi
 

Similaire à 09_dm1_knn_2022_23.pdf (20)

K nearest neighbours
K nearest neighboursK nearest neighbours
K nearest neighbours
 
Clustering
ClusteringClustering
Clustering
 
k-Nearest Neighbors with brief explanation.pptx
k-Nearest Neighbors with brief explanation.pptxk-Nearest Neighbors with brief explanation.pptx
k-Nearest Neighbors with brief explanation.pptx
 
Could a Data Science Program use Data Science Insights?
Could a Data Science Program use Data Science Insights?Could a Data Science Program use Data Science Insights?
Could a Data Science Program use Data Science Insights?
 
Knn 160904075605-converted
Knn 160904075605-convertedKnn 160904075605-converted
Knn 160904075605-converted
 
PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)
 
[UMAP2013] Recommendation with Differential Context Weighting
[UMAP2013] Recommendation with Differential Context Weighting[UMAP2013] Recommendation with Differential Context Weighting
[UMAP2013] Recommendation with Differential Context Weighting
 
K Nearest neighbour
K Nearest neighbourK Nearest neighbour
K Nearest neighbour
 
Machine Learning workshop by GDSC Amity University Chhattisgarh
Machine Learning workshop by GDSC Amity University ChhattisgarhMachine Learning workshop by GDSC Amity University Chhattisgarh
Machine Learning workshop by GDSC Amity University Chhattisgarh
 
Pattern recognition
Pattern recognitionPattern recognition
Pattern recognition
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learning
 
DMTM 2015 - 06 Introduction to Clustering
DMTM 2015 - 06 Introduction to ClusteringDMTM 2015 - 06 Introduction to Clustering
DMTM 2015 - 06 Introduction to Clustering
 
Knn
KnnKnn
Knn
 
How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?
 
CSA 3702 machine learning module 2
CSA 3702 machine learning module 2CSA 3702 machine learning module 2
CSA 3702 machine learning module 2
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Mean shift and Hierarchical clustering
Mean shift and Hierarchical clustering Mean shift and Hierarchical clustering
Mean shift and Hierarchical clustering
 
Machine Learning Comparative Analysis - Part 1
Machine Learning Comparative Analysis - Part 1Machine Learning Comparative Analysis - Part 1
Machine Learning Comparative Analysis - Part 1
 
Lec 18-19.pptx
Lec 18-19.pptxLec 18-19.pptx
Lec 18-19.pptx
 
Machine learning by using python Lesson One Part 2 By Professor Lili Saghafi
Machine learning by using python Lesson One Part 2 By Professor Lili SaghafiMachine learning by using python Lesson One Part 2 By Professor Lili Saghafi
Machine learning by using python Lesson One Part 2 By Professor Lili Saghafi
 

Dernier

Durg CALL GIRL ❤ 82729*64427❤ CALL GIRLS IN durg ESCORTS
Durg CALL GIRL ❤ 82729*64427❤ CALL GIRLS IN durg ESCORTSDurg CALL GIRL ❤ 82729*64427❤ CALL GIRLS IN durg ESCORTS
Durg CALL GIRL ❤ 82729*64427❤ CALL GIRLS IN durg ESCORTSkajalroy875762
 
GUWAHATI 💋 Call Girl 9827461493 Call Girls in Escort service book now
GUWAHATI 💋 Call Girl 9827461493 Call Girls in  Escort service book nowGUWAHATI 💋 Call Girl 9827461493 Call Girls in  Escort service book now
GUWAHATI 💋 Call Girl 9827461493 Call Girls in Escort service book nowkapoorjyoti4444
 
joint cost.pptx COST ACCOUNTING Sixteenth Edition ...
joint cost.pptx  COST ACCOUNTING  Sixteenth Edition                          ...joint cost.pptx  COST ACCOUNTING  Sixteenth Edition                          ...
joint cost.pptx COST ACCOUNTING Sixteenth Edition ...NadhimTaha
 
PARK STREET 💋 Call Girl 9827461493 Call Girls in Escort service book now
PARK STREET 💋 Call Girl 9827461493 Call Girls in  Escort service book nowPARK STREET 💋 Call Girl 9827461493 Call Girls in  Escort service book now
PARK STREET 💋 Call Girl 9827461493 Call Girls in Escort service book nowkapoorjyoti4444
 
Nashik Call Girl Just Call 7091819311 Top Class Call Girl Service Available
Nashik Call Girl Just Call 7091819311 Top Class Call Girl Service AvailableNashik Call Girl Just Call 7091819311 Top Class Call Girl Service Available
Nashik Call Girl Just Call 7091819311 Top Class Call Girl Service Availablepr788182
 
How to Get Started in Social Media for Art League City
How to Get Started in Social Media for Art League CityHow to Get Started in Social Media for Art League City
How to Get Started in Social Media for Art League CityEric T. Tung
 
Lucknow Housewife Escorts by Sexy Bhabhi Service 8250092165
Lucknow Housewife Escorts  by Sexy Bhabhi Service 8250092165Lucknow Housewife Escorts  by Sexy Bhabhi Service 8250092165
Lucknow Housewife Escorts by Sexy Bhabhi Service 8250092165meghakumariji156
 
Escorts in Nungambakkam Phone 8250092165 Enjoy 24/7 Escort Service Enjoy Your...
Escorts in Nungambakkam Phone 8250092165 Enjoy 24/7 Escort Service Enjoy Your...Escorts in Nungambakkam Phone 8250092165 Enjoy 24/7 Escort Service Enjoy Your...
Escorts in Nungambakkam Phone 8250092165 Enjoy 24/7 Escort Service Enjoy Your...meghakumariji156
 
UAE Bur Dubai Call Girls ☏ 0564401582 Call Girl in Bur Dubai
UAE Bur Dubai Call Girls ☏ 0564401582 Call Girl in Bur DubaiUAE Bur Dubai Call Girls ☏ 0564401582 Call Girl in Bur Dubai
UAE Bur Dubai Call Girls ☏ 0564401582 Call Girl in Bur Dubaijaehdlyzca
 
Falcon Invoice Discounting: The best investment platform in india for investors
Falcon Invoice Discounting: The best investment platform in india for investorsFalcon Invoice Discounting: The best investment platform in india for investors
Falcon Invoice Discounting: The best investment platform in india for investorsFalcon Invoice Discounting
 
Organizational Transformation Lead with Culture
Organizational Transformation Lead with CultureOrganizational Transformation Lead with Culture
Organizational Transformation Lead with CultureSeta Wicaksana
 
Lundin Gold - Q1 2024 Conference Call Presentation (Revised)
Lundin Gold - Q1 2024 Conference Call Presentation (Revised)Lundin Gold - Q1 2024 Conference Call Presentation (Revised)
Lundin Gold - Q1 2024 Conference Call Presentation (Revised)Adnet Communications
 
QSM Chap 10 Service Culture in Tourism and Hospitality Industry.pptx
QSM Chap 10 Service Culture in Tourism and Hospitality Industry.pptxQSM Chap 10 Service Culture in Tourism and Hospitality Industry.pptx
QSM Chap 10 Service Culture in Tourism and Hospitality Industry.pptxDitasDelaCruz
 
Mckinsey foundation level Handbook for Viewing
Mckinsey foundation level Handbook for ViewingMckinsey foundation level Handbook for Viewing
Mckinsey foundation level Handbook for ViewingNauman Safdar
 
Call 7737669865 Vadodara Call Girls Service at your Door Step Available All Time
Call 7737669865 Vadodara Call Girls Service at your Door Step Available All TimeCall 7737669865 Vadodara Call Girls Service at your Door Step Available All Time
Call 7737669865 Vadodara Call Girls Service at your Door Step Available All Timegargpaaro
 
Cannabis Legalization World Map: 2024 Updated
Cannabis Legalization World Map: 2024 UpdatedCannabis Legalization World Map: 2024 Updated
Cannabis Legalization World Map: 2024 UpdatedCannaBusinessPlans
 
Berhampur CALL GIRL❤7091819311❤CALL GIRLS IN ESCORT SERVICE WE ARE PROVIDING
Berhampur CALL GIRL❤7091819311❤CALL GIRLS IN ESCORT SERVICE WE ARE PROVIDINGBerhampur CALL GIRL❤7091819311❤CALL GIRLS IN ESCORT SERVICE WE ARE PROVIDING
Berhampur CALL GIRL❤7091819311❤CALL GIRLS IN ESCORT SERVICE WE ARE PROVIDINGpr788182
 
PHX May 2024 Corporate Presentation Final
PHX May 2024 Corporate Presentation FinalPHX May 2024 Corporate Presentation Final
PHX May 2024 Corporate Presentation FinalPanhandleOilandGas
 

Dernier (20)

WheelTug Short Pitch Deck 2024 | Byond Insights
WheelTug Short Pitch Deck 2024 | Byond InsightsWheelTug Short Pitch Deck 2024 | Byond Insights
WheelTug Short Pitch Deck 2024 | Byond Insights
 
Durg CALL GIRL ❤ 82729*64427❤ CALL GIRLS IN durg ESCORTS
Durg CALL GIRL ❤ 82729*64427❤ CALL GIRLS IN durg ESCORTSDurg CALL GIRL ❤ 82729*64427❤ CALL GIRLS IN durg ESCORTS
Durg CALL GIRL ❤ 82729*64427❤ CALL GIRLS IN durg ESCORTS
 
GUWAHATI 💋 Call Girl 9827461493 Call Girls in Escort service book now
GUWAHATI 💋 Call Girl 9827461493 Call Girls in  Escort service book nowGUWAHATI 💋 Call Girl 9827461493 Call Girls in  Escort service book now
GUWAHATI 💋 Call Girl 9827461493 Call Girls in Escort service book now
 
joint cost.pptx COST ACCOUNTING Sixteenth Edition ...
joint cost.pptx  COST ACCOUNTING  Sixteenth Edition                          ...joint cost.pptx  COST ACCOUNTING  Sixteenth Edition                          ...
joint cost.pptx COST ACCOUNTING Sixteenth Edition ...
 
PARK STREET 💋 Call Girl 9827461493 Call Girls in Escort service book now
PARK STREET 💋 Call Girl 9827461493 Call Girls in  Escort service book nowPARK STREET 💋 Call Girl 9827461493 Call Girls in  Escort service book now
PARK STREET 💋 Call Girl 9827461493 Call Girls in Escort service book now
 
Nashik Call Girl Just Call 7091819311 Top Class Call Girl Service Available
Nashik Call Girl Just Call 7091819311 Top Class Call Girl Service AvailableNashik Call Girl Just Call 7091819311 Top Class Call Girl Service Available
Nashik Call Girl Just Call 7091819311 Top Class Call Girl Service Available
 
How to Get Started in Social Media for Art League City
How to Get Started in Social Media for Art League CityHow to Get Started in Social Media for Art League City
How to Get Started in Social Media for Art League City
 
Lucknow Housewife Escorts by Sexy Bhabhi Service 8250092165
Lucknow Housewife Escorts  by Sexy Bhabhi Service 8250092165Lucknow Housewife Escorts  by Sexy Bhabhi Service 8250092165
Lucknow Housewife Escorts by Sexy Bhabhi Service 8250092165
 
Escorts in Nungambakkam Phone 8250092165 Enjoy 24/7 Escort Service Enjoy Your...
Escorts in Nungambakkam Phone 8250092165 Enjoy 24/7 Escort Service Enjoy Your...Escorts in Nungambakkam Phone 8250092165 Enjoy 24/7 Escort Service Enjoy Your...
Escorts in Nungambakkam Phone 8250092165 Enjoy 24/7 Escort Service Enjoy Your...
 
UAE Bur Dubai Call Girls ☏ 0564401582 Call Girl in Bur Dubai
UAE Bur Dubai Call Girls ☏ 0564401582 Call Girl in Bur DubaiUAE Bur Dubai Call Girls ☏ 0564401582 Call Girl in Bur Dubai
UAE Bur Dubai Call Girls ☏ 0564401582 Call Girl in Bur Dubai
 
Falcon Invoice Discounting: The best investment platform in india for investors
Falcon Invoice Discounting: The best investment platform in india for investorsFalcon Invoice Discounting: The best investment platform in india for investors
Falcon Invoice Discounting: The best investment platform in india for investors
 
Organizational Transformation Lead with Culture
Organizational Transformation Lead with CultureOrganizational Transformation Lead with Culture
Organizational Transformation Lead with Culture
 
Lundin Gold - Q1 2024 Conference Call Presentation (Revised)
Lundin Gold - Q1 2024 Conference Call Presentation (Revised)Lundin Gold - Q1 2024 Conference Call Presentation (Revised)
Lundin Gold - Q1 2024 Conference Call Presentation (Revised)
 
QSM Chap 10 Service Culture in Tourism and Hospitality Industry.pptx
QSM Chap 10 Service Culture in Tourism and Hospitality Industry.pptxQSM Chap 10 Service Culture in Tourism and Hospitality Industry.pptx
QSM Chap 10 Service Culture in Tourism and Hospitality Industry.pptx
 
Mckinsey foundation level Handbook for Viewing
Mckinsey foundation level Handbook for ViewingMckinsey foundation level Handbook for Viewing
Mckinsey foundation level Handbook for Viewing
 
HomeRoots Pitch Deck | Investor Insights | April 2024
HomeRoots Pitch Deck | Investor Insights | April 2024HomeRoots Pitch Deck | Investor Insights | April 2024
HomeRoots Pitch Deck | Investor Insights | April 2024
 
Call 7737669865 Vadodara Call Girls Service at your Door Step Available All Time
Call 7737669865 Vadodara Call Girls Service at your Door Step Available All TimeCall 7737669865 Vadodara Call Girls Service at your Door Step Available All Time
Call 7737669865 Vadodara Call Girls Service at your Door Step Available All Time
 
Cannabis Legalization World Map: 2024 Updated
Cannabis Legalization World Map: 2024 UpdatedCannabis Legalization World Map: 2024 Updated
Cannabis Legalization World Map: 2024 Updated
 
Berhampur CALL GIRL❤7091819311❤CALL GIRLS IN ESCORT SERVICE WE ARE PROVIDING
Berhampur CALL GIRL❤7091819311❤CALL GIRLS IN ESCORT SERVICE WE ARE PROVIDINGBerhampur CALL GIRL❤7091819311❤CALL GIRLS IN ESCORT SERVICE WE ARE PROVIDING
Berhampur CALL GIRL❤7091819311❤CALL GIRLS IN ESCORT SERVICE WE ARE PROVIDING
 
PHX May 2024 Corporate Presentation Final
PHX May 2024 Corporate Presentation FinalPHX May 2024 Corporate Presentation Final
PHX May 2024 Corporate Presentation Final
 

09_dm1_knn_2022_23.pdf

  • 1. DATA MINING 1 Instance-based Classifiers Dino Pedreschi, Riccardo Guidotti a.a. 2022/2023 Slides edited from Tan, Steinbach, Kumar, Introduction to Data Mining
  • 2. Instance-based Classifiers • Instead of performing explicit generalization, compare new instances with instances seen in training, which have been stored in memory. • Sometimes called memory-based learning. • Advantages • Adapt its model to previously unseen data by storing a new instance or throwing an old instance away. • Disadvantages • Lazy learner: it does not build a model explicitly. • Classifying unknown records is relatively expensive: in the worst case, given n training items, the complexity of classifying a single instance is O(n).
  • 3. Nearest-Neighbor Classifier (K-NN) Basic idea: If it walks like a duck, quacks like a duck, then it’s probably a duck. Requires three things 1. Training set of stored records 2. Distance metric to compute distance between records 3. The value of k, the number of nearest neighbors to retrieve
  • 4. Nearest-Neighbor Classifier (K-NN) Given a set of training records (memory), and a test record: 1. Compute the distances from the records in the training to the test. 2. Identify the k “nearest” records. 3. Use class labels of nearest neighbors to determine the class label of unknown record (e.g., by taking majority vote).
  • 5. Definition of Nearest Neighbor • K-nearest neighbors of a record x are data points that have the k smallest distance to x.
  • 6. Choosing the Value of K • If k is too small, it is sensitive to noise points and it can leads to overfitting to the noise in the training set. • If k is too large, the neighborhood may include points from other classes. • General practice k = sqrt(N) where N is the number of samples in the training dataset.
  • 7. Nearest Neighbor Classification Compute distance between two points: • Euclidean distance 𝑑 𝑝, 𝑞 = ∑!(𝑝! − 𝑞!)" Determine the class from nearest neighbors • take the majority vote of class labels among the k nearest neighbors • weigh the vote according to distance (e.g. weight factor, w = 1/d2)
  • 8. Dimensionality and Scaling Issues • Problem with Euclidean measure: high dimensional data can cause curse of dimensionality. • Solution: normalize the vectors to unit length • Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes. • Example: • height of a person may vary from 1.5m to 1.8m • weight of a person may vary from 10km to 200kg • income of a person may vary from $10K to $1M
  • 9. Parallel Exemplar-Based Learning System (PEBLS) • PEBLS is a nearest-neighbor learning system (k=1) designed for applications where the instances have symbolic feature values. • Works with both continuous and nominal features. • For nominal features, the distance between two nominal values is computed using Modified Value Difference Metric (MVDM) • 𝑑 𝑉#, 𝑉" = ∑! | $!" $! − $#" $# | • Where n1 is the number of records that consists of nominal attribute value V1 and n1i is the number of records whose target label is class i.
  • 10. Distance Between Nominal Attribute Values • d(Status=Single, Status=Married) = | 2/4 – 0/4 | + | 2/4 – 4/4 | = 1 • d(Status=Single, Status=Divorced) = | 2/4 – 1/2 | + | 2/4 – 1/2 | = 0 • d(Status=Married, Status=Divorced) = | 0/4 – 1/2 | + | 4/4 – 1/2 | = 1 • d(Refund=Yes, Refund=No) = | 0/3 – 3/7 | + | 3/3 – 4/7 | = 6/7
  • 11. Distance Between Records • 𝛿 𝑋, 𝑌 = 𝑤!𝑤" ∑#$% & 𝑑(𝑋#, 𝑌#) • Each record X is assigned a weight 𝑤! = '!"#$%&'( '!"#$%&'( ')##$'( , which represents its reliability • 𝑁!"#$%&'( is the number of times X is used for prediction • 𝑁!"#$%&'( ')##$'( is the number of times the prediction using X is correct • If 𝑤! ≅ 1 X makes accurate prediction most of the time • If 𝑤!> 1, then X is not reliable for making predictions. High 𝑤! >1 would result in high distance, which makes it less possible to use X to make predictions.
  • 12. Characteristics of Nearest Neighbor Classifiers • Instance-based learner: makes predictions without maintaining abstraction, i.e., building a model like decision trees. • It is a lazy learner: classifying a test example can be expensive because need to compute the proximity values between test and training examples. • In contrast eager learners spend time in building the model but then the classification is fast. • Make their prediction on local information and for low k they are susceptible to noise. • Can produce wrong predictions if inappropriate distance functions and/or preprocessing steps are performed.
  • 13. k-NN for Regression Given a set of training records (memory), and a test record: 1. Compute the distances from the records in the training to the test. 2. Identify the k “nearest” records. 3. Use target value of nearest neighbors to determine the value of unknown record (e.g., by averaging the values). 3.4 5.4 5.7 8.7 9.7 3.9 3.8 4.1 5.1 4.3 10.3 9.2 3.2 5.9 4.7 12.1 10.8 8.7 9.7 3.4 4.2
  • 14. References • Nearest Neighbor classifiers. Chapter 5.2. Introduction to Data Mining.