SlideShare a Scribd company logo
1 of 19
Classification Technique KNN in
           Data Mining
       ---on dataset “Iris”


      Comp722 data mining
        Kaiwen Qi, UNC
          Spring 2012
Outline
   Dataset introduction
   Data processing
   Data analysis
   KNN & Implementation
   Testing
Dataset
   Raw dataset
    Iris(http://archive.ics.uci.edu/ml/datasets/Iris)




                                                                 5 Attributes
                                                (a) Raw
    150 total records                                                     Sepal length in cm
                                                data                     (continious number)
                                                                            Sepal width in cm
                                                                          (continious number)
                  50 records Iris Setosa                                   Petal length in cm
                                                                          (continious number)

                                                                            Petal width in cm
                  50 records Iris Versicolour                             (continious number)
                                                                                   Class
                                                                        (nominal data:
                  50 records Iris Virginica                                  Iris Setosa
                                                                             Iris Versicolour
                                                                             Iris Virginica)

       (b) Data
                                                             (C) Data
       organization
Classification Goal
   Task
Data Processing
   Original data
Data Processing
• Balanced distribution
Data Analysis
   Statistics
Data Analysis
   Histogram
Data Analysis
   Histogram
KNN
   KNN algorithm




    The unknown data, the green circle, is classified to be square when
    K is 5. The distance between two points is calculated with Euclidean
    distance d(p, q)=         . .In this example, square is the majority
    in 5 nearest neighbors.
KNN
   Advantage
       the skimpiness of implementation. It is good
        at dealing with numeric attributes.
       Does not set up the model and just imports
        the dataset with very low computer overhead.
       Does not need to calculate the useful attribute
        subset. Compared with naïve Bayesian, we
        do not need to worry about lack of available
        probability data
Implementation of KNN
   Algorithm
        Algorithm: KNN. Asses a classification label from training data for an
         unlabeled data
         Input: K, the number of neighbors.
         Dataset that include training data
        Output: A string that indicates unknown tuple’s classification

    Method:
     Create a distance array whose size is K
     Initialize the array with the distances between the unlabeled tuple with
      first K records in dataset
     Let i=k+1
     calculate the distance between the unlabeled tuple with the (k+1)th
      record in dataset, if the distance is greater than the biggest distance in
      the array, replace the old max distance with the new distance; i=i+1
     repeat step (4) until i is greater than dataset size(150)
     Count the class number in the array, the class of biggest number is
      mining result
Implementation of KNN
   UML
Testing
   Testing (K=7, total 150 tuples)
Testing
   Testing (K=7, 60% data as training data)
Testing
   Input random distribution dataset



               Random dataset




      Accuracy test:
Performance
   Comparison
     Decision tree
    Advantage                                    Naïve Bayesian
    • comprehensibility
    • construct a decision tree without any     Advantage
      domain knowledge                          • relatively simply.
    • handle high dimensional                   • By simply calculating
    • By eliminating unrelated attributes         attributes frequency from
      and tree pruning, it simplifies             training datanand without
      classification calculation                  any other operations (e.g.
    Disadvantage                                  sort, search),
    • requires good quality of training data.   Disadvantage
    • usually runs in memory                    • The assumption of
    • Not good at handling continuous             independence is not right
      number features.                          • No available probability data
                                                  to calculate probability
Conclusion
   KNN is a simple algorithm with high
    classification accuracy for dataset with
    continuous attributes.
   It shows high performance with balanced
    distribution training data as input.
Thanks
Question?

More Related Content

What's hot

Introduction To Data Mining
Introduction To Data Mining   Introduction To Data Mining
Introduction To Data Mining
Phi Jack
 

What's hot (20)

Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Classification and prediction in data mining
Classification and prediction in data miningClassification and prediction in data mining
Classification and prediction in data mining
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Data mining
Data miningData mining
Data mining
 
Exploratory data analysis
Exploratory data analysis Exploratory data analysis
Exploratory data analysis
 
Datawarehouse and OLAP
Datawarehouse and OLAPDatawarehouse and OLAP
Datawarehouse and OLAP
 
Missing Data and data imputation techniques
Missing Data and data imputation techniquesMissing Data and data imputation techniques
Missing Data and data imputation techniques
 
Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...
Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...
Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...
 
Introduction To Data Mining
Introduction To Data Mining   Introduction To Data Mining
Introduction To Data Mining
 
Modelling and evaluation
Modelling and evaluationModelling and evaluation
Modelling and evaluation
 
Data mining
Data mining Data mining
Data mining
 
Presentation on K-Means Clustering
Presentation on K-Means ClusteringPresentation on K-Means Clustering
Presentation on K-Means Clustering
 
2. visualization in data mining
2. visualization in data mining2. visualization in data mining
2. visualization in data mining
 
Decision tree
Decision treeDecision tree
Decision tree
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Support Vector Machines for Classification
Support Vector Machines for ClassificationSupport Vector Machines for Classification
Support Vector Machines for Classification
 
Presentation on supervised learning
Presentation on supervised learningPresentation on supervised learning
Presentation on supervised learning
 
Data Mining : Concepts
Data Mining : ConceptsData Mining : Concepts
Data Mining : Concepts
 
Confusion Matrix
Confusion MatrixConfusion Matrix
Confusion Matrix
 
Data mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationData mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, Classification
 

Viewers also liked

Dwdm naive bayes_ankit_gadgil_027
Dwdm naive bayes_ankit_gadgil_027Dwdm naive bayes_ankit_gadgil_027
Dwdm naive bayes_ankit_gadgil_027
ankitgadgil
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Research
kevinlan
 
Marekting research applications ppt
Marekting research applications pptMarekting research applications ppt
Marekting research applications ppt
ANSHU TIWARI
 
Multidimentional data model
Multidimentional data modelMultidimentional data model
Multidimentional data model
jagdish_93
 
Data warehouse architecture
Data warehouse architectureData warehouse architecture
Data warehouse architecture
pcherukumalla
 
Data mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesData mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniques
Saif Ullah
 
Data Warehouse Modeling
Data Warehouse ModelingData Warehouse Modeling
Data Warehouse Modeling
vivekjv
 

Viewers also liked (19)

Dwdm naive bayes_ankit_gadgil_027
Dwdm naive bayes_ankit_gadgil_027Dwdm naive bayes_ankit_gadgil_027
Dwdm naive bayes_ankit_gadgil_027
 
Naive bayes
Naive bayesNaive bayes
Naive bayes
 
ML KNN-ALGORITHM
ML KNN-ALGORITHMML KNN-ALGORITHM
ML KNN-ALGORITHM
 
Multidimensional Database Design & Architecture
Multidimensional Database Design & ArchitectureMultidimensional Database Design & Architecture
Multidimensional Database Design & Architecture
 
Data modelling 101
Data modelling 101Data modelling 101
Data modelling 101
 
Multidimensional data models
Multidimensional data  modelsMultidimensional data  models
Multidimensional data models
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Research
 
Marekting research applications ppt
Marekting research applications pptMarekting research applications ppt
Marekting research applications ppt
 
Data mining
Data miningData mining
Data mining
 
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALADATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
 
Data Modeling Basics
Data Modeling BasicsData Modeling Basics
Data Modeling Basics
 
Data Modeling PPT
Data Modeling PPTData Modeling PPT
Data Modeling PPT
 
Multidimentional data model
Multidimentional data modelMultidimentional data model
Multidimentional data model
 
Copy Testing
Copy TestingCopy Testing
Copy Testing
 
Multi dimensional model vs (1)
Multi dimensional model vs (1)Multi dimensional model vs (1)
Multi dimensional model vs (1)
 
Promotion
PromotionPromotion
Promotion
 
Data warehouse architecture
Data warehouse architectureData warehouse architecture
Data warehouse architecture
 
Data mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesData mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniques
 
Data Warehouse Modeling
Data Warehouse ModelingData Warehouse Modeling
Data Warehouse Modeling
 

Similar to Data mining project presentation

Instance based learning
Instance based learningInstance based learning
Instance based learning
Slideshare
 
ICCV2009: Max-Margin Ađitive Classifiers for Detection
ICCV2009: Max-Margin Ađitive Classifiers for DetectionICCV2009: Max-Margin Ađitive Classifiers for Detection
ICCV2009: Max-Margin Ađitive Classifiers for Detection
zukun
 
lecture_RNN Autoencoder.pdf
lecture_RNN Autoencoder.pdflecture_RNN Autoencoder.pdf
lecture_RNN Autoencoder.pdf
ssusercc3ff71
 
London useR Meeting 21-Jul-09
London useR Meeting 21-Jul-09London useR Meeting 21-Jul-09
London useR Meeting 21-Jul-09
bwhitcher
 
FUNCTION APPROXIMATION
FUNCTION APPROXIMATIONFUNCTION APPROXIMATION
FUNCTION APPROXIMATION
ankita pandey
 

Similar to Data mining project presentation (14)

Statistical classification: A review on some techniques
Statistical classification: A review on some techniquesStatistical classification: A review on some techniques
Statistical classification: A review on some techniques
 
Instance based learning
Instance based learningInstance based learning
Instance based learning
 
ICCV2009: Max-Margin Ađitive Classifiers for Detection
ICCV2009: Max-Margin Ađitive Classifiers for DetectionICCV2009: Max-Margin Ađitive Classifiers for Detection
ICCV2009: Max-Margin Ađitive Classifiers for Detection
 
Principal Component Analysis For Novelty Detection
Principal Component Analysis For Novelty DetectionPrincipal Component Analysis For Novelty Detection
Principal Component Analysis For Novelty Detection
 
lecture_RNN Autoencoder.pdf
lecture_RNN Autoencoder.pdflecture_RNN Autoencoder.pdf
lecture_RNN Autoencoder.pdf
 
Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)
 
London useR Meeting 21-Jul-09
London useR Meeting 21-Jul-09London useR Meeting 21-Jul-09
London useR Meeting 21-Jul-09
 
Introduction to deep learning
Introduction to deep learningIntroduction to deep learning
Introduction to deep learning
 
Intel Nervana Artificial Intelligence Meetup 1/31/17
Intel Nervana Artificial Intelligence Meetup 1/31/17Intel Nervana Artificial Intelligence Meetup 1/31/17
Intel Nervana Artificial Intelligence Meetup 1/31/17
 
Bayesian Counters
Bayesian CountersBayesian Counters
Bayesian Counters
 
FUNCTION APPROXIMATION
FUNCTION APPROXIMATIONFUNCTION APPROXIMATION
FUNCTION APPROXIMATION
 
Convolutional Patch Representations for Image Retrieval An unsupervised approach
Convolutional Patch Representations for Image Retrieval An unsupervised approachConvolutional Patch Representations for Image Retrieval An unsupervised approach
Convolutional Patch Representations for Image Retrieval An unsupervised approach
 
Machine Learning with R
Machine Learning with RMachine Learning with R
Machine Learning with R
 
Learning Classifier Systems for Class Imbalance Problems
Learning Classifier Systems  for Class Imbalance  ProblemsLearning Classifier Systems  for Class Imbalance  Problems
Learning Classifier Systems for Class Imbalance Problems
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 

Data mining project presentation

  • 1. Classification Technique KNN in Data Mining ---on dataset “Iris” Comp722 data mining Kaiwen Qi, UNC Spring 2012
  • 2. Outline  Dataset introduction  Data processing  Data analysis  KNN & Implementation  Testing
  • 3. Dataset  Raw dataset Iris(http://archive.ics.uci.edu/ml/datasets/Iris) 5 Attributes (a) Raw 150 total records Sepal length in cm data (continious number) Sepal width in cm (continious number) 50 records Iris Setosa Petal length in cm (continious number) Petal width in cm 50 records Iris Versicolour (continious number) Class (nominal data: 50 records Iris Virginica Iris Setosa Iris Versicolour Iris Virginica) (b) Data (C) Data organization
  • 5. Data Processing  Original data
  • 7. Data Analysis  Statistics
  • 8. Data Analysis  Histogram
  • 9. Data Analysis  Histogram
  • 10. KNN  KNN algorithm The unknown data, the green circle, is classified to be square when K is 5. The distance between two points is calculated with Euclidean distance d(p, q)= . .In this example, square is the majority in 5 nearest neighbors.
  • 11. KNN  Advantage  the skimpiness of implementation. It is good at dealing with numeric attributes.  Does not set up the model and just imports the dataset with very low computer overhead.  Does not need to calculate the useful attribute subset. Compared with naïve Bayesian, we do not need to worry about lack of available probability data
  • 12. Implementation of KNN  Algorithm  Algorithm: KNN. Asses a classification label from training data for an unlabeled data Input: K, the number of neighbors. Dataset that include training data Output: A string that indicates unknown tuple’s classification Method:  Create a distance array whose size is K  Initialize the array with the distances between the unlabeled tuple with first K records in dataset  Let i=k+1  calculate the distance between the unlabeled tuple with the (k+1)th record in dataset, if the distance is greater than the biggest distance in the array, replace the old max distance with the new distance; i=i+1  repeat step (4) until i is greater than dataset size(150)  Count the class number in the array, the class of biggest number is mining result
  • 14. Testing  Testing (K=7, total 150 tuples)
  • 15. Testing  Testing (K=7, 60% data as training data)
  • 16. Testing  Input random distribution dataset Random dataset Accuracy test:
  • 17. Performance  Comparison Decision tree Advantage Naïve Bayesian • comprehensibility • construct a decision tree without any Advantage domain knowledge • relatively simply. • handle high dimensional • By simply calculating • By eliminating unrelated attributes attributes frequency from and tree pruning, it simplifies training datanand without classification calculation any other operations (e.g. Disadvantage sort, search), • requires good quality of training data. Disadvantage • usually runs in memory • The assumption of • Not good at handling continuous independence is not right number features. • No available probability data to calculate probability
  • 18. Conclusion  KNN is a simple algorithm with high classification accuracy for dataset with continuous attributes.  It shows high performance with balanced distribution training data as input.