SlideShare une entreprise Scribd logo
1  sur  29
Presented By,
Manasi C. Kadam
Sharmishtha P. Alwekar
Ganesh H. Satpute
Deepak D. Ambegaonkar
Rajesh V. Dulhani

   Under the guidance
    Prof. G. A. Patil
    Mr. Varad Meru
Agenda
                    
   Introduction
   Clustering
   K-means clustering algorithm
   Canopy clustering algorithm
   Complexity Evaluation
   Conclusion
   Future Enhancement
   References
Introduction
                   
 Tedious task to maintain large Data
 Types
     1.   Structured
     2.   Unstructured
Introduction to Data
          analysis
             
 Extracting information out of data
 Two types
  1. Exploratory or descriptive
  2. Confirmative or inferential
Clustering
     (Aka Unsupervised Learning)
                         
 Goal is to discover the natural grouping(s) between
  objects
 Given n objects find K groups on measure of
  “similarity”
 Organizing data into clusters such that there is
       • high intra-cluster similarity
       • low inter-cluster similarity
 Ideal cluster - set of points that is compact and
  isolated
 Ex. K-means algorithm, k-medoids etc.

Problems in clustering
             
   Cluster can differ in size, shape & density
   Presence of noise
   Cluster is a subjective entity
   Automation
Clustering Algorithm
            
 Types of Clustering Algorithm
   1. Hierarchical
   2. Partitional
 Hierarchical – recursively finds nested clusters
    Types
     1.   Agglomerative
     2.   Divisive
 Partitional - finds all the clusters simultaneously
     ex. K-means
K-means algorithm

           
K-means Algorithm
          (contd.)
             
 Goal of K-means is to minimize the sum of the
 squared error over all K clusters
Flowchart
   
Class Diagram of K-means
          
Parameter for K-means
          
 Most critical choice is K
    Typically algorithm is run for various values of K and
     most appropriate output is selected

 Different initialization can lead to different output
Canopy Clustering
             
 Traditional clustering algorithm works well when
  dataset has either property.
   Large number of clusters
   A high feature dimensionality
   Large number of data points.
 When dataset has all three property at once
 computation becomes expensive.
 This necessitates need of new technique, thus
 canopy clustering
Canopy Clustering
          (contd.)
             
 Performs clustering in two stages
  1. Rough and quick stage
  2. Rigorous stage
Canopy Clustering
         (contd.)
            
 Rough and quick stage
   Uses extremely inexpensive method
   divides the data into overlapping subsets called
    “canopies”
 Rigorous stage
   Uses rigorous and expensive metric
   Clustering is applied only on canopy
Flowchart of Canopy
    Clustering
        





Source: Ref [2]
Output of K-means on
Mathematica on Same Dataset
            
Output of K-means on R on
      Same Dataset
           
Output of K-means on
Microsoft Excel on Same
       Dataset
          
Output of canopy on Excel on
       Same Dataset
            
Complexity
                  
 Complexity of K-means is O(nk), where n is number
 of objects and k is number of centroids
 Canopy based K-means changes to O(nkf2/c)
   c is no of canopies
   f is average no of canopies that each data point falls
    into
 As f is very small number and c is comparatively
 big, the complexity is reduced
Conclusion
                 
 Implemented K-means Algorithm
 Verified Result on Mathematica, R
 Implemented Canopy Clustering
 Verified Result on Excel
Future Enhancement
             
 Learning Hadoop and MapReduce
 Parallelizing K-Means based on MapReduce and
  comparing the implementation
 Running All the of K-means on standard dataset
References
                   
 Anil K. Jain, “Data Clustering: 50 Years Beyond K-
  Means”
 Andrew McCallum et al., “Efficient Clustering of
High Dimensional Data Sets with Application to
Reference Matching”

Thank You

Contenu connexe

Tendances

Clustering (from Google)
Clustering (from Google)Clustering (from Google)
Clustering (from Google)Sri Prasanna
 
Clustering: Large Databases in data mining
Clustering: Large Databases in data miningClustering: Large Databases in data mining
Clustering: Large Databases in data miningZHAO Sam
 
Big data Clustering Algorithms And Strategies
Big data Clustering Algorithms And StrategiesBig data Clustering Algorithms And Strategies
Big data Clustering Algorithms And StrategiesFarzad Nozarian
 
K means clustering algorithm
K means clustering algorithmK means clustering algorithm
K means clustering algorithmDarshak Mehta
 
5.4 mining sequence patterns in biological data
5.4 mining sequence patterns in biological data5.4 mining sequence patterns in biological data
5.4 mining sequence patterns in biological dataKrish_ver2
 
Optics ordering points to identify the clustering structure
Optics ordering points to identify the clustering structureOptics ordering points to identify the clustering structure
Optics ordering points to identify the clustering structureRajesh Piryani
 
Current clustering techniques
Current clustering techniquesCurrent clustering techniques
Current clustering techniquesPoonam Kshirsagar
 
Large Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewLarge Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewVahid Mirjalili
 
K-means Clustering with Scikit-Learn
K-means Clustering with Scikit-LearnK-means Clustering with Scikit-Learn
K-means Clustering with Scikit-LearnSarah Guido
 
L4 cluster analysis NWU 4.3 Graphics Course
L4 cluster analysis NWU 4.3 Graphics CourseL4 cluster analysis NWU 4.3 Graphics Course
L4 cluster analysis NWU 4.3 Graphics CourseMohaiminur Rahman
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
K means Clustering
K means ClusteringK means Clustering
K means ClusteringEdureka!
 
Enhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial DatasetEnhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial DatasetAlaaZ
 
Chapter 11 cluster advanced : web and text mining
Chapter 11 cluster advanced : web and text miningChapter 11 cluster advanced : web and text mining
Chapter 11 cluster advanced : web and text miningHouw Liong The
 

Tendances (20)

Clustering (from Google)
Clustering (from Google)Clustering (from Google)
Clustering (from Google)
 
Lec4 Clustering
Lec4 ClusteringLec4 Clustering
Lec4 Clustering
 
Dataa miining
Dataa miiningDataa miining
Dataa miining
 
Clustering: Large Databases in data mining
Clustering: Large Databases in data miningClustering: Large Databases in data mining
Clustering: Large Databases in data mining
 
Clustering ppt
Clustering pptClustering ppt
Clustering ppt
 
Big data Clustering Algorithms And Strategies
Big data Clustering Algorithms And StrategiesBig data Clustering Algorithms And Strategies
Big data Clustering Algorithms And Strategies
 
K means clustering algorithm
K means clustering algorithmK means clustering algorithm
K means clustering algorithm
 
Lect4
Lect4Lect4
Lect4
 
5.4 mining sequence patterns in biological data
5.4 mining sequence patterns in biological data5.4 mining sequence patterns in biological data
5.4 mining sequence patterns in biological data
 
Cg33504508
Cg33504508Cg33504508
Cg33504508
 
Bj24390398
Bj24390398Bj24390398
Bj24390398
 
Optics ordering points to identify the clustering structure
Optics ordering points to identify the clustering structureOptics ordering points to identify the clustering structure
Optics ordering points to identify the clustering structure
 
Current clustering techniques
Current clustering techniquesCurrent clustering techniques
Current clustering techniques
 
Large Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewLarge Scale Data Clustering: an overview
Large Scale Data Clustering: an overview
 
K-means Clustering with Scikit-Learn
K-means Clustering with Scikit-LearnK-means Clustering with Scikit-Learn
K-means Clustering with Scikit-Learn
 
L4 cluster analysis NWU 4.3 Graphics Course
L4 cluster analysis NWU 4.3 Graphics CourseL4 cluster analysis NWU 4.3 Graphics Course
L4 cluster analysis NWU 4.3 Graphics Course
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 
K means Clustering
K means ClusteringK means Clustering
K means Clustering
 
Enhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial DatasetEnhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial Dataset
 
Chapter 11 cluster advanced : web and text mining
Chapter 11 cluster advanced : web and text miningChapter 11 cluster advanced : web and text mining
Chapter 11 cluster advanced : web and text mining
 

Similaire à Clustering

Lec4 Clustering
Lec4 ClusteringLec4 Clustering
Lec4 Clusteringmobius.cn
 
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.pptChapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.pptSubrata Kumer Paul
 
A survey on Efficient Enhanced K-Means Clustering Algorithm
 A survey on Efficient Enhanced K-Means Clustering Algorithm A survey on Efficient Enhanced K-Means Clustering Algorithm
A survey on Efficient Enhanced K-Means Clustering Algorithmijsrd.com
 
Premeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means ClusteringPremeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means ClusteringIJCSIS Research Publications
 
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Salah Amean
 
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clusteringArshad Farhad
 
CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...
CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...
CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...butest
 
10 clusbasic
10 clusbasic10 clusbasic
10 clusbasicengrasi
 
lecture12-clustering.ppt
lecture12-clustering.pptlecture12-clustering.ppt
lecture12-clustering.pptImXaib
 
lecture12-clustering.ppt
lecture12-clustering.pptlecture12-clustering.ppt
lecture12-clustering.pptRajeshT305412
 
26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.pptvikassingh569137
 

Similaire à Clustering (20)

Lec4 Clustering
Lec4 ClusteringLec4 Clustering
Lec4 Clustering
 
K means report
K means reportK means report
K means report
 
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.pptChapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
 
Bb25322324
Bb25322324Bb25322324
Bb25322324
 
Data clustring
Data clustring Data clustring
Data clustring
 
A survey on Efficient Enhanced K-Means Clustering Algorithm
 A survey on Efficient Enhanced K-Means Clustering Algorithm A survey on Efficient Enhanced K-Means Clustering Algorithm
A survey on Efficient Enhanced K-Means Clustering Algorithm
 
Premeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means ClusteringPremeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means Clustering
 
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
 
Neural nw k means
Neural nw k meansNeural nw k means
Neural nw k means
 
10 clusbasic
10 clusbasic10 clusbasic
10 clusbasic
 
CLUSTERING
CLUSTERINGCLUSTERING
CLUSTERING
 
My8clst
My8clstMy8clst
My8clst
 
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clustering
 
CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...
CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...
CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...
 
10 clusbasic
10 clusbasic10 clusbasic
10 clusbasic
 
lecture12-clustering.ppt
lecture12-clustering.pptlecture12-clustering.ppt
lecture12-clustering.ppt
 
lecture12-clustering.ppt
lecture12-clustering.pptlecture12-clustering.ppt
lecture12-clustering.ppt
 
lecture12-clustering.ppt
lecture12-clustering.pptlecture12-clustering.ppt
lecture12-clustering.ppt
 
lecture12-clustering.ppt
lecture12-clustering.pptlecture12-clustering.ppt
lecture12-clustering.ppt
 
26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt
 

Clustering

  • 1. Presented By, Manasi C. Kadam Sharmishtha P. Alwekar Ganesh H. Satpute Deepak D. Ambegaonkar Rajesh V. Dulhani Under the guidance Prof. G. A. Patil Mr. Varad Meru
  • 2. Agenda   Introduction  Clustering  K-means clustering algorithm  Canopy clustering algorithm  Complexity Evaluation  Conclusion  Future Enhancement  References
  • 3. Introduction   Tedious task to maintain large Data  Types 1. Structured 2. Unstructured
  • 4. Introduction to Data analysis   Extracting information out of data  Two types 1. Exploratory or descriptive 2. Confirmative or inferential
  • 5. Clustering (Aka Unsupervised Learning)   Goal is to discover the natural grouping(s) between objects  Given n objects find K groups on measure of “similarity”  Organizing data into clusters such that there is • high intra-cluster similarity • low inter-cluster similarity  Ideal cluster - set of points that is compact and isolated  Ex. K-means algorithm, k-medoids etc.
  • 6.
  • 7. Problems in clustering   Cluster can differ in size, shape & density  Presence of noise  Cluster is a subjective entity  Automation
  • 8. Clustering Algorithm   Types of Clustering Algorithm 1. Hierarchical 2. Partitional  Hierarchical – recursively finds nested clusters  Types 1. Agglomerative 2. Divisive  Partitional - finds all the clusters simultaneously ex. K-means
  • 10. K-means Algorithm (contd.)   Goal of K-means is to minimize the sum of the squared error over all K clusters
  • 11. Flowchart
  • 12. Class Diagram of K-means 
  • 13. Parameter for K-means   Most critical choice is K  Typically algorithm is run for various values of K and most appropriate output is selected  Different initialization can lead to different output
  • 14. Canopy Clustering   Traditional clustering algorithm works well when dataset has either property.  Large number of clusters  A high feature dimensionality  Large number of data points.  When dataset has all three property at once computation becomes expensive.  This necessitates need of new technique, thus canopy clustering
  • 15. Canopy Clustering (contd.)   Performs clustering in two stages 1. Rough and quick stage 2. Rigorous stage
  • 16. Canopy Clustering (contd.)   Rough and quick stage  Uses extremely inexpensive method  divides the data into overlapping subsets called “canopies”  Rigorous stage  Uses rigorous and expensive metric  Clustering is applied only on canopy
  • 17. Flowchart of Canopy Clustering 
  • 19. Output of K-means on Mathematica on Same Dataset 
  • 20. Output of K-means on R on Same Dataset 
  • 21.
  • 22. Output of K-means on Microsoft Excel on Same Dataset 
  • 23.
  • 24. Output of canopy on Excel on Same Dataset 
  • 25. Complexity   Complexity of K-means is O(nk), where n is number of objects and k is number of centroids  Canopy based K-means changes to O(nkf2/c)  c is no of canopies  f is average no of canopies that each data point falls into  As f is very small number and c is comparatively big, the complexity is reduced
  • 26. Conclusion   Implemented K-means Algorithm  Verified Result on Mathematica, R  Implemented Canopy Clustering  Verified Result on Excel
  • 27. Future Enhancement   Learning Hadoop and MapReduce  Parallelizing K-Means based on MapReduce and comparing the implementation  Running All the of K-means on standard dataset
  • 28. References   Anil K. Jain, “Data Clustering: 50 Years Beyond K- Means”  Andrew McCallum et al., “Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching”