SlideShare une entreprise Scribd logo
1  sur  31
Comparing Clustering Algorithms

 
     Partitioning Algorithms
     −   K-Means
     −   DBSCAN Using KD Trees

 
     Hierarchical Algorithms
     −   Agglomerative Clustering
     −   CURE
K-Means Partitional clustering

    Prototype based Clustering

    O(I * K * m * n) Space Complexity

    Using KD Trees the overall Time Complexity
    reduces to O(m * logm)

    Select K initial centroids

    Repeat
    − For each point, find its closes centroid and assign
      that point to the centroid. This results in the
      formation of K clusters
    − Recompute centroid for each cluster
    until the centroids do not change
K-Means (Contd.)
Datasets
    - SPAETH2 2D dataset of 3360 points
K-Means (Contd.)
Performance Measurements
Compiler Used
  −   LabVIEW 8.2.1
Hardware Used
  −   Intel® Core(TM)2 IV 1.73 Ghz
  −   1 GB RAM
Current Status
  −   Done
Time Taken
  −   355 ms / 3360 points
K-Means (Contd.)
Pros

  Simple

  Fast for low dimensional data

  It can find pure sub clusters if large number
  of clusters is specified

Cons

  K-Means cannot handle non-globular data of
  different sizes and densities

  K-Means will not identify outliers

  K-Means is restricted to data which has the
  notion of a center (centroid)
Agglomerative Hierarchical
            Clustering

    Starting with one point (singleton) clusters
    and recursively merging two or more most
    similar clusters to one "parent" cluster until
    the termination criterion is reached

    Algorithms:
    −   MIN (Single Link)
    −   MAX (Complete Link)
    −   Group Average (GA)

    MIN: susceptible to noise/outliers

    MAX/GA: may not work well with non-
    globular clusters

    CURE tries to handle both problems
Data Set

    2-D data set used
    −   The SPAETH2 dataset is a related collection of
        data for cluster analysis. (Around 1500 data
        points)
Algorithm optimization

    It involved the implementation of Minimum
    Spanning Tree using Kruskal’s algorithm

    Union By Rank method is used to speed-up
    the algorithm

    Environment:
    −   Implemented using MATLAB

    Other Tools:
    −   Gnuplot

    Present Status
    −   Single Link and Complete Link– Done
    −   Group Average – in progress
Single Link/CURE Globular
         Clusters
After 64000 iterations
Final Cluster
Single Link / CURE Non
        globular
KD Trees

    K Dimensional Trees

    Space Partitioning Data Structure

    Splitting planes perpendicular to
    Coordinate Axes


    Useful in Nearest Neighbor
    Search

    Reduces the Overall Time
    Complexity to O(log n)

    Has been used in many clustering
    algorithms and other domains
Clustering Algorithms use KD Trees extensively for improving their
Time Complexity Requirements
Eg. Fast K-Means, Fast DBSCAN etc

We considered 2 popular Clustering Algorithms which use KD Tree
Approach to speed up clustering and minimize search time.

We used Open Source Implementation of KD Trees (available under
GNU GPL)
DBSCAN (Using KD Trees)

    Density based Clustering (Maximal Set of
    Density Connected Points)

    O(m) Space Complexity

    Using KD Trees the overall Time Complexity
    reduces to O(m * logm) from O(m^2)

Pros

    Fast for low dimensional data

    Can discover clusters of arbitrary shapes

    Robust towards Outlier Detection (Noise)
DBSCAN - Issues

    DBSCAN is very sensitive to clustering
    parameters MinPoints (Min Neighborhood
    Points) and EPS (Images Next)

    The Algorithm is not partitionable for multi-
    processor systems.

    DBSCAN fails to identify clusters if density
    varies and if the data set is too sparse.
    (Images Next)

    Sampling Affects Density Measures
DBSCAN (Contd.)
 Performance Measurements
 
         Compiler Used - Java 1.6
 
         Hardware Used Intel Pentium IV 1.8 Ghz (Duo Core) 1 GB RAM

 No. of Points                 1572            3568           7502             10256

 Clustering Time (sec) 3.5                      10.9           39.5             78.4

                             DBSCAN Using KD Trees Performance Measures
120


100


 80


 60
                                                                           DBSCAN Using KDTree
 40
                                                                           Basic DBSCAN
 20


  0
  1572                3568                    7502                 10256
CURE – Hierarchical Clustering

    Involves Two Pass clustering

    Uses Efficient Sampling Algorithms

    Scalable for Large Datasets

    First pass of Algorithm is partitionable so that
    it can run concurrently on multiple
    processors (Higher number of partitions help
    keeping execution time linear as size of
    dataset increase)
Source - CURE: An Efficient Clustering Algorithm for Large Databases. S.
Guha, R. Rastogi and K. Shim, 1998.

Each STEP is Important in Achieving Scalability and Efficiency as well as
Improving concurrency.

    Data Structures

  KD-Tree to store the data/representative points : O(log n) searching time
for nearest neighbors

  Min Heap to Store the Clusters : O(1) searching time to compute next
cluster to be processed
Cure hence has a O(n) Space Complexity
CURE (Contd.)

    Outperforms Basic Hierarchical Clustering by
    reducing the Time Complexity to O(n^2) from
    O(n^2*logn)

    Two Steps of Outlier Elimination
    −   After Pre-clustering
    −   Assigning label to data which was not part of Sample

    Captures the shape of clusters by selecting the
    notion of representative points (well scattered
    points which determine the boundary of cluster)
CURE - Benefits against
          Popular Algorithms

    K-Means (& Centroid based Algorithms) : Unsuitable for
    non-spherical and size differing clusters.

    CLARANS : Needs multiple data scan (R* Trees were
    proposed later on). CURE uses KD Trees inherently to
    store the dataset and use it across passes.

    BIRCH : Suffers from identifying only convex or
    spherical clusters of uniform size

    DBSCAN : No parallelism, High Sensitivity, Sampling of
    data may affect density measures.
CURE (Contd.)
Observations towards Sensitivity to Parameters
  −   Random Sample Size : It should be ensured that
      the sample represents all existing cluster. Algorithm
      uses Chernoff Bounds to calculate the size
  −   Shrink Factor of Representative Points
  −   Representative Points  Computation Time 
  −   Number of Partitions : Very high number of
      partitions (>50) would not give suitable results as
      some partitions may not have sufficient points to
      cluster.
CURE - Performance
Compiler : Java 1.6 Hardware Used : Intel Pentium IV 1.8 Ghz (Duo Core) 1 GB RAM
           No. of Points           1572     3568      7502       10256

           Clustering Time (sec)
           Partition P = 2            6.4    7.8          29.4    75.7
           Partition P = 3            6.5    7.6          21.6    43.6
           Partition P = 5            6.1    7.3          12.2    21.2


                                      CURE Performance Measurements
      90

      80

      70                                                                         P=2
      60                                                                         P=3
      50                                                                         P=5
      40                                                                         DBSCAN
      30

      20

      10

      0
      1572                     3568                7502                  10256
Data Sets and Results

    SPAETH - http://people.scs.fsu.edu/~burkardt/f_src/spaeth/spaeth.html

    Synthetic Data - http://dbkgroup.org/handl/generators/
References


    An Efficient k-Means Clustering Algorithm: Analysis and
    Implementation - Tapas Kanungo, Nathan S. Netanyahu, Christine D.
    Piatko, Ruth Silverman, Angela Y. Wu.

    A Density-Based Algorithm for Discovering Clusters in Large
    Spatial Databases with Noise - Martin Ester, Hans-Peter Kriegel, Jörg
    Sander, Xiaowei Xu, KDD '96

    CURE : An Efficient Clustering Algorithm for Large Databases – S.
    Guha, R. Rastogi and K. Shim, 1998.

    Introduction to Clustering Techniques – by Leo Wanner

    A comprehensive overview of Basic Clustering Algorithms – Glenn
    Fung

    Introduction to Data Mining – Tan/Steinbach/Kumar
Thanks!
Presenters
   −   Vasanth Prabhu Sundararaj
   −   Gnana Sundar Rajendiran
   −   Joyesh Mishra

Source www.cise.ufl.edu/~jmishra/clustering

Tools Used
JDK 1.6, Eclipse, MATLAB, LABView, GnuPlot

This slide was made using Open Office 2.2.1

Contenu connexe

Tendances

Enhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial DatasetEnhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial DatasetAlaaZ
 
K-means Clustering Algorithm with Matlab Source code
K-means Clustering Algorithm with Matlab Source codeK-means Clustering Algorithm with Matlab Source code
K-means Clustering Algorithm with Matlab Source codegokulprasath06
 
K means and dbscan
K means and dbscanK means and dbscan
K means and dbscanYan Xu
 
K MEANS CLUSTERING
K MEANS CLUSTERINGK MEANS CLUSTERING
K MEANS CLUSTERINGsingh7599
 
K means clustering
K means clusteringK means clustering
K means clusteringThomas K T
 
K-means clustering algorithm
K-means clustering algorithmK-means clustering algorithm
K-means clustering algorithmVinit Dantkale
 
CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...
CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...
CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...butest
 
3.4 density and grid methods
3.4 density and grid methods3.4 density and grid methods
3.4 density and grid methodsKrish_ver2
 
Cure, Clustering Algorithm
Cure, Clustering AlgorithmCure, Clustering Algorithm
Cure, Clustering AlgorithmLino Possamai
 
K-means Clustering
K-means ClusteringK-means Clustering
K-means ClusteringSajib Sen
 
CLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxCLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxShwetapadmaBabu1
 
Cluster analysis using k-means method in R
Cluster analysis using k-means method in RCluster analysis using k-means method in R
Cluster analysis using k-means method in RVladimir Bakhrushin
 
Clustering: Large Databases in data mining
Clustering: Large Databases in data miningClustering: Large Databases in data mining
Clustering: Large Databases in data miningZHAO Sam
 
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...Edureka!
 
3.3 hierarchical methods
3.3 hierarchical methods3.3 hierarchical methods
3.3 hierarchical methodsKrish_ver2
 

Tendances (20)

Enhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial DatasetEnhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial Dataset
 
K-means Clustering Algorithm with Matlab Source code
K-means Clustering Algorithm with Matlab Source codeK-means Clustering Algorithm with Matlab Source code
K-means Clustering Algorithm with Matlab Source code
 
K means and dbscan
K means and dbscanK means and dbscan
K means and dbscan
 
K MEANS CLUSTERING
K MEANS CLUSTERINGK MEANS CLUSTERING
K MEANS CLUSTERING
 
K means
K meansK means
K means
 
K means clustering
K means clusteringK means clustering
K means clustering
 
K-means clustering algorithm
K-means clustering algorithmK-means clustering algorithm
K-means clustering algorithm
 
CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...
CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...
CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...
 
3.4 density and grid methods
3.4 density and grid methods3.4 density and grid methods
3.4 density and grid methods
 
Cure, Clustering Algorithm
Cure, Clustering AlgorithmCure, Clustering Algorithm
Cure, Clustering Algorithm
 
K-means Clustering
K-means ClusteringK-means Clustering
K-means Clustering
 
K-Means manual work
K-Means manual workK-Means manual work
K-Means manual work
 
Birch
BirchBirch
Birch
 
CLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxCLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptx
 
Cluster analysis using k-means method in R
Cluster analysis using k-means method in RCluster analysis using k-means method in R
Cluster analysis using k-means method in R
 
Clustering: Large Databases in data mining
Clustering: Large Databases in data miningClustering: Large Databases in data mining
Clustering: Large Databases in data mining
 
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
 
3.3 hierarchical methods
3.3 hierarchical methods3.3 hierarchical methods
3.3 hierarchical methods
 
Clustering: A Survey
Clustering: A SurveyClustering: A Survey
Clustering: A Survey
 
K means
K meansK means
K means
 

En vedette

Density Based Clustering
Density Based ClusteringDensity Based Clustering
Density Based ClusteringSSA KPI
 
Decomposing Object oriented class
Decomposing Object oriented classDecomposing Object oriented class
Decomposing Object oriented classdineshppc
 
Better Visualization of Trips through Agglomerative Clustering
Better Visualization of  Trips through     Agglomerative ClusteringBetter Visualization of  Trips through     Agglomerative Clustering
Better Visualization of Trips through Agglomerative ClusteringAnbarasan S
 
Text Analytics on Google App
Text Analytics on Google AppText Analytics on Google App
Text Analytics on Google AppNan Wang
 
PPT file
PPT filePPT file
PPT filebutest
 
Chapter 5 decision tree induction using frequency tables for attribute selection
Chapter 5 decision tree induction using frequency tables for attribute selectionChapter 5 decision tree induction using frequency tables for attribute selection
Chapter 5 decision tree induction using frequency tables for attribute selectionKy Hong Le
 
Hierarchical clustering
Hierarchical clusteringHierarchical clustering
Hierarchical clusteringishmecse13
 
Correspondence analysis(step by step)
Correspondence analysis(step by step)Correspondence analysis(step by step)
Correspondence analysis(step by step)Nguyen Van Chuc
 
1. Introduction to Association Rule 2. Frequent Item Set Mining 3. Market Bas...
1. Introduction to Association Rule2. Frequent Item Set Mining3. Market Bas...1. Introduction to Association Rule2. Frequent Item Set Mining3. Market Bas...
1. Introduction to Association Rule 2. Frequent Item Set Mining 3. Market Bas...Surabhi Gosavi
 
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
Data Mining:  Concepts and Techniques (3rd ed.)— Chapter _04 olapData Mining:  Concepts and Techniques (3rd ed.)— Chapter _04 olap
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olapSalah Amean
 
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Salah Amean
 
Chap8 basic cluster_analysis
Chap8 basic cluster_analysisChap8 basic cluster_analysis
Chap8 basic cluster_analysisguru_prasadg
 
Clustering, k means algorithm
Clustering, k means algorithmClustering, k means algorithm
Clustering, k means algorithmJunyoung Park
 
K-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsK-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsVarad Meru
 
K means Clustering
K means ClusteringK means Clustering
K means ClusteringEdureka!
 

En vedette (20)

Density Based Clustering
Density Based ClusteringDensity Based Clustering
Density Based Clustering
 
Decomposing Object oriented class
Decomposing Object oriented classDecomposing Object oriented class
Decomposing Object oriented class
 
Better Visualization of Trips through Agglomerative Clustering
Better Visualization of  Trips through     Agglomerative ClusteringBetter Visualization of  Trips through     Agglomerative Clustering
Better Visualization of Trips through Agglomerative Clustering
 
Text Analytics on Google App
Text Analytics on Google AppText Analytics on Google App
Text Analytics on Google App
 
Project PPT
Project PPTProject PPT
Project PPT
 
PPT file
PPT filePPT file
PPT file
 
Chapter 5 decision tree induction using frequency tables for attribute selection
Chapter 5 decision tree induction using frequency tables for attribute selectionChapter 5 decision tree induction using frequency tables for attribute selection
Chapter 5 decision tree induction using frequency tables for attribute selection
 
Clustering
ClusteringClustering
Clustering
 
Hierarchical clustering
Hierarchical clusteringHierarchical clustering
Hierarchical clustering
 
Correspondence analysis(step by step)
Correspondence analysis(step by step)Correspondence analysis(step by step)
Correspondence analysis(step by step)
 
Hierarchical Clustering
Hierarchical ClusteringHierarchical Clustering
Hierarchical Clustering
 
1. Introduction to Association Rule 2. Frequent Item Set Mining 3. Market Bas...
1. Introduction to Association Rule2. Frequent Item Set Mining3. Market Bas...1. Introduction to Association Rule2. Frequent Item Set Mining3. Market Bas...
1. Introduction to Association Rule 2. Frequent Item Set Mining 3. Market Bas...
 
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
Data Mining:  Concepts and Techniques (3rd ed.)— Chapter _04 olapData Mining:  Concepts and Techniques (3rd ed.)— Chapter _04 olap
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
 
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
 
Chap8 basic cluster_analysis
Chap8 basic cluster_analysisChap8 basic cluster_analysis
Chap8 basic cluster_analysis
 
K-Means Algorithm
K-Means AlgorithmK-Means Algorithm
K-Means Algorithm
 
Anomaly Detection
Anomaly DetectionAnomaly Detection
Anomaly Detection
 
Clustering, k means algorithm
Clustering, k means algorithmClustering, k means algorithm
Clustering, k means algorithm
 
K-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsK-Means, its Variants and its Applications
K-Means, its Variants and its Applications
 
K means Clustering
K means ClusteringK means Clustering
K means Clustering
 

Similaire à Data miningpresentation

A Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelA Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelJenny Liu
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAprithan
 
New Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmNew Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmEditor IJCATR
 
Achieving horizontal scalability in density-based clustering for urls
Achieving horizontal scalability in density-based clustering for urlsAchieving horizontal scalability in density-based clustering for urls
Achieving horizontal scalability in density-based clustering for urlsAndrea Morichetta
 
Parallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive IndexingParallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive IndexingIRJET Journal
 
Tutorial-on-DNN-09A-Co-design-Sparsity.pdf
Tutorial-on-DNN-09A-Co-design-Sparsity.pdfTutorial-on-DNN-09A-Co-design-Sparsity.pdf
Tutorial-on-DNN-09A-Co-design-Sparsity.pdfDuy-Hieu Bui
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade OffDatabases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade OffTimescale
 
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETSFAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETScsandit
 
Restricting the Flow: Information Bottlenecks for Attribution
Restricting the Flow: Information Bottlenecks for AttributionRestricting the Flow: Information Bottlenecks for Attribution
Restricting the Flow: Information Bottlenecks for Attributiontaeseon ryu
 
DECISION TREE CLUSTERING: A COLUMNSTORES TUPLE RECONSTRUCTION
DECISION TREE CLUSTERING: A COLUMNSTORES TUPLE RECONSTRUCTIONDECISION TREE CLUSTERING: A COLUMNSTORES TUPLE RECONSTRUCTION
DECISION TREE CLUSTERING: A COLUMNSTORES TUPLE RECONSTRUCTIONcscpconf
 
Auto-Scaling Apache Spark cluster using Deep Reinforcement Learning.pdf
Auto-Scaling Apache Spark cluster using Deep Reinforcement Learning.pdfAuto-Scaling Apache Spark cluster using Deep Reinforcement Learning.pdf
Auto-Scaling Apache Spark cluster using Deep Reinforcement Learning.pdfKundjanasith Thonglek
 
Achitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and ExascaleAchitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and Exascaleinside-BigData.com
 
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화NAVER Engineering
 
IRJET- Enhanced Density Based Method for Clustering Data Stream
IRJET-  	  Enhanced Density Based Method for Clustering Data StreamIRJET-  	  Enhanced Density Based Method for Clustering Data Stream
IRJET- Enhanced Density Based Method for Clustering Data StreamIRJET Journal
 
Clustering
ClusteringClustering
ClusteringMeme Hei
 
Decision Tree Clustering : A Columnstores Tuple Reconstruction
Decision Tree Clustering : A Columnstores Tuple ReconstructionDecision Tree Clustering : A Columnstores Tuple Reconstruction
Decision Tree Clustering : A Columnstores Tuple Reconstructioncsandit
 
Decision tree clustering a columnstores tuple reconstruction
Decision tree clustering  a columnstores tuple reconstructionDecision tree clustering  a columnstores tuple reconstruction
Decision tree clustering a columnstores tuple reconstructioncsandit
 

Similaire à Data miningpresentation (20)

A Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelA Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in Parallel
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDA
 
New Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmNew Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids Algorithm
 
Achieving horizontal scalability in density-based clustering for urls
Achieving horizontal scalability in density-based clustering for urlsAchieving horizontal scalability in density-based clustering for urls
Achieving horizontal scalability in density-based clustering for urls
 
Parallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive IndexingParallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive Indexing
 
Tutorial-on-DNN-09A-Co-design-Sparsity.pdf
Tutorial-on-DNN-09A-Co-design-Sparsity.pdfTutorial-on-DNN-09A-Co-design-Sparsity.pdf
Tutorial-on-DNN-09A-Co-design-Sparsity.pdf
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade OffDatabases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
 
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETSFAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
 
Restricting the Flow: Information Bottlenecks for Attribution
Restricting the Flow: Information Bottlenecks for AttributionRestricting the Flow: Information Bottlenecks for Attribution
Restricting the Flow: Information Bottlenecks for Attribution
 
[ppt]
[ppt][ppt]
[ppt]
 
DECISION TREE CLUSTERING: A COLUMNSTORES TUPLE RECONSTRUCTION
DECISION TREE CLUSTERING: A COLUMNSTORES TUPLE RECONSTRUCTIONDECISION TREE CLUSTERING: A COLUMNSTORES TUPLE RECONSTRUCTION
DECISION TREE CLUSTERING: A COLUMNSTORES TUPLE RECONSTRUCTION
 
Auto-Scaling Apache Spark cluster using Deep Reinforcement Learning.pdf
Auto-Scaling Apache Spark cluster using Deep Reinforcement Learning.pdfAuto-Scaling Apache Spark cluster using Deep Reinforcement Learning.pdf
Auto-Scaling Apache Spark cluster using Deep Reinforcement Learning.pdf
 
Achitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and ExascaleAchitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and Exascale
 
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
 
L14.pdf
L14.pdfL14.pdf
L14.pdf
 
IRJET- Enhanced Density Based Method for Clustering Data Stream
IRJET-  	  Enhanced Density Based Method for Clustering Data StreamIRJET-  	  Enhanced Density Based Method for Clustering Data Stream
IRJET- Enhanced Density Based Method for Clustering Data Stream
 
Clustering
ClusteringClustering
Clustering
 
Decision Tree Clustering : A Columnstores Tuple Reconstruction
Decision Tree Clustering : A Columnstores Tuple ReconstructionDecision Tree Clustering : A Columnstores Tuple Reconstruction
Decision Tree Clustering : A Columnstores Tuple Reconstruction
 
Decision tree clustering a columnstores tuple reconstruction
Decision tree clustering  a columnstores tuple reconstructionDecision tree clustering  a columnstores tuple reconstruction
Decision tree clustering a columnstores tuple reconstruction
 

Data miningpresentation

  • 1. Comparing Clustering Algorithms  Partitioning Algorithms − K-Means − DBSCAN Using KD Trees  Hierarchical Algorithms − Agglomerative Clustering − CURE
  • 2. K-Means Partitional clustering  Prototype based Clustering  O(I * K * m * n) Space Complexity  Using KD Trees the overall Time Complexity reduces to O(m * logm)  Select K initial centroids  Repeat − For each point, find its closes centroid and assign that point to the centroid. This results in the formation of K clusters − Recompute centroid for each cluster until the centroids do not change
  • 3. K-Means (Contd.) Datasets - SPAETH2 2D dataset of 3360 points
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9. K-Means (Contd.) Performance Measurements Compiler Used − LabVIEW 8.2.1 Hardware Used − Intel® Core(TM)2 IV 1.73 Ghz − 1 GB RAM Current Status − Done Time Taken − 355 ms / 3360 points
  • 10. K-Means (Contd.) Pros  Simple  Fast for low dimensional data  It can find pure sub clusters if large number of clusters is specified Cons  K-Means cannot handle non-globular data of different sizes and densities  K-Means will not identify outliers  K-Means is restricted to data which has the notion of a center (centroid)
  • 11. Agglomerative Hierarchical Clustering  Starting with one point (singleton) clusters and recursively merging two or more most similar clusters to one "parent" cluster until the termination criterion is reached  Algorithms: − MIN (Single Link) − MAX (Complete Link) − Group Average (GA)  MIN: susceptible to noise/outliers  MAX/GA: may not work well with non- globular clusters  CURE tries to handle both problems
  • 12. Data Set  2-D data set used − The SPAETH2 dataset is a related collection of data for cluster analysis. (Around 1500 data points)
  • 13. Algorithm optimization  It involved the implementation of Minimum Spanning Tree using Kruskal’s algorithm  Union By Rank method is used to speed-up the algorithm  Environment: − Implemented using MATLAB  Other Tools: − Gnuplot  Present Status − Single Link and Complete Link– Done − Group Average – in progress
  • 17. Single Link / CURE Non globular
  • 18. KD Trees  K Dimensional Trees  Space Partitioning Data Structure  Splitting planes perpendicular to Coordinate Axes  Useful in Nearest Neighbor Search  Reduces the Overall Time Complexity to O(log n)  Has been used in many clustering algorithms and other domains
  • 19. Clustering Algorithms use KD Trees extensively for improving their Time Complexity Requirements Eg. Fast K-Means, Fast DBSCAN etc We considered 2 popular Clustering Algorithms which use KD Tree Approach to speed up clustering and minimize search time. We used Open Source Implementation of KD Trees (available under GNU GPL)
  • 20. DBSCAN (Using KD Trees)  Density based Clustering (Maximal Set of Density Connected Points)  O(m) Space Complexity  Using KD Trees the overall Time Complexity reduces to O(m * logm) from O(m^2) Pros  Fast for low dimensional data  Can discover clusters of arbitrary shapes  Robust towards Outlier Detection (Noise)
  • 21. DBSCAN - Issues  DBSCAN is very sensitive to clustering parameters MinPoints (Min Neighborhood Points) and EPS (Images Next)  The Algorithm is not partitionable for multi- processor systems.  DBSCAN fails to identify clusters if density varies and if the data set is too sparse. (Images Next)  Sampling Affects Density Measures
  • 22. DBSCAN (Contd.) Performance Measurements  Compiler Used - Java 1.6  Hardware Used Intel Pentium IV 1.8 Ghz (Duo Core) 1 GB RAM No. of Points 1572 3568 7502 10256 Clustering Time (sec) 3.5 10.9 39.5 78.4 DBSCAN Using KD Trees Performance Measures 120 100 80 60 DBSCAN Using KDTree 40 Basic DBSCAN 20 0 1572 3568 7502 10256
  • 23. CURE – Hierarchical Clustering  Involves Two Pass clustering  Uses Efficient Sampling Algorithms  Scalable for Large Datasets  First pass of Algorithm is partitionable so that it can run concurrently on multiple processors (Higher number of partitions help keeping execution time linear as size of dataset increase)
  • 24. Source - CURE: An Efficient Clustering Algorithm for Large Databases. S. Guha, R. Rastogi and K. Shim, 1998. Each STEP is Important in Achieving Scalability and Efficiency as well as Improving concurrency. Data Structures  KD-Tree to store the data/representative points : O(log n) searching time for nearest neighbors  Min Heap to Store the Clusters : O(1) searching time to compute next cluster to be processed Cure hence has a O(n) Space Complexity
  • 25. CURE (Contd.)  Outperforms Basic Hierarchical Clustering by reducing the Time Complexity to O(n^2) from O(n^2*logn)  Two Steps of Outlier Elimination − After Pre-clustering − Assigning label to data which was not part of Sample  Captures the shape of clusters by selecting the notion of representative points (well scattered points which determine the boundary of cluster)
  • 26. CURE - Benefits against Popular Algorithms  K-Means (& Centroid based Algorithms) : Unsuitable for non-spherical and size differing clusters.  CLARANS : Needs multiple data scan (R* Trees were proposed later on). CURE uses KD Trees inherently to store the dataset and use it across passes.  BIRCH : Suffers from identifying only convex or spherical clusters of uniform size  DBSCAN : No parallelism, High Sensitivity, Sampling of data may affect density measures.
  • 27. CURE (Contd.) Observations towards Sensitivity to Parameters − Random Sample Size : It should be ensured that the sample represents all existing cluster. Algorithm uses Chernoff Bounds to calculate the size − Shrink Factor of Representative Points − Representative Points  Computation Time  − Number of Partitions : Very high number of partitions (>50) would not give suitable results as some partitions may not have sufficient points to cluster.
  • 28. CURE - Performance Compiler : Java 1.6 Hardware Used : Intel Pentium IV 1.8 Ghz (Duo Core) 1 GB RAM No. of Points 1572 3568 7502 10256 Clustering Time (sec) Partition P = 2 6.4 7.8 29.4 75.7 Partition P = 3 6.5 7.6 21.6 43.6 Partition P = 5 6.1 7.3 12.2 21.2 CURE Performance Measurements 90 80 70 P=2 60 P=3 50 P=5 40 DBSCAN 30 20 10 0 1572 3568 7502 10256
  • 29. Data Sets and Results  SPAETH - http://people.scs.fsu.edu/~burkardt/f_src/spaeth/spaeth.html  Synthetic Data - http://dbkgroup.org/handl/generators/
  • 30. References  An Efficient k-Means Clustering Algorithm: Analysis and Implementation - Tapas Kanungo, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, Angela Y. Wu.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise - Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, KDD '96  CURE : An Efficient Clustering Algorithm for Large Databases – S. Guha, R. Rastogi and K. Shim, 1998.  Introduction to Clustering Techniques – by Leo Wanner  A comprehensive overview of Basic Clustering Algorithms – Glenn Fung  Introduction to Data Mining – Tan/Steinbach/Kumar
  • 31. Thanks! Presenters − Vasanth Prabhu Sundararaj − Gnana Sundar Rajendiran − Joyesh Mishra Source www.cise.ufl.edu/~jmishra/clustering Tools Used JDK 1.6, Eclipse, MATLAB, LABView, GnuPlot This slide was made using Open Office 2.2.1