Selection K in K-means Clustering

•Télécharger en tant que PPTX, PDF•

3 j'aime•1,246 vues

Junghoon Kim

Technologie Formation

Why I choose this paper
• There is always an assumption in k-means
algorithm, but I really want to execute without
human’s intuition or insight.
• This paper is first review existing automatical
method for selecting the number of clusters for
k-means algorithm

Paper Format
1)
2)
3)
4)
5)

Introduction
review the main known method for selecting K
analyses the factors influencing the selection of K
describes the proposed evaluation measure
presents the results of applying the proposed
measure to select K for different data sets
6) concludes the paper

K-means Algorithm
• k-means algorithm is a method of clustering
algorithm originally from signal processing, that is
popular for machine learning and data mining.
• k-means clustering aims to partition n
observations into k clusters in which each
observation belongs to the cluster with the
nearest mean until move distance is smaller than
threshold

K-means Algorithm
1) Pick a number (k) of point randomly
2) Assign every node to its nearest cluster center
3) Move each cluster center to the mean of its
assigned nodes
4) Repeat 2-3 until convergence

Clustering: Example 2, Step 1
Algorithm: k-means, Distance Metric: Euclidean Distance

expression in condition 2

5

4

k1
3

k2

2

1

k3
0
0

1

2

3

4

expression in condition 1

5

Clustering: Example 2, Step 2
Algorithm: k-means, Distance Metric: Euclidean Distance

expression in condition 2

5

4

k1
3

k2

2

1

k3
0
0

1

2

3

4

expression in condition 1

5

Clustering: Example 2, Step 3
Algorithm: k-means, Distance Metric: Euclidean Distance

expression in condition 2

5

4

k1

3

2

k3
k2

1

0
0

1

2

3

4

expression in condition 1

5

Clustering: Example 2, Step 4
Algorithm: k-means, Distance Metric: Euclidean Distance

expression in condition 2

5

4

k1

3

2

k3
k2

1

0
0

1

2

3

4

expression in condition 1

5

Clustering: Example 2, Step 5
Algorithm: k-means, Distance Metric: Euclidean Distance

expression in condition 2

5

4

k1

3

2

k2

k3

1

0
0

1

2

3

4

expression in condition 1

5

Comments on the K-Means Metho
d
• Strength
• Relatively efficient: O(tkn), where n is # instances, c is # clusters
, and t is # iterations. Normally, k, t << n.
• Often terminates at a local optimum. The global optimum may
be found using techniques such as: simulated annealing or ge
netic algorithms

• Weakness
• Need to specify c, the number of clusters, in advance
• Initialization Problem
• Not suitable to discover clusters with non-convex shapes

What’s the problem?
• Initialization problem

• it's a problem which is caused when much point is assigned to the part
of high density and less point is assigned to the part of low density

What’s the problem?
• hard to find cluster in non-convex shape

Existing Method
• Values of K determined through human’s viewpoint

• Using probabilistic theory
• Akeike’s information criterion
• if data sets are constructed by a set of Gaussian dist

• Hardy method
• if data sets are constructed by a set of Possion dist

• Monte Carlo techniques(associated null hypothesis)

Research Method
• The method has been validated on
15 artificial and 12 benchmark data sets.
• Also there are 12 benchmark data sets from the
UCI Repository Machine Learning Databases
• These fifteen artificial data sets show effective
sample of lots of distribution which can be usually
generated.

Recommendation Example
f(X) < 0.85, K = X
else K=1

Conclusion
• The new method is closely related to the approach
of K-means clustering because it takes into account
information reflecting the performance of the
algorithm
• The proposed method can suggest multiple values
of K to users for cases when different clustering
results could be obtained with various required
levels of detail
• this method is computationally expensive if used
with large data sets

improvement
• This paper did not mentioned how can we calculate
threshold(e.g, f(x) < 0.85), if we have lots of data
sets, we can apply learning algorithm to determine
threshold
• Experiment data sets are almost biased. This means,
having set of data is too ideal. It doesn't consider
the complexity in reality at all. It can be a way to
evaluate data randomly.
• It is an important issue that we know the range, or
maximum value of K.

Contenu connexe

Tendances

EXPERIMENTS ON HYPOTHESIS "FUZZY K-MEANS IS BETTER THAN K-MEANS FOR CLUSTERING"IJDKP

Af4201214217IJERA Editor

A Novel Approach to Mathematical Concepts in Data Miningijdmtaiir

Training machine learning k means 2017Iwan Sofana

Optimising Data Using K-Means Clustering AlgorithmIJERA Editor

Performance Analysis of Different Clustering AlgorithmIOSR Journals

lecture_mooney.pptbutest

Unsupervised Learning: Clustering Experfy

Improved k-meansKasun Ranga Wijeweera

Noura2Dr-mahmoud Algamel

Lightning talk at MLConf NYC 2015Mohitdeep Singh

Instance based learningswapnac12

A survey on Efficient Enhanced K-Means Clustering Algorithmijsrd.com

A Scalable Dataflow Implementation of Curran's Approximation AlgorithmNECST Lab @ Politecnico di Milano

xldb-2015Mohitdeep Singh

AROPUB-IJPGE-14-30shirko mahmoudi

Premeditated Initial Points for K-Means ClusteringIJCSIS Research Publications

Unsupervised LearningSAHEEL FAL DESAI

A Study of Efficiency Improvements Technique for K-Means AlgorithmIRJET Journal

New Approach for K-mean and K-medoids AlgorithmEditor IJCATR

Tendances (20)

EXPERIMENTS ON HYPOTHESIS "FUZZY K-MEANS IS BETTER THAN K-MEANS FOR CLUSTERING"

Af4201214217

A Novel Approach to Mathematical Concepts in Data Mining

Training machine learning k means 2017

Optimising Data Using K-Means Clustering Algorithm

Performance Analysis of Different Clustering Algorithm

lecture_mooney.ppt

Unsupervised Learning: Clustering

Improved k-means

Noura2

Lightning talk at MLConf NYC 2015

Instance based learning

A survey on Efficient Enhanced K-Means Clustering Algorithm

A Scalable Dataflow Implementation of Curran's Approximation Algorithm

xldb-2015

AROPUB-IJPGE-14-30

Premeditated Initial Points for K-Means Clustering

Unsupervised Learning

A Study of Efficiency Improvements Technique for K-Means Algorithm

New Approach for K-mean and K-medoids Algorithm

Similaire à Selection K in K-means Clustering

Master's Thesis Presentation●๋•máńíکhá Gőýálツ

Clustering.pptx19526YuvaKumarIrigi

machine learning - Clustering in RSudhakar Chavan

CSA 3702 machine learning module 3Nandhini S

UNIT_V_Cluster Analysis.pptxsandeepsandy494692

Neural nw k meansEng. Dr. Dennis N. Mwighusa

Document clustering for forensic analysis an approach for improving compute...Madan Golla

Advanced database and data mining & clustering conceptsNithyananthSengottai

Fuzzy c means clustering protocol for wireless sensor networksmourya chandra

Experimental study of Data clustering using k- Means and modified algorithmsIJDKP

Pattern recognition binoy k means clustering108kaushik

26-Clustering MTech-2017.pptvikassingh569137

K means Clustering - algorithm to cluster n objectsVoidVampire

Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...Maninda Edirisooriya

K-Nearest Neighbor ClassifierNeha Kulkarni

Data mining techniques unit vmalathieswaran29

Unsupervised Learning in Machine LearningPyingkodi Maran

Parallel Algorithms K – means ClusteringAndreina Uzcategui

k-mean-clustering.pdfYatharthKhichar1

Clustering.pptxMukul Kumar Singh Chauhan

Similaire à Selection K in K-means Clustering (20)

Master's Thesis Presentation

Clustering.pptx

machine learning - Clustering in R

CSA 3702 machine learning module 3

UNIT_V_Cluster Analysis.pptx

Neural nw k means

Document clustering for forensic analysis an approach for improving compute...

Advanced database and data mining & clustering concepts

Fuzzy c means clustering protocol for wireless sensor networks

Experimental study of Data clustering using k- Means and modified algorithms

Pattern recognition binoy k means clustering

26-Clustering MTech-2017.ppt

K means Clustering - algorithm to cluster n objects

Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...

K-Nearest Neighbor Classifier

Data mining techniques unit v

Unsupervised Learning in Machine Learning

Parallel Algorithms K – means Clustering

k-mean-clustering.pdf

Clustering.pptx

Dernier

TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc

Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93

TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc

Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez

Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal

Artificial Intelligence: Facts and MythsJoaquim Jorge

Manulife - Insurer Innovation Award 2024The Digital Insurer

Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software

Real Time Object Detection Using Open CVKhem

Automating Google Workspace (GWS) & more with Apps Scriptwesley chun

Apidays New York 2024 - The value of a flexible API Management solution for O...apidays

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra

AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin

Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous

Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies

Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun

Dernier (20)

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

Exploring the Future Potential of AI-Enabled Smartphone Processors

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf

Artificial Intelligence: Facts and Myths

Manulife - Insurer Innovation Award 2024

Tata AIG General Insurance Company - Insurer Innovation Award 2024

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Real Time Object Detection Using Open CV

Automating Google Workspace (GWS) & more with Apps Script

Apidays New York 2024 - The value of a flexible API Management solution for O...

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

AWS Community Day CPH - Three problems of Terraform

Top 10 Most Downloaded Games on Play Store in 2024

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...

Powerful Google developer tools for immediate impact! (2023-24 C)

Selection K in K-means Clustering

1. 2013 KSE Seminar 2013/10/11 Jung hoon Kim

2. TOPIC

3. Selection of K in K-means clustering

4. Why I choose this paper • There is always an assumption in k-means algorithm, but I really want to execute without human’s intuition or insight. • This paper is first review existing automatical method for selecting the number of clusters for k-means algorithm

5. Paper Format 1) 2) 3) 4) 5) Introduction review the main known method for selecting K analyses the factors influencing the selection of K describes the proposed evaluation measure presents the results of applying the proposed measure to select K for different data sets 6) concludes the paper

6. Small introduction

7. K-means Algorithm • k-means algorithm is a method of clustering algorithm originally from signal processing, that is popular for machine learning and data mining. • k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean until move distance is smaller than threshold

8. K-means Algorithm 1) Pick a number (k) of point randomly 2) Assign every node to its nearest cluster center 3) Move each cluster center to the mean of its assigned nodes 4) Repeat 2-3 until convergence

9. Clustering: Example 2, Step 1 Algorithm: k-means, Distance Metric: Euclidean Distance expression in condition 2 5 4 k1 3 k2 2 1 k3 0 0 1 2 3 4 expression in condition 1 5

10. Clustering: Example 2, Step 2 Algorithm: k-means, Distance Metric: Euclidean Distance expression in condition 2 5 4 k1 3 k2 2 1 k3 0 0 1 2 3 4 expression in condition 1 5

11. Clustering: Example 2, Step 3 Algorithm: k-means, Distance Metric: Euclidean Distance expression in condition 2 5 4 k1 3 2 k3 k2 1 0 0 1 2 3 4 expression in condition 1 5

12. Clustering: Example 2, Step 4 Algorithm: k-means, Distance Metric: Euclidean Distance expression in condition 2 5 4 k1 3 2 k3 k2 1 0 0 1 2 3 4 expression in condition 1 5

13. Clustering: Example 2, Step 5 Algorithm: k-means, Distance Metric: Euclidean Distance expression in condition 2 5 4 k1 3 2 k2 k3 1 0 0 1 2 3 4 expression in condition 1 5

14. Comments on the K-Means Metho d • Strength • Relatively efficient: O(tkn), where n is # instances, c is # clusters , and t is # iterations. Normally, k, t << n. • Often terminates at a local optimum. The global optimum may be found using techniques such as: simulated annealing or ge netic algorithms • Weakness • Need to specify c, the number of clusters, in advance • Initialization Problem • Not suitable to discover clusters with non-convex shapes

15. What’s the problem?

16. What’s the problem? • Initialization problem • it's a problem which is caused when much point is assigned to the part of high density and less point is assigned to the part of low density

17. What’s the problem? • hard to find cluster in non-convex shape

18. What’s the problem? • Selection of K

19. Existing Method • Values of K determined through human’s viewpoint • Using probabilistic theory • Akeike’s information criterion • if data sets are constructed by a set of Gaussian dist • Hardy method • if data sets are constructed by a set of Possion dist • Monte Carlo techniques(associated null hypothesis)

20. Paper proposed

21. Formula

22. Research Method • The method has been validated on 15 artificial and 12 benchmark data sets. • Also there are 12 benchmark data sets from the UCI Repository Machine Learning Databases • These fifteen artificial data sets show effective sample of lots of distribution which can be usually generated.

23. Sample

24. Sample

25. Sample

26. Sample

27. Recommendation Example f(X) < 0.85, K = X else K=1

28. Conclusion • The new method is closely related to the approach of K-means clustering because it takes into account information reflecting the performance of the algorithm • The proposed method can suggest multiple values of K to users for cases when different clustering results could be obtained with various required levels of detail • this method is computationally expensive if used with large data sets

29. improvement • This paper did not mentioned how can we calculate threshold(e.g, f(x) < 0.85), if we have lots of data sets, we can apply learning algorithm to determine threshold • Experiment data sets are almost biased. This means, having set of data is too ideal. It doesn't consider the complexity in reality at all. It can be a way to evaluate data randomly. • It is an important issue that we know the range, or maximum value of K.

30. Do you have any question?

31. thank you

Selection K in K-means Clustering

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Selection K in K-means Clustering

Similaire à Selection K in K-means Clustering (20)

Dernier

Dernier (20)

Selection K in K-means Clustering