Selection K in K-means Clustering

•Télécharger en tant que PPTX, PDF•

3 j'aime•1,246 vues

Junghoon Kim

Technologie Formation

Why I choose this paper
• There is always an assumption in k-means
algorithm, but I really want to execute without
human’s intuition or insight.
• This paper is first review existing automatical
method for selecting the number of clusters for
k-means algorithm

Paper Format
1)
2)
3)
4)
5)

Introduction
review the main known method for selecting K
analyses the factors influencing the selection of K
describes the proposed evaluation measure
presents the results of applying the proposed
measure to select K for different data sets
6) concludes the paper

K-means Algorithm
• k-means algorithm is a method of clustering
algorithm originally from signal processing, that is
popular for machine learning and data mining.
• k-means clustering aims to partition n
observations into k clusters in which each
observation belongs to the cluster with the
nearest mean until move distance is smaller than
threshold

K-means Algorithm
1) Pick a number (k) of point randomly
2) Assign every node to its nearest cluster center
3) Move each cluster center to the mean of its
assigned nodes
4) Repeat 2-3 until convergence

Clustering: Example 2, Step 1
Algorithm: k-means, Distance Metric: Euclidean Distance

expression in condition 2

5

4

k1
3

k2

2

1

k3
0
0

1

2

3

4

expression in condition 1

5

Clustering: Example 2, Step 2
Algorithm: k-means, Distance Metric: Euclidean Distance

expression in condition 2

5

4

k1
3

k2

2

1

k3
0
0

1

2

3

4

expression in condition 1

5

Clustering: Example 2, Step 3
Algorithm: k-means, Distance Metric: Euclidean Distance

expression in condition 2

5

4

k1

3

2

k3
k2

1

0
0

1

2

3

4

expression in condition 1

5

Clustering: Example 2, Step 4
Algorithm: k-means, Distance Metric: Euclidean Distance

expression in condition 2

5

4

k1

3

2

k3
k2

1

0
0

1

2

3

4

expression in condition 1

5

Clustering: Example 2, Step 5
Algorithm: k-means, Distance Metric: Euclidean Distance

expression in condition 2

5

4

k1

3

2

k2

k3

1

0
0

1

2

3

4

expression in condition 1

5

Comments on the K-Means Metho
d
• Strength
• Relatively efficient: O(tkn), where n is # instances, c is # clusters
, and t is # iterations. Normally, k, t << n.
• Often terminates at a local optimum. The global optimum may
be found using techniques such as: simulated annealing or ge
netic algorithms

• Weakness
• Need to specify c, the number of clusters, in advance
• Initialization Problem
• Not suitable to discover clusters with non-convex shapes

What’s the problem?
• Initialization problem

• it's a problem which is caused when much point is assigned to the part
of high density and less point is assigned to the part of low density

What’s the problem?
• hard to find cluster in non-convex shape

Existing Method
• Values of K determined through human’s viewpoint

• Using probabilistic theory
• Akeike’s information criterion
• if data sets are constructed by a set of Gaussian dist

• Hardy method
• if data sets are constructed by a set of Possion dist

• Monte Carlo techniques(associated null hypothesis)

Research Method
• The method has been validated on
15 artificial and 12 benchmark data sets.
• Also there are 12 benchmark data sets from the
UCI Repository Machine Learning Databases
• These fifteen artificial data sets show effective
sample of lots of distribution which can be usually
generated.

Recommendation Example
f(X) < 0.85, K = X
else K=1

Conclusion
• The new method is closely related to the approach
of K-means clustering because it takes into account
information reflecting the performance of the
algorithm
• The proposed method can suggest multiple values
of K to users for cases when different clustering
results could be obtained with various required
levels of detail
• this method is computationally expensive if used
with large data sets

improvement
• This paper did not mentioned how can we calculate
threshold(e.g, f(x) < 0.85), if we have lots of data
sets, we can apply learning algorithm to determine
threshold
• Experiment data sets are almost biased. This means,
having set of data is too ideal. It doesn't consider
the complexity in reality at all. It can be a way to
evaluate data randomly.
• It is an important issue that we know the range, or
maximum value of K.

Contenu connexe

Tendances

EXPERIMENTS ON HYPOTHESIS "FUZZY K-MEANS IS BETTER THAN K-MEANS FOR CLUSTERING"IJDKP

Af4201214217IJERA Editor

A Novel Approach to Mathematical Concepts in Data Miningijdmtaiir

Training machine learning k means 2017Iwan Sofana

Optimising Data Using K-Means Clustering AlgorithmIJERA Editor

Performance Analysis of Different Clustering AlgorithmIOSR Journals

lecture_mooney.pptbutest

Unsupervised Learning: Clustering Experfy

Improved k-meansKasun Ranga Wijeweera

Noura2Dr-mahmoud Algamel

Lightning talk at MLConf NYC 2015Mohitdeep Singh

Instance based learningswapnac12

A survey on Efficient Enhanced K-Means Clustering Algorithmijsrd.com

A Scalable Dataflow Implementation of Curran's Approximation AlgorithmNECST Lab @ Politecnico di Milano

xldb-2015Mohitdeep Singh

AROPUB-IJPGE-14-30shirko mahmoudi

Premeditated Initial Points for K-Means ClusteringIJCSIS Research Publications

Unsupervised LearningSAHEEL FAL DESAI

A Study of Efficiency Improvements Technique for K-Means AlgorithmIRJET Journal

New Approach for K-mean and K-medoids AlgorithmEditor IJCATR

Tendances (20)

EXPERIMENTS ON HYPOTHESIS "FUZZY K-MEANS IS BETTER THAN K-MEANS FOR CLUSTERING"

Af4201214217

A Novel Approach to Mathematical Concepts in Data Mining

Training machine learning k means 2017

Optimising Data Using K-Means Clustering Algorithm

Performance Analysis of Different Clustering Algorithm

lecture_mooney.ppt

Unsupervised Learning: Clustering

Improved k-means

Noura2

Lightning talk at MLConf NYC 2015

Instance based learning

A survey on Efficient Enhanced K-Means Clustering Algorithm

A Scalable Dataflow Implementation of Curran's Approximation Algorithm

xldb-2015

AROPUB-IJPGE-14-30

Premeditated Initial Points for K-Means Clustering

Unsupervised Learning

A Study of Efficiency Improvements Technique for K-Means Algorithm

New Approach for K-mean and K-medoids Algorithm

Similaire à Selection K in K-means Clustering

Master's Thesis Presentation●๋•máńíکhá Gőýálツ

Clustering.pptx19526YuvaKumarIrigi

machine learning - Clustering in RSudhakar Chavan

CSA 3702 machine learning module 3Nandhini S

UNIT_V_Cluster Analysis.pptxsandeepsandy494692

Neural nw k meansEng. Dr. Dennis N. Mwighusa

Document clustering for forensic analysis an approach for improving compute...Madan Golla

Advanced database and data mining & clustering conceptsNithyananthSengottai

Fuzzy c means clustering protocol for wireless sensor networksmourya chandra

Experimental study of Data clustering using k- Means and modified algorithmsIJDKP

Pattern recognition binoy k means clustering108kaushik

26-Clustering MTech-2017.pptvikassingh569137

K means Clustering - algorithm to cluster n objectsVoidVampire

Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...Maninda Edirisooriya

K-Nearest Neighbor ClassifierNeha Kulkarni

Data mining techniques unit vmalathieswaran29

Unsupervised Learning in Machine LearningPyingkodi Maran

Parallel Algorithms K – means ClusteringAndreina Uzcategui

k-mean-clustering.pdfYatharthKhichar1

Clustering.pptxMukul Kumar Singh Chauhan

Similaire à Selection K in K-means Clustering (20)

Master's Thesis Presentation

Clustering.pptx

machine learning - Clustering in R

CSA 3702 machine learning module 3

UNIT_V_Cluster Analysis.pptx

Neural nw k means

Document clustering for forensic analysis an approach for improving compute...

Advanced database and data mining & clustering concepts

Fuzzy c means clustering protocol for wireless sensor networks

Experimental study of Data clustering using k- Means and modified algorithms

Pattern recognition binoy k means clustering

26-Clustering MTech-2017.ppt

K means Clustering - algorithm to cluster n objects

Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...

K-Nearest Neighbor Classifier

Data mining techniques unit v

Unsupervised Learning in Machine Learning

Parallel Algorithms K – means Clustering

k-mean-clustering.pdf

Clustering.pptx

Dernier

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst

How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe

Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

Artificial intelligence in cctv survelliance.pptxhariprasad279825

Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi

DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy

How to write a Business Continuity PlanDatabarracks

From Family Reminiscence to Scholarly Archive .Alan Dix

Advanced Computer Architecture – An IntroductionDilum Bandara

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely

Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson

TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc

Powerpoint exploring the locations used in television show Time Clashcharlottematthew16

What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati

Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed

"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays

Gen AI in Business - Global Trends Report 2024.pdfAddepto

Dernier (20)

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

Human Factors of XR: Using Human Factors to Design XR Systems

How AI, OpenAI, and ChatGPT impact business and software.

Unraveling Multimodality with Large Language Models.pdf

Artificial intelligence in cctv survelliance.pptx

Vertex AI Gemini Prompt Engineering Tips

DevoxxFR 2024 Reproducible Builds with Apache Maven

How to write a Business Continuity Plan

From Family Reminiscence to Scholarly Archive .

Advanced Computer Architecture – An Introduction

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf

Are Multi-Cloud and Serverless Good or Bad?

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy

Powerpoint exploring the locations used in television show Time Clash

What's New in Teams Calling, Meetings and Devices March 2024

Dev Dives: Streamline document processing with UiPath Studio Web

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day

Scanning the Internet for External Cloud Exposures via SSL Certs

"Debugging python applications inside k8s environment", Andrii Soldatenko

Gen AI in Business - Global Trends Report 2024.pdf

Selection K in K-means Clustering

1. 2013 KSE Seminar 2013/10/11 Jung hoon Kim

2. TOPIC

3. Selection of K in K-means clustering

4. Why I choose this paper • There is always an assumption in k-means algorithm, but I really want to execute without human’s intuition or insight. • This paper is first review existing automatical method for selecting the number of clusters for k-means algorithm

5. Paper Format 1) 2) 3) 4) 5) Introduction review the main known method for selecting K analyses the factors influencing the selection of K describes the proposed evaluation measure presents the results of applying the proposed measure to select K for different data sets 6) concludes the paper

6. Small introduction

7. K-means Algorithm • k-means algorithm is a method of clustering algorithm originally from signal processing, that is popular for machine learning and data mining. • k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean until move distance is smaller than threshold

8. K-means Algorithm 1) Pick a number (k) of point randomly 2) Assign every node to its nearest cluster center 3) Move each cluster center to the mean of its assigned nodes 4) Repeat 2-3 until convergence

9. Clustering: Example 2, Step 1 Algorithm: k-means, Distance Metric: Euclidean Distance expression in condition 2 5 4 k1 3 k2 2 1 k3 0 0 1 2 3 4 expression in condition 1 5

10. Clustering: Example 2, Step 2 Algorithm: k-means, Distance Metric: Euclidean Distance expression in condition 2 5 4 k1 3 k2 2 1 k3 0 0 1 2 3 4 expression in condition 1 5

11. Clustering: Example 2, Step 3 Algorithm: k-means, Distance Metric: Euclidean Distance expression in condition 2 5 4 k1 3 2 k3 k2 1 0 0 1 2 3 4 expression in condition 1 5

12. Clustering: Example 2, Step 4 Algorithm: k-means, Distance Metric: Euclidean Distance expression in condition 2 5 4 k1 3 2 k3 k2 1 0 0 1 2 3 4 expression in condition 1 5

13. Clustering: Example 2, Step 5 Algorithm: k-means, Distance Metric: Euclidean Distance expression in condition 2 5 4 k1 3 2 k2 k3 1 0 0 1 2 3 4 expression in condition 1 5

14. Comments on the K-Means Metho d • Strength • Relatively efficient: O(tkn), where n is # instances, c is # clusters , and t is # iterations. Normally, k, t << n. • Often terminates at a local optimum. The global optimum may be found using techniques such as: simulated annealing or ge netic algorithms • Weakness • Need to specify c, the number of clusters, in advance • Initialization Problem • Not suitable to discover clusters with non-convex shapes

15. What’s the problem?

16. What’s the problem? • Initialization problem • it's a problem which is caused when much point is assigned to the part of high density and less point is assigned to the part of low density

17. What’s the problem? • hard to find cluster in non-convex shape

18. What’s the problem? • Selection of K

19. Existing Method • Values of K determined through human’s viewpoint • Using probabilistic theory • Akeike’s information criterion • if data sets are constructed by a set of Gaussian dist • Hardy method • if data sets are constructed by a set of Possion dist • Monte Carlo techniques(associated null hypothesis)

20. Paper proposed

21. Formula

22. Research Method • The method has been validated on 15 artificial and 12 benchmark data sets. • Also there are 12 benchmark data sets from the UCI Repository Machine Learning Databases • These fifteen artificial data sets show effective sample of lots of distribution which can be usually generated.

23. Sample

24. Sample

25. Sample

26. Sample

27. Recommendation Example f(X) < 0.85, K = X else K=1

28. Conclusion • The new method is closely related to the approach of K-means clustering because it takes into account information reflecting the performance of the algorithm • The proposed method can suggest multiple values of K to users for cases when different clustering results could be obtained with various required levels of detail • this method is computationally expensive if used with large data sets

29. improvement • This paper did not mentioned how can we calculate threshold(e.g, f(x) < 0.85), if we have lots of data sets, we can apply learning algorithm to determine threshold • Experiment data sets are almost biased. This means, having set of data is too ideal. It doesn't consider the complexity in reality at all. It can be a way to evaluate data randomly. • It is an important issue that we know the range, or maximum value of K.

30. Do you have any question?

31. thank you

Selection K in K-means Clustering

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Selection K in K-means Clustering

Similaire à Selection K in K-means Clustering (20)

Dernier

Dernier (20)

Selection K in K-means Clustering