K-Means Algorithm

•

3 likes•3,169 views

Part of the course "Algorithmic Methods of Data Science". Sapienza University of Rome, 2015. http://aris.me/index.php/data-mining-ds-2015

Technology

1
K-Means
Class Algorithmic Methods of Data Mining
Program M. Sc. Data Science
University Sapienza University of Rome
Semester Fall 2015
Lecturer Carlos Castillo http://chato.cl/
Sources:
● Mohammed J. Zaki, Wagner Meira, Jr., Data Mining and Analysis:
Fundamental Concepts and Algorithms, Cambridge University
Press, May 2014. Example 13.1. [download]
● Evimaria Terzi: Data Mining course at Boston University
http://www.cs.bu.edu/~evimaria/cs565-13.html

2
Boston University Slideshow Title Goes Here
The k-means problem
• consider set X={x1,...,xn} of n points in Rd
• assume that the number k is given
• problem:
• find k points c1,...,ck (named centers or means)
so that the cost
is minimized

3
Boston University Slideshow Title Goes Here
The k-means problem
• k=1 and k=n are easy special cases (why?)
• an NP-hard problem if the dimension of the
data is at least 2 (d≥2)
• in practice, a simple iterative algorithm
works quite well

4
Boston University Slideshow Title Goes Here
The k-means
algorithm
• voted among the top-10
algorithms in data mining
• one way of solving the k-
means problem

6
Boston University Slideshow Title Goes Here
The k-means algorithm
1.randomly (or with another method) pick k
cluster centers {c1,...,ck}
2.for each j, set the cluster Xj to be the set of
points in X that are the closest to center cj
3.for each j let cj be the center of cluster Xj
(mean of the vectors in Xj)
1.repeat (go to step 2) until convergence

7
Boston University Slideshow Title Goes Here
Sample execution

8
1-dimensional clustering exercise
Exercise:
● For the data in the figure
● Run k-means with k=2 and initial centroids u1=2, u2=4
(Verify: last centroids are 18 units apart)
● Try with k=3 and initialization 2,3,30
http://www.dataminingbook.info/pmwiki.php/Main/BookDownload Exercise 13.1

9
Limitations of k-means
● Clusters of different size
● Clusters of different density
● Clusters of non-globular shape
● Sensitive to initialization

10
Boston University Slideshow Title Goes Here
Limitations of k-means: different sizes

11
Boston University Slideshow Title Goes Here
Limitations of k-means: different
density

12
Boston University Slideshow Title Goes Here
Limitations of k-means: non-spherical
shapes

13
Boston University Slideshow Title Goes Here
Effects of bad initialization

14
Boston University Slideshow Title Goes Here
k-means algorithm
• finds a local optimum
• often converges quickly
but not always
• the choice of initial points can have large
influence in the result
• tends to find spherical clusters
• outliers can cause a problem
• different densities may cause a problem

16
Boston University Slideshow Title Goes Here
Initialization
• random initialization
• random, but repeat many times and take the
best solution
• helps, but solution can still be bad
• pick points that are distant to each other
• k-means++
• provable guarantees

17
Boston University Slideshow Title Goes Here
k-means++
David Arthur and Sergei Vassilvitskii
k-means++: The advantages of careful
seeding
SODA 2007

18
Boston University Slideshow Title Goes Here
k-means algorithm: random
initialization

19
Boston University Slideshow Title Goes Here
k-means algorithm: random
initialization

20
Boston University Slideshow Title Goes Here
1
2
3
4
k-means algorithm:
initialization with further-first
traversal

21
Boston University Slideshow Title Goes Here
k-means algorithm:
initialization with further-first
traversal

22
Boston University Slideshow Title Goes Here
1
2
3
but... sensitive to outliers

23
Boston University Slideshow Title Goes Here
but... sensitive to outliers

24
Boston University Slideshow Title Goes Here
Here random may work well

25
Boston University Slideshow Title Goes Here
k-means++ algorithm
• interpolate between the two methods
• let D(x) be the distance between x and the
nearest center selected so far
• choose next center with probability proportional to
(D(x))a = Da(x)
 a = 0 random initialization
 a = ∞ furthestfirst traversal
 a = 2 kmeans++

26
Boston University Slideshow Title Goes Here
k-means++ algorithm
• initialization phase:
• choose the first center uniformly at random
• choose next center with probability proportional
to D2(x)
• iteration phase:
• iterate as in the k-means algorithm until
convergence

27
Boston University Slideshow Title Goes Here
k-means++ initialization
1
2
3

28
Boston University Slideshow Title Goes Here
k-means++ result

29
Boston University Slideshow Title Goes Here
• approximation guarantee comes just from the
first iteration (initialization)
• subsequent iterations can only improve cost
k-means++ provable guarantee

30
Boston University Slideshow Title Goes Here
Lesson learned
• no reason to use k-means and not k-means++
• k-means++ :
• easy to implement
• provable guarantee
• works well in practice

31
k-means--
● Algorithm 4.1 in [Chawla & Gionis SDM 2013]

What's hot

K means clusteringkeshav goyal

Presentation on K-Means ClusteringPabna University of Science & Technology

K mean-clusteringAfzaal Subhani

K MEANS CLUSTERINGsingh7599

K-Means clustring @jaxAjay Iet

Clustering, k-means clusteringMegha Sharma

K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...Simplilearn

CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...butest

Fuzzy Clustering(C-means, K-means)Fellowship at Vodafone FutureLab

3.5 model based clusteringKrish_ver2

3.2 partitioning methodsKrish_ver2

Clique and stingSubramanyam Natarajan

Density Based ClusteringSSA KPI

K-Means manual workDr.E.N.Sathishkumar

ClusteringRashmi Bhat

Dbscan algorithomMahbubur Rahman Shimul

KNN.pptxMohamed Essam

K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...Edureka!

K mean-clustering algorithmparry prabhu

K means clusteringKuppusamy P

What's hot (20)

K means clustering

Presentation on K-Means Clustering

K mean-clustering

K MEANS CLUSTERING

K-Means clustring @jax

Clustering, k-means clustering

K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...

CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...

Fuzzy Clustering(C-means, K-means)

3.5 model based clustering

3.2 partitioning methods

Clique and sting

Density Based Clustering

K-Means manual work

Clustering

Dbscan algorithom

KNN.pptx

K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...

K mean-clustering algorithm

K means clustering

Viewers also liked

Clustering, k means algorithmJunyoung Park

K means ClusteringEdureka!

slides Céline BejiChristian Robert

Enhance The K Means Algorithm On Spatial DatasetAlaaZ

K-Means, its Variants and its ApplicationsVarad Meru

Decision tree Using c4.5 AlgorithmMohd. Noor Abdul Hamid

Kmeans initializationdjempol

K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...Edureka!

K meansElias Hasnat

Project PPTDhaarna Singh

Neural nw k meansEng. Dr. Dennis N. Mwighusa

K means clustering algorithmDarshak Mehta

Databeers: Big Crisis DataCarlos Castillo (ChaTo)

Large Scale Data Clustering: an overviewVahid Mirjalili

Big Crisis Data for ISPCCarlos Castillo (ChaTo)

Detecting Algorithmic Bias (keynote at DIR 2016)Carlos Castillo (ChaTo)

Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16MLconf

15857 cse422 unsupervised-learningAnil Yadav

Data miningpresentationManoj Krishna Yadavalli

Fairness-Aware Data MiningCarlos Castillo (ChaTo)

Viewers also liked (20)

Clustering, k means algorithm

K means Clustering

slides Céline Beji

Enhance The K Means Algorithm On Spatial Dataset

K-Means, its Variants and its Applications

Decision tree Using c4.5 Algorithm

Kmeans initialization

K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...

K means

Project PPT

Neural nw k means

K means clustering algorithm

Databeers: Big Crisis Data

Large Scale Data Clustering: an overview

Big Crisis Data for ISPC

Detecting Algorithmic Bias (keynote at DIR 2016)

Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16

15857 cse422 unsupervised-learning

Data miningpresentation

Fairness-Aware Data Mining

Similar to K-Means Algorithm

The Million Domain Challenge: Broadcast Email Prioritization by Cross-domain ...Daiki Tanaka

On-Homomorphic-Encryption-and-Secure-Computation.pptssuser85a33d

$Learning multifractal structure in large networks (Purdue ML Seminar)$ $Learning multifractal structure in large networks (Purdue ML Seminar)$

Learning multifractal structure in large networks (Purdue ML Seminar)Austin Benson

Deep Learning for Personalized Search and Recommender SystemsBenjamin Le

ECCV WS 2012 (Frank)Chun-Hao Huang

Introduction to Big Data ScienceAlbert Bifet

Fast Single-pass K-means Clusterting at Oxford MapR Technologies

IOEfficientParalleMatrixMultiplication_presentShubham Joshi

DLBLR talkAnuj Gupta

Deep Learning Bangalore meet up Satyam Saxena

SP18 Generative Design - Week 8 - OptimizationDanil Nagy

An Introduction to Supervised Machine Learning and Pattern Classification: Th...Sebastian Raschka

More investment in Research and Development for better Education in the future?Dhafer Malouche

Fuzzy c means clustering protocol for wireless sensor networksmourya chandra

Poggi analytics - distance - 1aGaston Liberman

Lecture1tarikh007

Lecture 3.pdfDanielGarca686549

Evaluation of programs codes using machine learningVivek Maskara

lecture15-neural-nets (2).pptxanjithaba

Entity-Relationship Queries over WikipediaThe Innovative Data Intelligence Research (IDIR) Laboratory, University of Texas at Arlington

Similar to K-Means Algorithm (20)

The Million Domain Challenge: Broadcast Email Prioritization by Cross-domain ...

On-Homomorphic-Encryption-and-Secure-Computation.ppt

$Learning multifractal structure in large networks (Purdue ML Seminar)$ $Learning multifractal structure in large networks (Purdue ML Seminar)$

Learning multifractal structure in large networks (Purdue ML Seminar)

Deep Learning for Personalized Search and Recommender Systems

ECCV WS 2012 (Frank)

Introduction to Big Data Science

Fast Single-pass K-means Clusterting at Oxford

IOEfficientParalleMatrixMultiplication_present

DLBLR talk

Deep Learning Bangalore meet up

SP18 Generative Design - Week 8 - Optimization

An Introduction to Supervised Machine Learning and Pattern Classification: Th...

More investment in Research and Development for better Education in the future?

Fuzzy c means clustering protocol for wireless sensor networks

Poggi analytics - distance - 1a

Lecture1

Lecture 3.pdf

Evaluation of programs codes using machine learning

lecture15-neural-nets (2).pptx

Entity-Relationship Queries over Wikipedia

Recently uploaded

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous

[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745

How to convert PDF to text with Nanonetsnaman860154

CNv6 Instructor Chapter 6 Quality of Servicegiselly40

GenCyber Cyber Security Day PresentationMichael W. Hawkins

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

🐬 The future of MySQL is Postgres 🐘RTylerCroy

GenAI Risks & Security Meetup 01052024.pdflior mazor

Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services

Evaluating the top large language models.pdfChristopherTHyatt

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science

Automating Google Workspace (GWS) & more with Apps Scriptwesley chun

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Artificial Intelligence: Facts and MythsJoaquim Jorge

Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun

Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal

Histor y of HAM Radio presentation slidevu2urc

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j

Recently uploaded (20)

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

[2024]Digital Global Overview Report 2024 Meltwater.pdf

How to convert PDF to text with Nanonets

CNv6 Instructor Chapter 6 Quality of Service

GenCyber Cyber Security Day Presentation

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

🐬 The future of MySQL is Postgres 🐘

GenAI Risks & Security Meetup 01052024.pdf

Strategies for Landing an Oracle DBA Job as a Fresher

Evaluating the top large language models.pdf

08448380779 Call Girls In Friends Colony Women Seeking Men

Driving Behavioral Change for Information Management through Data-Driven Gree...

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx

Automating Google Workspace (GWS) & more with Apps Script

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

Artificial Intelligence: Facts and Myths

Powerful Google developer tools for immediate impact! (2023-24 C)

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf

Histor y of HAM Radio presentation slide

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

K-Means Algorithm

1. 1 K-Means Class Algorithmic Methods of Data Mining Program M. Sc. Data Science University Sapienza University of Rome Semester Fall 2015 Lecturer Carlos Castillo http://chato.cl/ Sources: ● Mohammed J. Zaki, Wagner Meira, Jr., Data Mining and Analysis: Fundamental Concepts and Algorithms, Cambridge University Press, May 2014. Example 13.1. [download] ● Evimaria Terzi: Data Mining course at Boston University http://www.cs.bu.edu/~evimaria/cs565-13.html

2. 2 Boston University Slideshow Title Goes Here The k-means problem • consider set X={x1,...,xn} of n points in Rd • assume that the number k is given • problem: • find k points c1,...,ck (named centers or means) so that the cost is minimized

3. 3 Boston University Slideshow Title Goes Here The k-means problem • k=1 and k=n are easy special cases (why?) • an NP-hard problem if the dimension of the data is at least 2 (d≥2) • in practice, a simple iterative algorithm works quite well

4. 4 Boston University Slideshow Title Goes Here The k-means algorithm • voted among the top-10 algorithms in data mining • one way of solving the k- means problem

5. 5 K-means algorithm

6. 6 Boston University Slideshow Title Goes Here The k-means algorithm 1.randomly (or with another method) pick k cluster centers {c1,...,ck} 2.for each j, set the cluster Xj to be the set of points in X that are the closest to center cj 3.for each j let cj be the center of cluster Xj (mean of the vectors in Xj) 1.repeat (go to step 2) until convergence

7. 7 Boston University Slideshow Title Goes Here Sample execution

8. 8 1-dimensional clustering exercise Exercise: ● For the data in the figure ● Run k-means with k=2 and initial centroids u1=2, u2=4 (Verify: last centroids are 18 units apart) ● Try with k=3 and initialization 2,3,30 http://www.dataminingbook.info/pmwiki.php/Main/BookDownload Exercise 13.1

9. 9 Limitations of k-means ● Clusters of different size ● Clusters of different density ● Clusters of non-globular shape ● Sensitive to initialization

10. 10 Boston University Slideshow Title Goes Here Limitations of k-means: different sizes

11. 11 Boston University Slideshow Title Goes Here Limitations of k-means: different density

12. 12 Boston University Slideshow Title Goes Here Limitations of k-means: non-spherical shapes

13. 13 Boston University Slideshow Title Goes Here Effects of bad initialization

14. 14 Boston University Slideshow Title Goes Here k-means algorithm • finds a local optimum • often converges quickly but not always • the choice of initial points can have large influence in the result • tends to find spherical clusters • outliers can cause a problem • different densities may cause a problem

15. 15 Advanced: k-means initialization

16. 16 Boston University Slideshow Title Goes Here Initialization • random initialization • random, but repeat many times and take the best solution • helps, but solution can still be bad • pick points that are distant to each other • k-means++ • provable guarantees

17. 17 Boston University Slideshow Title Goes Here k-means++ David Arthur and Sergei Vassilvitskii k-means++: The advantages of careful seeding SODA 2007

18. 18 Boston University Slideshow Title Goes Here k-means algorithm: random initialization

19. 19 Boston University Slideshow Title Goes Here k-means algorithm: random initialization

20. 20 Boston University Slideshow Title Goes Here 1 2 3 4 k-means algorithm: initialization with further-first traversal

21. 21 Boston University Slideshow Title Goes Here k-means algorithm: initialization with further-first traversal

22. 22 Boston University Slideshow Title Goes Here 1 2 3 but... sensitive to outliers

23. 23 Boston University Slideshow Title Goes Here but... sensitive to outliers

24. 24 Boston University Slideshow Title Goes Here Here random may work well

25. 25 Boston University Slideshow Title Goes Here k-means++ algorithm • interpolate between the two methods • let D(x) be the distance between x and the nearest center selected so far • choose next center with probability proportional to (D(x))a = Da(x)  a = 0 random initialization  a = ∞ furthestfirst traversal  a = 2 kmeans++

26. 26 Boston University Slideshow Title Goes Here k-means++ algorithm • initialization phase: • choose the first center uniformly at random • choose next center with probability proportional to D2(x) • iteration phase: • iterate as in the k-means algorithm until convergence

27. 27 Boston University Slideshow Title Goes Here k-means++ initialization 1 2 3

28. 28 Boston University Slideshow Title Goes Here k-means++ result

29. 29 Boston University Slideshow Title Goes Here • approximation guarantee comes just from the first iteration (initialization) • subsequent iterations can only improve cost k-means++ provable guarantee

30. 30 Boston University Slideshow Title Goes Here Lesson learned • no reason to use k-means and not k-means++ • k-means++ : • easy to implement • provable guarantee • works well in practice

31. 31 k-means-- ● Algorithm 4.1 in [Chawla & Gionis SDM 2013]

K-Means Algorithm

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to K-Means Algorithm

Similar to K-Means Algorithm (20)

More from Carlos Castillo (ChaTo)

More from Carlos Castillo (ChaTo) (20)

Recently uploaded

Recently uploaded (20)

K-Means Algorithm