SlideShare a Scribd company logo
1 of 31
Download to read offline
1
K-Means
Class Algorithmic Methods of Data Mining
Program M. Sc. Data Science
University Sapienza University of Rome
Semester Fall 2015
Lecturer Carlos Castillo http://chato.cl/
Sources:
ā— Mohammed J. Zaki, Wagner Meira, Jr., Data Mining and Analysis:
Fundamental Concepts and Algorithms, Cambridge University
Press, May 2014. Example 13.1. [download]
ā— Evimaria Terzi: Data Mining course at Boston University
http://www.cs.bu.edu/~evimaria/cs565-13.html
2
Boston University Slideshow Title Goes Here
The k-means problem
ā€¢ consider set X={x1,...,xn} of n points in Rd
ā€¢ assume that the number k is given
ā€¢ problem:
ā€¢ find k points c1,...,ck (named centers or means)
so that the cost
is minimized
3
Boston University Slideshow Title Goes Here
The k-means problem
ā€¢ k=1 and k=n are easy special cases (why?)
ā€¢ an NP-hard problem if the dimension of the
data is at least 2 (dā‰„2)
ā€¢ in practice, a simple iterative algorithm
works quite well
4
Boston University Slideshow Title Goes Here
The k-means
algorithm
ā€¢ voted among the top-10
algorithms in data mining
ā€¢ one way of solving the k-
means problem
5
K-means algorithm
6
Boston University Slideshow Title Goes Here
The k-means algorithm
1.randomly (or with another method) pick k
cluster centers {c1,...,ck}
2.for each j, set the cluster Xj to be the set of
points in X that are the closest to center cj
3.for each j let cj be the center of cluster Xj
(mean of the vectors in Xj)
1.repeat (go to step 2) until convergence
7
Boston University Slideshow Title Goes Here
Sample execution
8
1-dimensional clustering exercise
Exercise:
ā— For the data in the figure
ā— Run k-means with k=2 and initial centroids u1=2, u2=4
(Verify: last centroids are 18 units apart)
ā— Try with k=3 and initialization 2,3,30
http://www.dataminingbook.info/pmwiki.php/Main/BookDownload Exercise 13.1
9
Limitations of k-means
ā— Clusters of different size
ā— Clusters of different density
ā— Clusters of non-globular shape
ā— Sensitive to initialization
10
Boston University Slideshow Title Goes Here
Limitations of k-means: different sizes
11
Boston University Slideshow Title Goes Here
Limitations of k-means: different
density
12
Boston University Slideshow Title Goes Here
Limitations of k-means: non-spherical
shapes
13
Boston University Slideshow Title Goes Here
Effects of bad initialization
14
Boston University Slideshow Title Goes Here
k-means algorithm
ā€¢ finds a local optimum
ā€¢ often converges quickly
but not always
ā€¢ the choice of initial points can have large
influence in the result
ā€¢ tends to find spherical clusters
ā€¢ outliers can cause a problem
ā€¢ different densities may cause a problem
15
Advanced: k-means initialization
16
Boston University Slideshow Title Goes Here
Initialization
ā€¢ random initialization
ā€¢ random, but repeat many times and take the
best solution
ā€¢ helps, but solution can still be bad
ā€¢ pick points that are distant to each other
ā€¢ k-means++
ā€¢ provable guarantees
17
Boston University Slideshow Title Goes Here
k-means++
David Arthur and Sergei Vassilvitskii
k-means++: The advantages of careful
seeding
SODA 2007
18
Boston University Slideshow Title Goes Here
k-means algorithm: random
initialization
19
Boston University Slideshow Title Goes Here
k-means algorithm: random
initialization
20
Boston University Slideshow Title Goes Here
1
2
3
4
k-means algorithm:
initialization with further-first
traversal
21
Boston University Slideshow Title Goes Here
k-means algorithm:
initialization with further-first
traversal
22
Boston University Slideshow Title Goes Here
1
2
3
but... sensitive to outliers
23
Boston University Slideshow Title Goes Here
but... sensitive to outliers
24
Boston University Slideshow Title Goes Here
Here random may work well
25
Boston University Slideshow Title Goes Here
k-means++ algorithm
ā€¢ interpolate between the two methods
ā€¢ let D(x) be the distance between x and the
nearest center selected so far
ā€¢ choose next center with probability proportional to
(D(x))a = Da(x)
ļ€æ aĀ =Ā 0Ā Ā Ā Ā Ā Ā randomĀ initialization
ļ€æ aĀ =Ā āˆž furthestĀ­firstĀ traversal
ļ€æ aĀ =Ā 2Ā Ā Ā Ā Ā Ā kĀ­means++Ā 
26
Boston University Slideshow Title Goes Here
k-means++ algorithm
ā€¢ initialization phase:
ā€¢ choose the first center uniformly at random
ā€¢ choose next center with probability proportional
to D2(x)
ā€¢ iteration phase:
ā€¢ iterate as in the k-means algorithm until
convergence
27
Boston University Slideshow Title Goes Here
k-means++ initialization
1
2
3
28
Boston University Slideshow Title Goes Here
k-means++ result
29
Boston University Slideshow Title Goes Here
ā€¢ approximation guarantee comes just from the
first iteration (initialization)
ā€¢ subsequent iterations can only improve cost
k-means++ provable guarantee
30
Boston University Slideshow Title Goes Here
Lesson learned
ā€¢ no reason to use k-means and not k-means++
ā€¢ k-means++ :
ā€¢ easy to implement
ā€¢ provable guarantee
ā€¢ works well in practice
31
k-means--
ā— Algorithm 4.1 in [Chawla & Gionis SDM 2013]

More Related Content

What's hot

K means clustering
K means clusteringK means clustering
K means clusteringkeshav goyal
Ā 
K mean-clustering
K mean-clusteringK mean-clustering
K mean-clusteringAfzaal Subhani
Ā 
K MEANS CLUSTERING
K MEANS CLUSTERINGK MEANS CLUSTERING
K MEANS CLUSTERINGsingh7599
Ā 
K-Means clustring @jax
K-Means clustring @jaxK-Means clustring @jax
K-Means clustring @jaxAjay Iet
Ā 
Clustering, k-means clustering
Clustering, k-means clusteringClustering, k-means clustering
Clustering, k-means clusteringMegha Sharma
Ā 
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...Simplilearn
Ā 
CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...
CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...
CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...butest
Ā 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clusteringKrish_ver2
Ā 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methodsKrish_ver2
Ā 
Density Based Clustering
Density Based ClusteringDensity Based Clustering
Density Based ClusteringSSA KPI
Ā 
Clustering
ClusteringClustering
ClusteringRashmi Bhat
Ā 
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...Edureka!
Ā 
K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithmparry prabhu
Ā 
K means clustering
K means clusteringK means clustering
K means clusteringKuppusamy P
Ā 

What's hot (20)

K means clustering
K means clusteringK means clustering
K means clustering
Ā 
Presentation on K-Means Clustering
Presentation on K-Means ClusteringPresentation on K-Means Clustering
Presentation on K-Means Clustering
Ā 
K mean-clustering
K mean-clusteringK mean-clustering
K mean-clustering
Ā 
K MEANS CLUSTERING
K MEANS CLUSTERINGK MEANS CLUSTERING
K MEANS CLUSTERING
Ā 
K-Means clustring @jax
K-Means clustring @jaxK-Means clustring @jax
K-Means clustring @jax
Ā 
Clustering, k-means clustering
Clustering, k-means clusteringClustering, k-means clustering
Clustering, k-means clustering
Ā 
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
Ā 
CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...
CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...
CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...
Ā 
Fuzzy Clustering(C-means, K-means)
Fuzzy Clustering(C-means, K-means)Fuzzy Clustering(C-means, K-means)
Fuzzy Clustering(C-means, K-means)
Ā 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clustering
Ā 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methods
Ā 
Clique and sting
Clique and stingClique and sting
Clique and sting
Ā 
Density Based Clustering
Density Based ClusteringDensity Based Clustering
Density Based Clustering
Ā 
K-Means manual work
K-Means manual workK-Means manual work
K-Means manual work
Ā 
Clustering
ClusteringClustering
Clustering
Ā 
Dbscan algorithom
Dbscan algorithomDbscan algorithom
Dbscan algorithom
Ā 
KNN.pptx
KNN.pptxKNN.pptx
KNN.pptx
Ā 
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
Ā 
K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithm
Ā 
K means clustering
K means clusteringK means clustering
K means clustering
Ā 

Viewers also liked

Clustering, k means algorithm
Clustering, k means algorithmClustering, k means algorithm
Clustering, k means algorithmJunyoung Park
Ā 
K means Clustering
K means ClusteringK means Clustering
K means ClusteringEdureka!
Ā 
slides CĆ©line Beji
slides CĆ©line Bejislides CĆ©line Beji
slides CĆ©line BejiChristian Robert
Ā 
Enhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial DatasetEnhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial DatasetAlaaZ
Ā 
K-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsK-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsVarad Meru
Ā 
Decision tree Using c4.5 Algorithm
Decision tree Using c4.5 AlgorithmDecision tree Using c4.5 Algorithm
Decision tree Using c4.5 AlgorithmMohd. Noor Abdul Hamid
Ā 
Kmeans initialization
Kmeans initializationKmeans initialization
Kmeans initializationdjempol
Ā 
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...Edureka!
Ā 
K means clustering algorithm
K means clustering algorithmK means clustering algorithm
K means clustering algorithmDarshak Mehta
Ā 
Large Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewLarge Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewVahid Mirjalili
Ā 
Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)Carlos Castillo (ChaTo)
Ā 
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16MLconf
Ā 
15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learningAnil Yadav
Ā 

Viewers also liked (20)

Clustering, k means algorithm
Clustering, k means algorithmClustering, k means algorithm
Clustering, k means algorithm
Ā 
K means Clustering
K means ClusteringK means Clustering
K means Clustering
Ā 
slides CĆ©line Beji
slides CĆ©line Bejislides CĆ©line Beji
slides CĆ©line Beji
Ā 
Enhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial DatasetEnhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial Dataset
Ā 
K-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsK-Means, its Variants and its Applications
K-Means, its Variants and its Applications
Ā 
Decision tree Using c4.5 Algorithm
Decision tree Using c4.5 AlgorithmDecision tree Using c4.5 Algorithm
Decision tree Using c4.5 Algorithm
Ā 
Kmeans initialization
Kmeans initializationKmeans initialization
Kmeans initialization
Ā 
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
Ā 
K means
K meansK means
K means
Ā 
Project PPT
Project PPTProject PPT
Project PPT
Ā 
Neural nw k means
Neural nw k meansNeural nw k means
Neural nw k means
Ā 
K means clustering algorithm
K means clustering algorithmK means clustering algorithm
K means clustering algorithm
Ā 
Databeers: Big Crisis Data
Databeers: Big Crisis DataDatabeers: Big Crisis Data
Databeers: Big Crisis Data
Ā 
Large Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewLarge Scale Data Clustering: an overview
Large Scale Data Clustering: an overview
Ā 
Big Crisis Data for ISPC
Big Crisis Data for ISPCBig Crisis Data for ISPC
Big Crisis Data for ISPC
Ā 
Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)
Ā 
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Ā 
15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning
Ā 
Data miningpresentation
Data miningpresentationData miningpresentation
Data miningpresentation
Ā 
Fairness-Aware Data Mining
Fairness-Aware Data MiningFairness-Aware Data Mining
Fairness-Aware Data Mining
Ā 

Similar to K-Means Algorithm

The Million Domain Challenge: Broadcast Email Prioritization by Cross-domain ...
The Million Domain Challenge: Broadcast Email Prioritization by Cross-domain ...The Million Domain Challenge: Broadcast Email Prioritization by Cross-domain ...
The Million Domain Challenge: Broadcast Email Prioritization by Cross-domain ...Daiki Tanaka
Ā 
On-Homomorphic-Encryption-and-Secure-Computation.ppt
On-Homomorphic-Encryption-and-Secure-Computation.pptOn-Homomorphic-Encryption-and-Secure-Computation.ppt
On-Homomorphic-Encryption-and-Secure-Computation.pptssuser85a33d
Ā 
Learning multifractal structure in large networks (Purdue ML Seminar)
Learning multifractal structure in large networks (Purdue ML Seminar)Learning multifractal structure in large networks (Purdue ML Seminar)
Learning multifractal structure in large networks (Purdue ML Seminar)Austin Benson
Ā 
Deep Learning for Personalized Search and Recommender Systems
Deep Learning for Personalized Search and Recommender SystemsDeep Learning for Personalized Search and Recommender Systems
Deep Learning for Personalized Search and Recommender SystemsBenjamin Le
Ā 
ECCV WS 2012 (Frank)
ECCV WS 2012 (Frank)ECCV WS 2012 (Frank)
ECCV WS 2012 (Frank)Chun-Hao Huang
Ā 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data ScienceAlbert Bifet
Ā 
Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford MapR Technologies
Ā 
IOEfficientParalleMatrixMultiplication_present
IOEfficientParalleMatrixMultiplication_presentIOEfficientParalleMatrixMultiplication_present
IOEfficientParalleMatrixMultiplication_presentShubham Joshi
Ā 
DLBLR talk
DLBLR talkDLBLR talk
DLBLR talkAnuj Gupta
Ā 
Deep Learning Bangalore meet up
Deep Learning Bangalore meet up Deep Learning Bangalore meet up
Deep Learning Bangalore meet up Satyam Saxena
Ā 
SP18 Generative Design - Week 8 - Optimization
SP18 Generative Design - Week 8 - OptimizationSP18 Generative Design - Week 8 - Optimization
SP18 Generative Design - Week 8 - OptimizationDanil Nagy
Ā 
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...Sebastian Raschka
Ā 
More investment in Research and Development for better Education in the future?
More investment in Research and Development for better Education in the future?More investment in Research and Development for better Education in the future?
More investment in Research and Development for better Education in the future?Dhafer Malouche
Ā 
Fuzzy c means clustering protocol for wireless sensor networks
Fuzzy c means clustering protocol for wireless sensor networksFuzzy c means clustering protocol for wireless sensor networks
Fuzzy c means clustering protocol for wireless sensor networksmourya chandra
Ā 
Poggi analytics - distance - 1a
Poggi   analytics - distance - 1aPoggi   analytics - distance - 1a
Poggi analytics - distance - 1aGaston Liberman
Ā 
Lecture1
Lecture1Lecture1
Lecture1tarikh007
Ā 
Evaluation of programs codes using machine learning
Evaluation of programs codes using machine learningEvaluation of programs codes using machine learning
Evaluation of programs codes using machine learningVivek Maskara
Ā 
lecture15-neural-nets (2).pptx
lecture15-neural-nets (2).pptxlecture15-neural-nets (2).pptx
lecture15-neural-nets (2).pptxanjithaba
Ā 

Similar to K-Means Algorithm (20)

The Million Domain Challenge: Broadcast Email Prioritization by Cross-domain ...
The Million Domain Challenge: Broadcast Email Prioritization by Cross-domain ...The Million Domain Challenge: Broadcast Email Prioritization by Cross-domain ...
The Million Domain Challenge: Broadcast Email Prioritization by Cross-domain ...
Ā 
On-Homomorphic-Encryption-and-Secure-Computation.ppt
On-Homomorphic-Encryption-and-Secure-Computation.pptOn-Homomorphic-Encryption-and-Secure-Computation.ppt
On-Homomorphic-Encryption-and-Secure-Computation.ppt
Ā 
Learning multifractal structure in large networks (Purdue ML Seminar)
Learning multifractal structure in large networks (Purdue ML Seminar)Learning multifractal structure in large networks (Purdue ML Seminar)
Learning multifractal structure in large networks (Purdue ML Seminar)
Ā 
Deep Learning for Personalized Search and Recommender Systems
Deep Learning for Personalized Search and Recommender SystemsDeep Learning for Personalized Search and Recommender Systems
Deep Learning for Personalized Search and Recommender Systems
Ā 
ECCV WS 2012 (Frank)
ECCV WS 2012 (Frank)ECCV WS 2012 (Frank)
ECCV WS 2012 (Frank)
Ā 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data Science
Ā 
Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford
Ā 
IOEfficientParalleMatrixMultiplication_present
IOEfficientParalleMatrixMultiplication_presentIOEfficientParalleMatrixMultiplication_present
IOEfficientParalleMatrixMultiplication_present
Ā 
DLBLR talk
DLBLR talkDLBLR talk
DLBLR talk
Ā 
Deep Learning Bangalore meet up
Deep Learning Bangalore meet up Deep Learning Bangalore meet up
Deep Learning Bangalore meet up
Ā 
SP18 Generative Design - Week 8 - Optimization
SP18 Generative Design - Week 8 - OptimizationSP18 Generative Design - Week 8 - Optimization
SP18 Generative Design - Week 8 - Optimization
Ā 
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
Ā 
More investment in Research and Development for better Education in the future?
More investment in Research and Development for better Education in the future?More investment in Research and Development for better Education in the future?
More investment in Research and Development for better Education in the future?
Ā 
Fuzzy c means clustering protocol for wireless sensor networks
Fuzzy c means clustering protocol for wireless sensor networksFuzzy c means clustering protocol for wireless sensor networks
Fuzzy c means clustering protocol for wireless sensor networks
Ā 
Poggi analytics - distance - 1a
Poggi   analytics - distance - 1aPoggi   analytics - distance - 1a
Poggi analytics - distance - 1a
Ā 
Lecture1
Lecture1Lecture1
Lecture1
Ā 
Lecture 3.pdf
Lecture 3.pdfLecture 3.pdf
Lecture 3.pdf
Ā 
Evaluation of programs codes using machine learning
Evaluation of programs codes using machine learningEvaluation of programs codes using machine learning
Evaluation of programs codes using machine learning
Ā 
lecture15-neural-nets (2).pptx
lecture15-neural-nets (2).pptxlecture15-neural-nets (2).pptx
lecture15-neural-nets (2).pptx
Ā 
Entity-Relationship Queries over Wikipedia
Entity-Relationship Queries over WikipediaEntity-Relationship Queries over Wikipedia
Entity-Relationship Queries over Wikipedia
Ā 

More from Carlos Castillo (ChaTo)

Finding High Quality Content in Social Media
Finding High Quality Content in Social MediaFinding High Quality Content in Social Media
Finding High Quality Content in Social MediaCarlos Castillo (ChaTo)
Ā 
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017Carlos Castillo (ChaTo)
Ā 
Observational studies in social media
Observational studies in social mediaObservational studies in social media
Observational studies in social mediaCarlos Castillo (ChaTo)
Ā 
Graph Partitioning and Spectral Methods
Graph Partitioning and Spectral MethodsGraph Partitioning and Spectral Methods
Graph Partitioning and Spectral MethodsCarlos Castillo (ChaTo)
Ā 
Text similarity and the vector space model
Text similarity and the vector space modelText similarity and the vector space model
Text similarity and the vector space modelCarlos Castillo (ChaTo)
Ā 
Keynote talk: Big Crisis Data, an Open Invitation
Keynote talk: Big Crisis Data, an Open InvitationKeynote talk: Big Crisis Data, an Open Invitation
Keynote talk: Big Crisis Data, an Open InvitationCarlos Castillo (ChaTo)
Ā 

More from Carlos Castillo (ChaTo) (20)

Finding High Quality Content in Social Media
Finding High Quality Content in Social MediaFinding High Quality Content in Social Media
Finding High Quality Content in Social Media
Ā 
When no clicks are good news
When no clicks are good newsWhen no clicks are good news
When no clicks are good news
Ā 
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Ā 
Discrimination Discovery
Discrimination DiscoveryDiscrimination Discovery
Discrimination Discovery
Ā 
Observational studies in social media
Observational studies in social mediaObservational studies in social media
Observational studies in social media
Ā 
Natural experiments
Natural experimentsNatural experiments
Natural experiments
Ā 
Content-based link prediction
Content-based link predictionContent-based link prediction
Content-based link prediction
Ā 
Link prediction
Link predictionLink prediction
Link prediction
Ā 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
Ā 
Graph Partitioning and Spectral Methods
Graph Partitioning and Spectral MethodsGraph Partitioning and Spectral Methods
Graph Partitioning and Spectral Methods
Ā 
Finding Dense Subgraphs
Finding Dense SubgraphsFinding Dense Subgraphs
Finding Dense Subgraphs
Ā 
Graph Evolution Models
Graph Evolution ModelsGraph Evolution Models
Graph Evolution Models
Ā 
Link-Based Ranking
Link-Based RankingLink-Based Ranking
Link-Based Ranking
Ā 
Text Indexing / Inverted Indices
Text Indexing / Inverted IndicesText Indexing / Inverted Indices
Text Indexing / Inverted Indices
Ā 
Indexing
IndexingIndexing
Indexing
Ā 
Text Summarization
Text SummarizationText Summarization
Text Summarization
Ā 
Hierarchical Clustering
Hierarchical ClusteringHierarchical Clustering
Hierarchical Clustering
Ā 
Clustering
ClusteringClustering
Clustering
Ā 
Text similarity and the vector space model
Text similarity and the vector space modelText similarity and the vector space model
Text similarity and the vector space model
Ā 
Keynote talk: Big Crisis Data, an Open Invitation
Keynote talk: Big Crisis Data, an Open InvitationKeynote talk: Big Crisis Data, an Open Invitation
Keynote talk: Big Crisis Data, an Open Invitation
Ā 

Recently uploaded

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
Ā 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
Ā 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
Ā 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
Ā 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
Ā 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
Ā 
šŸ¬ The future of MySQL is Postgres šŸ˜
šŸ¬  The future of MySQL is Postgres   šŸ˜šŸ¬  The future of MySQL is Postgres   šŸ˜
šŸ¬ The future of MySQL is Postgres šŸ˜RTylerCroy
Ā 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
Ā 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
Ā 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdfChristopherTHyatt
Ā 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
Ā 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
Ā 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
Ā 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
Ā 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
Ā 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
Ā 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
Ā 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
Ā 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
Ā 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
Ā 

Recently uploaded (20)

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
Ā 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
Ā 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
Ā 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
Ā 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
Ā 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
Ā 
šŸ¬ The future of MySQL is Postgres šŸ˜
šŸ¬  The future of MySQL is Postgres   šŸ˜šŸ¬  The future of MySQL is Postgres   šŸ˜
šŸ¬ The future of MySQL is Postgres šŸ˜
Ā 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
Ā 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
Ā 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
Ā 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
Ā 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
Ā 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Ā 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
Ā 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Ā 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Ā 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
Ā 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Ā 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
Ā 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Ā 

K-Means Algorithm

  • 1. 1 K-Means Class Algorithmic Methods of Data Mining Program M. Sc. Data Science University Sapienza University of Rome Semester Fall 2015 Lecturer Carlos Castillo http://chato.cl/ Sources: ā— Mohammed J. Zaki, Wagner Meira, Jr., Data Mining and Analysis: Fundamental Concepts and Algorithms, Cambridge University Press, May 2014. Example 13.1. [download] ā— Evimaria Terzi: Data Mining course at Boston University http://www.cs.bu.edu/~evimaria/cs565-13.html
  • 2. 2 Boston University Slideshow Title Goes Here The k-means problem ā€¢ consider set X={x1,...,xn} of n points in Rd ā€¢ assume that the number k is given ā€¢ problem: ā€¢ find k points c1,...,ck (named centers or means) so that the cost is minimized
  • 3. 3 Boston University Slideshow Title Goes Here The k-means problem ā€¢ k=1 and k=n are easy special cases (why?) ā€¢ an NP-hard problem if the dimension of the data is at least 2 (dā‰„2) ā€¢ in practice, a simple iterative algorithm works quite well
  • 4. 4 Boston University Slideshow Title Goes Here The k-means algorithm ā€¢ voted among the top-10 algorithms in data mining ā€¢ one way of solving the k- means problem
  • 6. 6 Boston University Slideshow Title Goes Here The k-means algorithm 1.randomly (or with another method) pick k cluster centers {c1,...,ck} 2.for each j, set the cluster Xj to be the set of points in X that are the closest to center cj 3.for each j let cj be the center of cluster Xj (mean of the vectors in Xj) 1.repeat (go to step 2) until convergence
  • 7. 7 Boston University Slideshow Title Goes Here Sample execution
  • 8. 8 1-dimensional clustering exercise Exercise: ā— For the data in the figure ā— Run k-means with k=2 and initial centroids u1=2, u2=4 (Verify: last centroids are 18 units apart) ā— Try with k=3 and initialization 2,3,30 http://www.dataminingbook.info/pmwiki.php/Main/BookDownload Exercise 13.1
  • 9. 9 Limitations of k-means ā— Clusters of different size ā— Clusters of different density ā— Clusters of non-globular shape ā— Sensitive to initialization
  • 10. 10 Boston University Slideshow Title Goes Here Limitations of k-means: different sizes
  • 11. 11 Boston University Slideshow Title Goes Here Limitations of k-means: different density
  • 12. 12 Boston University Slideshow Title Goes Here Limitations of k-means: non-spherical shapes
  • 13. 13 Boston University Slideshow Title Goes Here Effects of bad initialization
  • 14. 14 Boston University Slideshow Title Goes Here k-means algorithm ā€¢ finds a local optimum ā€¢ often converges quickly but not always ā€¢ the choice of initial points can have large influence in the result ā€¢ tends to find spherical clusters ā€¢ outliers can cause a problem ā€¢ different densities may cause a problem
  • 16. 16 Boston University Slideshow Title Goes Here Initialization ā€¢ random initialization ā€¢ random, but repeat many times and take the best solution ā€¢ helps, but solution can still be bad ā€¢ pick points that are distant to each other ā€¢ k-means++ ā€¢ provable guarantees
  • 17. 17 Boston University Slideshow Title Goes Here k-means++ David Arthur and Sergei Vassilvitskii k-means++: The advantages of careful seeding SODA 2007
  • 18. 18 Boston University Slideshow Title Goes Here k-means algorithm: random initialization
  • 19. 19 Boston University Slideshow Title Goes Here k-means algorithm: random initialization
  • 20. 20 Boston University Slideshow Title Goes Here 1 2 3 4 k-means algorithm: initialization with further-first traversal
  • 21. 21 Boston University Slideshow Title Goes Here k-means algorithm: initialization with further-first traversal
  • 22. 22 Boston University Slideshow Title Goes Here 1 2 3 but... sensitive to outliers
  • 23. 23 Boston University Slideshow Title Goes Here but... sensitive to outliers
  • 24. 24 Boston University Slideshow Title Goes Here Here random may work well
  • 25. 25 Boston University Slideshow Title Goes Here k-means++ algorithm ā€¢ interpolate between the two methods ā€¢ let D(x) be the distance between x and the nearest center selected so far ā€¢ choose next center with probability proportional to (D(x))a = Da(x) ļ€æ aĀ =Ā 0Ā Ā Ā Ā Ā Ā randomĀ initialization ļ€æ aĀ =Ā āˆž furthestĀ­firstĀ traversal ļ€æ aĀ =Ā 2Ā Ā Ā Ā Ā Ā kĀ­means++Ā 
  • 26. 26 Boston University Slideshow Title Goes Here k-means++ algorithm ā€¢ initialization phase: ā€¢ choose the first center uniformly at random ā€¢ choose next center with probability proportional to D2(x) ā€¢ iteration phase: ā€¢ iterate as in the k-means algorithm until convergence
  • 27. 27 Boston University Slideshow Title Goes Here k-means++ initialization 1 2 3
  • 28. 28 Boston University Slideshow Title Goes Here k-means++ result
  • 29. 29 Boston University Slideshow Title Goes Here ā€¢ approximation guarantee comes just from the first iteration (initialization) ā€¢ subsequent iterations can only improve cost k-means++ provable guarantee
  • 30. 30 Boston University Slideshow Title Goes Here Lesson learned ā€¢ no reason to use k-means and not k-means++ ā€¢ k-means++ : ā€¢ easy to implement ā€¢ provable guarantee ā€¢ works well in practice
  • 31. 31 k-means-- ā— Algorithm 4.1 in [Chawla & Gionis SDM 2013]