SlideShare une entreprise Scribd logo
1  sur  33
K-means Clustering
Ass.-Prof. Dr.rer.nat Anna Fensel
Outline
» Introduction, learning goals
» Motivation and example
» Clustering
» K-means clustering algorithm
definition, functions, iteration process, pseudocode
» Computational complexity
» Extensions
» Tools
» Application examples
» Conclusions
» References
Page 2
Outline
» Introduction, learning goals
» Motivation and example
» Clustering
» K-means clustering algorithm
definition, functions, iteration process, pseudocode
» Computational complexity
» Extensions
» Tools
» Application examples
» Conclusions
» References
Page 3
What you should be able to do after this lecture?
» Understand the concept of clustering, and
particularly k-means clustering
» Explain the k-means clustering algorithm
» Provide diverse usage examples of the k-
means clustering algorithm
» Understand different challenges in the use of
the k-means clustering algorithm and its
extensions/variations
Page 4
Motivation: Why clustering?
What is clustering?
Page 5
Motivation: Why clustering?
What is clustering?
» Finding “natural” groupings between objects
» We want to find similar objects (f.e. documents) to treat
them in the same way
We aim at:
» High intra-cluster similarity
» Low inter-cluster similarity
Page 6
Motivating example: Web document search
» A web search engine often returns thousands of pages in
response to a broad query, making it difficult for users to
browse or to identify relevant information.
» Clustering methods can be used to automatically group the
retrieved documents into a list of meaningful categories.
Page 7
Is clustering typically …?
A. Supervised
B. Unsupervised
Page 8
Is clustering typically …?
A. Supervised
B. Unsupervised
Page 9
Outline
» Introduction, learning goals
» Motivation and example
» Clustering
» K-means clustering algorithm
definition, functions, iteration process, pseudocode
» Computational complexity
» Extensions
» Tools
» Application examples
» Conclusions
» References
Page 10
What is K-means clustering?
» k-means clustering aims to partition n observations into k clusters in
which each observation belongs to the cluster with the nearest mean
» Works for n-dimensional spaces as well
Page 11
How do we measure similarity?
Give examples of similarity measures.
» …
» …
Page 12
How do we measure similarity?
Give examples of similarity measures.
» Similarity is subjective
» Its measure therefore depends on the data, the use case, the users
» In practice it is not always straightforward which metrics work well -
then “trial and error” can be followed
» Examples of similarity measures: Euclidean, Manhattan, cosine distance
Page 13
How does K-means do the clustering?
K-means is a very important/basic flat clustering algorithm.
Its objective is to minimize the average squared Euclidean distance
of values from their cluster centers where a cluster center is defined
as the mean or centroid of the values in a cluster :
Selection of centroids
» The first step of K-means is to select as initial cluster centers
K randomly selected documents, the seeds.
» The algorithm then moves the cluster centers around in
space in order to minimize RSS (the function that defines
how central the centroids are).
Page 15
Page 16Ort I Name I Datum
K-means illustration step-by-step
(1 dimensional)
From: StatQuest: K-means clustering:
https://www.youtube.com/watch?v=4b5d3muPQmA
Calculating the “centrality” of the centroids
[Manning & Schütze, 2008]
Page 17
A measure of how well the centroids represent the members of their clusters
is the residual sum of squares or RSS,
the squared distance of each vector from its centroid summed over all vectors:
Our goal is to minimize RSS (i.e. the average squared distance) till it is possible.
K-means algorithm summary
[Manning & Schütze, 2008]
Page 18
K-means – iteration process
[Manning & Schütze, 2008]
» reassigning
documents to the
cluster with the
closest centroid,
» recomputing each
centroid based on
the current
members of its
cluster.
How to set where to terminate the algorithm?
[Manning & Schütze, 2008]
Page 20
• A fixed number of iterations has been completed.
• This condition limits the runtime of the clustering algorithm, but in some
cases the quality of the clustering will be poor because of an insufficient
number of iterations.
• Assignment of documents to clusters (the partitioning function ) does not
change between iterations.
• Except for cases with a bad local minimum, this produces a good clustering,
but runtimes may be unacceptably long.
• Centroids do not change between iterations.
• This is equivalent to not changing.
• Terminate when RSS falls below a threshold.
• This criterion ensures that the clustering is of a desired quality after
termination. In practice, we need to combine it with a bound on the number
of iterations to guarantee termination.
• Terminate when the decrease in RSS falls below a threshold .
• For small , this indicates that we are close to convergence. Again, we need
to combine it with a bound on the number of iterations to prevent very long
runtimes.
A variation: How do you know how many clusters
you should make?
» It is possible to try different cluster numbers.
» And check where the variation stabilizes to decide on the number of clusters.
Page 21
From: StatQuest: K-means clustering:
https://www.youtube.com/watch?v=4b5d3muPQmA
Outline
» Introduction, learning goals
» Motivation and example
» Clustering
» K-means clustering algorithm
definition, functions, iteration process, pseudocode
» Computational complexity
» Extensions
» Tools
» Application examples
» Conclusions
» References
Page 22
K-means computational aspects
» K-means converges, but there is unfortunately no guarantee
that a global minimum in the objective function will be
reached.
This is a particular problem if a document set contains many outliers ,
documents that are far from any other documents and therefore do
not fit well into any cluster. We may end up with a singleton cluster (a
cluster with only one document) even though there is probably a
clustering with lower RSS.
Page 23
What is the time complexity K-means?
[Manning & Schütze, 2008]
» Most of the time is spent on computing vector distances.
One such operation costs .
» The reassignment step computes KN distances, so its overall
complexity is .
» In the recomputation step, each vector gets added to a
centroid once, so the complexity of this step is .
» For a fixed number of iterations I, the overall complexity is
therefore .
Page 24
Extensions
There are numerous extensions to the K-means clustering f.e.
» K-means clustering can be generalized e.g. into a Gaussian
mixture model.
» Efficiency problem can be addressed e.g. by K-medoids , a
variant of K-means that computes medoids instead of
centroids as cluster centers.
The medoid of a cluster as the value that is closest to the centroid.
Distance computations are faster in this case.
Page 25
Tool support
A number of tools implementing k-means clustering are
available:
» Open source e.g. Apache Spark Torch, R, and
» Proprietary e.g. MATLAB, Mathematica, SAP HANA
Page 26
Outline
» Introduction, learning goals
» Motivation and example
» Clustering
» K-means clustering algorithm
definition, functions, iteration process, pseudocode
» Computational complexity
» Extensions
» Tools
» Application examples
» Conclusions
» References
Page 27
Application example: Image compression
» Aim: compress an image in size
» Question: with how many dimensional space we are working here?
Page 28
Image credit: Shinichi Tamura @ Slideshare
Application example: Retail – recommendation
and yield management
» User profiles/personas: similar
purchase behavior…
Page 29
» Product profiles: similar
selling patterns
» Deciding when to discount
product groups
Question: with how many dimensional space we are working here?
Outline
» Introduction, learning goals
» Motivation and example
» Clustering
» K-means clustering algorithm
definition, functions, iteration process, pseudocode
» Computational complexity
» Extensions
» Tools
» Application examples
» Conclusions
» References
Page 30
Summary
» Provide 5 most important points you have learnt from
today’s lecture.
» …
» …
» …
» …
» …
» (and let’s compare the points)
Page 31
References
» Hartigan, J. A., & Wong, M. A. (1979). Algorithm AS 136: A k-means clustering
algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics),
28(1), 100-108.
» Manning С, R. P., & Schütze, H. (2008). Introduction to information retrieval.
Cambridge University Press.
» K-Means Clustering at Wikipedia: https://en.wikipedia.org/wiki/K-
means_clustering
» StatQuest: K-means clustering:
https://www.youtube.com/watch?v=4b5d3muPQmA
» Press, WH; Teukolsky, SA; Vetterling, WT; Flannery, BP (2007). "Section 16.1.
Gaussian Mixture Models and k-Means Clustering". Numerical Recipes: The
Art of Scientific Computing (3rd ed.). New York: Cambridge University Press.
ISBN 978-0-521-88068-8.
Page 32
www.sti-innsbruck.at
Thank you for attention.
Questions?

Contenu connexe

Tendances (20)

Hierarchical Clustering
Hierarchical ClusteringHierarchical Clustering
Hierarchical Clustering
 
Hierachical clustering
Hierachical clusteringHierachical clustering
Hierachical clustering
 
Machine learning clustering
Machine learning clusteringMachine learning clustering
Machine learning clustering
 
Introduction to Linear Discriminant Analysis
Introduction to Linear Discriminant AnalysisIntroduction to Linear Discriminant Analysis
Introduction to Linear Discriminant Analysis
 
K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithm
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
KNN
KNN KNN
KNN
 
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
 
KNN
KNNKNN
KNN
 
K means
K meansK means
K means
 
Kmeans
KmeansKmeans
Kmeans
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning Algorithms
 
K means clustering
K means clusteringK means clustering
K means clustering
 
Decision trees in Machine Learning
Decision trees in Machine Learning Decision trees in Machine Learning
Decision trees in Machine Learning
 
K means clustering
K means clusteringK means clustering
K means clustering
 
05 Clustering in Data Mining
05 Clustering in Data Mining05 Clustering in Data Mining
05 Clustering in Data Mining
 
Introduction to Machine Learning Classifiers
Introduction to Machine Learning ClassifiersIntroduction to Machine Learning Classifiers
Introduction to Machine Learning Classifiers
 
K-Nearest Neighbor Classifier
K-Nearest Neighbor ClassifierK-Nearest Neighbor Classifier
K-Nearest Neighbor Classifier
 
K-Means clustring @jax
K-Means clustring @jaxK-Means clustring @jax
K-Means clustring @jax
 
Data clustering
Data clustering Data clustering
Data clustering
 

Similaire à K-means Clustering

Chapter 3 principles of parallel algorithm design
Chapter 3   principles of parallel algorithm designChapter 3   principles of parallel algorithm design
Chapter 3 principles of parallel algorithm designDenisAkbar1
 
Combinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learningCombinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learning민재 정
 
CLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfCLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfSowmyaJyothi3
 
lecture12-clustering.ppt
lecture12-clustering.pptlecture12-clustering.ppt
lecture12-clustering.pptImXaib
 
lecture12-clustering.ppt
lecture12-clustering.pptlecture12-clustering.ppt
lecture12-clustering.pptRajeshT305412
 
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering Unsupervised Learning: Clustering
Unsupervised Learning: Clustering Experfy
 
New Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmNew Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmEditor IJCATR
 
15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learningAnil Yadav
 
A Survey of Machine Learning Methods Applied to Computer ...
A Survey of Machine Learning Methods Applied to Computer ...A Survey of Machine Learning Methods Applied to Computer ...
A Survey of Machine Learning Methods Applied to Computer ...butest
 
Improve the Performance of Clustering Using Combination of Multiple Clusterin...
Improve the Performance of Clustering Using Combination of Multiple Clusterin...Improve the Performance of Clustering Using Combination of Multiple Clusterin...
Improve the Performance of Clustering Using Combination of Multiple Clusterin...ijdmtaiir
 
Apache con big data 2015 - Data Science from the trenches
Apache con big data 2015 - Data Science from the trenchesApache con big data 2015 - Data Science from the trenches
Apache con big data 2015 - Data Science from the trenchesVinay Shukla
 
Top 3 design patterns in Map Reduce
Top 3 design patterns in Map ReduceTop 3 design patterns in Map Reduce
Top 3 design patterns in Map ReduceEdureka!
 
Clustering Algorithm with a Novel Similarity Measure
Clustering Algorithm with a Novel Similarity MeasureClustering Algorithm with a Novel Similarity Measure
Clustering Algorithm with a Novel Similarity MeasureIOSR Journals
 
Experimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsExperimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsIJDKP
 

Similaire à K-means Clustering (20)

Chapter 3 principles of parallel algorithm design
Chapter 3   principles of parallel algorithm designChapter 3   principles of parallel algorithm design
Chapter 3 principles of parallel algorithm design
 
Chap3 slides
Chap3 slidesChap3 slides
Chap3 slides
 
Combinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learningCombinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learning
 
CLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfCLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdf
 
lecture12-clustering.ppt
lecture12-clustering.pptlecture12-clustering.ppt
lecture12-clustering.ppt
 
lecture12-clustering.ppt
lecture12-clustering.pptlecture12-clustering.ppt
lecture12-clustering.ppt
 
lecture12-clustering.ppt
lecture12-clustering.pptlecture12-clustering.ppt
lecture12-clustering.ppt
 
lecture12-clustering.ppt
lecture12-clustering.pptlecture12-clustering.ppt
lecture12-clustering.ppt
 
clustering.pptx
clustering.pptxclustering.pptx
clustering.pptx
 
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
 
New Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmNew Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids Algorithm
 
Clustering ppt
Clustering pptClustering ppt
Clustering ppt
 
15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning
 
A Survey of Machine Learning Methods Applied to Computer ...
A Survey of Machine Learning Methods Applied to Computer ...A Survey of Machine Learning Methods Applied to Computer ...
A Survey of Machine Learning Methods Applied to Computer ...
 
Improve the Performance of Clustering Using Combination of Multiple Clusterin...
Improve the Performance of Clustering Using Combination of Multiple Clusterin...Improve the Performance of Clustering Using Combination of Multiple Clusterin...
Improve the Performance of Clustering Using Combination of Multiple Clusterin...
 
Chapter 5.pdf
Chapter 5.pdfChapter 5.pdf
Chapter 5.pdf
 
Apache con big data 2015 - Data Science from the trenches
Apache con big data 2015 - Data Science from the trenchesApache con big data 2015 - Data Science from the trenches
Apache con big data 2015 - Data Science from the trenches
 
Top 3 design patterns in Map Reduce
Top 3 design patterns in Map ReduceTop 3 design patterns in Map Reduce
Top 3 design patterns in Map Reduce
 
Clustering Algorithm with a Novel Similarity Measure
Clustering Algorithm with a Novel Similarity MeasureClustering Algorithm with a Novel Similarity Measure
Clustering Algorithm with a Novel Similarity Measure
 
Experimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsExperimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithms
 

Plus de Anna Fensel

Special Issue on Machine Learning and Knowledge Graphs
Special Issue on Machine Learning and Knowledge GraphsSpecial Issue on Machine Learning and Knowledge Graphs
Special Issue on Machine Learning and Knowledge GraphsAnna Fensel
 
Smart Data for Behavioural Change: Towards Energy Efficient Buildings
Smart Data for Behavioural Change: Towards Energy Efficient BuildingsSmart Data for Behavioural Change: Towards Energy Efficient Buildings
Smart Data for Behavioural Change: Towards Energy Efficient BuildingsAnna Fensel
 
Seefeld eTourism Hackathon
Seefeld eTourism HackathonSeefeld eTourism Hackathon
Seefeld eTourism HackathonAnna Fensel
 
Big data impact on society: a research roadmap for Europe (BYTE project resea...
Big data impact on society: a research roadmap for Europe (BYTE project resea...Big data impact on society: a research roadmap for Europe (BYTE project resea...
Big data impact on society: a research roadmap for Europe (BYTE project resea...Anna Fensel
 
Open Data Hackathon in Bolzano, Italy - October 15-16, 2016
Open Data Hackathon in Bolzano, Italy - October 15-16, 2016Open Data Hackathon in Bolzano, Italy - October 15-16, 2016
Open Data Hackathon in Bolzano, Italy - October 15-16, 2016Anna Fensel
 
Online Marketing with Schema.org and Multi-channel Communication
Online Marketing with Schema.org and Multi-channel CommunicationOnline Marketing with Schema.org and Multi-channel Communication
Online Marketing with Schema.org and Multi-channel CommunicationAnna Fensel
 
Towards Semantic APIs for Research Data Services (Invited Talk)
Towards Semantic APIs for Research Data Services (Invited Talk)Towards Semantic APIs for Research Data Services (Invited Talk)
Towards Semantic APIs for Research Data Services (Invited Talk)Anna Fensel
 
The Knowledge Level: Structure and Details (A. Newell, 1982)
The Knowledge Level: Structure and Details (A. Newell, 1982)The Knowledge Level: Structure and Details (A. Newell, 1982)
The Knowledge Level: Structure and Details (A. Newell, 1982)Anna Fensel
 
TourPack: Packaging and Disseminating Touristic Services with Linked Data and...
TourPack: Packaging and Disseminating Touristic Services with Linked Data and...TourPack: Packaging and Disseminating Touristic Services with Linked Data and...
TourPack: Packaging and Disseminating Touristic Services with Linked Data and...Anna Fensel
 
Semantic Web 2014/15 Course - Tools and Applications Evaluation Tasks
Semantic Web 2014/15 Course - Tools and Applications Evaluation TasksSemantic Web 2014/15 Course - Tools and Applications Evaluation Tasks
Semantic Web 2014/15 Course - Tools and Applications Evaluation TasksAnna Fensel
 
Selecting Ontologies and Publishing Data of Electrical Appliances: A Refrige...
Selecting Ontologies  and Publishing Data of Electrical Appliances: A Refrige...Selecting Ontologies  and Publishing Data of Electrical Appliances: A Refrige...
Selecting Ontologies and Publishing Data of Electrical Appliances: A Refrige...Anna Fensel
 
The Semantic Web Exists. What Next?
The Semantic Web Exists. What Next?The Semantic Web Exists. What Next?
The Semantic Web Exists. What Next?Anna Fensel
 
Context Based Adaptation of Semantic Rules in Smart Buildings
Context Based Adaptation of Semantic Rules in SmartBuildingsContext Based Adaptation of Semantic Rules in SmartBuildings
Context Based Adaptation of Semantic Rules in Smart BuildingsAnna Fensel
 
Restaurant and food ontologies
Restaurant and food ontologiesRestaurant and food ontologies
Restaurant and food ontologiesAnna Fensel
 
Austrian Experience in Building Data Value Chain
Austrian Experience in Building Data Value ChainAustrian Experience in Building Data Value Chain
Austrian Experience in Building Data Value ChainAnna Fensel
 
RESIDE: Renewable Energy Services on an Infrastructure for Data Exchange
RESIDE: Renewable Energy Services on an Infrastructure for Data ExchangeRESIDE: Renewable Energy Services on an Infrastructure for Data Exchange
RESIDE: Renewable Energy Services on an Infrastructure for Data ExchangeAnna Fensel
 
BEYOND ENERGY EFFICIENT SMART BUILDINGS
BEYOND ENERGY EFFICIENT SMART BUILDINGSBEYOND ENERGY EFFICIENT SMART BUILDINGS
BEYOND ENERGY EFFICIENT SMART BUILDINGSAnna Fensel
 
Empowering user participation with converged semantic services
Empowering user participation with converged semantic servicesEmpowering user participation with converged semantic services
Empowering user participation with converged semantic servicesAnna Fensel
 

Plus de Anna Fensel (18)

Special Issue on Machine Learning and Knowledge Graphs
Special Issue on Machine Learning and Knowledge GraphsSpecial Issue on Machine Learning and Knowledge Graphs
Special Issue on Machine Learning and Knowledge Graphs
 
Smart Data for Behavioural Change: Towards Energy Efficient Buildings
Smart Data for Behavioural Change: Towards Energy Efficient BuildingsSmart Data for Behavioural Change: Towards Energy Efficient Buildings
Smart Data for Behavioural Change: Towards Energy Efficient Buildings
 
Seefeld eTourism Hackathon
Seefeld eTourism HackathonSeefeld eTourism Hackathon
Seefeld eTourism Hackathon
 
Big data impact on society: a research roadmap for Europe (BYTE project resea...
Big data impact on society: a research roadmap for Europe (BYTE project resea...Big data impact on society: a research roadmap for Europe (BYTE project resea...
Big data impact on society: a research roadmap for Europe (BYTE project resea...
 
Open Data Hackathon in Bolzano, Italy - October 15-16, 2016
Open Data Hackathon in Bolzano, Italy - October 15-16, 2016Open Data Hackathon in Bolzano, Italy - October 15-16, 2016
Open Data Hackathon in Bolzano, Italy - October 15-16, 2016
 
Online Marketing with Schema.org and Multi-channel Communication
Online Marketing with Schema.org and Multi-channel CommunicationOnline Marketing with Schema.org and Multi-channel Communication
Online Marketing with Schema.org and Multi-channel Communication
 
Towards Semantic APIs for Research Data Services (Invited Talk)
Towards Semantic APIs for Research Data Services (Invited Talk)Towards Semantic APIs for Research Data Services (Invited Talk)
Towards Semantic APIs for Research Data Services (Invited Talk)
 
The Knowledge Level: Structure and Details (A. Newell, 1982)
The Knowledge Level: Structure and Details (A. Newell, 1982)The Knowledge Level: Structure and Details (A. Newell, 1982)
The Knowledge Level: Structure and Details (A. Newell, 1982)
 
TourPack: Packaging and Disseminating Touristic Services with Linked Data and...
TourPack: Packaging and Disseminating Touristic Services with Linked Data and...TourPack: Packaging and Disseminating Touristic Services with Linked Data and...
TourPack: Packaging and Disseminating Touristic Services with Linked Data and...
 
Semantic Web 2014/15 Course - Tools and Applications Evaluation Tasks
Semantic Web 2014/15 Course - Tools and Applications Evaluation TasksSemantic Web 2014/15 Course - Tools and Applications Evaluation Tasks
Semantic Web 2014/15 Course - Tools and Applications Evaluation Tasks
 
Selecting Ontologies and Publishing Data of Electrical Appliances: A Refrige...
Selecting Ontologies  and Publishing Data of Electrical Appliances: A Refrige...Selecting Ontologies  and Publishing Data of Electrical Appliances: A Refrige...
Selecting Ontologies and Publishing Data of Electrical Appliances: A Refrige...
 
The Semantic Web Exists. What Next?
The Semantic Web Exists. What Next?The Semantic Web Exists. What Next?
The Semantic Web Exists. What Next?
 
Context Based Adaptation of Semantic Rules in Smart Buildings
Context Based Adaptation of Semantic Rules in SmartBuildingsContext Based Adaptation of Semantic Rules in SmartBuildings
Context Based Adaptation of Semantic Rules in Smart Buildings
 
Restaurant and food ontologies
Restaurant and food ontologiesRestaurant and food ontologies
Restaurant and food ontologies
 
Austrian Experience in Building Data Value Chain
Austrian Experience in Building Data Value ChainAustrian Experience in Building Data Value Chain
Austrian Experience in Building Data Value Chain
 
RESIDE: Renewable Energy Services on an Infrastructure for Data Exchange
RESIDE: Renewable Energy Services on an Infrastructure for Data ExchangeRESIDE: Renewable Energy Services on an Infrastructure for Data Exchange
RESIDE: Renewable Energy Services on an Infrastructure for Data Exchange
 
BEYOND ENERGY EFFICIENT SMART BUILDINGS
BEYOND ENERGY EFFICIENT SMART BUILDINGSBEYOND ENERGY EFFICIENT SMART BUILDINGS
BEYOND ENERGY EFFICIENT SMART BUILDINGS
 
Empowering user participation with converged semantic services
Empowering user participation with converged semantic servicesEmpowering user participation with converged semantic services
Empowering user participation with converged semantic services
 

Dernier

Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 

Dernier (20)

Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 

K-means Clustering

  • 2. Outline » Introduction, learning goals » Motivation and example » Clustering » K-means clustering algorithm definition, functions, iteration process, pseudocode » Computational complexity » Extensions » Tools » Application examples » Conclusions » References Page 2
  • 3. Outline » Introduction, learning goals » Motivation and example » Clustering » K-means clustering algorithm definition, functions, iteration process, pseudocode » Computational complexity » Extensions » Tools » Application examples » Conclusions » References Page 3
  • 4. What you should be able to do after this lecture? » Understand the concept of clustering, and particularly k-means clustering » Explain the k-means clustering algorithm » Provide diverse usage examples of the k- means clustering algorithm » Understand different challenges in the use of the k-means clustering algorithm and its extensions/variations Page 4
  • 5. Motivation: Why clustering? What is clustering? Page 5
  • 6. Motivation: Why clustering? What is clustering? » Finding “natural” groupings between objects » We want to find similar objects (f.e. documents) to treat them in the same way We aim at: » High intra-cluster similarity » Low inter-cluster similarity Page 6
  • 7. Motivating example: Web document search » A web search engine often returns thousands of pages in response to a broad query, making it difficult for users to browse or to identify relevant information. » Clustering methods can be used to automatically group the retrieved documents into a list of meaningful categories. Page 7
  • 8. Is clustering typically …? A. Supervised B. Unsupervised Page 8
  • 9. Is clustering typically …? A. Supervised B. Unsupervised Page 9
  • 10. Outline » Introduction, learning goals » Motivation and example » Clustering » K-means clustering algorithm definition, functions, iteration process, pseudocode » Computational complexity » Extensions » Tools » Application examples » Conclusions » References Page 10
  • 11. What is K-means clustering? » k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean » Works for n-dimensional spaces as well Page 11
  • 12. How do we measure similarity? Give examples of similarity measures. » … » … Page 12
  • 13. How do we measure similarity? Give examples of similarity measures. » Similarity is subjective » Its measure therefore depends on the data, the use case, the users » In practice it is not always straightforward which metrics work well - then “trial and error” can be followed » Examples of similarity measures: Euclidean, Manhattan, cosine distance Page 13
  • 14. How does K-means do the clustering? K-means is a very important/basic flat clustering algorithm. Its objective is to minimize the average squared Euclidean distance of values from their cluster centers where a cluster center is defined as the mean or centroid of the values in a cluster :
  • 15. Selection of centroids » The first step of K-means is to select as initial cluster centers K randomly selected documents, the seeds. » The algorithm then moves the cluster centers around in space in order to minimize RSS (the function that defines how central the centroids are). Page 15
  • 16. Page 16Ort I Name I Datum K-means illustration step-by-step (1 dimensional) From: StatQuest: K-means clustering: https://www.youtube.com/watch?v=4b5d3muPQmA
  • 17. Calculating the “centrality” of the centroids [Manning & Schütze, 2008] Page 17 A measure of how well the centroids represent the members of their clusters is the residual sum of squares or RSS, the squared distance of each vector from its centroid summed over all vectors: Our goal is to minimize RSS (i.e. the average squared distance) till it is possible.
  • 18. K-means algorithm summary [Manning & Schütze, 2008] Page 18
  • 19. K-means – iteration process [Manning & Schütze, 2008] » reassigning documents to the cluster with the closest centroid, » recomputing each centroid based on the current members of its cluster.
  • 20. How to set where to terminate the algorithm? [Manning & Schütze, 2008] Page 20 • A fixed number of iterations has been completed. • This condition limits the runtime of the clustering algorithm, but in some cases the quality of the clustering will be poor because of an insufficient number of iterations. • Assignment of documents to clusters (the partitioning function ) does not change between iterations. • Except for cases with a bad local minimum, this produces a good clustering, but runtimes may be unacceptably long. • Centroids do not change between iterations. • This is equivalent to not changing. • Terminate when RSS falls below a threshold. • This criterion ensures that the clustering is of a desired quality after termination. In practice, we need to combine it with a bound on the number of iterations to guarantee termination. • Terminate when the decrease in RSS falls below a threshold . • For small , this indicates that we are close to convergence. Again, we need to combine it with a bound on the number of iterations to prevent very long runtimes.
  • 21. A variation: How do you know how many clusters you should make? » It is possible to try different cluster numbers. » And check where the variation stabilizes to decide on the number of clusters. Page 21 From: StatQuest: K-means clustering: https://www.youtube.com/watch?v=4b5d3muPQmA
  • 22. Outline » Introduction, learning goals » Motivation and example » Clustering » K-means clustering algorithm definition, functions, iteration process, pseudocode » Computational complexity » Extensions » Tools » Application examples » Conclusions » References Page 22
  • 23. K-means computational aspects » K-means converges, but there is unfortunately no guarantee that a global minimum in the objective function will be reached. This is a particular problem if a document set contains many outliers , documents that are far from any other documents and therefore do not fit well into any cluster. We may end up with a singleton cluster (a cluster with only one document) even though there is probably a clustering with lower RSS. Page 23
  • 24. What is the time complexity K-means? [Manning & Schütze, 2008] » Most of the time is spent on computing vector distances. One such operation costs . » The reassignment step computes KN distances, so its overall complexity is . » In the recomputation step, each vector gets added to a centroid once, so the complexity of this step is . » For a fixed number of iterations I, the overall complexity is therefore . Page 24
  • 25. Extensions There are numerous extensions to the K-means clustering f.e. » K-means clustering can be generalized e.g. into a Gaussian mixture model. » Efficiency problem can be addressed e.g. by K-medoids , a variant of K-means that computes medoids instead of centroids as cluster centers. The medoid of a cluster as the value that is closest to the centroid. Distance computations are faster in this case. Page 25
  • 26. Tool support A number of tools implementing k-means clustering are available: » Open source e.g. Apache Spark Torch, R, and » Proprietary e.g. MATLAB, Mathematica, SAP HANA Page 26
  • 27. Outline » Introduction, learning goals » Motivation and example » Clustering » K-means clustering algorithm definition, functions, iteration process, pseudocode » Computational complexity » Extensions » Tools » Application examples » Conclusions » References Page 27
  • 28. Application example: Image compression » Aim: compress an image in size » Question: with how many dimensional space we are working here? Page 28 Image credit: Shinichi Tamura @ Slideshare
  • 29. Application example: Retail – recommendation and yield management » User profiles/personas: similar purchase behavior… Page 29 » Product profiles: similar selling patterns » Deciding when to discount product groups Question: with how many dimensional space we are working here?
  • 30. Outline » Introduction, learning goals » Motivation and example » Clustering » K-means clustering algorithm definition, functions, iteration process, pseudocode » Computational complexity » Extensions » Tools » Application examples » Conclusions » References Page 30
  • 31. Summary » Provide 5 most important points you have learnt from today’s lecture. » … » … » … » … » … » (and let’s compare the points) Page 31
  • 32. References » Hartigan, J. A., & Wong, M. A. (1979). Algorithm AS 136: A k-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1), 100-108. » Manning С, R. P., & Schütze, H. (2008). Introduction to information retrieval. Cambridge University Press. » K-Means Clustering at Wikipedia: https://en.wikipedia.org/wiki/K- means_clustering » StatQuest: K-means clustering: https://www.youtube.com/watch?v=4b5d3muPQmA » Press, WH; Teukolsky, SA; Vetterling, WT; Flannery, BP (2007). "Section 16.1. Gaussian Mixture Models and k-Means Clustering". Numerical Recipes: The Art of Scientific Computing (3rd ed.). New York: Cambridge University Press. ISBN 978-0-521-88068-8. Page 32
  • 33. www.sti-innsbruck.at Thank you for attention. Questions?