"k-means-clustering" presentation @ Papers We Love Bucharest

•Download as PPTX, PDF•

3 likes•552 views

Adrian Florea

6th presentation @ Papers We Love - Bucharest Chapter on 22nd February, 2016

Software

k-means Clustering
Papers We Love
Bucharest Chapter
February 22, 2016

What is clustering?
◉branch of Machine Learning (unsupervised learning i.e. find
underlying patterns in the data)
◉method for identifying data items that closely resemble one
another, assembling them into clusters [ODS]

Clustering applications
◉Customer segmentation

Definitions and notations
◉data to be clustered (N data points)
◉iteratively refined clustering solution
◉cluster membership vector
◉closeness cost function
◉Minkowski distance
p=1 (Manhattan), p=2 (Euclidian)

Voronoi diagrams
◉Euclidian distance ◉Manhattan distance

K-means algorithm
◉input: dataset D, number of clusters k
◉output: cluster solution C, cluster membership m
◉initialize C by randomly choose k data points sets from D
◉repeat
reasign points in D to closest cluster mean
update m
update C such that is mean of points in cluster,
◉until convergence of KM(D, C)

K-means iterations example
N=12, k=2
randomly
choose
2 clusters
calculate
cluster
means
reassign points to
closest cluster mean
update
cluster
means
loop if needed

Poor initialization/Poor clustering
Initial (I) Final (I) Initial (II) Final (II)
Initial (III) Final (III)

Pros
◉simple
◉common in practice
◉easy to adapt
◉relatively fast
◉sensitive to initial partitions
◉optimal k is difficult to choose
◉restricted to data which has
the notion of a center
◉cannot handle well non-
convex clusters
◉does not identify outliers
(because mean is not a “robust”
statistic function)
Cons

Tips & tricks
◉run the algorithm multiple times with different initial centroids
and select the best result
◉if possible, use the knowledge about the dataset in choosing k
◉trying several k and choosing the one based on closeness cost
function is naive
◉use a more appropriate distance measure for the dataset
◉use k-means together with another algorithm
◉Exploit the triangular inequality to avoid to compare each data
point with all centroids

Tips & tricks - continued
◉recursively split the least compact cluster in 2
clusters by using 2-means
◉remove outliers in preprocessing (anomaly
detection)
◉eliminate small clusters or merge close clusters
at postprocessing
◉in case of empty clusters reinitialize their
centroids or steal points from the largest cluster

Tools & frameworks
◉Frontline Systems XLMiner (Excel Add-in)
◉SciPy K-means Clustering and Vector
Quantization Modules (scipy.cluster.vq.kmeans)
◉stats package in R
◉Azure Machine Learning

Bibliography
◉ H.-H. Bock, "Origins and extensions of the k-means algorithm in cluster analysis", JEHPS, Vol.
4, No. 2, December 2008
◉ D. Chappell, “Understanding Machine Learning”, PluralSight.com, 4 February 2016
◉ J. Ghosh, A. Liu, "K-Means", pp. 21-35 in X. Wu, V. Kumar (Eds.), "The Top Ten Algorithms in
Data Mining", Chapman & Hall/CRC, 2009
◉ G.J. Hamerly, "Learning structure and concepts in data through data clustering", PhD Thesis,
University of California, San Diego, 2003
◉ J. Han, M. Kamber, J. Pei, "Chapter 10. Cluster Analysis: Basic Concepts and Methods" in "Data
Mining: Concepts and Techniques", Morgan Kaufmann, 2011
◉ S.P. Lloyd, "Least Squares Quantization in PCM", IEEE Transactions on Information Theory, Vol.
28, No. 2, 1982
◉ J. MacQueen, “Some Methods for Classification and Analysis of Multivariate Observations”, 5th
Berkeley Symposium, 1967
◉ J. McCaffrey, "Machine Learning Using C# Succinctly", Syncfusion, 2014
◉ [ODS] G. Upton, I. Cook, “A Dictionary of Statistics”, 3rd Ed., Oxford University Press, 2014
◉ ***, “k-means clustering”, Wikipedia
◉ ***, “Voronoi diagram”, Wikipedia

Any questions ?
You can find me at
Thanks!

What's hot

Enhance The K Means Algorithm On Spatial DatasetAlaaZ

K-means ClusteringAnna Fensel

KmeansNikita Goyal

Clustering: A SurveyRaffaele Capaldo

Types of clustering and different types of clustering algorithmsPrashanth Guntal

K means clusteringkeshav goyal

CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...butest

Cure, Clustering AlgorithmLino Possamai

05 k-means clusteringSubhas Kumar Ghosh

Pattern recognition binoy k means clustering108kaushik

Clustering, k-means clusteringMegha Sharma

DATA MINING:Clustering TypesAshwin Shenoy M

K-MEDOIDS CLUSTERING USING PARTITIONING AROUND MEDOIDS FOR PERFORMING FACE R...ijscmc

3.3 hierarchical methodsKrish_ver2

Data clustering GARIMA SHAKYA

Chapter 11 cluster advanced : web and text miningHouw Liong The

K mean-clustering algorithmparry prabhu

K-means clustering algorithmVinit Dantkale

Dataa miiningSUBBIAH SURESH

Clique sk_klms

What's hot (20)

Enhance The K Means Algorithm On Spatial Dataset

K-means Clustering

Kmeans

Clustering: A Survey

Types of clustering and different types of clustering algorithms

K means clustering

CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...

Cure, Clustering Algorithm

05 k-means clustering

Pattern recognition binoy k means clustering

Clustering, k-means clustering

DATA MINING:Clustering Types

K-MEDOIDS CLUSTERING USING PARTITIONING AROUND MEDOIDS FOR PERFORMING FACE R...

3.3 hierarchical methods

Data clustering

Chapter 11 cluster advanced : web and text mining

K mean-clustering algorithm

K-means clustering algorithm

Dataa miining

Clique

Viewers also liked

Melanoma Image Segmentation using Self Organized Features MapsMunnangi Anirudh

Systolic arraySyed Zaid Irshad

Lec2 finalGichelle Amon

A Dual Tree Complex Wavelet Transform Construction and Its Application to Ima...CSCJournals

Timo Honkela: Turning quantity into quality and making concepts visible using...Timo Honkela

Timo Honkela: Kohonen's Self-Organizing Maps for Intelligent Systems Developm...Timo Honkela

Timo Honkela: Self-Organizing Map as a Means for Gaining PerspectivesTimo Honkela

Self Organizing MapsDaksh Raj Chopra

MATLAB IMPLEMENTATION OF SELF-ORGANIZING MAPS FOR CLUSTERING OF REMOTE SENSIN...Daksh Raj Chopra

PptGMRIT

Neural networks Self Organizing Map by Engr. Edgar Carrillo IIEdgar Carrillo

Neural Networks: Self-Organizing Maps (SOM)Mostafa G. M. Mostafa

Bioinformatics Project Training for 2,4,6 monthbiinoida

A Simple Tutorial on Conjoint and Cluster AnalysisIterative Path

Methods of Manifold Learning for Dimension Reduction of Large Data SetsRyan B Harvey, CSDP, CSM

Customer Clustering For Retail MarketingJonathan Sedar

Dimension Reduction And Visualization Of Large High Dimensional Data Via Inte...wl820609

XL-MINER: Data ExplorationDataminingTools Inc

Introduction To XL-MinerDataminingTools Inc

Self-Organising Maps for Customer Segmentation using R - Shane Lynn - Dublin Rshanelynn

Viewers also liked (20)

Melanoma Image Segmentation using Self Organized Features Maps

Systolic array

Lec2 final

A Dual Tree Complex Wavelet Transform Construction and Its Application to Ima...

Timo Honkela: Turning quantity into quality and making concepts visible using...

Timo Honkela: Kohonen's Self-Organizing Maps for Intelligent Systems Developm...

Timo Honkela: Self-Organizing Map as a Means for Gaining Perspectives

Self Organizing Maps

MATLAB IMPLEMENTATION OF SELF-ORGANIZING MAPS FOR CLUSTERING OF REMOTE SENSIN...

Ppt

Neural networks Self Organizing Map by Engr. Edgar Carrillo II

Neural Networks: Self-Organizing Maps (SOM)

Bioinformatics Project Training for 2,4,6 month

A Simple Tutorial on Conjoint and Cluster Analysis

Methods of Manifold Learning for Dimension Reduction of Large Data Sets

Customer Clustering For Retail Marketing

Dimension Reduction And Visualization Of Large High Dimensional Data Via Inte...

XL-MINER: Data Exploration

Introduction To XL-Miner

Self-Organising Maps for Customer Segmentation using R - Shane Lynn - Dublin R

Similar to "k-means-clustering" presentation @ Papers We Love Bucharest

DataEngConf: Feature Extraction: Modern Questions and Challenges at GoogleHakka Labs

CS583-unsupervised-learning.pptHathiramN1

CS583-unsupervised-learning.ppt learningssuserb02eff

Unsupervised Learning.pptxGandhiMathy6

15857 cse422 unsupervised-learningAnil Yadav

Neural nw k meansEng. Dr. Dennis N. Mwighusa

K-means Clustering Method for the Analysis of Log Dataidescitation

Document clustering for forensic analysis an approach for improving compute...Madan Golla

iiit delhi unsupervised pdf.pdfVIKASGUPTA127897

Premeditated Initial Points for K-Means ClusteringIJCSIS Research Publications

A Comparative Study Of Various Clustering Algorithms In Data MiningNatasha Grant

A PSO-Based Subtractive Data Clustering AlgorithmIJORCS

Data clustering using kernel basedIJITCA Journal

Ensemble based Distributed K-Modes ClusteringIJERD Editor

Introduction to data mining and machine learningTilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL

Unsupervised Machine Learning Algorithm K-means-Clustering.pptxAnupama Kate

Unsupervised learning clusteringArshad Farhad

Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Salah Amean

Master's Thesis Presentation●๋•máńíکhá Gőýálツ

Chapter 10. Cluster Analysis Basic Concepts and Methods.pptSubrata Kumer Paul

Similar to "k-means-clustering" presentation @ Papers We Love Bucharest (20)

DataEngConf: Feature Extraction: Modern Questions and Challenges at Google

CS583-unsupervised-learning.ppt

CS583-unsupervised-learning.ppt learning

Unsupervised Learning.pptx

15857 cse422 unsupervised-learning

Neural nw k means

K-means Clustering Method for the Analysis of Log Data

Document clustering for forensic analysis an approach for improving compute...

iiit delhi unsupervised pdf.pdf

Premeditated Initial Points for K-Means Clustering

A Comparative Study Of Various Clustering Algorithms In Data Mining

A PSO-Based Subtractive Data Clustering Algorithm

Data clustering using kernel based

Ensemble based Distributed K-Modes Clustering

Introduction to data mining and machine learning

Unsupervised Machine Learning Algorithm K-means-Clustering.pptx

Unsupervised learning clustering

Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...

Master's Thesis Presentation

Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt

Recently uploaded

Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave

%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba

Right Money Management App For Your Financial GoalsJhone kinadey

Direct Style Effect Systems -The Print[A] Example- A Comprehension AidPhilip Schwarz

%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfonteinmasabamasaba

OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...Shane Coughlan

%in Midrand+277-882-255-28 abortion pills for sale in midrandmasabamasaba

Announcing Codolex 2.0 from GDK SoftwareJim McKeeth

Microsoft AI Transformation Partner Playbook.pdfWilly Marroquin (WillyDevNET)

W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda

Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfkalichargn70th171

The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171

Define the academic and professional writing..pdfPearlKirahMaeRagusta1

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls

Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab

%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...masabamasaba

%in Harare+277-882-255-28 abortion pills for sale in Hararemasabamasaba

Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...SelfMade bd

%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfonteinmasabamasaba

Generic or specific? Making sensible software design decisionsBert Jan Schrijver

Recently uploaded (20)

Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...

%in tembisa+277-882-255-28 abortion pills for sale in tembisa

Right Money Management App For Your Financial Goals

Direct Style Effect Systems -The Print[A] Example- A Comprehension Aid

%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein

OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...

%in Midrand+277-882-255-28 abortion pills for sale in midrand

Announcing Codolex 2.0 from GDK Software

Microsoft AI Transformation Partner Playbook.pdf

W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...

Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf

The Ultimate Test Automation Guide_ Best Practices and Tips.pdf

Define the academic and professional writing..pdf

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️

Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...

%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...

%in Harare+277-882-255-28 abortion pills for sale in Harare

Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...

%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein

Generic or specific? Making sensible software design decisions

"k-means-clustering" presentation @ Papers We Love Bucharest

1. k-means Clustering Papers We Love Bucharest Chapter February 22, 2016

2. I am Adrian Florea Architect Lead @ IBM Bucharest Software Lab Hello!

3. What is clustering? ◉branch of Machine Learning (unsupervised learning i.e. find underlying patterns in the data) ◉method for identifying data items that closely resemble one another, assembling them into clusters [ODS]

4. Clustering applications ◉Customer segmentation

5. Clustering applications ◉Google News

6. Definitions and notations ◉data to be clustered (N data points) ◉iteratively refined clustering solution ◉cluster membership vector ◉closeness cost function ◉Minkowski distance p=1 (Manhattan), p=2 (Euclidian)

7. Voronoi diagrams ◉Euclidian distance ◉Manhattan distance

8. K-means algorithm ◉input: dataset D, number of clusters k ◉output: cluster solution C, cluster membership m ◉initialize C by randomly choose k data points sets from D ◉repeat reasign points in D to closest cluster mean update m update C such that is mean of points in cluster, ◉until convergence of KM(D, C)

9. K-means iterations example N=12, k=2 randomly choose 2 clusters calculate cluster means reassign points to closest cluster mean update cluster means loop if needed

10. Poor initialization/Poor clustering Initial (I) Final (I) Initial (II) Final (II) Initial (III) Final (III)

11. Pros ◉simple ◉common in practice ◉easy to adapt ◉relatively fast ◉sensitive to initial partitions ◉optimal k is difficult to choose ◉restricted to data which has the notion of a center ◉cannot handle well non- convex clusters ◉does not identify outliers (because mean is not a “robust” statistic function) Cons

12. Tips & tricks ◉run the algorithm multiple times with different initial centroids and select the best result ◉if possible, use the knowledge about the dataset in choosing k ◉trying several k and choosing the one based on closeness cost function is naive ◉use a more appropriate distance measure for the dataset ◉use k-means together with another algorithm ◉Exploit the triangular inequality to avoid to compare each data point with all centroids

13. Tips & tricks - continued ◉recursively split the least compact cluster in 2 clusters by using 2-means ◉remove outliers in preprocessing (anomaly detection) ◉eliminate small clusters or merge close clusters at postprocessing ◉in case of empty clusters reinitialize their centroids or steal points from the largest cluster

14. Tools & frameworks ◉Frontline Systems XLMiner (Excel Add-in) ◉SciPy K-means Clustering and Vector Quantization Modules (scipy.cluster.vq.kmeans) ◉stats package in R ◉Azure Machine Learning

15. Bibliography ◉ H.-H. Bock, "Origins and extensions of the k-means algorithm in cluster analysis", JEHPS, Vol. 4, No. 2, December 2008 ◉ D. Chappell, “Understanding Machine Learning”, PluralSight.com, 4 February 2016 ◉ J. Ghosh, A. Liu, "K-Means", pp. 21-35 in X. Wu, V. Kumar (Eds.), "The Top Ten Algorithms in Data Mining", Chapman & Hall/CRC, 2009 ◉ G.J. Hamerly, "Learning structure and concepts in data through data clustering", PhD Thesis, University of California, San Diego, 2003 ◉ J. Han, M. Kamber, J. Pei, "Chapter 10. Cluster Analysis: Basic Concepts and Methods" in "Data Mining: Concepts and Techniques", Morgan Kaufmann, 2011 ◉ S.P. Lloyd, "Least Squares Quantization in PCM", IEEE Transactions on Information Theory, Vol. 28, No. 2, 1982 ◉ J. MacQueen, “Some Methods for Classification and Analysis of Multivariate Observations”, 5th Berkeley Symposium, 1967 ◉ J. McCaffrey, "Machine Learning Using C# Succinctly", Syncfusion, 2014 ◉ [ODS] G. Upton, I. Cook, “A Dictionary of Statistics”, 3rd Ed., Oxford University Press, 2014 ◉ ***, “k-means clustering”, Wikipedia ◉ ***, “Voronoi diagram”, Wikipedia

16. Any questions ? You can find me at Thanks!

Editor's Notes

- partition the customer database into several groups, where customers in each group have similar purchasing history and demographics

"k-means-clustering" presentation @ Papers We Love Bucharest

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to "k-means-clustering" presentation @ Papers We Love Bucharest

Similar to "k-means-clustering" presentation @ Papers We Love Bucharest (20)

Recently uploaded

Recently uploaded (20)

"k-means-clustering" presentation @ Papers We Love Bucharest

Editor's Notes