2. I am Adrian Florea
Architect Lead @ IBM Bucharest Software Lab
Hello!
3. What is clustering?
âbranch of Machine Learning (unsupervised learning i.e. find
underlying patterns in the data)
âmethod for identifying data items that closely resemble one
another, assembling them into clusters [ODS]
8. K-means algorithm
âinput: dataset D, number of clusters k
âoutput: cluster solution C, cluster membership m
âinitialize C by randomly choose k data points sets from D
ârepeat
reasign points in D to closest cluster mean
update m
update C such that is mean of points in cluster,
âuntil convergence of KM(D, C)
9. K-means iterations example
N=12, k=2
randomly
choose
2 clusters
calculate
cluster
means
reassign points to
closest cluster mean
update
cluster
means
loop if needed
11. Pros
âsimple
âcommon in practice
âeasy to adapt
ârelatively fast
âsensitive to initial partitions
âoptimal k is difficult to choose
ârestricted to data which has
the notion of a center
âcannot handle well non-
convex clusters
âdoes not identify outliers
(because mean is not a ârobustâ
statistic function)
Cons
12. Tips & tricks
ârun the algorithm multiple times with different initial centroids
and select the best result
âif possible, use the knowledge about the dataset in choosing k
âtrying several k and choosing the one based on closeness cost
function is naive
âuse a more appropriate distance measure for the dataset
âuse k-means together with another algorithm
âExploit the triangular inequality to avoid to compare each data
point with all centroids
13. Tips & tricks - continued
ârecursively split the least compact cluster in 2
clusters by using 2-means
âremove outliers in preprocessing (anomaly
detection)
âeliminate small clusters or merge close clusters
at postprocessing
âin case of empty clusters reinitialize their
centroids or steal points from the largest cluster
14. Tools & frameworks
âFrontline Systems XLMiner (Excel Add-in)
âSciPy K-means Clustering and Vector
Quantization Modules (scipy.cluster.vq.kmeans)
âstats package in R
âAzure Machine Learning
15. Bibliography
â H.-H. Bock, "Origins and extensions of the k-means algorithm in cluster analysis", JEHPS, Vol.
4, No. 2, December 2008
â D. Chappell, âUnderstanding Machine Learningâ, PluralSight.com, 4 February 2016
â J. Ghosh, A. Liu, "K-Means", pp. 21-35 in X. Wu, V. Kumar (Eds.), "The Top Ten Algorithms in
Data Mining", Chapman & Hall/CRC, 2009
â G.J. Hamerly, "Learning structure and concepts in data through data clustering", PhD Thesis,
University of California, San Diego, 2003
â J. Han, M. Kamber, J. Pei, "Chapter 10. Cluster Analysis: Basic Concepts and Methods" in "Data
Mining: Concepts and Techniques", Morgan Kaufmann, 2011
â S.P. Lloyd, "Least Squares Quantization in PCM", IEEE Transactions on Information Theory, Vol.
28, No. 2, 1982
â J. MacQueen, âSome Methods for Classification and Analysis of Multivariate Observationsâ, 5th
Berkeley Symposium, 1967
â J. McCaffrey, "Machine Learning Using C# Succinctly", Syncfusion, 2014
â [ODS] G. Upton, I. Cook, âA Dictionary of Statisticsâ, 3rd Ed., Oxford University Press, 2014
â ***, âk-means clusteringâ, Wikipedia
â ***, âVoronoi diagramâ, Wikipedia