Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Ā
K-Means Algorithm
1. 1
K-Means
Class Algorithmic Methods of Data Mining
Program M. Sc. Data Science
University Sapienza University of Rome
Semester Fall 2015
Lecturer Carlos Castillo http://chato.cl/
Sources:
ā Mohammed J. Zaki, Wagner Meira, Jr., Data Mining and Analysis:
Fundamental Concepts and Algorithms, Cambridge University
Press, May 2014. Example 13.1. [download]
ā Evimaria Terzi: Data Mining course at Boston University
http://www.cs.bu.edu/~evimaria/cs565-13.html
2. 2
Boston University Slideshow Title Goes Here
The k-means problem
ā¢ consider set X={x1,...,xn} of n points in Rd
ā¢ assume that the number k is given
ā¢ problem:
ā¢ find k points c1,...,ck (named centers or means)
so that the cost
is minimized
3. 3
Boston University Slideshow Title Goes Here
The k-means problem
ā¢ k=1 and k=n are easy special cases (why?)
ā¢ an NP-hard problem if the dimension of the
data is at least 2 (dā„2)
ā¢ in practice, a simple iterative algorithm
works quite well
4. 4
Boston University Slideshow Title Goes Here
The k-means
algorithm
ā¢ voted among the top-10
algorithms in data mining
ā¢ one way of solving the k-
means problem
6. 6
Boston University Slideshow Title Goes Here
The k-means algorithm
1.randomly (or with another method) pick k
cluster centers {c1,...,ck}
2.for each j, set the cluster Xj to be the set of
points in X that are the closest to center cj
3.for each j let cj be the center of cluster Xj
(mean of the vectors in Xj)
1.repeat (go to step 2) until convergence
8. 8
1-dimensional clustering exercise
Exercise:
ā For the data in the figure
ā Run k-means with k=2 and initial centroids u1=2, u2=4
(Verify: last centroids are 18 units apart)
ā Try with k=3 and initialization 2,3,30
http://www.dataminingbook.info/pmwiki.php/Main/BookDownload Exercise 13.1
9. 9
Limitations of k-means
ā Clusters of different size
ā Clusters of different density
ā Clusters of non-globular shape
ā Sensitive to initialization
14. 14
Boston University Slideshow Title Goes Here
k-means algorithm
ā¢ finds a local optimum
ā¢ often converges quickly
but not always
ā¢ the choice of initial points can have large
influence in the result
ā¢ tends to find spherical clusters
ā¢ outliers can cause a problem
ā¢ different densities may cause a problem
16. 16
Boston University Slideshow Title Goes Here
Initialization
ā¢ random initialization
ā¢ random, but repeat many times and take the
best solution
ā¢ helps, but solution can still be bad
ā¢ pick points that are distant to each other
ā¢ k-means++
ā¢ provable guarantees
17. 17
Boston University Slideshow Title Goes Here
k-means++
David Arthur and Sergei Vassilvitskii
k-means++: The advantages of careful
seeding
SODA 2007
25. 25
Boston University Slideshow Title Goes Here
k-means++ algorithm
ā¢ interpolate between the two methods
ā¢ let D(x) be the distance between x and the
nearest center selected so far
ā¢ choose next center with probability proportional to
(D(x))a = Da(x)
ļæ aĀ =Ā 0Ā Ā Ā Ā Ā Ā randomĀ initialization
ļæ aĀ =Ā ā furthestĀfirstĀ traversal
ļæ aĀ =Ā 2Ā Ā Ā Ā Ā Ā kĀmeans++Ā
26. 26
Boston University Slideshow Title Goes Here
k-means++ algorithm
ā¢ initialization phase:
ā¢ choose the first center uniformly at random
ā¢ choose next center with probability proportional
to D2(x)
ā¢ iteration phase:
ā¢ iterate as in the k-means algorithm until
convergence
29. 29
Boston University Slideshow Title Goes Here
ā¢ approximation guarantee comes just from the
first iteration (initialization)
ā¢ subsequent iterations can only improve cost
k-means++ provable guarantee
30. 30
Boston University Slideshow Title Goes Here
Lesson learned
ā¢ no reason to use k-means and not k-means++
ā¢ k-means++ :
ā¢ easy to implement
ā¢ provable guarantee
ā¢ works well in practice