5. INTRODUCTION
• K-means (MacQueen, 1967) is one of the simplest
unsupervised learning algorithms that solve the well known
clustering problem.
• The main idea is to define k centroids, one for each
cluster.
6. • Input
• M(set of points)
• k(number of clusters)
• Output
• μ_1 , …, μ_k (cluster centroids)
• k-Means clusters the M point into K clusters by minimizing the
squared error function
μ
18. K-MEANS IN PRACTICE
• How to choose initial centroids
• select randomly among the data points
• generate completely randomly
• How to choose k
• study the data
• run k-Means for different k (measure squared error for each k)
• Run k-means many times!
• Get many choices of initial points
20. QUESTIONS
• Euclidean distance results in spherical clusters
• What cluster shape does the Manhattan distance give?
• Think of other distance measures. What cluster shapes
will those yield?
22. DENSITY-BASED SPATIAL CLUSTERING OF APPLICATION
WITH NOISE
• DBSCAN is a Density-Based Clustering algorithm
• In density based clustering we partition points into dense regions separated
by not-so-dense regions.
• Important Questions:
• How do we measure density and what is a dense region?
• DBSCAN:
• Density at point p: number of points within a circle of radius Eps
• Dense Region: A circle of radius Eps that contains at least MinPts points
31. DETERMINING EPS & MINPTS
• Idea is that for points in a cluster, their kth nearest neighbors
are at roughly the same distance
• Noise points have the kth nearest neighbor at farther distance
• So, plot sorted distance of every point to its kth nearest
neighbor
• Find the distance d where there is a “knee” in the curve
• Eps = d, MinPts = k
36. DISTANCE METRIC FOR DOCUMENTS
• Motivations
• Identical – easy
• Modified or related (Ex: DNA, Plagiarism, Authorship)
• Did Francis Bacon write Shakespeare’s plays
39. DOCUMENT REPRESENTATION
• Word count document representation
• Bag of words model
• Ignore order of words
• Count # of instances of each word in vocabulary
40. EXAMPLE
• Word: Sequence of alphanumeric characters. For example, the phrase “6.006
is fun” has 4 words.
• Word Frequencies: Word frequency D(w) of a given word w is the number of
times it occurs in a document D.
• For example, the words and word frequencies for the above phrase are as
below: Word 6 The Is 006 Easy Fun
Count 1 0 1 1 0 1
42. METRIC
• Inner product of the vectors D1 andD2 containing the word frequencies
for all words in the 2 documents. Equivalently, this is the projection of
vectors D1 onto D2 or vice versa. Mathematically this is expressed as:
D1 ·D2 = ∑ D1(w) .D2(w)
• Angle Metric: The angle between the vectors D1 and D2 gives an
indication of overlap between the 2 documents. Mathematically this
angle is expressed as:
θ(D1,D2) = arccos (
𝐷1.𝐷2
| 𝐷1 |∗| 𝐷2 |
)