4. K-MEANS
Salah satu partitional clustering yang terkenal adalah K-Means
Clustering
Kelebihan: komputasinya yang sederhana
Kekurangan: kualitas kluster tergantung pada pemilihan centroid awal dan nilai k.
Parameter K menunjukkan banyaknya cluster yang akan dibentuk
Sebuah nilai k ditentukan di awal.
Nilai k = banyaknya cluster
• Didefinisikan centroid awal sebanyak k
• Centroid awal di-inisialisasikan secara random
5. K-MEANS (PSEUDOCODE)
Proses pengelompokan ke k-cluster dilakukan dalam beberapa
iterasi
Iterasi berhenti jika centroidnya tidak berubah lagi atau setiap data
selalu berada di cluster yang sama di iterasi-iterasi berikutnya
6. K-MEANS
Jika atribut ke-i numerik,
maka nilai centroid ke-i
merupakan mean dari nilai
atribut 1≤ i ≤ n
Jika atribut ke-i
kategorikal, maka nilai
centroid ke-i merupakan
modus dari nilai atribut itu
1 ≤ i ≤ n
Contoh k-Means clustering dengan
k=3, dan 3
centroid: m1, m2, m3
Setiap cluster diasosiasikan dengan sebuah centroid
Setiap point data dimasukkan ke cluster dengan centroid terdekat
Sebuah centroid: sebuah vektor n-dimensi. (Dimana n adalah banyaknya
atribut di setiap data)
7. 7
K-MEAN ALGORITHM
Diinisialkan jumlah klaster sebanyak K, the K-means algorithm dilakukan
dalam 5 langkah:
1. Tentukan k
2. Tentukan titik awal centroid (set
seed points) sebanyak k
3. Masukkan setiap data ke cluster
dengan centroid terdekat (jarak
minimum)
4. Update centroid dari masing-
masing klaster (centroid adalah
pusat dari klaster, i.e., mean
point, dari klaster)
5. Kembali ke no 1, iterasi
berhenti jika sudah tidak ada
8. 8
Problem
Example
Suppose we have 4 types of medicines and each has two attributes (pH and
weight index). Our goal is to group these objects into K=2 group of medicine.
Medicine Weight pH-Index
A 1 1
B 2 1
C 4 3
D 5 4
A B
C
D
9. 9
EXAMPLE
Step 1: Use initial seed points for partitioning
B
c
,
A
c 2
1
24
.
4
)
1
4
(
)
2
5
(
)
,
(
5
)
1
4
(
)
1
5
(
)
,
(
2
2
2
2
2
1
c
D
d
c
D
d
Assign each object to the cluster
with the nearest seed point
Euclidean distance
10. 10
EXAMPLE
Step 2: Compute new centroids of the current partition
Knowing the members of each
cluster, now we compute the new
centroid of each group based on
these new memberships.
)
67
.
2
,
67
.
3
(
)
3
/
8
,
3
/
11
(
3
4
3
1
,
3
5
4
2
)
1
,
1
(
2
1
c
c
11. 11
EXAMPLE
Step 2: Renew membership based on new centroids
Compute the distance of all
objects to the new centroids
Assign the membership to objects
12. 12
EXAMPLE
Step 3: Repeat the first two steps until its convergence
Knowing the members of each
cluster, now we compute the new
centroid of each group based on
these new memberships.
)
2
1
3
,
2
1
4
(
2
4
3
,
2
5
4
)
1
,
2
1
1
(
2
1
1
,
2
2
1
2
1
c
c
13. 13
EXAMPLE
Step 3: Repeat the first two steps until its convergence
Compute the distance of all objects to
the new centroids
Stop due to no new assignment
14. EVALUASI PERFORMA K-
MEANS
Evaluasi performa K-Means Clustering dapat menggunakan Sum of
Square Error (SSE). Ide utama dari penggunaan SSE ini adalah
mengukur keseragaman antar data dalam satu klaster
Keseragaman diukur berdasarkan error/jarak antara setiap data
dengan centroidnya. Semakin seragam data-data dalam sebuah
cluster, semakin kecil jarak antara setiap data dengan centroidnya
Selanjutnya error disetiap cluster dijumlahkan untuk semua cluster
(Sum of Square Error/SSE). Semakin kecil nilai SSE maka semakin
bagus hasil clusteringnya
15. EVALUASI PERFORMA K-
MEANS
K = banyaknya cluster
Ci = Cluster ke-i
mi = centroid cluster ke-I
x = data yang berada di masing-masing cluster
18. INTRODUCTION
Hierarchical Clustering Approach
A typical clustering analysis approach via partitioning data set sequentially
Construct nested partitions layer by layer via grouping objects into a tree of
clusters (without the need to know the number of clusters in advance)
Uses distance matrix as clustering criteria and a termination condition needed
Agglomerative vs. Divisive
Two sequential clustering strategies for constructing a tree of clusters
Agglomerative: a bottom-up strategy
Initially each data object is in its own (atomic) cluster
Then merge these atomic clusters into larger and larger clusters
Divisive: a top-down strategy
Initially all objects are in one single cluster
Then the cluster is subdivided into smaller and smaller clusters
19. INTRODUCTION
Illustrative Example
Agglomerative and divisive clustering on the data set {a, b, c, d ,e }
Cluster distance
Termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
b
d
c
e
a
a b
d e
c d e
a b c d e
Step 4 Step 3 Step 2 Step 1 Step 0
Agglomerative
Divisive
20. single link
(min)
complete link
(max)
average
CLUSTER DISTANCE MEASURES
Single link: smallest distance
between an element in one cluster
and an element in the other, i.e., d(Ci,
Cj) = min{d(xip, xjq)}
Complete link: largest distance
between an element in one cluster
and an element in the other, i.e., d(Ci,
Cj) = max{d(xip, xjq)}
Average: avg distance between
elements in one cluster and elements
in the other, i.e.,
d(C , C ) = avg{d(x , x )}
21. AGGLOMERATIVE ALGORITHM
The Agglomerative algorithm is carried out in three steps:
1) Convert object attributes to distance
matrix
2) Set each object as a cluster (thus if
we have N objects, we will have N
clusters at the beginning)
3) Repeat until number of cluster is
one (or known # of clusters)
Merge two closest clusters
Update distance matrix
22. Problem: clustering analysis with agglomerative algorithm
Example and Demo
data matrix
distance matrix
Euclidean distance
30. Dendrogram tree representation
Example and Demo
1. In the beginning we have 6
clusters: A, B, C, D, E and F
2. We merge cluster D and F into
cluster (D, F) at distance 0.50
3. We merge cluster A and cluster B
into (A, B) at distance 0.71
4. We merge cluster E and (D, F)
into ((D, F), E) at distance 1.00
5. We merge cluster ((D, F), E) and C
into (((D, F), E), C) at distance 1.41
6. We merge cluster (((D, F), E), C)
and (A, B) into ((((D, F), E), C), (A, B))
at distance 2.50
7. The last cluster contain all the objects,
thus conclude the computation
2
3
4
5
6
31. CLUSTERING IN R
library mva:
- Hierarchical clustering: hclust, heatmap
- k-means: kmeans
library class:
- Self-organizing maps: SOM
library cluster:
- pam and other functions
32. TUGAS T2: K-MEANS &
HIERARCHICAL CLUSTERING
(SECTION 4)
Individu
Dikerjakan di kertas folio/A4 (tulis tangan)
Dikumpulkan minggu depan (Minggu 12), 21 April 2015
33. NEXT WEEK (MINGGU 12)
UNSUPERVISED LEARNING: ASSOCIATION RULE
BAYES THEOREM
Final Project: Any topic (From Week 1-14) using R, Laporan & Demo:
Minggu 15
FP: kelompok, 3-4 orang
Neural Network
Clustering
Bayesian
Association Rule
34. REFERENCES
Flach, Peter. 2012. Machine Learning: The Art and Science of
Algorithms that Make Sense of Data. Cambridge University
Press.
Tan et. al., ‘Introduction to Data Mining’, Addison Wesley, 2006
Ke Chen, University of Manchester, COMP24111 Machine
Learning
http://www.cs.man.ac.uk/~kechen/teaching.php
Wikibooks, K-Means Example,
http://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Clu
stering/K-Means