11-2-Clustering.pptx

PARTITIONAL &
HIERARCHICAL
CLUSTERING
KS141321 SISTEM CERDAS
Materi – Minggu 11
Jurusan Sistem Informasi
ITS
Oleh: Irmasari Hafidz

OUTLINE
1. Partitional Clustering: K-Means
 Pseudocode of K-Means
 Example
 Evaluasi Performa K-Means
2. Hierarchical Clustering
Cluster Distance Measures
Agglomerative Algorithm
Example

PARTITIONAL
CLUSTERING: K-
MEANS

K-MEANS
Salah satu partitional clustering yang terkenal adalah K-Means
Clustering
Kelebihan: komputasinya yang sederhana
Kekurangan: kualitas kluster tergantung pada pemilihan centroid awal dan nilai k.
Parameter K menunjukkan banyaknya cluster yang akan dibentuk
Sebuah nilai k ditentukan di awal.
Nilai k = banyaknya cluster
• Didefinisikan centroid awal sebanyak k
• Centroid awal di-inisialisasikan secara random

K-MEANS (PSEUDOCODE)
Proses pengelompokan ke k-cluster dilakukan dalam beberapa
iterasi
Iterasi berhenti jika centroidnya tidak berubah lagi atau setiap data
selalu berada di cluster yang sama di iterasi-iterasi berikutnya

K-MEANS
Jika atribut ke-i numerik,
maka nilai centroid ke-i
merupakan mean dari nilai
atribut 1≤ i ≤ n
Jika atribut ke-i
kategorikal, maka nilai
centroid ke-i merupakan
modus dari nilai atribut itu
1 ≤ i ≤ n
Contoh k-Means clustering dengan
k=3, dan 3
centroid: m1, m2, m3
Setiap cluster diasosiasikan dengan sebuah centroid
Setiap point data dimasukkan ke cluster dengan centroid terdekat
Sebuah centroid: sebuah vektor n-dimensi. (Dimana n adalah banyaknya
atribut di setiap data)

7
K-MEAN ALGORITHM
Diinisialkan jumlah klaster sebanyak K, the K-means algorithm dilakukan
dalam 5 langkah:
1. Tentukan k
2. Tentukan titik awal centroid (set
seed points) sebanyak k
3. Masukkan setiap data ke cluster
dengan centroid terdekat (jarak
minimum)
4. Update centroid dari masing-
masing klaster (centroid adalah
pusat dari klaster, i.e., mean
point, dari klaster)
5. Kembali ke no 1, iterasi
berhenti jika sudah tidak ada

8
Problem
Example
Suppose we have 4 types of medicines and each has two attributes (pH and
weight index). Our goal is to group these objects into K=2 group of medicine.
Medicine Weight pH-Index
A 1 1
B 2 1
C 4 3
D 5 4
A B
C
D

9
EXAMPLE
Step 1: Use initial seed points for partitioning
B
c
,
A
c 2
1 

24
.
4
)
1
4
(
)
2
5
(
)
,
(
5
)
1
4
(
)
1
5
(
)
,
(
2
2
2
2
2
1










c
D
d
c
D
d
Assign each object to the cluster
with the nearest seed point
Euclidean distance

10
EXAMPLE
Step 2: Compute new centroids of the current partition
Knowing the members of each
cluster, now we compute the new
centroid of each group based on
these new memberships.
)
67
.
2
,
67
.
3
(
)
3
/
8
,
3
/
11
(
3
4
3
1
,
3
5
4
2
)
1
,
1
(
2
1







 





c
c

11
EXAMPLE
Step 2: Renew membership based on new centroids
Compute the distance of all
objects to the new centroids
Assign the membership to objects

12
EXAMPLE
Step 3: Repeat the first two steps until its convergence
Knowing the members of each
cluster, now we compute the new
centroid of each group based on
these new memberships.
)
2
1
3
,
2
1
4
(
2
4
3
,
2
5
4
)
1
,
2
1
1
(
2
1
1
,
2
2
1
2
1






 








 


c
c

13
EXAMPLE
Step 3: Repeat the first two steps until its convergence
Compute the distance of all objects to
the new centroids
Stop due to no new assignment

EVALUASI PERFORMA K-
MEANS
 Evaluasi performa K-Means Clustering dapat menggunakan Sum of
Square Error (SSE). Ide utama dari penggunaan SSE ini adalah
mengukur keseragaman antar data dalam satu klaster
 Keseragaman diukur berdasarkan error/jarak antara setiap data
dengan centroidnya. Semakin seragam data-data dalam sebuah
cluster, semakin kecil jarak antara setiap data dengan centroidnya
 Selanjutnya error disetiap cluster dijumlahkan untuk semua cluster
(Sum of Square Error/SSE). Semakin kecil nilai SSE maka semakin
bagus hasil clusteringnya

EVALUASI PERFORMA K-
MEANS
K = banyaknya cluster
Ci = Cluster ke-i
mi = centroid cluster ke-I
x = data yang berada di masing-masing cluster

INTRODUCTION
Hierarchical Clustering Approach
 A typical clustering analysis approach via partitioning data set sequentially
 Construct nested partitions layer by layer via grouping objects into a tree of
clusters (without the need to know the number of clusters in advance)
 Uses distance matrix as clustering criteria and a termination condition needed
Agglomerative vs. Divisive
 Two sequential clustering strategies for constructing a tree of clusters
 Agglomerative: a bottom-up strategy
 Initially each data object is in its own (atomic) cluster
 Then merge these atomic clusters into larger and larger clusters
 Divisive: a top-down strategy
 Initially all objects are in one single cluster
 Then the cluster is subdivided into smaller and smaller clusters

INTRODUCTION
Illustrative Example
Agglomerative and divisive clustering on the data set {a, b, c, d ,e }
 Cluster distance
 Termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
b
d
c
e
a
a b
d e
c d e
a b c d e
Step 4 Step 3 Step 2 Step 1 Step 0
Agglomerative
Divisive

single link
(min)
complete link
(max)
average
CLUSTER DISTANCE MEASURES
Single link: smallest distance
between an element in one cluster
and an element in the other, i.e., d(Ci,
Cj) = min{d(xip, xjq)}
Complete link: largest distance
between an element in one cluster
and an element in the other, i.e., d(Ci,
Cj) = max{d(xip, xjq)}
Average: avg distance between
elements in one cluster and elements
in the other, i.e.,
d(C , C ) = avg{d(x , x )}

AGGLOMERATIVE ALGORITHM
The Agglomerative algorithm is carried out in three steps:
1) Convert object attributes to distance
matrix
2) Set each object as a cluster (thus if
we have N objects, we will have N
clusters at the beginning)
3) Repeat until number of cluster is
one (or known # of clusters)
 Merge two closest clusters
 Update distance matrix

Problem: clustering analysis with agglomerative algorithm
Example and Demo
data matrix
distance matrix
Euclidean distance

Merge two closest clusters (iteration 1)
Example and Demo

Update distance matrix (iteration 1)
Example and Demo

Merge two closest clusters (iteration 2)
Example and Demo

Update distance matrix (iteration 2)
Example and Demo

Merge two closest clusters/update distance matrix (iteration 3)
Example and Demo

Merge two closest clusters/update distance matrix (iteration 4)
Example and Demo

COMP24111 MACHINE LEARNING 29
Final result (meeting termination condition)
Example and Demo

Dendrogram tree representation
Example and Demo
1. In the beginning we have 6
clusters: A, B, C, D, E and F
2. We merge cluster D and F into
cluster (D, F) at distance 0.50
3. We merge cluster A and cluster B
into (A, B) at distance 0.71
4. We merge cluster E and (D, F)
into ((D, F), E) at distance 1.00
5. We merge cluster ((D, F), E) and C
into (((D, F), E), C) at distance 1.41
6. We merge cluster (((D, F), E), C)
and (A, B) into ((((D, F), E), C), (A, B))
at distance 2.50
7. The last cluster contain all the objects,
thus conclude the computation
2
3
4
5
6

CLUSTERING IN R
library mva:
- Hierarchical clustering: hclust, heatmap
- k-means: kmeans
library class:
- Self-organizing maps: SOM
library cluster:
- pam and other functions

TUGAS T2: K-MEANS &
HIERARCHICAL CLUSTERING
(SECTION 4)
Individu
Dikerjakan di kertas folio/A4 (tulis tangan)
Dikumpulkan minggu depan (Minggu 12), 21 April 2015

NEXT WEEK (MINGGU 12)
UNSUPERVISED LEARNING: ASSOCIATION RULE
BAYES THEOREM
Final Project: Any topic (From Week 1-14) using R, Laporan & Demo:
Minggu 15
FP: kelompok, 3-4 orang
Neural Network
Clustering
Bayesian
Association Rule

REFERENCES
Flach, Peter. 2012. Machine Learning: The Art and Science of
Algorithms that Make Sense of Data. Cambridge University
Press.
Tan et. al., ‘Introduction to Data Mining’, Addison Wesley, 2006
Ke Chen, University of Manchester, COMP24111 Machine
Learning
http://www.cs.man.ac.uk/~kechen/teaching.php
Wikibooks, K-Means Example,
http://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Clu
stering/K-Means

11-2-Clustering.pptx

Recommandé

Recommandé

Contenu connexe

Similaire à 11-2-Clustering.pptx

Similaire à 11-2-Clustering.pptx (20)

Dernier

Dernier (20)

11-2-Clustering.pptx