Contenu connexe
Similaire à icml2004 tutorial on spectral clustering part II
Similaire à icml2004 tutorial on spectral clustering part II (20)
icml2004 tutorial on spectral clustering part II
- 1. A Tutorial on Spectral Clustering
Part 2: Advanced/related Topics
Chris Ding
Computational Research Division
Lawrence Berkeley National Laboratory
University of California
1
Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
- 2. Advanced/related Topics
• Spectral embedding: simplex cluster structure
• Perturbation analysis
• K-means clustering in embedded space
• Equivalence of K-means clustering and PCA
• Connectivity networks: scaled PCA & Green’s function
• Extension to bipartite graphs: Correspondence
analysis
• Random talks and spectral clustering
• Semi-definite programming and spectral clustering
• Spectral ordering (distance-sensitive ordering)
• Webpage spectral ranking: Page-Rank and HITS
2
Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
- 3. Spectral Embedding:
Simplex Cluster Structure
• Compute K eigenvectors of the Laplacian.
• Embed objects in the K-dim eigenspace
What is the structure of the clusters?
Simplex Embedding Theorem.
Assume objects are well characterized by spectral
clustering objective functions. In the embedded
space, objects aggregate to K distinct centroids:
• Centroids locate on K corners of a simplex
• Simplex consists K basis vectors + coordinate origin
• Simplex is rotated by an orthogonal transformation T
• Columns of T are eigenvectors of a K × K embedding
matrix Γ 3
(Ding, 2004)
Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
- 4. K-way Clustering Objectives
s (C p , C q ) s(C p , Cq ) s (C p , G − C q )
J= ∑ 1≤ p < q ≤ K ρ (C p )
+
ρ (C q )
= ∑ k ρ (C p )
for Ratio Cut
nk =| Ck |
ρ (Ck ) = d (Ck ) =
∑
d
i∈Ck i for Normalized Cut
s (Ck , Ck ) =
∑ w
i∈Ck , j∈Ck ij
for MinMaxCut
G - Ck is the graph complement of Ck
4
Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
- 5. Simplex Spectral Embedding Theorem
Simplex Orthogonal
Transform Matrix
T = (t1 , , t K )
T are determined by: Γt k = λ k t k
−1 −1
Spectral Perturbation Matrix Γ=Ω 2
ΓΩ 2
h11 − s12 − s1K s pq = s (C p , Cq )
− s − s2 K
∑
h22
Γ = 21 hkk = s
p| p ≠ k kp
− s K 1 − s K 2 hKK Ω = diag[ ρ (C1 ),, ρ (Ck )]
5
Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
- 6. Properties of Spectral Embedding
• Original basis vectors: -nk
hk = (0l0,1l1,0l0) / nk
• Dimension of embedding is K-1: (q2, … qK)
– q1=(1,…,1)T is constant trivial
– Eigenvalues of Γ (=eigenvalues of D-W)
– Eigenvalues determine how well clustering objective
function characterize the data
• Exact solution for K=2
6
Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
- 7. 2-way Spectral Embedding
(Exact Solution)
Eigenvalues
s (C1 , C2 ) s (C1 , C 2 ) s(C1 , C 2 ) s (C1 , C 2 ) s (C1 , C 2 ) s (C1 , C 2 )
λRcut = + , λNcut = + , λMMC = +
n(C1 ) n(C 2 ) d (C1 ) d (C 2 ) s (C1 , C1 ) s(C 2 , C 2 )
Recover the original 2-way clustering objectives
For Normalized Cut, orthogonal transform T rotates
h1 = (1l1,0 l 0)T , h2 = (0 l 0,1l1)T
into
q1 = (1l1)T, q2 = ( a,l, a,−b,l,−b)
T
Spectral clustering inherently consistent! (Ding et al, KDD’01)
7
Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
- 8. Perturbation Analysis
Wq = λDq Wz = ( D −1 / 2WD −1 / 2 ) z = λz q = D −1 / 2 z
ˆ
Assume data has 3 dense clusters sparsely connected.
C2
W W W
11 12 13 C1
W = 21 W22 W23
W
31 W32 W33
W C3
Off-diagonal blocks are between-cluster connections,
assumed small and are treated as a perturbation
(Ding et al, KDD’01) 8
Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
- 9. Perturbation Analysis
0th order:
W ( 0)
W11 ˆ
11
= ˆ ( 0)
(0) Wˆ (0 ) =
W W22 W 22
ˆ ( 0)
W33
W 33
1st order:
W12 W13 W − W (0)
ˆ ˆ ˆ
W12 W13
ˆ
11 11
W (1) W
= 21 W23 ˆ
W (1) = W21ˆ (0)
W23
ˆ
W22 − W 22
ˆ ˆ
W31 W32
W31ˆ ˆ
W32 ˆ ( 0)
W33 − W 33
ˆ
W pq = D pp W pq D qq
−1 / 2 −1 / 2
ˆ (0)
W pq = ( D p1 + D p 2 + D p 3 ) −1 / 2 W pq ( Dq1 + Dq 2 + Dq 3 ) −1 / 2
ˆ
9
Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
- 10. K-means clustering
• Developed in 1960’s (Lloyd, MacQueen, etc)
• Computationally Efficient (order-mN)
• Widely used in practice
– Benchmark to evaluate other algorithms
Given n points in m-dim: X = ( x1 , x2 ,l, xn )
K
K-means min J K = ∑ ∑|| x − c
k =1 i∈C k
i k ||2
10
Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
- 11. K-means Clustering in
Spectral Embedded Space
Simplex spectral embedding theorem provides
theoretical basis for K-means clustering in the
embedded eigenspace
– Cluster centroids are well separated (corners of the
simplex)
– K-means clustering is invariant under (i) coordinate
rotation x → Tx, and (ii) shift x → x + a
– Thus orthogonal transform T in simplex embedding un-
necessary
• Many variants of K-means (Ng et al, Bach
Jordan, Zha et al, Shi Xu, etc)
11
Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
- 12. We have proved
Spectral embedding + K-means clustering
is the appropriate method
We now show :
K-means itself is solved by PCA
12
Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
- 13. Equivalence of K-means Clustering
and Principal Component Analysis
• Cluster indicators specify the solution of K-means
clustering
• Principal components are eigenvectors of the
Gram (Kernel) matrix = data projections in the
principal directions of the covariance matrix
• Optimal solution of K-means clustering:
continuous solution of the discrete cluster
indicators of K-means are given by
Principal components
(Zha et al, NIPS’01; Ding He, 2003)
13
Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
- 14. Principal Component Analysis (PCA)
• Widely used in large number of different fields
– Best low-rank approximation (SVD Theorem, Eckart-
Young, 1930) : Noise reduction
– Unsupervised dimension reduction
– Many generalizations
• Conventional perspective is inadequate to explain
the effectiveness of PCA
• New results: Principal components are cluster
indicators for well-motivated clustering objective
14
Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
- 15. Principal Component Analysis
n points in m-dim: X = ( x1 , x2 ,l, xn )
Principal directions: uk
Covariance S = XX T XX T u k = λk uk
Principal components: vk
Gram (Kernel) matrix X T X X T Xvk = λk vk
m
Singular Value Decomposition: X= ∑
k =1
λk u k v k
T
15
Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
- 16. 2-way K -means Clustering
Cluster membership + n2 / n1n
if i ∈ C1
q (i ) =
indicator: − n1 / n2 n if i ∈ C2
n1n2 d (C1 , C2 ) d (C1, C1 ) d (C2 , C2 )
J K = n〈 x 2 〉 − J D , J D = 2 n n − −
n 1 2 n12 2
n2
Define distance matrix: D = (dij ), d ij =|xi − x j|2
~
J D = − qT Dq = −qT Dq = 2qT X T Xq
~
D is the centered distance matrix
min J K ⇒ max J D
16
Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
- 17. 2-way K-means Clustering
Cluster indicator satisfy: ∑i q(i ) = 0, ∑i q 2 (i ) = 1
Relax the restriction q (i) take discrete values.
Let it take continuous values in [-1,1]. Solution
for q is the eigenvector of the Gram matrix.
Theorem: The (continuous) optimal solution of q
is given by the principal component v1 .
Clusters C1, C2 are determined by:
C1 = {i | v1 (i ) 0}, C2 = {i | v1 (i ) ≥ 0}
Once C1, C2 are computed, iterate K-mean to
convergence 17
Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
- 18. Multi-way K-means Clustering
Unsigned Cluster membership indicators h1, …, hK:
C1 C2 C3
1 0 0
1 0 0
= (h1 , h2 , h3 )
0 1 0
0 0 1
18
Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
- 19. Multi-way K-means Clustering
K
For K ≥ 2,
∑ ∑ ∑
1
JK = x2 − xiT x j
i i n i , j∈C k
k =1 k
(Unsigned) Cluster membership indicators h1, …, hK:
-nk
hk = (0 0,11,0 0)T / nk
K
J K = ∑i x − ∑ hk X T Xhk
T 2
i
k =1
Let H = (h1 ,, hK )
J K = ∑ xi2 − Tr ( H k X T XH k )
T
19
i
Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
- 20. Multi-way K-means Clustering
Regularized Relaxation of K-means Clustering
K
Redundancy in h1, …, hK: ∑ n1/ 2 hk = e = (11m1)T
k
k =1
Transform to signed indicator vectors q1 - qk via
the k x k orthogonal matrix T:
(q1 ,..., qk ) = (h1 ,, hk )T Qk = H kT
Require 1st column of T = ( n1
1 /2
,, n1/2)T / n1/2
k
Thus q1 = e /n1/2 = const
20
Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
- 21. Regularized Relaxation of K-means Clustering
J K = Tr (Y T Y ) − Tr (Qk −1Y T YQk −1 )
T
Qk −1 = (q2 ,..., qk )
(Regularized relaxation)
Theorem: The optimal solutions of q2 … qk are
given by the principal components v2 … vk. JK is
bounded below by total variance minus sum of
K eigenvalues of covariance:
K −1
n y − ∑ λk min J K n y 2
2
k =1 21
Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
- 22. Scaled PCA
similarity matrix S=(sij) (generated from XXT)
D = diag (d1 ,m, d n ) di = si.
~ −1 −1
Nonlinear re-scaling: S = D SD , ~ = sij /(si.s j. )1/ 2
sij 2 2
~
Apply SVD on S⇒
1 ~
S = D 2 S D 2 = D 2 ∑ zk λk zT D 2 = D ∑ qk λk qT D
1 1 1
k k
k k
qk = D-1/2 zk is the scaled principal component
Subtract trivial component λ = 1, z = d 1 / 2 /s.., q =1
0 0 0
⇒ S − dd T /s.. = D ∑ qk λk qT D
k
k =1 (Ding, et al, 2002) 22
Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
- 23. Optimality Properties of Scaled PCA
Scaled principal components have optimality properties:
Ordering
– Adjacent objects along the order are similar
– Far-away objects along the order are dissimilar
– Optimal solution for the permutation index are given by
scaled PCA.
Clustering
– Maximize within-cluster similarity
– Minimize between-cluster similarity
– Optimal solution for cluster membership indicators given
by scaled PCA.
23
Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
- 24. Difficulty of K-way clustering
• 2-way clustering uses a single eigenvector
• K-way clustering uses several eigenvectors
• How to recover 0-1 cluster indicators H?
eigenvecto rs : Q = ( q1 ,..., q k )
has both positive and negative entries
indicators : H = ( h1 , l , hk ) Q = HT
Avoid computing the transformation T:
• Do K-means, which is invariant under T
• Compute connectivity network QQT, which cancels T
24
Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
- 25. Connectivity Network
1 if i, j belong to same cluster
Cij =
0 otherwise
K
SPCA provides C≅D ∑q λ
k =1
k k qT D
k
K
∑q
1
Green’s function : C ≈G= qT
1 − λk k
k
k =2
K
Projection matrix: C≈P≡ ∑qk =1
k qT
k
(Ding et al, 2002)
25
Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
- 26. Connectivity network
• Similar to Hopfield network
• Mathematical basis: projection matrix
• Show self-aggregation clearly
• Drawback: how to recover clusters
– Apply K-means directly on C
– Use linearized assignment with cluster crossing
and spectral ordering (ICML’04)
26
Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
- 27. Similarity matrix W
Connectivity network: Example 1
Connectivity
λ2 = 0.300, λ2 = 0.268
matrix
Between-cluster connections suppressed
Within-cluster connections enhanced
Effects of self-aggregation 27
Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
- 28. Connectivity of Internet Newsgroups
NG2: comp.graphics 100 articles from each group.
NG9: rec.motorcycles 1000 words
NG10: rec.sport.baseball Tf.idf weight. Cosine similarity
NG15:sci.space
NG18:talk.politics.mideast
Spectral Clustering 89%
Direct K-means 66%
cosine similarity Connectivity matrix
28
Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
- 29. Spectral embedding is not
topology preserving
700 3-D data points form
2 interlock rings
In eigenspace, they
shrink and separate
29
Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
- 30. Correspondence Analysis (CA)
• Mainly used in graphical display of data
• Popular in France (Benzécri, 1969)
• Long history
– Simultaneous row and column regression (Hirschfeld,
1935)
– Reciprocal averaging (Richardson Kuder, 1933;
Horst, 1935; Fisher, 1940; Hill, 1974)
– Canonical correlations, dual scaling, etc.
• Formulation is a bit complicated (“convoluted”
Jolliffe, 2002, p.342)
• “A neglected method”, (Hill, 1974)
30
Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
- 31. Scaled PCA on a Contingency Table
⇒ Correspondence Analysis
~
Nonlinear re-scaling: P = D−2 PD−2 , ~ = p /( p p )1/ 2
1 1
r c pij ij i. j.
~
Apply SVD on P Subtract trivial component
P − rcT / p.. = Dr ∑ f k λk g T Dc r = ( p1.,l, pn. )T
k
k =1
−1 −1
c = ( p.1,l, p.n )T
fk = Dr uk , gk = Dc vk
2 2
are the scaled row and column principal
component (standard coordinates in CA) 31
Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
- 32. Information Retrieval
Bell Lab tech memos
5 comp-sci and 4 applied-math memo titles:
C1: Human machine interface for lab ABC computer applications
C2: A survey of user opinion of computer system response time
C3: The EPS user interface management system
C4: System and human system engineering testing of EPS
C5: Relation of user-perceived response time to error management
M1: The generation of random, binary, unordered trees
M2: The intersection graph of paths in trees
M3: Graph minors IV: widths of trees and well-quasi-ordering
M4: Graph minors: A survey
32
Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
- 33. Word-document matrix: row/col clustering
words docs c4 c1 c3 c5 c2 m4 m3 m2 m1
human 1 1
EPS 1 1
interface 1 1
system 2 1 1
computer 1
user 1 1
response 1 1
time 1 1
survey 1 1
minors 1 1
graph 1 1 1
tree 1 1 1 33
Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
- 34. Bipartite Graph: 3 types of Connectivity networks
F FT FK GK
T
F
QK QK = K KT
T
T
QK = (q1 ,m , q K ) = K
GK FK GK GK GK
e eT 2s11 0 row-row clustering
FK FK = r1 r1
T
T
0 e r2 e r2 2s22
e c eT 2 s11 0
GK G K = 1 1
c
Column-column clustering
T
T
0 e c2 e c2 2 s22
e eT 2 s11 0
FK GK = r1 c1 Row-column association
T
T
0 e r2 e c2 2s22
34
Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
- 35. Example
column-column: GGT
row-row: FFT
Original data matrix
λ2 = 0.456, λ2 = 0.477
row-column: FGT
35
Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
- 36. Internet Newsgroups
Simultaneous clustering
of documents and words
36
Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
- 37. Random Walks and Normalized Cut
Similarity matrix W, P = D −1W Stochastic matrix
πTP =πT ⇒ equilibrium distribution: π =d
Px = λx ⇒ Wx = λDx ⇒ ( D − W ) x = (1 − λ ) Dx
Random walks between A,B:
P( A → B) P( B → A)
J NormCut = +
π ( A) π ( B)
(Meila Shi, 2001)
−1
PageRank: P = αLDout + (1 − α )eeT 37
Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
- 38. Semi-definite Programming for
Normalized Cut
-nk
Normalized Cut : y k = D1/ 2 (0o 0,1o1,0o 0)T / || D1/ 2 hk ||
~
W = D −1/ 2WD −1 / 2
~
Optimize : min Tr (Y T ( I − W )Y ), subject to Y T Y = I
Y
~ ~
⇒ Tr[( I − W )YY T ] ⇒ min Tr[( I − W ) Z ] Z = YY T
Z
s.t. Z ≥ 0, Z 0 , Zd = d , Tr Z = K , Z = Z T
Compute Z via SDP. Z=Y’Y’T.
Y’’=D-1/2Y’. K-means on Y’’.
(Xing Jordan, 2003)
Z = connectivity network
38
Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
- 39. Spectral Ordering
Hill, 1970, spectral embedding
Find coordinate x to minimize
J= ∑(x − x ) w
ij
i j
2
ij = xT ( D − W ) x
Solution are eigenvectors of Laplacian
Barnard, Pothen, Simon, 1993, envelop reduction of sparse
matrix: find ordering such that the envelop is minimized
min ∑ (i − j) w
ij
2
ij ⇒ min ∑ (π
ij
i − π j ) 2 wij
39
Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
- 40. Distance-sensitive ordering
Ordering is determined by permutation indexes
4-variable. For a given ordering, there are 3
distance=1 pairs, two d=2 pairs, one d=3 pair.
0 1 2 3 π (1,l, n) = (π 1 ,l, π n )
0 1 2
n−d
J d (π ) = ∑i =1 sπ i ,π i+d
0 1
0
min J , J (π ) = ∑n−1 d 2 J d (π )
d =1
π
The larger distance, the larger weights. Large distance
similarities reduced more than small distance similarities
(Ding He, ICML’04) 40
Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
- 41. Distance-sensitive ordering
Theorem. The continuous optimal solution for the
discrete inverse permutation indexes are given by
the scaled principal component q1.
The shifted and scaled inverse permutation indexes
π i−1 − (n + 1) /2 1 − n 3 − n n −1
qi = ={ , ,, }
n /2 n n n
Relax the restriction on q. Allow it be continuous.
Solution for q becomes the eigenvector of
(D − S)q = λ Dq
41
Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
- 42. Re-ordering of Genes and Tissues
J (π )
r=
J (random)
r = 0.18
J d =1 (π )
rd =1=
J d =1 ( random)
rd =1 = 3.39
42
C. Ding, RECOMB 2002
Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
- 43. Webpage Spectral Ranking
Rank webpages from the hyperlink topology.
L : adjacency matrix of the web subgraph
PageRank (Page Brin): rank according to
principal eigenvector π (equilibrium distribution)
−1
πT = π , T = 0.8 Dout L + 0.2eeT
HITS (Kleinberg): rank according to
principal eigenvector of authority matrix
(LT L)q = λ q
Eigenvectors can be obtained in closed-form
43
Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
- 44. Webpage Spectral Ranking
HITS (Kleinberg) ranking algorithm
Assume web graph is fixed degree sequence
random graph (Aiello, Chung, Lu, 2000)
Theorem. Eigenvalues of LTL
d i2
λ1 h1 λ2 h2 , hi = d i −
n −1
Eigenvectors: d1 d2 dn T
uk = ( , ,, )
λk − h1 λk − h2 λk − hn
Principal eigenvector u1 is monotonic decreasing
if d1 d 2 d3 hh
⇒ HITS ranking is identical to indegree ranking
(Ding, et al, SIAM Review ’04) 44
Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
- 45. Webpage Spectral Ranking
PageRank: weight normalization
HITS : mutual reinforcement
Combine PageRank and HITS. Generalize. ⇒
−1
Ranking based on a similarity graph S = LT Dout L
Random walks on this similarity graph
has the equilibrium distribution:
(d1, d2,, dn )T / 2E
PageRank ranking is identical to indegree ranking
(1st order approximation, due to combination of PageRank HITS)
(Ding, et al, SIGIR’02)
45
Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
- 46. PCA: a Unified Framework
for clustering and ordering
• PCA is equivalent to K-means Clustering
• Scaled PCA has two optimality properties
– Distance sensitive ordering
– Min-max principle Clustering
• SPCA on contingency table ⇒ Correspondence Analysis
– Simultaneous ordering of rows and columns
– Simultaneous clustering of rows and columns
• Resolve open problems
– Relationship between Correspondence Analysis and PCA (open
problem since 1940s)
– Relationship between PCA and K-means clustering (open
problem since 1960s)
46
Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
- 47. Spectral Clustering:
a rich spectrum of topics
a comprehensive framework for learning
A tutorial review of spectral clustering
Tutorial website will post all related papers (send your papers)
47
Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
- 48. Acknowledgment
Hongyuan Zha, Penn State
Horst Simon, Lawrence Berkeley Lab
Ming Gu, UC Berkeley
Xiaofeng He, Lawrence Berkeley Lab
Michael Jordan, UC Berkeley
Michael Berry, U. Tennessee, Knoxville
Inderjit Dhillon, UT Austin
George Karypis, U. Minnesota
Haesen Park, U. Minnesota
Work supported by Office of Science, Dept. of Energy
48
Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California