Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Graph mining with kernel self-organizing map
1. Motivations
Dissimilarities and distances between vertices
Kernel SOM
Application and comments
Graph mining with kernel self-organizing map
Nathalie Villa-Vialaneix
http://www.nathalievilla.org
Joint work with Fabrice Rossi, INRIA, Rocquencourt, France
Institut de Mathématiques de Toulouse, - IUT de Carcassonne, Université de
Perpignan
France
SanTouVal, February 1st, 2008
Nathalie Villa - nathalie.villa@math.univ-toulouse.fr SanTouVal - Feb. 2008
2. Motivations
Dissimilarities and distances between vertices
Kernel SOM
Application and comments
Table of contents
1 Motivations
2 Dissimilarities and distances between vertices
3 Kernel SOM
4 Application and comments
Nathalie Villa - nathalie.villa@math.univ-toulouse.fr SanTouVal - Feb. 2008
3. Motivations
Dissimilarities and distances between vertices
Kernel SOM
Application and comments
Exploring a big historic database
Data
1000 agrarian contracts,
from four seignories (about 10 villages) of South West of
France,
established between 1250 and 1350 (before the Hundred
Years’ war).
Nathalie Villa - nathalie.villa@math.univ-toulouse.fr SanTouVal - Feb. 2008
4. Motivations
Dissimilarities and distances between vertices
Kernel SOM
Application and comments
Exploring a big historic database
Data
1000 agrarian contracts,
from four seignories (about 10 villages) of South West of
France,
established between 1250 and 1350 (before the Hundred
Years’ war).
Historian’s questions:
family or geographical social links ?
central people having a main social role ?
. . .
Nathalie Villa - nathalie.villa@math.univ-toulouse.fr SanTouVal - Feb. 2008
5. Motivations
Dissimilarities and distances between vertices
Kernel SOM
Application and comments
Exploring a big historic database
Data
1000 agrarian contracts,
from four seignories (about 10 villages) of South West of
France,
established between 1250 and 1350 (before the Hundred
Years’ war).
Historian’s questions:
family or geographical social links ?
central people having a main social role ?
. . .
⇒ Data mining is required.
Nathalie Villa - nathalie.villa@math.univ-toulouse.fr SanTouVal - Feb. 2008
6. Motivations
Dissimilarities and distances between vertices
Kernel SOM
Application and comments
A graph clustering problem
From the database, building a weighted graph:
with 615 vertices x1, . . . , xn := peasants found in the
contracts;
Nathalie Villa - nathalie.villa@math.univ-toulouse.fr SanTouVal - Feb. 2008
7. Motivations
Dissimilarities and distances between vertices
Kernel SOM
Application and comments
A graph clustering problem
From the database, building a weighted graph:
with 615 vertices x1, . . . , xn := peasants found in the
contracts;
with weights (wi,j)i,j=1,...,n := {contracts where xi and xj are
mentionned}.
Nathalie Villa - nathalie.villa@math.univ-toulouse.fr SanTouVal - Feb. 2008
8. Motivations
Dissimilarities and distances between vertices
Kernel SOM
Application and comments
A graph clustering problem
From the database, building a weighted graph:
with 615 vertices x1, . . . , xn := peasants found in the
contracts;
with weights (wi,j)i,j=1,...,n := {contracts where xi and xj are
mentionned}.
Number of vertices: 615
Number of edges: 4193
Total of weights: 40 329
Diameter: 10
Density: 2,2%
Nathalie Villa - nathalie.villa@math.univ-toulouse.fr SanTouVal - Feb. 2008
9. Motivations
Dissimilarities and distances between vertices
Kernel SOM
Application and comments
A graph clustering problem
From the database, building a weighted graph:
with 615 vertices x1, . . . , xn := peasants found in the
contracts;
with weights (wi,j)i,j=1,...,n := {contracts where xi and xj are
mentionned}.
Number of vertices: 615
Number of edges: 4193
Total of weights: 40 329
Diameter: 10
Density: 2,2%
Clustering the vertices into homogeneous social groups to
understand the structure of the peasant community.
Nathalie Villa - nathalie.villa@math.univ-toulouse.fr SanTouVal - Feb. 2008
10. Motivations
Dissimilarities and distances between vertices
Kernel SOM
Application and comments
Other fields modelized by large graphs
Computer science: World Wide Web, P2P network. . .
Social networks
Biology: Protein interactions, Neuronal network,. . .
Business, management: Transportation networks, Industry
partnerships. . .
Nathalie Villa - nathalie.villa@math.univ-toulouse.fr SanTouVal - Feb. 2008
11. Motivations
Dissimilarities and distances between vertices
Kernel SOM
Application and comments
Other fields modelized by large graphs
Computer science: World Wide Web, P2P network. . .
Social networks
Biology: Protein interactions, Neuronal network,. . .
Business, management: Transportation networks, Industry
partnerships. . .
Question: Understanding the structure of these large graphs
Clustering: building relevant homogeneous groups;
Graph drawing: giving a global representation of the graph.
Nathalie Villa - nathalie.villa@math.univ-toulouse.fr SanTouVal - Feb. 2008
12. Motivations
Dissimilarities and distances between vertices
Kernel SOM
Application and comments
Other fields modelized by large graphs
Computer science: World Wide Web, P2P network. . .
Social networks
Biology: Protein interactions, Neuronal network,. . .
Business, management: Transportation networks, Industry
partnerships. . .
Question: Understanding the structure of these large graphs
Clustering: building relevant homogeneous groups;
Graph drawing: giving a global representation of the graph.
Here: Self-Organizing Map for nonvectorial data.
Nathalie Villa - nathalie.villa@math.univ-toulouse.fr SanTouVal - Feb. 2008
13. Motivations
Dissimilarities and distances between vertices
Kernel SOM
Application and comments
Table of contents
1 Motivations
2 Dissimilarities and distances between vertices
3 Kernel SOM
4 Application and comments
Nathalie Villa - nathalie.villa@math.univ-toulouse.fr SanTouVal - Feb. 2008
14. Motivations
Dissimilarities and distances between vertices
Kernel SOM
Application and comments
Usual dissimilarities between vertices
The Dice (Jaccard) index:
D(xi, xj) =
Γ(xi) ∩ Γ(xj)
|Γ(xi)| + |Γ(xj)|
(non weighted graphs);
Nathalie Villa - nathalie.villa@math.univ-toulouse.fr SanTouVal - Feb. 2008
15. Motivations
Dissimilarities and distances between vertices
Kernel SOM
Application and comments
Usual dissimilarities between vertices
The Dice (Jaccard) index:
D(xi, xj) =
Γ(xi) ∩ Γ(xj)
|Γ(xi)| + |Γ(xj)|
(non weighted graphs);
Dissimilarities based on the shortest paths;
Nathalie Villa - nathalie.villa@math.univ-toulouse.fr SanTouVal - Feb. 2008
16. Motivations
Dissimilarities and distances between vertices
Kernel SOM
Application and comments
Usual dissimilarities between vertices
The Dice (Jaccard) index:
D(xi, xj) =
Γ(xi) ∩ Γ(xj)
|Γ(xi)| + |Γ(xj)|
(non weighted graphs);
Dissimilarities based on the shortest paths;
Dissimilarities or distances based on the Laplacian matrix:
spectral clustering.
Nathalie Villa - nathalie.villa@math.univ-toulouse.fr SanTouVal - Feb. 2008
17. Motivations
Dissimilarities and distances between vertices
Kernel SOM
Application and comments
Laplacian
Definitions
For a graph with vertices V = {x1, . . . , xn} having positive weights
(wi,j)i,j=1,...,n such that, for all i, j = 1, . . . , n, wi,j = wj,i and di = n
j=1 wi,j,
Laplacian: L = (Li,j)i,j=1,...,n where
Li,j =
−wi,j if i j
di if i = j
;
Nathalie Villa - nathalie.villa@math.univ-toulouse.fr SanTouVal - Feb. 2008
18. Motivations
Dissimilarities and distances between vertices
Kernel SOM
Application and comments
Laplacian: property I [von Luxburg, 2007]
Connected subgraphs
KerL = Span{IA1
, . . . , IAk
} where Ai indicates the positions of the
vertices of the ith connected component of the graph.
1
4
5
2
3
KerL = Span
1
0
0
1
1
;
0
1
1
0
0
Nathalie Villa - nathalie.villa@math.univ-toulouse.fr SanTouVal - Feb. 2008
19. Motivations
Dissimilarities and distances between vertices
Kernel SOM
Application and comments
Laplacian: property II [Boulet et al., 2008]
Perfect community : Complete subgraph (clique) which vertices
share the same neighbors outside the clique.
Laplacian and perfect communities
For a non weighted graph,
The graph has a perfect community with m vertices
⇔
L has m eigenvectors such that each eigenvector has the same
n − m coordinates that vanish.
Nathalie Villa - nathalie.villa@math.univ-toulouse.fr SanTouVal - Feb. 2008
20. Motivations
Dissimilarities and distances between vertices
Kernel SOM
Application and comments
Laplacian: property II [Boulet et al., 2008]
Perfect community : Complete subgraph (clique) which vertices
share the same neighbors outside the clique.
Application :
Nathalie Villa - nathalie.villa@math.univ-toulouse.fr SanTouVal - Feb. 2008
21. Motivations
Dissimilarities and distances between vertices
Kernel SOM
Application and comments
Laplacian: property II [Boulet et al., 2008]
Perfect community : Complete subgraph (clique) which vertices
share the same neighbors outside the clique.
Application :
But: only 1/3 of the graph can be drawn this way.
Nathalie Villa - nathalie.villa@math.univ-toulouse.fr SanTouVal - Feb. 2008
22. Motivations
Dissimilarities and distances between vertices
Kernel SOM
Application and comments
Laplacian: property III [von Luxburg, 2007]
Min Cut problem: Suppose that we have a connected graph.
Find a classification of the vertices of the graph, A1, . . . , Ak such
that
1
2
k
i=1 j∈Ai,j Ai
wj,j
is minimum , is equivalent to minimize
H = arg min
h∈Rn×k
Tr hT
Lh subject to
hT
h = I
hi = 1/
√
|Ai|1Ai
Nathalie Villa - nathalie.villa@math.univ-toulouse.fr SanTouVal - Feb. 2008
23. Motivations
Dissimilarities and distances between vertices
Kernel SOM
Application and comments
Laplacian: property III [von Luxburg, 2007]
Min Cut problem: Suppose that we have a connected graph.
Find a classification of the vertices of the graph, A1, . . . , Ak such
that
1
2
k
i=1 j∈Ai,j Ai
wj,j
is minimum , is equivalent to minimize
H = arg min
h∈Rn×k
Tr hT
Lh subject to
hT
h = I
hi = 1/
√
|Ai|1Ai
⇒ NP-complete problem.
Nathalie Villa - nathalie.villa@math.univ-toulouse.fr SanTouVal - Feb. 2008
24. Motivations
Dissimilarities and distances between vertices
Kernel SOM
Application and comments
Laplacian: property III [von Luxburg, 2007]
Min Cut problem: Suppose that we have a connected graph.
Find a classification of the vertices of the graph, A1, . . . , Ak such
that
1
2
k
i=1 j∈Ai,j Ai
wj,j
is minimum can be approached by
H = arg min
h∈Rn×k
Tr hT
Lh subject to hT
h = I
Nathalie Villa - nathalie.villa@math.univ-toulouse.fr SanTouVal - Feb. 2008
25. Motivations
Dissimilarities and distances between vertices
Kernel SOM
Application and comments
Laplacian: property III [von Luxburg, 2007]
Min Cut problem: Suppose that we have a connected graph.
Find a classification of the vertices of the graph, A1, . . . , Ak such
that
1
2
k
i=1 j∈Ai,j Ai
wj,j
is minimum can be approached by
H = arg min
h∈Rn×k
Tr hT
Lh subject to hT
h = I
Spectral clustering: Find the k smallest eigenvectors of L, H, and
make the classification on the rows of H.
Nathalie Villa - nathalie.villa@math.univ-toulouse.fr SanTouVal - Feb. 2008
26. Motivations
Dissimilarities and distances between vertices
Kernel SOM
Application and comments
A regularized version of L
Regularization : the diffusion matrix : pour β > 0,
Kβ = e−βL
= +∞
k=1
(−βL)k
k! .
⇒
kβ
: V × V → R
(xi, xj) → K
β
i,j
diffusion kernel (or heat kernel).
Nathalie Villa - nathalie.villa@math.univ-toulouse.fr SanTouVal - Feb. 2008
27. Motivations
Dissimilarities and distances between vertices
Kernel SOM
Application and comments
Diffusion process on the graph
If Z0 = (1 1 1 . . . 1 1)T
is the “energy” of each vertex at time 0 and
if a small fraction of this energy is propagated among the edges
of the graph at each time step, then after t steps, the energy of the
vertices of the graph is:
Zt = (1 + L)t
Z0
Nathalie Villa - nathalie.villa@math.univ-toulouse.fr SanTouVal - Feb. 2008
28. Motivations
Dissimilarities and distances between vertices
Kernel SOM
Application and comments
Diffusion process on the graph
If Z0 = (1 1 1 . . . 1 1)T
is the “energy” of each vertex at time 0 and
if a small fraction of this energy is propagated among the edges
of the graph at each time step, then after t steps, the energy of the
vertices of the graph is:
Zt = (1 + L)t
Z0
Limits: Time step ∆t by t → t/(∆t) and → ∆t; then
(∆t) → 0 (continuous process) gives
lim Zt = e tL
= K t
Nathalie Villa - nathalie.villa@math.univ-toulouse.fr SanTouVal - Feb. 2008
29. Motivations
Dissimilarities and distances between vertices
Kernel SOM
Application and comments
Properties
1 Diffusion on the graph: kβ(xi, xj) quantity of energy
accumulated in xj after a given time if energy 1 is injected in xi
at time 0 and if diffusion is done continuously along the edges.
β intensity of diffusion;
Nathalie Villa - nathalie.villa@math.univ-toulouse.fr SanTouVal - Feb. 2008
30. Motivations
Dissimilarities and distances between vertices
Kernel SOM
Application and comments
Properties
1 Diffusion on the graph: kβ(xi, xj) quantity of energy
accumulated in xj after a given time if energy 1 is injected in xi
at time 0 and if diffusion is done continuously along the edges.
β intensity of diffusion;
2 Regularization operator: for u ∈ Rn
∼ V, uT
Kβu is higher for
vectors u that vary a lot over “close” vertices of the graph.
β intensity of regularization (for small β, direct neighbors are
more important);
Nathalie Villa - nathalie.villa@math.univ-toulouse.fr SanTouVal - Feb. 2008
31. Motivations
Dissimilarities and distances between vertices
Kernel SOM
Application and comments
Properties
1 Diffusion on the graph: kβ(xi, xj) quantity of energy
accumulated in xj after a given time if energy 1 is injected in xi
at time 0 and if diffusion is done continuously along the edges.
β intensity of diffusion;
2 Regularization operator: for u ∈ Rn
∼ V, uT
Kβu is higher for
vectors u that vary a lot over “close” vertices of the graph.
β intensity of regularization (for small β, direct neighbors are
more important);
3 Reproducing kernel property: kβ is symmetric and positive
⇒ ∃ Hilbert space (H, ., . ) and φ : V → H such that
kβ
(xi, xj) = φ(xi), φ(xj) .
Nathalie Villa - nathalie.villa@math.univ-toulouse.fr SanTouVal - Feb. 2008
32. Motivations
Dissimilarities and distances between vertices
Kernel SOM
Application and comments
Table of contents
1 Motivations
2 Dissimilarities and distances between vertices
3 Kernel SOM
4 Application and comments
Nathalie Villa - nathalie.villa@math.univ-toulouse.fr SanTouVal - Feb. 2008
33. Motivations
Dissimilarities and distances between vertices
Kernel SOM
Application and comments
Kohonen map
Mapping the data onto a 2 dimensional map
Each neuron of the map, i = 1, . . . , M is associated to a
prototype, pi ∈ H ;
Neurons are related to each others by a neighborhood
relationship (“distance”: d) :
Classifying the vertices on the map
Each xi is associated to a neuron (cluster or class) of the map,
f(xi).
Nathalie Villa - nathalie.villa@math.univ-toulouse.fr SanTouVal - Feb. 2008
34. Motivations
Dissimilarities and distances between vertices
Kernel SOM
Application and comments
Preserving the initial topology
Energy
The goal is to minimize the energy of the map:
E =
M
i=1
h(d(f(x), i)) x − pi
2
H dP(x)
where h is a decreasing function (ex: h(t) = αe−t/2σ2
).
Nathalie Villa - nathalie.villa@math.univ-toulouse.fr SanTouVal - Feb. 2008
35. Motivations
Dissimilarities and distances between vertices
Kernel SOM
Application and comments
Preserving the initial topology
Energy
The goal is to minimize the energy of the map:
E =
M
i=1
h(d(f(x), i)) x − pi
2
H dP(x)
where h is a decreasing function (ex: h(t) = αe−t/2σ2
).
Energy is approached by its empirical version:
En
=
n
j=1
M
i=1
h(d(f(xj), i)) xj − pi
2
H .
and minimization is approached by SOM algorithm.
Nathalie Villa - nathalie.villa@math.univ-toulouse.fr SanTouVal - Feb. 2008
36. Motivations
Dissimilarities and distances between vertices
Kernel SOM
Application and comments
Batch kernel SOM [Villa and Rossi, 2007]
Initialize randomly γ0
ji
∈ R (i, j = 1, . . . , n) and p0
j
= n
i=1 γ0
ji
φ(xi).
Then, for l = 1, . . . , n repeat
Nathalie Villa - nathalie.villa@math.univ-toulouse.fr SanTouVal - Feb. 2008
37. Motivations
Dissimilarities and distances between vertices
Kernel SOM
Application and comments
Batch kernel SOM [Villa and Rossi, 2007]
Initialize randomly γ0
ji
∈ R (i, j = 1, . . . , n) and p0
j
= n
i=1 γ0
ji
φ(xi).
Then, for l = 1, . . . , n repeat
Assignment step
for all xi,
fl
(xi) = arg min
j=1,...,M
φ(xi) −
n
i=1
γl
jiφ(xi)
H
Nathalie Villa - nathalie.villa@math.univ-toulouse.fr SanTouVal - Feb. 2008
38. Motivations
Dissimilarities and distances between vertices
Kernel SOM
Application and comments
Batch kernel SOM [Villa and Rossi, 2007]
Initialize randomly γ0
ji
∈ R (i, j = 1, . . . , n) and p0
j
= n
i=1 γ0
ji
φ(xi).
Then, for l = 1, . . . , n repeat
Assignment step
for all xi,
fl
(xi) = arg min
j=1,...,M
φ(xi) −
n
i=1
γl
jiφ(xi)
H
Representation step
γl
j = arg min
γ∈Rn
n
i=1
h(fl
(xi), j) φ(xi) −
n
l =1
γl φ(xl )
2
H
Nathalie Villa - nathalie.villa@math.univ-toulouse.fr SanTouVal - Feb. 2008
39. Motivations
Dissimilarities and distances between vertices
Kernel SOM
Application and comments
Batch kernel SOM [Villa and Rossi, 2007]
Initialize randomly γ0
ji
∈ R (i, j = 1, . . . , n) and p0
j
= n
i=1 γ0
ji
φ(xi).
Then, for l = 1, . . . , n repeat
Assignment step
for all xi,
f(xi) = arg min
j=1,...,M
n
u,u =1
γjuγju kβ
(xu, xu ) − 2
n
u=1
γjukβ
(xu, xi)
Representation step
γl
ji =
h(fl
(xi), j))
n
i =1 h(fl(xi , j))
Nathalie Villa - nathalie.villa@math.univ-toulouse.fr SanTouVal - Feb. 2008
40. Motivations
Dissimilarities and distances between vertices
Kernel SOM
Application and comments
Table of contents
1 Motivations
2 Dissimilarities and distances between vertices
3 Kernel SOM
4 Application and comments
Nathalie Villa - nathalie.villa@math.univ-toulouse.fr SanTouVal - Feb. 2008
41. Motivations
Dissimilarities and distances between vertices
Kernel SOM
Application and comments
Results on a 7 × 7 rectangular map
Nathalie Villa - nathalie.villa@math.univ-toulouse.fr SanTouVal - Feb. 2008
42. Motivations
Dissimilarities and distances between vertices
Kernel SOM
Application and comments
Results on a 7 × 7 rectangular map
Nathalie Villa - nathalie.villa@math.univ-toulouse.fr SanTouVal - Feb. 2008
43. Motivations
Dissimilarities and distances between vertices
Kernel SOM
Application and comments
Results on a 7 × 7 rectangular map
Nathalie Villa - nathalie.villa@math.univ-toulouse.fr SanTouVal - Feb. 2008
44. Motivations
Dissimilarities and distances between vertices
Kernel SOM
Application and comments
Expected developments
1 Hierarchical clustering;
2 Achieve a classification based on density criterium (joint work
with S. Gadat);
3 Adapting the algorithm to very large graphs (thousands of
vertices).
Nathalie Villa - nathalie.villa@math.univ-toulouse.fr SanTouVal - Feb. 2008
45. Motivations
Dissimilarities and distances between vertices
Kernel SOM
Application and comments
References
Boulet, R., Jouve, B., Rossi, F., and Villa, N. (2008).
Batch kernel SOM and related laplacian methods for social network
analysis.
Neurocomputing.
To appear.
Villa, N. and Rossi, F. (2007).
A comparison between dissimilarity SOM and kernel SOM for clustering the
vertices of a graph.
In Proceedings of the 6th Workshop on Self-Organizing Maps (WSOM 07),
Bielefield, Germany.
von Luxburg, U. (2007).
A tutorial on spectral clustering.
Technical Report TR-149, Max Planck Institut für biologische Kybernetik.
Avaliable at http://www.kyb.mpg.de/publications/
attachments/luxburg06_TR_v2_4139%5B1%5D.pdf.
Nathalie Villa - nathalie.villa@math.univ-toulouse.fr SanTouVal - Feb. 2008