[ICDE 2012] On Top-k Structural Similarity Search

Pei Lee, ICDE 2012
On Top-k Structural
Similarity Search
Pei Lee, Laks V.S. Lakshmanan
University of British Columbia
Vancouver, BC, Canada
Jeffrey Xu Yu
Chinese University of Hong Kong
Hong Kong, China
1
2014-4-16

Outline
 Problem definition
 Structural similarity
 top k structural similarity search
 Existing top k structural similarity search methods
 SimRank, P-Rank
 Constraints
 TopSim: a family of efficient top k structural similarity
search algorithms with accuracy guarantee
 Truncated TopSim, Prioritized TopSim
 Experiments
2
Problem Statement

Graph structures are ubiquitous
 Social networks, citation networks, web graphs, etc
3
Problem Statement

What’s structural similarity?
 Structural similarity: the pairwise similarity between nodes
in a graph
 Applications: link prediction, recommendation, etc
4
Problem Statement
 Intuition: two nodes are similar, if their neighbors are similar
 Derived from PageRank’s intuition
v
a h
b gd
c
e
u
fHow to quantify the similarity
between node u and v?
Problem Definition:
Input: ( , ), ,
Output: ( , )
G V E u V v V
S u v
A node is important, if this node is referenced
by many other important nodes

What’s top-k structural similarity
search?5
Problem Statement
Problem Definition:
Input: ( , ), ,
Output: Top- similar nodes for
G V E v V k
k v
 Given a node v in a huge graph
 Find top-k similar nodes with v
 But
 Definitely do not want to compare with every node
 The accuracy of results should be guaranteed.

Existing Structural Similarity
Measures
 Neighbor-based approaches
 Jaccard Coefficient, Cosine Similarity, Pearson
correlation, Co-citation, etc
 Cons: no common neighbors, no similarity!
 Random walk based approaches
 SimRank (Jeh & Widom, KDD’02)
 P-Rank (Zhao et.al, CIKM’09) (by extending SimRank)
 Cons:
 high computational cost
 Not designed for top-k similarity search
6
Related Work

SimRank & P-Rank
 SimRank: two nodes are similar, if they are
referenced by similar nodes
7
Related Work
v
a
cb
u
( , ) 0.5 0S b c
( , ) 0.25 0S u v
( , ) 1S a a
1
( ) ( )
( , ) ( , )
| ( ) || ( ) |
n n
i I u j I v
C
S u v S i j
I u I v
1
T
n n
S CWS W 
Pairwise iterative form:
Matrix form:
In-neighbors
Transition matrix
Correction matrix
 P-Rank: two nodes are similar, if they are related
with similar nodes
1
( ) ( ) ( ) ( )
(1 )
( , ) ( , ) ( , )
| ( ) || ( ) | | ( ) || ( ) |
n n n
i I u j I v i O u j O v
C C
S u v S i j S i j
I u I v O u O v
0 < C < 1
0 < λ < 1
SimRank Reversed SimRank

Top-k similarity search: challenges
 Matrix-based approach: (KDD’02, VLDB’08)
 Offline: compute a |V|-by-|V| similarity matrix
 SimRank/P-Rank takes O(|E|2) time, which degenerate to
O(|V|4) in the worst case
 Space cost: hard to store this huge similarity matrix
 Vector-based approach: (SDM’10)
 Offline: compute a vector with length |V|
 Takes O(|V|D2n) time in the worst case, where n is the
iteration number, D is the average edge degree
 All these approaches need to access the whole graph to
find the exact top-k similar nodes
8
Challenges

Contributions
 Transform the computation of pairwise similarity on graph G
to the computation of authority on G G, based on a
propagation & aggregation process;
 Propose TopSim, a local top-k structural similarity search
algorithm that avoids accessing the whole graph while the
accuracy is guaranteed.
 Propose Trun-TopSim-SM and Prio-TopSim-SM, which are
two approximations allowing us to trade accuracy for speed.
9
Contributions

How TopSim works
10
Coupling random walk on G
Single random walk on G G
Propagation & Aggregation
Similarity Path
Similarity Score

Product of graphs: G G
 Given G(V, E), G G is defined as
 For node u and v in G, uv is a node in G G
 For edge (e, u) and (e, v) in G, (ee, uv) is an edge in G G
11
d
b
u
v
a
c
e
uvce eebd
uu
vu
vv
dd
cb
aada
eccc
ae
ea
 Each node pair in G will be a node in G G
 Each edge pair in G will be an edge in G G
 No need to materialize G G: only conceptually exists to facilitate analysis
G
G G

Coupling random walk
 Coupling random walk: two random surfers walk simultaneously and
follow the same edge direction
 Surf1, Surf2
 Coupling random walk on G can be equivalently transformed as a single
random walk on G G
 SimRank: S(u, v) is the first meeting probability of two random surfers
starting from u and v respectively and following backward links.
12
d
b
u
v
a
c
e
uvce eebd
uu
vu
vv
dd
cb
aada
eccc
ae
ea
G G G

Compute similarity based on
coupling random walk
 We actually transform a similarity ranking problem on
G into an authority ranking problem on G G
 R(uv) = S(u, v)
 Initialization:
 Source node (if u = v): R(uv) = 1 is fixed
 Target node (if u ≠ v): R(uv) = 0 and R(uv) will be updated
 How is R(uv) updated?
 Propagation & Aggregation process on G G
 Propagation: nodes propagate their authority to their neighbors
following random walk steps
 Aggregation: nodes receive and aggregate the authorities that are
propagated-in from their neighbors.
13

Compute S(u,v)?
 Similarity path: a path from source node to target node without
going by source nodes
 Probability of a transition step:
 Similarity:
 Sum of similarity paths with end node uv
14
uvce eebd
uu
vu
vv
dd
cb
aada
eccc
ae
ea

uvce eebd
uu
vu
vv
dd
cb
aada
eccc
ae
ea
Compute S(u,v): example
15
1
11
1
1 1
Path 1: (ee, uv)
P(ee, uv) = 0.5
 If we only consider 3 steps
Path 2: (aa, bd, ce, uv)
P(aa, bd, ce, uv) = 0.5*1*0.5 = 0.25
S3(u,v) = P(ee, uv)*C + P(aa, bd, ce, uv)*C3 = 0.28
C = 0.5

Optimization based on SimMap
 Observation: many similarity paths are overlapped
17
v
a h
b gd
c
e
u
1
2
3
f
0
 SimMap SM(u) = {(key, value)}
 key is the node visited by Surf2 on step i when Surf1 visits the node u
 value = Si(key, u)
 SM(v) is exactly the result list
 TopSim-SM
 Example:
 Start from c
 SM(b) = {(d, 1/2), (f, 1/2)}
 SM(a) = {(e, 1/8)}
 SM(v) = {(u, 1/32)}
Similarity paths

Family of TopSim Algorithms
18
Algorithms Quality Performance
TopSim Exact Slow if the graph is not sparse
TopSim-SM Exact More efficient than TopSim
Trade accuracy for speed More efficient than TopSim-SM
Trade accuracy for speed More efficient than TopSim-SM
Trun-TopSim-SM
Prio-TopSim-SM

TopSim approximations for Scale-
free graphs
 Scale-free graphs
 Some nodes have very high degree
 Web graphs, citation networks, etc
 Random surfers will be trapped by high degree nodes
 The size of SimMaps will be exploded
 Revisit the transition probabilities:
19
a
 n

TopSim approximations
 Basic idea:
 Only consider similarity paths with higher probability
 Truncated TopSim-SM
 If P(u0u0, …, uivi) < η, stop and ignore this path
 Prioritized TopSim-SM
 Set a buffer size H for each SimMap;
 Only expand top H nodes in SimMaps:
 If | SM(u) | > H, set | SM(u) | = H.
 Find accuracy and complexity analysis in paper
20

Experiments
 Datasets
 Arxiv High Energy Physics paper citation network,
including 34,546 nodes and 421,578 edges
 DBLP co-author graph, with 0.92M nodes, 6.1M edges
 DBLP citation network, with 1.5M papers and 2.1M
citations
 Live Journal social network, with 4.84M users and
68.99M friendship ties
 Factors
 C = 0.5, η = 0.001, H = 100
21

Accuracy of similarity scores
22
Accuracy ratio Accuracy loss
(Running on Arxiv citation network)
3 steps/iterations are good enough for the accuracy of top-20 list

Precision@k
23
(Running on DBLP citation network)
k around 20~30 yields the highest
precision
3 steps/iterations yields a high
precision

24
Kendall Tau distance
(care more about the ranking order …)
a
b
a
b
a
b
b
a
concordant discordant
The higher, the better

Kendall Tau distance
(care more about the ranking order …)
25
k around 20~30 yields the highest
precision
3 steps/iterations yields a high
precision

Running time with different node
sizes and node degrees26
TopSim algorithms are not very
sensitive to the graph size
TopSim approximations can
handle high degree nodes

Running time and accessed nodes
27

Excitements
 We transform a similarity problem on graph G into an
equivalent authority ranking problem on the product graph
G G to facilitate analysis;
 We propose a family of TopSim algorithms that:
 Produce top-k results with accuracy guarantee;
 Only access a small portion of the graph.
 Handle both SimRank and P-Rank under the same top k
framework.
 Questions?
28
SimRank P-Rank
TopSim

TopSim-SM
 Start from v and find source nodes at each step
 From level n-1 to 0
 Let Surf1 start from source node and walk to node v
 Let Surf2 start from the same source node and put the visited
nodes into SimMaps
 When Surf1 visits v, Surf2 will exactly visits the similar
nodes of v in the same step
29
v
a h
b gd
c
e
u
1
2
3
f
0
 Example:
 Start from c
 SM(b) = {(d, 1/2), (f, 1/2)}
 SM(a) = {(e, 1/8)}
 SM(v) = {(u, 1/32)}

[ICDE 2012] On Top-k Structural Similarity Search

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (19)

En vedette

En vedette (16)

Similaire à [ICDE 2012] On Top-k Structural Similarity Search

Similaire à [ICDE 2012] On Top-k Structural Similarity Search (20)

Dernier

Dernier (20)

[ICDE 2012] On Top-k Structural Similarity Search

Notes de l'éditeur