My talk from WSDM2012. See the paper on my webpage: http://www.cs.purdue.edu/homes/dgleich/publications/Andersen%202012%20-%20overlapping.pdf
And the codes http://www.cs.purdue.edu/homes/dgleich/codes/overlapping/
2. Problem
Find a good way to distribute a big graph
for solving things like linear systems and simulating random walks
Contributions
Theoretical demonstration that overlap helps
Proof of concept procedure to find overlapping
partitions to reduce communication (~20%)
All code available
http://www.cs.purdue.edu/~dgleich/codes/
overlapping
2
David Gleich · Purdue
WSDM2012
3. The problem
WHAT OUR NETWORKS WHAT OUR OTHER
LOOK LIKE
NETWORKS LOOK LIKE
3
David Gleich · Purdue
WSDM2012
4. The problem
COMBINING NETWORKS AND GRAPHS IS A MESS
4
David Gleich · Purdue
WSDM2012
5. “Good” data distributions are
a fundamental problem in
distributed computation.
!
How to divide the
communication graph!
Balance work
Balance communication
Balance data
Balance programming
complexity too
5
David Gleich · Purdue
WSDM2012
6. Current solutions
Work
Comm.
Data
Programming
Disjoint vertex Okay to “Think like a
Excellent
Excellent
partitions
Good
vertex”
2d or Edge
Excellent
Excellent
Good
“Impossible”
Partitions
Where we fit!
Overlapping Good to “Think like a
Okay
“Let’s see”
partitions
Excellent
cached vertex”
6
David Gleich · Purdue
WSDM2012
7. Goals
Find a set of "
overlapping clusters "
where
random walks stay in a
cluster for a long time
solving diffusion-like problems
requires little communication
(think PageRank, Katz, hitting times,
semi-supervised learning)
7
David Gleich · Purdue
WSDM2012
8. Related work
Domain decomposition, Schwarz methods
How to solve a linear system with overlap. Szyld et al.
Communication avoiding algorithms
k-step matrix-vector products (Demmel et al.) and "
growing overlap around partitions (Fritzsche, Frommer, Szyld)
Overlapping communities and link partitioning
algorithms for social network analysis
Link communities (Ahn et al.); surveys by Fortunato and Satu
P2P based PageRank algorithms
Parreira, Castillo, Donato et al.
8
David Gleich · Purdue
WSDM2012
9. Overlapping clusters
Each vertex
in at least one cluster
has one home cluster
Formally,
an overlapping cover is
(C, ⌧ )
C={ , , }
= set of clusters
⌧ : V 7! C = map to homes
⌧ is a partition!
9
David Gleich · Purdue
WSDM2012
10. Random walks in
overlapping clusters
Each vertex
in at least one cluster
has one home cluster
red cluster "
keeps the walk
Random walks
red cluster "
go to the home
sends the walk cluster after leaving
to gray cluster
10
David Gleich · Purdue
WSDM2012
11. An evaluation metric"
Swapping probability
Is (C, ⌧ ) a good
overlapping cover?
Does a random walk
swap clusters often?
red cluster "
keeps the walk
⇢
1 =
probability that a walk
red cluster "
sends the walk changes clusters on each
to gray cluster
step
computable expression in the paper
11
David Gleich · Purdue
WSDM2012
12. Overlapping clusters
Each vertex
is in at least one cluster
has one home cluster
Vol(C) = sum of degrees of
vertices in cluster C
MaxVol = "
upper bound on Vol(C)
TotalVol(C) = "
C
sum of Vol(C) for all clusters
VolRatio = TotalVol(C) / Vol(G)"
C
how much extra data!
12
David Gleich · Purdue
WSDM2012
13. Swapping probability &
partitioning
No overlap in
this figure !
P is a partition
⇢1 (P)
=
1 X
Cut(P)
Vol(G)
P2P
Much like a
classical graph
partitioning metric
13
David Gleich · Purdue
WSDM2012
14. Overlapping clusters vs.
Partitioning in theory
Take a cycle graph
M groups of ℓ������ vertices
MaxVol = 2ℓ������
partitioning
for
1
1
⇢ = (Optimal!)
`
for overlapping
4
⇢1 =
⌦(`2 )
14
David Gleich · Purdue
WSDM2012
15. Heuristics for finding good " N P-hard for optimal
overlapping clusters
solution L
Our multi-stage heuristic!
1. Find a large set of good clusters
Use personalized PageRank clusters
2. Find “well contained” nodes (cores)
Compute expected “leavetime”
3. Cover the graph with core vertices
Approximately solve a min set-cover problem
4. Combine clusters up to MaxVol
The swapping probability is sub-modular
15
David Gleich · Purdue
WSDM2012
16. Heuristics for finding good " N P-hard for optimal
overlapping clusters
solution L
Our multi-stage heuristic!
1. Find a large set of good clusters
Each cluster takes
Use personalized PageRank clusters, or metis
“< MaxVol” work
2. Find “well contained” nodes (cores)
Takes O(Vol)
Compute expected “leave time”
work per cluster
3. Cover the graph with core vertices
Approximately solve a min set-cover problem
Fast enough
4. Combine clusters up to MaxVol
The swapping probability is sub-modular
Fast enough
16
David Gleich · Purdue
WSDM2012
18. Solving "
linear "
systems
Like PageRank, Katz, and
semi-supervised learning
18
David Gleich · Purdue
WSDM2012
19. All nodes solve locally using "
the coordinate descent method.
19
David Gleich · Purdue
WSDM2012
20. All nodes solve locally using "
the coordinate descent method.
A core vertex for the
gray cluster.
20
David Gleich · Purdue
WSDM2012
21. All nodes solve locally using "
the coordinate descent method.
Red sends residuals to white.
White send residuals to red.
21
David Gleich · Purdue
WSDM2012
22. White then uses the coordinate
descent method to adjust its solution.
Will cause communication to red/blue.
22
David Gleich · Purdue
WSDM2012
23. That algorithm is called "
restricted additive Schwarz.
PageRank
We look at
PageRank!
Katz scores
semi-supervised learning
any spd or M-matrix "
linear system
23
David Gleich · Purdue
WSDM2012
24. It works!
2
communication
Swapping Probability (usroads)
PageRank Communication (usroads)
Swapping Probability (web−Google)
1.5
PageRank Communication (web−Google)
Relative Relative Work
1 Metis Partitioner
Partitioning baseline
0.5
0
1 1.1 1.2 1.3 1.4 1.5 1.6 1.7
Volume Ratio
How much more of the
graph we need to store.
24
David Gleich · Purdue
WSDM2012
25. Edges are counted twice and some graphs have self-
loops. The first group are geometric networks and
the second are information networks.
Graph
Graph Vertices
|V | Edges
|E| MaxDeg
max deg Density
|E|/|V |
onera 85567 419201 5 4.9
usroads 126146 323900 7 2.6
annulus 500000 2999258 19 6.0
email-Enron 33696 361622 1383 10.7
soc-Slashdot 77360 1015667 2540 13.1
dico 111982 2750576 68191 24.6
lcsh 144791 394186 1025 2.7
web-Google 855802 8582704 6332 10.0
as-skitter 1694616 22188418 35455 13.1
cit-Patents 3764117 33023481 793 8.8
1 1 1
0.8 0.8 0.8
Conductance
Conductance
-
Conductance
0.6 0.6 0.6
0.4 0.4 0.4
25
0.2 0.2 0.2
0
David Gleich · Purdue
0
WSDM2012
0 5 0 0 5
26. he communication ratio of our best result for the PageRan
ommunication volume compared to METIS or GRACLUS show
at the method works for 6 of them (perf. ratio < 1). The
ommunication result is not a bug.
Graph Comm. of Comm. of Perf. Ratio Vol. Ratio
Partition Overlap
onera 18654 48 0.003 2.82
usroads 3256 0 0.000 1.49
annulus 12074 2 0.000 0.01
email-Enron 194536* 235316 1.210 1.7
soc-Slashdot 875435* 1.3 ⇥ 106 1.480 1.78
dico 1.5 ⇥ 106 * 2.0 ⇥ 106 1.320 1.53
lcsh 73000* 48777 0.668 2.17
web-Google 201159* 167609 0.833 1.57
as-skitter 2.4 ⇥ 106 3.9 ⇥ 106 1.645 1.93
cit-Patents 8.7 ⇥ 106 7.3 ⇥ 106 0.845 1.34
* means Graculus
nally, we evaluate our heuristic.
gave a better
partition than Metis
At left, the cluster combine procedure reduces 106 clusters to
26
around 102 . Middle, combining clusters can decrease the volume
David Gleich · Purdue
WSDM2012
27. Summary
Future work
!
Overlap helps reduce Truly distributed implementation and
communication in a distributed evaluation
process!
! Can we exploit data redundancy to
Proof of concept procedure to solve problems on large graphs faster?
find overlapping partitions to
reduce communication
Copy 1
Copy 2
src -> dst
src -> dst
src -> dst
src -> dst
src -> dst
src -> dst
All code available
http://www.cs.purdue.edu/~dgleich/codes/
overlapping
27
David Gleich · Purdue
WSDM2012