Sergey Edunov, Software Engineer at Facebook gave a great talk on how and why his company generating large-scale social graphs. The underlying reasons to start such an ambitious project are capacity planning to make sure that their system will be able to handle a graph that keeps growing year after year and fair evaluation of their system against the ones being implemented by other companies.
1. Darwini: Generating realistic large-
scale social graphs
Avery Ching
Facebook
Cheng Wang
University of Houston
Sergey Edunov
Facebook
Maja Kabiljo
Facebook
Dionysios Logothetis
Facebook
6. Algorithms
Friend of Friends counts
PageRank
Community detection
Graph partitioning
K-Core decomposition
Eigen value decomposition
Local clustering coefficient
Personalized Page Rank
9. Requirements
1. Match the graph size. If it doesn’t scale, it doesn’t work
2. Match degree distribution
3. Match joint degree and clustering coefficient (ideally dk-3
distribution)
4. Match high level application metrics
10. Existing algorithms vs requirements
Kronecker BTER Erdos-Renyi
Scalability
Degree
distribution
Joint degree &
CC
High level
metrics
11. Darwini*
*Caerostris darwini - is an orb-weaver spider that produces
one of the largest known orb webs, web size ranged from
900–28000 square centimeters
1. Built on Apache Giraph, scales to hundreds machines
2. Capable of generating graphs with trillions of edges
3. Generates graphs with specified joint degree-clustering coefficient distribution
4. Shows better accuracy in performance benchmarking against the original graph
12. Applying Darwin to the real graph
Original Graph
Measure
Darwini
Generated Graph
13. Darwini step by step
Create vertices
Assign expected degree
and clustering coefficient
Group vertices that expect
same number of triangles
together
Create random edges
within each group
Create random edges
between groups
14. Darwini: create vertices
Create N vertices and draw degree and
clustering coefficient from the joint degre-
clustering coefficient distribution
8ci, di
15. Darwini: group vertices into buckets
ce,i = cidi(di 1)
Group vertices that expected to participate in
the same number of triangles together
n min
i2B
(di) + 1 = nB,max
Limit the size of each bucket, so that we don’t exceed expected degree
16. Darwini: create triangles
Pe = 3
q
cidi(di 1)
(n 1)(n 2)
Create random edges between each pair of
vertices in each bucket with probability
After this step, we will have enough triangles to get right clustering coefficient
17. Darwini: create random edges between
buckets
For each vertex, that doesn’t have enough
edges yet, pick random vertex and create an
edge if another vertex doesn’t have enough
edges either.
Hard to find counterparts for high degree vertices
18. Adding random edges in Apache Giraph
1. Not all information readily available on every machine
2. Execution must be parallel
3. Exact match is not always necessary
4. Purely random connection is not enough to make realistic joint
degree distribution
19. Darwini: create edges for high-degree
nodes
1. Group vertices into ever increasing
groups.
2. For each pair of vertices within each
group, connect them with probability
p = |d[i] d[j]|
d[i]+d[j]