DrunkardMob: Billions of Random Walks on Just a PC

Thanks: Major part of this work done during
visit at Twitter’s Personalization and
Recommendations team (Fall-2012).

DrunkardMob: Billions of
Random Walks on Just a PC
Aapo Kyrola
Carnegie Mellon University
Twitter: @kyrpov
Big Data – small machine
DrunkardMob - RecSys '13

This work in a Nutshell
1. Background: Random walk –based
methods are popular in Recommender
Systems.
2. Research problem: How to simulate
random walks if your graph does not fit in
memory?
3. Solution: Instead of doing one walk a
time, do billions of them a time. Stream
graph from disk and maintain walk states
in RAM.

Contents
•
•
•
•

Introduction to random walks
Disk-based graph systems: GraphChi
DrunkardMob algorithm
Experiments

All code available in GitHub:
http://github.com/graphchi/graphchi-java

Introduction: Random Walks
• Graph: G(V, E)
– V = vertices / nodes, E = edges / links.

• Walk is a sequence of random t visits to
vertices:
w := source(0)  v(1)  v(2)  v(3) …. 
v(t)

• Walks follow edges by default, but can
also reset or teleport with certain
probability.
– Transition probability:'13 P(v(k+1) | v(k))
DrunkardMob - RecSys

Introduction (cont.)
• Usually we are interested about the
distribution of the visits.
– Either global distribution or for each source
separately.
– Many applications (PageRank, FolkRank,
SALSA,..)

• Can be used to generate candidates:
– Choose top K visited vertices as candidates to
recommend.

Example: Global PageRank
• Model: random surfer who
starts from random
webpage and clicks each
link on the page with
uniform probability:
– With probability d, teleports
to a random vertex  infinite
walk.

“any vertex”
P=d

P=(1-d) / 3
?
P=(1-d) / 3

P=(1-d) / 3

• Pagerank(web page) ~
Can
authority of web page. be computed using “power iteration” very
efficiently (in secs / minutes even for graphs with
billions of vertices)  Not interesting.

Personalized Pagerank
• Pagerank | home
(source) nodes:
– Compute pagerank vector
for each node separately
 resets only to the home
node(s).
– Restrict home nodes to
some category / topic /
pages visited by a user.

• Used e.g. for social
network
recommendations.

home vertex
P=d

P=(1-d) / 3
?
P=(1-d) / 3

P=(1-d) / 3

Personalized Pagerank (cont.)
• Naïve computation of Personalized
Pagerank (PPR):
– Compute pagerank vector for each source
separately using power iteration: O(n^2)

• Approximate by sampling:
– Simulate actual walks on the graph.


Random walk in an in-memory
graph
• Compute one walk a time (multiple in
parallel, of course): in walks:
parfor walk
for i=1 to
:
vertex = walk.atVertex()
walk.takeStep(vertex.randomNeighbor())


Problem: What if Graph does not
fit in memory?
Twitter network visualization,
by Akshay Java, 2009

Disk-based “singlemachine” graph
systems:
- “Paging” from disk
is costly.

Distributed graph
systems:
- Each hop across
partition boundary
is costly.

(This talk)


DISK-BASED GRAPH
SYSTEMS

Disk-based Graph Systems
• Recently frameworks that can handle
graphs with billions of edges on a single
machine, using disk, have been
proposed:
– GraphChi (Kyrola, Blelloch, Guestrin:
OSDI’12)
– TurboGraph (KDD’13)
– [X-Stream (SOSP’13) – model not suitable]

• We assume vertex-centric model:
– Computation done one vertex a time.

GraphChi execution model
1

v1

v2

n

interval(1)

interval(2)

interval(P)

shard(1)

shard(2)

shard(P)

For T iterations:
For p=1 to P
For vertex in interval(p)
updateFunction(vertex)

Random walk is often called “Drunkard’s Walk”

DRUNKARDMOB ALGORITHM


DrunkardMob: Basic Idea
• By example:
– Task: Compute personalized pagerank (PPR) for
1 million users in a social network -- in parallel
• I.e 1MM different home/source -nodes

– For each user, launch 1000 random walks (with
resets) – in parallel
• Each walk takes 10 hops
~ Equivalent to one 10,000 hop walk (with resets) / user

– For each user, keep track of the visits done by its
1000 short walks  PPR for each user.
– Store state of each walk in RAM, process graph
from disk.
= 1B random walks in parallel  ~5 GB of RAM.

Random walks in GraphChi
• DrunkardMob –algorithm
– Reverse thinking
ForEach interval p:
walkSnapshot = getWalksForInterval(p)
ForEach vertex in interval(p):
mywalks = walkSnapshot.getWalksAtVertex(vertex.id)
ForEach walk in mywalks:
walkManager.addHop(walk, vertex.randomNeighbor())

Note: Need to store only
current position of each walk!


WalkManager
• Store walks in buckets
– Array for each vertex would cost too much.


Encoding walks

Only 4 bytes /
walk.

Keeps track of
each path 
knowledge
base
applications.


Keeping track of walks
GraphChi

Walk Distribution Tracker
(DrunkardCompanion)

Execution interval

Source A
top-N visits

Vertex walks table (WalkManager)


Source B
top-N visits

Keeping track of Walks
• If we don’t have enough RAM to store the
distributions:
– Cut long tails: Similar problem to estimating
top-K frequent items in data streams with
limited memory.

• Can also write hops to disk (bucket-bybucket) and analyze later.


Validity
• We assume that simulating 2000 x 5-hop
walks with resets ~ 10000-hop walk with
resets.
– Not exactly same distribution – some longer
streaks not covered.
• But those would be not relevant anyway for
recommendations!

– See Fogaras (2005) for analysis.


Related Work
• Fogaras, Racz, Csalogany, Sarlos:
“Towards scaling fully personalized
pagerank: Algorithms, lower bounds,
experiments” (2005)
– Similar idea with full external memory
implementation.
• We keep walks in memory.

• Plenty of research in approximating PPR.


See paper for more
experiments!

EXPERIMENTS


Case Study: Twitter WTF
• Implemented Twitter’s Who-to-Follow
algorithm on GraphChi (see paper)
– Based on WWW’13 paper by Gupta et al.
– Use DrunkardMob to generate set of
candidates to recommend for each user.
– See paper.


PPR: Full Twitter Graph
With a large server with SSD and 144 GB of memory:

On Mac laptop, could estimate 500K-1M PPRs )= 0.51B walks ) in roughly the same time.

Runtime / Graph size

Running time ~ linear with graph size

Comparison to in-memory walks

Competitive with in-memory walks. However, if you can fit
your graph in memory – no need for DrunkardMob.

Summary
• DrunkardMob allows simulating random
walks efficiently on extremely large graphs
– Uses bulk of RAM for keeping track of walks,
graph streamed from disk.
– Graph size not limited by RAM.
– Implement Twitter Who-To-Follow on your Laptop!

• Future work: Adapt to distributed graph
systems.
– Even Hadoop if you really really want.

Thank You!
• Code: http://github.com/graphchi/graphchijava
Aapo Kyrölä
Ph.D. candidate @ CMU
http://www.cs.cmu.edu/~akyrola
Twitter: @kyrpov

Special thanks to Pankaj Gupta, Dong Wang, Aneesh
Sharma and Jayarama Shenoy @ Twitter.

DrunkardMob: Billions of Random Walks on Just a PC

Recommandé

Recommandé

Contenu connexe

Similaire à DrunkardMob: Billions of Random Walks on Just a PC

Similaire à DrunkardMob: Billions of Random Walks on Just a PC (20)

Dernier

Dernier (20)

DrunkardMob: Billions of Random Walks on Just a PC

Notes de l'éditeur