SlideShare une entreprise Scribd logo
1  sur  30
Thanks: Major part of this work done during
visit at Twitter’s Personalization and
Recommendations team (Fall-2012).

DrunkardMob: Billions of
Random Walks on Just a PC
Aapo Kyrola
Carnegie Mellon University
Twitter: @kyrpov
Big Data – small machine
DrunkardMob - RecSys '13
This work in a Nutshell
1. Background: Random walk –based
methods are popular in Recommender
Systems.
2. Research problem: How to simulate
random walks if your graph does not fit in
memory?
3. Solution: Instead of doing one walk a
time, do billions of them a time. Stream
graph from disk and maintain walk states
in RAM.
DrunkardMob - RecSys '13
Contents
•
•
•
•

Introduction to random walks
Disk-based graph systems: GraphChi
DrunkardMob algorithm
Experiments

All code available in GitHub:
http://github.com/graphchi/graphchi-java
DrunkardMob - RecSys '13
Introduction: Random Walks
• Graph: G(V, E)
– V = vertices / nodes, E = edges / links.

• Walk is a sequence of random t visits to
vertices:
w := source(0)  v(1)  v(2)  v(3) …. 
v(t)

• Walks follow edges by default, but can
also reset or teleport with certain
probability.
– Transition probability:'13 P(v(k+1) | v(k))
DrunkardMob - RecSys
Introduction (cont.)
• Usually we are interested about the
distribution of the visits.
– Either global distribution or for each source
separately.
– Many applications (PageRank, FolkRank,
SALSA,..)

• Can be used to generate candidates:
– Choose top K visited vertices as candidates to
recommend.
DrunkardMob - RecSys '13
Example: Global PageRank
• Model: random surfer who
starts from random
webpage and clicks each
link on the page with
uniform probability:
– With probability d, teleports
to a random vertex  infinite
walk.

“any vertex”
P=d

P=(1-d) / 3
?
P=(1-d) / 3

P=(1-d) / 3

• Pagerank(web page) ~
Can
authority of web page. be computed using “power iteration” very
efficiently (in secs / minutes even for graphs with
billions of vertices)  Not interesting.
DrunkardMob - RecSys '13
Personalized Pagerank
• Pagerank | home
(source) nodes:
– Compute pagerank vector
for each node separately
 resets only to the home
node(s).
– Restrict home nodes to
some category / topic /
pages visited by a user.

• Used e.g. for social
network
recommendations.
DrunkardMob - RecSys '13

home vertex
P=d

P=(1-d) / 3
?
P=(1-d) / 3

P=(1-d) / 3
Personalized Pagerank (cont.)
• Naïve computation of Personalized
Pagerank (PPR):
– Compute pagerank vector for each source
separately using power iteration: O(n^2)

• Approximate by sampling:
– Simulate actual walks on the graph.

DrunkardMob - RecSys '13
Random walk in an in-memory
graph
• Compute one walk a time (multiple in
parallel, of course): in walks:
parfor walk
for i=1 to
:
vertex = walk.atVertex()
walk.takeStep(vertex.randomNeighbor())

DrunkardMob - RecSys '13
Problem: What if Graph does not
fit in memory?
Twitter network visualization,
by Akshay Java, 2009

Disk-based “singlemachine” graph
systems:
- “Paging” from disk
is costly.

Distributed graph
systems:
- Each hop across
partition boundary
is costly.

(This talk)

DrunkardMob - RecSys '13
DISK-BASED GRAPH
SYSTEMS
DrunkardMob - RecSys '13
Disk-based Graph Systems
• Recently frameworks that can handle
graphs with billions of edges on a single
machine, using disk, have been
proposed:
– GraphChi (Kyrola, Blelloch, Guestrin:
OSDI’12)
– TurboGraph (KDD’13)
– [X-Stream (SOSP’13) – model not suitable]

• We assume vertex-centric model:
– Computation done one vertex a time.
DrunkardMob - RecSys '13
GraphChi execution model
1

v1

v2

n

interval(1)

interval(2)

interval(P)

shard(1)

shard(2)

shard(P)

For T iterations:
For p=1 to P
For vertex in interval(p)
updateFunction(vertex)
DrunkardMob - RecSys '13
Random walk is often called “Drunkard’s Walk”

DRUNKARDMOB ALGORITHM

DrunkardMob - RecSys '13
DrunkardMob: Basic Idea
• By example:
– Task: Compute personalized pagerank (PPR) for
1 million users in a social network -- in parallel
• I.e 1MM different home/source -nodes

– For each user, launch 1000 random walks (with
resets) – in parallel
• Each walk takes 10 hops
~ Equivalent to one 10,000 hop walk (with resets) / user

– For each user, keep track of the visits done by its
1000 short walks  PPR for each user.
– Store state of each walk in RAM, process graph
from disk.
= 1B random walks in parallel  ~5 GB of RAM.
DrunkardMob - RecSys '13
Random walks in GraphChi
• DrunkardMob –algorithm
– Reverse thinking
ForEach interval p:
walkSnapshot = getWalksForInterval(p)
ForEach vertex in interval(p):
mywalks = walkSnapshot.getWalksAtVertex(vertex.id)
ForEach walk in mywalks:
walkManager.addHop(walk, vertex.randomNeighbor())

Note: Need to store only
current position of each walk!

DrunkardMob - RecSys '13
WalkManager
• Store walks in buckets
– Array for each vertex would cost too much.

DrunkardMob - RecSys '13
Encoding walks

Only 4 bytes /
walk.

Keeps track of
each path 
knowledge
base
applications.

DrunkardMob - RecSys '13
Keeping track of walks
GraphChi

Walk Distribution Tracker
(DrunkardCompanion)

Execution interval

Source A
top-N visits

Vertex walks table (WalkManager)

DrunkardMob - RecSys '13

Source B
top-N visits
Keeping track of walks
GraphChi

Walk Distribution Tracker
(DrunkardCompanion)

Execution interval

Source A
top-N visits

Vertex walks table (WalkManager)

DrunkardMob - RecSys '13

Source B
top-N visits
Keeping track of Walks
• If we don’t have enough RAM to store the
distributions:
– Cut long tails: Similar problem to estimating
top-K frequent items in data streams with
limited memory.

• Can also write hops to disk (bucket-bybucket) and analyze later.

DrunkardMob - RecSys '13
Validity
• We assume that simulating 2000 x 5-hop
walks with resets ~ 10000-hop walk with
resets.
– Not exactly same distribution – some longer
streaks not covered.
• But those would be not relevant anyway for
recommendations!

– See Fogaras (2005) for analysis.

DrunkardMob - RecSys '13
Related Work
• Fogaras, Racz, Csalogany, Sarlos:
“Towards scaling fully personalized
pagerank: Algorithms, lower bounds,
experiments” (2005)
– Similar idea with full external memory
implementation.
• We keep walks in memory.

• Plenty of research in approximating PPR.

DrunkardMob - RecSys '13
See paper for more
experiments!

EXPERIMENTS

DrunkardMob - RecSys '13
Case Study: Twitter WTF
• Implemented Twitter’s Who-to-Follow
algorithm on GraphChi (see paper)
– Based on WWW’13 paper by Gupta et al.
– Use DrunkardMob to generate set of
candidates to recommend for each user.
– See paper.

DrunkardMob - RecSys '13
PPR: Full Twitter Graph
With a large server with SSD and 144 GB of memory:

On Mac laptop, could estimate 500K-1M PPRs )= 0.51B walks ) in roughly the same time.
DrunkardMob - RecSys '13
Runtime / Graph size

Running time ~ linear with graph size
DrunkardMob - RecSys '13
Comparison to in-memory walks

Competitive with in-memory walks. However, if you can fit
your graph in memory – no need for DrunkardMob.
DrunkardMob - RecSys '13
Summary
• DrunkardMob allows simulating random
walks efficiently on extremely large graphs
– Uses bulk of RAM for keeping track of walks,
graph streamed from disk.
– Graph size not limited by RAM.
– Implement Twitter Who-To-Follow on your Laptop!

• Future work: Adapt to distributed graph
systems.
– Even Hadoop if you really really want.
DrunkardMob - RecSys '13
Thank You!
• Code: http://github.com/graphchi/graphchijava
Aapo Kyrölä
Ph.D. candidate @ CMU
http://www.cs.cmu.edu/~akyrola
Twitter: @kyrpov

Special thanks to Pankaj Gupta, Dong Wang, Aneesh
Sharma and Jayarama Shenoy @ Twitter.
DrunkardMob - RecSys '13

Contenu connexe

Similaire à DrunkardMob: Billions of Random Walks on Just a PC

A Lightweight Infrastructure for Graph Analytics
A Lightweight Infrastructure for Graph AnalyticsA Lightweight Infrastructure for Graph Analytics
A Lightweight Infrastructure for Graph AnalyticsDonald Nguyen
 
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)Ontico
 
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)Alexey Zinoviev
 
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyData
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudRevolution Analytics
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tipsSubhas Kumar Ghosh
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
 
St Petersburg R user group meetup 2, Parallel R
St Petersburg R user group meetup 2, Parallel RSt Petersburg R user group meetup 2, Parallel R
St Petersburg R user group meetup 2, Parallel RAndrew Bzikadze
 
EEDC - Apache Pig
EEDC - Apache PigEEDC - Apache Pig
EEDC - Apache Pigjavicid
 
Optimizing Performance - Clojure Remote - Nikola Peric
Optimizing Performance - Clojure Remote - Nikola PericOptimizing Performance - Clojure Remote - Nikola Peric
Optimizing Performance - Clojure Remote - Nikola PericNik Peric
 
Spark Gotchas and Lessons Learned (2/20/20)
Spark Gotchas and Lessons Learned (2/20/20)Spark Gotchas and Lessons Learned (2/20/20)
Spark Gotchas and Lessons Learned (2/20/20)Jen Waller
 
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...Databricks
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2Fabio Fumarola
 
Lens: Data exploration with Dask and Jupyter widgets
Lens: Data exploration with Dask and Jupyter widgetsLens: Data exploration with Dask and Jupyter widgets
Lens: Data exploration with Dask and Jupyter widgetsVíctor Zabalza
 
Performance myths in android
Performance myths in androidPerformance myths in android
Performance myths in androidJavier Gamarra
 
Wuala, P2P Online Storage
Wuala, P2P Online StorageWuala, P2P Online Storage
Wuala, P2P Online Storageadunne
 
GraphChi big graph processing
GraphChi big graph processingGraphChi big graph processing
GraphChi big graph processinghuguk
 

Similaire à DrunkardMob: Billions of Random Walks on Just a PC (20)

A Lightweight Infrastructure for Graph Analytics
A Lightweight Infrastructure for Graph AnalyticsA Lightweight Infrastructure for Graph Analytics
A Lightweight Infrastructure for Graph Analytics
 
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
 
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
 
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the Cloud
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tips
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
St Petersburg R user group meetup 2, Parallel R
St Petersburg R user group meetup 2, Parallel RSt Petersburg R user group meetup 2, Parallel R
St Petersburg R user group meetup 2, Parallel R
 
EEDC - Apache Pig
EEDC - Apache PigEEDC - Apache Pig
EEDC - Apache Pig
 
Optimizing Performance - Clojure Remote - Nikola Peric
Optimizing Performance - Clojure Remote - Nikola PericOptimizing Performance - Clojure Remote - Nikola Peric
Optimizing Performance - Clojure Remote - Nikola Peric
 
Spark Gotchas and Lessons Learned (2/20/20)
Spark Gotchas and Lessons Learned (2/20/20)Spark Gotchas and Lessons Learned (2/20/20)
Spark Gotchas and Lessons Learned (2/20/20)
 
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
As simple as Apache Spark
As simple as Apache SparkAs simple as Apache Spark
As simple as Apache Spark
 
Pydata talk
Pydata talkPydata talk
Pydata talk
 
Lens: Data exploration with Dask and Jupyter widgets
Lens: Data exploration with Dask and Jupyter widgetsLens: Data exploration with Dask and Jupyter widgets
Lens: Data exploration with Dask and Jupyter widgets
 
Performance myths in android
Performance myths in androidPerformance myths in android
Performance myths in android
 
Wuala, P2P Online Storage
Wuala, P2P Online StorageWuala, P2P Online Storage
Wuala, P2P Online Storage
 
20080528dublinpt3
20080528dublinpt320080528dublinpt3
20080528dublinpt3
 
GraphChi big graph processing
GraphChi big graph processingGraphChi big graph processing
GraphChi big graph processing
 

Dernier

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 

Dernier (20)

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 

DrunkardMob: Billions of Random Walks on Just a PC

  • 1. Thanks: Major part of this work done during visit at Twitter’s Personalization and Recommendations team (Fall-2012). DrunkardMob: Billions of Random Walks on Just a PC Aapo Kyrola Carnegie Mellon University Twitter: @kyrpov Big Data – small machine DrunkardMob - RecSys '13
  • 2. This work in a Nutshell 1. Background: Random walk –based methods are popular in Recommender Systems. 2. Research problem: How to simulate random walks if your graph does not fit in memory? 3. Solution: Instead of doing one walk a time, do billions of them a time. Stream graph from disk and maintain walk states in RAM. DrunkardMob - RecSys '13
  • 3. Contents • • • • Introduction to random walks Disk-based graph systems: GraphChi DrunkardMob algorithm Experiments All code available in GitHub: http://github.com/graphchi/graphchi-java DrunkardMob - RecSys '13
  • 4. Introduction: Random Walks • Graph: G(V, E) – V = vertices / nodes, E = edges / links. • Walk is a sequence of random t visits to vertices: w := source(0)  v(1)  v(2)  v(3) ….  v(t) • Walks follow edges by default, but can also reset or teleport with certain probability. – Transition probability:'13 P(v(k+1) | v(k)) DrunkardMob - RecSys
  • 5. Introduction (cont.) • Usually we are interested about the distribution of the visits. – Either global distribution or for each source separately. – Many applications (PageRank, FolkRank, SALSA,..) • Can be used to generate candidates: – Choose top K visited vertices as candidates to recommend. DrunkardMob - RecSys '13
  • 6. Example: Global PageRank • Model: random surfer who starts from random webpage and clicks each link on the page with uniform probability: – With probability d, teleports to a random vertex  infinite walk. “any vertex” P=d P=(1-d) / 3 ? P=(1-d) / 3 P=(1-d) / 3 • Pagerank(web page) ~ Can authority of web page. be computed using “power iteration” very efficiently (in secs / minutes even for graphs with billions of vertices)  Not interesting. DrunkardMob - RecSys '13
  • 7. Personalized Pagerank • Pagerank | home (source) nodes: – Compute pagerank vector for each node separately  resets only to the home node(s). – Restrict home nodes to some category / topic / pages visited by a user. • Used e.g. for social network recommendations. DrunkardMob - RecSys '13 home vertex P=d P=(1-d) / 3 ? P=(1-d) / 3 P=(1-d) / 3
  • 8. Personalized Pagerank (cont.) • Naïve computation of Personalized Pagerank (PPR): – Compute pagerank vector for each source separately using power iteration: O(n^2) • Approximate by sampling: – Simulate actual walks on the graph. DrunkardMob - RecSys '13
  • 9. Random walk in an in-memory graph • Compute one walk a time (multiple in parallel, of course): in walks: parfor walk for i=1 to : vertex = walk.atVertex() walk.takeStep(vertex.randomNeighbor()) DrunkardMob - RecSys '13
  • 10. Problem: What if Graph does not fit in memory? Twitter network visualization, by Akshay Java, 2009 Disk-based “singlemachine” graph systems: - “Paging” from disk is costly. Distributed graph systems: - Each hop across partition boundary is costly. (This talk) DrunkardMob - RecSys '13
  • 12. Disk-based Graph Systems • Recently frameworks that can handle graphs with billions of edges on a single machine, using disk, have been proposed: – GraphChi (Kyrola, Blelloch, Guestrin: OSDI’12) – TurboGraph (KDD’13) – [X-Stream (SOSP’13) – model not suitable] • We assume vertex-centric model: – Computation done one vertex a time. DrunkardMob - RecSys '13
  • 13. GraphChi execution model 1 v1 v2 n interval(1) interval(2) interval(P) shard(1) shard(2) shard(P) For T iterations: For p=1 to P For vertex in interval(p) updateFunction(vertex) DrunkardMob - RecSys '13
  • 14. Random walk is often called “Drunkard’s Walk” DRUNKARDMOB ALGORITHM DrunkardMob - RecSys '13
  • 15. DrunkardMob: Basic Idea • By example: – Task: Compute personalized pagerank (PPR) for 1 million users in a social network -- in parallel • I.e 1MM different home/source -nodes – For each user, launch 1000 random walks (with resets) – in parallel • Each walk takes 10 hops ~ Equivalent to one 10,000 hop walk (with resets) / user – For each user, keep track of the visits done by its 1000 short walks  PPR for each user. – Store state of each walk in RAM, process graph from disk. = 1B random walks in parallel  ~5 GB of RAM. DrunkardMob - RecSys '13
  • 16. Random walks in GraphChi • DrunkardMob –algorithm – Reverse thinking ForEach interval p: walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p): mywalks = walkSnapshot.getWalksAtVertex(vertex.id) ForEach walk in mywalks: walkManager.addHop(walk, vertex.randomNeighbor()) Note: Need to store only current position of each walk! DrunkardMob - RecSys '13
  • 17. WalkManager • Store walks in buckets – Array for each vertex would cost too much. DrunkardMob - RecSys '13
  • 18. Encoding walks Only 4 bytes / walk. Keeps track of each path  knowledge base applications. DrunkardMob - RecSys '13
  • 19. Keeping track of walks GraphChi Walk Distribution Tracker (DrunkardCompanion) Execution interval Source A top-N visits Vertex walks table (WalkManager) DrunkardMob - RecSys '13 Source B top-N visits
  • 20. Keeping track of walks GraphChi Walk Distribution Tracker (DrunkardCompanion) Execution interval Source A top-N visits Vertex walks table (WalkManager) DrunkardMob - RecSys '13 Source B top-N visits
  • 21. Keeping track of Walks • If we don’t have enough RAM to store the distributions: – Cut long tails: Similar problem to estimating top-K frequent items in data streams with limited memory. • Can also write hops to disk (bucket-bybucket) and analyze later. DrunkardMob - RecSys '13
  • 22. Validity • We assume that simulating 2000 x 5-hop walks with resets ~ 10000-hop walk with resets. – Not exactly same distribution – some longer streaks not covered. • But those would be not relevant anyway for recommendations! – See Fogaras (2005) for analysis. DrunkardMob - RecSys '13
  • 23. Related Work • Fogaras, Racz, Csalogany, Sarlos: “Towards scaling fully personalized pagerank: Algorithms, lower bounds, experiments” (2005) – Similar idea with full external memory implementation. • We keep walks in memory. • Plenty of research in approximating PPR. DrunkardMob - RecSys '13
  • 24. See paper for more experiments! EXPERIMENTS DrunkardMob - RecSys '13
  • 25. Case Study: Twitter WTF • Implemented Twitter’s Who-to-Follow algorithm on GraphChi (see paper) – Based on WWW’13 paper by Gupta et al. – Use DrunkardMob to generate set of candidates to recommend for each user. – See paper. DrunkardMob - RecSys '13
  • 26. PPR: Full Twitter Graph With a large server with SSD and 144 GB of memory: On Mac laptop, could estimate 500K-1M PPRs )= 0.51B walks ) in roughly the same time. DrunkardMob - RecSys '13
  • 27. Runtime / Graph size Running time ~ linear with graph size DrunkardMob - RecSys '13
  • 28. Comparison to in-memory walks Competitive with in-memory walks. However, if you can fit your graph in memory – no need for DrunkardMob. DrunkardMob - RecSys '13
  • 29. Summary • DrunkardMob allows simulating random walks efficiently on extremely large graphs – Uses bulk of RAM for keeping track of walks, graph streamed from disk. – Graph size not limited by RAM. – Implement Twitter Who-To-Follow on your Laptop! • Future work: Adapt to distributed graph systems. – Even Hadoop if you really really want. DrunkardMob - RecSys '13
  • 30. Thank You! • Code: http://github.com/graphchi/graphchijava Aapo Kyrölä Ph.D. candidate @ CMU http://www.cs.cmu.edu/~akyrola Twitter: @kyrpov Special thanks to Pankaj Gupta, Dong Wang, Aneesh Sharma and Jayarama Shenoy @ Twitter. DrunkardMob - RecSys '13

Notes de l'éditeur

  1. ----- Meeting Notes (10/15/13 17:19) -----
  2. So how would we do this if we could fit the graph in memory?