Graphlab under the hood

GraphLab under the hood

Zuhair Khayyat

12/10/12 1

GraphLab overview: GraphLab 1.0
● GraphLab: A New Framework For Parallel
Machine Learning
– high-level abstractions for machine learning
problems
– Shared-memory multiprocessor
– Assume no fault tolerance needed
– Concurrent access precessing models with
sequential-consistency guarantees

12/10/12 2

● How GraphLab 1.0 works?
– Represent the user's data by a directed graph
– Each block of data is represented by a vertex
and a directed edge
– Shared data table
– User functions:
● Update: modify the vertex and edges state,
read only to shared table
● Fold: sequential aggregation to a key entry in
12/10/12
the shared table, modify vertex data 3
● Merge: Parallelize Fold function
● Apply: Finalize the key entry in the shared table


12/10/12 4

GraphLab overview: Distributed
GraphLab 1.0
● Distributed GraphLab: A Framework for
Machine Learning and Data Mining in the
Cloud
– Fault tolerance using snapshot algorithm
– Improved distributed parallel processing
– Two stage partitioning:
● Atoms generated by ParMetis
● Ghosts generated by the intersection of the
atoms
12/10/12
– Finalize() function for vertex synchronization5

GraphLab 1.0

12/10/12 6

GraphLab 1.0

12/10/12 7

Worker 1 Worker 2
GHosts

PowerGraph: Introduction

● GraphLab 2.1
● Problems of highly skewed power-law graphs:
– Workload imbalance ==> performance
degradations
– Limiting Scalability
– Hard to partition if the graph is too large
– Storage
– Non-parallel computation
12/10/12 8

PowerGraph: New Abstraction
● Original Functions:
– Update
– Finalize
– Fold
– Merge
– Apply: The synchronization apply
● Introduce GAS model:
– Gather: in, out or all neighbors
12/10/12 – Apply: The GAS model apply 9

– Scatter

PowerGraph: Gather

12/10/12 10

Worker 1 Worker 2

PowerGraph: Apply

12/10/12 11

Worker 1 Worker 2

PowerGraph: Scatter

12/10/12 12

Worker 1 Worker 2

PowerGraph: Vertex Cut
A B A H
A

B A G B C

G B H C D

H C
C H C I

F D E D I
I
E F E I

E D F H F G

12/10/12 13

PowerGraph: Vertex Cut
A B C
A B A H
D
A G B C F H

B H C D I

C H C I
A H
D E D I
A G
E B
E F E I

C D
F H F G F G
12/10/12 14

E I C I

PowerGraph: Vertex Cut (Greedy)

A B A H A B

A G B C
G H C
B H C D

C H C I
B C C D
D E D I

E F E I E H I E

F H F G
12/10/12 15
F G

PowerGraph: Experiment

12/10/12 16

PowerGraph: Experiment

12/10/12 17

PowerGraph: Discussion
● Isn't it similar to Pregel Mode?
– Partially process the vertex if a message exists
● Gather, Apply and Scatter are commutative
and associative operations. What if the
computation is not commutative!
– Sum up the message values in a specific order
to get the same floating point rounding error.

12/10/12 18

PowerGraph and Mizan
● In Mizan we use partial replication:

W0 W1 W0 W1

b b e
e

c a f c a a' f

d g d g

Compute Phase Communication Phase
12/10/12 19

GraphChi: Introduction
● Asynchronous Disk-based version of
GraphLab
● Utilizing parallel sliding window
– Very small number of non-sequential accesses
to the disk
● Support for graph updates
– Based on Kineograph, a distributed system for
processing a continuous in-flow of graph
12/10/12
updates, while simultaneously running 20
advanced graph mining algorithms.

GraphChi: Graph Constrains
● Graph does not fit in memory
● A vertex, its edges and values fits in memory

12/10/12 21

GraphChi: Disk storage
● Compressed sparse row (CSR):
– Compressed adjacency list with indexes of the
edges.
– Fast access to the out-degree vertices.
● Compressed Sparse Column (CSC):
– CSR for the transpose graph
– Fast access to the in-degree vertices
● Shard: Store the edges' data
12/10/12 22

GraphChi: Loading the graph
● Input graph is split into P disjoint intervals to balance
edges, each associated with a shard
● A shard contains data of the edges of an interval
● The sub graph is constructed as reading its interval

12/10/12 23

GraphChi: Parallel Sliding Windows
● Each interval is processed in parallel
● P sequential disk access are required to process
each interval
● The length of intervals vary with graph distribution
● P * P disk access required for one superstep

12/10/12 24

GraphChi: Example

Executing interval (1,2):

12/10/12 25
(1,2) (3,4) (5,6)

GraphChi: Example

Executing interval (3,4):

12/10/12 26
(1,2) (3,4) (5,6)

GraphChi: Example

12/10/12 27

GraphChi: Evolving Graphs
● Adding an edge is reflected on the intervals and
shards if read
● Deleting an edge causes that edge to be ignored
● Adding and deleting edges are handled after
processing the current interval.

12/10/12 28

GraphChi: Preprocessing

12/10/12 29

Thank you

12/10/12 30

The Blog wants YOU

12/10/12 31
thegraphsblog.wordpress.com/

Graphlab under the hood

Recommandé

Recommandé

Contenu connexe

Plus de Zuhair khayyat

Plus de Zuhair khayyat (11)

Dernier

Dernier (20)

Graphlab under the hood