2. GraphLab overview: GraphLab 1.0
● GraphLab: A New Framework For Parallel
Machine Learning
– high-level abstractions for machine learning
problems
– Shared-memory multiprocessor
– Assume no fault tolerance needed
– Concurrent access precessing models with
sequential-consistency guarantees
12/10/12 2
3. GraphLab overview: GraphLab 1.0
● How GraphLab 1.0 works?
– Represent the user's data by a directed graph
– Each block of data is represented by a vertex
and a directed edge
– Shared data table
– User functions:
● Update: modify the vertex and edges state,
read only to shared table
● Fold: sequential aggregation to a key entry in
12/10/12
the shared table, modify vertex data 3
● Merge: Parallelize Fold function
● Apply: Finalize the key entry in the shared table
5. GraphLab overview: Distributed
GraphLab 1.0
● Distributed GraphLab: A Framework for
Machine Learning and Data Mining in the
Cloud
– Fault tolerance using snapshot algorithm
– Improved distributed parallel processing
– Two stage partitioning:
● Atoms generated by ParMetis
● Ghosts generated by the intersection of the
atoms
12/10/12
– Finalize() function for vertex synchronization5
8. PowerGraph: Introduction
● GraphLab 2.1
● Problems of highly skewed power-law graphs:
– Workload imbalance ==> performance
degradations
– Limiting Scalability
– Hard to partition if the graph is too large
– Storage
– Non-parallel computation
12/10/12 8
9. PowerGraph: New Abstraction
● Original Functions:
– Update
– Finalize
– Fold
– Merge
– Apply: The synchronization apply
● Introduce GAS model:
– Gather: in, out or all neighbors
12/10/12 – Apply: The GAS model apply 9
– Scatter
18. PowerGraph: Discussion
● Isn't it similar to Pregel Mode?
– Partially process the vertex if a message exists
● Gather, Apply and Scatter are commutative
and associative operations. What if the
computation is not commutative!
– Sum up the message values in a specific order
to get the same floating point rounding error.
12/10/12 18
19. PowerGraph and Mizan
● In Mizan we use partial replication:
W0 W1 W0 W1
b b e
e
c a f c a a' f
d g d g
Compute Phase Communication Phase
12/10/12 19
20. GraphChi: Introduction
● Asynchronous Disk-based version of
GraphLab
● Utilizing parallel sliding window
– Very small number of non-sequential accesses
to the disk
● Support for graph updates
– Based on Kineograph, a distributed system for
processing a continuous in-flow of graph
12/10/12
updates, while simultaneously running 20
advanced graph mining algorithms.
21. GraphChi: Graph Constrains
● Graph does not fit in memory
● A vertex, its edges and values fits in memory
12/10/12 21
22. GraphChi: Disk storage
● Compressed sparse row (CSR):
– Compressed adjacency list with indexes of the
edges.
– Fast access to the out-degree vertices.
● Compressed Sparse Column (CSC):
– CSR for the transpose graph
– Fast access to the in-degree vertices
● Shard: Store the edges' data
12/10/12 22
23. GraphChi: Loading the graph
● Input graph is split into P disjoint intervals to balance
edges, each associated with a shard
● A shard contains data of the edges of an interval
● The sub graph is constructed as reading its interval
12/10/12 23
24. GraphChi: Parallel Sliding Windows
● Each interval is processed in parallel
● P sequential disk access are required to process
each interval
● The length of intervals vary with graph distribution
● P * P disk access required for one superstep
12/10/12 24
28. GraphChi: Evolving Graphs
● Adding an edge is reflected on the intervals and
shards if read
● Deleting an edge causes that edge to be ignored
● Adding and deleting edges are handled after
processing the current interval.
12/10/12 28