To meet the challenge of processing rapidly growing graph and
network data created by modern applications, a number of distributed graph processing systems have emerged, such as Pregel and GraphLab. All these systems divide input graphs into partitions, and employ a “think like a vertex” programming model to support iterative graph computation. This vertex-centric model is easy to program and has been proved useful for many graph algorithms. However, this model hides the partitioning information from the users, thus prevents many algorithm-specific optimizations. This often results in longer execution time due to excessive network messages (e.g. in Pregel) or heavy scheduling overhead to ensure data consistency (e.g. in GraphLab). To address this limitation, we
propose a new “think like a graph” programming paradigm. Under this graph-centric model, the partition structure is opened up to the users, and can be utilized so that communication within a partition can bypass the heavy message passing or scheduling machinery. We implemented this model in a new system, called Giraph++, based on Apache Giraph, an open source implementation of Pregel. We explore the applicability of the graph-centric model to three categories of graph algorithms, and demonstrate its flexibility and superior performance, especially on well-partitioned data.
Giraph++: From "Think Like a Vertex" to "Think Like a Graph"
1. From “Think Like a
Vertex” to “Think Like
a Graph”
Yuanyuan Tian, Andrey Balmin, Severin Andreas
Corsten, Shirish Tatikonda, John McPherson
2. Large Scale Graph Processing
Graph data is everywhere and growing rapidly!
126 million blogs
50 billion Web pages
90 trillion emails
Analyzing graph data is increasingly important.
Retail: customer segmentation and recommendation
Consumer Insights: key influencer analysis
Telco: churn prediction
Biology & Health Care: disease propagation analysis
Government: Terrorists detection
What is the right programming model for large-scale graph
processing?
3. Existing Graph Processing Systems
Divide input graphs into partitions
Employ vertex-centric programming model
Programmers “think like a vertex”
Operate on a vertex and its edges
Communication to other vertices
Message passing (e.g Pregel/Giraph)
Scheduling of updates (e.g GraphLab)
4. “Think like a vertex” “Think like a graph”
“Think like a vertex” “Think like a graph”
Partition: A collection of vertices A proper subgraph
Computation: A vertex and its edges A subgraph
Communication: 1-hop at a time
e.g ABD
Multiple-hops at a time
e.g. AD
5. Graph-Centric Programming Model
Expose subgraphs to programmers
Internal vertices vs boundary vertices
Information exchange between internal vertices of a partition is
immediate
Messages are only sent from boundary vertices of a partition to
internal vertices of a different partition
internal vertex
external vertex
(primary copy)
(local copy)
message
6. Advantages of graph-centric model
Any algorithm expressed in the vertex-centric model can
be expressed in the graph-centric model
Allow lower-level access for algorithm-specific
optimizations
Use of existing off-the-shell graph algorithms on subgraphs
Local asynchronous computation
Natural translation of existing graph-centric parallel algorithms
Provide sufficiently high-level abstraction for easy of use
7. Example: Weakly Connected Component (WCC)
compute() on each subgraph
If 0th
superstep
Sequential WCC on subgraph
Send labels of boundary vertices
to their corresponding internal
vertices
Else
Use the received messages to
update the labels of vertices in the
subgraph
Merge connected components
For boundary vertices with label
change, send labels to their
corresponding internal vertices
vertex-centricmodelg
8. Hybrid Execution Model
Can we keep the simple vertex-centric programming
model but still improve performance?
Key: Differentiate internal messages and external
messages
What messages can be used in local computation?
Vertex-centric model
Only messages (external and internal) from previous superstep
Hybrid model
External messages from previous superstep
Internal messages from previous + current superstep (local
asynchronous computation)
Only apply to a limited set of graph algorithms
9. Giraph++: A New Graph Processing System
Built on top of Apache Giraph
Support vertex-centric model (VM), graph-centric model
(GM) & hybrid model (HM) in the same Giraph++ system
Contributed to Apache Giraph Project
Planed in future Giraph release
10. A Peek of Performance Results
10-node cluster
Per node: 32GB RAM, 8 cores, 7 workers
Network: 1Gbit
Dataset: 4 web graphs
# nodes: 19million ~ 428million
# edges: 298million ~ 1.0billion
Significant improvement in execution time
and network messages, especially with
better graph partitioning strategy
Up to 62X speedup in execution time!
Up to 200X reduction in network messages!
Data Set
Time(sec)
Networkmsgs
VM: vertex-centric model
GM: graph-centric model
HM: hybrid model
HP: hash partitioning
GP: graph partitioningWCC Algorithm
11. Conclusion
“Think like a vertex” “Think like a graph”
Take advantage of local graph structure in a partition
Enable complex and flexible graph algorithms
Can be exploited to various graph applications
Bring significant performance improvement, especially with
better graph partitioning strategy
A valuable complement to existing vertex-centric model
Notes de l'éditeur
The industry of large scale grpah processing is mainly dirven by the open source communities, of which Giraph is one of the most important projects. By contributing to Giraph, we influence the direction of the industry.