The document summarizes the Pregel system, which was designed for large-scale graph processing. Pregel addresses the inefficiency of MapReduce for graph problems by allowing direct message passing between vertices during synchronized iterations. It provides fault tolerance through checkpointing and a master-worker architecture. Key contributions of Pregel include its distributed programming model and APIs for message passing, combining messages to reduce overhead, global communication through aggregators, and mutating graph topology. The paper notes strengths like fault tolerance but also weaknesses such as putting responsibility on the user and lack of master failure detection.
1. Pregel: A System for Large-Scale Graph
Processing
Paper Review
Maria Stylianou
November 2, 2012
1 Motivation
Nowadays, large-scale graphs, like the Web graph and social networks, are among the
main sources of new computing problems. Processing such graphs efficiently can be a
challenge. MapReduce can be a solution, though very inefficient due to the require-
ment of passing the entire state of the graph from one stage to another. Hence, the
authors propound Pregel, a distributed programming model especially designed to ad-
dress the processing of large-scale graphs, which preserves efficiency, scalability and
fault-tolerance[1].
2 Contributions
So far, there was a gap in the area of frameworks for large-scale graphs processing that
can offer scalability, while being distributed and fault-tolerant. Pregel is exactly designed
with these characteristics. The authors designed Pregel for the Google cluster architec-
ture, in which clusters are interconnected and geographically distributed, and each one
of them containing thousands of commodity machines. Their main contributions in-
clude: 1. Design of a fault-tolerant distributed programming framework for enabling
execution of graph algorithms in parallel over thousands of machines. 2. Provision of
an API with direct message-passing among vertices, combiners for reducing overhead,
aggregators for global communication and monitoring, and lastly topology mutations by
solving conflicting requests.
3 Solution
Pregel operates as a repeated synchronized computation process on vertices. Upon
inserting a graph as an input, the graph is divided into partitions, which include a set
of vertices and their outgoing edges. The vertices are assigned to machines and one of
1
2. them acts as a master for coordinating the worker machines. The workers then undergo
a series of iterations, called supersteps. In every superstep, all vertices in each worker
execute the same user-defined function which can (a) receive messages sent during the
previous superstep, (b) modify the state of the vertex and its outgoing edges (vertices
and edges are kept on the machines) and (c) send messages to be delivered during the
next superstep. At the end of each superstep a global synchronization point occurs.
Vertices can become inactive and the sequence of iterations terminates when all vertices
are inactive and there are no messages in transit. During computation, the master also
sends ping messages to check for workers failures. The network is used only for sending
messages and therefore it significantly reduces the communication overhead, becoming
more efficient.
4 Strong Points
S1 Fault-tolerance is achieved with the use of checkpoints, in which the state of nodes’
partitions is saved to a persistent storage. Upon a machine failure during compu-
tation, the rest of the machines reload their partition state from the most recent
checkpoint.
S2 Combiners are an optimization for less network traffic and can be manually enabled
by the user. With this option, messages can be combined and sent in a single
message, reducing the overhead.
S3 Aggregators are a mechanism for global communication and monitoring. They have
different uses, like: in statistics, for global coordination or even in more advanced
implementations. . . .
5 Weak Points
W1 The user has to modify Pregel a lot in order to personalise it to his/her needs.
More precisely, the user has to code for enabling combiners and for customizing
aggregators. Additionally, the user is responsible for solving conflicting requests.
He/She needs to define handlers, which increases the complexity in the system.
W2 No failure detection is mentioned for the master, making it a single point of failure.
W3 The evaluation presented in the paper is very limited with very little explanation.
There is no clear comparison with other systems. An experimental comparison
with MapReduce would be an interesting approach. Also, there is no experiment
evaluating the fault-tolerance of the system. . . .
References
[1] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Cza-
jkowski, “Pregel: a system for large-scale graph processing,” in Proceedings of the
2
3. 2010 ACM SIGMOD International Conference on Management of data, SIGMOD
’10, (New York, NY, USA), pp. 135–146, ACM, 2010.
3