Apache Giraph

A Distributed Graph-Processing Library
Ahmet Emre Aladağ - AGMLab
26.08.2013

● Library for large-scale graph processing.
● Runs on Apache Hadoop with Map Jobs
● Bulk Synchronous Parallel (BSP) model
What is Giraph?
1incoming
messages
outgoing
messages
0.2
0.53
0.32
0.16
0.12
0.34
Vertex
computation

Uses
● PageRank-variant iterative algorithms
● Graph clustering
○ Label propagation
○ Max Clique
○ Triangle Closure
○ Finding related people, groups, interests.
● Shortest-Path
○ Single source, s-t, all to all
● Finding Connected Components

Alternatives
● Map-Reduce jobs on Hadoop
○ Not a good fit for graph algorithms: overhead.
● Google Pregel
○ Requires its own infrastructure
○ Not available
○ Master is single point of failure.
● Message Passing Interface (MPI)
○ Not fault-tolerant
○ Too generic

How Giraph differs
● You can use a Hadoop cluster, no need for
special infrastructure.
● Easy deployment with Amazon EMR
● Dynamic resource management
● Graph oriented API
● Open Source
● Fault Tolerant, no SPOF except Hadoop
namenode and jobtracker
● Jython Support

Mechanism
InputFormat/Reader
Input
Computation OutputFormat/Writer
Output
● Accumulo
● HBase
● HCatalog
● HDFS
● Hive
● Neo4j etc.
● Accumulo
● HBase
● HCatalog
● HDFS
● Hive
● Neo4j etc.
● GraphViz
Adjacency matrix, id-
value pairs, JSON

InputFormat
● VertexInputFormat
1;3.4
2;6.1
3;2.7
● EdgeInputFormat
1;2
2;3
1;3
1 2 3
3.4 6.1 2.7
1 2 3

Computation
● Superstep barriers.
● Send/Receive messages from neighbors
● Update value.
● Vote to halt or wake up.
Single-Source Shortest Path Example

Shortest-Path Computation Code
Note: old API

Aggregators
● Shared variables among the workers.
● Each vertex computation can add/multiply a
value to aggregators.
● Examples:
○ Holding the min/max value among all vertices
○ Holding sum of the vertex values.
○ Holding average value of vertex values.
○ Holding sum of mean square errors and stdev.
1 2 3
0.2
0.6
0.45
1.25
Computation at
Iteration k

MasterCompute Class
● Master’s compute() always runs before the
slaves (like pre-superstep)
○ In compute: aggregate vertex values: sum of values
○ In MasterCompute: average=sum/N
● Aggregators are registered here.
● You can set values to aggregators.

Worker Context
● Allows for the execution of user code on a
per-worker basis.
● There's one WorkerContext per worker.
● Methods for Pre/post superstep/application
operations.

Flexible Edge/Vertex Input
● Read edges/vertices from different sources.
● Multiple input resources

Parallel Computing
● More map jobs (workers) = parallel computing
● To overcome slowest worker problem,
multithreading is applied on
input/computation/output
● Linear speedup in CPU-bound applications
such as k-means clustering due to
multithreading
● Take a set of entrie machines & use
multithreading to maximize resource utilization.

Memory Optimization
● Vertices and edges are stored as serialized
byte arrays.
● Used FastUtil-based Java primitives.

Sharded Aggregators
● Each aggregator is randomly assigned to one of the workers.
● The assigned worker is in charge of gathering the values of its aggregators
from all workers, performing the aggregation, and distributing the final values
to other workers.
● Aggregation responsibilities are balanced across all workers rather than
bottlenecked by the master.

Performance
● PageRank on 1 trillion edges with 200 commodity
machines: 4 minutes/iteration.
● K-Means on 1 billion input vectors x 100 features into
10.000 centroids: 10 minutes.
● Linear Scalability

Currently
● Version 1.0, on the way to 1.1
● Changing rapidly: backwards-incompatible
changes
● Documentation not mature yet.
● More algorithms to be contributed.
● More data sources to be ported.
● http://giraph.apache.org for more info

References
Giraph: Large-scale graph processing infrastructure on Hadoop, 2011
Scaling Apache Giraph to a trillion edges, Avery Ching, Facebook, 2013
Scaling Apache Giraph, Nitay Joffe, Facebook, 2013.
Giraph: http://giraph.apache.org

Apache Giraph

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Apache Giraph

Similaire à Apache Giraph (20)

Dernier

Dernier (20)

Apache Giraph