Fast, Scalable Graph Processing with Apache Giraph on YARN

Fast, Scalable Graph Processing:
Apache Giraph on YARN

Hello, I'm Eli Reisman!

Eli is...
•  Apache Giraph Committer and PMC Member
•  Apache Tajo Committer
•  Wrote initial port of Giraph to YARN
•  Collaborating with fellow Giraph committers on
Giraph in Action book for Manning publishing

Eli is...
•  Only able to do all this with the support of:

Eli is a software engineer at

Etsy enables non-technical folks to sell
handmade and vintage stuff:
We have a great blog called Code As Craft:

...but, enough about me, lets talk Giraph!

Key Topics
What is Apache Giraph?
Why do I need it?
Giraph + MapReduce
Giraph + YARN
Giraph Roadmap

Giraph is a framework for performing offline
batch processing of semi-structured graph
data on a massive scale.
Giraph is loosely based upon Google's Pregel
graph processing framework.

Giraph performs iterative calculations on top of an
existing Hadoop cluster.

Giraph uses Apache Zookeeper to enforce atomic
barrier waits and perform leader election.
Done! Done! ...Still
working...

Giraph benefits from a vibrant Apache community, and is
under active development:

Why do I need it?
Giraph makes graph algorithms easy to reason about
and implement by following the Bulk Synchronous
Parallel (BSP) programming model.
In BSP, all algorithms are implemented from the point
of view of a single vertex in the input graph
performing a single iteration of the computation.

Why do I need it?
•  Giraph makes iterative data processing more
practical for Hadoop users.
•  Giraph can avoid costly disk and network
operations that are mandatory in MR.
•  No concept of message passing in MR.

Why do I need it?
Each cycle of an iterative calculation on
Hadoop means running a full MapReduce
job.

Let's use simple PageRank as a quick
example:
http://en.wikipedia.org/wiki/PageRank
1.0
1.0
1.0

1. All vertices start with same PageRank
1.0
1.0
1.0

2. Each vertex distributes an equal portion of
its PageRank to all neighbors:
0.5
0.5
1
1

3. Each vertex sums incoming values times a
weight factor and adds in small adjustment:
1/(# vertices in graph)
(.5*.85) + (.15/3)
(1.5*.85) + (.15/3)
(1*.85) + (.15/3)

4. This value becomes the vertices' PageRank
for the next iteration
.43
.21
.64

5. Repeat until convergence:
(change in PR per-iteration < epsilon)

Vertices with more in-degrees converge to higher
PageRank

Put another way:

PageRank on MapReduce
1. Load complete input graph from disk as
[K= Vertex ID, V = out-edges and PR]
Map Sort/Shuffle Reduce

2. Emit all input records (full graph state),
Emit [K = edgeTarget, V = share of PR]

3. Sort and Shuffle this entire mess!

4. Sum incoming PR shares for each vertex,
update PR values in graph state records

5. Emit full graph state to disk...

6. ...and start over!

•  Awkward to reason about
•  I/O bound despite simple core business logic

PageRank on Giraph
1. Hadoop Mappers are "hijacked" to host
Giraph master and worker tasks.

PageRank on Giraph
2. Input graph is loaded once, maintaining
code-data locality when possible.

PageRank on Giraph
3. All iterations are performed on data in memory,
optionally spilled to disk. Disk access is linear/
scan-based.

PageRank on Giraph
4. Output is written from the Mappers hosting
the calculation, and the job run ends.

This is all well and good, but must we
manipulate Hadoop this way?
?

Giraph + MapReduce
•  Heap and other resources are set once, globally, for all
Mappers in the computation.
•  No control of which cluster nodes host which tasks.
•  No control over how Mappers are scheduled.
•  Mapper and Reducer slots abstraction is meaningless
for Giraph at best, an artificial limit at worst.

YARN
•  YARN (Yet Another Resource Negotiator) is Hadoop's
next-gen job management platform.
•  Powers MapReduce v2, but is a general purpose
framework that is not tied to the MapReduce paradigm.
•  Offers fine-grained control over each task's resource
allocations and host placement for clients that need it.

YARN Architecture

Giraph + YARN
Its a natural fit!

Giraph + YARN
•  Giraph has maintained compatibility with Hadoop since
0.1 release by executing via MapReduce interface.
•  Giraph has featured a "pure YARN" build profile since
1.0 release. It supports Hadoop-2.0.3 and trunk.
*Patches to add 2.0.4 and 2.0.5 support are in review :)
•  Giraph's YARN component is easy to extend or use as
a template to port other projects!

Giraph + YARN: Roadmap
•  YARN Application Master allows for more natural and
stable bootstrapping of Giraph jobs.
•  Zookeeper management can find natural home in
Application Master.
•  Giraph on YARN can stop borrowing from Hadoop and
have its own web interface.

Giraph + YARN: Roadmap
•  Variable per-task resource allocation opens up the
possibility of Supertasks to manage graph supernodes.
•  Ability to spawn or retire tasks per-iteration enables in-
flight reassignment of data partitions.
•  AppMaster managed utility tasks such as dedicated
sub-aggregators for tree-like aggregation, or data pre-
samplers.

Giraph New Developments
•  Decoupling of logic and graph data means tasks host
computations that are pluggable per-iteration.
•  Support for Giraph job scripting, starting with Jython.
More to follow...
•  New website, fresh docs, upcoming Manning book, and
large, active community means Giraph has never been
easier to use or contribute to!

Great! Where can I learn more?
http://giraph.apache.org
Mailing List:
user@giraph.apache.org

Fast, Scalable Graph Processing with Apache Giraph on YARN

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Fast, Scalable Graph Processing with Apache Giraph on YARN

Similaire à Fast, Scalable Graph Processing with Apache Giraph on YARN (20)

Plus de DataWorks Summit

Plus de DataWorks Summit (20)

Dernier

Dernier (20)

Fast, Scalable Graph Processing with Apache Giraph on YARN