Dynamic graph computation and iterative processing with Apache Giraph

Dynamic graph / iterative
computation on Apache Giraph
6/3/2014
Avery Ching
Hadoop Summit

Apache Giraph
• Inspired by Google’s Pregel but runs on Hadoop
• “Think like a vertex”
• Maximum value vertex example
Processor 1
Processor 2
Time
5
5
5
5
2
5
5
5
2
1
5
5
2
1

Giraph on Hadoop / Yarn
MapReduce YARN
Giraph
Hadoop
0.20.x
Hadoop
0.20.203
Hadoop
2.0.x
Hadoop
1.x

Send page rank
value to
neighbors for
30 iterations
Calculate
updated
page rank
value from
neighbors
Page rank in Giraph
!
!
public class PageRankComputation extends BasicComputation<LongWritable,
DoubleWritable, FloatWritable, DoubleWritable> {
public void compute(Vertex<LongWritable, DoubleWritable, FloatWritable> vertex,
Iterable<DoubleWritable> messages) {
if (getSuperstep() >= 1) {
double sum = 0;
for (DoubleWritable message : messages) {
sum += message.get();
}
vertex.getValue().set(DoubleWritable((0.15d / getTotalNumVertices()) + 0.85d * sum);
}
if (getSuperstep() < 30) {
sendMsgToAllEdges(new DoubleWritable(getVertexValue().get() / getNumOutEdges()));
} else {
voteToHalt();
}
}
}

Apache Giraph data ﬂow
Loading the graph
Input 
format
Split0
Split1
Split2
Split3
Master
Load/
Send
Graph
Worker0
Load/
Send
Graph
Worker1
Storing the graph
Worker0Worker1
Output 
format
Part0
Part1
Part2
Part3
Part0
Part1
Part2
Part3
Compute / Iterate
Compute/
Send
Messages
Compute/
Send
Messages
In-memory
graph
Part0
Part1
Part2
Part3Master
Worker0Worker1
Sendstats/iterate!

Pipelined computation
Master “computes”
• Sets computation, in/out message, combiner for next super step
•Can set/modify aggregator values
Time
Worker 0
Worker 1
Master
phase 1a
phase 1a
phase 1b
phase 1b
phase 2
phase 2
phase 3
phase 3

Afﬁnity propagation
Frey and Dueck “Clustering by passing messages between data points”
Science 2007
Organically discover exemplars based on similarity
Initialization Intermediate Convergence

Responsibility r(i,k)
• How well suited is k to be an exemplar for i?
Availability a(i,k)
• How appropriate for point i to choose point k as an exemplar given all
of i’s responsibilities?
Update exemplars
• Based on known responsibilities/availabilities, which vertex should be
my exemplar?
!
* Dampen responsibility, availability
3 stages

Responsibility
Every vertex i with an edge to k maintains responsibility of k for i
Sends responsibility to k in ResponsibilityMessage (senderid,
responsibility(i,k))
C
A
D
B
r(c,a)
r(d,a)
r(b,d)
r(b,a)

Availability
Vertex sums positive messages
Sends availability to i in AvailabilityMessage (senderid, availability(i,k))
C
A
D
B
a(c,a)
a(d,a)
a(b,d)
a(b,a)

Update exemplars
Dampens availabilities and scans edges to ﬁnd exemplar k
Updates self-exemplar
C
A
D
Bupdate update
update update
exemplar=a exemplar=d
exemplar=a exemplar=a

Master logic
calculate
responsibility
calculate
availability
update
exemplars
initial
state
halt
if (exemplars agree they are exemplars
&& changed exemplars < ∆) then halt,
otherwise continue

Faster than Hive?
Application Graph Size CPU Time Speedup Elapsed Time Speedup
Page rank 
(single iteration)
400B+ edges 26x 120x
Friends of
friends score  71B+ edges 12.5x 48x

Apache Giraph scalability
Scalability of workers
(200B edges)
Seconds
0
125
250
375
500
# of Workers
50 100 150 200 250 300
Giraph Ideal
Scalability of edges (50
workers)
Seconds
0
125
250
375
500
# of Edges
1E+09 7E+10 1E+11 2E+11
Giraph Ideal

Trillion social edges page rank
Minutesperiteration
0
1
2
3
4
6/30/2013 6/2/2014
Improvements
• GIRAPH-840 - Netty 4 upgrade
• G1 Collector / tuning

Why balanced partitioning
Random partitioning == good balance
BUT ignores entity afﬁnity
0 1
2
3
4 5
6
7
8 9
10
11

Balanced partitioning application
Results from one service:
Cache hit rate grew from 70% to 85%, bandwidth cut in 1/2
!
!
0
2
3
5
6 9
11
1 4 7
8
10

Balanced label propagation results
* Loosely based on Ugander and Backstrom. Balanced label
propagation for partitioning massive graphs, WSDM '13

Partitioning experimentsSecondsperiteration
0
40
80
120
160
Random 47% Local Edges
345B edge page rank
Improvements
• 56% faster!
• Native vertex remapping

Avoiding out-of-core
Example: Mutual friends calculation between
neighbors
1. Send your friends a list of your friends
2. Intersect with your friend list
!
1.23B (as of 1/2014)
200+ average friends (2011 S1)
8-byte ids (longs)
= 394 TB / 100 GB machines
3,940 machines (not including the graph)
A B
C
D
E
A:{D}
D:{A,E}
E:{D}
B:{}
C:{D}
D:{C}
A:{C}
C:{A,E}
E:{C}
!
C:{D}
D:{C}
!
!
E:{}

Superstep splitting
Subsets of sources/destinations edges per superstep
* Currently manual - future work automatic!
A
Sources: A (on), B (off)
Destinations: A (on), B (off)
B
B
B
A
A
A
Sources: A (on), B (off)
Destinations: A (off), B (on)
B
B
B
A
A
A
Sources: A (off), B (on)
Destinations: A (on), B (off)
B
B
B
A
A
A
Sources: A (off), B (on)
Destinations: A (off), B (on)
B
B
B
A
A

Giraph in production
Over 1.5 years in production
Over 100 jobs processed a week
30+ applications in our internal application repository
Sample production job - 700B+ edge

Giraph related projects
Graft: The distributed
Giraph debugger

Giraph roadmap
2/12 - 0.1 6/14 - 1.15/13 - 1.0

Future work
Automatic checkpointing make scheduling more fair
Investigate alternative computing models
•Giraph++ (IBM research)
•Giraphx (University at Buffalo, SUNY)
Performance
Lower the barrier to entry
Applications

Our team
!
Maja
Kabiljo
Sergey
Edunov
Pavan
Athivarapu
Avery
Ching
Sambavi
Muthukrishnan

Dynamic graph computation and iterative processing with Apache Giraph

Dynamic graph computation and iterative processing with Apache Giraph

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Dynamic graph computation and iterative processing with Apache Giraph

Similaire à Dynamic graph computation and iterative processing with Apache Giraph (20)

Plus de DataWorks Summit

Plus de DataWorks Summit (20)

Dernier

Dernier (20)

Dynamic graph computation and iterative processing with Apache Giraph