dotScale 2016 presentation
Writing distributed graph applications is inherently hard. In this talk, Vasia gives an overview of high-level programming models and platforms for distributed graph processing. She exposes and discusses common misconceptions, shares lessons learnt, and suggests best practices.
7. INTERMEDIATE DATA: THE OFTEN DISREGARDED EVIL
▸ Naive Who(m) to Follow:
▸ compute a friends-of-friends list
per user
▸ exclude existing friends
▸ rank by common connections
18. PREGEL EXAMPLE: PAGERANK
void compute(messages):
sum = 0.0
for (m <- messages) do
sum = sum + m
end for
setValue(0.15/numVertices() + 0.85*sum)
for (edge <- getOutEdges()) do
sendMessageTo(
edge.target(), getValue()/numEdges)
end for
sum up received
messages
update vertex rank
distribute rank to
neighbors
20. SIGNAL-COLLECT EXAMPLE: PAGERANK
void signal():
for (edge <- getOutEdges()) do
sendMessageTo(
edge.target(), getValue()/numEdges)
end for
void collect(messages):
sum = 0.0
for (m <- messages) do
sum = sum + m
end for
setValue(0.15/numVertices() + 0.85*sum)
distribute rank to
neighbors
sum up received
messages
update vertex rank
25. THINK LIKE A (SUB)GRAPH
1
5
4
3
2
1
5
4
3
2
- compute() on the entire partition
- Information flows freely inside each partition
- Network communication between partitions,
not vertices
30. CAN WE HAVE IT ALL?
▸ Data pipeline integration: built on top of an efficient
distributed processing engine
▸ Graph ETL: high-level API with abstractions and methods to
transform graphs
▸ Familiar programming model: support popular programming
abstractions
31. HELLO, GELLY! THE APACHE FLINK GRAPH API
▸ Java and Scala APIs: seamlessly integrate with Flink’s DataSet API
▸ Transformations, library of common algorithms
val graph = Graph.fromDataSet(edges, env)
val ranks = graph.run(new PageRank(0.85, 20))
▸ Iteration abstractions
Pregel
Signal-Collect
Gather-Sum-Apply
Partition-Centric*