This document discusses batch and stream graph processing with Apache Flink. It provides an overview of distributed graph processing and Flink's graph processing APIs, Gelly for batch graph processing and Gelly-Stream for continuous graph processing on data streams. It describes how Gelly and Gelly-Stream allow for processing large and dynamic graphs in a distributed fashion using Flink's dataflow engine.
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Apache Flink & Graph Processing
1. Batch & Stream Graph Processing
with Apache Flink
Vasia Kalavri
vasia@apache.org
@vkalavri
Apache Flink Meetup London
October 5th, 2016
2. 2
Graphs capture relationships
between data items
connections, interactions, purchases,
dependencies, friendships, etc.
Recommenders
Social networks
Bioinformatics
Web search
8. NAIVE WHO(M)-T0-FOLLOW
▸ Naive Who(m) to Follow:
▸ compute a friends-of-friends
list per user
▸ exclude existing friends
▸ rank by common
connections
15. WHEN DO YOU NEED DISTRIBUTED GRAPH PROCESSING?
▸ When you do have really big graphs
▸ When the intermediate data is big
▸ When your data is already distributed
▸ When you want to build end-to-end graph pipelines
16. HOW DO WE EXPRESS A
DISTRIBUTED GRAPH
ANALYSIS TASK?
20. PAGERANK: THE WORD COUNT OF GRAPH PROCESSING
VertexID Out-degree
Transition
Probability
1 2 1/2
2 2 1/2
3 0 -
4 3 1/3
5 1 1
1
5
4
3
2
21. PAGERANK: THE WORD COUNT OF GRAPH PROCESSING
VertexID Out-degree
Transition
Probability
1 2 1/2
2 2 1/2
3 0 -
4 3 1/3
5 1 1
PR(3) = 0.5*PR(1) + 0.33*PR(4) + PR(5)
1
5
4
3
2
22. 1
5
4
3
2
PAGERANK: THE WORD COUNT OF GRAPH PROCESSING
VertexID Out-degree
Transition
Probability
1 2 1/2
2 2 1/2
3 0 -
4 3 1/3
5 1 1
PR(3) = 0.5*PR(1) + 0.33*PR(4) + PR(5)
23. PAGERANK: THE WORD COUNT OF GRAPH PROCESSING
VertexID Out-degree
Transition
Probability
1 2 1/2
2 2 1/2
3 0 -
4 3 1/3
5 1 1
PR(3) = 0.5*PR(1) + 0.33*PR(4) + PR(5)
1
5
4
3
2
24. PAGERANK: THE WORD COUNT OF GRAPH PROCESSING
VertexID Out-degree
Transition
Probability
1 2 1/2
2 2 1/2
3 0 -
4 3 1/3
5 1 1
PR(3) = 0.5*PR(1) + 0.33*PR(4) + PR(5)
1
5
4
3
2
25. PREGEL EXAMPLE: PAGERANK
void compute(messages):
sum = 0.0
for (m <- messages) do
sum = sum + m
end for
setValue(0.15/numVertices() + 0.85*sum)
for (edge <- getOutEdges()) do
sendMessageTo(
edge.target(), getValue()/numEdges)
end for
sum up received
messages
update vertex rank
distribute rank
to neighbors
27. SIGNAL-COLLECT EXAMPLE: PAGERANK
void signal():
for (edge <- getOutEdges()) do
sendMessageTo(
edge.target(), getValue()/numEdges)
end for
void collect(messages):
sum = 0.0
for (m <- messages) do
sum = sum + m
end for
setValue(0.15/numVertices() + 0.85*sum)
distribute rank
to neighbors
sum up
messages
update vertex
rank
30. PREGEL VS. SIGNAL-COLLECT VS. GSA
Update Function
Properties
Update Function
Logic
Communication
Scope
Communication
Logic
Pregel arbitrary arbitrary any vertex arbitrary
Signal-Collect arbitrary
based on
received
messages
any vertex
based on vertex
state
GSA
associative &
commutative
based on
neighbors’
values
neighborhood
based on vertex
state
31. CAN WE HAVE IT ALL?
▸ Data pipeline integration: built on top of an
efficient distributed processing engine
▸ Graph ETL: high-level API with abstractions and
methods to transform graphs
▸ Familiar programming model: support popular
programming abstractions
34. Meet Gelly
• Java & Scala Graph APIs on top of Flink’s DataSet API
Flink Core
Scala API
(batch and streaming)
Java API
(batch and streaming)
FlinkML GellyTable API ...
Transformations
and Utilities
Iterative Graph
Processing
Graph Library
34
35. Gelly is NOT
• a graph database
• a specialized graph processor
35
36. Hello, Gelly!
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSet<Edge<Long, NullValue>> edges = getEdgesDataSet(env);
Graph<Long, Long, NullValue> graph = Graph.fromDataSet(edges, env);
DataSet<Vertex<Long, Long>> verticesWithMinIds = graph.run(
new ConnectedComponents(maxIterations));
val env = ExecutionEnvironment.getExecutionEnvironment
val edges: DataSet[Edge[Long, NullValue]] = getEdgesDataSet(env)
val graph = Graph.fromDataSet(edges, env)
val components = graph.run(new ConnectedComponents(maxIterations))
Java
Scala
38. Example: mapVertices
// increment each vertex value by one
val graph = Graph.fromDataSet(...)
// increment each vertex value by one
val updatedGraph = graph.mapVertices(v => v.getValue + 1)
4
2
8
5
5
3
1
7
4
5
39. Example: subGraph
val graph: Graph[Long, Long, Long] = ...
// keep only vertices with positive values
// and only edges with negative values
val subGraph = graph.subgraph(
vertex => vertex.getValue > 0,
edge => edge.getValue < 0
)
40. Neighborhood Methods
Apply a reduce function to the 1st-hop neighborhood
of each vertex in parallel
graph.reduceOnNeighbors(
new MinValue, EdgeDirection.OUT)
41. What makes Gelly unique?
• Batch graph processing on top of a streaming
dataflow engine
• Built for end-to-end analytics
• Support for multiple iteration abstractions
• Graph algorithm building blocks
• A large open-source library of graph algorithms
42. Why streaming dataflow?
• Batch engines materialize data… even if they don’t
have to
• the graph is always loaded and materialized in memory,
even if not needed, e.g. mapping, filtering, transformation
• Communication and computation overlap
• We can do continuous graph processing (more
after the break!)
43. End-to-end analytics
• Graphs don’t appear out of thin air…
• We need to support pre- and post-processing
• Gelly can be easily mixed with the DataSet API:
pre-processing, graph analysis, and post-
processing in the same Flink program
44. Iterative Graph Processing
• Gelly offers iterative graph processing abstractions
on top of Flink’s Delta iterations
• vertex-centric
• scatter-gather
• gather-sum-apply
• partition-centric*
46. Optimization
• the runtime is aware of the iterative execution
• no scheduling overhead between iterations
• caching and state maintenance are handled automatically
Push work
“out of the loop”
Maintain state as indexCache Loop-invariant Data
47. Vertex-Centric SSSP
final class SSSPComputeFunction extends ComputeFunction {
override def compute(vertex: Vertex, messages: MessageIterator) = {
var minDistance = if (vertex.getId == srcId) 0 else Double.MaxValue
while (messages.hasNext) {
val msg = messages.next
if (msg < minDistance)
minDistance = msg
}
if (vertex.getValue > minDistance) {
setNewVertexValue(minDistance)
for (edge: Edge <- getEdges)
sendMessageTo(edge.getTarget, vertex.getValue + edge.getValue)
}
48. Algorithms building blocks
• Allow operator re-use across graph algorithms
when processing the same input with a similar
configuration
49. Library of Algorithms
• PageRank
• Single Source Shortest Paths
• Label Propagation
• Weakly Connected Components
• Community Detection
• Triangle Count & Enumeration
• Local and Global Clustering Coefficient
• HITS
• Jaccard & Adamic-Adar Similarity
• Graph Summarization
• val ranks = inputGraph.run(new PageRank(0.85, 20))
51. Can’t we block them?
proxy
Tracker
Tracker
Ad Server
Legitimate site
52. • not frequently updated
• not sure who or based on what criteria URLs are
blacklisted
• miss “hidden” trackers or dual-role nodes
• blocking requires manual matching against the list
• can you buy your way into the whitelist?
Available Solutions
Crowd-sourced “black lists” of tracker URLs:
- AdBlock, DoNotTrack, EasyPrivacy
53. DataSet
• 6 months (Nov 2014 - April 2015) of augmented
Apache logs from a web proxy
• 80m requests, 2m distinct URLs, 3k users
56. Data Pipeline
raw logs
cleaned
logs
1: logs pre-
processing
2: bipartite graph
creation
3: largest
connected
component
extraction
4: hosts-
projection
graph creation
5: community
detection
google-analytics.com: T
bscored-research.com: T
facebook.com: NT
github.com: NT
cdn.cxense.com: NT
...
6: results
DataSet API
Gelly
DataSet API
57. Feeling Gelly?
• Gelly Guide
https://ci.apache.org/projects/flink/flink-docs-master/libs/
gelly_guide.html
• To Petascale and Beyond @Flink Forward ‘16
http://flink-forward.org/kb_sessions/to-petascale-and-beyond-apache-
flink-in-the-clouds/
• Web Tracker Detection @Flink Forward ’15
https://www.youtube.com/watch?v=ZBCXXiDr3TU
paper: Kalavri, Vasiliki, et al. "Like a pack of wolves: Community
structure of web trackers." International Conference on Passive and
Active Network Measurement, 2016.
59. Real Graphs are dynamic
Graphs are created from events happening in real-time
60.
61. How we’ve done graph processing so far
1. Load: read the graph
from disk and partition it in
memory
62. 2. Compute: read and
mutate the graph state
How we’ve done graph processing so far
1. Load: read the graph
from disk and partition it in
memory
63. 3. Store: write the final
graph state back to disk
How we’ve done graph processing so far
2. Compute: read and
mutate the graph state
1. Load: read the graph
from disk and partition it in
memory
64. What’s wrong with this model?
• It is slow
• wait until the computation is over before you see
any result
• pre-processing and partitioning
• It is expensive
• lots of memory and CPU required in order to
scale
• It requires re-computation for graph changes
• no efficient way to deal with updates
65. Can we do graph processing
on streams?
• Maintain the
dynamic graph
structure
• Provide up-to-date
results with low
latency
• Compute on fresh
state only
66. Single-pass graph streaming
• Each event is an edge addition
• Maintain only a graph summary
• Recent events are grouped in graph
windows
67.
68. Graph Summaries
• spanners for distance estimation
• sparsifiers for cut estimation
• sketches for homomorphic properties
graph summary
algorithm algorithm~R1 R2
84. Stream Connected
Components with Flink
DataStream<DisjointSet> cc =
edgeStream
.keyBy(0)
.timeWindow(Time.of(100, TimeUnit.MILLISECONDS))
.fold(new DisjointSet(), new UpdateCC())
.flatMap(new Merger())
.setParallelism(1);
85. Stream Connected
Components with Flink
DataStream<DisjointSet> cc =
edgeStream
.keyBy(0)
.timeWindow(Time.of(100, TimeUnit.MILLISECONDS))
.fold(new DisjointSet(), new UpdateCC())
.flatMap(new Merger())
.setParallelism(1);
Partition the edge
stream
86. Stream Connected
Components with Flink
DataStream<DisjointSet> cc =
edgeStream
.keyBy(0)
.timeWindow(Time.of(100, TimeUnit.MILLISECONDS))
.fold(new DisjointSet(), new UpdateCC())
.flatMap(new Merger())
.setParallelism(1);
Define the merging
frequency
87. Stream Connected
Components with Flink
DataStream<DisjointSet> cc =
edgeStream
.keyBy(0)
.timeWindow(Time.of(100, TimeUnit.MILLISECONDS))
.fold(new DisjointSet(), new UpdateCC())
.flatMap(new Merger())
.setParallelism(1);
merge locally
88. Stream Connected
Components with Flink
DataStream<DisjointSet> cc =
edgeStream
.keyBy(0)
.timeWindow(Time.of(100, TimeUnit.MILLISECONDS))
.fold(new DisjointSet(), new UpdateCC())
.flatMap(new Merger())
.setParallelism(1); merge globally
90. Introducing Gelly-Stream
Gelly-Stream enriches the DataStream API with two new additional ADTs:
• GraphStream:
• A representation of a data stream of edges.
• Edges can have state (e.g. weights).
• Supports property streams, transformations and aggregations.
• GraphWindow:
• A “time-slice” of a graph stream.
• It enables neighborhood aggregations
92. Graph Stream Aggregations
result
aggregate
property streamgraph
stream
(window) fold
combine
fold
reduce
local
summaries
global
summary
edges
agg
global aggregates
can be persistent or transient
graphStream.aggregate(
new MyGraphAggregation(window, fold, combine, transform))