1. Spark GraphX & Pregel
Challenges and Best Practices
Ashutosh Trivedi (IIIT Bangalore)
Kaushik Ranjan (IIIT Bangalore)
Sigmoid-Meetup Bangalore
https://github.com/anantasty/SparkAlgorithms
2. Agenda
⢠Introduction to GraphX
â How to describe a graph
â RDDs to store Graph
â Algorithms available
⢠Application in graph algorithms
â Feedback Vertex Set of a Graph
â Identifying parallel parts of the solution.
⢠Challenges we faced
⢠Best practices
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
2
3. Graph Representation
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
3
class Graph [ V, E ] {
def Graph(vertices: Table[ (Id, V) ],
edges: Table[ (Id, Id, E) ])
⢠The VertexRDD[A] extends RDD[(VertexID, A)] and adds the additional
constraint that each VertexID occurs only once.
⢠Moreover, VertexRDD[A] represents a set of vertices each with an
attribute of type A
⢠The EdgeRDD[ED], extends RDD[Edge[ED]]
6. Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
6
A BA
Vertex and Edges
Vertex Edge
7. Triplets Join Vertices and Edges
⢠The triplets operator joins vertices and edges:
TripletsVertices
B
A
C
D
Edges
A B
A C
B C
C D
A BA
B A C
B C
C D
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
7
10. Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
10
Feedback Vertex Set
⢠A feedback vertex set of a graph is a set of vertices whose removal
leaves a graph without cycles.
⢠Each feedback vertex set contains at least one vertex of any cycle in the
graph.
⢠The feedback vertex set problem is an NP-complete problem
in computational complexity theory
⢠Enumerate each simple cycle.
11. Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
11
1 2
34
5
6
7
8
9
10
Strongly Connected Components
Each strongly connected component can be considered in
parallel since they do not share any cycle
SC1 â (1) SC2 â (5) SC3 â (8) SC4 â (9)
12. Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
12
FVS Algorithm
#Greedy recursive solution
FVS(G)
sccGraph = scc(G)
For each graph in sccGraph
For each vertex
remove vertex and again calculate scc,
vertexV = vertex which give max number of scc #which means it
kills maximum cycles
subGraph = subgraph(removeV )
FVS (subGraph )
15. Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
15
FVS â Spark Implementation
sccGraph has one more property sccID on each vertices, extract it
16. Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
16
FVS â Spark Implementation
sccGraph = scc(G)
For each graph in sccGraph
18. Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
18
FVS â Spark Implementation
For each vertex
remove vertex and again calculate scc,
# Z is a list of scc count after removing each vertex
19. Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
19
vertexV = vertex which give max number of scc #which means it
kills maximum cycles
FVS â Spark Implementation
21. Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
21
Pregel
⢠Graph DB
â Data Storage
â Data Mining
⢠Advantages
â Large-scale distributed computations
â Parallel-algorithms for graphs on multiple machines
â Fault tolerance and distributability
22. Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
22
Oldest Follower
What is the age of oldest follower of each user ?
Val oldestFollowerAge = graph
.aggregateMessages(
#map word => (word.dst.id, word.src.age),
#reduce (a,b) => max(a, b)
)
.vertices
mapReduceTriplets is now aggregateMessages
23. Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
23
In aggregateMessages :
⢠EdgeContext which exposes the triplet fields .
⢠functions to explicitly send messages to the source and
destination vertex.
⢠It require the user to indicate what fields in the triplet are
actually required.
New in GraphX
24. Theory â itâs Good
How it works â thatâs awesome
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
24
Graphâs are recursive data-structures, where the
property of a vertex is dependent on the properties of
itâs neighbors, which in turn are dependent on the
properties of their neighbors.
28. Applications - GIS
⢠Algorithm â to compute all vertices in a directed graph, that can
reach out to a given vertex.
⢠Can be used for watershed delineation in Geographic Information
Systems
Ashutosh & Kaushik, Sigmoid-Meetup
Bangalore Dec-2014
28
Vertices that can reach out to E are A and B