Processing Large Graphs

Processing Large Graphs
www.serendio.com

Content:
• Introduction
• Graph Processing with MapReduce
• Graph Processing with Apache Giraph
• Graph Processing with Neo4j
• Conclusion

3
What is Graph?
V1
V4
V6
V7
V2
V5
V3
The set of objects connected by links.

Why Graphs are
important in Big Data
World?

Application Domains
• Recommendation Systems
• Fraud Detection
• Complex Network Analysis
• Graph based Search
• Network & IT operations
• Master Data Management
• Many More…

Graph Processing
with MapReduce

MapReduce : Finding Triangle
Problem: Enumerating 3-cycle sub graph from given
graph

• In the first map operation for enumerating triangles, the mapper records
each edge under the vertex with the lowest degree.
• The incoming records’ key doesn’t matter.

• The second map for enumerating triangles brings together the edge
and open triad records.
• In the process, it rekeys the edge records so that both record types
are binned under the vertices they connect.

• In the second reduce, each bin contains at most one edge record and some
number of triad records (perhaps none).
• For every combination of edge record and triad record in a bin, the reduce
emits a triangle record. The output key isn’t significant.

MapReduce : Finding Connected
Components
Problem: Finding Connected Components for given
Graph
3
1 2 5
6
4

MapReduce: Finding Connected
Components

3
1 2 5
6
4
Components
K V
1 1,2
2 1,2,3,4
3 2,3
4 2,4,5
5 4,5,6
6 5,6
2
1 1 4
5
2
K V
1 1,2,3,4
2 1,2,3,4,5
3 1,2
4 1,2,4,5,6
5 4,5,6
6 4,5

1
1 1 1
1
1
1
1 1 2
4
1
K V
1 1,2,3,4,5,6
2 1
3 1
4 1,5,6
5 1,4
6 1,4
K V
1 1,2,3,4,5,6
2 1
3 1
4 1
5 1
6 1
Components

Why not MapReduce for graph?
Problem with MapReduce Graph Algorithm
• Iterative MR-Jobs
• High I/O
• Not intuitive for Graph Algorithm

The Graph based Technologies in
BigData/Nosql domain
Database Storage & Traversal
Neo4j
TitanDB
OrientDB
Computation Engines
Apache Giraph
GraphLab
Apache Spark Graph ML/Graphx

MapReduce
Hive
Spark
Pig
Cassandra
MySql
Hbase
Giraph
Graphx
GraphLab
Neo4j
OrientDB
Analytics
Database
Tuples Graphs
Data management tools with respect to
Graph

Graph Processing Engine
with Apache Giraph

Apache Giraph
Graph Processing System
• In-memory Computation
• Inspired by Google Pregel
• Vertex-Centic High-level programming model
• Batch oriented processing
• Based on Valient's Bulk Synchronization Parallel
Model

Bulk Synchronization Parallel Model

• Each vertex has
• Vertex-Identifier
• Variable
• Each directed edge has
• Source Vertex identifier
• Target Vertex identifier
• Variable
• Computation consists of,
• Input
• Supersteps separated by global synchronization points
• Algorithm termination
• Output
• Each vertex compute in parallel with same user defined
function.
Apache Giraph Model

• Algorithm Termination:
• Each vertex is in inactive state
• No messages are generated for next superstep
Vertex State Transition

Architecture of Giraph on Hadoop Stack

Problem: Simple Shortest Path
0
3
21
43
1
1
2
4
4
Source

public void compute(vertex, messages) {
if (Superstep = 0):
vertex.setValue(new DoubleWritable(Double.MAX_VALUE));
double minDist = isSource(vertex) ? 0d : Double.MAX_VALUE;
for (message : messages):
minDist = Math.min(minDist, message);
if (minDist < vertex.getValue())
vertex.setValue(minDist);
for (Edge edge : vertex.getEdges()):
distance = minDist + edge.getValue();
sendMessage(edge.getTargetVertexId(), distance);
vertex.voteToHalt();
}

0
3
21
43
1
1
2
4
4

Introduction to Nosql
Nosql = Not Only SQL

Graph Databases
• A database which follows graph structure
• Each node knows its adjacent nodes
• As the number of nodes increases, the cost of local
step remains the same
• Index for lookups
• Optimized for traversing connected data

Graph Databases: Model
Key1 : Value 1
Key2 : Value 2
Key1 : Value 1
Key2 : Value 2
Key1 : Value 1
Key2 : Value 2
Key1 : Value 1
Key2 : Value 2
Key1 : Value 1
Key2 : Value 2

http://db-engines.com
Introduction to Nosql

Neo4j
• Graph database from Neo Technology
• A schema-free labeled Property Graph Database +
Lucene Index
• Perfect for complex, highly connected data
• Reliable with real ACID Transactions
• Scalable: Billions of Nodes and Relationships, Scale
out with highly available Neo4j Cluster
• Server with REST API or Embeddable
• Declarative Query Language (Cypher)

Neo4j: Strengths & Weakness
Strengths
• Powerful data model
• Whiteboard friendly
• Fast for connected data
• Easy to query
Weakness
• Requires Conceptual Shift (Graph like thinking)

Four Building Blocks
• Nodes
• Relationships
• Properties
• Labels
(:USER)
[:RELATIVE
] (:PET)
Name: Mike
Animal: Dog
Name: Apple
Age: 25
Relation: Owner

40Serendio Proprietary and Confidential
SQL to Graph DB: Data Model
Transformation
SQL Graph DB
Table Type of Node(Labels)
Rows of Table Nodes
Columns of Table Node-Properties
Foreign-key, Joins Relationships

SQL to Graph DB: Data Model
Transformation
Name Movies
Language
Rajnikant Tamil
Maheshbabu Telugu
Vijay Tamil
Prabhas Telugu
Name Lead Actor
Bahubali Prabhas
Puli Vijay
Shrimanthu
du
Maheshbabu
Robot Rajnikant
Table: Actor
Table: Movie
ACTOR
MOVIE
ACTOR
MOVIE
Name Prabhas
Movie
Language
Telugu
Name Rajnikant
Movie
Language
Tamil
Name Bahubali
Name Robot
LEAD_ACTOR
LEAD_ACTOR

How to query Graph Database?
• Graph Query Language
– Cypher
– Gremlin

Cypher Query Language
• Declarative
• SQL-inspired
• Pattern based
Ramesh Suresh
FRIEND
(Ramesh:PERSON) - [connect:FRIEND] -> (Orange:PERSON)

Cypher: Getting Started
Structure:
• Similar to SQL
• Most common clauses:
– MATCH: the graph pattern for matching
– WHERE: add constrains or filter
– RETURN: what to return

CRUD Operations
MATCH:
• MATCH (n) RETURN n
• MATCH (movie:Movie) RETURN movie
• MATCH (movie:Movie { title: 'Bahubali' }) RETURN movie
• MATCH (director { name:'Rajamouli' })--(movie) RETURN movie.title
• MATCH (raj:Person { name:'Rajamouli'})--(movie:Movie) RETURN movie
• MATCH (raj:Person { name:'Rajamouli'})-->(movie:Movie) RETURN movie
• MATCH (raj:Person { name:'Rajamouli'})<--(movie:Movie) RETURN movie
• MATCH (raj:Person { name:'Rajamouli'})-[:DIRECTED]->(movie:Movie) RETURN
movie

CRUD Operations
WHERE:
• MATCH (n)
WHERE n:Movie
RETURN n
• MATCH (n)
WHERE n.name <> 'Prabhas'
RETURN n

CRUD Operations
Let clean the database:
MATCH (n)
OPTIONAL MATCH (n)-[r]-()
DELETE n,r

CRUD Operations
CREATE:
Node:
• CREATE (n)
• CREATE (n),(m)
• CREATE (n:Person)
• CREATE (n:Person:Swedish)
• CREATE (n:Person { name : 'Andres', title : 'Developer' })
• CREATE (a:Person { name : 'Roman' }) RETURN a

CRUD Operations
CREATE:
Relationships:
• MATCH (a:Person),(b:Person)
WHERE a.name = 'Roman' AND b.name = 'Andres'
CREATE (a)-[r:RELTYPE]->(b)
RETURN r
• MATCH (a:Person),(b:Person)
WHERE a.name = 'Roman' AND b.name = 'Andres'
CREATE (a)-[r:RELTYPE { name : a.name + '<->' + b.name }]->(b)
RETURN r

CRUD Operations
CREATE:
Relationships:
• CREATE p =(andres { name:'Andres'}) - [:WORKS_AT] -> (neo)
<- [:WORKS_AT] - (michael { name:'Michael' })
RETURN p

CRUD Operations
UPDATE:
Properties:
• MATCH (n:Person { name : 'Andres' }) SET n :Person:Coder
• MATCH (n:Person { name : 'Andres', title : 'Developer' }) SET
n.title = 'Mang'

CRUD Operations
DELETE:
• MATCH (n:Person)
WHERE n.name = 'Andres'
DELETE n
• MATCH (n { name: 'Andres' })-[r]-()
DELETE n, r
• MATCH (n:Person)
DELETE n
• MATCH (n)
OPTIONAL MATCH (n)-[r]-()
DELETE n,r

Functions
Predicates:
• ALL(identifier in collection WHERE predicate)
• ANY(identifier in collection WHERE predicate)
• NONE(identifier in collection WHERE predicate)
• SINGLE(identifier in collection WHERE predicate)
• EXISTS( pattern-or-property )
Scalar Function:
• LENGTH( collection/pattern expression )
• TYPE( relationship )
• ID( property-container )
• COALESCE( expression [, expression]* )
• HEAD( expression )
• LAST( expression )
• TIMESTAMP()

Functions
Collection Function:
• NODES( path )
• RELATIONSHIPS( path )
• LABELS( node )
• FILTER(identifier in collection WHERE predicate)
• REDUCE( accumulator = initial, identifier in collection | expression )
Mathematical Function:
• ABS( expression )
• COS( expression )
• LOG( expression )
• ROUND( expression )
• SQRT( expression )

Use Case: Movie Recommendation*
Problem:
• We are running IMDB type website.
• We have dataset which contains movie rating done by users.
• Our problem is to generate list of movies which will be
recommended to individual users.
*http://neo4j.com/graphgist/a7c915c8-a3d6-43b9-8127-1836fecc6e2f

Use Case: Movie Recommendation

Solution:
• We will find the people who has given similar rating to the
movies watch by both of them.
• After that we will recommend movies which one has not seen
and other has rated high.
• Cosine Similarity function to calculate similarity between
users.
• k-Nearest Neighbors for finding similar users

• Cosine Similarity:
• K-NN:

Query:Add Cosine Similarity
MATCH (p1:Person)-[x:RATED]->(m:Movie)<-[y:RATED]-(p2:Person)
WITH SUM(x.rating * y.rating) AS xyDotProduct,
SQRT(REDUCE(xDot = 0.0, a IN COLLECT(x.rating) | xDot + a^2)) AS
xLength,
SQRT(REDUCE(yDot = 0.0, b IN COLLECT(y.rating) | yDot + b^2)) AS
yLength,
p1, p2
MERGE (p1)-[s:SIMILARITY]-(p2)
SET s.similarity = xyDotProduct / (xLength * yLength)

Query: See who is your neighbor in
similarity
MATCH (p1:Person {name:'Michael Sherman'})-[s:SIMILARITY]-(p2:Person)
WITH p2, s.similarity AS sim
ORDER BY sim DESC
LIMIT 5
RETURN p2.name AS Neighbor, sim AS Similarity

Use Case: Movie Recommendation (Conti..)
Query: Recommendation Finally
MATCH (b:Person)-[r:RATED]->(m:Movie), (b)-[s:SIMILARITY]-(a:Person
{name:'Michael Sherman'})
WHERE NOT((a)-[:RATED]->(m))
WITH m, s.similarity AS similarity, r.rating AS rating ORDER BY m.name,
similarity DESC
WITH m.name AS movie, COLLECT(rating)[0..3] AS ratings
WITH movie, REDUCE(s = 0, i IN ratings | s + i)*1.0 / LENGTH(ratings) AS reco
ORDER BY reco DESC
RETURN movie AS Movie, reco AS Recommendation

Why Real-time Recommendation Engine
Important?
• Example of E-Commerce

Neo4j with Other technologies
• Data Import
– LOAD CSV
– Neo4j-import
• Graph Visualization
– Alistair Jones (Arrow)
– Alchemy.js (GraphJSON)
– Neo4j Browser
– Linkurious
– Keylines
– D3.js

Neo4j in Action
• Neo4j + Linkurious for Panama leaks

Conclusion
• The graph is important data model to represent lot of real
world scenarios as connected object provide more
information that isolated objects.
• The de-facto big data technologies are inefficient for solving
large scale graph problems.
• The technologies, designed to solve large scale graph
problems in real time as well as offline are available.
• These graph technologies are matured enough to use in
production.

nishant@serendio.com
Serendio provides Big Data Science Solutions &
Services for Data-Driven Enterprises.
Learn more at:
serendio.com/index.php/case-studies
Thank You!

Processing Large Graphs

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Processing Large Graphs

Similaire à Processing Large Graphs (20)

Plus de Nishant Gandhi

Plus de Nishant Gandhi (8)

Dernier

Dernier (20)

Processing Large Graphs