Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

# Processing Large Graphs

Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Prochain SlideShare
Grails and Neo4j
Chargement dans…3
×

## Consultez-les par la suite

1 sur 68 Publicité

# Processing Large Graphs

Guest Lecture: Processing Large-Scale Graph at Sathyabama University

Guest Lecture: Processing Large-Scale Graph at Sathyabama University

Publicité
Publicité

## Plus De Contenu Connexe

Publicité

Publicité

### Processing Large Graphs

1. 1. Processing Large Graphs www.serendio.com
2. 2. Content: • Introduction • Graph Processing with MapReduce • Graph Processing with Apache Giraph • Graph Processing with Neo4j • Conclusion
3. 3. 3 What is Graph? V1 V4 V6 V7 V2 V5 V3 The set of objects connected by links.
4. 4. Why Graphs are important in Big Data World?
5. 5. Application Domains • Recommendation Systems • Fraud Detection • Complex Network Analysis • Graph based Search • Network & IT operations • Master Data Management • Many More…
6. 6. Graph Processing with MapReduce
7. 7. MapReduce
8. 8. MapReduce : Word Count
9. 9. MapReduce : Finding Triangle Problem: Enumerating 3-cycle sub graph from given graph
10. 10. MapReduce : Finding Triangle • In the first map operation for enumerating triangles, the mapper records each edge under the vertex with the lowest degree. • The incoming records’ key doesn’t matter.
11. 11. MapReduce : Finding Triangle
12. 12. MapReduce : Finding Triangle • The second map for enumerating triangles brings together the edge and open triad records. • In the process, it rekeys the edge records so that both record types are binned under the vertices they connect.
13. 13. • In the second reduce, each bin contains at most one edge record and some number of triad records (perhaps none). • For every combination of edge record and triad record in a bin, the reduce emits a triangle record. The output key isn’t significant. MapReduce : Finding Triangle
14. 14. MapReduce : Finding Connected Components Problem: Finding Connected Components for given Graph 3 1 2 5 6 4
15. 15. MapReduce: Finding Connected Components
16. 16. 3 1 2 5 6 4 MapReduce: Finding Connected Components K V 1 1,2 2 1,2,3,4 3 2,3 4 2,4,5 5 4,5,6 6 5,6 2 1 1 4 5 2 K V 1 1,2,3,4 2 1,2,3,4,5 3 1,2 4 1,2,4,5,6 5 4,5,6 6 4,5
17. 17. 1 1 1 1 1 1 1 1 1 2 4 1 K V 1 1,2,3,4,5,6 2 1 3 1 4 1,5,6 5 1,4 6 1,4 K V 1 1,2,3,4,5,6 2 1 3 1 4 1 5 1 6 1 MapReduce: Finding Connected Components
18. 18. Why not MapReduce for graph? Problem with MapReduce Graph Algorithm • Iterative MR-Jobs • High I/O • Not intuitive for Graph Algorithm
19. 19. The Graph based Technologies in BigData/Nosql domain Database Storage & Traversal Neo4j TitanDB OrientDB Computation Engines Apache Giraph GraphLab Apache Spark Graph ML/Graphx
20. 20. MapReduce Hive Spark Pig Cassandra MySql Hbase Giraph Graphx GraphLab Neo4j OrientDB Analytics Database Tuples Graphs Data management tools with respect to Graph
21. 21. Graph Processing Engine with Apache Giraph
22. 22. Apache Giraph Graph Processing System • In-memory Computation • Inspired by Google Pregel • Vertex-Centic High-level programming model • Batch oriented processing • Based on Valient's Bulk Synchronization Parallel Model
23. 23. Bulk Synchronization Parallel Model
24. 24. • Each vertex has • Vertex-Identifier • Variable • Each directed edge has • Source Vertex identifier • Target Vertex identifier • Variable • Computation consists of, • Input • Supersteps separated by global synchronization points • Algorithm termination • Output • Each vertex compute in parallel with same user defined function. Apache Giraph Model
25. 25. Supersteps
26. 26. • Algorithm Termination: • Each vertex is in inactive state • No messages are generated for next superstep Vertex State Transition
27. 27. Architecture of Giraph on Hadoop Stack
28. 28. Problem: Find Maximum Number
29. 29. Problem: Simple Shortest Path 0 3 21 43 1 1 2 4 4 Source
30. 30. public void compute(vertex, messages) { if (Superstep = 0): vertex.setValue(new DoubleWritable(Double.MAX_VALUE)); double minDist = isSource(vertex) ? 0d : Double.MAX_VALUE; for (message : messages): minDist = Math.min(minDist, message); if (minDist < vertex.getValue()) vertex.setValue(minDist); for (Edge edge : vertex.getEdges()): distance = minDist + edge.getValue(); sendMessage(edge.getTargetVertexId(), distance); vertex.voteToHalt(); } Problem: Simple Shortest Path
31. 31. 0 3 21 43 1 1 2 4 4 Problem: Simple Shortest Path
32. 32. Graph Database with Neo4j
33. 33. Introduction to Nosql Nosql = Not Only SQL
34. 34. Graph Databases • A database which follows graph structure • Each node knows its adjacent nodes • As the number of nodes increases, the cost of local step remains the same • Index for lookups • Optimized for traversing connected data
35. 35. Graph Databases: Model Key1 : Value 1 Key2 : Value 2 Key1 : Value 1 Key2 : Value 2 Key1 : Value 1 Key2 : Value 2 Key1 : Value 1 Key2 : Value 2 Key1 : Value 1 Key2 : Value 2
36. 36. http://db-engines.com Introduction to Nosql
37. 37. Neo4j • Graph database from Neo Technology • A schema-free labeled Property Graph Database + Lucene Index • Perfect for complex, highly connected data • Reliable with real ACID Transactions • Scalable: Billions of Nodes and Relationships, Scale out with highly available Neo4j Cluster • Server with REST API or Embeddable • Declarative Query Language (Cypher)
38. 38. Neo4j: Strengths & Weakness Strengths • Powerful data model • Whiteboard friendly • Fast for connected data • Easy to query Weakness • Requires Conceptual Shift (Graph like thinking)
39. 39. Four Building Blocks • Nodes • Relationships • Properties • Labels (:USER) [:RELATIVE ] (:PET) Name: Mike Animal: Dog Name: Apple Age: 25 Relation: Owner
40. 40. 40Serendio Proprietary and Confidential SQL to Graph DB: Data Model Transformation SQL Graph DB Table Type of Node(Labels) Rows of Table Nodes Columns of Table Node-Properties Foreign-key, Joins Relationships
41. 41. SQL to Graph DB: Data Model Transformation Name Movies Language Rajnikant Tamil Maheshbabu Telugu Vijay Tamil Prabhas Telugu Name Lead Actor Bahubali Prabhas Puli Vijay Shrimanthu du Maheshbabu Robot Rajnikant Table: Actor Table: Movie ACTOR MOVIE ACTOR MOVIE Name Prabhas Movie Language Telugu Name Rajnikant Movie Language Tamil Name Bahubali Name Robot LEAD_ACTOR LEAD_ACTOR
42. 42. How to query Graph Database? • Graph Query Language – Cypher – Gremlin
43. 43. How to query Graph Database? • Graph Query Language – Cypher – Gremlin
44. 44. Cypher Query Language • Declarative • SQL-inspired • Pattern based Ramesh Suresh FRIEND (Ramesh:PERSON) - [connect:FRIEND] -> (Orange:PERSON)
45. 45. Cypher: Getting Started Structure: • Similar to SQL • Most common clauses: – MATCH: the graph pattern for matching – WHERE: add constrains or filter – RETURN: what to return
46. 46. CRUD Operations MATCH: • MATCH (n) RETURN n • MATCH (movie:Movie) RETURN movie • MATCH (movie:Movie { title: 'Bahubali' }) RETURN movie • MATCH (director { name:'Rajamouli' })--(movie) RETURN movie.title • MATCH (raj:Person { name:'Rajamouli'})--(movie:Movie) RETURN movie • MATCH (raj:Person { name:'Rajamouli'})-->(movie:Movie) RETURN movie • MATCH (raj:Person { name:'Rajamouli'})<--(movie:Movie) RETURN movie • MATCH (raj:Person { name:'Rajamouli'})-[:DIRECTED]->(movie:Movie) RETURN movie
47. 47. CRUD Operations WHERE: • MATCH (n) WHERE n:Movie RETURN n • MATCH (n) WHERE n.name <> 'Prabhas' RETURN n
48. 48. CRUD Operations Let clean the database: MATCH (n) OPTIONAL MATCH (n)-[r]-() DELETE n,r
49. 49. CRUD Operations CREATE: Node: • CREATE (n) • CREATE (n),(m) • CREATE (n:Person) • CREATE (n:Person:Swedish) • CREATE (n:Person { name : 'Andres', title : 'Developer' }) • CREATE (a:Person { name : 'Roman' }) RETURN a
50. 50. CRUD Operations CREATE: Relationships: • MATCH (a:Person),(b:Person) WHERE a.name = 'Roman' AND b.name = 'Andres' CREATE (a)-[r:RELTYPE]->(b) RETURN r • MATCH (a:Person),(b:Person) WHERE a.name = 'Roman' AND b.name = 'Andres' CREATE (a)-[r:RELTYPE { name : a.name + '<->' + b.name }]->(b) RETURN r
51. 51. CRUD Operations CREATE: Relationships: • CREATE p =(andres { name:'Andres'}) - [:WORKS_AT] -> (neo) <- [:WORKS_AT] - (michael { name:'Michael' }) RETURN p
52. 52. CRUD Operations UPDATE: Properties: • MATCH (n:Person { name : 'Andres' }) SET n :Person:Coder • MATCH (n:Person { name : 'Andres', title : 'Developer' }) SET n.title = 'Mang'
53. 53. CRUD Operations DELETE: • MATCH (n:Person) WHERE n.name = 'Andres' DELETE n • MATCH (n { name: 'Andres' })-[r]-() DELETE n, r • MATCH (n:Person) DELETE n • MATCH (n) OPTIONAL MATCH (n)-[r]-() DELETE n,r
54. 54. Functions Predicates: • ALL(identifier in collection WHERE predicate) • ANY(identifier in collection WHERE predicate) • NONE(identifier in collection WHERE predicate) • SINGLE(identifier in collection WHERE predicate) • EXISTS( pattern-or-property ) Scalar Function: • LENGTH( collection/pattern expression ) • TYPE( relationship ) • ID( property-container ) • COALESCE( expression [, expression]* ) • HEAD( expression ) • LAST( expression ) • TIMESTAMP()
55. 55. Functions Collection Function: • NODES( path ) • RELATIONSHIPS( path ) • LABELS( node ) • FILTER(identifier in collection WHERE predicate) • REDUCE( accumulator = initial, identifier in collection | expression ) Mathematical Function: • ABS( expression ) • COS( expression ) • LOG( expression ) • ROUND( expression ) • SQRT( expression )
56. 56. Use Case: Movie Recommendation* Problem: • We are running IMDB type website. • We have dataset which contains movie rating done by users. • Our problem is to generate list of movies which will be recommended to individual users. *http://neo4j.com/graphgist/a7c915c8-a3d6-43b9-8127-1836fecc6e2f
57. 57. Use Case: Movie Recommendation
58. 58. Use Case: Movie Recommendation Solution: • We will find the people who has given similar rating to the movies watch by both of them. • After that we will recommend movies which one has not seen and other has rated high. • Cosine Similarity function to calculate similarity between users. • k-Nearest Neighbors for finding similar users
59. 59. Use Case: Movie Recommendation • Cosine Similarity: • K-NN:
60. 60. Use Case: Movie Recommendation Query:Add Cosine Similarity MATCH (p1:Person)-[x:RATED]->(m:Movie)<-[y:RATED]-(p2:Person) WITH SUM(x.rating * y.rating) AS xyDotProduct, SQRT(REDUCE(xDot = 0.0, a IN COLLECT(x.rating) | xDot + a^2)) AS xLength, SQRT(REDUCE(yDot = 0.0, b IN COLLECT(y.rating) | yDot + b^2)) AS yLength, p1, p2 MERGE (p1)-[s:SIMILARITY]-(p2) SET s.similarity = xyDotProduct / (xLength * yLength)
61. 61. Use Case: Movie Recommendation
62. 62. Use Case: Movie Recommendation Query: See who is your neighbor in similarity MATCH (p1:Person {name:'Michael Sherman'})-[s:SIMILARITY]-(p2:Person) WITH p2, s.similarity AS sim ORDER BY sim DESC LIMIT 5 RETURN p2.name AS Neighbor, sim AS Similarity
63. 63. Use Case: Movie Recommendation (Conti..) Query: Recommendation Finally MATCH (b:Person)-[r:RATED]->(m:Movie), (b)-[s:SIMILARITY]-(a:Person {name:'Michael Sherman'}) WHERE NOT((a)-[:RATED]->(m)) WITH m, s.similarity AS similarity, r.rating AS rating ORDER BY m.name, similarity DESC WITH m.name AS movie, COLLECT(rating)[0..3] AS ratings WITH movie, REDUCE(s = 0, i IN ratings | s + i)*1.0 / LENGTH(ratings) AS reco ORDER BY reco DESC RETURN movie AS Movie, reco AS Recommendation
64. 64. Why Real-time Recommendation Engine Important? • Example of E-Commerce
65. 65. Neo4j with Other technologies • Data Import – LOAD CSV – Neo4j-import • Graph Visualization – Alistair Jones (Arrow) – Alchemy.js (GraphJSON) – Neo4j Browser – Linkurious – Keylines – D3.js
66. 66. Neo4j in Action • Neo4j + Linkurious for Panama leaks
67. 67. Conclusion • The graph is important data model to represent lot of real world scenarios as connected object provide more information that isolated objects. • The de-facto big data technologies are inefficient for solving large scale graph problems. • The technologies, designed to solve large scale graph problems in real time as well as offline are available. • These graph technologies are matured enough to use in production.
68. 68. nishant@serendio.com Serendio provides Big Data Science Solutions & Services for Data-Driven Enterprises. Learn more at: serendio.com/index.php/case-studies Thank You!