Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

Processing Large Graphs

Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Prochain SlideShare
Grails and Neo4j
Grails and Neo4j
Chargement dans…3
×

Consultez-les par la suite

1 sur 68 Publicité
Publicité

Plus De Contenu Connexe

Diaporamas pour vous (20)

Similaire à Processing Large Graphs (20)

Publicité
Publicité

Processing Large Graphs

  1. 1. Processing Large Graphs www.serendio.com
  2. 2. Content: • Introduction • Graph Processing with MapReduce • Graph Processing with Apache Giraph • Graph Processing with Neo4j • Conclusion
  3. 3. 3 What is Graph? V1 V4 V6 V7 V2 V5 V3 The set of objects connected by links.
  4. 4. Why Graphs are important in Big Data World?
  5. 5. Application Domains • Recommendation Systems • Fraud Detection • Complex Network Analysis • Graph based Search • Network & IT operations • Master Data Management • Many More…
  6. 6. Graph Processing with MapReduce
  7. 7. MapReduce
  8. 8. MapReduce : Word Count
  9. 9. MapReduce : Finding Triangle Problem: Enumerating 3-cycle sub graph from given graph
  10. 10. MapReduce : Finding Triangle • In the first map operation for enumerating triangles, the mapper records each edge under the vertex with the lowest degree. • The incoming records’ key doesn’t matter.
  11. 11. MapReduce : Finding Triangle
  12. 12. MapReduce : Finding Triangle • The second map for enumerating triangles brings together the edge and open triad records. • In the process, it rekeys the edge records so that both record types are binned under the vertices they connect.
  13. 13. • In the second reduce, each bin contains at most one edge record and some number of triad records (perhaps none). • For every combination of edge record and triad record in a bin, the reduce emits a triangle record. The output key isn’t significant. MapReduce : Finding Triangle
  14. 14. MapReduce : Finding Connected Components Problem: Finding Connected Components for given Graph 3 1 2 5 6 4
  15. 15. MapReduce: Finding Connected Components
  16. 16. 3 1 2 5 6 4 MapReduce: Finding Connected Components K V 1 1,2 2 1,2,3,4 3 2,3 4 2,4,5 5 4,5,6 6 5,6 2 1 1 4 5 2 K V 1 1,2,3,4 2 1,2,3,4,5 3 1,2 4 1,2,4,5,6 5 4,5,6 6 4,5
  17. 17. 1 1 1 1 1 1 1 1 1 2 4 1 K V 1 1,2,3,4,5,6 2 1 3 1 4 1,5,6 5 1,4 6 1,4 K V 1 1,2,3,4,5,6 2 1 3 1 4 1 5 1 6 1 MapReduce: Finding Connected Components
  18. 18. Why not MapReduce for graph? Problem with MapReduce Graph Algorithm • Iterative MR-Jobs • High I/O • Not intuitive for Graph Algorithm
  19. 19. The Graph based Technologies in BigData/Nosql domain Database Storage & Traversal Neo4j TitanDB OrientDB Computation Engines Apache Giraph GraphLab Apache Spark Graph ML/Graphx
  20. 20. MapReduce Hive Spark Pig Cassandra MySql Hbase Giraph Graphx GraphLab Neo4j OrientDB Analytics Database Tuples Graphs Data management tools with respect to Graph
  21. 21. Graph Processing Engine with Apache Giraph
  22. 22. Apache Giraph Graph Processing System • In-memory Computation • Inspired by Google Pregel • Vertex-Centic High-level programming model • Batch oriented processing • Based on Valient's Bulk Synchronization Parallel Model
  23. 23. Bulk Synchronization Parallel Model
  24. 24. • Each vertex has • Vertex-Identifier • Variable • Each directed edge has • Source Vertex identifier • Target Vertex identifier • Variable • Computation consists of, • Input • Supersteps separated by global synchronization points • Algorithm termination • Output • Each vertex compute in parallel with same user defined function. Apache Giraph Model
  25. 25. Supersteps
  26. 26. • Algorithm Termination: • Each vertex is in inactive state • No messages are generated for next superstep Vertex State Transition
  27. 27. Architecture of Giraph on Hadoop Stack
  28. 28. Problem: Find Maximum Number
  29. 29. Problem: Simple Shortest Path 0 3 21 43 1 1 2 4 4 Source
  30. 30. public void compute(vertex, messages) { if (Superstep = 0): vertex.setValue(new DoubleWritable(Double.MAX_VALUE)); double minDist = isSource(vertex) ? 0d : Double.MAX_VALUE; for (message : messages): minDist = Math.min(minDist, message); if (minDist < vertex.getValue()) vertex.setValue(minDist); for (Edge edge : vertex.getEdges()): distance = minDist + edge.getValue(); sendMessage(edge.getTargetVertexId(), distance); vertex.voteToHalt(); } Problem: Simple Shortest Path
  31. 31. 0 3 21 43 1 1 2 4 4 Problem: Simple Shortest Path
  32. 32. Graph Database with Neo4j
  33. 33. Introduction to Nosql Nosql = Not Only SQL
  34. 34. Graph Databases • A database which follows graph structure • Each node knows its adjacent nodes • As the number of nodes increases, the cost of local step remains the same • Index for lookups • Optimized for traversing connected data
  35. 35. Graph Databases: Model Key1 : Value 1 Key2 : Value 2 Key1 : Value 1 Key2 : Value 2 Key1 : Value 1 Key2 : Value 2 Key1 : Value 1 Key2 : Value 2 Key1 : Value 1 Key2 : Value 2
  36. 36. http://db-engines.com Introduction to Nosql
  37. 37. Neo4j • Graph database from Neo Technology • A schema-free labeled Property Graph Database + Lucene Index • Perfect for complex, highly connected data • Reliable with real ACID Transactions • Scalable: Billions of Nodes and Relationships, Scale out with highly available Neo4j Cluster • Server with REST API or Embeddable • Declarative Query Language (Cypher)
  38. 38. Neo4j: Strengths & Weakness Strengths • Powerful data model • Whiteboard friendly • Fast for connected data • Easy to query Weakness • Requires Conceptual Shift (Graph like thinking)
  39. 39. Four Building Blocks • Nodes • Relationships • Properties • Labels (:USER) [:RELATIVE ] (:PET) Name: Mike Animal: Dog Name: Apple Age: 25 Relation: Owner
  40. 40. 40Serendio Proprietary and Confidential SQL to Graph DB: Data Model Transformation SQL Graph DB Table Type of Node(Labels) Rows of Table Nodes Columns of Table Node-Properties Foreign-key, Joins Relationships
  41. 41. SQL to Graph DB: Data Model Transformation Name Movies Language Rajnikant Tamil Maheshbabu Telugu Vijay Tamil Prabhas Telugu Name Lead Actor Bahubali Prabhas Puli Vijay Shrimanthu du Maheshbabu Robot Rajnikant Table: Actor Table: Movie ACTOR MOVIE ACTOR MOVIE Name Prabhas Movie Language Telugu Name Rajnikant Movie Language Tamil Name Bahubali Name Robot LEAD_ACTOR LEAD_ACTOR
  42. 42. How to query Graph Database? • Graph Query Language – Cypher – Gremlin
  43. 43. How to query Graph Database? • Graph Query Language – Cypher – Gremlin
  44. 44. Cypher Query Language • Declarative • SQL-inspired • Pattern based Ramesh Suresh FRIEND (Ramesh:PERSON) - [connect:FRIEND] -> (Orange:PERSON)
  45. 45. Cypher: Getting Started Structure: • Similar to SQL • Most common clauses: – MATCH: the graph pattern for matching – WHERE: add constrains or filter – RETURN: what to return
  46. 46. CRUD Operations MATCH: • MATCH (n) RETURN n • MATCH (movie:Movie) RETURN movie • MATCH (movie:Movie { title: 'Bahubali' }) RETURN movie • MATCH (director { name:'Rajamouli' })--(movie) RETURN movie.title • MATCH (raj:Person { name:'Rajamouli'})--(movie:Movie) RETURN movie • MATCH (raj:Person { name:'Rajamouli'})-->(movie:Movie) RETURN movie • MATCH (raj:Person { name:'Rajamouli'})<--(movie:Movie) RETURN movie • MATCH (raj:Person { name:'Rajamouli'})-[:DIRECTED]->(movie:Movie) RETURN movie
  47. 47. CRUD Operations WHERE: • MATCH (n) WHERE n:Movie RETURN n • MATCH (n) WHERE n.name <> 'Prabhas' RETURN n
  48. 48. CRUD Operations Let clean the database: MATCH (n) OPTIONAL MATCH (n)-[r]-() DELETE n,r
  49. 49. CRUD Operations CREATE: Node: • CREATE (n) • CREATE (n),(m) • CREATE (n:Person) • CREATE (n:Person:Swedish) • CREATE (n:Person { name : 'Andres', title : 'Developer' }) • CREATE (a:Person { name : 'Roman' }) RETURN a
  50. 50. CRUD Operations CREATE: Relationships: • MATCH (a:Person),(b:Person) WHERE a.name = 'Roman' AND b.name = 'Andres' CREATE (a)-[r:RELTYPE]->(b) RETURN r • MATCH (a:Person),(b:Person) WHERE a.name = 'Roman' AND b.name = 'Andres' CREATE (a)-[r:RELTYPE { name : a.name + '<->' + b.name }]->(b) RETURN r
  51. 51. CRUD Operations CREATE: Relationships: • CREATE p =(andres { name:'Andres'}) - [:WORKS_AT] -> (neo) <- [:WORKS_AT] - (michael { name:'Michael' }) RETURN p
  52. 52. CRUD Operations UPDATE: Properties: • MATCH (n:Person { name : 'Andres' }) SET n :Person:Coder • MATCH (n:Person { name : 'Andres', title : 'Developer' }) SET n.title = 'Mang'
  53. 53. CRUD Operations DELETE: • MATCH (n:Person) WHERE n.name = 'Andres' DELETE n • MATCH (n { name: 'Andres' })-[r]-() DELETE n, r • MATCH (n:Person) DELETE n • MATCH (n) OPTIONAL MATCH (n)-[r]-() DELETE n,r
  54. 54. Functions Predicates: • ALL(identifier in collection WHERE predicate) • ANY(identifier in collection WHERE predicate) • NONE(identifier in collection WHERE predicate) • SINGLE(identifier in collection WHERE predicate) • EXISTS( pattern-or-property ) Scalar Function: • LENGTH( collection/pattern expression ) • TYPE( relationship ) • ID( property-container ) • COALESCE( expression [, expression]* ) • HEAD( expression ) • LAST( expression ) • TIMESTAMP()
  55. 55. Functions Collection Function: • NODES( path ) • RELATIONSHIPS( path ) • LABELS( node ) • FILTER(identifier in collection WHERE predicate) • REDUCE( accumulator = initial, identifier in collection | expression ) Mathematical Function: • ABS( expression ) • COS( expression ) • LOG( expression ) • ROUND( expression ) • SQRT( expression )
  56. 56. Use Case: Movie Recommendation* Problem: • We are running IMDB type website. • We have dataset which contains movie rating done by users. • Our problem is to generate list of movies which will be recommended to individual users. *http://neo4j.com/graphgist/a7c915c8-a3d6-43b9-8127-1836fecc6e2f
  57. 57. Use Case: Movie Recommendation
  58. 58. Use Case: Movie Recommendation Solution: • We will find the people who has given similar rating to the movies watch by both of them. • After that we will recommend movies which one has not seen and other has rated high. • Cosine Similarity function to calculate similarity between users. • k-Nearest Neighbors for finding similar users
  59. 59. Use Case: Movie Recommendation • Cosine Similarity: • K-NN:
  60. 60. Use Case: Movie Recommendation Query:Add Cosine Similarity MATCH (p1:Person)-[x:RATED]->(m:Movie)<-[y:RATED]-(p2:Person) WITH SUM(x.rating * y.rating) AS xyDotProduct, SQRT(REDUCE(xDot = 0.0, a IN COLLECT(x.rating) | xDot + a^2)) AS xLength, SQRT(REDUCE(yDot = 0.0, b IN COLLECT(y.rating) | yDot + b^2)) AS yLength, p1, p2 MERGE (p1)-[s:SIMILARITY]-(p2) SET s.similarity = xyDotProduct / (xLength * yLength)
  61. 61. Use Case: Movie Recommendation
  62. 62. Use Case: Movie Recommendation Query: See who is your neighbor in similarity MATCH (p1:Person {name:'Michael Sherman'})-[s:SIMILARITY]-(p2:Person) WITH p2, s.similarity AS sim ORDER BY sim DESC LIMIT 5 RETURN p2.name AS Neighbor, sim AS Similarity
  63. 63. Use Case: Movie Recommendation (Conti..) Query: Recommendation Finally MATCH (b:Person)-[r:RATED]->(m:Movie), (b)-[s:SIMILARITY]-(a:Person {name:'Michael Sherman'}) WHERE NOT((a)-[:RATED]->(m)) WITH m, s.similarity AS similarity, r.rating AS rating ORDER BY m.name, similarity DESC WITH m.name AS movie, COLLECT(rating)[0..3] AS ratings WITH movie, REDUCE(s = 0, i IN ratings | s + i)*1.0 / LENGTH(ratings) AS reco ORDER BY reco DESC RETURN movie AS Movie, reco AS Recommendation
  64. 64. Why Real-time Recommendation Engine Important? • Example of E-Commerce
  65. 65. Neo4j with Other technologies • Data Import – LOAD CSV – Neo4j-import • Graph Visualization – Alistair Jones (Arrow) – Alchemy.js (GraphJSON) – Neo4j Browser – Linkurious – Keylines – D3.js
  66. 66. Neo4j in Action • Neo4j + Linkurious for Panama leaks
  67. 67. Conclusion • The graph is important data model to represent lot of real world scenarios as connected object provide more information that isolated objects. • The de-facto big data technologies are inefficient for solving large scale graph problems. • The technologies, designed to solve large scale graph problems in real time as well as offline are available. • These graph technologies are matured enough to use in production.
  68. 68. nishant@serendio.com Serendio provides Big Data Science Solutions & Services for Data-Driven Enterprises. Learn more at: serendio.com/index.php/case-studies Thank You!

×