Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

NDC Oslo 2018 - A Practical Guide to Graph Databases

4 143 vues

Publié le

With the emergence of offerings on both AWS (Neptune) and Azure (CosmosDB) within the past year it is fair to say that graph databases are of the hottest trends and that they are here to stay. So what are graph databases all about then? You can read article after article about how great they are and that they will solve all your problems better than your relational database but its difficult to really find any practical information about them.

This talk will start with a short primer on graph databases and the ecosystem but will then quickly transition to discussing the practical aspects of how to apply them to solve real world business problems. We will dive into what makes a good use case and what does not. We will then follow this up with some real world examples of some of the common patterns and anti-patterns of using graph databases. If you haven't been scared away by this point we will end by showing you some of the powerful insights that graph databases can provide you.

Publié dans : Logiciels
  • Login to see the comments

NDC Oslo 2018 - A Practical Guide to Graph Databases

  1. 1. A Practical Guide to Graph Databases
  2. 2. About Me Architect and Full Stack Developer ● 20 years of full stack experience ● Distributed high performance low latency big data platforms ● Graph Databases are kinda my thing www.bechbergerconsulting.com www.bechberger.com @bechbd www.linkedin.com/in/davebechberger
  3. 3. Graph Databases
  4. 4. Graph Databases are hot
  5. 5. Graph Theory
  6. 6. What is Graph Datastore? ● Type of NoSQL datastore ● Uses graph structures (nodes, edges) to store data ● Efficiently represents and traverses relationships
  7. 7. The NoSQL Spectrum
  8. 8. Why use a graph database? Network Analysis Master Data Management Recommendation Engines Fraud Detection
  9. 9. Graph Ecosystem
  10. 10. The ecosystem is large and growing
  11. 11. The ecosystem is complex Frameworks RDF Triple Stores Labeled Property Model Databases
  12. 12. Databases vs. Frameworks Frameworks ● Data is processed not persisted ● Works on enormous datasets ● OLAP workloads Databases ● Data is persisted and processed ● Real time querying ● OLTP and OLAP workloads
  13. 13. RDF/Triple Stores vs. Labeled Property Graphs RDF Triple Stores ● Each entity is a triple ● Works with subject - object - predicate ● Comes from semantic web ● Great for inferring relationships Labeled Property Graphs ● Entities are a node or an edge ● Works with nodes - edges - properties - labels ● Both nodes and edges contain properties ● Great for efficiently traversing relationships
  14. 14. RDF/Triple Stores vs. Labeled Property Graphs RDF Triple Stores Labeled Property Graphs
  15. 15. Graph Query Languages Gremlin ● Imperative + Declarative ● Powerful ● Steep Learning Curve GraphQL ● Most useful for REST endpoints ● Query Language for APIs SPARQL ● W3C Standard for RDFs ● Based on semantic Web Cypher ● Declarative ● Easy to Use ● Most Popular Language Others ● Most are extensions of SQL ● Usually specific to one system
  16. 16. Queries - Find a Friend of a Friend SPARQL PREFIX foaf: <http://xmlns.com/foaf/0.1/> SELECT ?name WHERE { ?x foaf:name ?y . ?y foaf:name ?name .} Cypher MATCH n (me:Person)-[:FRIEND*2]-> (myFriend:Person) RETURN n.name Gremlin g.V().hasLabel(‘person’) .repeat(out(‘friend’)).times(2) .dedup().values(‘name’).next() GraphQL { friend { friend { name } } } SQL Variants SELECT name FROM expand( bothE('is_friend_with').bothV() .bothE('is_friend_with').bothV() )
  17. 17. Both Visualization Desktop Tool Web
  18. 18. Visualizations
  19. 19. To use or not to use, that is the question
  20. 20. Everything is a Graph But that doesn’t mean you should solve it with a graph
  21. 21. Explore the Questions
  22. 22. Search and Selection ● Get me everyone who works at X? ● Find me everyone with a first name like “John”? ● Find me all stores within X miles? Answer: Use a RDBMS or a Search Server
  23. 23. Related Data ● What is the easiest way for me to be introduced to an executive at X? ● How do “John” and “Paula” know each other? ● How is company X related to company Y? Answer: Use a Graph
  24. 24. Aggregation ● How many companies are in my system? ● What are my average sales for each day over the past month? ● What is the number of transactions processed by my system each day? Answer: Use a RDBMS
  25. 25. Pattern Matching ● Who in my system has a similar profile to me? ● Does this transaction look like other known fraudulent transactions? ● Is the user “J. Smith” the same as “Johan S.”? Answer: It depends, you might use search server or a graph
  26. 26. Clustering, Centrality, and Influence ● Who is the most influential person I am connected with on LinkedIn? ● What equipment in my network will have the largest impact if it breaks? ● What parts tend to fail at the same time? Answer: Use a graph
  27. 27. Still not sure?
  28. 28. Should I use Graph? I sold this to Management as a Graph project so we are using a graph Based on work by Dr. Denise Gosnell: https://bit.ly/2s0qBC2
  29. 29. I’m still confused ● Do we care about the relationships between entities as much or more than the entities themselves? ● If I were to model this in a RDBMS would I be writing queries with multiple (5+) joins or recursive CTE’s to retrieve my data? ● Is the structure of my data continuously evolving? ● Is my domain a natural fit for a graph?
  30. 30. Can’t I just do this in SQL?
  31. 31. Northwind Data Models
  32. 32. Give me all products in a category (Search/Selection) SQL SELECT c.categoryName, p.productName, FROM product AS p INNER JOIN category AS c ON c.categoryId=p.categoryId WHERE c.categoryName=’Beverages’ Gremlin g.V().has(‘category’, ‘categoryName’, ‘Beverages’).as(‘c’).in(‘part_of’) .as(‘p’).select(‘c’, ‘p’) .by(‘categoryName’).by(‘productName’) Cypher MATCH (o:Category)-[:PARTOF]->(p:Product) RETURN c.categoryName, p.productName
  33. 33. Give me the top 5 products ordered (Aggregation) SQL SELECT TOP(5) c.categoryName, p.productName, count(o) FROM order AS o INNER JOIN product AS p ON p.productId=o.productId INNER JOIN category AS c ON c.categoryId=p.categoryId ORDER BY count(o) Gremlin g.V().hasLabel("order").as(‘o’) .out(‘orders’).as(‘p’).out(‘part_of’) .as(‘c’).order().by(select(‘o’).count()). select(‘c’, ‘p’, ‘o’).by(‘categoryName’) .by(‘productName’).by(count()) Cypher MATCH (o:Order)-[:ORDERS]->(p:Product) - [:PART_OF]->(c:Category) RETURN c.categoryName, p.productName, count(o) ORDER BY count(o) DESC LIMIT 5
  34. 34. Find Products Purchased by others that I haven’t purchased (Related Data/Pattern Matching) SQL SELECT TOP(5) product.product_name as Recommendation, count(1) as Frequency FROM product, customer_product_mapping, (SELECT cpm3.product_id, cpm3.customer_id FROM Customer_product_mapping cpm, Customer_product_mapping cpm2, Customer_product_mapping cpm3 WHERE cpm.customer_id = ‘123’ and cpm.product_id = cpm2.product_id and cpm2.customer_id != ‘customer-one’ and cpm3.customer_id = cpm2.customer_id and cpm3.product_id not in (select distinct product_id FROM Customer_product_mapping cpm WHERE cpm.customer_id = ‘customer-one’) ) recommended_products WHERE customer_product_mapping.product_id = product.product_id and customer_product_mapping.product_id in recommended_products.product_id and customer_product_mapping.customer_id = recommended_products.customer_id GROUP BY product.product_name ORDER BY Frequency desc Gremlin g.V().has("customer", "customerId", "123").as("c"). out("ordered").out("contains").out("is").aggregate("p"). in("is").in("contains").in("ordered").where(neq("c")). out("ordered").out("contains").out("is").where(without("p")). groupCount().order(local).by(values, decr).select(keys).limit(local, 5). unfold().values("name") Cypher MATCH (u:Customer {customer_id:’123’})-[:BOUGHT]->(p:Product)<- [:BOUGHT]-(peer:Customer)-[:BOUGHT]->(r:Product) WHERE not (u)-[:BOUGHT]->(r) RETURN r as Recommendation, count(*) as Frequency ORDER BY Frequency DESC LIMIT 5;
  35. 35. Give me all employees, their supervisor and level (Recursive CTE) SQL WITH EmployeeHierarchy (EmployeeID, LastName, FirstName, ReportsTo, HierarchyLevel) AS ( SELECT EmployeeID , LastName , FirstName , ReportsTo , 1 as HierarchyLevel FROM Employees WHERE ReportsTo IS NULL UNION ALL SELECT e.EmployeeID , e.LastName , e.FirstName , e.ReportsTo , eh.HierarchyLevel + 1 AS HierarchyLevel FROM Employees e INNER JOIN EmployeeHierarchy eh ON e.ReportsTo = eh.EmployeeID) SELECT * FROM EmployeeHierarchy ORDER BY HierarchyLevel, LastName, FirstName Gremlin g.V().hasLabel("employee").where(__.not(out("reportsTo"))). repeat(__.in("reportsTo")).emit().tree().by(map {def employee = it.get() employee.value("firstName") + " " + employee.value("lastName")}).next() Cypher MATCH p = (u:Employee)->[:ReportsTo]->(s:Employee)<- RETURN u.firstName as FirstName, u.LastName AS LastName, (s.firstName + " " + s.lastName) AS ReportsTo, path(p) AS HierarchyLevel ORDER BY HierarchyLevel, LastName, FirstName Based on work by http://sql2gremlin.com/
  36. 36. Where do I start?
  37. 37. Choosing a Datastore ● Framework vs. RDF vs. Property Model ● HA/Transaction Volume/Data Size ● Hosted vs On Premise
  38. 38. Datastore Concerns ● Data Consistency - ACID or BASE ● Explore your choices ● Beware the Operational Overhead
  39. 39. Data Modelling ● Whiteboard friendly - close to but Pragmatic Conceptual model ● Take into account how you are traversing data ● Use your Relational model to start ● Iterate, Iterate, Iterate
  40. 40. Data Modelling Concerns ● Don’t use Symmetric Relationships ● Look out for Hidden/Anemic Relationships ● Look for Supernodes ● Schema - Use it and make it general
  41. 41. What next?
  42. 42. Summary
  43. 43. The Good ● Graphs are flexible ● Great at finding and traversing relationships ● Natural fit in many complex domains ● Query times are proportional to amount of graph you traverse
  44. 44. The Bad ● Different options scale very differently ● Team needs to learn a new mindset ● Still immature space
  45. 45. The Ugly ● Lack of documentation ● Large, splintered and rapidly evolving ecosystem ● Hard for new users to tell good versus bad use cases
  46. 46. Advice from the trenches... ● Graph datastores may solve your problem, but understand your problem first ● Expect some trial and error ● Your data model will evolve, plan for it ● Don’t underestimate the time it takes to bring your team up to speed ● Graphs databases are not a silver bullet
  47. 47. www.bechbergerconsulting.com www.bechberger.com @bechbd www.linkedin.com/in/davebechberger Questions?