Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Chargement dans…3
×
1 sur 68

Using MongoDB as a graph database - 2014 redux

9

Partager

Télécharger pour lire hors ligne

** An update to the 2012 MongoUK presentation, given at NoSQL Birmingham/London meetup **

This presentation charts how Talis implemented tripod, a library that runs over the top of MongoDB, to provide access to large scale graph datasets with very high performance query access.

As Talis' own applications became web-scale, the company used tripod as a replacement for its earlier, general purpose RDF triple store, and maintained the graph-model in the code line whilst swapping in MongoDB underneath.

By prioritising on what really mattered to those applications, and discarding what did not, the company was able to extract extreme performance from graph based datasets using MongoDB running on commodity hardware.

https://github.com/talis/tripod-php
https://github.com/talis/tripod-node

Using MongoDB as a graph database - 2014 redux

  1. 1. Using MongoDB as a Graph Database Chris Clarke NoSQL Birmingham 16th October 2014
  2. 2. Graphs 101 For the uninitiated
  3. 3. John knows Jane
  4. 4. John knows Jane Jane knows John John knows Jane
  5. 5. John knows Jane
  6. 6. John knows Jane Jane ? John John knows Jane
  7. 7. John knows Jane Jane knows John knows John Jane knows
  8. 8. RDF
  9. 9. Entity Property Value John knows Jane
  10. 10. Subject Predicate Object John knows Jane
  11. 11. Subject Predicate Object John knows Jane Jane knows John
  12. 12. Subject Predicate Object http://example.com/John foaf:knows http://example.com/Jane PREFIX foaf: <http://xmlns.com/foaf/0.1/>
  13. 13. Subject Predicate Object http://example.com/John http://example.com/John foaf:knows http://example.com/Jane foaf:name “John” http://example.com/John rdf:type foaf:Person http://example.com/Jane foaf:name “Jane” http://example.com/Jane rdf:type foaf:Person http://example.com/Jane foaf:knows http://example.com/John PREFIX foaf: <http://xmlns.com/foaf/0.1/> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
  14. 14. foaf:Person rdf:type rdf:type foaf:knows example:John example:Jane foaf:knows foaf:name foaf:name “John” “Jane”
  15. 15. “WTF! Surely this is easier in JSON!” – Jack Fullstack
  16. 16. > db.people.find() { _id: ObjectID(‘123’), name: ‘John’ knows: [ObjectID(‘456’)] }, { _id: ObjectID(‘456’), name: ‘Jane’ knows: [ObjectID(‘123’)] }
  17. 17. foaf: Pers on
  18. 18. Dataset A Dataset B example:John foaf:name “John” example:John foaf:age 24
  19. 19. Dataset A+B example:John foaf:name foaf:age “John” 24
  20. 20. SPARQL An RDF Query Language
  21. 21. PREFIX foaf: <http://xmlns.com/foaf/0.1/> SELECT ?name ?email WHERE { ?person a foaf:Person. ?person foaf:name ?name. ?person foaf:mbox ?email. } ORDER BY ?name LIMIT 50
  22. 22. CONSTRUCT DESCRIBE SELECT ASK Graph Graph Tabular Boolean
  23. 23. Graphs and Talis A bit of history
  24. 24. Over time… • Our apps become popular. Last week, average 4M requests per day and at peak times 600k+ per hour • Our dataset is growing in size - about 350M triples this week • Our apps needed more queries and more expensive queries • Our in-house triple store was EoL and out of date
  25. 25. Project Tripod http://github.com/talis/tripod-php http://github.com/talis/tripod-node
  26. 26. System characteristics • 99:1 read:write • Well shared, tenant based system. Our largest single customer has 35M triples • Graph data structures and operations (merges, sub-graphs etc.) well entrenched in the codebase, over 2M lines code (inc. libraries) • Actually not that many distinct query shapes
  27. 27. Simple Queries, and how they influenced our core data model
  28. 28. DESCRIBE <http://example.com/John> Give me all the triples about John as a graph SELECT ?name ?age WHERE { <http://example.com/John> <foaf:name> ?name . <http://example.com/John> <foaf:age> ?age . } Give me properties name, age of John as tabular data
  29. 29. Subject Predicate Object http://example.com/John http://example.com/John foaf:knows http://example.com/Jane foaf:name “John” http://example.com/John rdf:type foaf:Person http://example.com/Jane foaf:name “Jane” http://example.com/Jane rdf:type foaf:Person http://example.com/Jane foaf:knows http://example.com/John PREFIX foaf: <http://xmlns.com/foaf/0.1/> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
  30. 30. Concise Bound Description of http://example.com/John http://example.com/John http://example.com/John foaf:knows http://example.com/Jane foaf:name “John” http://example.com/John rdf:type foaf:Person http://example.com/Jane foaf:name “Jane” http://example.com/Jane rdf:type foaf:Person http://example.com/Jane foaf:knows http://example.com/John Concise Bound Description of http://example.com/Jane
  31. 31. Concise Bound Description of http://example.com/John http://example.com/John http://example.com/John foaf:knows http://example.com/Jane foaf:name “John” http://example.com/John rdf:type foaf:Person { _id: “example:John”, “foaf:knows”: { u: “example:Jane” }, “rdf:type”: { u: “foaf:Person” }, “foaf:name”: { l: “John” } }
  32. 32. { _id: “example:John”, “foaf:knows”: { u: “example:Jane” }, “rdf:type”: { u: “foaf:Person” }, “foaf:name”: { l: “John” } }
  33. 33. { _id: “example:John”, “foaf:knows”: { u: “example:Jane” }, “rdf:type”: { u: “foaf:Person” }, “foaf:name”: { l: “John” } } _id is the unique primary key. There can only be one John
  34. 34. { _id: “example:John”, “foaf:knows”: { u: “example:Jane” }, “rdf:type”: { u: “foaf:Person” }, “foaf:name”: { l: “John” } } l means value is a literal text value _id is the unique primary key. There can only be one John
  35. 35. { _id: “example:John”, “foaf:knows”: { u: “example:Jane” }, “rdf:type”: { u: “foaf:Person” }, “foaf:name”: { l: “John” } } u means value is a uri, or another node. l means value is a literal text value _id is the unique primary key. There can only be one John
  36. 36. { _id: “example:John”, “foaf:knows”: { u: “example:Jane” }, “rdf:type”: { u: “foaf:Person” }, “foaf:name”: { l: “John” } } DESCRIBE <http://example.com/John> SELECT ?name ?age WHERE { <http://example.com/John> <foaf:name> ?name . <http://example.com/John> <foaf:age> ?age . }
  37. 37. { _id: “example:John”, “foaf:knows”: { u: “example:Jane” }, “rdf:type”: { u: “foaf:Person” }, “foaf:name”: { l: “John” } } DESCRIBE <http://example.com/John> mongo$ col.findOne({_id:”example:John”}); SELECT ?name ?age WHERE { <http://example.com/John> <foaf:name> ?name . <http://example.com/John> <foaf:age> ?age . } mongo$ col.findOne({_id:”example:John”},{“foaf:name.l”:1,”foaf:age.l”:1});
  38. 38. { s: “example:John, p: “foaf:knows” o: { u: “example:Jane” } }, { s: “example:John, p: “rdf:type” o: { u: “foaf:Person” } }, { s: “example:John, p: “foaf:name” o: { l: “John” } },
  39. 39. { s: “example:John, p: “foaf:knows” o: { u: “example:Jane” } }, { s: “example:John, p: “rdf:type” o: { u: “foaf:Person” } }, { s: “example:John, p: “foaf:name” o: { l: “John” } }, DESCRIBE <http://example.com/John> mongo$ var s = col.find({s:”example:John”}); mongo$ while (s.hasNext()) { addToGraph(s.next()) } SELECT ?name ?age WHERE { <http://example.com/John> <foaf:name> ?name . <http://example.com/John> <foaf:age> ?age . } mongo$ col.find({s:”example:John”, p: “foaf:name”}},{“o”:1}); mongo$ col.find({s:”example:John”, p: “age”}},{“o”:1});
  40. 40. { s: “example:John, p: “foaf:knows” o: { u: “example:Jane” } }, { s: “example:John, p: “rdf:type” o: { u: “foaf:Person” } }, { s: “example:John, p: “foaf:name” o: { l: “John” } }, DESCRIBE ?person WHERE { ?person <foaf:name> “John” . } mongo$ var s = col.find({p:”foaf:name”, o:”John”}); // BasicCursor = slow { _id: “example:John”, “foaf:knows”: { u: “example:Jane” }, “rdf:type”: { u: “foaf:Person” }, “foaf:name”: { l: “John” } } DESCRIBE ?person WHERE { ?person <foaf:name> “John” . } mongo$ col.ensureIndex({“foaf:name.u”:1}); mongo$ var s = col.find({“foaf:name.u”:”John”}); // BTreeCursor = fast
  41. 41. Complex Queries
  42. 42. DESCRIBE <http://example.com/foo> ?sectionOrItem ?resource ?document ? authorList ?author ?usedBy ?creator ?libraryNote ?publisher WHERE { OPTIONAL { <http://example.com/foo> resource:contains ?sectionOrItem . OPTIONAL { ?sectionOrItem resource:resource ?resource . OPTIONAL { ?resource dcterms:isPartOf ?document . } OPTIONAL { ?resource bibo:authorList ?authorList . OPTIONAL { ?authorList ?p ?author . } } OPTIONAL { ?resource dcterms:publisher ?publisher . } } OPTIONAL { ?libraryNote bibo:annotates ?sectionOrItem } } . OPTIONAL { <http://example.com/foo> resource:usedBy ?usedBy } . OPTIONAL { <http://example.com/foo> sioc:has_creator ?creator } }
  43. 43. DESCRIBE <http://example.com/foo> ?sectionOrItem ?resource ?document ? authorList ?author ?usedBy ?creator ?libraryNote ?publisher WHERE { OPTIONAL { <http://example.com/foo> resource:contains ?sectionOrItem . OPTIONAL { ?sectionOrItem resource:resource ?resource . OPTIONAL { ?resource dcterms:isPartOf ?document . } OPTIONAL { ?resource bibo:authorList ?authorList . OPTIONAL { ?authorList ?p ?author . } } OPTIONAL { ?resource dcterms:publisher ?publisher . } } OPTIONAL { ?libraryNote bibo:annotates ?sectionOrItem } } . OPTIONAL { <http://example.com/foo> resource:usedBy ?usedBy } . OPTIONAL { <http://example.com/foo> sioc:has_creator ?creator } }
  44. 44. “We don’t need dynamic queries” – Project Tripod Team, sometime 2012
  45. 45. Precomputed views Remember those from the RDBMS?
  46. 46. { _id: { “example:John” “foaf:knows”: { u: “example:Jane” }, “rdf:type”: { u: “foaf:Person” }, “foaf:name”: { l: “John” } } { _id: “example:Jane”, “foaf:knows”: { u: “example:John” }, “rdf:type”: { u: “foaf:Person” }, “foaf:name”: { l: “Jane” } } DESCRIBE example:John ?knownPerson WHERE { example:John foaf:knows ?knownPerson . } mongo$ var john = col.findOne({_id:”example:John”}); for (var i=0; i < john[“foaf:knows”].length; i++) { var knownPerson = col.findOne({“_id: john[“foaf:knows”][i]}); }
  47. 47. System characteristics • 99:1 read:write • Well shared, tenant based system. Our largest single customer has 35M triples • Graph data structures and operations (merges, sub-graphs etc.) well entrenched in the codebase, over 2M lines code (inc. libraries). • Actually not that many distinct query shapes.
  48. 48. { _id : { r: “example:John, t: “v_knows”}, graphs: [{ _id: { “example:John” “foaf:knows”: { u: “example:Jane” }, “rdf:type”: { u: “foaf:Person” }, “foaf:name”: { l: “John” } }, { _id: “example:Jane”, “foaf:knows”: { u: “example:John” }, “rdf:type”: { u: “foaf:Person” }, “foaf:name”: { l: “Jane” } }] } DESCRIBE example:John ?knownPerson WHERE { example:John foaf:knows ?knownPerson . } mongo$ viewsCol.findOne({_id: {r:”example:John”,t:”v_knows”}})
  49. 49. { _id : { r: “example:John, t: “v_knows”}, graphs: [{ _id: { “example:John” “foaf:knows”: { u: “example:Jane” }, “rdf:type”: { u: “foaf:Person” }, “foaf:name”: { l: “John” } }, { _id: “example:Jane”, “foaf:knows”: { u: “example:John” }, “rdf:type”: { u: “foaf:Person” }, “foaf:name”: { l: “Jane” } }] _impactIndex : [“example:Jane”,”example:John”] }
  50. 50. View specification { "_id":"v_knows", "type":["foaf:Person"], "from":"CBD_people", "joins":{ “foaf:knows":{} } }
  51. 51. More complex example { "_id":"v_resources", "type":["resourcelist:Resource"], "from":"CBD_resources", "joins":{ "dct:partOf":{ "joins": { "bibo:authorList":{ "joins" : { "followSequence":{ "maxJoins":50 } } }, "bibo:editorList":{ "joins" : { "followSequence":{ "maxJoins":50 } } }, "dct:publisher":{} } }, "dct:isPartOf":{ "joins": { "bibo:authorList":{ "joins" : { "followSequence":{ "maxJoins":50 } } }, "bibo:editorList":{ "joins" : { "followSequence":{ "maxJoins":50 } } }, "dct:publisher":{} } }, "bibo:authorList":{ "joins" : { "followSequence":{ "maxJoins":50 } } }, "bibo:editorList":{ "joins" : { "followSequence":{ "maxJoins":50 } } }, "dct:publisher":{} } }
  52. 52. What about tabular data? • We also have tables and table specs • Conceptually the same as views • Instead of an array of graphs we have computed columns for complex tabular queries • You can page, limit, offset results just like you’d expect
  53. 53. { "_id" : { "r" : “http://example.com/users/FC44E153-161C-C199-DBAB-4DDE13F76F9B/bookmarks/1ABE1B4B-A68C-90E4-41DB "type" : "t_user_resources" }, "value" : { "_impactIndex" : [ { "r" : “http://example.com/users/FC44E153-161C-C199-DBAB-4DDE13F76F9B/bookmarks/1ABE1B4B-A68C-90E4 "c" : "tenantContexts:DefaultGraph" }, { "r" : "tenantResources:7AB1D8E3-5D74-D07F-41E7-56206CFEC8EE", "c" : "tenantContexts:DefaultGraph" } ], "collection" : “http://example.com/users/FC44E153-161C-C199-DBAB-4DDE13F76F9B/bookmarks", "createdDate" : "2011-02-08T15:59:45+00:00", "resourceUri" : "tenantResources:7AB1D8E3-5D74-D07F-41E7-56206CFEC8EE", "note" : "ELECTRONIC", "title" : "Feminism & psychology", "type" : [ "resourcelist:Resource", "bibo:Journal" ] } }
  54. 54. Database layout talis-rs:PRIMARY> show collections CBD_config CBD_draft CBD_events CBD_jobs CBD_lists CBD_nodes CBD_resources CBD_reviews CBD_service CBD_user_lists CBD_user_resources CBD_users table_rows views r/w } read only
  55. 55. Fast and slow saves, you decide.
  56. 56. Tripod save() • Based on change sets, you supply the old and new graphs • CBDs updated immediately. Write ahead transaction log for multi-CBD writes • Choice per save on whether to update views/tables sync or async (eventually consistent) • Async adds jobs to a Mongo based queue
  57. 57. Measure everything
  58. 58. Query volume complex vs. simple
  59. 59. Query volume graph vs. tabular
  60. 60. Query speed complex vs. simple graph query
  61. 61. Hardware • Real tin, 2x Dell low-end rack mount servers • 96Gb RAM, 24 cores • RAID-10 disks, non-SSD • Keep ‘em on the same LAN as your app servers • About the same to lease per month than a couple of c3.4xlarge (30Gb, 32vCPU) • We’re about to add similar second cluster, 144Gb
  62. 62. Why Mongo? RTFM, not HN comment feeds. But seriously it could have been n other document DBs
  63. 63. There’s lots more Search, named graphs (quads), data functions
  64. 64. Future roadmap • Multi-cluster <- IN PROGRESS • NodeJS port <- IN PROGRESS • Choose better solution for tlog, probably PostgreSQL • Background queue -> redis and resque • Chainable API • Spout of updates for Apache Storm • Versioned views/tables config
  65. 65. Aperture Annotate your models to persist to graph
  66. 66. Aperture Annotate your models to persist to graph
  67. 67. tripod-php code… …same in aperture
  68. 68. @talis facebook.com/talisgroup +44 (0) 121 374 2740 talis.com info@talis.com 48 Frederick Street Birmingham B1 3HN

Remarques

  • Not just mongodb
    Specific to our circumstances
    YMMV
  • The theory part - remember I’m not a data scientist ;-)
  • Ball and stick diagrams
    The balls are nodes and the sticks are named relationships between the nodes
    This is an undirected graph
  • Ball and stick diagrams
    The balls are nodes and the sticks are named relationships between the nodes
    This is an undirected graph
  • This is a directed graph
    directional relationship
  • Doesn’t tell us Jane knows John
  • A toolset to work with graph data.
    Directed graphs
  • Values can be other Entities
  • This is a triple
  • The same node can be a subject or an object.
  • In RDF subjects and properties are actually URIs that can be dereferenced

    Here the predicate is part of a public vocabulary called FOAF
    Billions of triples out there on the public internets defined using FOAF

    Namespacing - makes URIs shorter
  • In RDF subjects and properties are actually URIs that can be dereferenced

    Here the predicate is part of a vocabulary called FOAF

    Billions of triples out there on the public internets defined using FOAF
  • Here it is in ball and stick
  • Yes, you can!
    Data schema only makes sense to you
    Not graph data
    Complex graphs quickly end in renormalisation hell, or many, many follow your nose queries


  • Real data graphs quickly get complicated
  • Really easy to merge datasets from different sources that talk ABOUT THE SAME THING

    Global identifiers via URIs

  • Really easy to merge datasets from different sources that talk ABOUT THE SAME THING

    Global identifiers via URIs

  • W3 standard
  • SQL-like, to an extent.

    WHERE is Pattern matching, essentially joins

    UNIONS, Geo extensions, etc.

  • 4 main query types

  • We started working on our first application in 2008

    Talis was 3 companies back then. One built a general purpose graph store, part of technical strategy to build on it

    RDF based, integrates other data sources from around the web
  • We did caching for performance. Complicated!

    Data size outgrew our existing general purpose technology stack, became hard to operate

    Complex SPARQL queries expensive on large data sets

    In 2008 even low hundred ms from the DB was acceptable (with caching). Today we do 20 queries a page and expect single digit or better performance.

    Our graph store end of lifed
  • 2012 - project to replace generalised triple store with something more specific to our app

    FIND A NEW POD FOR OUR TRIPLES

    It’s a library. Currently implemented in php and parts ported to node. Sorry, our apps are php.
  • We didn’t consider moving from the graph. You can’t just refactor the whole codebase to relational and flip a switch overnight and expect it to work. This was a moving target.
  • Lots of very simple data

    These can be satisfied very easily and cheaply if you group all the immediate properties of a subject together

    “Concise Bound Description”

  • Earlier example
  • Graph theory concept: CBD
  • Our data model: One document per CBD
  • In more detail
  • _id indexed by default
  • Mega fast queries with single docs returned, no cursors.

    Micro secs on decent hardware.
  • Mega fast queries with single docs returned, no cursors.

    Micro secs on decent hardware.
  • Contrast that to most triple stores, they traditionally model the triple. Cayley being one of them


  • Makes queries expensive. Have to deal with cursors with 1..n documents. Have to pluck values via multiple or complex queries.


  • Gets worse when you want to find matches by value


  • JOINS
  • Typical complex query

    9 “joins”

    Document databases don’t generally like joins. map reduce?!
  • Only thing that changes in this query is the URI

    9 joins in this query = expensive

    In the whole system we probably only had 20 queries that required joins
  • A revelation from the data gods!

    Flexibility of SPARQL great for the developer but simply put hard to scale

    1000’s of hours optimisation over relation DB’s query engines over decades

    This is why in the old design we hid everything behind a cache
  • Pre compute all possible answers to the query

    data storage cheap
  • Without pre-computed views

    This is just a single join. Very messy.

    What if John knows 50 people? n+1 queries.
  • We discard and re-generate views at write time

    There’s only about 20 of them in our whole app.
  • Pre-computed typed views

    1 query, ultra fast

  • When we do a save we do a lookup to see which views might be impacted
  • In our config

    Simple config lang with a few keywords

    This means we have to specify queries up front, not send them at run time
  • Most complex in our system. 11 joins!
  • This is a table row from our system

    Instead of graphs key/value pairs

    Note you can have multi-value cells (type). This was a limitation of SPARQL select for us.
  • CBD collections are read-write for the developer
    table_rows, views read only, tripod driver manages regeneration

    In our system:

    50M distinct CBDs
    34M distinct views
    23M distinct table rows

    roughly 800Mb per MT inc indexes and views etc.
  • This brings us nicely onto saves.

    Trade speed with eventual consistency.
  • Mongo doesn’t have transactions. TLog is a separate mongo cluster used to control transactions + rollback. Also allows us to update a nightly backup to the last applied transaction in the case of total data loss.

    TLog is in Mongo but a poor choice. Moving to Postgres.

    Async faster, but not consistent. Depends on situation

    Queue implemented in Mongo, moving to redis + resque (probably)
  • Tripod has built in ability to collect stats. We use statsD+graphite
  • Lot less tabular queries
  • Scale on left is ms.

    This includes database, network to web server and the time marshalling into php objects. This is where the extra time is spent for views!
  • Cost wise cloud just didn’t stack up for us, esp in 2012.

    Tin vs. Cloud like for like is more like 2x today, 8x cost in 2012

    RAM is king here

    We’re adding a second cluster with 144Gb shortly

    PaaS is prohibitively expensive at scale. We had a support contract on the first cluster but on the second we’re going it alone.
  • Don’t mention write preference or I will shoot you in the head.

    Looking for a document database not a graph database

    Evaluated Couch, Riak and Postgres
    - CouchBase was a new product, just merged with Memcache. Felt risky although map/reduce queries fitted well with views/tables
    - Riak. Features we liked were commercially licences
    - Postgres - JSON datatype was primitive at the time, worth a second look today tho

    ServerDensity David Swiss Army Knife NoSQL

    Community
    - commercial
    - & developer

    Friendly API

    Ultimately not bound to it - swapping out parts as we scale

    There’s a lot of shit written about Mongo. Don’t read HN. Instead RTFM.
  • But not enough time
  • Mongo doesn’t have transactions. TLog is a separate mongo cluster used to control transactions + rollback. Also allows us to update a nightly backup to the last applied transaction in the case of total data loss.

    TLog is in Mongo but a poor choice. Moving to Postgres.

    Async faster, but not consistent. Depends on situation

    Queue implemented in Mongo, moving to redis + resque (probably)

    FINALLY before we go, a sneak peak of a project we’ve been working on to hide the graph entirely…
  • Our app already worked natively with graphs

    But the model in most apps is not a graph

    Aperture is our new project built on top of tripod allowing you to hide the complexity of graphs

    Plain old php object
  • Simply annotating it will
  • ×