Using MongoDB as a graph database - 2014 redux

Using MongoDB as a Graph Database
Chris Clarke
NoSQL Birmingham
16th October 2014

Graphs 101
For the uninitiated

John knows Jane
Jane knows John
John knows Jane

John knows Jane
Jane ? John
John knows Jane

John knows Jane
Jane knows John
knows
John Jane
knows

Entity Property Value
John knows Jane

Subject Predicate Object
John knows Jane

John knows Jane
Jane knows John

http://example.com/John foaf:knows http://example.com/Jane
PREFIX foaf: <http://xmlns.com/foaf/0.1/>

http://example.com/John
foaf:knows http://example.com/Jane
foaf:name “John”
http://example.com/John rdf:type foaf:Person
http://example.com/Jane foaf:name “Jane”
http://example.com/Jane rdf:type foaf:Person
http://example.com/Jane foaf:knows http://example.com/John
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

foaf:Person
rdf:type rdf:type
foaf:knows
example:John example:Jane
foaf:knows
foaf:name foaf:name
“John” “Jane”

“WTF! Surely this is easier in JSON!”
– Jack Fullstack

> db.people.find()
{
_id: ObjectID(‘123’),
name: ‘John’
knows: [ObjectID(‘456’)]
},
{
_id: ObjectID(‘456’),
name: ‘Jane’
knows: [ObjectID(‘123’)]
}

Dataset A Dataset B
example:John
foaf:name
“John”
example:John
foaf:age
24

Dataset
A+B
example:John
foaf:name foaf:age
“John” 24

PREFIX foaf:
<http://xmlns.com/foaf/0.1/>
SELECT ?name ?email
WHERE {
?person a foaf:Person.
?person foaf:name ?name.
?person foaf:mbox ?email.
}
ORDER BY ?name
LIMIT 50

CONSTRUCT
DESCRIBE
SELECT
ASK
Graph
Graph
Tabular
Boolean

Graphs and Talis
A bit of history

Over time…
• Our apps become popular. Last week, average 4M
requests per day and at peak times 600k+ per hour
• Our dataset is growing in size - about 350M triples
this week
• Our apps needed more queries and more expensive
queries
• Our in-house triple store was EoL and out of date

Project Tripod
http://github.com/talis/tripod-php
http://github.com/talis/tripod-node

System characteristics
• 99:1 read:write
• Well shared, tenant based system. Our largest
single customer has 35M triples
• Graph data structures and operations (merges, sub-graphs
etc.) well entrenched in the codebase, over
2M lines code (inc. libraries)
• Actually not that many distinct query shapes

Simple Queries, and how they
influenced our core data
model

DESCRIBE <http://example.com/John>
Give me all the triples about John as a graph
SELECT ?name ?age
WHERE {
<http://example.com/John> <foaf:name> ?name .
<http://example.com/John> <foaf:age> ?age .
}
Give me properties name, age of John as tabular data

Subject Predicate
Object
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

Concise Bound Description of http://example.com/John
Concise Bound Description of http://example.com/Jane

Concise Bound Description of http://example.com/John
{
_id: “example:John”,
“foaf:knows”: { u: “example:Jane” },
“rdf:type”: { u: “foaf:Person” },
“foaf:name”: { l: “John” }
}

{
}

{
}
_id is the unique primary key. There can only be one John

{
}
l means value is a
literal text value

{
}
u means value is a
uri, or another node.
l means value is a
literal text value

{
}
SELECT ?name ?age
WHERE {
}

{
}
mongo$ col.findOne({_id:”example:John”});
SELECT ?name ?age
WHERE {
}
mongo$ col.findOne({_id:”example:John”},{“foaf:name.l”:1,”foaf:age.l”:1});

{ s: “example:John, p: “foaf:knows” o: { u: “example:Jane” } },
{ s: “example:John, p: “rdf:type” o: { u: “foaf:Person” } },
{ s: “example:John, p: “foaf:name” o: { l: “John” } },

mongo$ var s = col.find({s:”example:John”});
mongo$ while (s.hasNext()) {
addToGraph(s.next())
}
SELECT ?name ?age
WHERE {
}
mongo$ col.find({s:”example:John”, p: “foaf:name”}},{“o”:1});
mongo$ col.find({s:”example:John”, p: “age”}},{“o”:1});

DESCRIBE ?person WHERE { ?person <foaf:name> “John” . }
mongo$ var s = col.find({p:”foaf:name”, o:”John”}); // BasicCursor =
slow
{
}
DESCRIBE ?person WHERE { ?person <foaf:name> “John” . }
mongo$ col.ensureIndex({“foaf:name.u”:1});
mongo$ var s = col.find({“foaf:name.u”:”John”}); // BTreeCursor = fast

DESCRIBE <http://example.com/foo> ?sectionOrItem ?resource ?document ?
authorList ?author ?usedBy ?creator ?libraryNote ?publisher
WHERE
{
OPTIONAL
{
<http://example.com/foo> resource:contains ?sectionOrItem .
OPTIONAL
{
?sectionOrItem resource:resource ?resource .
OPTIONAL { ?resource dcterms:isPartOf ?document . }
OPTIONAL
{
?resource bibo:authorList ?authorList .
OPTIONAL { ?authorList ?p ?author . }
}
OPTIONAL { ?resource dcterms:publisher ?publisher . }
}
OPTIONAL { ?libraryNote bibo:annotates ?sectionOrItem }
} .
OPTIONAL { <http://example.com/foo> resource:usedBy ?usedBy } .
OPTIONAL { <http://example.com/foo> sioc:has_creator ?creator }
}

“We don’t need dynamic queries”
– Project Tripod Team, sometime 2012

Precomputed views
Remember those from the RDBMS?

{
_id: { “example:John”
}
{
_id: “example:Jane”,
“foaf:knows”: { u: “example:John” },
“foaf:name”: { l: “Jane” }
}
DESCRIBE example:John ?knownPerson
WHERE { example:John foaf:knows ?knownPerson . }
mongo$ var john = col.findOne({_id:”example:John”});
for (var i=0; i < john[“foaf:knows”].length; i++) {
var knownPerson = col.findOne({“_id: john[“foaf:knows”][i]});
}

System characteristics
• 99:1 read:write
• Well shared, tenant based system. Our largest
single customer has 35M triples
• Graph data structures and operations (merges, sub-graphs
etc.) well entrenched in the codebase, over
2M lines code (inc. libraries).
• Actually not that many distinct query shapes.

{
_id : { r: “example:John, t: “v_knows”},
graphs: [{
},
{
}]
}
DESCRIBE example:John ?knownPerson
WHERE { example:John foaf:knows ?knownPerson . }
mongo$ viewsCol.findOne({_id: {r:”example:John”,t:”v_knows”}})

{
_id : { r: “example:John, t: “v_knows”},
graphs: [{
},
{
}]
_impactIndex : [“example:Jane”,”example:John”]
}

View specification
{
"_id":"v_knows",
"type":["foaf:Person"],
"from":"CBD_people",
"joins":{
“foaf:knows":{}
}
}

More complex example
{
"_id":"v_resources",
"type":["resourcelist:Resource"],
"from":"CBD_resources",
"joins":{
"dct:partOf":{
"joins": {
"bibo:authorList":{
"joins" : {
"followSequence":{
"maxJoins":50
}
}
},
"bibo:editorList":{
"joins" : {
"followSequence":{
"maxJoins":50
}
}
},
"dct:publisher":{}
}
},
"dct:isPartOf":{
"joins": {
"bibo:authorList":{
"joins" : {
"followSequence":{
"maxJoins":50
}
}
},
"bibo:editorList":{
"joins" : {
"followSequence":{
"maxJoins":50
}
}
},
"dct:publisher":{}
}
},
"bibo:authorList":{
"joins" : {
"followSequence":{
"maxJoins":50
}
}
},
"bibo:editorList":{
"joins" : {
"followSequence":{
"maxJoins":50
}
}
},
"dct:publisher":{}
}
}

What about tabular data?
• We also have tables and table specs
• Conceptually the same as views
• Instead of an array of graphs we have computed
columns for complex tabular queries
• You can page, limit, offset results just like you’d
expect

{
"_id" : {
"r" : “http://example.com/users/FC44E153-161C-C199-DBAB-4DDE13F76F9B/bookmarks/1ABE1B4B-A68C-90E4-41DB
"type" : "t_user_resources"
},
"value" : {
"_impactIndex" : [
{
"r" : “http://example.com/users/FC44E153-161C-C199-DBAB-4DDE13F76F9B/bookmarks/1ABE1B4B-A68C-90E4
"c" : "tenantContexts:DefaultGraph"
},
{
"r" : "tenantResources:7AB1D8E3-5D74-D07F-41E7-56206CFEC8EE",
"c" : "tenantContexts:DefaultGraph"
}
],
"collection" : “http://example.com/users/FC44E153-161C-C199-DBAB-4DDE13F76F9B/bookmarks",
"createdDate" : "2011-02-08T15:59:45+00:00",
"resourceUri" : "tenantResources:7AB1D8E3-5D74-D07F-41E7-56206CFEC8EE",
"note" : "ELECTRONIC",
"title" : "Feminism & psychology",
"type" : [
"resourcelist:Resource",
"bibo:Journal"
]
}
}

Database layout
talis-rs:PRIMARY> show collections
CBD_config
CBD_draft
CBD_events
CBD_jobs
CBD_lists
CBD_nodes
CBD_resources
CBD_reviews
CBD_service
CBD_user_lists
CBD_user_resources
CBD_users
table_rows
views
r/w
} read only

Fast and slow saves,
you decide.

Tripod save()
• Based on change sets, you supply the old and new
graphs
• CBDs updated immediately. Write ahead transaction
log for multi-CBD writes
• Choice per save on whether to update views/tables
sync or async (eventually consistent)
• Async adds jobs to a Mongo based queue

Query volume
complex vs. simple

Query volume
graph vs. tabular

Query speed
complex vs. simple graph query

Hardware
• Real tin, 2x Dell low-end rack mount servers
• 96Gb RAM, 24 cores
• RAID-10 disks, non-SSD
• Keep ‘em on the same LAN as your app servers
• About the same to lease per month than a couple of
c3.4xlarge (30Gb, 32vCPU)
• We’re about to add similar second cluster, 144Gb

Why Mongo?
RTFM, not HN comment feeds.
But seriously it could have been n other document DBs

There’s lots more
Search, named graphs (quads), data functions

Future roadmap
• Multi-cluster <- IN PROGRESS
• NodeJS port <- IN PROGRESS
• Choose better solution for tlog, probably PostgreSQL
• Background queue -> redis and resque
• Chainable API
• Spout of updates for Apache Storm
• Versioned views/tables config

Aperture
Annotate your models to persist to graph

tripod-php code…
…same in aperture

@talis
facebook.com/talisgroup
+44 (0) 121 374 2740
talis.com
info@talis.com
48 Frederick Street
Birmingham
B1 3HN

Using MongoDB as a graph database - 2014 redux

Recommandé

Recommandé

Contenu connexe

Plus de Chris Clarke

Plus de Chris Clarke (6)

Dernier

Dernier (20)

Using MongoDB as a graph database - 2014 redux

Notes de l'éditeur