2. We’re going to talk about
MongoDB Intro & Fundamentals
MongoDB for Genealogy data
Scaling MongoDB for all the generations
The Family Tree
Storing a graph in MongoDB
3. Steve @sp
A
15+ years building
the internet
Father, husband,
skateboarder,
genealogist at ❤
Chief Solutions Architect @
responsible for drivers,
integrations, web & docs
4. Company behind MongoDB
Offices in NYC, Palo Alto, London & Dublin
100+ employees
Support, consulting, training
Mgt: Google/DoubleClick, Oracle, Apple, NetApp, Mark Logic
Well Funded: Sequoia, Union Square, Flybridge
15. Cell Phones in 2012
Dual core 1.5Ghz
802.11n (300+ Mbps)
1 GB ram
64 GB Solid State
16. MongoDB
Application Document
Oriented
High { author : “steve”,
date : new Date(),
Performance
text : “About MongoDB...”,
tags : [“tech”, “database”]}
Fully
Consistent
Horizontally Scalable
17. MongoDB philosophy
Keep functionality when we can (key/value
stores are great, but we need more)
Non-relational (no joins) makes scaling
horizontally practical
Document data models are good
Database technology should run anywhere
virtualized, cloud, metal, etc
18. Under the hood
Written in C++
Runs nearly everywhere
Data serialized to BSON
Extensive use of memory-mapped files
i.e. read-through write-through
memory caching.
20. “
MongoDB has the best
features of key/value
stores, document
databases and relational
databases in one.
John Nunemaker
21. Relational made normalized
data look like this
Category
• Name
• Url
Article
User • Name
Tag
• Name • Slug • Name
• Email Address • Publish date • Url
• Text
Comment
• Comment
• Date
• Author
22. Document databases make
normalized data look like this
Article
• Name
• Slug
• Publish date
User • Text
• Name • Author
• Email Address
Comment[]
• Comment
• Date
• Author
Tag[]
• Value
Category[]
• Value
23. But we’ve been using
a relational database
for 40 years!
26. Each document type
in it’s own drawer
MRIs X-rays Lab Invoices Index
1 1 1 1
1 1 1 1
History Medications Lab Forms
27. Each document type
in it’s own drawer
MRIs X-rays Lab Invoices Index
1 1 1 1
1 1 1 1
History Medications Lab Forms
28. Each document type
in it’s own drawer
MRIs X-rays Lab Invoices Index
1 1 1 1
1 1 1 1
History Medications Lab Forms
29. 2. Group related records
Patient 1 Patient 2 Patient 3 ...
Vendor 1 Vendor 2 Vendor 3
30. 2. Group related records
Patient 1 Patient 3 ...
Patient 2
Vendor 1 Vendor 2 Vendor 3
31. Databases work the same way
Relation Docum
Patient 1 Vendor 1
Article
Category • Name
• Name • Slug
• Url • Publish
User date
• Text
• Name • Author
• Email Address
Article
User Tag
• Name Comment[]
• Name • Name
• Email • Slug • Url • Comment
Address • Publish
date • Date
• Author
Comment Tag[]
• Comment • Value
• Date
• Author
Category[]
• Value
33. Why MongoDB
My Top 10 Reasons
10. Great developer experience
9. Speaks your language
8. Scale horizontally
7. Fully consistent data w/atomic operations
1.It’s web scale
6. Memory caching integrated
5. Open source
4. Flexible, rich & structured data format not just K:V
3. Ludicrously fast (without going plaid)
2. Simplify infrastructure & application
34. Why MongoDB
My Top 10 Reasons
10. Great developer experience
9. Speaks your language
8. Scale horizontally
7. Fully consistent data w/atomic operations
1.It’s web scale
6. Memory caching integrated
5. Open source
4. Flexible, rich & structured data format not just K:V
3. Ludicrously fast (without going plaid)
2. Simplify infrastructure & application
36. CMS / Blog
Needs:
• Business needed modern data store for rapid development and
scale
Solution:
• Use PHP & MongoDB
Results:
• Real time statistics
• All data, images, etc stored together
easy access, easy deployment, easy high availability
• No need for complex migrations
• Enabled very rapid development and growth
37. Photo Meta-Data
Problem:
• Business needed more flexibility than Oracle could deliver
Solution:
• Use MongoDB instead of Oracle
Results:
• Developed application in one sprint cycle
• 500% cost reduction compared to Oracle
• 900% performance improvement compared to Oracle
38. Customer Analytics
Problem:
• Deal with massive data volume across all customer sites
Solution:
• Use MongoDB to replace Google Analytics / Omniture options
Results:
• Less than one week to build prototype and prove business case
• Rapid deployment of new features
39. Archiving
Why MongoDB:
• Existing application built on MySQL
• Lots of friction with RDBMS based archive storage
• Needed more scalable archive storage backend
Solution:
• Keep MySQL for active data (100mil)
• MongoDB for archive (2+ billion)
Results:
• No more alter table statements taking over 2 months to run
• Sharding fixed vertical scale problem
• Very happily looking at other places to use MongoDB
40. Online Dictionary
Problem:
• MySQL could not scale to handle their 5B+ documents
Solution:
• Switched from MySQL to MongoDB
Results:
• Massive simplification of code base
• Eliminated need for external caching system
• 20x performance improvement over MySQL
41. E-commerce
Problem:
• Multi-vertical E-commerce impossible to model (efficiently) in
RDBMS
Solution:
• Switched from MySQL to MongoDB
Results:
• Massive simplification of code base
• Rapidly build, halving time to market (and cost)
• Eliminated need for external caching system
• 50x+ performance improvement over MySQL
42. Tons more
MongoDB casts a wide net
people keep coming up with
new and brilliant ways to use it
50. A More Complex Document
place1 = {
name : "10gen HQ",
address : "578 Broadway 7th Floor",
city : "New York",
zip : "10011",
tags : [ "business", "awesome" ],
latlong : [40.0,72.0],
tips : [ { user : "ryan",
time : 6/26/2011,
tip : "stop by for office hours"},
{.....}]
}
61. A better network FS
GridFS files are seamlessly sharded & replicated.
No OS constraints...
No file size limits
No naming constraints
No folder limits
Standard across different OSs
MongoDB automatically generates the MD5 hash of
the file
63. Types of
genealogy data
Events (birth, death, Photographs
etc)
Diaries & letters
Official records
Ship passenger list
Census
Occupation
Names
and more
Relationships
64. Challenges of
genealogy data
Lots of possible data points... need flexible schema
Multiple versions of same data point
(3 different dates for death date, 4 variations on
name).
Data related to records
Multiple versions of same nodes
(intelligent nondestructive merge needed)
Need to have meta data associated
66. 0 @I2@ INDI
1 NAME Charles Phillip /Ingalls/
1 SEX M
1 BIRT
2 DATE 10 JAN 1836
2 PLAC Cuba, Allegheny, NY
1 DEAT
Recog
2 DATE 08 JUN 1902
2 PLAC De Smet, Kingsbury, Dakota Territory
1 FAMC @F2@
1 FAMS @F3@
nize
0 @I3@ INDI
1 NAME Caroline Lake /Quiner/
1 SEX F
1 BIRT
2 DATE 12 DEC 1839
67. GEDCOM
File format, not a database
Handles the great variety of data well
Doesn’t really scale beyond a local user.
Doesn’t provide good mechanism for storing
external documents (birth certificates, etc).
Built to solve problem of sharing data
68. Genealogy &
MongoDB
Genealogy is anything but rigid and fixed
Flexible schema fits genealogy data well
Packaging things together makes sense
Relating records doesn’t require a relational
database
69. Indivi
•AFN
•Modification Date
Events[]
•type
•date
Name •contributor[]
•record[]
•First[]
•Middle[] Location
•Last[] •city
•state
•county
•country
70. Indivi Events[]
Us
• Name
• AFN • type • Email Address
• Modification Date • date • Password
• contributor[] • Individual_id
• record[]
Name
• First[]
• Middle[] Location
• Last[] • city
• state Rec
• county • contributor
• country • type
• coordinates[] • thumbnail
• content
• description
• tags[]
87. It’s not a tree at all,
It’s really a graph
... and an odd one at that
88. It would be easy if it
always looked like this
89. It would be easy if it
always looked like this
90. All sorts of mess
Step & adopted relationships
Duplicate nodes
Lots of missing nodes
Divorces and re-marriages
Multiple names for the same person
Multiple dates for the same event
94. Trees / graphs
in MongoDB
Since MongoDB data structures are
essentially objects, a good degree of
flexibility here.
Think of how you would structure them in
your application
95. Trees / graphs
in MongoDB
Each node is stored as a document
Contains references to related nodes
What is “related” depends on your
application
96. References vs
Relation
MongoDB uses references
Unlike foreign keys, references don’t
enforce integrity
Reference is really just a reference
For many applications a reference is
sufficient
97. Simple relationship
{ _id: "a" } { _id: "b" } { _id: "c" } { _id: "d" }
{ _id: "e", parents: ["a", "b" ]}
{ _id: "f", parents: ["c", "d" ]}
{ _id: "g", parents: ["e", "f" ]}
•= b =allancestors of g: of'g'});'b'}).toArray();
Easy to access b:
//find
//find all descendants
var
nodes in either direction
db.family.find({ _id:
g db.family.findOne({_id:
•Good for trees / {graphs
descendantsFind = function(par) {
ancestorFind = function(child)
• if ( ! (i in par) return sets
var rv
Can==[];[]; { large rv;
var rv
grab
for child.parents)
//finddb.family.find( { descendants of b:} ).toArray();
var k = all db.family.find( { _id : :{ par[i]._id }).toArray();
parents = direct parents $in : child.parents }
•Minimum amount of maintenance
rv = rv.concat(parents);
rv = rv.concat(k);
>forrv = irv.concat(descendantsFind(k)); : ‘b’})
db.family.find({ parents
(var in parents) {
•Balanced ancestorFind(parents[i]));
}
}
rv = rv.concat(
return rv;
•Implied relationships
}
}
return rv;
descendantsFind(b);
ancestorFind(g);
98. Simple relationship
{ _id: "a" } { _id: "b" } { _id: "c" } { _id: "d" }
{ _id: "e", parents: ["a", "b" ]}
{ _id: "f", parents: ["c", "d" ]}
{ _id: "g", parents: ["e", "f" ]}
•= b =allancestors of g: of'g'});'b'}).toArray();
Easy to access b:
//find
//find all descendants
var
nodes in either direction
db.family.find({ _id:
g db.family.findOne({_id:
•Good for trees / {graphs
descendantsFind = function(par) {
ancestorFind = function(child)
• if ( ! (i in par) return sets
var rv
Can==[];[]; { large rv;
var rv
grab
for child.parents)
//finddb.family.find( { descendants of b:} ).toArray();
var k = all db.family.find( { _id : :{ par[i]._id }).toArray();
parents = direct parents $in : child.parents }
•Minimum amount of maintenance
rv = rv.concat(parents);
rv = rv.concat(k);
>forrv = irv.concat(descendantsFind(k)); : ‘b’})
db.family.find({ parents
(var in parents) {
•Balanced ancestorFind(parents[i]));
}
}
rv = rv.concat(
return rv;
•Implied relationships
}
}
return rv;
descendantsFind(b);
ancestorFind(g);
99. Simple relationship
{ _id: "a" } { _id: "b" } { _id: "c" } { _id: "d" }
{ _id: "e", parents: ["a", "b" ]}
{ _id: "f", parents: ["c", "d" ]}
{ _id: "g", parents: ["e", "f" ]}
•= b =allancestors of g: of'g'});'b'}).toArray();
Easy to access b:
//find
//find all descendants
var
nodes in either direction
db.family.find({ _id:
g db.family.findOne({_id:
•Good for trees / {graphs
descendantsFind = function(par) {
ancestorFind = function(child)
• if ( ! (i in par) return sets
var rv
Can==[];[]; { large rv;
var rv
grab
for child.parents)
//finddb.family.find( { descendants of b:} ).toArray();
var k = all db.family.find( { _id : :{ par[i]._id }).toArray();
parents = direct parents $in : child.parents }
•Minimum amount of maintenance
rv = rv.concat(parents);
rv = rv.concat(k);
>forrv = irv.concat(descendantsFind(k)); : ‘b’})
db.family.find({ parents
(var in parents) {
•Balanced ancestorFind(parents[i]));
}
}
rv = rv.concat(
return rv;
•Implied relationships
}
}
return rv;
descendantsFind(b);
ancestorFind(g);
100. Simple relationship
{ _id: "a" } { _id: "b" } { _id: "c" } { _id: "d" }
{ _id: "e", parents: ["a", "b" ]}
{ _id: "f", parents: ["c", "d" ]}
{ _id: "g", parents: ["e", "f" ]}
•= b =allancestors of g: of'g'});'b'}).toArray();
Easy to access b:
//find
//find all descendants
var
nodes in either direction
db.family.find({ _id:
g db.family.findOne({_id:
•Good for trees / {graphs
descendantsFind = function(par) {
ancestorFind = function(child)
• if ( ! (i in par) return sets
var rv
Can==[];[]; { large rv;
var rv
grab
for child.parents)
//finddb.family.find( { descendants of b:} ).toArray();
var k = all db.family.find( { _id : :{ par[i]._id }).toArray();
parents = direct parents $in : child.parents }
•Minimum amount of maintenance
rv = rv.concat(parents);
rv = rv.concat(k);
>forrv = irv.concat(descendantsFind(k)); : ‘b’})
db.family.find({ parents
(var in parents) {
•Balanced ancestorFind(parents[i]));
}
}
rv = rv.concat(
return rv;
•Implied relationships
}
}
return rv;
descendantsFind(b);
ancestorFind(g);
101. Bi-directional
{ _id: "a", children: ["e"] }
{ _id: "b", children: ["e"] }
{ _id: "c", children: ["f"] }
{ _id: "d", children: ["f"] }
{ _id: "e", children: ["g"], parents: ["a", "b" ]}
{ _id: "f", children: ["g"], parents: ["c", "d" ]}
{ _id: "g", children: [] , parents: ["e", "f"] }
•Doesn’t really add much beyond the first example
•More maintenance
•Duplication of each relationship
•Only real advantage is ability to grab all related
nodes (both directions) with one query.
102. Array of Ancestors
{ _id: "a" }
{ _id: "b" }
{ _id: "c" }
{ _id: "d" }
{ _id: "e", ancestors: [ "a", "b" ], parents: ["a", "b" ]}
{ _id: "f", ancestors: [ "c", "d" ], parents: ["c", "d" ]}
{ _id: "g", ancestors: [ "a", "b", "c", "d", "e", "f" ], parents: ["e", "f"] }
Great for small trees (or subsets).
//find all descendants of b:
> db.tree.find({ ancestors: ‘b’})
Could be used to store X generations of ancestors
Optimized for retrieving entire tree
//find all direct descendants of b:
> db.tree.find({ parents: ‘b’})
Uses implied relationships
//find all ancestors of g:
No = db.tree.findOne( { _id: 'g'is )this person my grandson?
> g help on specifics... }
> db.tree.find( { _id: { $in : g.ancestors } )
Easier retrieval at expense of costlier maintenance
103. Array of Ancestors
{ _id: "a" }
{ _id: "b" }
{ _id: "c" }
{ _id: "d" }
{ _id: "e", ancestors: [ "a", "b" ], parents: ["a", "b" ]}
{ _id: "f", ancestors: [ "c", "d" ], parents: ["c", "d" ]}
{ _id: "g", ancestors: [ "a", "b", "c", "d", "e", "f" ], parents: ["e", "f"] }
Great for small trees (or subsets).
//find all descendants of b:
> db.tree.find({ ancestors: ‘b’})
Could be used to store X generations of ancestors
Optimized for retrieving entire tree
//find all direct descendants of b:
> db.tree.find({ parents: ‘b’})
Uses implied relationships
//find all ancestors of g:
No = db.tree.findOne( { _id: 'g'is )this person my grandson?
> g help on specifics... }
> db.tree.find( { _id: { $in : g.ancestors } )
Easier retrieval at expense of costlier maintenance
105. Relations (detailed)
{ _id : "b",
relations : [
{
id : "a",
relation : "parent",
type : "mother",
subtype : "biological" },
{
id : "c",
relation : "parent",
type : "father",
subtype : "adopted"},
{
id : "d",
relation : "parent",
type : "father",
subtype : "biological"}]}
106. Shouldn’t I store my
family tree in a graph
database?
They are built to store trees after all
107. Graphs are great at
traversing deep in a tree
• Is this node my
relative?
• Retrieve my paternal
great, great, great,
great grandpa
108. Graphs are great at
traversing deep in a tree
• Is this node my
relative?
• Retrieve my paternal
great, great, great,
great grandpa
109. Graphs are great at
traversing deep in a tree
• Is this node my
relative?
• Retrieve my paternal
great, great, great,
great grandpa
110. Unfortunately that’s not
how we commonly work
Typically we are working with a node and
it’s immediate neighbors
The significant majority of our operations
aren’t traversing
If those operations are
important, perhaps a
hybrid graph & document
solution makes sense
111. http://spf13.com
http://github.com/s
@spf13
Question
download at mongodb.org
We’re hiring!! Contact us at jobs@10gen.com
Notes de l'éditeur
\n
\n
\n
\n
\n
\n
\n
\n
Remember in 1995 there were around 10,000 websites. Mosiac, Lynx, Mozilla (pre netscape) and IE 2.0 were the only web browsers. \nApache (Dec ’95), Java (’96), PHP (June ’95), and .net didn’t exist yet. Linux just barely (1.0 in ’94)\n
Remember in 1995 there were around 10,000 websites. Mosiac, Lynx, Mozilla (pre netscape) and IE 2.0 were the only web browsers. \nApache (Dec ’95), Java (’96), PHP (June ’95), and .net didn’t exist yet. Linux just barely (1.0 in ’94)\n
Remember in 1995 there were around 10,000 websites. Mosiac, Lynx, Mozilla (pre netscape) and IE 2.0 were the only web browsers. \nApache (Dec ’95), Java (’96), PHP (June ’95), and .net didn’t exist yet. Linux just barely (1.0 in ’94)\n
\n
\n
\n
\n
By reducing transactional semantics the db provides, one can still solve an interesting set of problems where performance is very important, and horizontal scaling then becomes easier.\n\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
Store an array of the id of the ancestor of a given document\n
Store an array of the id of the ancestor of a given document\n
Store an array of the id of the ancestor of a given document\n
Store an array of the id of the ancestor of a given document\n
Store an array of the id of the ancestor of a given document\n
Store an array of the id of the ancestor of a given document\n
Store an array of the id of the ancestor of a given document\n