Modern Database Systems (for Genealogy)

@spf13

AKA
Steve Francia

Chief Evangelist @
responsible for drivers,
integrations, web & docs

What’s the Point?
๏ Goal: Discover & identify ideal
storage solution for our needs
๏ History is important
๏ Many options today
๏ Document databases are good
for Genealogy

Over 5500 years ago

2 People

World Population Growth
(last ~200 years in Billions)
8

6

4
7
6
5
4 2
3
2
1
1804 1927 0
1960 1974 1987 1999 2012

Really Big Data
In the last 50 years...

over 4 % of the world people
were born...

in less than 1 % of the time

1970

๏ Oracle
creates the relational
database
๏ Everyone happily uses it for
the next 43 years

Let’s start at
the beginning

It’s a story about...

Storing & Retrieving
Information

Even today we still use
the same mediums for
data storage

With the advent of
the computer things
really took off

1960 : DBMS Emerges
๏ Ordered set of fixed length fields
๏ Low level pointer operations (flat
files)
๏ Most popular was IMS (created at
IBM)
๏ Shockingly still in use today at IBM &
American Airlines

Lots of Problems
๏ Complex and inflexible
๏ User had to know physical structure of the
DB in order to query for information
๏ Adding a field to the DB required rewriting
the underlying access/modification scheme
๏ Records isolated (no relations)
๏ Emphasis on records to be processed, not
overall structure

1970 : Relational DB
๏ Edgar Frank “Ted” Codd
๏ Relational Database
theory
๏ Codd’s 13 rules
(aka 12 rules)

3 HUGE Advantages
๏ Data independence from hardware
and storage implementation
๏ Ability to process more than one
record at a time with a single
operation
๏ Establishing a relationship
between records

IBM vs Codd
๏ IBM bet on IMS
๏ Codd bets on relational DB
๏ Eventually
2 relational
prototypes emerge

Ingres

๏ Built at UC Berkley
๏ Uses QUEL
๏ Inspires Sybase & MSSQL

System R
๏ Built at IBM
๏ Leads to SEQUEL... later SQL
๏ Evolved into SQL/DS which
evolved into DB2
๏ Project concludes that relational
model is viable

Oracle
๏ Larry Ellison watches IBM
๏ Starts Relational Software Inc.
๏ Oracle 1st commercial RDBMS
released in 1979
๏ Beats IBM by 2 years to market

Entity Relationship
๏ Proposed by Peter
Chen in 1976
๏ Focuses on data use
and not logical table
structure

1980s
๏ RDBMS dominates
๏ Some fields (medicine,
physics, multimedia) need
more than RDBMS offers
๏ Object Databases emerge

Object Databases
๏ Inspired by Entity Relationship
๏ More flexible than relational permits
๏ Tightly coupled with OO
programming language (c++, later
Java)
๏ Full object: data & methods stored

1990s
๏ Internet emerges
๏ Data demand spikes
๏ Databases used for
archiving historical data

Early 2000s
๏ Internet booms
๏ RDBMS fails to scale
๏ Indesperation we take a
step backwards

MemcacheD
๏1 dimensional
๏ No persistence
๏ No ACI or D
๏ but...

2005 ish
๏ Relational + MemcacheD
broken (and we didn’t know it)
๏ Scale redefined with high
volume & social
๏ Infrastructure reinvented with
cloud computing & SSDs

Alternatives Emerge

๏ Dynamo / Key Value
๏ Document

๏ Graph

A lot going on
Easiest to define databases in
broad terms
• What is a record?
(data model)
• CAP : CA, AP, CP ?
(infrastructure model)

Data Storage Structure
1D 2D nD

Key Key Value Key Value(s)
Key Value Key Value(s)
Value Key Value Key
Key Value Key Value
Key Value(s)
Key
Key Value
Key Value(s)

Database structure
1D 2D nD

Key Value
Relational Document
Dynamo
Graph

CAP Theorem
Availability

Partitioning Consistency

CAP Theorem

xx
Node Node

App

CAP Theorem
Availability

Dynamo
RDBMS
t
Key Value
ten

Int
o
sis

ler
NoSQLs
on

ant
Inc

Unavailable
Partition Consistency
Tolerant MongoDB
BigTable

Key Value
๏ ๏ Often
1 Dimensional
storage (tupal) MultiMaster...
๏
meaning
Query key only availability over
๏ Bucket index consistency
(range) on keys ๏ Partitioning easy
๏ Records cannot be thanks to single
updated, only value
replaced

Cassandra, Redis, MemcacheD, Riak, DynamoDB

Relational
๏ Single master
๏ 2 Dimensional
storage (map) meaning
consistency >
๏ Query any availability
field ๏ Partitioning hard
๏ due to
BTree Indexes transactions &
joins

Oracle, MSSQL, MySQL, PostgreSQL, DB2

Document
๏ ๏ Single master
n Dimensional
storage (hash meaning
w/ nesting) consistency >
availability
๏ Query any field
๏ Partitioning easy
at any level
thanks to richer
๏ BTree Indexes data model

MongoDB, CouchDB, RethinkDB

Graph
๏ 1 Dimensional storage... but grouped to appear
2D
๏ Differentiated by indexes
๏ Large indexes cover many relationships
๏ Query time depends on # records returned,
not distance to get them
๏ Doesn’t require traversing to determine
relationship

Neo4j, about 20 more... nobody talks much about

Types of
genealogy data
๏
Events ๏
Photographs
(birth, death, etc)
๏
๏ Diaries & letters
Official records
๏
๏ Ship passenger list
Census
๏
๏ Occupation
Names
๏
๏ and more
Relationships

Challenges of
genealogy data
๏
Lots of possible data points... need flexible
schema
๏
Multiple versions of same data point
(3 different dates for death date, 4 variations on
name).
๏
Lots of data associated with physical records
๏
Multiple versions of same nodes
(intelligent nondestructive merge needed)
๏
Need to have meta data associated

Individual User
Events[] • Name
• AFN • type • Email Address
• Modification Date • date • Password
• contributor[] • Individual_id
• record[]
Name
• First[]
• Middle[] Location
• Last[] • city
• state
• county
Record
• contributor
• country • type
• coordinates[] • thumbnail
• content
• description
• tags[]

Individual
individual = {
_id : ObjectId("4f2978dfaa999d9db02618ce"),
AFN : '1XYK-KQJ',
name: {
first: ['john', 'johannes'],
middle: 'peter',
last: ['smith', 'sandvik']
}
}

db.individual.find(
{name.first : ‘john’, name.middle : ‘peter’})

Individual.Events
events : [
death : {
date : ISODate('1989-07-14'),
location : {
city: 'pensacola',
state: 'fl',
county: 'escambia',
country: 'usa'
coordinates : [30.26,87.12]},
contributor : ObjectId("4eeac...691")}]

db.individual.find(
{events.death.date : ISODate(‘1989-07-14’)})

db.individual.find(
{events.death.location : { $near:[30,90]}})

Event Versions
events : [
birth : [ {
date : ISODate('1928-04-06'),
location : {
city: 'brattleboro',
state: 'vt',
county: 'windham',
country: 'usa'
coordinates : [42.51,72.34]},
contributor : ObjectId("4ee...00000"),
records: ObjectId("4ed8a...7b000000")
},
{
date : ISODate('1928-04-16'),
location : {
city: 'brattleboro',
state: 'vt',
county: 'windham',
country: 'usa'
coordinates : [42.51,72.34]},
contributor : ObjectId("4ee...37bb"),
records: ObjectId("4eea...0000c8"),
}],
}

Query with Versioned Events
events : [
birth : [
{ date : ISODate('1928-04-06')},
{ date : ISODate('1928-04-16')}
],
]

db.individual.find(
{events.birth.date : ISODate(‘1928-04-16’)})

Records
record1 = {
_id : ObjectId("4ed8aea7d8562f7d7b")
contributor : ObjectId("4eeab...1537bb"),
type : 'birth certificate',
thumbnail : BinData(0,"/9j/4AAQSkZJ...."),
content : BinData(0,"j6b/Id11lWqs..."),
tags : ['NY', 'certified'],
description : "John's birth certificate"
}

MongoDB: Scale built in
๏ Intelligent replication
๏ Automatic partitioning of data
(user configurable)
๏ Horizontal Scale
๏ Targeted Queries
๏ Parallel Processing

Intelligent Replication

Node 1 Node 2
Secondary Secondary
Heartbeat
Re

on
p

i
cat
lic
ati

pli
on

Re
Node 3
Primary

Scalable Architecture
App Server App Server App Server

Mongos Mongos Mongos
Config
Node 1
Server
Secondary

Config
Node 1
Server
Secondary

Config
Node 1
Server
Secondary

Shard Shard Shard

x
High Availability in Shards

Shard Shard

Primary

Mongod
or
Secondary

Secondary

Targeted Requests
1
4

Mongos

2

3

Shard Shard Shard

Parallel processing
1
6

Mongos 5

2 2 2

4 4 4

Shard Shard Shard

3 3 3

Broad Feature Set
๏ Rich query language
๏ Native support for over 12 languages
๏ GeoSpatial
๏ Text search
๏ Aggregation & MapReduce
๏ GridFS
(distributed & replicated file storage)
๏ Integration with Hadoop, Solr & more

Last Year I
presented
on Graph in
MongoDB

http://j.mp/XvJ3dl

FamilySearch
presented in
December
2012

http://j.mp/X03TXp

http://spf13.com
http://github.com/spf13
@spf13

Questions?
download at mongodb.org

Modern Database Systems (for Genealogy)

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Modern Database Systems (for Genealogy)

Similaire à Modern Database Systems (for Genealogy) (20)

Plus de Steven Francia

Plus de Steven Francia (20)

Dernier

Dernier (20)

Modern Database Systems (for Genealogy)