Discover & identify ideal storage solution for our needs by examining the history of data storage & the modern database systems including Key Value, Relational, Graph and Document databases.
This presentation was given at RootsTech 2013 in March
2. @spf13
AKA
Steve Francia
Chief Evangelist @
responsible for drivers,
integrations, web & docs
3. What’s the Point?
๏ Goal: Discover & identify ideal
storage solution for our needs
๏ History is important
๏ Many options today
๏ Document databases are good
for Genealogy
24. 1960 : DBMS Emerges
๏ Ordered set of fixed length fields
๏ Low level pointer operations (flat
files)
๏ Most popular was IMS (created at
IBM)
๏ Shockingly still in use today at IBM &
American Airlines
25. Lots of Problems
๏ Complex and inflexible
๏ User had to know physical structure of the
DB in order to query for information
๏ Adding a field to the DB required rewriting
the underlying access/modification scheme
๏ Records isolated (no relations)
๏ Emphasis on records to be processed, not
overall structure
26. 1970 : Relational DB
๏ Edgar Frank “Ted” Codd
๏ Relational Database
theory
๏ Codd’s 13 rules
(aka 12 rules)
27. 3 HUGE Advantages
๏ Data independence from hardware
and storage implementation
๏ Ability to process more than one
record at a time with a single
operation
๏ Establishing a relationship
between records
28. IBM vs Codd
๏ IBM bet on IMS
๏ Codd bets on relational DB
๏ Eventually
2 relational
prototypes emerge
29. Ingres
๏ Built at UC Berkley
๏ Uses QUEL
๏ Inspires Sybase & MSSQL
30. System R
๏ Built at IBM
๏ Leads to SEQUEL... later SQL
๏ Evolved into SQL/DS which
evolved into DB2
๏ Project concludes that relational
model is viable
31. Oracle
๏ Larry Ellison watches IBM
๏ Starts Relational Software Inc.
๏ Oracle 1st commercial RDBMS
released in 1979
๏ Beats IBM by 2 years to market
32. Entity Relationship
๏ Proposed by Peter
Chen in 1976
๏ Focuses on data use
and not logical table
structure
33. 1980s
๏ RDBMS dominates
๏ Some fields (medicine,
physics, multimedia) need
more than RDBMS offers
๏ Object Databases emerge
34. Object Databases
๏ Inspired by Entity Relationship
๏ More flexible than relational permits
๏ Tightly coupled with OO
programming language (c++, later
Java)
๏ Full object: data & methods stored
35. 1990s
๏ Internet emerges
๏ Data demand spikes
๏ Databases used for
archiving historical data
36. Early 2000s
๏ Internet booms
๏ RDBMS fails to scale
๏ Indesperation we take a
step backwards
37. MemcacheD
๏1 dimensional
๏ No persistence
๏ No ACI or D
๏ but...
39. 2005 ish
๏ Relational + MemcacheD
broken (and we didn’t know it)
๏ Scale redefined with high
volume & social
๏ Infrastructure reinvented with
cloud computing & SSDs
42. A lot going on
Easiest to define databases in
broad terms
• What is a record?
(data model)
• CAP : CA, AP, CP ?
(infrastructure model)
43. Data Storage Structure
1D 2D nD
Key Key Value Key Value(s)
Key Value Key Value(s)
Value Key Value Key
Key Value Key Value
Key Value(s)
Key
Key Value
Key Value(s)
47. CAP Theorem
Availability
Dynamo
RDBMS
t
Key Value
ten
Int
o
sis
ler
NoSQLs
on
ant
Inc
Unavailable
Partition Consistency
Tolerant MongoDB
BigTable
48. Key Value
๏ ๏ Often
1 Dimensional
storage (tupal) MultiMaster...
๏
meaning
Query key only availability over
๏ Bucket index consistency
(range) on keys ๏ Partitioning easy
๏ Records cannot be thanks to single
updated, only value
replaced
Cassandra, Redis, MemcacheD, Riak, DynamoDB
49. Relational
๏ Single master
๏ 2 Dimensional
storage (map) meaning
consistency >
๏ Query any availability
field ๏ Partitioning hard
๏ due to
BTree Indexes transactions &
joins
Oracle, MSSQL, MySQL, PostgreSQL, DB2
50. Document
๏ ๏ Single master
n Dimensional
storage (hash meaning
w/ nesting) consistency >
availability
๏ Query any field
๏ Partitioning easy
at any level
thanks to richer
๏ BTree Indexes data model
MongoDB, CouchDB, RethinkDB
51. Graph
๏ 1 Dimensional storage... but grouped to appear
2D
๏ Differentiated by indexes
๏ Large indexes cover many relationships
๏ Query time depends on # records returned,
not distance to get them
๏ Doesn’t require traversing to determine
relationship
Neo4j, about 20 more... nobody talks much about
54. Types of
genealogy data
๏
Events ๏
Photographs
(birth, death, etc)
๏
๏ Diaries & letters
Official records
๏
๏ Ship passenger list
Census
๏
๏ Occupation
Names
๏
๏ and more
Relationships
55. Challenges of
genealogy data
๏
Lots of possible data points... need flexible
schema
๏
Multiple versions of same data point
(3 different dates for death date, 4 variations on
name).
๏
Lots of data associated with physical records
๏
Multiple versions of same nodes
(intelligent nondestructive merge needed)
๏
Need to have meta data associated
56. Individual User
Events[] • Name
• AFN • type • Email Address
• Modification Date • date • Password
• contributor[] • Individual_id
• record[]
Name
• First[]
• Middle[] Location
• Last[] • city
• state
• county
Record
• contributor
• country • type
• coordinates[] • thumbnail
• content
• description
• tags[]
63. MongoDB: Scale built in
๏ Intelligent replication
๏ Automatic partitioning of data
(user configurable)
๏ Horizontal Scale
๏ Targeted Queries
๏ Parallel Processing
64. Intelligent Replication
Node 1 Node 2
Secondary Secondary
Heartbeat
Re
on
p
i
cat
lic
ati
pli
on
Re
Node 3
Primary
65. Scalable Architecture
App Server App Server App Server
Mongos Mongos Mongos
Config
Node 1
Server
Secondary
Config
Node 1
Server
Secondary
Config
Node 1
Server
Secondary
Shard Shard Shard
70. Broad Feature Set
๏ Rich query language
๏ Native support for over 12 languages
๏ GeoSpatial
๏ Text search
๏ Aggregation & MapReduce
๏ GridFS
(distributed & replicated file storage)
๏ Integration with Hadoop, Solr & more