4. Horizontal Scaling
• Vertical scaling is limited
• Hard to scale vertically in the cloud
• Can scale wider than higher
5. Schema
• Modeling the same data in different ways
can change performance by orders of
magnitude
• Very often performance problems can be
solved by changing Schema
6. Embedding
• Great for read performance
• One seek to load entire object
• One roundtrip to database
• Writes can be slow if adding to objects all
the time
7. Should you embed comments?
{
title : “MongoDB is fun” ,
author : “eliot” ,
date : “2010-12-03” ,
comments : [
{ author : “bob” , text : “...” } ,
{ author : “joe” , text : “...” }
]
}
db.posts.update( { title : “MongoDB is fun” } ,
{ $push : { author : “sam” , text : “...” } } )
8. Indexes
• Index common queries
• Make sure there aren’t duplicates: (A) and
(A,B) aren’t needed
• Right-balanced indexes keep working set
small
14. Read Scaling
• One master at any time
• Programmer determines if read hits master
or a slave
• Pro: easy to setup, can scale reads very well
• Con: reads are inconsistent on a slave
• Writes don’t scale
15. One Master, Many Slaves
• Custom Master/Slave setup
• Have as many slaves as you want
• Can put them local to application servers
• Good for 90+% read heavy applications
(Wikipedia)
16. Replica Sets
• High Availability Cluster
• One master at any time, up to 6 slaves
• A slave automatically promoted to master if
failure
• Drivers support auto routing of reads to
slaves if programmer allows
• Good for applications that need high write
availability but mostly reads (Commenting
System)
17. Sharding
• Many masters, even more slaves
• Can scale reads and writes in two
dimensions
• Add slaves for inconsistent read scaling and
redundancy
• Add Shards for write and data size scaling
19. Common Setup
• Typical setup is 3 shards with 3 servers per
shard: 3 masters, 6 slaves
• One massive collection, dozen non-sharded
• Can add sharding later to an existing replica
set with no down time
• Can have sharded and non-sharded
collections
20. Choosing a Shard Key
• Shard key determines how data is
partitioned
• Hard to change
• Most important performance decision
21. Range Based
MIN MAX LOCATION
A F shard1
F M shard1
M R shard2
R Z shard3
• collection is broken into chunks by range
• chunks default to 200mb or 100,000
objects
22. Use Case: User Profiles
{ email : “eliot@10gen.com” ,
addresses : [ { state : “NY” } ]
}
• Shard by email
• Lookup by email hits 1 node
• Index on { “addresses.state” : 1 }
23. Use Case: Activity
Stream
{ user_id : XXX, event_id : YYY , data : ZZZ }
• Shard by user_id
• Looking up an activity stream hits 1 node
• Writing even is distributed
• Index on { “event_id” : 1 } for deletes
24. Use Case: Photos
{ photo_id : ???? , data : <binary> }
What’s the right key?
• auto increment
• MD5( data )
• now() + MD5(data)
• month() + MD5(data)