This document provides an overview of MongoDB sharding. It begins with definitions of key terms like shards, chunks, config servers, and mongos. It explains how MongoDB partitions and distributes data across shards. The roles of config servers and mongos routers are outlined. Guidelines for choosing a shard key are presented, emphasizing characteristics like cardinality, write distribution, and query isolation. Best practices for setting up and using MongoDB sharding are also covered.
10. Data Store Scalability
•
Custom Hardware
•
Custom Software
In the past you've had two options for achieving data store scalability:
1) custom hardware (oracle?)
2) custom software (google, facebook)
!
The reason these things were custom were that these problems were not yet common enough. The number of people on the internet 10 years ago is
incredibly small compared to the number of people using web services 10 years from now.
13. The MongoDB Sharding Solution
•
Automatically partition your data
•
Worry about failover at the partition layer
•
Application independent
•
Free and open source
32. Terminology
•
Shards
•
Chunks
•
Config Servers
•
mongos
A shard is a server, or a collection of servers, that holds chunks of info which are split up according to a shard key, a shard holds a subset of a collection's
data
A chunk of info is a group of data falling in a particular range based on a shard key that can be moved logically from server to server
config serves hold information about where chunks live
mongos is the router and balancer -- it communicates with the config servers and figures out how to intelligently direct your query.
33. What exactly is a shard?
•
Shard is a node of the cluster
•
Can be a single mongod or an entire replica set
Shard
Mongod
Shard
or
Primary
Secondary
Secondary
Now what do shards hold? Chunks, which are partitions of your data that live in certain ranges.
34. Partitioning
•
User defines a shard key or uses hash based sharding
•
Shard key defines a range of data
•
The key space is like points on a line
•
A range is a segment of that line
-∞
Remember interval notation?
Key Space
+∞
35. Data Distribution
Initially a single chunk
Default Max Chunk Size: 64mb
MongoDB willMongos Mongos split and migrate chunks as
automatically Mongos
they reach the max size
Config
Node 1
Secondary
Server
Shard 1
Mongod
Shard 2
39. What is a config server?
•
A config server is for storing shard meta-data
•
It stores chunk ranges and locations
•
Run with 3 in production!
Config
Node 1
Secondary
Server
Config
Node 1
Secondary
Server
or
Config
Node 1
Secondary
Server
Config
Node 1
Secondary
Server
this is not a replica set, the three servers are purely for failover purposes.
!
pro-tip use CNAMEs to identify these.
40. What is a mongos?
•
Acts as a router / balancer for queries and ops
•
No local data (persists all info to the config servers)
•
Can run with just one or many
App Server
App Server
App Server
App Server
or
Mongos
Mongos
Mongos
41. MongoDB's Sharding Infrastructure
App Server
Config
Node 1
Secondary
Server
App Server
App Server
Mongos
Mongos
Mongos
Shard
Shard
Shard
Config
Node 1
Secondary
Server
Config
Node 1
Secondary
Server
42. Get Started With Sharding?
1. Choose a shard key (we'll talk about this later)
2. Start config servers
3. Turn on sharding
4. Profit.
44. Start the Configuration Server
Config
Node 1
Secondary
Server
mongod --configsvr
Starts a configuration server on the default port (27019)
45. Start the mongos router
Mongos
Config
Node 1
Secondary
Server
mongos --configdb catconf.mongodb.com:27019
46. Start the mongod
Mongos
Config
Node 1
Secondary
Server
Shard
Mongod
mongod --shardsvr
Starts a mongod with the default shard port (27018)
Shard is not yet connected to the rest of the cluster
Could have already been a part of the cluster
47. Add the Shard
Mongos
Config
Node 1
Secondary
Server
Shard
Mongod
On mongos:
sh.addShard('cat1.mongodb.com:27018')
For a replica set:
sh.addShard('<rsname>/<seedlist>')
49. Now enable sharding
•
Enable Sharding on a database
sh.enableSharding("<dbname>")
•
Shard a collection (with a key):
sh.shardCollection(
"<dbname>.cat",
{"name": 1})
•
Use a compound shard key to prevent duplicates
sh.shardCollection(
"<dbname>.cats",
{"name": 1, "uniqueid": 1})
50. Tag Aware Sharding
•
Total control over the distribution of your data!
•
Tag a range of shard keys:
sh.addTagRange(<collection>,<min>,<max>,<tag>)
•
Tag a shard:
sh.addShardTag("shard0000","NYC")
51. The Balancer
•
•
Transparent to driver and application
•
try to minimize clock skew with ntpd
Ensures even distribution of chunks across the cluster
Very tuneable but defaults are often sensible
65. Things to remember!
•
•
Shard key values are immutable
•
Shard key must be indexed
•
It is limited to 512 bytes in size
•
Try to choose a field used in queries
•
should not be monotonically increasing!
Shard Key is immutable
Only the shard key can be guaranteed unique across shards
66. How to choose your key?
•
Cardinality
•
Write Distribution
•
Query Isolation
•
Reliability
•
Index Locality
Cardinality – Can your data be broken down enough?
Query Isolation - query targeting to a specific shard
Reliability – shard outages
!
A good shard key can:
Optimize routing
Minimize (unnecessary) traffic
Allow best scaling
!
consider pre splitting
no unique indexes keys unless part of the shard key
!
geokeys cannot be part of a shardkey
$near won't work but the $geo commands work fine