Webinar: Choosing the Right Shard Key for High Performance and Scale

Ger Hartnett
Director of Technical Services (EMEA), MongoDB @ghartnett #MongoDB
Tales from the Field
Part three: Choosing the Right Shard Key for
High-Performance and Scale

Or:
●Cautionary Tales
●Don’t solve the wrong problems
●Bad schemas & shard keys hurt ops
too

●The main talk should take 30-35 minutes
●You can submit questions via the chat box
●We’ll answer as many as possible at the end
●We are recording and will send slides Friday
●This is the final webinar in a series of 3
Before we start

●You work in operations
●You work in development
●You have a MongoDB system in production
●You have contacted MongoDB Technical
Services (support)
●You attended an earlier webinar in the series
(part1, part2)
A quick poll - add a word to the
chat to let me know your
perspective

●We collect - observations about common
mistakes - to share the experience of many
●Names have been changed to protect the
(mostly) innocent
●No animals were harmed during the making
of this presentation (but maybe some DBAs
and engineers had light emotional scarring)
●While you might be new to MongoDB we
have deep experience that you can leverage
Stories

1. Discovering a DR flaw during a data
centre outage
2. Complex documents, memory and
an upgrade “surprise”
3. Wild success “uncovers” the wrong
shard key
The Stories (part three today)

Story #1: Recovering from a
disaster
●Prospect in the process of signing up for a
subscription
●Called us late on Friday, data centre power
outage and 30+ (11 shards) servers down
●When they started bringing up the first
shard, the nodes crashed with data
corruption
●17TB of data, very little free disk space,
JOURNALLING DISABLED!

Recovering each shard
1.Start secondary
read only
2.Mount NFS
storage for repair
3.Repair former
primary node
4.Iterative rsync to
seed a secondary
Secondary
Primary
Secondary

Key takeaways for you
●If you are departing significantly from
standard config, check with us (i.e. if you
think journalling is a bad idea)
●Two DC in different buildings on different
flood plains, not in the path of the same
storm (i.e. secondaries in AWS)
●DR/backups are useless if you haven’t
tested them

Story #2: Complex documents,
memory and an upgrade “surprise”
●Well established ecommerce site selling
diverse goods in 20+ countries
●After switching to wired tiger in production,
performance dropped, this is the opposite of
what they were expecting

{
_id: 375
en_US : { name : ..., description : ..., <etc...> },
en_GB : { name : ..., description : ..., <etc...> },
fr_FR : { name : ..., description : ..., <etc...> },
de_DE : ...,
de_CH : ...,
<... and so on for other locales... >
inventory: 423
}
Product Catalog: Original Schema

Key Takeaways
●When doing a major version/storage-engine
upgrade, test in staging with some
proportion of production data/workload
●Sometimes putting everything into one
document is counter productive

Story #3: Wild success uncovers the
wrong shard key
●Started out as error “[Balancer] caught
exception … tag ranges not valid for: db.coll”
●11 shards, they had added 2 new shards to
keep up traffic - 400+ databases
●Lots of code changes ahead of the Superbowl
●Spotted slow 300+s queries, decided to build
some indexes without telling us
●Went production down

Adding Shards
2 More Shards….

The “Golden Hammer” Tendency

Diagnosing the issues #1
●The red-herring hunt begins
●Transparent Huge Pages enabled -
production
●Chaotic call - 20 people talking at once, then
in the middle of the call everything started
working again
●Barrage of tickets and calls
●Connection storms

Using mtools to analyse logs - conn
churn

●Got inconsistent and missing log files
●Discovered repeated scatter-gather queries
returning the same results
●Secondary reads
●Heavy load on some shards and low disk
space

Insert load on two shards (from
Cloud Manager)

● Shard key - string with year/month & customer id
{
_id : ObjectId("4c4ba5e5e8aabf3"),
count: 1025,
changes: { … }
modified :
{ date : "2015_02",
customerId: 314159 }
}

●First heard about DDOS attack
●Missing tag ranges on some collections
●Stopped the balancer which reduced system
load from chunk moves
●Two clusters had a mongos each on the
same server

Fixing the issues
●Script to fix the tag ranges
●Proposed finer granularity shard key - but this
was not possible because of 30TB of data
●Moved mongos to dedicated servers
●Re-enable the balancer for short windows with
waitForDelete and secondaryThrottle
●Put together scripts to pre-split and move empty
chunks to quiet shards based on traffic from
month before

Monthly pre-split and move chunks
{ date : "2015_03",
customerId: min-500
customerId: 314159
customerId: 501-10000
customerId: 10001-300k
customerId: 300k-314158
customerId: 314160-max

The diagnosis in retrospect
●The outage did not appear to have been related
to either the invalid tag ranges or the earlier
failed moves
●The step downs did not help resolve the outage
but did highlight some queries that need to be
fixed
●The DDoS was the ultimate cause of the outage
- lead to diagnosis of deeper issues
●The deepest issue was the shard key

Aftermath and lessons learned
●Signed up for a Named TSE
●Now doing pre-split and move before the
end of every month
●Check before making other changes (i.e.
building new indexes)

Key takeaways for you
●Choosing a shard key is a pivotal decision -
make it carefully
●Understand current bottleneck
●Monitor insert distribution and chunk ranges
●Look for slow queries (logs & mtools)
●Run mongos, mongod, config server on
dedicated server or use containers/cgroups

Further Reading
Production notes
docs.mongodb.org/manual/administration/production-notes
Mtools
github.com/rueckstiess/mtools
Previous Webinars
mongodb.com/presentations

Ger Hartnett
Director Technical Services (EMEA), MongoDB
@ghartnett #MongoDB
Questions?

●You can submit questions via the chat box
●We are recording and will send slides Friday
Questions

Code GerHartnett gets 25% discount

Webinar: Choosing the Right Shard Key for High Performance and Scale

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (16)

Similaire à Webinar: Choosing the Right Shard Key for High Performance and Scale

Similaire à Webinar: Choosing the Right Shard Key for High Performance and Scale (20)

Plus de MongoDB

Plus de MongoDB (20)

Dernier

Dernier (20)

Webinar: Choosing the Right Shard Key for High Performance and Scale

Notes de l'éditeur