Speaker: Akira Kurogane, Senior Technical Services Engineer, MongoDB
Level: 300 (Advanced)
Track: Performance
One week your active dataset consumes 90% of available RAM. The next week it's 110%. Is that a 10% or 99% performance degradation? Let's discover what it looks like when different hardware capacity limitations are hit. For example, memory vs. disk bottlenecks, the rare CPU bottleneck and network bottlenecks, seeing what happens when you drop a crucial index during peak load, or what happens when you run multiple WiredTiger nodes on the same server without limiting their cache size.
What You Will Learn:
- Performance analysis
- Post-mortem log analysis
- Capacity planning
2. WHY YOU'RE HERE TODAY
Curious about DB performance
Responsible for DB performance
You've experienced a bad
"Everything in production is slow!"
day at work before, and you'd
like to never have one again.
3. #MDBW17
DB PERFORMANCE IS A
COMPLEX EQUATION
• "What if the queries per second rate increased by 50%
compared to now?"
• "What if the queries and aggregations get larger on average?"
• "What if the read to write ratio changes?"
• "What if I downsize the server or use cheaper disk storage to
reduce cost?"
δx = . . . . .?
5. #MDBW17
TWO DIFFERENT PERFORMANCE-LIMITING
MECHANISMS
1. For any channel
The channel's throughput capacity is saturated -> bottleneck.
Mainly influenced by: The rate of db ops / sec * cost of avg op
2. Storage I/O channels
Small, fast storage layer is full -> the next, slower level is used
L1 Cache -> L2 -> L3 -> RAM -> Disk
Mainly influenced by: How much data is 'active'
6. #MDBW17
"ACTIVE DATA SET"
Active data set size is not derived simply from total data size, or server
specs.
My definition: The portion of your data where 99%* of reads
are expected to be completed within a fixed latency.
* or 99.9%, or 99.99%, etc., according to your preference
At PerformanceShopper.com we found 99.9% of reads are either on recently-inserted
documents, or from certain small collections.
If we get in-memory latencies for that 99.9% then disk latency for the other 0.1% is fine.
Our total data size may be ~1TB but the "active" documents are < 100GB.
Example:
10. #MDBW17
DEMO #2: NETWORK BOTTLENECK
A simple query that can be switched between tiny and huge result
sizes.
db.collection.find(
/* find */ { _id: X, nested_array: Y },
//Nested array is large in every document
/*project*/ { _id: true, "nested_array.$": true }
//Oops: Let's 'accidentally' forget the ".$".
)
Avg result is ~ 0.4 kB when nested_array.$ is used; ~1.8 MB when it is not
14. #MDBW17
DEMO #3: ACTIVE DATA SET GROWS BEYOND
WIREDTIGER CACHE
Collection "foo": 160 GB. Average document size is 1kb,
when uncompressed in the WiredTiger cache.
RAM on the server is 15 GB: WiredTiger cache set to
10GB.
Test query:
db.foo.find({_id: <val>})
for random _id within limited range.
17. #MDBW17
THE STORY IN WIREDTIGER CACHE
ACTIVITY
N.b. to make decompression
run slower than typical in this
test I artificially constrained
CPU cores to just 2.
2 cores
8 cores
100 MB/s
200 MB/s
300 MB/s
18. This test’s active data set size was kept within
that size.
So far this test is rigged to be pure RAM.
Disk was avoided.
• Default WiredTiger cache size 60% of RAM.
• Leaves 40% for OS and filesystem cache.
‒ Let's say 35% for filesystem page cache
Even with mildly compressible document data
more than twice the cache size of document
data will be in RAM.
It just needs to be decompressed on the fly.
RAM
19. #MDBW17
DEMO #4: ACTIVE DATA SET GROWS INTO
DISK RANGES
Continue the same test, gradually increasing the range
of data being queried to be ~10x than the WiredTiger
cache size.
Rough calculation: By the end > 70% of queries will
need to wait for disk.
20. #MDBW17
WHAT YOU SEE IN MONGODB OP
COUNTERS
Linear increase in active data set size
Welcome to disk-land!
21. #MDBW17
THE STORY IN WIREDTIGER CACHE ACTIVITY
RAM
RAM
↓
Decompress
↓
RAM
Disk
WT Cache Activity
(before)
Where now?
22. #MDBW17
AGGREGATE SUMS OF
DIFFERENT LATENCIES
Classic latency comparison numbers
--------------------------
Main memory reference 0.1 μs
Compress 1K bytes with snappy 3 μs
Read 4K randomly from SSD 150 μs
Read 1 MB sequentially from memory 250 μs
Read 1 MB seq'ly from 300MB/s SSD 3,300 μs
Disk seek 10,000 μs
Read 1 MB sequentially from disk 20,000 μs
105
102
23. #MDBW17
RAM vs Disk % Latency What the users say
100 / 0 25 ms (normal)
99 / 1 42 ms "Everything's really slow"
90 / 10 200 ms "Everything's broken'
50 / 50 1000 ms "What do you mean ETA ..."
0 / 100 2000 ms "... is next week?!"
Magnetic disk
THEORETICAL 100MB READ LATENCIES
24. #MDBW17
RAM vs SSD % Latency What the users say
100 / 0 25 ms (normal)
99 / 1 28 ms (normal)
90 / 10 50 ms "Everything's really slow"
50 / 50 175 ms "Everything's broken"
0 / 100 330 ms "ETA within today?"
THEORETICAL 100MB READ LATENCIES
Low-end SSD, 300MB/s
25. #MDBW17
WRITE LOADS
The previous demonstrations focused on read-only cases alone.
Writes are more I/O bound that reads.
Every write is involved in disk access at two points.
• First all writes go to journal. (Commits ~10 times per second)
• Asynchronously WiredTiger cache block marked 'dirty' ->
compressed -> fdatasync'ed to disk (once per min)
Key point: focus even more on disk util% and WiredTiger Cache
Activity than we did in the previous demonstrations.
27. #MDBW17
CPU, NETWORK BOTTLENECKS
• It's unlikely you're suffering from these
• But on the other hand it's not hard to check them
• Check it, forget it, move onto storage I/O
28. #MDBW17
STORAGE
• On a logarithmic scale the differences between disk latency and
RAM latency doesn't look so bad ....
... but here in the real, linear-time universe it is.
• Increased read MB/s into WiredTiger cache is not a problem if
it's being read from filesystem page cache, but:
• That metric growing from near-zero to 100's of MB/s gives you a
warning that the active data set is getting closer to ‘disk-land’.
Hello All. I'm happy to be here, to have the opportunity to discuss performance topics with you.
My name is Akira and I work at MongoDB as a Technical Service Engineer, that is I'm a member of the support team. In the support team we have diverse set of skills, but mine are more towards server development and Linux performance.
As we go through this presentation you're going to see four demonstration. These demonstrations are drawn from lessons I've learnt from supporting MongoDB in the field.
Although these have been simplified all still reflect real-world situations that have caught someone out.
Let's begin.
◉◉◉
In this presentation I'm only going to show perfectly-running software hosted on perfectly good hardware.
So what are you going to learn? What is the problem I will address?It's that we, the users and administrators of the software, are not very good at predicting when performance will change.
Well, I think majority of us do have good gut feelings about how much more load we can place on our database servers,
but in truth if we were challenged to answer "When?" and "How much?" we know our estimations are too rough.
Some of us have had the experience of discovering our estimations were wrong - very wrong - and even if it hasn't happened to you yet, don't kid yourself. It could.
In the process of trying to predict how your database server will perform in the future you're probably going to ask the following sorts of 'what if' questions.
◉◉◉◉
These are a simple as you can make them - you're just changing one (◉) variable according to DB metrics in these questions.
The problem is each of those single DB metrics relies on multiple variables at the hardware level.
The first reason it's a complex equation: The wildly different response times of these different parts of the modern computer are unintutive
The human mind isn't naturally suited to process polynomials that have constants of nanoseconds in one place and tenths of a second in another.
And to get into semantics it's not even an equation - it's an algorithm. An algorithm with these polynomials on the inside.
The second reason it's a complex equation: There are two different top-level mechanisms influencing database performance
◉ ◉
The first and simpler rule is a generic one that when a server process saturates the capacity of any channel, that will be a bottleneck.
"This is mainly influenced by: the rate of db ops multiplied by their average execution cost"
The first two demonstrations will show examples of these
◉ ◉
The second is that when the ACTIVE DATASET grows larger then the fraction of data that has to be accessed from the lower, slower storage levels increases.
Data reads are slower by 1 to 2 powers of 10 for each level lower.
"This is mainly influenced by: how much data is 'active'"
The third and fourth demonstrations will show this mechanism.
I used the term "Active dataset" in the last slide and I will use it again. Is anyone concerned that I am not defining it clearly enough? (Raise hand) Good. You should be pestering me to explain this, it's an important concept.
◉
It is not derived simply from total data size, or server specs.
◉
It is a subjective measure. It depends on what your goals for latency are (personally, or those set by a SLA)
◉
To give one example, to give the general idea, let's imagine I have an ecommerce company called PerformanceShopper. Here's what I might say about my active data set. (Read the example)
Now I begin the demonstrations.
The first two will be classic bottleneck cases.
The third and fouth ones will show what happens when the active data set overflows available RAM.
This first demonstration will compare these two aggregations. The second one will be adding a sort an unindexed field to force high CPU load.
Requesting a sort in a query or aggregation doesn't necessarily cause CPU load. If an applicable index exists then the collection data can be iterated in that order. (In this case that would be a compound index on city and first_name, in that order.)
But when there is no applicable index the query engine must perform a sort for every query. Once you have 100's of documents per query that will become a significant cost.
These aggregations select 20,000 small documents each. What can the high CPU usage needed to sort them do to you performance? The answer is here (next slide)
◉ ◉
In these two graphs we can see a blue line representing Queries per Second in the opcounters graph, and a matching latency graph on the right.
As you can see when the problem begins the latency increases and the rate of database ops per second decreases.
You know what I've done here, but I'd like you to imagine this came upon you out of nowhere. You suspect CPU-greedy operations somewhere. What can you do to prove or disprove?
◉
Simply look at the CPU usage. If the system-wide CPU usage has gone to 100% or close to it, that's it. If it isn't .... then it isn't.
If it has then of course you want to look for CPU-intensive operations. Use the mongod logs to look for lots of slow commands, or use the profiler.
Look for sorts or anything that you could suspect of being relatively compute-intensive.
Some examples of other CPU-intensive operations include:
- Aggregations that work on large result sets.
- Authentication functions are CPU-intensive by design, so if you're needlessly opening and closing scores or hundreds of new connections each second that can give you a CPU bottleneck too.
In my time at MongoDB I have only seen one clear, undeniable case of network bottleneck, and it was achieved exactly like the mistake above.
For those who are unfamiliar with the positional $ projection operator (shown in orange here) it lets you keep only the nested_array item that matches the query clause, and discard all the others in the same nested array.
You can see that the latter query will return several thousand times as much data over the network compared to the first one.
I think we can show everything with these single set of graphs.
In the top left graph (Opcounters) please pay the most attention to the blue line. It shows the number of find commands drops dramatically.
The top right is a graph of database operation latency.
Can you see what is strange?
The latency, measured in the query engine, has only changed a small amount but the client is receiving far less results per minute.
Q: What explains the difference? A: Saturation in the network
◉
What goes up is Network bytes out / sec. To a very high value. In this test case I managed to get a fairly flat ceiling, which I suspect is deliberate rate limitation in AWS.
In a normal self-owned data center I'd expect the traffic between other servers on the same LAN to be more variable. Still very high though - LANS these days are very fast.
The previous test was so simple I was afraid you might not believe it was a meaningful demonstration.
So I've added this supplementary example to show you the effects of the previous test diluted by effects of another benchmark test running at the same time.
Despite the mixed load the key points in a network bottleneck remain the same -
◉ a really high transmission rate,
◉ a drop in operations per second,
◉ but at the same time little difference to the average latency measured server-side.
Intermission topics:
By the way, I'm based in the Sydney office of MongoDB. Anyone here from Australia?
Akira Kurogane == Michael Castley
Before we go on I want to impress that the issues in the follow slide are going to happen to you.
The previous demonstrations are problems that can spring upon you suddenly, but they only happen when people don't take care to test what the performance effects of queries etc are.
But the following problems are something that will come upon without you making any mistake.
No mistake other than being insufficient vigilant about the increasing size of your database.
In the last two demonstrations we looked at the simpler mechanism of running into a single bottleneck. Now I'll bring the second factor into play - changes in Active data set size. If you have ample RAM then it will simply use more and more of that RAM, but when that is exhausted ... an increasing amount of data will be accessed from the lower, slower storage layers.
Databases are, in the great majority of use cases, bound by the latency of the I/O on the server they are running on, rather than CPU. That I/O layer is not going to be the CPU caches - they're used of course, but they have capacities of just MB rather than GB. Instead the majority of reads will be from RAM and/or disk.
For this demonstration I created a single collection 160 GB collection. The test server has 15 GB of RAM. And to be more specific the WiredTiger cache size is 10 GB.
In the initial stage of this demonstration I picked a range of documents that only need 2GB of space. I warmed up that small data set, then I ran the tests you will see on the following slides.
For those unfamiliar with the term OpCounters, I'm referring to those that are sourced from the MongoDB serverStatus output. These are the count of insert, update, query, getmore and delete etc. commands.
In this test I started with a range of ids that needed only 2GB of cache space. I increased the range of ids being queried by a constant amount. The size of data needed to hold those documents grew from 6, to 7, to 8, 9, 10, 11 GB etc. It's obvious there's an effect as the cache size is exceeded.
OK let's stop, step back for a moment.
I'd like you to imagine you're seeing this graph when looking at the performance of the database at your workplace. You're disturbed about the drop in performance, as it happening to you in real life and not just in some test case you saw at conference. Something appears to have changed, so your ask the application developers to find out what happened. But then they tell you "The application didn't change; the queries didn't change".
With that information alone this an unexplainable incident. If you're a DBA, having an unexplainable incident like this is exactly where you don't want to be.
Of course you need investigate more deeply. Let's start with the mongod logs.
Although the mongod logs are the number one diagnostic resource generally it isn't going to give us the insight we need. But lets pass through here for completeness.
The slow command log lines above are examples of the query I am running in this test. The top one is from the higher performance time, the bottom is from a lower performance time.
Let's highlight the differences as best we can ◉
There aren't many. White - unimportant, . Green - important, but identical. Red - different.
The green parts show that the query engine is using the same plan type, scanned the same number of index entries and documents - so you can see the query engine is doing the same thing all through.It's just the latencies that have changed
So if the query engine is doing the same thing let's look at the storage engine.
To observe what's happening in the storage engine I'm going to look at the WiredTiger cache activity metrics.
As a first picture, here's the OpCounters graph from before. Let's fade that out to show the graph of WiredTiger cache activity from the same time.
◉◉
Here we see the reason- the increasingly slower performance starting at that elbow lines up precisely with the increase of WiredTiger Cache Activity.
◉
In this test, on this server, you can see that having to read 200Mb into the cache every second has caused the performance to drop 10%. When it gets close the 400MB/s the performance has dropped by about 20%.
Because there wasn't enough RAM to keep everything in cache the storage engine has to do more work every single second re-reading out-of-cache data back in, and so the queries became slower on average.
◉
I have to confess I cheated a bit just to make a nice graph for this presentation.
I limited the mongod, this mongod running 40,000 queries per second, to just 2 cores to magnify the effect that decompressing the out-of-cache data has on aggregate performance. With a two cores limit I was able to experience a mild CPU bottleneck.
I had to do this because the degradation effect was basically invisible when I used all 8 cores on the server. (At least 8 cores is typical for a 40k/s load.)
◉
So the little side-lesson here is: Yes, WiredTiger uses compression, and compression and decompression must consume CPU, but it hardly taxes the overall multi-core power on the typical server
And now I have to make a second confession. The previous test was rigged in a more fundamental way. It avoided disk, and thus avoided the worst latencies you can get in a server.
How is explained in the slide here -
WiredTiger Cache should not be configured to use all the RAM.
By default it's 60% - that's deliberate so there'll be a fair amount of RAM left for the kernel to use for filesystem cache. When MongoDB is busy on the server then it's data files are the ones that the kernel will buffer there.
So the filesystem cache will become in effect an in-memory store of compressed document data.
Even with mildly compressible data it will possible for it to be holding 2, 3 times as many documents in it as there in the WiredTiger cache.
◉◉
So just by doing decompression you can have an active data set that is 2, 3 times as large as the WiredTiger cache, so reads of
that data don't need to touch the disk at all.
For the fourth demo let's do the whole thing again, but keep going. Make the active data set even larger.
For comparison I will include the previous demonstration's test stages again.
They'll be on the left. Then on the right we will see the performance decrease caused by using disk.
Are you ready? Here it comes, in one hit:-
◉
(Welcome to disk-land)
◉
I have to remind you at this point that the active data set size was only being increased in a steady, linear fashion.
You too could go those few extra percent and FALL OFF this cliff. ◉
"How? Why?" you're probably asking.
To start with I'm going to show the WiredTiger cache activity again.
◉◉
Can anyone guess what happens to the cache activity when disk reads are introduced?
(Can I see a show of hands for those who will think it goes up? Stays even? Down?)
◉◉◉
As you can see the WiredTiger cache activity goes down at this time.
The new collection data being read from disk still needs to be decompressed and brought into the cache, but it's queued. It's queued waiting for blocks of file from disk to be delivered.
Time to step back and see a new angle.
This graph is the story of two-and-a-half storage layers
◉ Everything in RAM
◉ The 'psuedo-layer' of compressed data in RAM
◉ And disk
By the way disk utilization at this time, where the performance plummets, jumps straight from near-zero figures to 100%, and then sticks there.
So, what is best explanation, the real explanation for this behaviour? It's all ABOUT HOW the time cost of running 10 thousand or 10 million database operations is really about the aggregate sum of different hardware latencies.
Here I show a table I'm sure nearly everyone here is familiar is. It shows that the latencies between different technologies in our servers are basically powers of ten apart.
◉
The difference in seek times between RAM and Magnetic disk are dramatic: 10^5 apart
◉
It makes the mere 10^2 difference in the throughput rate look mild. But remember 100 times slower is not mild if that's a factor hitting your database throughput. 100x slower is a disaster.
I'll give you a moment to read this.
I'd like the lesson of this slide to be: Even a little disk quickly poisons the average latency.
Just one percent of spinning magnetic disk can halve your performance.
And if 1/10th of the data is being sourced from disk instead of RAM you've lost 90% of your performance.
In the next slide I will show the same thing, but for SSD instead of spinning magnetics disk.
You can see that even a low-spec SSD buys you a power-of-ten delay before the issue manifests itself.
READ MESSAGE ON SLIDE
Let's summarize. Please start thinking of any questions you would like to ask.
First: the CPU and network bottlenecks.
Even if you have strong suspicions it's unlikely you're suffering from these.
But on the other hand it's not too hard to check them.
For your peace of mind just check them, and move on.
(For reference when I do support cases I don't look at either of these things first, they're that uncommon.)
◉
I would say the second most important thing I was sharing today is that when you get into disk-land performance drops quickly, and hard.
The No 1. important thing was the concept of ACTIVE DATA SET SIZE. It's not your total data size that matters, it's the subset of it that is being actively accessed every minute. That's what needs to be kept in RAM.
◉◉
Lastly - you don't have to be blind to an upcoming "Everything in production is slow" disaster day.
Watch your disk metrics over the long term, that's obvious, but also as demonstrated here today
even thought you may be currently keeping 99.99% of reads serviced by RAM
watch your WiredTiger cache activity.
If it has been a low value for a long time, but recently it's increasing ... and increasing .... then you've pushed above the uncompressed WiredTiger cache size and you're using an increasing amount of filesystem cache.
How far that will go before you run depends on how compressible your data is.
But obviously what you need to do when you see this is get some more RAM before you run out.
That’s it – but as we have some time left did anyone want to ask a question to:
Clarify something, or
Ask what would happen if we had X or Y instead of A or B in these demonstrations
or anything else like that?