12. User Space Kernel Space
Process
read(fd, *buffer, count)
Page Cache
System
call
Page Cache – Read
Example File
Page 1 Page 2 Page 3
File descriptor
At 2,000
End
at 10,000
Page in
cache?
offset+count pages
Read from
disk and
store in
cache
Read from cache
and copy to *buffer
No
Yes
22. Mitigation Plan
Protect MongoDB with an API
Enforce index usage
Pass a query timeout (from 2.6)
Example of a simple API
def find_samples(start_time, end_time):
return samples.find({‘time’: {‘$gte’: start_time,
‘$lt’: end_time}})
Why should you care about memory management?
memory management has a huge impact on performance and costs.
This relates both to developers and dbas, as a developer you can optimize the schema and queries for better memory usage,
As a dba you can monitor and predict performance issues related to memory usage. I’m pretty sure every mongodb administrator asked himself atleast once: how much memory do I really need?.
Before we dive in I want to tell you a little secret: MongoDB doesn’t actually manage memory. It leaves that responsibility to the operating system.
Within the operating system there’s a stack of components which MongoDB depends on to manage memory.
Each component relies on the component below it.
This talk is structured around this stack of components.
We’ll start from the low level components which are storage devices: disks and RAM
We’ll continue with the page cache and memory mapped files which are a part of the operating system’s kernel
And we’ll finish off with MongoDB’s usage of these mechanisms.
Let’s talk about storage.
There are different types of storage devices with different characteristics, we’ll review hard disk drives, solid state drives and RAM.
(!)
Let’s start by breaking these into categories: HDDs and SSDs are persistent and RAM isn’t, but RAM is really fast. That’s why every computer has both types of storage, one persistent (a HDD or a SSD) and one is volatile (RAM).
Now let’s compare throughput. As I said before, RAM is fast, it could go as fast as 6400 MBPS for reads and writes.
SSDs are 10 times slower than RAM, modern SSDs can reach a read rate of 650 MBPS and a little less for writes.
HDDs are much slower, ranging from 1 MB to 160 MB per second for reads and writes.
The reason there’s such variance in HDD speed is because throughput is highly affected by access patterns.
Specifically with HDDs, random access is much slower than sequential access, and that’s because a HDD contains a mechanical arm that needs to move on almost every random access.
Sadly for us, databases do a lot of random I/O. which means, if you’re running a query on data that’s not in memory and therefore, it has to be read from disk, you’re seeing a penalty of about two multitudes on response times.
The next characteristic is price. (!)
For making the comparison easier we’ll compare the price per GB. It’s not surprising that there’s a correlation between price and throughput, meaning, the more you pay for each GB, you get better throughput. So hard drives are really cheap at 5 cents per GB, SSDs are 10 times more expensive and RAM is 100 times more expensive.
This slide reveals the tradeoffs between price, capacity and performance which are key factors in choosing the right hardware configuration.
Is this information sufficient to choose the optimal hardware configuration? I think it’s not, your application’s requirements are also a part of the equation.
For example, if your application is an archive that saves huge amounts of data that is rarely accessed, you can go for a large HDD and save a lot of money.
Later on we’ll see how can you take measurements of things like RAM and capacity and then you’ll be able to determine what kind of hardware configuration you need.
Before looking at additional tools I want to answer a simple question: how do we know when something is wrong? what do we need to monitor?
And since we’re talking about memory, how do we know we don’t have enough of it?.
Well, the phenomenon of not having enough memory is called thrashing.
When the OS is thrashing, it’s because an application is constantly accessing pages that are not in memory, the OS is busy handling the pagefaults, reading the pages from disk.
So the first thing to monitor is page faults, and since it’s hard to tell how many page faults are too much, you should also look at disk utilization.
There are a lot of other things that go wrong like a lot of queries being queued and high locking ratios but these just are symptoms
I usually use iostat for looking at disk utlization.
Here’s an example output of the command, the rightmost column shows this disk utilization and reveals a disk that is busy a 100% of the time.
The second column show the disk serves 570 reads per second and the third column shows the number of writes per second which is zero.
If this is happening constantly, the working set does not fit in memory.
Along with iostat, I frequently use mongostat
Mongostat comes packaged with MongoDB and uses the underlying serverStatus command. It displays a bunch of interesting metrics like the number of page faults and queued reads.
It’s pretty hard to say how many page faults are too much but more than one or two hundread page faults per second are an indication of a lot of data being read from disk. If this happens over long periods of time it could be an indication the working set does not fit in RAM.
If the number of queued reads is larger than a hundred over long periods of time it could also be an indication the working set doesn’t fit in RAM.
It’s often important to look at these parameters over time in order to determine if there’s a sudden spike or repeating problem. This brings me to offline monitoring.
Tools like the MMS or graphite can show you these important metrics over time.
Using one of these tools is mandatory for a production system. I cannot tell you how useful they are.
Whenever we get a ticket about a performance problem we put our Sherlock hats on and start an investigation.
We look at metrics related to our application but also, a lot of metrics related to mongo and how they change over time: we look at the number of queries, the number of documents in collections and tens of other metrics.
I’d like to show you an example workflow of a ticket.
It was a beautiful morning, 10 A.M, when I get an automated email that one of our shards is misbehaving, it has more than 300 queries just waiting in queue.
I immediately open graphite, this is a screenshot of the number of page faults in green and the number of queued readers in blue. By looking at the history you can spot two trends:
1. First, there’s a spike of high load every hour. This is actually normal since we’re doing hourly aggregations of our data.
2. The second trend, is a massive rise in page faults and queued queries at exactly 20:00. At this point there’s an impact on users as a lot of queries take a very long time.
Why is this happening? Has the working set outgrown memory?
Lets look at another screenshot of the same time frame. This time we look at other metrics: in blue are the numbers of queries, in green are the number of updates, the disk utilization in red.
Remember that disk utilization is measured in percentage so even though the graph is lower than others we can still see that at 20:00 the disk was constantly utilized at a 100%.
When looking at the updates vs. queries it’s obvious that a huge amount of updates is hurting the query performance. We were busy writing to disk.
In this case an application change was the root cause of the problem, the application simply started updating a lot more documents.
We were still able to trace it to application and later on changed our schema to reduce the document size and the load on disk.
This brings me to next topic which is optimization.
When optimizing memory usage the main target is to reduce the amount of required memory for your application.
Smaller the collections and documents are, the faster the queries will be. not just in terms of memory but also disk, if documents are smaller less disk access is required to read them.
There are several optimizations you can do when it comes to schema:
first, shorten the keys. we’ve started with long names like firstName, then, shortened them to a single word or acronym and finally used one or two letters since it had a huge impact on the size of our data. By shortening the keys we reduced the size of our data in more than 50%. There is a huge downside for doing this because it obscures the data but fortunately, we have an API that hides this ugly implementation detail so it doesn’t have an impact on our users.
Another thing to consider is the tradeoff between the number of documents and their size, in many use cases it’s more efficient to store a smaller amount of large documents vs. a large amount of small ones.
The next thing you can optimize is indices
First thing you should know is that unused indices are still accessed whenever documents are being inserted, updated or deleted. Try to identify those and remove them.
(!) Use sparse indices when only some of the documents will have the indexed attribute as they use less space.
(!) The last thing I want to talk about is how much of the index is located in memory. The answer is: it depends.
If the entire index is accessed by queries then the entire index should be located in memory. If only a single part of the index is used, only that part has to fit in memory.
Lets look at a few examples to emphasize the difference, you can imagine an index as a segment of memory, the red marks are locations frequently accessed by queries.
(!) The first example is an index on a date field called creation_time. Each inserted document inserts the largest value of all previous ones so the right most part of the index is updated.
In many such indexes only the recent history is often accessed so only the right-most part of the index will be located in memory.
(!) The second example is an index on a person’s name, the index accesses will probably distribute evenly across the entire index so most of it will be located in memory.
So lets summarize what we’ve learned:
1. We’ve seen how memory management works, we’ve started from the disk and RAM, went up the stack to the page cache whose sole purpose is to improve read and write performance by using the memory. We continued to memory mapped files which translate memory accesses like reads and writes to file reads and writes. And we finished with MongoDB’s usage of these mechanisms.
2. We’ve talked about the challenges this strategy presents: like predicting and measuring the size of the working set.
3. We then talked about monitoring, which is something you have to do if you have a DB running in production.
4. We finished with schema and index optimizations which are crucial for cutting costs and improving performance.
I hope you enjoyed my talk and thanks for having me.