2. Agenda
Current state of resource management in the
Linux kernel
Beancounters overview
User memory management
I/O accounting
Kernel memory management
Network buffers accounting
Performance
3. Current state
Per-process accounting and limiting (rlimits)
− Manages individual processes
− Memory limits are mostly ignored by the kernel
Group-based management
− Absent
Global statistics
− Not suitable for group isolation
5. Agenda
Current state of resource management in the
Linux kernel
Beancounters overview
User memory management
I/O accounting
Kernel memory management
Network buffers accounting
Performance
6. Beancounters basics
A beancounter manages a group of tasks
Resource counters parameters
− held – the current consumption level
− limit – the maximal allowed level of consumption
− barrier – the "shortage warn" line – each resource
controller may take some precautions
− fails – the number of allocation rejects
Beancounter is assigned once during process
lifetime
8. Beancounters controlled resources
User memory
− Length of mappings
− RSS
− Locked pages
Dirty page cache
Kernel memory
Network buffers
Miscellaneous
resources
− Number of tasks
− Number of files
− Number of sockets
− Number of file locks
− Number of PTYs
− Number of signals
− Active dentry cache
9. Agenda
Current state of resource management in the
Linux kernel
Beancounters overview
User memory management
I/O accounting
Kernel memory management
Network buffers accounting
Performance
10. User memory management
VMA lengths accounting
− Graceful rejects of VM region allocation
− Take precautions against overcommitment
RSS accounting
− Real memory usage
− OOM killer priorities
Dirty page cache accounting
− IO statistics and scheduling
11. VMA lengths accounting
VMAs classification
− unreclaimable:
private and anonymous
− reclaimable:
shared file mappings
Unused pages Used pages Unreclaimable VMAsReclaimable VMAs
“Lengths of mappings” resource
“RSS” resource
Pages classification
− unused:
parts of mapped regions
− used:
touched pages
Task address space
12. VMA lengths accounting pros'n'cons
Pros
− The way to track the
host commitment level
− Graceful rejects of
address space
growths
Cons
− Hard limiting of
address space growth
13. RSS accounting
First touch N Touches
Drawbacks
Additional pointer on the struct page
Extra locking during page faults
page page beancounter
beancounter
14. Shared pages accounting
Account the page to the first beancounter
− Non uniform statistics for similar beancounters
Account a whole page for each beancounter
− The values accounted are not related to the actual
memory usage
Account page's fractions the all beancounters
− The “middle” way used in the beancounters
16. Agenda
Current state of resource management in the
Linux kernel
Beancounters overview
User memory management
I/O accounting
Kernel memory management
Network buffers accounting
Performance
17. Dirty page cache accounting
First touch N Touches
Dirty
Unmap
Last unmap
Clean
IO beancounter
19. Agenda
Current state of resource management in the
Linux kernel
Beancounters overview
User memory management
I/O accounting
Kernel memory management
Network buffers accounting
Performance
25. Agenda
Current state of resource management in the
Linux kernel
Beancounters overview
User memory management
I/O accounting
Kernel memory management
Network buffers accounting
Performance
26. Network buffers accounting
Mainstream accounting shortcomings
slab overhead is not included
− up to 30% for usual Ethernet frames
− unpredictable difference for non-ethernet MTU
− no way to recalculate skb->truesize
27. Implementation basics
Separate accounting for
− send and receive buffers
− TCP and all the other types of traffic
Implementation is straightforward:
− account actual memory usage for objects with
undefined or infinite lifetime
select(2) compatibility
Buffer space guarantees
29. Agenda
Current state of resource management in the
Linux kernel
Beancounters overview
User memory management
I/O accounting
Kernel memory management
Network buffers accounting
Performance
30. Performance
Test name
No RSS Full
% %
Process creation 97% 91%
Execl Throughtput 99% 91%
Pipe Throughtput 100% 99%
Shell Scripts 96% 87%
File Read 99% 98%
File Write 101% 99%
RSS accounting – the bottleneck
Hi, my name is Pavel. My talk is about the resource management in the kernel and the way we do it in the OpenVZ.
In the coming half-an-hour we'll talk about the current state of the resource management in the Linux kernel and outline some shortcomings of it.
After this I will intruduce our resource management subsystem – the beancounters. I will tell about the main part of this subsystem – the memory management. This includes the user and the kernel memory management, input/output accounting and the network buffers accounting.
At the very end, of cource, I will show the influence of the beancounters on the kernel and tell what we're planning to do about it.
OK. The resource management can occur at three levels.
First – the processes can be tracked individually, and the Linux kernel has some arms for this – the RLIMITS are intended to help with per-process resource management. The disadvantages of them are obvious – they work on individual processes only and protects the system from accidents. Let alone the fact that the memory limits (e.g. RLIM_CORE/_RSS) are mostly ignored by the kernel
The next level is group-baseg accounting, which is completely missed in the kernel. The "user" notion is used on VFS layer only. So this level of accounting is required in the kernel rather badly.
At the top goes the global management. That is the most prorabotanniy management in the kernel, but it is not suitable for group isolation at all.
The operating system provides numerous resource for the executing processes. The main resources are memory, CPU time, IO and network bandwidth and the disk space.
<click>
This report concerns the most crucial resource only – that's the memory. As we'll see later "the memory" resource itself stands at the top of deep resources hierarchy.
Now let's look at some basics of the beancounters. The best analogiya of what the beancounter is is the "nsproxy", which has recently apeared in the kernel.
The beancounter denotes a process group. It is assigned to the task once during the task's lifetime and accounts for all the resource allocations made by the task.
The beancounters account for many resources each of which is characterized with the "held" value – it is the current consumption level; the "limit" – the maximal allowed level of resource consumption; the "barrier" – the value that tells the controller that the resource is about to get run-out, and it is time to take some precautions. Finally, the "fails" value shows the number of resource allocation rejects.
Let me tell some detail on the accounting. Let's start with a single process (a purple circle). Whenever he wants it can <click> attach itself to a beancounter. After this he, all his <click> children and even <click> grand-children live with this beancounter and cannot disown it.
Almost each kernel object that is created with the resource allocation request <click> from the task is accounted with that task's beancounter. I.e. its "weight" or "size" is added to the "held" value of appropriate resource on the beancounter. The object may be of almost any kind – a file, a dentry, an iptable rule, virtual memory region – anything.
The "beancounters" work with the following memory-related resources.
The user memory, which includes such resource as "the lenghts of mappings" and the resident set size (the RSS). Locked page set is also accounted, but it is ommited in this talk as this kind of resource has noting interesting in design and implementation.
Dirty page cache is another kind of resource and is very interesting as we'll see later.
The most crucial memory type – the kernel memry – is also accounted.
Finally, if we have time, we'll talk about the network buffers management. It has some spices.
<click>
However the full list of beancounters controlled resources will include the numbers of tasks, files, sockets, etc, active dentry cache and so on and so forth...
Now let's see the details of user memory management.
As I have told there are tree types of resources here – the lenghts of mappings, the RSS and the dirty page cache.
The first resource accounting gives us the ability to reject the virtual memory exdending gracefully – with the error returned from mmap or brk system call. Besides, this allows to make some precautions against node overcommitment.
RSS accounting gives us the real memory usage (we'll see a bit later how it works) and provides good group priorities for the out-of-memory killer.
Dirty page cache accounting is mainly used in IO statistics gathering and asynchronous (output disk traffic) scheduling
OK, here's the first resource – the lenghts of mappings.
It works with the task address space. All the vm areas it may have are splitted <click> into two classes: the unreclaimable areas <click> – those that are not backed by any disk file and thus will go to swap on memory shortage; and <click> the reclaimable ones – those that are backed by a disk file and its reclamation is almost always succeeds.
Next, we distinguish two <click> pages types in the areas. The "unused" page <click> is the "hole" in the vm area – the page place is reserved, but the physical page is not yet created. This page becomes "used" <click> in the page fault. Used page is the synonym for a physical page.
Using these terms the "lenghts of mapping" resource is <click> the sum of used pages and unused ones in the unreclaimable areas.
The RSS resource is <click> merely the number of used pages
What are the pros and cons (pros – za, cons – protiv?)
The pros are that we have the way to track the node overcommitment and may gracefully (that is the way that the application expects during its normal operation) reject the address space extending
The cons are that once the group hit the limit it cannot move further unless it releases some of its mappings.
Now the RSS resource. This works with pages, that are <click> touched and thus get by the process.
The page ownership is established during page faults by attaching a page beancounter to the page. The page beancounter (a blue box) is a "tag" that is attached to a page and points to the beancounter (a green box) that owns the page.
We do not point to the beancounter directly as after many touches from <click> different beancounters page have to point to many beancounters. To track this we attach a circular list of page beancounter each one pointing to appropriate beancounter.
The page beancounter is also responsible for per-beancounter mapcount of the page.
The drawbacks of such approach are <click> the following. First we have an extra pointer on the struct page. And the second is that we intruduce and extra locking in the page faults. What this results in – is at the end of my talk.
The most interesting part of RSS accounting is how the shared page is accounted between the owning beancounters.
Let's start with the single page. When the first beancounter touches it <click> it gets the whole page into its RSS resource.
Then goes the second one <click>. We take one half of the page from the first beancounter and move this to the newbie.
The third gay (sorry for my english – gUy) will <click> steal a quarter from the second one, without bothering the first.
The forth <click> beancounter gets the quarter from the first one and doesn't mess with the rest owners.
That is – we account the pages with the halves. Then we'll have one eigths and so on. This algorithm's benefits <click> are the follwong.
The first is that we have a constant time algorithm of adding and removind the page owners – each touch we work with only one of the existing owners and do not bother the others.
The second, is that when we sum-up the RSS values from all the beancounters we'll get the real physical memory consumption by the user space.
Now let's see how we track the dirty page cache.
Let's start with the known scheme of a page <click> owned by several beancounters.
When one of them writres to the page <click> and thus makes it dirty, we attach an extra tag <click> to it – the IO beancounter, which points to the page ownership list of the page beancounters and holds the beancounter that made the page dirty.
Even if the page is get unmapped from its dirtier <click>, and even if it gets completely unmapped from the user space <click>, it holds its IO beancounter until it is flushed to the disk <click> and becomes clean. Note that at the time of writing the page we do know the beancounter that is responsible for this write.
The good points of such RSS tracking are obvious.
We have the node memory utilisation statistics, we have the support for ashyncronous IO scheduling and we have the ground for page reclamation.
The price we pay for it is the performance issues, that will come in details later, and the extra memory, that is required to store all the page and IO beancounters seen so far.
Now the kernel memory management.
The reason we have to controll it is that the normal zone of the kernel is limited. This zone is the only place where kernel can hold its objects and if this gets exhaused the system can stop. This problem stands mainly for 32-bit architectures, but even for 64-bit ones, eating all the page tables by a single process eats hundreds of megabytes from this zone.
The major (and actually the only) problem of kernel objects tracking is theis freeing context. Reference counting and RCU technology makes kernel objects be freed almost in any context – that is the beancounter current task belongs to most often is not the same one, that brought the object alive.
The objects that are allocated with vmalloc and buddy page allocator can be tracked easily. We just use the (already seen) extra pointer <click> on the struct page. Vmalloc objects are considered to be owned by the <click> zeroth page owner.
This is simple and dosn't produce any noticeable neudobstv
From the kernel API point of view we just have an additional GFP flag that tells the buddy allocator that we would like to account the page allocation with the beancounters
Slab objects are different. Since one page may carry objects belonging to different beancounters we cannot treat the page's owner to be the object owner.
To solve this we <click> place the appropriate number of pointers behind the struct slab and all its buf-ctls.
Each object on the slab is owned by the beancounter referenced by the pointer of the same number.
The described slab accounting may introduce some problems.
First, slabs may become shorter as we steal some space for our pointers. Next, slabs may become "offslab".
That's the theory. The reality <click> differs.
Small objects from size-32 loose 10 percents of them, but the other caches look much better. Look, we loose only 5 percents of objects on size-64 cache and even less for others. And no slabs become offslab.
The advantages are obvious – we have a total controll on the consumption of the most crucial memory resource – the kernel memory.
All the disadvantages are already optimized out actually. We have no performance hit and negligible extra memory consumption.
OK, let's start with the network accounting.
First, let me tell why the existing accounting facilities are not that good.
First thing is that the slab overhead is not taken into account. For example ethernet frames are allocated from size-2048 mainly and occupy up to 30 percents more space than they really need. The second thing is that the incoming traffic is not accounted at all, and finally the existing accounting is not strict – the limits set can be easily overused (den, how?).
The network accounting basics are the follwing
We distinguigh the incoming and outgoing traffic, this is natural. But we also make difference between the TCP and all the other traffic. Thus we have four kinds of resources.
These four kinds have some fundamentials:
We account the actual memory usage with all the overheads for all the objects that have undefined or infinite lifetime. For example TCP clones that are sent to the card are not accounted as they live for a very limited time.
Then we have the select compatibility – if select says that the socket is writeable the write or send system call won't be rejected by the beancounter.
And one more thing is that we provide some minimal guaranteed space for each kind of resource to allow a slow progress for each socket
However there are some species for different resources.
For example TCP outgoing paths wait for the beancounter resource availability in case limit is hit.
For simple protocols like UDP we drop the incoming packets when the limit is hit. This is OK as the protocol itself do not provide any packet delivery guarantees.
Netlink traffic is always accounted on the user-side socket despite traffic source. This is done so as kernel produces traffic mainly on response from user.
Now we're done with the accounting. Let's now talk about the performance.
The major bottleneck is the RSS accounting with its extra locking. As you can see from the table the beancounters withot this part produce very small overhead (if any) on many unixbench tests.
The RSS mainly hits the fork() and exec() (and thus – the shell test) operation speed. Howver we have some ideas of how to improve this.
The beancounters are continuously evolving and even now we have something to work on.
The major question is the performance issues. The main technics we're exploiting are pre-charging and on-demand accounting.
Pre charging means that each task reserves some amount of resource from the beancounter during its creation and exhaust it for object allocations in the future. We have this working for kernel memory, files and network buffers and plan to account the vma lenghts with it.
On-demand accounting is a bit more tricky. The main idea is that we need to account the precice values only when we're near the limit. When the beancounter comsumes little of resource we can have just estimations of the consumption level. We have this working for active dentry cache and are willing to implement this for RSS accounting.
Another importaint question is implementing hard RSS limits with pages reclamation. We have some proof-of-concept patches sent to lkml. Soon this will apear in oficial openvz kernels.
And the last interesting question is TCP window management based on beancounter resource availability. This will allow for better manegement of TCP traffic.
Well, that's all. Thank you for your attention. Now I'm ready to answer your questions.