Cgroupv1 (or just "cgroups") has helped revolutionize the way that we manage and use containers over the past 8 years. In kernel 4.5, a complete overhaul is coming -- cgroupv2. This talk will go into why a new control group system was needed, the changes from cgroupv1, and practical uses that you can apply to improve the level of control you have over the processes on your servers.
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
DOXLON November 2016: Facebook Engineering on cgroupv2
1. cgroupv2: Linux’s new unified control
group system
Chris Down (cdown@fb.com)
Production Engineer, Web Foundation
2. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
About me
■ Chris Down
■ Production Engineer at Facebook
■ Working in Web Foundation
■ Working on cgroupv2 ↔ OS integration, especially inside Facebook
■ Dealing with web server and backend service reliability
■ Managing >100,000 machines at scale
■ Building tools to manage and maintain Facebook’s fleet of machines
■ Debugging production incidents, with a focus on Linux internals
3. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
In this talk
■ A short intro to control groups and where/why they are used
■ cgroup(v1), what went well, what didn’t
■ Why a new major version/API break was needed
■ Fundamental design decisions in cgroupv2
■ New features and improvements in cgroupv2
■ State of cgroupv2, what’s ready, what’s not
4. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
What are cgroups?
■ cgroup == control group
■ System for resource management on Linux
■ Provides directory hierarchy as main point of interaction (at /sys/fs/cgroup)
■ Limit, throttle, manage, and account for resource usage per control group
■ Each resource interface is provided by a controller
5. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
Practical uses
■ Isolating core workload from background resource needs
■ Web server vs. system processes (eg. Chef, metric collection, etc)
■ Time critical work vs. long-term asynchronous jobs
■ Ensuring machines with multiple workloads (eg. containers) do not allow one workload
to overpower the others
6. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
How did this work in cgroupv1?
cgroupv1 has a hierarchy per-resource, for example:
% ls /sys/fs/cgroup
cpu/ cpuacct/ cpuset/ devices/ freezer/
memory/ net_cls/ pids/
Each resource hierarchy contains cgroups for this resource:
% find /sys/fs/cgroup/pids -type d
/sys/fs/cgroup/pids/background.slice
/sys/fs/cgroup/pids/background.slice/async.slice
/sys/fs/cgroup/pids/workload.slice
7. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
How did this work in cgroupv1?
■ Separate hierarchy/cgroups for
each resource
■ Even if they have the same
name, cgroups for each
resource are distinct
■ cgroups can be nested inside
each other
/sys/fs/cgroup
resource A
cgroup 1
cgroup 2
...
resource B
cgroup 3
cgroup 4
...
resource C
cgroup 5
cgroup 6
...
8. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
How did this work in cgroupv1?
■ Limits and accounting are performed per-cgroup
■ For example, you might set memory.limit_in_bytes in cgroup 3, if resource B is
“memory”.
/sys/fs/cgroup
resource A
cgroup 1
pid 1 pid 2
resource B
cgroup 3
pid 3 pid 4
resource C
cgroup 5
pid 2 pid 3
9. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
How did this work in cgroupv1?
■ One PID is in exactly one cgroup per resource
■ For example, PID 2 is in separate cgroups for resource A and C, but in the root cgroup
for resource B since it’s not explicitly assigned
/sys/fs/cgroup
resource A
cgroup 1
pid 1 pid 2
resource B
cgroup 3
pid 3 pid 4
resource C
cgroup 5
pid 2 pid 3
10. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
Hierarchy in cgroupv1
/sys/fs/cgroup
memory/ service 1 memory.limit_in_bytes
cpu/ service 1 cpu.shares
blkio/ service 2 blkio.throttle.read_iops_device
11. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
How does this work in cgroupv2?
cgroupv2 has a unified hierarchy, for example:
% ls /sys/fs/cgroup
background.slice/ workload.slice/
Each cgroup can support multiple resource domains:
% ls /sys/fs/cgroup/background.slice
async.slice/ foo.mount/ cgroup.subtree_control
memory.high memory.max pids.current pids.max
12. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
How does this work in cgroupv2?
■ cgroups are “global” now —
not limited to one resource
■ Resources are now opt-in for
cgroups
/sys/fs/cgroup
cgroup 1
cgroup 2
...
cgroup 3
cgroup 4
...
cgroup 5
cgroup 6
...
13. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
Hierarchy in cgroupv1
/sys/fs/cgroup
memory/ service 1 memory.limit_in_bytes
cpu/ service 1 cpu.shares
blkio/ service 2 blkio.throttle.read_iops_device
14. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
Hierarchy in cgroupv2
/sys/fs/cgroup
service 1
memory.high ∗
cpu.weight
cgroup.subtree_control
memory
cpu
service 2
io.max (riops key)
cgroup.subtree_control io
∗
will discuss about this vs. memory.max later
15. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
Fundamental differences between v1 and v2
■ Unified hierarchy — resources apply to cgroups now
■ Granularity at PID, not TID level
■ Focus on simplicity/clarity over ultimate flexibility
16. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
v2 improvements: Tracking of non-immediate charges
Some types of resources are not immediately chargeable (eg. page cache writeback,
network packets going in/out)
In v1:
■ These resources are not able to be tied to a cgroup
■ Charged to the root cgroup for each resource type, so are essentially unlimited
In v2:
■ Resources spent in page cache writebacks and network are charged to the correct
cgroup
■ Can be considered as part of cgroup limits
17. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
v2 improvements: Communication with backing subsystems
In v1:
■ Most actions for non-share based resources reacted violently to hitting thresholds
■ For example, in the memory cgroup, the only option was to OOM kill or freeze
In v2:
■ Many cgroup controllers inform subsystems before problems occur
■ Subsystems can take action to remedy to avoid violent action (eg. forced direct reclaim
with memory.high, window size manipulation on reclaim failure)
■ Much easier to deal with temporary spikes in a resource’s usage
18. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
v2 improvements: Consistency between controllers
In v1:
■ Controllers often have inconsistent interfaces
■ Some controller hierarchies inherit values from parents, some don’t
■ Controllers that have similar limiting methods (eg. io/cpu) have inconsistent APIs
In v2:
■ Our crack team of API Design ExpertsTM have ensured your sheer delight∗
∗
well, at least we can pray we didn’t screw it up too badly
19. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
v2 improvements: Some reasonable configurations are now possible
In v1:
■ Some limitations could not be fixed due to backwards compatibility (eg. memory limit
types)
■ memory.{,kmem.,kmem.tcp.,memsw.,[...]}limit_in_bytes
In v2:
■ Less iterative, more designed up front
■ In the case of memory limit types, we now have universal thresholds
(memory.{high,max})
20. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
Current support
■ Controllers merged into mainline kernel
■ I/O
■ Memory
■ PID
■ Controllers still pending merge
■ CPU (thread: https://bit.ly/cgroupv2cpu)
Disagreements: Process granularity, constraints around pid placement in cgroups
■ Controllers still being worked on
■ Freezer
21. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
Chris, is Facebook really using cgroupv2 in production?
Yes! Really!
■ Currently rolled out to a non-negligible percentage of web servers
■ Running managed with systemd (see Davide Cavalca’s “Deploying systemd at scale” talk
at systemd.conf 2016)
■ Already getting better results on spiky workloads
■ Better separation of workload services from system services (eg. service routing,
metric collection)
22. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
How can I get it?
cgroupv2 is stable since Linux 4.5. Here are some useful kernel commandline flags:
■ systemd.unified_cgroup_hierarchy=1
■ systemd will mount /sys/fs/cgroup as cgroupv2
■ Available from systemd v226 onwards
■ cgroup_no_v1=all
■ The kernel will disable all v1 cgroup controllers
■ Available from Linux 4.6 onwards
Mounting (if your init system doesn’t do it for you): % mount -t cgroup2 none /sys/fs/cgroup
23. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
People at Facebook working on cgroupv2
■ FB kernel team working on upstreaming improvements and new features
■ FB operating systems team and Web Foundation working on OS integration &
production testing
■ Tupperware (scheduler & containerisation) team working on rollout to all container
users
24. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
Still have questions?
■ cgroupv2 has great user-facing docs: https://bit.ly/cgroupv2
■ I’ll be around for questions over pizza 😃