SlideShare une entreprise Scribd logo
1  sur  25
Télécharger pour lire hors ligne
cgroupv2: Linux’s new unified control
group system
Chris Down (cdown@fb.com)
Production Engineer, Web Foundation
cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
About me
■ Chris Down
■ Production Engineer at Facebook
■ Working in Web Foundation
■ Working on cgroupv2 ↔ OS integration, especially inside Facebook
■ Dealing with web server and backend service reliability
■ Managing >100,000 machines at scale
■ Building tools to manage and maintain Facebook’s fleet of machines
■ Debugging production incidents, with a focus on Linux internals
cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
In this talk
■ A short intro to control groups and where/why they are used
■ cgroup(v1), what went well, what didn’t
■ Why a new major version/API break was needed
■ Fundamental design decisions in cgroupv2
■ New features and improvements in cgroupv2
■ State of cgroupv2, what’s ready, what’s not
cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
What are cgroups?
■ cgroup == control group
■ System for resource management on Linux
■ Provides directory hierarchy as main point of interaction (at /sys/fs/cgroup)
■ Limit, throttle, manage, and account for resource usage per control group
■ Each resource interface is provided by a controller
cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
Practical uses
■ Isolating core workload from background resource needs
■ Web server vs. system processes (eg. Chef, metric collection, etc)
■ Time critical work vs. long-term asynchronous jobs
■ Ensuring machines with multiple workloads (eg. containers) do not allow one workload
to overpower the others
cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
How did this work in cgroupv1?
cgroupv1 has a hierarchy per-resource, for example:
% ls /sys/fs/cgroup
cpu/ cpuacct/ cpuset/ devices/ freezer/
memory/ net_cls/ pids/
Each resource hierarchy contains cgroups for this resource:
% find /sys/fs/cgroup/pids -type d
/sys/fs/cgroup/pids/background.slice
/sys/fs/cgroup/pids/background.slice/async.slice
/sys/fs/cgroup/pids/workload.slice
cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
How did this work in cgroupv1?
■ Separate hierarchy/cgroups for
each resource
■ Even if they have the same
name, cgroups for each
resource are distinct
■ cgroups can be nested inside
each other
/sys/fs/cgroup
resource A
cgroup 1
cgroup 2
...
resource B
cgroup 3
cgroup 4
...
resource C
cgroup 5
cgroup 6
...
cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
How did this work in cgroupv1?
■ Limits and accounting are performed per-cgroup
■ For example, you might set memory.limit_in_bytes in cgroup 3, if resource B is
“memory”.
/sys/fs/cgroup
resource A
cgroup 1
pid 1 pid 2
resource B
cgroup 3
pid 3 pid 4
resource C
cgroup 5
pid 2 pid 3
cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
How did this work in cgroupv1?
■ One PID is in exactly one cgroup per resource
■ For example, PID 2 is in separate cgroups for resource A and C, but in the root cgroup
for resource B since it’s not explicitly assigned
/sys/fs/cgroup
resource A
cgroup 1
pid 1 pid 2
resource B
cgroup 3
pid 3 pid 4
resource C
cgroup 5
pid 2 pid 3
cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
Hierarchy in cgroupv1
/sys/fs/cgroup
memory/ service 1 memory.limit_in_bytes
cpu/ service 1 cpu.shares
blkio/ service 2 blkio.throttle.read_iops_device
cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
How does this work in cgroupv2?
cgroupv2 has a unified hierarchy, for example:
% ls /sys/fs/cgroup
background.slice/ workload.slice/
Each cgroup can support multiple resource domains:
% ls /sys/fs/cgroup/background.slice
async.slice/ foo.mount/ cgroup.subtree_control
memory.high memory.max pids.current pids.max
cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
How does this work in cgroupv2?
■ cgroups are “global” now —
not limited to one resource
■ Resources are now opt-in for
cgroups
/sys/fs/cgroup
cgroup 1
cgroup 2
...
cgroup 3
cgroup 4
...
cgroup 5
cgroup 6
...
cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
Hierarchy in cgroupv1
/sys/fs/cgroup
memory/ service 1 memory.limit_in_bytes
cpu/ service 1 cpu.shares
blkio/ service 2 blkio.throttle.read_iops_device
cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
Hierarchy in cgroupv2
/sys/fs/cgroup
service 1
memory.high ∗
cpu.weight
cgroup.subtree_control
memory
cpu
service 2
io.max (riops key)
cgroup.subtree_control io
∗
will discuss about this vs. memory.max later
cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
Fundamental differences between v1 and v2
■ Unified hierarchy — resources apply to cgroups now
■ Granularity at PID, not TID level
■ Focus on simplicity/clarity over ultimate flexibility
cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
v2 improvements: Tracking of non-immediate charges
Some types of resources are not immediately chargeable (eg. page cache writeback,
network packets going in/out)
In v1:
■ These resources are not able to be tied to a cgroup
■ Charged to the root cgroup for each resource type, so are essentially unlimited
In v2:
■ Resources spent in page cache writebacks and network are charged to the correct
cgroup
■ Can be considered as part of cgroup limits
cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
v2 improvements: Communication with backing subsystems
In v1:
■ Most actions for non-share based resources reacted violently to hitting thresholds
■ For example, in the memory cgroup, the only option was to OOM kill or freeze
In v2:
■ Many cgroup controllers inform subsystems before problems occur
■ Subsystems can take action to remedy to avoid violent action (eg. forced direct reclaim
with memory.high, window size manipulation on reclaim failure)
■ Much easier to deal with temporary spikes in a resource’s usage
cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
v2 improvements: Consistency between controllers
In v1:
■ Controllers often have inconsistent interfaces
■ Some controller hierarchies inherit values from parents, some don’t
■ Controllers that have similar limiting methods (eg. io/cpu) have inconsistent APIs
In v2:
■ Our crack team of API Design ExpertsTM have ensured your sheer delight∗
∗
well, at least we can pray we didn’t screw it up too badly
cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
v2 improvements: Some reasonable configurations are now possible
In v1:
■ Some limitations could not be fixed due to backwards compatibility (eg. memory limit
types)
■ memory.{,kmem.,kmem.tcp.,memsw.,[...]}limit_in_bytes
In v2:
■ Less iterative, more designed up front
■ In the case of memory limit types, we now have universal thresholds
(memory.{high,max})
cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
Current support
■ Controllers merged into mainline kernel
■ I/O
■ Memory
■ PID
■ Controllers still pending merge
■ CPU (thread: https://bit.ly/cgroupv2cpu)
Disagreements: Process granularity, constraints around pid placement in cgroups
■ Controllers still being worked on
■ Freezer
cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
Chris, is Facebook really using cgroupv2 in production?
Yes! Really!
■ Currently rolled out to a non-negligible percentage of web servers
■ Running managed with systemd (see Davide Cavalca’s “Deploying systemd at scale” talk
at systemd.conf 2016)
■ Already getting better results on spiky workloads
■ Better separation of workload services from system services (eg. service routing,
metric collection)
cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
How can I get it?
cgroupv2 is stable since Linux 4.5. Here are some useful kernel commandline flags:
■ systemd.unified_cgroup_hierarchy=1
■ systemd will mount /sys/fs/cgroup as cgroupv2
■ Available from systemd v226 onwards
■ cgroup_no_v1=all
■ The kernel will disable all v1 cgroup controllers
■ Available from Linux 4.6 onwards
Mounting (if your init system doesn’t do it for you): % mount -t cgroup2 none /sys/fs/cgroup
cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
People at Facebook working on cgroupv2
■ FB kernel team working on upstreaming improvements and new features
■ FB operating systems team and Web Foundation working on OS integration &
production testing
■ Tupperware (scheduler & containerisation) team working on rollout to all container
users
cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
Still have questions?
■ cgroupv2 has great user-facing docs: https://bit.ly/cgroupv2
■ I’ll be around for questions over pizza 😃
DOXLON November 2016: Facebook Engineering on cgroupv2

Contenu connexe

Tendances

Virtualization which isn't: LXC (Linux Containers)
Virtualization which isn't: LXC (Linux Containers)Virtualization which isn't: LXC (Linux Containers)
Virtualization which isn't: LXC (Linux Containers)
Dobrica Pavlinušić
 

Tendances (20)

Linux Containers From Scratch: Makfile MicroVPS
Linux Containers From Scratch: Makfile MicroVPSLinux Containers From Scratch: Makfile MicroVPS
Linux Containers From Scratch: Makfile MicroVPS
 
Introduction to RCU
Introduction to RCUIntroduction to RCU
Introduction to RCU
 
Let's Containerize New York with Docker!
Let's Containerize New York with Docker!Let's Containerize New York with Docker!
Let's Containerize New York with Docker!
 
Using cgroups in docker container
Using cgroups in docker containerUsing cgroups in docker container
Using cgroups in docker container
 
Containers with systemd-nspawn
Containers with systemd-nspawnContainers with systemd-nspawn
Containers with systemd-nspawn
 
Linux Locking Mechanisms
Linux Locking MechanismsLinux Locking Mechanisms
Linux Locking Mechanisms
 
Effective service and resource management with systemd
Effective service and resource management with systemdEffective service and resource management with systemd
Effective service and resource management with systemd
 
Anatomy of a Container: Namespaces, cgroups & Some Filesystem Magic - LinuxCon
Anatomy of a Container: Namespaces, cgroups & Some Filesystem Magic - LinuxConAnatomy of a Container: Namespaces, cgroups & Some Filesystem Magic - LinuxCon
Anatomy of a Container: Namespaces, cgroups & Some Filesystem Magic - LinuxCon
 
Namespaces in Linux
Namespaces in LinuxNamespaces in Linux
Namespaces in Linux
 
NUSE (Network Stack in Userspace) at #osio
NUSE (Network Stack in Userspace) at #osioNUSE (Network Stack in Userspace) at #osio
NUSE (Network Stack in Userspace) at #osio
 
Virtualization which isn't: LXC (Linux Containers)
Virtualization which isn't: LXC (Linux Containers)Virtualization which isn't: LXC (Linux Containers)
Virtualization which isn't: LXC (Linux Containers)
 
High Availability Storage (susecon2016)
High Availability Storage (susecon2016)High Availability Storage (susecon2016)
High Availability Storage (susecon2016)
 
LSA2 - 01 Virtualization with KVM
LSA2 - 01 Virtualization with KVMLSA2 - 01 Virtualization with KVM
LSA2 - 01 Virtualization with KVM
 
Introduction to Docker and deployment and Azure
Introduction to Docker and deployment and AzureIntroduction to Docker and deployment and Azure
Introduction to Docker and deployment and Azure
 
OpenNebulaConf 2016 - The DRBD SDS for OpenNebula by Philipp Reisner, LINBIT
OpenNebulaConf 2016 - The DRBD SDS for OpenNebula by Philipp Reisner, LINBITOpenNebulaConf 2016 - The DRBD SDS for OpenNebula by Philipp Reisner, LINBIT
OpenNebulaConf 2016 - The DRBD SDS for OpenNebula by Philipp Reisner, LINBIT
 
mTCP使ってみた
mTCP使ってみたmTCP使ってみた
mTCP使ってみた
 
Linux: LVM
Linux: LVMLinux: LVM
Linux: LVM
 
Playing BBR with a userspace network stack
Playing BBR with a userspace network stackPlaying BBR with a userspace network stack
Playing BBR with a userspace network stack
 
Network Stack in Userspace (NUSE)
Network Stack in Userspace (NUSE)Network Stack in Userspace (NUSE)
Network Stack in Userspace (NUSE)
 
LibOS as a regression test framework for Linux networking #netdev1.1
LibOS as a regression test framework for Linux networking #netdev1.1LibOS as a regression test framework for Linux networking #netdev1.1
LibOS as a regression test framework for Linux networking #netdev1.1
 

En vedette

Een Gezond Gebit2
Een Gezond Gebit2Een Gezond Gebit2
Een Gezond Gebit2
guest031320
 

En vedette (20)

(SEC313) Security & Compliance at the Petabyte Scale
(SEC313) Security & Compliance at the Petabyte Scale(SEC313) Security & Compliance at the Petabyte Scale
(SEC313) Security & Compliance at the Petabyte Scale
 
Apostila De Dispositivos EléTricos
Apostila De Dispositivos EléTricosApostila De Dispositivos EléTricos
Apostila De Dispositivos EléTricos
 
Business selectors
Business selectorsBusiness selectors
Business selectors
 
Machine Learning & IT Service Intelligence for the Enterprise: The Future is ...
Machine Learning & IT Service Intelligence for the Enterprise: The Future is ...Machine Learning & IT Service Intelligence for the Enterprise: The Future is ...
Machine Learning & IT Service Intelligence for the Enterprise: The Future is ...
 
Aws + Puppet = Dynamic Scale
Aws + Puppet = Dynamic ScaleAws + Puppet = Dynamic Scale
Aws + Puppet = Dynamic Scale
 
Sunbrella Ottomans by Outdoor Elegance
Sunbrella Ottomans by Outdoor EleganceSunbrella Ottomans by Outdoor Elegance
Sunbrella Ottomans by Outdoor Elegance
 
Java standards in WCM
Java standards in WCMJava standards in WCM
Java standards in WCM
 
What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)
 
Bridging the Gap: Connecting AWS and Kafka
Bridging the Gap: Connecting AWS and KafkaBridging the Gap: Connecting AWS and Kafka
Bridging the Gap: Connecting AWS and Kafka
 
Python Pants Build System for Large Codebases
Python Pants Build System for Large CodebasesPython Pants Build System for Large Codebases
Python Pants Build System for Large Codebases
 
Developing highly scalable applications with Symfony and RabbitMQ
Developing highly scalable applications with  Symfony and RabbitMQDeveloping highly scalable applications with  Symfony and RabbitMQ
Developing highly scalable applications with Symfony and RabbitMQ
 
Reactive Cloud Security | AWS Public Sector Summit 2016
Reactive Cloud Security | AWS Public Sector Summit 2016Reactive Cloud Security | AWS Public Sector Summit 2016
Reactive Cloud Security | AWS Public Sector Summit 2016
 
Linux Malware Analysis
Linux Malware Analysis	Linux Malware Analysis
Linux Malware Analysis
 
NSM (Network Security Monitoring) - Tecland Chapeco
NSM (Network Security Monitoring) - Tecland ChapecoNSM (Network Security Monitoring) - Tecland Chapeco
NSM (Network Security Monitoring) - Tecland Chapeco
 
AWS + Puppet = Dynamic Scale
AWS + Puppet = Dynamic ScaleAWS + Puppet = Dynamic Scale
AWS + Puppet = Dynamic Scale
 
Chicago AWS user group meetup - May 2014 at Cohesive
Chicago AWS user group meetup - May 2014 at CohesiveChicago AWS user group meetup - May 2014 at Cohesive
Chicago AWS user group meetup - May 2014 at Cohesive
 
Platform - Technical architecture
Platform - Technical architecturePlatform - Technical architecture
Platform - Technical architecture
 
Automated Infrastructure Security: Monitoring using FOSS
Automated Infrastructure Security: Monitoring using FOSSAutomated Infrastructure Security: Monitoring using FOSS
Automated Infrastructure Security: Monitoring using FOSS
 
Expect the unexpected: Anticipate and prepare for failures in microservices b...
Expect the unexpected: Anticipate and prepare for failures in microservices b...Expect the unexpected: Anticipate and prepare for failures in microservices b...
Expect the unexpected: Anticipate and prepare for failures in microservices b...
 
Een Gezond Gebit2
Een Gezond Gebit2Een Gezond Gebit2
Een Gezond Gebit2
 

Similaire à DOXLON November 2016: Facebook Engineering on cgroupv2

Seven problems of Linux containers
Seven problems of Linux containersSeven problems of Linux containers
Seven problems of Linux containers
OpenVZ
 
Red Hat Global File System (GFS)
Red Hat Global File System (GFS)Red Hat Global File System (GFS)
Red Hat Global File System (GFS)
Schubert Zhang
 

Similaire à DOXLON November 2016: Facebook Engineering on cgroupv2 (20)

Grub2 and troubleshooting_ol7_boot_problems
Grub2 and troubleshooting_ol7_boot_problemsGrub2 and troubleshooting_ol7_boot_problems
Grub2 and troubleshooting_ol7_boot_problems
 
Enabling ceph-mgr to control Ceph services via Kubernetes
Enabling ceph-mgr to control Ceph services via KubernetesEnabling ceph-mgr to control Ceph services via Kubernetes
Enabling ceph-mgr to control Ceph services via Kubernetes
 
Systemd: the modern Linux init system you will learn to love
Systemd: the modern Linux init system you will learn to loveSystemd: the modern Linux init system you will learn to love
Systemd: the modern Linux init system you will learn to love
 
Ggplot2 Installation Instructions
Ggplot2 Installation InstructionsGgplot2 Installation Instructions
Ggplot2 Installation Instructions
 
Trivadis TechEvent 2016 cgroups im Einsatz von Florian Feicht
Trivadis TechEvent 2016 cgroups im Einsatz von Florian FeichtTrivadis TechEvent 2016 cgroups im Einsatz von Florian Feicht
Trivadis TechEvent 2016 cgroups im Einsatz von Florian Feicht
 
Control your service resources with systemd
 Control your service resources with systemd  Control your service resources with systemd
Control your service resources with systemd
 
Choosing Right Garbage Collector to Increase Efficiency of Java Memory Usage
Choosing Right Garbage Collector to Increase Efficiency of Java Memory UsageChoosing Right Garbage Collector to Increase Efficiency of Java Memory Usage
Choosing Right Garbage Collector to Increase Efficiency of Java Memory Usage
 
Elastic JVM for Scalable Java EE Applications Running in Containers #Jakart...
Elastic JVM  for Scalable Java EE Applications  Running in Containers #Jakart...Elastic JVM  for Scalable Java EE Applications  Running in Containers #Jakart...
Elastic JVM for Scalable Java EE Applications Running in Containers #Jakart...
 
Tuning systemd for embedded
Tuning systemd for embeddedTuning systemd for embedded
Tuning systemd for embedded
 
Seven problems of Linux Containers
Seven problems of Linux ContainersSeven problems of Linux Containers
Seven problems of Linux Containers
 
Seven problems of Linux containers
Seven problems of Linux containersSeven problems of Linux containers
Seven problems of Linux containers
 
Red Hat Global File System (GFS)
Red Hat Global File System (GFS)Red Hat Global File System (GFS)
Red Hat Global File System (GFS)
 
systemd @ Facebook -- a year later
systemd @ Facebook -- a year latersystemd @ Facebook -- a year later
systemd @ Facebook -- a year later
 
Using GTP on Linux with libgtpnl
Using GTP on Linux with libgtpnlUsing GTP on Linux with libgtpnl
Using GTP on Linux with libgtpnl
 
Intro to Kernel Debugging - Just make the crashing stop!
Intro to Kernel Debugging - Just make the crashing stop!Intro to Kernel Debugging - Just make the crashing stop!
Intro to Kernel Debugging - Just make the crashing stop!
 
Linux Crash Dump Capture and Analysis
Linux Crash Dump Capture and AnalysisLinux Crash Dump Capture and Analysis
Linux Crash Dump Capture and Analysis
 
Вячеслав Блинов «Java Garbage Collection: A Performance Impact»
Вячеслав Блинов «Java Garbage Collection: A Performance Impact»Вячеслав Блинов «Java Garbage Collection: A Performance Impact»
Вячеслав Блинов «Java Garbage Collection: A Performance Impact»
 
Systemd for developers
Systemd for developersSystemd for developers
Systemd for developers
 
Docker 原理與實作
Docker 原理與實作Docker 原理與實作
Docker 原理與實作
 
Gitops Hands On
Gitops Hands OnGitops Hands On
Gitops Hands On
 

Plus de Outlyer

Plus de Outlyer (20)

Murat Karslioglu, VP Solutions @ OpenEBS - Containerized storage for containe...
Murat Karslioglu, VP Solutions @ OpenEBS - Containerized storage for containe...Murat Karslioglu, VP Solutions @ OpenEBS - Containerized storage for containe...
Murat Karslioglu, VP Solutions @ OpenEBS - Containerized storage for containe...
 
How & When to Feature Flag
How & When to Feature FlagHow & When to Feature Flag
How & When to Feature Flag
 
Why You Need to Stop Using "The" Staging Server
Why You Need to Stop Using "The" Staging ServerWhy You Need to Stop Using "The" Staging Server
Why You Need to Stop Using "The" Staging Server
 
How GitHub combined with CI empowers rapid product delivery at Credit Karma
How GitHub combined with CI empowers rapid product delivery at Credit Karma How GitHub combined with CI empowers rapid product delivery at Credit Karma
How GitHub combined with CI empowers rapid product delivery at Credit Karma
 
Packaging Services with Nix
Packaging Services with NixPackaging Services with Nix
Packaging Services with Nix
 
Minimum Viable Docker: our journey towards orchestration
Minimum Viable Docker: our journey towards orchestrationMinimum Viable Docker: our journey towards orchestration
Minimum Viable Docker: our journey towards orchestration
 
Ops is dead. long live ops.
Ops is dead. long live ops.Ops is dead. long live ops.
Ops is dead. long live ops.
 
The service mesh: resilient communication for microservice applications
The service mesh: resilient communication for microservice applicationsThe service mesh: resilient communication for microservice applications
The service mesh: resilient communication for microservice applications
 
Microservices: Why We Did It (and should you?)
Microservices: Why We Did It (and should you?) Microservices: Why We Did It (and should you?)
Microservices: Why We Did It (and should you?)
 
Renan Dias: Using Alexa to deploy applications to Kubernetes
Renan Dias: Using Alexa to deploy applications to KubernetesRenan Dias: Using Alexa to deploy applications to Kubernetes
Renan Dias: Using Alexa to deploy applications to Kubernetes
 
Alex Dias: how to build a docker monitoring solution
Alex Dias: how to build a docker monitoring solution Alex Dias: how to build a docker monitoring solution
Alex Dias: how to build a docker monitoring solution
 
How to build a container monitoring solution - David Gildeh, CEO and Co-Found...
How to build a container monitoring solution - David Gildeh, CEO and Co-Found...How to build a container monitoring solution - David Gildeh, CEO and Co-Found...
How to build a container monitoring solution - David Gildeh, CEO and Co-Found...
 
Heresy in the church of - Corey Quinn, Principal at The Quinn Advisory Group
Heresy in the church of - Corey Quinn, Principal at The Quinn Advisory Group Heresy in the church of - Corey Quinn, Principal at The Quinn Advisory Group
Heresy in the church of - Corey Quinn, Principal at The Quinn Advisory Group
 
Anatomy of a real-life incident -Alex Solomon, CTO and Co-Founder of PagerDuty
Anatomy of a real-life incident -Alex Solomon, CTO and Co-Founder of PagerDutyAnatomy of a real-life incident -Alex Solomon, CTO and Co-Founder of PagerDuty
Anatomy of a real-life incident -Alex Solomon, CTO and Co-Founder of PagerDuty
 
A Holistic View of Operational Capabilities—Roy Rapoport, Insight Engineering...
A Holistic View of Operational Capabilities—Roy Rapoport, Insight Engineering...A Holistic View of Operational Capabilities—Roy Rapoport, Insight Engineering...
A Holistic View of Operational Capabilities—Roy Rapoport, Insight Engineering...
 
The Network Knows—Avi Freedman, CEO & Co-Founder of Kentik
The Network Knows—Avi Freedman, CEO & Co-Founder of Kentik The Network Knows—Avi Freedman, CEO & Co-Founder of Kentik
The Network Knows—Avi Freedman, CEO & Co-Founder of Kentik
 
Building a production-ready, fully-scalable Docker Swarm using Terraform & Pa...
Building a production-ready, fully-scalable Docker Swarm using Terraform & Pa...Building a production-ready, fully-scalable Docker Swarm using Terraform & Pa...
Building a production-ready, fully-scalable Docker Swarm using Terraform & Pa...
 
Zero Downtime Postgres Upgrades
Zero Downtime Postgres UpgradesZero Downtime Postgres Upgrades
Zero Downtime Postgres Upgrades
 
DOXLON November 2016 - ELK Stack and Beats
DOXLON November 2016 - ELK Stack and Beats DOXLON November 2016 - ELK Stack and Beats
DOXLON November 2016 - ELK Stack and Beats
 
DOXLON November 2016 - Data Democratization Using Splunk
DOXLON November 2016 - Data Democratization Using SplunkDOXLON November 2016 - Data Democratization Using Splunk
DOXLON November 2016 - Data Democratization Using Splunk
 

Dernier

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Dernier (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 

DOXLON November 2016: Facebook Engineering on cgroupv2

  • 1. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com) Production Engineer, Web Foundation
  • 2. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com) About me ■ Chris Down ■ Production Engineer at Facebook ■ Working in Web Foundation ■ Working on cgroupv2 ↔ OS integration, especially inside Facebook ■ Dealing with web server and backend service reliability ■ Managing >100,000 machines at scale ■ Building tools to manage and maintain Facebook’s fleet of machines ■ Debugging production incidents, with a focus on Linux internals
  • 3. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com) In this talk ■ A short intro to control groups and where/why they are used ■ cgroup(v1), what went well, what didn’t ■ Why a new major version/API break was needed ■ Fundamental design decisions in cgroupv2 ■ New features and improvements in cgroupv2 ■ State of cgroupv2, what’s ready, what’s not
  • 4. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com) What are cgroups? ■ cgroup == control group ■ System for resource management on Linux ■ Provides directory hierarchy as main point of interaction (at /sys/fs/cgroup) ■ Limit, throttle, manage, and account for resource usage per control group ■ Each resource interface is provided by a controller
  • 5. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com) Practical uses ■ Isolating core workload from background resource needs ■ Web server vs. system processes (eg. Chef, metric collection, etc) ■ Time critical work vs. long-term asynchronous jobs ■ Ensuring machines with multiple workloads (eg. containers) do not allow one workload to overpower the others
  • 6. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com) How did this work in cgroupv1? cgroupv1 has a hierarchy per-resource, for example: % ls /sys/fs/cgroup cpu/ cpuacct/ cpuset/ devices/ freezer/ memory/ net_cls/ pids/ Each resource hierarchy contains cgroups for this resource: % find /sys/fs/cgroup/pids -type d /sys/fs/cgroup/pids/background.slice /sys/fs/cgroup/pids/background.slice/async.slice /sys/fs/cgroup/pids/workload.slice
  • 7. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com) How did this work in cgroupv1? ■ Separate hierarchy/cgroups for each resource ■ Even if they have the same name, cgroups for each resource are distinct ■ cgroups can be nested inside each other /sys/fs/cgroup resource A cgroup 1 cgroup 2 ... resource B cgroup 3 cgroup 4 ... resource C cgroup 5 cgroup 6 ...
  • 8. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com) How did this work in cgroupv1? ■ Limits and accounting are performed per-cgroup ■ For example, you might set memory.limit_in_bytes in cgroup 3, if resource B is “memory”. /sys/fs/cgroup resource A cgroup 1 pid 1 pid 2 resource B cgroup 3 pid 3 pid 4 resource C cgroup 5 pid 2 pid 3
  • 9. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com) How did this work in cgroupv1? ■ One PID is in exactly one cgroup per resource ■ For example, PID 2 is in separate cgroups for resource A and C, but in the root cgroup for resource B since it’s not explicitly assigned /sys/fs/cgroup resource A cgroup 1 pid 1 pid 2 resource B cgroup 3 pid 3 pid 4 resource C cgroup 5 pid 2 pid 3
  • 10. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com) Hierarchy in cgroupv1 /sys/fs/cgroup memory/ service 1 memory.limit_in_bytes cpu/ service 1 cpu.shares blkio/ service 2 blkio.throttle.read_iops_device
  • 11. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com) How does this work in cgroupv2? cgroupv2 has a unified hierarchy, for example: % ls /sys/fs/cgroup background.slice/ workload.slice/ Each cgroup can support multiple resource domains: % ls /sys/fs/cgroup/background.slice async.slice/ foo.mount/ cgroup.subtree_control memory.high memory.max pids.current pids.max
  • 12. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com) How does this work in cgroupv2? ■ cgroups are “global” now — not limited to one resource ■ Resources are now opt-in for cgroups /sys/fs/cgroup cgroup 1 cgroup 2 ... cgroup 3 cgroup 4 ... cgroup 5 cgroup 6 ...
  • 13. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com) Hierarchy in cgroupv1 /sys/fs/cgroup memory/ service 1 memory.limit_in_bytes cpu/ service 1 cpu.shares blkio/ service 2 blkio.throttle.read_iops_device
  • 14. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com) Hierarchy in cgroupv2 /sys/fs/cgroup service 1 memory.high ∗ cpu.weight cgroup.subtree_control memory cpu service 2 io.max (riops key) cgroup.subtree_control io ∗ will discuss about this vs. memory.max later
  • 15. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com) Fundamental differences between v1 and v2 ■ Unified hierarchy — resources apply to cgroups now ■ Granularity at PID, not TID level ■ Focus on simplicity/clarity over ultimate flexibility
  • 16. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com) v2 improvements: Tracking of non-immediate charges Some types of resources are not immediately chargeable (eg. page cache writeback, network packets going in/out) In v1: ■ These resources are not able to be tied to a cgroup ■ Charged to the root cgroup for each resource type, so are essentially unlimited In v2: ■ Resources spent in page cache writebacks and network are charged to the correct cgroup ■ Can be considered as part of cgroup limits
  • 17. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com) v2 improvements: Communication with backing subsystems In v1: ■ Most actions for non-share based resources reacted violently to hitting thresholds ■ For example, in the memory cgroup, the only option was to OOM kill or freeze In v2: ■ Many cgroup controllers inform subsystems before problems occur ■ Subsystems can take action to remedy to avoid violent action (eg. forced direct reclaim with memory.high, window size manipulation on reclaim failure) ■ Much easier to deal with temporary spikes in a resource’s usage
  • 18. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com) v2 improvements: Consistency between controllers In v1: ■ Controllers often have inconsistent interfaces ■ Some controller hierarchies inherit values from parents, some don’t ■ Controllers that have similar limiting methods (eg. io/cpu) have inconsistent APIs In v2: ■ Our crack team of API Design ExpertsTM have ensured your sheer delight∗ ∗ well, at least we can pray we didn’t screw it up too badly
  • 19. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com) v2 improvements: Some reasonable configurations are now possible In v1: ■ Some limitations could not be fixed due to backwards compatibility (eg. memory limit types) ■ memory.{,kmem.,kmem.tcp.,memsw.,[...]}limit_in_bytes In v2: ■ Less iterative, more designed up front ■ In the case of memory limit types, we now have universal thresholds (memory.{high,max})
  • 20. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com) Current support ■ Controllers merged into mainline kernel ■ I/O ■ Memory ■ PID ■ Controllers still pending merge ■ CPU (thread: https://bit.ly/cgroupv2cpu) Disagreements: Process granularity, constraints around pid placement in cgroups ■ Controllers still being worked on ■ Freezer
  • 21. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com) Chris, is Facebook really using cgroupv2 in production? Yes! Really! ■ Currently rolled out to a non-negligible percentage of web servers ■ Running managed with systemd (see Davide Cavalca’s “Deploying systemd at scale” talk at systemd.conf 2016) ■ Already getting better results on spiky workloads ■ Better separation of workload services from system services (eg. service routing, metric collection)
  • 22. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com) How can I get it? cgroupv2 is stable since Linux 4.5. Here are some useful kernel commandline flags: ■ systemd.unified_cgroup_hierarchy=1 ■ systemd will mount /sys/fs/cgroup as cgroupv2 ■ Available from systemd v226 onwards ■ cgroup_no_v1=all ■ The kernel will disable all v1 cgroup controllers ■ Available from Linux 4.6 onwards Mounting (if your init system doesn’t do it for you): % mount -t cgroup2 none /sys/fs/cgroup
  • 23. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com) People at Facebook working on cgroupv2 ■ FB kernel team working on upstreaming improvements and new features ■ FB operating systems team and Web Foundation working on OS integration & production testing ■ Tupperware (scheduler & containerisation) team working on rollout to all container users
  • 24. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com) Still have questions? ■ cgroupv2 has great user-facing docs: https://bit.ly/cgroupv2 ■ I’ll be around for questions over pizza 😃