A brief into to Linux Containers presented to IEEE working group P2302 (InterCloud standards and portability). This deck covers:
- Definitions and motivations for containers
- Container technology stack
- Containers vs Hypervisor VMs
- Cgroups
- Namespaces
- Pivot root vs chroot
- Linux Container image basics
- Linux Container security topics
- Overview of Linux Container tooling functionality
- Thoughts on container portability and runtime configuration
- Container tooling in the industry
- Container gaps
- Sample use cases for traditional VMs
Overall, a bulk of this deck is covered in other material I have posted here. However there are a few new slides in this deck, most notability some thoughts on container portability and runtime config.
Scaling API-first – The story of a global engineering organization
Linux Container Brief for IEEE WG P2302
1. Linux Containers (LXC)
IEEE P2302 Working Group Brief
Boden Russell (brussell@us.ibm.com /
bodenru@gmail.com)
2. Definitions
Linux Containers (LXC LinuX Containers)
– Lightweight virtualization
– Realized using features provided by a modern Linux kernel
– VMs without the hypervisor (kind of)
Containerization of
– (Linux) Operating Systems (less the kernel)
– Single or multiple applications
LXC as a technology ≠ LXC “tools”
6/27/2014 2
3. Why LXC
Provision in seconds / milliseconds
Near bare metal runtime performance
VM-like agility – it’s still “virtualization”
Flexibility
– Containerize a “system”
– Containerize “application(s)”
Lightweight
– Just enough Operating System (JeOS)
– Minimal per container penalty
Open source – free – lower TCO
Supported with OOTB modern Linux kernel
Growing in popularity
6/27/2014 3
“Linux Containers are poised as the next VM in our modern Cloud era…”
Manual VM LXC
Provision Time
Days
Minutes
Seconds / ms
linpack performance @ 45000
0
50
100
150
200
250
1
3
5
7
9
11
13
15
17
19
21
23
25
27
29
31
BM
vcpus
GFlops
Google trends - LXC Google trends - docker
4. LXC Technology Stack
6/27/2014 4
UserSpaceKernelSpace
Kernel
System Call Interface
Architecture Dependent Kernel Code
GLIBC / Pseudo FS / User Space Tools & Libs
Linux Container Tooling
Linux Container Commoditization
Orchestration & Management
Hardware
cgroups
namespaces
chroots
LSM
lxc
5. Hypervisors vs. Linux Containers
6/27/2014 5
Hardware
Operating System
Hypervisor
Virtual Machine
Operating
System
Bins / libs
App App
Virtual Machine
Operating
System
Bins / libs
App App
Hardware
Hypervisor
Virtual Machine
Operating
System
Bins / libs
App App
Virtual Machine
Operating
System
Bins / libs
App App
Hardware
Operating System
Container
Bins / libs
App App
Container
Bins / libs
App App
Type 1 Hypervisor Type 2 Hypervisor Linux Containers
Containers share the OS kernel of the host and thus are lightweight.
However, each container must have the same OS kernel.
Containers are isolated, but
share OS and, where
appropriate, libs / bins.
6. Linux cgroups
History
– Work started in 2006 by google engineers
– Merged into upstream 2.6.24 kernel due to wider spread LXC usage
– A number of features still a WIP
Functionality
– Access; which devices can be used per cgroup
– Resource limiting; memory, CPU, device accessibility, block I/O, etc.
– Prioritization; who gets more of the CPU, memory, etc.
– Accounting; resource usage per cgroup
– Control; freezing & check pointing
– Injection; packet tagging
Usage
– cgroup functionality exposed as “resource controllers” (aka “subsystems”)
– Subsystems mounted on FS
– Top-level subsystem mount is the root cgroup; all procs on host
– Directories under top-level mounts created per cgroup
– Procs put in tasks file for group assignment
– Interface via read / write pseudo files in group
6/27/2014 6
Cgroups provide throttling, prioritization, control and metrics for a group of processes.
7. Linux cgroups Subsystems
cgroups provided via kernel modules
– Not always loaded / provided by default
– Locate and load with modprobe
Some features tied to kernel version
See: https://www.kernel.org/doc/Documentation/cgroups/
6/27/2014 7
Subsystem Tunable Parameters
blkio - Weighted proportional block I/O access. Group wide or per device.
- Per device hard limits on block I/O read/write specified as bytes per second or IOPS per second.
cpu - Time period (microseconds per second) a group should have CPU access.
- Group wide upper limit on CPU time per second.
- Weighted proportional value of relative CPU time for a group.
cpuset - CPUs (cores) the group can access.
- Memory nodes the group can access and migrate ability.
- Memory hardwall, pressure, spread, etc.
devices - Define which devices and access type a group can use.
freezer - Suspend/resume group tasks.
memory - Max memory limits for the group (in bytes).
- Memory swappiness, OOM control, hierarchy, etc..
hugetlb - Limit HugeTLB size usage.
- Per cgroup HugeTLB metrics.
net_cls - Tag network packets with a class ID.
- Use tc to prioritize tagged packets.
net_prio - Weighted proportional priority on egress traffic (per interface).
8. Linux namespaces
History
– Initial kernel patches in 2.4.19
– Recent 3.8 patches for user namespace support
– A number of features still a WIP
Functionality
– Provide process level isolation of global resources
• MNT (mount points, file systems, etc.)
• PID (process)
• NET (NICs, routing, etc.)
• IPC (System V IPC resources)
• UTS (host & domain name)
• USER (UID + GID)
– Process(es) in namespace have illusion they are the only processes on the system
– Generally constructs exist to permit “connectivity” with parent namespace
Usage
– Construct namespace(s) of desired type
– Create process(es) in namespace (typically done when creating namespace)
– If necessary, initialize “connectivity” to parent namespace
– Process(es) in name space internally function as if they are only proc(s) on system
6/27/2014 8
Namespaces provide isolation of Linux resources for a group of processes.
10. Linux namespaces: Common Idioms
It’s not required to use all namespaces
– Pick & choose; if your toolset allows it
Constructs exist to permit “connectivity” between parent /
child namespace
Various linux user space tools have namespace support
Linux sys API supports flexible namespace creation
6/27/2014 10
11. Linux namespaces & cgroups: Availability
6/27/2014 11
Note: user namespace support in
upstream kernel 3.8+, but
distributions rolling out phased
support:
- Map LXC UID/GID between
container and host
- Non-root LXC creation
12. Linux chroot & pivot_root
Using pivot_root with MNT namespace addresses escaping chroot
concerns
The pivot_root target directory becomes the “new root FS”
6/27/2014 12
13. LXC Images
LXC images provide a flexible means to deliver only what you need – lightweight and minimal
footprint.
Basic constraints
– Same architecture & endian
– Linux’ish Operating System; you can run different Linux distros on same host
Image types
– System; virtualize Operating System(s) – standard distro root FS less the kernel
– Application; virtualize application(s) – only package apps + dependencies (aka JeOS – Just
enough Operating System)
Bind mount host libs / bins into LXC to share host resources
Container image init process
– Container init command provided on invocation – can be an application or a full fledged
init process
– Init script customized for image – skinny SysVinit, upstart, etc.
– Reduces overhead of lxc start-up and runtime foot print
Various tools to build images
– SuSE Kiwi
– Debootstrap
– Etc.
LXC tooling options often include numerous image templates
6/27/2014 13
14. Linux Security Modules & MAC
Linux Security Modules (LSM) – kernel modules which provide a
framework for Mandatory Access Control (MAC) security implementations
MAC vs DAC
– In MAC, admin (user or process) assigns access controls to subject / initiator
– In DAC, resource owner (user) assigns access controls to individual resources
Existing LSM implementations include: AppArmor, SELinux, GRSEC, etc.
6/27/2014 14
15. Linux Capabilities
Per process privileges which define sys call
access
Can be assigned to LXC process(es)
6/27/2014 15
16. Other Security Measures
Reduce shared FS access using RO bind mounts
Linux seccomp
– Confine system calls
Keep Linux kernel up to date
User namespaces in 3.8+ kernel; mitigates container root
escape concerns
– Launching containers as non-root user
– Mapping UID / GID into container
6/27/2014 16
17. LXC Security Triangle
Private environments (trusted code)
– App packaging / deployment / management / etc, devOps, Cloud, etc…
No additional worries about security
Public environments
– Single tenant
• Same restrictions as private envs; tenant trusted code
– Multi-tenant
6/27/2014 17
Privileges, Multitenancy, Untrusted Code
SecurityMeasures
LSM, capabilities,
seccomp, RO bind mounts,
GRSEC, etc.
LXC Security Triangle
NOTE: User namespaces will
mitigate root escape
concerns
18. So You Want To Build A Container?
6/27/2014 18
19. LXC Engine: A “Hypervisor” for Containers
6/27/2014 19
Hardware
Hypervisor
Virtual Machine
Operating
System
Bins / libs
App App
Virtual Machine
Operating
System
Bins / libs
App App
Hardware
Operating System
Container
Bins / libs
App App
Container
Bins / libs
App App
Hardware
Operating System
Container
Bins / libs
App App
Container
Bins / libs
App App
LXC Engine / Tooling
Type 1 Hypervisor Linux Containers LXC Engine / Tooling
Conceptual Mapping
VM Container
Hypervisor LXC Engine
20. Docker Container Engine
6/27/2014 20
PC Mac Virtual Machine Bare Metal
Docker
Image
Develop container image once – run anywhere the docker engine is supported.
21. Typical Container Engine / Tooling Functionality
Standard container operations to realize / manage LXCs
– Run, start, stop, delete, attach, snapshot, etc.
APIs
– CLI, Web API (RESTful or other), Automation files (e.g. dockerfile)
– Security – who can use the APIs
Images
– Format (typically RAW), required dev files, etc..
– Image repository or catalog
– Container FS layers (union FS style)
– Import / export images and containers
Metrics
– CPU, memory, block IO, etc..
Eventing
– Push events on container changes
Logs
– Inspect / manage container logs
Checkpointing
– Freeze a container’s state
6/27/2014 21
Sample concepts below; not a complete list.
22. LXC Host / Engine Portability Considerations
Linux Kernel dependencies
– Non-standard kernel modules needed by app(s) on image
– LXC related kernel features tied to specific versions of the Kernel
LSM profile settings; security
– Different profile config per LSM (e.g. AppArmor, SELinux, etc.)
Container file system / storage type & capability
– FS type
– FS capabilities such as direct I/O
External storage
– FS vols bind mounted into container
Network options
– Linux bridging, OVS, etc…
Shared host libs / bins; dependencies from host OS
6/27/2014 22
Sample concepts below; not a complete list.
Not well solidified in current open source community.
23. LXC Runtime Configuration Considerations
Linux capabilities
– Applied to container process for Linux security
– Sysctl settings
Cgroups settings
– Allowed devices, limits, UID / GID mappings, etc.
Environment variables
– Exposed to shell / procs in container
Namespaces
– Which to use for container procs
Networking
– Type, IP, gateway, hostname, etc..
Init
– Which cmd / bin to run on container init
6/27/2014 23
Example: http://tinyurl.com/nj4u3le
Sample concepts below; not a complete list.
Community consolidating around docker’s libcontainer project.
24. LXC Industry Tooling
Parallels OpenVZ Linux
VServer
Libvirt-lxc Lxc (tools) Warden lmctfy Docker
Summary Commercial
product
using
OpenVZ
under the
hood
Custom
Kernel (pre
GNU 3.13)
providing
well
seasoned
LXC support.
A set of
kernel
patches
providing
LXC. Not
based on
cgroups or
namespaces.
Libvirt support
for LXC via
cgroups and
namespaces.
Lib + set of user
spaces tools
/bindings for
LXC.
LXC
management
tooling used by
CF.
Similar to LXC,
but provides
more intent
based focus.
Commoditizatio
n of LXC adding
support for
images, build
files, etc.
Part of
upstream
Kernel?
No No Partial Yes Yes Yes Yes, but
additional
patches needed
for specific
features.
Yes
License Commercial GNU GPL v2 GNU GPL v2 GNU LGPL GNU LGPL Apache v2 Apache v2 Apache v2
APIs /
Bindings
- CLI
- API
- CLI
- C
- CLI
- C
- Python
- Java
- C#
- PHP
- Python
- Lua
- GO
- CLI
- GO
- REST
- CLI
- Python
- Other 3rd
party libs
Managem
ent plane/
Dashboard
Parallels
Cloud Server
Parallels +
others
- OpenStack
- Archipel
- Virt-
Manager
- LXC web
panel
- Lexy
- OpenStack
- Shipyard
- Docker UI
6/27/2014 24
25. LXC Orchestration & Management
Docker & libvirt-lxc in OpenStack
– Manage containers heterogeneously with traditional VMs… but not w/the level
of support & features we might like
CoreOS
– Zero-touch admin Linux distro with docker images as the unit of operation
– Centralized key/value store to coordinate distributed environment
Various other 3rd party apps
– Maestro for docker
– Shipyard for docker
– Fleet for CoreOS
– Etc.
LXC migration
– Container migration via criu
But…
– Still no great way to tie all virtual resources together with LXC – e.g. storage +
networking
• IMO; an area which needs focus for LXC to become more generally applicable
6/27/2014 25
26. LXC Gaps
There are gaps…
Lack of industry tooling / support
Live migration still a WIP
Full orchestration across resources (compute / storage / networking)
Fears of security
Not a well known technology… yet
Integration with existing virtualization and Cloud tooling
Not much / any industry standards
Missing skillset
Slower upstream support due to kernel dev process
Etc.
6/27/2014 26
27. LXC: Use Cases For Traditional VMs
There are still use cases where traditional VMs are warranted.
Virtualization of non Linux based OSs
– Windows
– AIX
– Etc.
LXC not supported on host
VM requires unique kernel setup which is not applicable to other VMs on the host
(i.e. per VM kernel config)
Etc.
6/27/2014 27