2. #ContainerDayFRParis Container Day 2017
Ori Pekelman
GeekPush at Platform.sh
I am @OriPekeman everywhere (github/twitter/LinkedIn)
Co-Founder & VP of Marketing for Platform.sh, an innovative
second generation PaaS.
My role usually spans beyond the technological aspects to the
business strategy, process design and product marketing.
There is no container
2
3. #ContainerDayFRParis Container Day 2017
We are in Paris Containers Day, so I could rightly imagine most
people around have an understanding of the underpinnings of
“containers”. But let’s have a show of hands to see how much
time we are going to spend on which slide.
There is no container
3
Group A
I don’t know much about
containers. It sounds
interesting. I came here
to learn.
4. #ContainerDayFRParis Container Day 2017
We are in Paris Containers Day, so I could rightly imagine most
people around have an understanding of the underpinnings of
“containers”. But let’s have a show of hands to see how much
time we are going to spend on which slide.
There is no container
4
Group B
I use Docker. In
production. It works and I
never had to care about
how it is implemented.
5. #ContainerDayFRParis Container Day 2017
We are in Paris Containers Day, so I could rightly imagine most
people around have an understanding of the underpinnings of
“containers”. But let’s have a show of hands to see how much
time we are going to spend on which slide.
There is no container
5
Group C
I implement my own
container stuff. I have
Kernel-Fu. I know how
this stuff is built.
6. #ContainerDayFRParis Container Day 2017
1. This is meant as an entry-level talk, I will still discuss some nuts and bolts.. so
when I am unclear. Interrupt me. I don’t mind.
2. I am rusty. They make me do marketing these days. So when I am wrong.
Interrupt me. I don’t mind.
3. Even more so as we have the incredible honor of having people like Jessie
Frazelle with us, people that participated in building many of the nuts and some
of the bolts.
So, please, Jessie and you other experts, forgive the depths of my ignorance and any
and all lies and errors I am about to spout.
There is no container
6
7. #ContainerDayFRParis Container Day 2017
What do containers solve? Why do we need containers?
There is no container
7
Containers allow us to package complex software in a reusable format that is
easy to deploy, making automation easier.
Sometimes they make updating software easier (with stateless systems… just
build a new one, kill the old).
They have lower overhead in terms of memory usage than VMs, so they are less
expensive.. and we can have more of them.
They allow us to reason about the systems we run at lesser granularity. AKA
abstraction. In greek Atom means - that which cannot be divided. The container
is our Atom.
15. #ContainerDayFRParis Container Day 2017
The boxes have a common, simple interface, that is not
influenced by their content
There is no container
15
16. #ContainerDayFRParis Container Day 2017
From the outside we don’t care what is inside. There are no
dependencies on the exterior world.
There is no container
16
18. #ContainerDayFRParis Container Day 2017
We can move containers. Install them. Run them. Without ever
knowing what was inside.
There is no container
18
$ docker pull complex_piece_of_software:latest
$ docker run complex_piece_of_software:latest
19. #ContainerDayFRParis Container Day 2017
The “Nuts and bolts” truth of the matter is probably inverse.
The container does not create opacity from the outside in.
There is no container
19
23. #ContainerDayFRParis Container Day 2017
From the outside, the kernel, UID 0, they see all. For them, there
is no container.
There is no container
23
24. #ContainerDayFRParis Container Day 2017
There is no container
It is from the“containerized” process point of view that the
world changes. Becomes smaller.
24
25. #ContainerDayFRParis Container Day 2017
When we create a container what happens is that using a bunch
of different Kernel features and modules (cgroups, namespaces,
seccomp...) we:
There is no container
25
29. #ContainerDayFRParis Container Day 2017
And we limit the capabilities of the process in what it can
invoke as functionalities from the Kernel (seccomp .. and
more…)
There is no container
29
31. #ContainerDayFRParis Container Day 2017
There is an operating system. In our case Linux. It abstracts away the hardware.
No software on a normal computer runs “outside” of the operating system. Yup.
Even assembly / machine code. You can’t access the processor, memory or
hardware without going through it. What you run on Linux are ELF binaries.
Nothing else.
Your program interacts with its operating system through System Calls, it cas ask
for memory, access to stuff (like the network or the disk), it can ask the operating
system to run some other processes. A bunch of fun stuff.
So.. let’s create a container.
There is no container
31
32. #ContainerDayFRParis Container Day 2017
So.. let’s create a container.
There is no container
32
Interactions with the OS pass through system calls.. but sometimes it gets fancy and
proposes higher-level constructs to make it easy (like a pseudo-file-system). Most
often we will use libraries and full-blown integrated apps to take care of talking to
the OS. More on that later.
In Linux processes are organized in a tree. Each process has an ID, and a parent;
Everything starts with 0 which is the scheduler and 1, which is init. Everything else
is going to get invoked from those and down.
In linux we have three different calls to start a process exec() which we don’t really
care about here. fork() which copies the current process with a new PID and clone()
that copies all or some of the current process and runs the new process as a child.
33. #ContainerDayFRParis Container Day 2017
So, how do we make the world seem smaller to a process?
When creating our process we can pass a couple of parameters to clone() that will
tell our operating system how it is going to live.
A bunch of these parameters (or flags) are called
CLONE_NEW[...SOMETHING….] Some of these parameters, not all, can be
modified later-on using the unshare() system call.
So.. let’s create a container.
There is no container
33
34. #ContainerDayFRParis Container Day 2017
So.. let’s create a container.
There is no container
34
For example the parameter CLONE_NEWUTS tells the operating system that:
1. Our newly created process can call sethostname() and that doing so, instead of
changing the hostname for the whole OS, it is going to keep a record, just for
that Namespace of the Host Name.
2. So when, later the process calls gethostname() it will return whatever was put
through this namespace’s sethostname().
So unlike all of its cousins and parents this process thinks the name of the machine it
is running on is different.
We tricked it! (remember the part about lying?)
35. #ContainerDayFRParis Container Day 2017
Setting up namespaces
There is no container
35
So.. we create a new process, and we attach a namespace to it,
either at its creation with the flags we pass to clone(), later
using the unshare() system call, that can change some of the
namespaced resources or using the setns() system call that
would set a namespace for an existing process.
36. #ContainerDayFRParis Container Day 2017
So.. let’s create a container.
There is no container
36
Having a different machine name per process is cool.
But not that useful right? That is not a container.
What else can we isolate?
37. #ContainerDayFRParis Container Day 2017
Isolating the file-system
There is no container
37
As far as containers are concerned the most important thing is the file-system. This
is done through CLONE_NEWNS.
1. First we create the new mount namespace
2. We can than unmount the stuff from the parent namespace and mount the
various things we need to mount in our target dir (we want to get to a usable
root file system).
3. Run `pivot_root $TARGETDIR` and voilà!
We can have different mounts and isolate parts of the file-system! As a side note,
doing stuff like mounting, requires “capabilities” in this case CAP_SYS_ADMIN. More
often than not these are going to have been dropped. So this is not always trivial.
38. #ContainerDayFRParis Container Day 2017
So.. let’s create a container.
There is no container
38
We can decide what mounts are going to be shared from the “host”. We can totally
decide that /var/lib is going to be common. Nothing disallows this.
We can use some crazy layered file system (like AUFS or OverlayFS) which will
allow us to mix stuff, some coming from the underlying OS and some ‘overridden’
just for our namespace.
Now, “container runtimes” like Docker, or LXC or runc are a lot about preparing an
image of a filesystem that can be mounted in a way that a process could run. If you
look at the OCI (open container initiative) it has two specs, one for this, the file
system, and one for the runtime.
39. #ContainerDayFRParis Container Day 2017
Isolating Inter-Process Communications
There is no container
39
With CLONE_NEWIPC we limit our processes capability to
send and receive messages from processes to others with the
same namespace;
We don’t want our nice isolated process to talk with strangers
right?
40. #ContainerDayFRParis Container Day 2017
This is how when you run ps -aux you only see processes in
your own namespace and its children (the pids won’t match.
This is complex).
Oops, I forgot to tell you, namespaces are hierarchical. Which is
triple fun. So yes containers can run inside other containers
ad-infinitum (really up to 32 levels, but, well, you know, details).
Isolate Process IDs!
There is no container
40
41. #ContainerDayFRParis Container Day 2017
This is how your container gets its own IP. Yay, now is it a big
boy.
(We won’t get into this.. but this is also where a lot of suffering will happen. Remember, from the Kernel
perspective this is just another interface. We will need either to use NAT, weird bridging or some creative
uses of IPTABLES to make sense thing. And this is clearly where we see how higher-level abstractions are a
necessity)
Isolating the Network
There is no container
41
42. #ContainerDayFRParis Container Day 2017
This is oh so important for unprivileged containers.
Yes! Linux supports doing all of this from userspace.
This basically means that the uid running inside does not exist
outside. And that your process can feel blessedly aloof.
Isolate User and group IDs
There is no container
42
43. #ContainerDayFRParis Container Day 2017
man namespaces
USER_NAMESPACES(7)
There is no container
43
A process's user and group IDs can be different inside and outside a user
namespace. In particular, a process can have a normal unprivileged user ID
outside a user namespace while at the same time having a user ID of 0
inside the namespace; in other words, the process has full privileges for
operations inside the user namespace, but is unprivileged for operations
outside the namespace.
This means quantum-state rootness! You are root and unprivileged at the
same time!
44. #ContainerDayFRParis Container Day 2017
man namespaces
USER_NAMESPACES(7)
There is no container
44
Each process is a member of exactly one user namespace. A process
created via fork or clone without the CLONE_NEWUSER flag is a member
of the same user namespace as its parent. A single-threaded process can
join another user namespace with setns if it has the CAP_SYS_ADMIN in
that namespace; upon doing so, it gains a full set of capabilities in that
namespace.
45. #ContainerDayFRParis Container Day 2017
This is where this ties in to the earlier mechanism we were
talking about, cgroups.
CLONE_NEWCGROUP basically allows us to limit the
resource usage of the process (and its children), in terms of
memory, CPU usage and IO.
Almost last, but not least. Isolate resources!
There is no container
45
46. #ContainerDayFRParis Container Day 2017
This is of unholy complexity. Short story: Linux used to be
mostly all or nothing . User 0 Vs the others. Now you have
capabilities. A long list of capabilities. Which you can now go
and set per process. And you have stuff like seccomp and
seccomp-bpf to help you do just that
And you can use a bunch of modules and kernel patch sets to
make everything more robust. Like SELinux. GRSecurity. Or
AppArmor.
Really last: isolate all the things and the Kernel.
There is no container
46
47. #ContainerDayFRParis Container Day 2017
seccomp
There is no container
47
Seccomp is a mechanism in the Linux kernel that allows a process to make
a one-way transition to a restricted state where it can only perform a
limited set of system calls.
If a process attempts any other system calls, it is killed via a SIGKILL signal.
In its most restrictive mode, seccomp prevents all system calls other than
read(), write(), _exit(), and sigreturn().
This would allow a program to initialize and then drop into a restricted
mode where it could only read from/write to already-opened files.
48. #ContainerDayFRParis Container Day 2017
seccomp-bpf
There is no container
48
If seccomp is a sledgehammer. seccomp-bpf is the fine-grained version that
allows specifying a filter that is applied to every system call.
49. #ContainerDayFRParis Container Day 2017
BTW You get to have a nice pseudo filesystem with which you
can interact to control these values.
try:
sudo ls -lai /proc/8/ns/
cat /proc/800/cgroups
Looking under the hood
There is no container
49
50. #ContainerDayFRParis Container Day 2017
Unlike other isolation techniques (Solaris Zones, BSD Jails, VMs)
this is an emergent thing
There is no container
50
This is not a “first class” citizen. This was not designed. Different projects assemble
different types of isolation that have different semantics from all of these elements.
● Docker is about packaging a single executable
● LXC wants to give you what feels like a virtual machine.
● FireJail is there as a sandbox to run stuff you don’t trust. GUI much.
And this is a recent thing, user namespaces appeared in Kernel release 3.8 on 18
Feb 2013
53. #ContainerDayFRParis Container Day 2017
Everything is this world is “race-condition” prone and much of it, because of the
mix of tooling is complex and hard.
Creating a Linux Container or “containerization” is using these different
mechanisms together in a coherent way so as to have the end result “feel” as if
the process you are running in an isolated machine.
A container runtime is a packaging of the above to make it simple.
The signatures and semantics of cgroups, namespaces and
seccomp are different.
There is no container
53
54. #ContainerDayFRParis Container Day 2017
Container runtimes, try to take something that more reliably
looks like this
There is no container
54
56. #ContainerDayFRParis Container Day 2017
When you think about all these low-level knobs we can control: the machine
name, the network interfaces, the file-system, the users etc… you see
something else emerging.
When we define how to “containerize” a piece of software we are extracting its
contract.
We are defining the minimal subset of resources it needs.
And what is the minimal understanding of that piece of software that the
runtime requires to reliably run it.
Containers as an abstraction
There is no container
56
57. #ContainerDayFRParis Container Day 2017
There were other isolation techniques before Docker. But because it exposed
such a simple contract it gained the incredible traction it had.
According to Docker the contract of a piece of software was:
● A base image (a state of a file-system). Itself can be layered.
● A working directory.
● A build step (which was basically a bash script).
● A TCP port exposed to the world.
● Environment variables.
● A command to run.
The simple Docker Contract
There is no container
57
58. #ContainerDayFRParis Container Day 2017
The incredible success it had shows the Docker software, and the Docker
contract were good enough; And good enough is good. Sometimes great.
At platform.sh we run a container based based PaaS and we chose not to use
Docker.
● Partly because the nuts-and-bolts at the time didn’t fit (it was too
new/buggy for production in 2013/14). No User namespaces until two
months ago. No Immutability. Weird networking.
● Partly because we thought the contract wasn’t correct for our use-case.
Choosing a contract
There is no container
58
59. #ContainerDayFRParis Container Day 2017
● The idea of mutable, layered, base-images made creating the first
generation of Docker containers easy. Which explains a lot of its
popularity. So yes.
● But it is a messy thing. This is something Docker has advanced on by
allowing immutable containers. Still the default is that the container is
mutable. And this is how the eco-system looks like.
● Build-oriented, reproducible, semantic base-images allow for orders of
magnitude better memory utilisation through deduplication; And order of
magnitude simpler operations. This is not something you can bolt-on easily
later. There is still strong inertia here.
Is it an efficient contract?
There is no container
59
60. #ContainerDayFRParis Container Day 2017
For some software (most software we cared about) this contract
doesn’t really make sense. Not in the long run. Not at scale.
In order to be useful the contract that describes software needed also to
describe:
○ How to build it
○ Everything it depends on (you can’t run Wordpress without MySQL)
○ Its initial data structures (you can’t run Wordpress without some data
in the MySQL)
○ Its basic configuration (most software needs to understand some
things about its place in the world)
There is no container
60
61. #ContainerDayFRParis Container Day 2017
○ And first, of-course, the Kubernetes ecosystem.
○ But using 30 different tools strung together doesn’t scream
“abstraction” to us, but more like DIY mess. And it hardly answers the
questions:
■ What is the minimal subset of resources an app needs?
■ How can we make it run, reliably?
These days there are a billion and one projects that add those
capabilities
There is no container
61
62. #ContainerDayFRParis Container Day 2017
The obligatory XKCD 435
There is no container
62
○ If our intuition is correct, and the minimal viable contract to run
“arbitrary” software contains these other things, if the useful level to
reason about software is the molecule, not the atom then we need
an Organic Chemistry set; Not a physics set.
○ It doesn’t mean physics are wrong. Or that Docker is bad software.
63. #ContainerDayFRParis Container Day 2017
● RO / immutable base-image that is not opaque
○ A semantic representation of system-libraries (with lock files)
○ A reproducible, semantic, build system (with lock files)
○ Potentially, a build step (which can basically be a bash script).
● RW / mutable base-image (mutable state) - which is Content Addressable
● Mapping of working directories to the RW image.
● A list of exposed network protocols and their parameters
● Build time environment variables / Run time environment variables
● Relationships (some containers make no sense -- would not run without a
database) to other containers (that should be semantic themselves).
● The capability to understand change (diff as part of the model).
What would be a perfect contract for us?
There is no container
63
64. #ContainerDayFRParis Container Day 2017
● Because we chose a container description system that did not depend on
the containerization method we can swap-out that part later and this is
domain where everything moves fast. Shiny new becomes legacy in 6
months.
○ Our reproducible build system can create our base LXC systems (we use in production)
our VMs (which we also deploy when we need higher levels of isolation) or Docker images
(which we use in our Gitlab based CI system).
● Because we went for Read-On Containers separated from the R/W
mounts we have gained factors in terms of density because of the level of
memory deduplication.
Why are abstractions important?
There is no container
64
65. #ContainerDayFRParis Container Day 2017
Why are abstractions important?
● Because we are describing the “minimal application” not as a single process
but as a graph.. and because we understand the protocol layer interactions …
and what writes where to disk .. we can have consistent operations over the
cluster that are fast .. and safe.
● Which also means we do not suffer from the same limitations around running
persistent services.
● It is easier to implement HA primitives when you understand who is writing to
the disk and how, who has what ports opened etc..
● When your base system is not .yaml but .yaml + git and when your .yaml
represents something that has meaning.. you can implement change with
much less friction.
There is no container
65
66. Platform.sh can clone a an
arbitrarily complex production
cluster in less than a minute.
With all of the data.
To create ephemeral staging
clusters on the fly.
Every branch gets a url with
basically fail-proof deployments.
67. Git-driven infrastructure
With a single git push you can
deploy an arbitrarily complex
cluster (with micro-services,
messages queues and the lot.)
Backup means a consistent
point-in-time snapshot of the
whole shebang.
70. #ContainerDayFRParis Container Day 2017
There is no container but the cluster
There is no container
70
● This is a bonus slide in case I didn’t run-out-of-time which is fun as I had
66 slides for 30 minutes.
● At the beginning of our project we used the word Cluster to describe, well
half of the different primitives we had. But then it all became murky. So we
started calling stuff Cluster, Kluster and Claster. Which stuck for a little bit
but faded back again.
● Now cluster is back with all its glory, and a bit like with Hebrew, my
mother’s tongue.. well, people seem just to be able to guess the correct
meaning of cluster form the context.
● Oh we should really refresh that cluster.