Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unikernels: Rise of the Library Hypervisor
1. Unikernels: the Rise of
the Library Hypervisor
Anil Madhavapeddy, @avsm
Mindy Preston, @yomimono
Martin Lucina
+the MirageOS and Docker for Mac/Win teams
Docker Inc, @docker
with contributions from IBM
Docker Distributed Systems Summit
7th October 2016, Berlin, Germany
2. Conventional hypervisors
• Run full guest operating
systems with complex
emulation needs.
• Scaffolding for device
emulation, instruction
emulation, etc.
• Hard to compose into existing
infrastructure without wrapping
a full hypervisor layer.
Xen Hypervisor
qemu
xenstored
xenconsoled
Hardware
Dom0DomU
3. Conventional hypervisors
CVE-2016-3710: VGA emulation
missing bounds checks causes exploit.
CVE-2016-5403: unbounded virtio
memory usage causes DoS.
CVE-2016-3672: unrestricted qemu
logging causes DoS.
CVE-2015-8554: qemu-dm buffer
overrun in MSI-X causes exploit.
CVE-2015-7504: heap overflow in
pcnet emulator causes exploit.
• Run full guest operating
systems with complex
emulation needs.
• Scaffolding for device
emulation, instruction
emulation, etc.
• Hard to compose into existing
infrastructure without wrapping
a full hypervisor layer.
4. How can distributed systems
use hardware protection more
flexibly and composably?
5. Recap: Unikernels
• "library operating systems"
break kernels into libraries.
• Link libraries with a boot layer,
scheduler and application.
• Portable microservices that boot
directly on hypervisors or Unix. Xen
Hardware
App
Linux
Hardware
DockerApp
Configuration Business Logic
HTTP JSON SSL
TCP/IP
Xen
Devices
Unix
libev
Unix
musl libc
Application
Libraries
Libraries
6. Recap: Unikernels
• Many benefits are lost when
deploying on existing clouds.
• Tiny binaries (200k) still require
scaffolding of a full OS to boot.
• Difficult to manage hypervisor
from inside a container as full
host privilege is needed.
• "library operating systems"
break kernels into libraries.
• Link libraries with a boot layer,
scheduler and application.
• Portable microservices that boot
directly on hypervisors or Unix.
7. Library Hypervisors
• Extend the "kit" model and break down hypervisor
functionality into libraries.
• Expose core functionality (CPU and memory) as library,
and other pieces (device emulation) are optional.
• Benefit: huge reduction in TCB, and better fit to
container-native infrastructure with privilege dropping.
• Drawback: no existing support in operating systems.
8. Library Hypervisors
• Extend the "kit" model and break down hypervisor
functionality into libraries.
• Expose core functionality (CPU and memory) as library,
and other pieces (device emulation) are optional.
• Benefit: huge reduction in TCB, and better fit to
container-native infrastructure with privilege dropping.
• Drawback: no existing support in operating systems.
But let's a closer look!
12. • Easy drag and drop installation, and
autoupdates to get latest Docker.
• Secure, sandboxed virtualisation
architecture without elevated privileges.
• Native networking support, with VPN and
network sharing compatibility.
• File sharing between container and host:
uid mapping, inotify events, etc.
Docker for Mac
Aiming for a native OSX experience
that works with existing developer
workflows.
13. • Uses the new HyperKit framework, which is in turn
based on xHyve and FreeBSD's bHyve.
• Sandbox friendly: processes largely run as non-
root, with privileges of the local user.
Virtualisation
14. • Uses the new HyperKit framework, which is in turn
based on xHyve and FreeBSD's bHyve.
• Sandbox friendly: processes largely run as non-
root, with privileges of the local user.
Virtualisation
OSX Kernel
Hypervisor.
framework
Hardware
virt: VMX,
nested
paging
15. • Uses the new HyperKit framework, which is in turn
based on xHyve and FreeBSD's bHyve.
• Sandbox friendly: processes largely run as non-
root, with privileges of the local user.
Virtualisation
OSX Kernel Userspace
Hypervisor.
framework
User Process
Thread/vCPU
Traps on I/O pages
Manages ACPI, PCI
devices
Hardware
virt: VMX,
nested
paging
16. • Uses the new HyperKit framework, which is in turn
based on xHyve and FreeBSD's bHyve.
• Sandbox friendly: processes largely run as non-
root, with privileges of the local user.
Virtualisation
OSX Kernel Userspace
Hypervisor.
framework
User ProcessHardware
virt: VMX,
nested
paging
Process
Linux Kernel
VirtIO IPC
VirtIO Block
VirtIO Net
Alpine Linux
Userspace
Latest Docker
preconfigured
QCow2
VPNKit
Logs redirected to
OSX host
17. • Uses the new HyperKit framework, which is in turn
based on xHyve and FreeBSD's bHyve.
• Embeds Linux: includes an embedded
lightweight Alpine Linux distribution optimised for
fast boot and stateless operation for containers.
Virtualisation
$ docker info
Containers: 358
Running: 13
Paused: 0
Stopped: 345
Images: 485
Server Version: 1.11.1
Storage Driver: aufs
Root Dir: /var/lib/docker/aufs
Backing Filesystem: extfs
Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge null host
Kernel Version: 4.4.9-moby
Operating System: Alpine Linux v3.3
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 3.858 GiB
18. HyperKit library structure
• In HyperKit, most functionality is linked as a library.
• If app doesn't need a protocol, it is not linked and
not part of the trusted computing base.
19. • Want to hide the gory details of virtualisation from
the user. The Linux VM should be "invisible".
• Not solving this leads to many user complaints:
• VPN software and corporate installations do not
like bridged virtual machines or custom routing.
Result: container traffic cannot connect to Internet.
• Services cannot be exposed on localhost or
the external interface and are instead on the Linux
VM IP address.
Result: breaks common web oAuth workflows.
Networking
23. • Want to hide the gory details of virtualisation from
the user. The Linux VM should be "invisible".
• Not solving this leads to many user complaints:
• VPN software and corporate installations do not
like bridged virtual machines or custom routing.
Result: container traffic cannot connect to Internet.
• Services cannot be exposed on localhost or
the external interface and are instead on the Linux
VM IP address.
Result: breaks common web oAuth workflows.
Networking
24. • Challenge: Services publishing ports should be
exposed on localhost without needing VM info.
• Solution: VPNKit forwards container port requests
to a OSX service which binds them natively on its
external interface.
• Benefits:
• docker run -P on the Mac now works without
requiring any knowledge of the VM innards.
• External oAuth workflows operate with web apps.
Networking
27. • Challenge: Deal with custom VPN software on the
host that makes it difficult to bridge.
• Solution: VPNKit, efficiently reconstructs container
traffic into separate TCP/IP flows and translates
them into native OSX/Windows sockets.
• Benefits:
• All network traffic is generated from normal socket
calls (e.g. gethostbyaddr) on the Mac, so
interacts well with firewalls, VPNs, and any local
security policies.
Networking
28. • Native OSX application, uses HyperKit to virtualise
for domain-specific purpose ("docker run")
• Links MirageOS unikernel libraries for networking
and storage translation between OS boundaries.
• The library approach let us glue together these
components really easily.
• Docker for Mac is quite a complex distributed
system internally, but (hopefully) hidden from user.
Docker for Mac + unikernels
29. MirageOS 3 + Solo5
•Unikernels have been gathering pace; next
challenge is to make them easily deployable.
•Build handled via Docker, but docker run
shouldn't need privileges (e.g. to start a VM).
•MirageOS 3 has a new library hypervisor for
Linux, developed by IBM, Docker and
Cambridge University contributors.
mirage.io
30. MirageOS 3 + Solo5
• Source: https://github.com/Solo5/solo5
• Runs as a Unix process and opens /dev/kvm for
hardware isolation.
• ukvm is a small, modular monitor that links only what is
needed. Can be 10k in size!
• Can run privilege separated: one process opens /dev/
kvm and drops privileges and executes the unikernel.
• Boot times are the same as process fork times, since all
the device setup is handled in-process.
31. MirageOS 3 + Solo5
Source: Dan Williams and Ricardo Koller, IBM Research, HotCloud 16
32. MirageOS 3 + Solo5
• Due for stable release in the next month.
• Intended to be "unikernel template" for
other projects to share hypervisor code.
• Liberally licensed under BSD/Apache2/ISC
to encourage adoption and embedding.
• BoF and tutorials tomorrow to demonstrate
it. Developers are all here and hacking!
34. How can distributed systems
use hardware protection more
flexibly and composably?
35. Questions?
Download free at
docker.com
Twitter: @avsm
https://github.com/docker/hyperkit
https://github.com/docker/vpnkit
https://github.com/docker/datakit
https://github.com/mirage/
We will be
hacking
tomorrow!
37. • Challenge: Share arbitrary OSX directory tree into
Linux container without requiring extensive
modification of either side.
• Solution: Use a FUSE forwarding layer and
translate Linux filesystem calls to OSX equivalents.
OSX Host Linux Host Container
VOLUMEcom.docker.osxfs
Track extra
metadata
Translate to OSX
filesystem calls
FUSE
Filesystem Sharing
38. • Challenge: Need filesystem activation so events on
the Mac wake up container servers and vice-versa.
• Solution: osxfs uses FSEvents API and injects
inotify activation events into container.
OSX Host Linux Host Container
VOLUMEcom.docker.osxfs
FSEvents watches
open files
Events from Linux
causes OSX apps
to wake up
FUSE
Filesystem Sharing
39. • Challenge: Need filesystem activation so events on
the Mac wake up container servers and vice-versa.
• Solution: osxfs uses FSEvents API and injects
inotify activation events into container.
OSX Host Linux Host Container
VOLUMEcom.docker.osxfs
FSEvents watches
open files
Events from Linux
causes OSX apps
to wake up
FUSE
Filesystem Sharing
40. • Challenge: Deal with custom VPN software on the
host that makes it difficult to bridge.
• Solution: VPNKit, efficiently reconstructs container
traffic into separate TCP/IP flows and translates
them into native OSX/Windows sockets.
OSX Host Linux Host Container
RUN <...>com.docker.hyperkit-net
Reconstruct traffic
TCP flows
Translate to OSX
socket calls
Ethernet bridge
DHCPv4
NTP
Networking
41. OSX Host Linux Host
Privileged Port
Service
Container
EXPOSE
Port Service
VSock Binder
RUN <...>
VSock Listener
Userland Proxy
• Challenge: Services publishing ports should be
exposed on localhost without needing VM info.
• Solution: VPNKit forwards container port requests
to a OSX service which binds them natively on its
external interface.
Networking
42. $ docker run resin/armv7hf-debian uname -a
Linux 7ed2fca7a3f0 4.1.12 #1 SMP Tue Jan 12 10:51:00
UTC 2016 armv7l GNU/Linux
$ docker run justincormack/ppc64le-debian uname -a
Linux edd13885f316 4.1.12 #1 SMP Tue Jan 12 10:51:00
UTC 2016 ppc64le GNU/Linux
Multi-CPU architectures