A guest lecture at National University of Defense Technology (NUDT) in 2016 to postgraduate students in China about emerging technologies in the Linux operating system.
7. whoami (1)
NAME
Anthony Wong - 黃彥邦
JOB
Engineering Manager, Hardware Enablement at Canonical
LINUX EXPERIENCE
First started Linux on Redhat 4.2 in 1997
Became Debian Developer in 1998
Works in Linux industry ever since
Contributed to lots of FOSS projects, e.g. Debian, Ubuntu
16. Linux is more secure because...
● Default user does not have admin privilege
● Linux is diverse
● Windows dominates desktop market, majority of viruses target Windows
● Linus Torvalds said "given enough eyeballs, all bugs are shallow."
18. 2 most severe security vulnerabilities in recent years
19. Heartbleed bug
● Disclosed in April 2014 in the OpenSSL.
● Due to improper input validation (missing bounds check) in the
implementation of the TLS heartbeat extension - buffer over-read.
● 17% (around half a million) of the Internet's secure web servers were
believed to be vulnerable to the attack. Some estimate 500 million
computers affected.
● Allow theft of the servers' private keys and users' session cookies and
passwords.
● Affected websites include Yahoo!, Stack Overflow, Github, Amazon Web
Services, Wikipedia.
20. Shellshock bug
● Disclosed on 24 September 2014
● A security hole in bash dating from version 1.03 (August 1989)
● Bash unintentionally executes commands when the commands are
concatenated to the end of function definitions stored in the values of
environment variables.
env x='() { :;}; echo vulnerable' bash -c "echo this is a test"
● Can be triggered through HTTP_USER_AGENT variable on CGI-based web
servers.
● Attackers exploited Shellshock within hours of the initial disclosure by
creating botnets to perform DDOS attacks and vulnerability scanning.
22. What problems do we have?
● OpenSSL, for a long time, was maintained by two guys named Steve. That
means that the internet for a long period of time was secured by those
two guys.
● OpenSSH was maintained by one guy working part time.
● Bash is maintained by just 1 guy.
● GnuPG author going broke
23. What problems do we have?
● From research data
○ 51% of active projects have only 1 contributor
○ 19% have 2
○ 9% have 3
○ 5% have 4
○ 3% have 5
○ Overall, 87% of projects have 5 or fewer committers per year.
○ Merely 1% of projects have 50 or more committers per year, and a
scant 0.1% have 200 or more
Source: http://redmonk.com/dberkholz/2013/04/22/the-size-of-open-source-communities-and-its-impact-upon-activity-licensing-and-
hosting/
24. But may be Linus’ Law still applies to linux kernel?
25. ● 21 million LOC in linux kernel
4.5.
● 3 million LOC (17%) in linux
kernel untouched for 10
years since 2005.
● 7.8 changes per hour!
● Linus’ Law applies to kernel
but not without its problems.
32. What is sandbox?
● A security mechanism for
separating running programs, so
that it won’t harm the host
machine.
● Implemented by executing the
software in a restricted
operating system environment,
thus controlling the resources
(for example, file descriptors,
memory, file system space, etc.)
that a process may use.
33. Sandbox related technologies
Virtual machine
Unix permissions
UID/GID
chroot
Linux Capabilities
cgroup
Namespaces
seccomp
SELinux & AppArmor
Container
34. Virtual machine
● Emulate another computer system.
● Processes are confined in the VM.
● Can act as a security boundary.
● Examples: VMware, virtualbox, KVM, Xen, OpenVZ, Java Virtual Machine,
.net runtime, Dalvik
● Fun fact (off-topic): there are non-general purpose virtual machine in
kernel, for BPF ("Berkeley packet filter") and ACPI.
35. UID separation
● Android assigns a unique user ID (UID) to each Android application and
runs it as that user in a separate process.
○ Unlike traditional Linux.
● On Android, the Dalvik VM is not a security boundary, so Dalvik can
interoperate with native code in the same application without any security
constraints.
36. chroot (2,8)
● Run command or interactive shell with special root directory.
● Commonly used for building software and packages.
● schroot allows normal user to chroot and more features.
● Only protects filesystem, but does not restrict the use of resources like
I/O, bandwidth, disk space or CPU time.
● chrooted programs with sufficient privileges may perform a second
chroot to break out.
● Can still create device nodes and mount the filesystems, can’t block low-
level access to system devices by privileged users.
37. Linux Capabilities (7)
● traditional UNIX distinguishes two categories of processes
○ privileged processes (effective user ID = 0)
○ unprivileged processes (effective UID ≠ 0)
● Linux divides the privileges traditionally associated with superuser into
capabilities.
● Provide fine-grained control over superuser permissions.
● Examples: CAP_SYS_ADMIN, CAP_NET_ADMIN, CAP_SYS_TIME,
CAP_NET_BIND_SERVICE
38. Linux Capabilities - demo
● Check your ping command is SUID root or not
● Check capabilities of /bin/ping
○ getcap /bin/ping
● Grant CAP_NET_RAW to /bin/ping
○ sudo setcap cap_net_raw+ep /bin/ping
● Remove capabilities from /bin/ping
○ sudo setcap -r /bin/ping
○ Can you still ping if /bin/ping is not SUID root and without capabilities set.
40. cgroup
● cgroups (control groups) limits system resource usage (CPU, memory,
disk I/O, network, etc.)
● ulimit can do some of these but not easy to manage.
● Resource limiting
○ groups can be set to not exceed a configured memory limit, which also includes the file
system cache
● Prioritization
○ some groups may get a larger share of CPU utilization or disk I/O throughput
● Accounting
○ measures a group's resource usage, which may be used, for example, for billing purposes
● Control
○ freezing groups of processes, their checkpointing and restarting
41. cgroup resource controllers (subsystems)
● memory: limit memory usage, OOM kicks in to kill process if limit reached.
● cpu: assign relative CPU share of a cgroup
● blkio: assign relative I/O access and upper limit for the number of I/O
operations performed by a specific device
● cpuset: assigns individual CPUs and memory nodes to cgroups
● devices: control read/write/mknod permission
● net_cls, net_prio: assigns class and priority to network traffic, does not set
limit.
● freezer cgroup: freeze/thaw group of processes. Better than
SIGSTOP/SIGCONT.
42. Using cgroup
● Imagine different cgroup subsystems (CPU, memory, block IO) are
different trees, and processes are nodes of the tree.
● Try “mount | grep cgroup” to see the cgroup sysfs is mounted.
● You can manipulate cgroup under /sys/fs/cgroup/<subsystem>/
● Check your current cgroup status: cat /proc/self/cgroup.
● You can try systemd-cgls and systemd-cgtop.
● Another tool is cgmanager.
● Can apply limits by systemd such as MemoryLimit.
44. Namespaces (7)
● Provides a process an isolated system view of the global system.
● Limits how much a process can see.
● Types of namespaces:
○ PID, isolates processes
○ Network, isolates network devices, stacks, ports, etc
○ Mount, isolates mount points
○ User, isolates User and Group IDs
○ UTS (Unix timesharing - host and domain name)
○ IPC (Inter-process communication)
○ Cgroup, isolates cgroup root directory
45. PID namespace
● Can only see processes in its
own namespace.
● Parent namespace can see all
child processes.
● The same processes will have
different PID in different PID
namespaces.
● Always start with PID 1.
46. Network namespace
● Process has its own network
stack
○ Network interfaces, including lo
○ Iptables
○ Routing tables
○ Sockets
47. Mount namespace
● Isolates filesystem mount points.
● Processes in different mount
namespaces have different views
of the filesystem hierarchy.
● Can be used like chroot.
● Can have private mount, e.g. can
have its own /tmp or /var/tmp
48. ● Does UID/GID mapping, so a
process's user and group IDs can
be different inside and outside a
user namespace.
● For example, process has
unprivileged user ID outside
namespace but have UID 0
inside the namespace
● That means the process has full
privileges inside the namespace,
but is unprivileged outside.
● Relatively new (since 3.8)
UID 0→5000 in namespace
maps to
UID 10000→15000 outside of
namespace
User namespace Example
49. More about namespaces
● Look into /proc/<PID>/ns for namespaces handles.
● Namespace API:
○ clone - create a new child process possibly with new namespace
■ Use instead of fork
○ setns - join an existing namespace
○ unshare - move a process to a new namespace
50. Namespaces demo
systemd-nspawn - Spawn a namespace container for debugging, testing and
building
$ debootstrap --arch=amd64 sid ~/debian-sid
# systemd-nspawn -D ~/debian-sid/
# systemd-nspawn --private-network --private-users=1000 -D ~/debian-sid/
● Run a command in the container and check its user.
● Check the network with ifconfig.
● Create some files and check its file ownership.
52. Seccomp
● Filters system calls unneeded by a process.
● Do you need all 300+ syscalls provided by kernel?
● Consider a number crunching application does not need bind(), accept()
or chroot().
● Can tremendously reduce the kernel attack surface.
● First version was merged to Linux 2.6.12 in 2005.
● Used in Chrome browser, OpenSSH, systemd, LXC, Docker, snapd
● Two modes: SECCOMP_SET_MODE_STRICT and
SECCOMP_SET_MODE_FILTER
53. Using seccomp in your code
● SECCOMP_SET_MODE_STRICT only allows to use read(2), write(2), _exit(2)
and sigreturn(2). You will get SIGKILL if you call other syscalls.
seccomp(SECCOMP_SET_MODE_STRICT, 0, NULL);
or
prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT);
54. Using seccomp in your code
● SECCOMP_SET_MODE_FILTER: can control what system calls are allowed.
● Added to Linux 3.5 in 2010
seccomp(SECCOMP_SET_MODE_FILTER, types, args);
or
prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, args);
The system calls allowed are defined by a pointer to a Berkeley Packet Filter (BPF) passed via args.
55. How end-users use seccomp?
● What if we don’t trust the running code? We can’t trust it to use seccomp.
● Needs containers to help confine running programs.
● systemd: SystemCallFilter=<allowed syscalls>
● For snap packages, apps can be confined or unconfined (for
development). Confined apps can declare what extra capabilities it needs
(through “interfaces”).
56. {
"defaultAction": "SCMP_ACT_ERRNO",
"syscalls": [
{
"name": "accept",
"action": "SCMP_ACT_ALLOW",
"args": []
},
{
"name": "accept4",
"action": "SCMP_ACT_ALLOW",
"args": []
...
seccomp in Docker Example
● Pass a docker profile (in JSON
format) when running your
container.
● Docker's default seccomp
profile is a whitelist.
● Syscalls such as clone, ptrace,
reboot, umount are not in
whitelist.
58. SELinux & AppArmor
● Both are Linux kernel security module.
● Both implements mandatory access controls (MAC).
● Original primary developer of SELinux is NSA, do you trust it?
● SELinux policy is much more complex than AppArmor.
● SELinux is used in Fedora/RHEL and Android since 4.3 (permissive) and 4.4
(enforcing).
● Ubuntu uses AppArmor.
59. AppArmor example
● Both are Linux kernel security module.
● Both implements mandatory access controls (MAC).
● Original primary developer of SELinux is NSA, do you trust it?
● SELinux policy is much more complex than AppArmor. According to
research, SELinux scores 34.58 in usability while AppArmor scores 54.93.
● SELinux is used in Fedora/RHEL and Android since 4.3 (permissive) and 4.4
(enforcing).
● AppArmor is default in Ubuntu.
66. Container
Technologies I just talked about: userspace, cgroup, capabilities
are building blocks in many container runtimes.
67.
68. Container runtimes
● LXC
○ LXC being the runtime, LXD being the hypervisor
● systemd-nspawn (1)
○ Spawn a namespace container for debugging, testing and building
○ Not for serious production use
● Docker
○ use LXC at first, later libcontainer and now runc
○ DockerHub as ecosystem to share images
○ micro-service
● Rkt by CoreOS
● OpenVZ
○ Predates the container hype, does not use namespace or cgroup
○ Requires patched kernel for full feature
73. Snap
● Backed by Canonical, installed in 16.04 by default.
● Can be used in Fedora, Debian, Arch, Gentoo.
● Strive to be a universal application format.
● A minimal core OS to provide basic root filesystem.
● Secured by AppArmor and seccomp.
● Package is created by a tool called snapcraft.
● Common commands are: snap install, snap remove, snap find, snap list,
very easy to use.
● You can create unconfined snaps for development or local use.
● Interface is the mechanism for providing resource sharing and granting
permissions.
74. Snap sandbox
● Snaps are installed into the regular host filesystem in
/snap/$name/$version/
● When a snap is launched:
○ A slave mount namespace is created
○ A private /tmp directory is created
○ The ubuntu-core-launcher bind mounts /bin, /lib, /lib64, /sbin, /usr from the ubuntu-core
snap
○ The ubuntu-core-launcher applies the AppArmor/seccomp confinement
○ The application is launched: it can see the host's /dev, /proc/, /sys, /media and other
mount points, but that might be mitigated by AppArmor
● But X11 is insecure! Needs Mir/Wayland!
75. Snapcraft example
name: dash
version: "0.5.9"
summary: dash shell
description: |
The Debian Almquist Shell (dash) is a POSIX-compliant shell derived
from ash.
Since it executes scripts faster than bash, and has fewer library
dependencies (making it more robust against software or hardware
failures), it is used as the default system shell on Debian systems.
apps:
dash:
command: dash
plugs: [home, camera]
76. Snap interface
# Description: Can access non-hidden files in user's $HOME. This is restricted
# because it gives file access to all of the user's $HOME.
# Usage: reserved
# Note, @{HOME} is the user's $HOME, not the snap's $HOME
# Allow read access to toplevel $HOME for the user
owner @{HOME}/ r,
# Allow read/write access to all non-hidden files that aren't in ~/snap/
# allow creating a few files not caught above
owner @{HOME}/{s,sn,sna}{,/} rwk,
# allow access to gvfs mounts (only allow writes to files, not mount point)
owner /run/user/[0-9]*/gvfs/** r,
owner /run/user/[0-9]*/gvfs/*/** w,
77. Snap Demo
● Install dash_confined_0.5.9_amd64.snap and check access to /home.
● Write to /tmp see what happens.
● Check dmesg for apparmor errors.
● Install dash_home_0.5.9_amd64.snap and check home access.
● Install dash_home+camera_0.5.9_amd64.snap and check /dev/video0
access.
○ Check with getfacl /dev/video0 to make sure you have access.
○ cat /dev/video0 in dash
○ You still need snap connect dash:camera ubuntu-core:camera to grant
access.
78. Flatpak
● Originally called xdg-app, mainly contributed by a Red Hat engineer.
● Can be used in Fedora, Ubuntu, Debian, Arch, Gentoo.
● Strive to be a universal application format.
● Depends on systemd for cgroup, which makes it less universal.
● Works closely with the GNOME community.
● For desktop applications, need to open up "safe" ways for an application
to interact with the system, they called it Portals.
80. Flatpak sandbox
● All processes run as the user with no capabilities
● All processes run in a transient systemd user scope with the name
flatpak-$appid-$pid
● A filesystem namespace where:
○ / is a private tmpfs not visible anywhere else. This is pivot_root:ed into so it is the new /
and all other mounts from the host are unmounted from the namespace.
○ /usr is a bind mount of the runtime, /app is a bind mount of the application
○ /proc shows only the processes in the app sandbox
○ /sys is a read-only bind mount of the host /sys
○ /dev contains /dev/full, /dev/null, /dev/random, /dev/urandom, /dev/tty and /dev/zero
● Seccomp is used to disable unnecessary system calls
● A private pid namespace with a minimal init process that reaps zombies
81. Flatpak sandbox continues...
● A private user namespace
● A private ipc namespace
● A private network namespace with only an ipv4 loopback device
○ Optionally can use the host network namespace
● SELinux or AppArmor is NOT used.
● Need wayland compositor in the session and no access to the Xserver to
be properly sandboxed, because X is insecure.
85. Controversy
https://lkml.org/lkml/2014/4/2/420
On Wed, Apr 2, 2014 at 11:42 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> The response is:
>
> "Generic terms are generic, not the first user owns them."
And by "their" you mean Kay Sievers.
Key, I'm f*cking tired of the fact that you don't fix problems in the
code *you* write, so that the kernel then has to work around the
problems you cause.
Greg - just for your information, I will *not* be merging any code
from Kay into the kernel until this constant pattern is fixed.
[show linus’s middle finger photo]
87. systemd.slice
● encodes information about a slice which is a concept for hierarchically
managing resources of a group of processes.
● performed by creating a node in the Linux Control Group (cgroup) tree.
● For each slice, certain resource limits may be set that apply to all
processes of all units contained in that slice.
● Default slices:
○ -.slice (root)
○ system.slice
○ user.slice
○ machine.slice
88. Control group with systemd
● Shows control group
● systemd-cgls
● systemd-cgtop
89. Security features in systemd
● Can be used to sandbox traditional services.
● Makes use of existing technologies to protect system services.
90. Service unit configuration
PrivateTmp=yes|no
● Private instances of /var and /var/tmp.
● Lifecycle is bound to service runtime.
● Use filesystem namespace.
● Solves Tmp race , symlink race, insecure temp file.
91. Service unit configuration
CapabilityBoundingSet=
CAP_SYS_ADMIN, CAP_KILL, CAP_MKNOD, CAP_SYS_TIME,
CAP_NET_BIND_SERVICE
● Think about an ntpd daemon that no longer need to run as root.
● Example:
Network-manager.service:
CapabilityBoundingSet=CAP_NET_ADMIN CAP_DAC_OVERRIDE
CAP_NET_RAW CAP_NET_BIND_SERVICE CAP_SETGID CAP_SETUID
CAP_SYS_MODULE CAP_AUDIT_WRITE CAP_KILL CAP_SYS_CHROOT
92. Service unit configuration
PrivateDevices=yes|no
● Get rid of raw devices
● Only have must-have devices such as /dev/null, /dev/random, /dev/null,
/dev/zero.
● Examples:
systemd-bus-proxyd.service:PrivateDevices=yes
systemd-hostnamed.service:PrivateDevices=yes
systemd-localed.service:PrivateDevices=yes
systemd-timesyncd.service:PrivateDevices=yes
94. Service unit configuration
PrivateNetwork=yes|no
● ‘no’ means loopback and no access to network interfaces
● fwupd.service:
[Service]
Type=dbus
BusName=org.freedesktop.fwupd
ExecStart=/usr/lib/x86_64-linux-gnu/fwupd/fwupd
PrivateNetwork=yes
PrivateTmp=yes
RestrictAddressFamilies=AF_UNIX|AF_INET|AF_INET6
99. Service unit configuration
LimitNPROC=
● Limit number of processes a user can have.
● For fork() protection.
● Same as ulimit -u
● Example: bluetooth.service:LimitNPROC=1
104. Benefits of kernel live patching
● No need to reboot!
● System administrators are afraid of reboot.
● Need physical presence.
● Even more reluctant to reboot if the machine has been running for long
time.
● Keep the uptime record :)
105.
106. Summary
● The problems that the Linux ecosystem is facing.
● Reviewed sandboxing technologies in Linux
○ cgroup, namespace, seccomp, Linux capabilities
○ MAC mechanism such as AppArmor and SELinux
● We looked at containers.
● How systemd can be used to protect services.
● Snap vs Flatpak, how they make use of sandboxing.
● Kernel live patching for fixing kernel bugs without reboot.
● SSL certifications is now free thanks to Let’s Encrypt project.