The current Linux kernel /proc/PID interface is great, time-proven and reliable way to get info about processes running on a system. Right? Well, yes and no. We found out (and you, too, might have noticed it) this is what makes ps and top slow when there are thousands of processes running. Besides the speed, there are a number of other problems with the current /proc/PID interface.
The talk describes all those in great details, then goes on to the alternative we are proposing for inclusion to the kernel, a new interface called task_diag. The new interface is slick, fast (5-10x speed improvement), and extendable.
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
Time to rethink /proc
1. Time to rethink
/proc
Kir Kolyshkin / Andrey Vagin
@kolyshkin / @vagin_andrey
Texas Linux Fest, 9 July 2016
Austin, TX
2. 2
Agenda
● Intro
● History of /proc
● Limitations of current interface
● Proposed solutions
● Performance results
3. 3
$ whoami
● Linux user since 1995
● Developing containers since 2002
– author of vzctl and vzpkg
● Leading OpenVZ: 2005 to 2015
● Twitter: @kolyshkin
4. 4
● Founded in 1997
● Spun off from Parallels
● HQ in Seattle, WA
● R&D in Moscow, RU
2016
8. 8
Ideas behind CRIU
● We can't merge kernel c/r upstream, so...
let’s redo the whole thing in userspace
● Use existing interfaces where available
– /proc, ptrace, netlink, parasite code injection
● Amend the kernel where necessary
– only ~180 kernel patches
– kernel v3.11+ is sufficient
(if CONFIG_CHECKPOINT_RESTORE is set)
9. 9
History of /proc part I
● Initial solution: /dev/kmem
– May 1975, UNIX 6th
edition (V6)
– http://man.cat-v.org/unix-6th/4/mem
● First “old style” /proc
– 1984, UNIX 8th
edition (V8), by Tom Killian
– A process is a file! Images of running processes
– An alternative to ptrace(2)
– http://man.cat-v.org/unix_8th/4/proc
10. 10
History of /proc part II
● Most well-known old-style /proc
– 1988...1991: UNIX SVR4 (port from V8 with
enhancements by Roger Faulkner and Ron Gomes)
– read(), write(), and 37 ioctl()s
● First modern style /proc
– mid-1990s, Plan 9
– Each process is a directory with multiple
informational and control files
– One can use ls and cat to work with it
12. 12
Modern Linux interface: /proc/PID/*
$ ls /proc/self/
attr cwd loginuid numa_maps schedstat task
autogroup environ map_files oom_adj sessionid timers
auxv exe maps oom_score setgroups uid_map
cgroup fd mem oom_score_adj smaps wchan
clear_refs fdinfo mountinfo pagemap stack
cmdline gid_map mounts personality stat
comm io mountstats projid_map statm
coredump_filter latency net root status
cpuset limits ns sched syscall
13. 13
Limitations of /proc/PID interface
● Requires at least three syscalls per process per file
– open(), read(), close()
● Variety of formats, mostly text based
● Not enough information (/proc/PID/fd/*)
● Some formats are non-extendable
– /proc/PID/maps where the last column is optional
● Sometimes slow due to extra attributes
– /proc/PID/smaps vs /proc/PID/maps
●
15. 15
Similar problem: info about sockets
● /proc
– /proc/net/netlink
– /proc/net/unix
– /proc/net/tcp
– /proc/net/packet
● Problems: not enough info, complex format, all-or-nothing
● Solution (2012): use netlink, generalize tcp_diag as sock_diag
– the extendable binary format
– allows to specify a group of attributes and sockets
16. 16
Solution 1: task_diag based on netlink socket
1.Netlink message format:
binary and extendable
2.Ways to specify a set of processes
3.Optimal grouping of attributes
18. 18
Specify sets of processes
● TASK_DIAG_DUMP_ALL
– Dump all processes
● TASK_DIAG_DUMP_ALL_THREAD
– Dump all threads
● TASK_DIAG_DUMP_CHILDREN
– Dump children of a specific task
● TASK_DIAG_DUMP_THREAD
– Dump threads of a specific task
● TASK_DIAG_DUMP_ONE
– Dump one task
19. 19
Groups of attributes
● TASK_DIAG_BASE
– PID, PGID, SID, TID, comm
● TASK_DIAG_CRED
– UID, GID, groups, capabilities
● TASK_DIAG_STAT
– per-task and per-process statistics (same as taskstats, not avail
in /proc)
● TASK_DIAG_VMA
– mapped memory regions and their access permissions (same as
maps)
● TASK_DIAG_VMA_STAT
– memory consumption for each mapping (same as smaps)
20. 20
This is what makes it real fast
1.Netlink message format:
binary and extendable
2.Ways to specify a set of processes
3.Optimal grouping of attributes
21. 21
Problems with netlink
● Designed for networking
● Not obvious where to get pid and user
namespaces
● Impossible to restrict netlink sockets
– Credentials are saved when a socket is created
– Process can drop privileges, but netlink doesn't care
– The same socket can be used to get process
attributes and to set ip addresses
22. 22
Change netlink socket to a transactional file
● /proc/task_diag as a transactional file
– write request → read response
● Otherwise same as netlink socket
● LKML discussion has not reached conclusion yet
23. 23
Performance: ps
Traditional ps (using /proc/PID/* files):
$ time ./ps/pscommand ax | wc -l
50089
real 0m1.596s
user 0m0.475s
sys 0m1.126s
New ps (using task_diag):
$ time ./ps/pscommand ax | wc -l
50089
real 0m0.148s
user 0m0.069s
sys 0m0.086s
24. 24
Performance: using perf tool
> Using the fork test command:
> 10,000 processes; 10k proc with 5 threads = 50,000 tasks
> reading /proc: 11.3 sec
> task_diag: 2.2 sec
>
> @7,440 tasks, reading /proc is at 0.77 sec and task_diag at 0.096
>
> 128 instances of sepcjbb, 80,000+ tasks:
> reading /proc: 32.1 sec
> task_diag: 3.9 sec
>
> So overall much snappier startup times.
// David Ahern
Slackware on floppies. Kernel 1.0.9, recompiled 1.1.50 from source.
And it’s my second time here at TXLF, long way from Seattle.
Virtuozzo a product is a essentially a supercharged version of OpenVZ, with containers and VMs working side by side and are uniformly managed by same set of tools.Storage idea is to take the individual servers’ hard drives to
OpenVZ, my baby. First steps, first words, first kernel panics. Do we have any users in the audience?
Full (system) containers for Linux
Developed since 1999,open source since 2005
Live migration since 2007
~2000 Linux kernel patches
enabling LXC, Docker, CoreOS…
biggest contributor to containers
Now reborn as Virtuozzo 7
4 years old! v.2.3 (June 2016)
Aims to replace OpenVZ kernel c/r
Saves and restores setsof running processes
Integrated into LXC, Docker*
Not just for live migration!
save HPC job or game, update kernel or hardware,balance load, speed-up boot, reverse debug, inject faults
We failed to merge in-kernel c/r because that kernel code is very invasive, touching every kernel subsystem, no kernel maintainer wanted that in their code
As I’m getting older, I find myself more and more interested in history.
More than 40 files and 10 directories for each process. Our tests showed that reading that amount of files takes lots of time.
Oh, and here is a picture of a classic locomotive, a rail transport vehicle. Why this picture? Because it’s slooow. Are there any engineers here? I mean, real ones, not software engineers. What would be the max speed of this beast?
Variety of formats – no one wants to spend their life writing parsers for all these formats.
Text-based: consider ps showing process time. Kernel has it in binary, shows to /proc as a string, ps reads it and converts to binary, to use say for sorting, and finally converts it to string when printing.
An example of non-extendable format is /proc/*/maps – last field is file name, and it is ... optional!
There are three definitive properties of this solutionLet’s see them in more details.
The structure is pretty generic, this is what makes this format extendable.
One important thing here is optimal grouping. If any attribute greatly affects response speed, it should be separated into a separate group.
These three properties is what makes the API real FAST.For those of you living in US, here’s a picture of a european high speed rail train, 186 miles per hour.
Another bad example of using netlink: taskstats
Final remark: open source is really awesome! Why? There are many people from many different places working on many different problems. The work that I just described is one example of such work.