6. 7
Comparison VM-s vs CT-s
● One real HW, many virtual HW,
many OS-s.
● One real HW, one kernel, many
userspace instances
● Full control on the guest OS ● Native performance: [almost] no
overhead
● High density
● KSM (Kernel SamePage Merging) ● Use resources on demand
● Dynamic resource allocation
● Naturally share pages
● Depends on hardware
(VT-x, VT-d, EPT, etc)
● Not all functionality are virtualized
● Flexibility
11. 12
How to execute CT
All allowed by default
● unshare, nsenter
● Systemd Lightweight Containers
● LXC
● Libvirt LXC
All restricted by default
● OpenVZ (vzctl-core) (FC19)
12. 13
vzctl - perform various operations on a container
# yum install -y vzctl-core
# vzctl create 101 --ostemplate fedora-15
# vzctl start 101
# vzctl exec 101 ps ax
PID TTY STAT TIME COMMAND
1 ? Ss 0:00 init
11830 ? Ss 0:00 syslogd -m 0
11897 ? Ss 0:00 /usr/sbin/sshd
11943 ? Ss 0:00 xinetd -stayalive -pidfile ...
12218 ? Ss 0:00 sendmail: accepting connections
12265 ? Ss 0:00 sendmail: Queue runner@01:00:00
13362 ? Ss 0:00 /usr/sbin/httpd
13363 ? S 0:00 _ /usr/sbin/httpd
..............................................
6416 ? Rs 0:00 ps axf
# vzctl stop 101
# vzctl destroy 101
13. 14
OpenVZ kernel only features
● Ploop (snapshot, backups, different formats)
● Second level quota
● More functional memory accounting
● PFCache (memory deduplication. Io-ops saving)
● More isolated in compare with FC19 (lack of userns)
16. 17
What is C/R and how can it be used?
C/R is the ability to save states of processes and to restore them later.
Usage scenarios:
– Failure recovery
– Live migration
– Reboot-less upgrade
– Speed up of slow-boot services
– HPC issues
17. 18
History
●
Berkeley Lab Checkpoint/Restart (BLCR) (2003)
– Load a kernel module and link with a library
● DMTCP: Distributed MultiThreaded CheckPointing (2004-2006)
– Preload a library
●
OpenVZ (2005)
– OpenVZ kernel
● Linux Checkpoint/Restart by Oren Laadan (2008)
– A non-mainline kernel
●
CRIU (2011)
OpenVZ
2005
BLCR
2003
Linux C/R
2008
CRIU
2011
DMTCP
2007
20. 21
Dump
● Parasite code
– Receive file descriptors
– Dump memory content
– Prctl(), sigaction, pending signals, timers, etc.
● Ptrace
– freeze processes
– Inject a parasite code
● Netlink
– Get information about sockets, netns
● Procfs
/proc/PID/maps, /proc/PID/map_files/,
/proc/PID/status, /proc/PID/mountinfo
21. 22
Restore
● Collect shared objects
● Restore name-spaces
● Create a process tree
– Restore SID, PGID
– Restore objects, which should be inherited
● Files, sockets, pipes, ...
● Restore per-task properties.
● Restore memory
● Call sigreturn
● Awesome
Namespaces
Processes
22. 23
Interesting moments
● How to restore shared objects?
– Send file descriptors via unix sockets
– Map files from /proc/self/map_files/ for restoring anon shared mappings
● How to restore memory mappings on the correct places?
– Map a new code block and a stack
– Unmap crtools' mappings
– Remap task's mappings on the correct places
● How to resume a process?
– Create a signal frame
– Call sigreturn()
24. 25
New features in a kernel
● Parasite code injection (by Tejun Heo)
– Read task states, that are currently retrieved by a task only about itself
● The kcmp() system call
– Helps checking which kernel objects are shared between processes
● Proc map_files directory
– Find out what exact file is mapped
– Mappings sharing info
● A bunch of prctl extensions
– Set various private stuff on task/mm objects (c/r-only feature)
● Last-pid sysctl
– Restore task with desired PID value
25. 26
New features in a kernel
● TCP repair mode
– Read intimate state of a TCP connection
and reconstructs it from scratch on a freshly created socket
● Sockets information dumping via netlink (sock_diag)
– Extendable sockets state retrieving engine
● Virtual net devices indexes
– Allows to restore network devices in a namespace
● Socket peeking offset
– Allows peeking sockets queues (reading without removing data from queue)
● Task memory tracking
– incremental snapshots, online migration
26. 27
What are already supported?
– X86_64 architecture
– Process tree linkage
– Multi-threaded apps
– All kinds of memory mappings
– Terminals, groups, sessions
– Open files (shared and unlinked)
– Established TCP connections
– Unix sockets, Packet sockets
– Name-spaces (net, mount, ipc)
– Non-posix files (epoll, inotify)
– Pipes, Fifo-s, IPC, ...
– ARM architecture
– Pending signals
– TCP time-stamps
– Iterative snapshots
– VDSO
– LXC and OpenVZ containers
In flight
– Posix timers
– Convert OpenVZ images
27. 28
How is CRIU tested?
● ZDTM – a set of unit-tests
● Real-life applications
– Apache, Nginx
– MySQL, MongoDB, Oracle
– Make && gcc
– Tar & gzip
– Screen
– Java
– LXC
– VNC server + GUI applications
28. 29
Future plans (Feb, 2013)
● Support all kinds of kernel objects
● Merge all in-flight patches in the mainstream kernel
● Integrate CRIU with OpenVZ and LXC utilities
● Iterative migration
– Migrate memory content before freezing applications
● Integration in distributions
– CRIU was accepted to Fedora 19
29. 30
How to use
● ./crtools dump -t pid [<options>]
– checkpoint a process/tree identified by pid
● ./crtools restore -t pid [<options>]
– restore - restore a process/tree identified by pid
● ./crtools show (-D dir)|(-f file) [<options>]
– show dump file(s) contents
● ./crtools check
– checks whether the kernel support is up-to-date
● ./crtools exec -t pid <syscall-string>
– exec - execute a system call by other task
BLCR is used a kernel module, doesn't checkpoint sockets, SysV IPC, zombies, etc. Applications should be linked with a library and executed via a helper. DMTCP uses an executer too, but doesn't require a kernel module. C/R in OpenVZ is used for checkpount/restore and migrate OpenVZ containers. It requires the OpenVZ kernel. Linux C/R is very similar on OpenVZ C/R. It is used for checkpoint/restore of LXC. CRIU combines all this project. It will work on the pure upstream kernel. It's able to dump a task without any preparation.