3. 3
● Live-migration feature extended to support external data sources
○ Data sources (“streams”)
■ File (local or remote over HTTP or HTTPS)
■ S3 object
○ Data formats
■ Raw (“rbd export --export-format 1” or “qemu-img -f raw”)
■ QCOW or QCOW2
● Compression, encryption, backing file, external data file and extended L2 entries
features are not supported
● Image becomes available immediately
○ “rbd migration prepare” sets up a link to the specified data source
○ Missing reads are satisfied over the link, overlapping writes trigger deep-copy(up)
○ Hydration can happen in the background while image is in active use
○ Beware of potential high latency to remote data sources
INSTANT IMPORT/RESTORE
5. 5
● Growing need to encrypt data client-side with a per-image key
● Layering QEMU encryption or dm-crypt on top of librbd has major limitations
○ Copy-on-write clone must be encrypted with same key as its parent
○ Breaks golden image use cases
● Initial support for LUKS encryption incorporated within librbd
○ Uses libcryptsetup for manipulating LUKS metadata and OpenSSL for encryption
○ LUKS1 (512-byte sectors) or LUKS2 (4K sectors) format
○ AES-128 or AES-256 cipher in xts-plain64 mode
○ Only flat (non-cloned) images can be encryption-formatted in Pacific
○ Clone images inherit parent encryption profile and key
● “rbd encryption format” generates LUKS master key and adds a passphrase
○ cryptsetup tool can be used to inspect LUKS metadata and add additional passphrases
BUILT-IN LUKS ENCRYPTION
6. 6
● Example
$ rbd create --size 10G myimage
$ rbd encryption format myimage luks2 ./mypassphrase
$ sudo rbd device map -t nbd myimage
-o encryption-format=luks2,encryption-passphrase-file=./mypassphrase
$ sudo mkfs.ext4 /dev/nbd0
$ sudo mount /dev/nbd0 /mnt
● Image layout is LUKS-compatible (unless it is an encryption-formatted clone)
$ sudo rbd device map myimage
$ sudo cryptsetup luksOpen /dev/rbd0 luks-rbd0 --key-file ./mypassphrase
$ sudo mount /dev/mapper/luks-rbd0 /mnt
BUILT-IN LUKS ENCRYPTION
7. 7
● A single librbd client struggled to achieve more than 20-30K 4K IOPS
○ Limitations in the internal threading architecture
■ Too many context switches per I/O
■ A single finisher thread for AIO callbacks
● librbd I/O path rewritten
○ Started in Octopus
○ Continued in Pacific with a switch to asynchronous reactor model
■ boost::asio reactor (event loop) provided by the new neorados API
■ May eventually allow tighter integration with SPDK
● Up to 3x improvement in IOPS for some benchmarks in all-flash clusters
○ Also reduced latency
SMALL I/O PERFORMANCE
8. 8
● Layering dm-cache or similar on top of librbd is too risky
○ Dirty cache blocks are flushed to the backing image out of order
○ If the cache device is lost the backing image is as good as lost too
● Two extremes with nothing in between
○ Sync write is acked when persisted in RADOS (all OSDs in the PG)
■ High latency, but the backing image is always consistent and up-to-date
○ Sync write is acked when persisted in the cache
■ Low latency, but the backing image may be stale and inconsistent (~ corrupted)
● New librbd pwl_cache plugin
○ Log-structured persistent write-back cache
○ Sync write is acked when the log entry is was appended to is persisted in the cache
○ Log entries are flushed to the backing image in order
○ Backing image may be stale but always remains point-in-time consistent
○ If the cache device is lost only a small, bounded amount of updates is at risk
PERSISTENT WRITE-BACK CACHE
9. 9
● “rbd_persistent_cache_mode = rwl”
○ Targeted at PMEM devices
○ Byte addressable on-disk format
■ Pool root, log entry table, contiguous data extent area
○ Uses PMDK to access the device (currently libpmemobj, plan to switch to raw libpmem)
● “rbd_persistent_cache_mode = ssd”
○ Targeted at SSD devices
○ 4K block addressable on-disk format
■ Superblock (wraps “pool root”)
■ Chunks consisting of control block (up to 32 log entries) followed by corresponding
data extents
○ Uses Bluestore’s BlockDevice to access the device (libaio + O_DIRECT)
PERSISTENT WRITE-BACK CACHE
10. 10
● Dramatic improvement in latency
○ Particularly p99 latency
■ An order to two orders of magnitude in some benchmarks
● Rough edges in Pacific
○ Cache reopen issue (PMEM and SSD modes)
■ Fix expected in 16.2.5 stable release
○ Multiple stability and crash recovery issues (SSD mode)
■ Fixes expected in future stable releases
○ Very rudimentary observability
○ Misleading “rbd status” output
PERSISTENT WRITE-BACK CACHE
11. 11
● New RPCs to allow for coordinated snapshot creation
○ Ensure that the filesystem and/or user application is in a clean consistent state before taking
a non-locally initiated snapshot (e.g. scheduled mirror-snapshot)
○ Snapshot creation is aborted if any client fails to quiesce
■ See rbd_default_snapshot_quiesce_mode config option
● Wired up in rbd-nbd in Pacific
○ $ sudo rbd device map -t nbd --quiesce
○ /usr/libexec/rbd-nbd/rbd-nbd_quiesce script
■ “fsfreeze -f” on quiesce
■ “fsfreeze -u” on unquiesce
○ Can be a custom binary
● May be integrated in krbd and QEMU (qemu-guest-agent) in the future
○ A bit challenging since quiesce is initiated by the block driver (very low in the stack)
SNAPSHOT QUIESCE HOOKS
12. 12
● Support for msgr2.1 wire protocol added in kernel 5.11
○ $ sudo rbd device map -o ms_mode=crc|secure|prefer-crc|prefer-secure
■ In place of ms_mon_client_mode and ms_client_mode config options
■ All or nothing: no separate option affecting only monitor connections
○ See ms_mon_service_mode and ms_service_mode config options
● Original msgr2 wire protocol not implemented
○ Several security, integrity and robustness issues enshrined in the protocol itself
○ Nautilus 14.2.11, Octopus 15.2.5 or Pacific required
KERNEL MESSENGER V2
13. 13
● Support for reads from non-primary OSDs added in kernel 5.8
○ Random OSD in the PG
$ sudo rbd device map -o read_from_replica=balance
○ Closest (most local) OSD in the PG
■ Locality is calculated against the specified location in CRUSH hierarchy
■ $ sudo rbd device map -o read_from_replica=localize,
crush_location='rack:R001|datacenter:DC1'
○ Does not apply to erasure coded pools
○ Safe only since Octopus (min_last_complete_ondisk proparaged to replicas)
■ Don’t use before running “ceph osd require-osd-release octopus”
● Very useful for clusters stretched across data centers or AZs
○ Primary OSD may be on a higher latency and cost link
KERNEL REPLICA READS
14. 14
● Support for sending compressible/incompressible hints added in kernel 5.8
○ Enable compression when pool compression_mode is passive
■ $ sudo rbd device map -o compression_hint=compressible
○ Disable compression when pool compression_mode is aggressive
■ $ sudo rbd device map -o compression_hint=incompressible
KERNEL COMPRESSION HINTS
15. 15
● Ceph client code ported to Windows
○ librbd.dll, librados.dll, etc
● wnbd.sys kernel driver provides a virtual block device
○ I/O requests are passed through to userspace via DeviceIoControl
● rbd-wnbd.exe transforms and calls into librbd
○ Similar in concept to rbd-nbd on Linux
○ Can run as a Windows service
■ Mappings are persisted across reboots
■ Proper boot dependency ordering
RBD ON WINDOWS
17. 17
● Encryption-formatted copy-on-write clones
○ Clone images encrypted with encryption profile or key different from parent
■ E.g. encrypted clone (key A) of encrypted clone (key B) of unencrypted golden image
● Persistent write-back cache improvements
○ Easy to get and interpret status and metrics
○ Expand crash recovery testing
● Make rbd_support manager module handle full clusters
● Replace /usr/bin/rbdmap script with systemd unit generator
● Allow applying export-diff incremental to export file
● Export/import of consistency groups
USABILITY AND QUALITY
18. 18
● Improved mirroring monitoring and alerting
○ Currently free-form JSON (“rbd mirror image status”)
○ Expose metrics to Prometheus
■ Directly from rbd-mirror instead of funneling through ceph-mgr
● Scalable and more expressive (avoids PerfCounters bridge)
○ Potentially unified mirroring metrics schema
■ At least across RBD and CephFS
● Expose snapshot-based mirroring in Dashboard
● Snapshot-based mirroring of consistency groups
MULTI-SITE
19. 19
● NVMeoF target gateway
○ Similar and alternative to existing iSCSI target gateway
○ Based on SPDK NVMeoF target and SPDK RBD bdev driver
○ Modern SmartNICs can translate between NVMe and NVMeoF
■ Image can appear as a local, direct-attached NVMe disk
■ Completely transparent, no hypervisor required
■ Useful for bare metal services providers
● rbd-nbd
○ Safe reattach after daemon restart (pending nbd.ko kernel module change)
○ Support single daemon managing multiple images/devices
● rbd-wnbd (Windows)
○ Set up sustainable CI (preferably upstream)
○ Add test suite coverage
ECOSYSTEM
20. 20
● QEMU block driver
○ Add rbd_write_zeroes support (Nautilus)
○ Add rbd_encryption_load support (Pacific)
○ Switch to QEMU coroutines
ECOSYSTEM