SlideShare une entreprise Scribd logo
1  sur  21
Télécharger pour lire hors ligne
Ceph Month 2021 - RBD Update
Ilya Dryomov
2
WHAT IS NEW IN
PACIFIC
3
● Live-migration feature extended to support external data sources
○ Data sources (“streams”)
■ File (local or remote over HTTP or HTTPS)
■ S3 object
○ Data formats
■ Raw (“rbd export --export-format 1” or “qemu-img -f raw”)
■ QCOW or QCOW2
● Compression, encryption, backing file, external data file and extended L2 entries
features are not supported
● Image becomes available immediately
○ “rbd migration prepare” sets up a link to the specified data source
○ Missing reads are satisfied over the link, overlapping writes trigger deep-copy(up)
○ Hydration can happen in the background while image is in active use
○ Beware of potential high latency to remote data sources
INSTANT IMPORT/RESTORE
4
● Traditional import examples
$ wget http://example.com/myimage.qcow2
$ qemu-img convert -f qcow2 -O raw myimage.qcow2 myimage.raw
$ rbd import myimage.raw myimage
$ qemu-img convert -f qcow2 -O raw 'json:{"file.driver":"http",
"file.url":"http://example.com/myimage.qcow2"}' rbd:rbd/myimage
● Migration-based instant import example
$ rbd migration prepare --import-only --source-spec '{"type":"qcow",
"stream":{"type":"http","url":"http://example.com/myimage.qcow2"}}' myimage
$ rbd migration execute myimage
$ rbd migration commit myimage
INSTANT IMPORT/RESTORE
5
● Growing need to encrypt data client-side with a per-image key
● Layering QEMU encryption or dm-crypt on top of librbd has major limitations
○ Copy-on-write clone must be encrypted with same key as its parent
○ Breaks golden image use cases
● Initial support for LUKS encryption incorporated within librbd
○ Uses libcryptsetup for manipulating LUKS metadata and OpenSSL for encryption
○ LUKS1 (512-byte sectors) or LUKS2 (4K sectors) format
○ AES-128 or AES-256 cipher in xts-plain64 mode
○ Only flat (non-cloned) images can be encryption-formatted in Pacific
○ Clone images inherit parent encryption profile and key
● “rbd encryption format” generates LUKS master key and adds a passphrase
○ cryptsetup tool can be used to inspect LUKS metadata and add additional passphrases
BUILT-IN LUKS ENCRYPTION
6
● Example
$ rbd create --size 10G myimage
$ rbd encryption format myimage luks2 ./mypassphrase
$ sudo rbd device map -t nbd myimage
-o encryption-format=luks2,encryption-passphrase-file=./mypassphrase
$ sudo mkfs.ext4 /dev/nbd0
$ sudo mount /dev/nbd0 /mnt
● Image layout is LUKS-compatible (unless it is an encryption-formatted clone)
$ sudo rbd device map myimage
$ sudo cryptsetup luksOpen /dev/rbd0 luks-rbd0 --key-file ./mypassphrase
$ sudo mount /dev/mapper/luks-rbd0 /mnt
BUILT-IN LUKS ENCRYPTION
7
● A single librbd client struggled to achieve more than 20-30K 4K IOPS
○ Limitations in the internal threading architecture
■ Too many context switches per I/O
■ A single finisher thread for AIO callbacks
● librbd I/O path rewritten
○ Started in Octopus
○ Continued in Pacific with a switch to asynchronous reactor model
■ boost::asio reactor (event loop) provided by the new neorados API
■ May eventually allow tighter integration with SPDK
● Up to 3x improvement in IOPS for some benchmarks in all-flash clusters
○ Also reduced latency
SMALL I/O PERFORMANCE
8
● Layering dm-cache or similar on top of librbd is too risky
○ Dirty cache blocks are flushed to the backing image out of order
○ If the cache device is lost the backing image is as good as lost too
● Two extremes with nothing in between
○ Sync write is acked when persisted in RADOS (all OSDs in the PG)
■ High latency, but the backing image is always consistent and up-to-date
○ Sync write is acked when persisted in the cache
■ Low latency, but the backing image may be stale and inconsistent (~ corrupted)
● New librbd pwl_cache plugin
○ Log-structured persistent write-back cache
○ Sync write is acked when the log entry is was appended to is persisted in the cache
○ Log entries are flushed to the backing image in order
○ Backing image may be stale but always remains point-in-time consistent
○ If the cache device is lost only a small, bounded amount of updates is at risk
PERSISTENT WRITE-BACK CACHE
9
● “rbd_persistent_cache_mode = rwl”
○ Targeted at PMEM devices
○ Byte addressable on-disk format
■ Pool root, log entry table, contiguous data extent area
○ Uses PMDK to access the device (currently libpmemobj, plan to switch to raw libpmem)
● “rbd_persistent_cache_mode = ssd”
○ Targeted at SSD devices
○ 4K block addressable on-disk format
■ Superblock (wraps “pool root”)
■ Chunks consisting of control block (up to 32 log entries) followed by corresponding
data extents
○ Uses Bluestore’s BlockDevice to access the device (libaio + O_DIRECT)
PERSISTENT WRITE-BACK CACHE
10
● Dramatic improvement in latency
○ Particularly p99 latency
■ An order to two orders of magnitude in some benchmarks
● Rough edges in Pacific
○ Cache reopen issue (PMEM and SSD modes)
■ Fix expected in 16.2.5 stable release
○ Multiple stability and crash recovery issues (SSD mode)
■ Fixes expected in future stable releases
○ Very rudimentary observability
○ Misleading “rbd status” output
PERSISTENT WRITE-BACK CACHE
11
● New RPCs to allow for coordinated snapshot creation
○ Ensure that the filesystem and/or user application is in a clean consistent state before taking
a non-locally initiated snapshot (e.g. scheduled mirror-snapshot)
○ Snapshot creation is aborted if any client fails to quiesce
■ See rbd_default_snapshot_quiesce_mode config option
● Wired up in rbd-nbd in Pacific
○ $ sudo rbd device map -t nbd --quiesce
○ /usr/libexec/rbd-nbd/rbd-nbd_quiesce script
■ “fsfreeze -f” on quiesce
■ “fsfreeze -u” on unquiesce
○ Can be a custom binary
● May be integrated in krbd and QEMU (qemu-guest-agent) in the future
○ A bit challenging since quiesce is initiated by the block driver (very low in the stack)
SNAPSHOT QUIESCE HOOKS
12
● Support for msgr2.1 wire protocol added in kernel 5.11
○ $ sudo rbd device map -o ms_mode=crc|secure|prefer-crc|prefer-secure
■ In place of ms_mon_client_mode and ms_client_mode config options
■ All or nothing: no separate option affecting only monitor connections
○ See ms_mon_service_mode and ms_service_mode config options
● Original msgr2 wire protocol not implemented
○ Several security, integrity and robustness issues enshrined in the protocol itself
○ Nautilus 14.2.11, Octopus 15.2.5 or Pacific required
KERNEL MESSENGER V2
13
● Support for reads from non-primary OSDs added in kernel 5.8
○ Random OSD in the PG
$ sudo rbd device map -o read_from_replica=balance
○ Closest (most local) OSD in the PG
■ Locality is calculated against the specified location in CRUSH hierarchy
■ $ sudo rbd device map -o read_from_replica=localize,
crush_location='rack:R001|datacenter:DC1'
○ Does not apply to erasure coded pools
○ Safe only since Octopus (min_last_complete_ondisk proparaged to replicas)
■ Don’t use before running “ceph osd require-osd-release octopus”
● Very useful for clusters stretched across data centers or AZs
○ Primary OSD may be on a higher latency and cost link
KERNEL REPLICA READS
14
● Support for sending compressible/incompressible hints added in kernel 5.8
○ Enable compression when pool compression_mode is passive
■ $ sudo rbd device map -o compression_hint=compressible
○ Disable compression when pool compression_mode is aggressive
■ $ sudo rbd device map -o compression_hint=incompressible
KERNEL COMPRESSION HINTS
15
● Ceph client code ported to Windows
○ librbd.dll, librados.dll, etc
● wnbd.sys kernel driver provides a virtual block device
○ I/O requests are passed through to userspace via DeviceIoControl
● rbd-wnbd.exe transforms and calls into librbd
○ Similar in concept to rbd-nbd on Linux
○ Can run as a Windows service
■ Mappings are persisted across reboots
■ Proper boot dependency ordering
RBD ON WINDOWS
16
WHAT IS COMING IN
QUINCY
17
● Encryption-formatted copy-on-write clones
○ Clone images encrypted with encryption profile or key different from parent
■ E.g. encrypted clone (key A) of encrypted clone (key B) of unencrypted golden image
● Persistent write-back cache improvements
○ Easy to get and interpret status and metrics
○ Expand crash recovery testing
● Make rbd_support manager module handle full clusters
● Replace /usr/bin/rbdmap script with systemd unit generator
● Allow applying export-diff incremental to export file
● Export/import of consistency groups
USABILITY AND QUALITY
18
● Improved mirroring monitoring and alerting
○ Currently free-form JSON (“rbd mirror image status”)
○ Expose metrics to Prometheus
■ Directly from rbd-mirror instead of funneling through ceph-mgr
● Scalable and more expressive (avoids PerfCounters bridge)
○ Potentially unified mirroring metrics schema
■ At least across RBD and CephFS
● Expose snapshot-based mirroring in Dashboard
● Snapshot-based mirroring of consistency groups
MULTI-SITE
19
● NVMeoF target gateway
○ Similar and alternative to existing iSCSI target gateway
○ Based on SPDK NVMeoF target and SPDK RBD bdev driver
○ Modern SmartNICs can translate between NVMe and NVMeoF
■ Image can appear as a local, direct-attached NVMe disk
■ Completely transparent, no hypervisor required
■ Useful for bare metal services providers
● rbd-nbd
○ Safe reattach after daemon restart (pending nbd.ko kernel module change)
○ Support single daemon managing multiple images/devices
● rbd-wnbd (Windows)
○ Set up sustainable CI (preferably upstream)
○ Add test suite coverage
ECOSYSTEM
20
● QEMU block driver
○ Add rbd_write_zeroes support (Nautilus)
○ Add rbd_encryption_load support (Pacific)
○ Switch to QEMU coroutines
ECOSYSTEM
21
● https://ceph.io/
● Twitter: @ceph
● Docs: http://docs.ceph.com/
● Mailing lists: http://lists.ceph.io/
○ ceph-announce@ceph.io → announcements
○ ceph-users@ceph.io → user discussion
○ dev@ceph.io → developer discussion
● IRC: irc.oftc.net
○ #ceph, #ceph-devel
● GitHub: https://github.com/ceph/
● YouTube ‘Ceph’ channel
FOR MORE INFORMATION

Contenu connexe

Tendances

Tendances (20)

Seastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for CephSeastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for Ceph
 
Ceph Performance and Sizing Guide
Ceph Performance and Sizing GuideCeph Performance and Sizing Guide
Ceph Performance and Sizing Guide
 
Ceph Block Devices: A Deep Dive
Ceph Block Devices:  A Deep DiveCeph Block Devices:  A Deep Dive
Ceph Block Devices: A Deep Dive
 
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
 
Your 1st Ceph cluster
Your 1st Ceph clusterYour 1st Ceph cluster
Your 1st Ceph cluster
 
Ceph Month 2021: RADOS Update
Ceph Month 2021: RADOS UpdateCeph Month 2021: RADOS Update
Ceph Month 2021: RADOS Update
 
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA ArchitectureCeph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
 
Revisiting CephFS MDS and mClock QoS Scheduler
Revisiting CephFS MDS and mClock QoS SchedulerRevisiting CephFS MDS and mClock QoS Scheduler
Revisiting CephFS MDS and mClock QoS Scheduler
 
AF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on FlashAF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on Flash
 
RocksDB compaction
RocksDB compactionRocksDB compaction
RocksDB compaction
 
Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)
 
Ceph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion ObjectsCeph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion Objects
 
Patroni - HA PostgreSQL made easy
Patroni - HA PostgreSQL made easyPatroni - HA PostgreSQL made easy
Patroni - HA PostgreSQL made easy
 
Performance optimization for all flash based on aarch64 v2.0
Performance optimization for all flash based on aarch64 v2.0Performance optimization for all flash based on aarch64 v2.0
Performance optimization for all flash based on aarch64 v2.0
 
Crimson: Ceph for the Age of NVMe and Persistent Memory
Crimson: Ceph for the Age of NVMe and Persistent MemoryCrimson: Ceph for the Age of NVMe and Persistent Memory
Crimson: Ceph for the Age of NVMe and Persistent Memory
 
Scylla Summit 2022: Making Schema Changes Safe with Raft
Scylla Summit 2022: Making Schema Changes Safe with RaftScylla Summit 2022: Making Schema Changes Safe with Raft
Scylla Summit 2022: Making Schema Changes Safe with Raft
 
Seastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for CephSeastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for Ceph
 
Ceph issue 해결 사례
Ceph issue 해결 사례Ceph issue 해결 사례
Ceph issue 해결 사례
 
MinIO January 2020 Briefing
MinIO January 2020 BriefingMinIO January 2020 Briefing
MinIO January 2020 Briefing
 
Ceph Introduction 2017
Ceph Introduction 2017  Ceph Introduction 2017
Ceph Introduction 2017
 

Similaire à Ceph RBD Update - June 2021

20121102 ceph-in-the-cloud
20121102 ceph-in-the-cloud20121102 ceph-in-the-cloud
20121102 ceph-in-the-cloud
Ceph Community
 
Lightweight Virtualization with Linux Containers and Docker I YaC 2013
Lightweight Virtualization with Linux Containers and Docker I YaC 2013Lightweight Virtualization with Linux Containers and Docker I YaC 2013
Lightweight Virtualization with Linux Containers and Docker I YaC 2013
Docker, Inc.
 
Woden 2: Developing a modern 3D graphics engine in Smalltalk
Woden 2: Developing a modern 3D graphics engine in SmalltalkWoden 2: Developing a modern 3D graphics engine in Smalltalk
Woden 2: Developing a modern 3D graphics engine in Smalltalk
ESUG
 
Docker Introduction + what is new in 0.9
Docker Introduction + what is new in 0.9 Docker Introduction + what is new in 0.9
Docker Introduction + what is new in 0.9
Jérôme Petazzoni
 

Similaire à Ceph RBD Update - June 2021 (20)

20121102 ceph-in-the-cloud
20121102 ceph-in-the-cloud20121102 ceph-in-the-cloud
20121102 ceph-in-the-cloud
 
CEPH DAY BERLIN - WHAT'S NEW IN CEPH
CEPH DAY BERLIN - WHAT'S NEW IN CEPH CEPH DAY BERLIN - WHAT'S NEW IN CEPH
CEPH DAY BERLIN - WHAT'S NEW IN CEPH
 
Community Update at OpenStack Summit Boston
Community Update at OpenStack Summit BostonCommunity Update at OpenStack Summit Boston
Community Update at OpenStack Summit Boston
 
What's new in Luminous and Beyond
What's new in Luminous and BeyondWhat's new in Luminous and Beyond
What's new in Luminous and Beyond
 
20171101 taco scargo luminous is out, what's in it for you
20171101 taco scargo   luminous is out, what's in it for you20171101 taco scargo   luminous is out, what's in it for you
20171101 taco scargo luminous is out, what's in it for you
 
Ceph Day London 2014 - Deploying ceph in the wild
Ceph Day London 2014 - Deploying ceph in the wildCeph Day London 2014 - Deploying ceph in the wild
Ceph Day London 2014 - Deploying ceph in the wild
 
Lightweight Virtualization with Linux Containers and Docker | YaC 2013
Lightweight Virtualization with Linux Containers and Docker | YaC 2013Lightweight Virtualization with Linux Containers and Docker | YaC 2013
Lightweight Virtualization with Linux Containers and Docker | YaC 2013
 
Lightweight Virtualization with Linux Containers and Docker I YaC 2013
Lightweight Virtualization with Linux Containers and Docker I YaC 2013Lightweight Virtualization with Linux Containers and Docker I YaC 2013
Lightweight Virtualization with Linux Containers and Docker I YaC 2013
 
2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard
 
Ceph, Now and Later: Our Plan for Open Unified Cloud Storage
Ceph, Now and Later: Our Plan for Open Unified Cloud StorageCeph, Now and Later: Our Plan for Open Unified Cloud Storage
Ceph, Now and Later: Our Plan for Open Unified Cloud Storage
 
RBD: What will the future bring? - Jason Dillaman
RBD: What will the future bring? - Jason DillamanRBD: What will the future bring? - Jason Dillaman
RBD: What will the future bring? - Jason Dillaman
 
The Accidental DBA
The Accidental DBAThe Accidental DBA
The Accidental DBA
 
"Lightweight Virtualization with Linux Containers and Docker". Jerome Petazzo...
"Lightweight Virtualization with Linux Containers and Docker". Jerome Petazzo..."Lightweight Virtualization with Linux Containers and Docker". Jerome Petazzo...
"Lightweight Virtualization with Linux Containers and Docker". Jerome Petazzo...
 
Woden 2: Developing a modern 3D graphics engine in Smalltalk
Woden 2: Developing a modern 3D graphics engine in SmalltalkWoden 2: Developing a modern 3D graphics engine in Smalltalk
Woden 2: Developing a modern 3D graphics engine in Smalltalk
 
NetBSD workshop
NetBSD workshopNetBSD workshop
NetBSD workshop
 
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration LogisticsNoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
 
Docker Introduction, and what's new in 0.9 — Docker Palo Alto at RelateIQ
Docker Introduction, and what's new in 0.9 — Docker Palo Alto at RelateIQDocker Introduction, and what's new in 0.9 — Docker Palo Alto at RelateIQ
Docker Introduction, and what's new in 0.9 — Docker Palo Alto at RelateIQ
 
Docker Introduction + what is new in 0.9
Docker Introduction + what is new in 0.9 Docker Introduction + what is new in 0.9
Docker Introduction + what is new in 0.9
 
Introduction to Docker (as presented at December 2013 Global Hackathon)
Introduction to Docker (as presented at December 2013 Global Hackathon)Introduction to Docker (as presented at December 2013 Global Hackathon)
Introduction to Docker (as presented at December 2013 Global Hackathon)
 
OpenNebulaConf 2016 - Storage Hands-on Workshop by Javier Fontán, OpenNebula
OpenNebulaConf 2016 - Storage Hands-on Workshop by Javier Fontán, OpenNebulaOpenNebulaConf 2016 - Storage Hands-on Workshop by Javier Fontán, OpenNebula
OpenNebulaConf 2016 - Storage Hands-on Workshop by Javier Fontán, OpenNebula
 

Dernier

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Dernier (20)

CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 

Ceph RBD Update - June 2021

  • 1. Ceph Month 2021 - RBD Update Ilya Dryomov
  • 2. 2 WHAT IS NEW IN PACIFIC
  • 3. 3 ● Live-migration feature extended to support external data sources ○ Data sources (“streams”) ■ File (local or remote over HTTP or HTTPS) ■ S3 object ○ Data formats ■ Raw (“rbd export --export-format 1” or “qemu-img -f raw”) ■ QCOW or QCOW2 ● Compression, encryption, backing file, external data file and extended L2 entries features are not supported ● Image becomes available immediately ○ “rbd migration prepare” sets up a link to the specified data source ○ Missing reads are satisfied over the link, overlapping writes trigger deep-copy(up) ○ Hydration can happen in the background while image is in active use ○ Beware of potential high latency to remote data sources INSTANT IMPORT/RESTORE
  • 4. 4 ● Traditional import examples $ wget http://example.com/myimage.qcow2 $ qemu-img convert -f qcow2 -O raw myimage.qcow2 myimage.raw $ rbd import myimage.raw myimage $ qemu-img convert -f qcow2 -O raw 'json:{"file.driver":"http", "file.url":"http://example.com/myimage.qcow2"}' rbd:rbd/myimage ● Migration-based instant import example $ rbd migration prepare --import-only --source-spec '{"type":"qcow", "stream":{"type":"http","url":"http://example.com/myimage.qcow2"}}' myimage $ rbd migration execute myimage $ rbd migration commit myimage INSTANT IMPORT/RESTORE
  • 5. 5 ● Growing need to encrypt data client-side with a per-image key ● Layering QEMU encryption or dm-crypt on top of librbd has major limitations ○ Copy-on-write clone must be encrypted with same key as its parent ○ Breaks golden image use cases ● Initial support for LUKS encryption incorporated within librbd ○ Uses libcryptsetup for manipulating LUKS metadata and OpenSSL for encryption ○ LUKS1 (512-byte sectors) or LUKS2 (4K sectors) format ○ AES-128 or AES-256 cipher in xts-plain64 mode ○ Only flat (non-cloned) images can be encryption-formatted in Pacific ○ Clone images inherit parent encryption profile and key ● “rbd encryption format” generates LUKS master key and adds a passphrase ○ cryptsetup tool can be used to inspect LUKS metadata and add additional passphrases BUILT-IN LUKS ENCRYPTION
  • 6. 6 ● Example $ rbd create --size 10G myimage $ rbd encryption format myimage luks2 ./mypassphrase $ sudo rbd device map -t nbd myimage -o encryption-format=luks2,encryption-passphrase-file=./mypassphrase $ sudo mkfs.ext4 /dev/nbd0 $ sudo mount /dev/nbd0 /mnt ● Image layout is LUKS-compatible (unless it is an encryption-formatted clone) $ sudo rbd device map myimage $ sudo cryptsetup luksOpen /dev/rbd0 luks-rbd0 --key-file ./mypassphrase $ sudo mount /dev/mapper/luks-rbd0 /mnt BUILT-IN LUKS ENCRYPTION
  • 7. 7 ● A single librbd client struggled to achieve more than 20-30K 4K IOPS ○ Limitations in the internal threading architecture ■ Too many context switches per I/O ■ A single finisher thread for AIO callbacks ● librbd I/O path rewritten ○ Started in Octopus ○ Continued in Pacific with a switch to asynchronous reactor model ■ boost::asio reactor (event loop) provided by the new neorados API ■ May eventually allow tighter integration with SPDK ● Up to 3x improvement in IOPS for some benchmarks in all-flash clusters ○ Also reduced latency SMALL I/O PERFORMANCE
  • 8. 8 ● Layering dm-cache or similar on top of librbd is too risky ○ Dirty cache blocks are flushed to the backing image out of order ○ If the cache device is lost the backing image is as good as lost too ● Two extremes with nothing in between ○ Sync write is acked when persisted in RADOS (all OSDs in the PG) ■ High latency, but the backing image is always consistent and up-to-date ○ Sync write is acked when persisted in the cache ■ Low latency, but the backing image may be stale and inconsistent (~ corrupted) ● New librbd pwl_cache plugin ○ Log-structured persistent write-back cache ○ Sync write is acked when the log entry is was appended to is persisted in the cache ○ Log entries are flushed to the backing image in order ○ Backing image may be stale but always remains point-in-time consistent ○ If the cache device is lost only a small, bounded amount of updates is at risk PERSISTENT WRITE-BACK CACHE
  • 9. 9 ● “rbd_persistent_cache_mode = rwl” ○ Targeted at PMEM devices ○ Byte addressable on-disk format ■ Pool root, log entry table, contiguous data extent area ○ Uses PMDK to access the device (currently libpmemobj, plan to switch to raw libpmem) ● “rbd_persistent_cache_mode = ssd” ○ Targeted at SSD devices ○ 4K block addressable on-disk format ■ Superblock (wraps “pool root”) ■ Chunks consisting of control block (up to 32 log entries) followed by corresponding data extents ○ Uses Bluestore’s BlockDevice to access the device (libaio + O_DIRECT) PERSISTENT WRITE-BACK CACHE
  • 10. 10 ● Dramatic improvement in latency ○ Particularly p99 latency ■ An order to two orders of magnitude in some benchmarks ● Rough edges in Pacific ○ Cache reopen issue (PMEM and SSD modes) ■ Fix expected in 16.2.5 stable release ○ Multiple stability and crash recovery issues (SSD mode) ■ Fixes expected in future stable releases ○ Very rudimentary observability ○ Misleading “rbd status” output PERSISTENT WRITE-BACK CACHE
  • 11. 11 ● New RPCs to allow for coordinated snapshot creation ○ Ensure that the filesystem and/or user application is in a clean consistent state before taking a non-locally initiated snapshot (e.g. scheduled mirror-snapshot) ○ Snapshot creation is aborted if any client fails to quiesce ■ See rbd_default_snapshot_quiesce_mode config option ● Wired up in rbd-nbd in Pacific ○ $ sudo rbd device map -t nbd --quiesce ○ /usr/libexec/rbd-nbd/rbd-nbd_quiesce script ■ “fsfreeze -f” on quiesce ■ “fsfreeze -u” on unquiesce ○ Can be a custom binary ● May be integrated in krbd and QEMU (qemu-guest-agent) in the future ○ A bit challenging since quiesce is initiated by the block driver (very low in the stack) SNAPSHOT QUIESCE HOOKS
  • 12. 12 ● Support for msgr2.1 wire protocol added in kernel 5.11 ○ $ sudo rbd device map -o ms_mode=crc|secure|prefer-crc|prefer-secure ■ In place of ms_mon_client_mode and ms_client_mode config options ■ All or nothing: no separate option affecting only monitor connections ○ See ms_mon_service_mode and ms_service_mode config options ● Original msgr2 wire protocol not implemented ○ Several security, integrity and robustness issues enshrined in the protocol itself ○ Nautilus 14.2.11, Octopus 15.2.5 or Pacific required KERNEL MESSENGER V2
  • 13. 13 ● Support for reads from non-primary OSDs added in kernel 5.8 ○ Random OSD in the PG $ sudo rbd device map -o read_from_replica=balance ○ Closest (most local) OSD in the PG ■ Locality is calculated against the specified location in CRUSH hierarchy ■ $ sudo rbd device map -o read_from_replica=localize, crush_location='rack:R001|datacenter:DC1' ○ Does not apply to erasure coded pools ○ Safe only since Octopus (min_last_complete_ondisk proparaged to replicas) ■ Don’t use before running “ceph osd require-osd-release octopus” ● Very useful for clusters stretched across data centers or AZs ○ Primary OSD may be on a higher latency and cost link KERNEL REPLICA READS
  • 14. 14 ● Support for sending compressible/incompressible hints added in kernel 5.8 ○ Enable compression when pool compression_mode is passive ■ $ sudo rbd device map -o compression_hint=compressible ○ Disable compression when pool compression_mode is aggressive ■ $ sudo rbd device map -o compression_hint=incompressible KERNEL COMPRESSION HINTS
  • 15. 15 ● Ceph client code ported to Windows ○ librbd.dll, librados.dll, etc ● wnbd.sys kernel driver provides a virtual block device ○ I/O requests are passed through to userspace via DeviceIoControl ● rbd-wnbd.exe transforms and calls into librbd ○ Similar in concept to rbd-nbd on Linux ○ Can run as a Windows service ■ Mappings are persisted across reboots ■ Proper boot dependency ordering RBD ON WINDOWS
  • 16. 16 WHAT IS COMING IN QUINCY
  • 17. 17 ● Encryption-formatted copy-on-write clones ○ Clone images encrypted with encryption profile or key different from parent ■ E.g. encrypted clone (key A) of encrypted clone (key B) of unencrypted golden image ● Persistent write-back cache improvements ○ Easy to get and interpret status and metrics ○ Expand crash recovery testing ● Make rbd_support manager module handle full clusters ● Replace /usr/bin/rbdmap script with systemd unit generator ● Allow applying export-diff incremental to export file ● Export/import of consistency groups USABILITY AND QUALITY
  • 18. 18 ● Improved mirroring monitoring and alerting ○ Currently free-form JSON (“rbd mirror image status”) ○ Expose metrics to Prometheus ■ Directly from rbd-mirror instead of funneling through ceph-mgr ● Scalable and more expressive (avoids PerfCounters bridge) ○ Potentially unified mirroring metrics schema ■ At least across RBD and CephFS ● Expose snapshot-based mirroring in Dashboard ● Snapshot-based mirroring of consistency groups MULTI-SITE
  • 19. 19 ● NVMeoF target gateway ○ Similar and alternative to existing iSCSI target gateway ○ Based on SPDK NVMeoF target and SPDK RBD bdev driver ○ Modern SmartNICs can translate between NVMe and NVMeoF ■ Image can appear as a local, direct-attached NVMe disk ■ Completely transparent, no hypervisor required ■ Useful for bare metal services providers ● rbd-nbd ○ Safe reattach after daemon restart (pending nbd.ko kernel module change) ○ Support single daemon managing multiple images/devices ● rbd-wnbd (Windows) ○ Set up sustainable CI (preferably upstream) ○ Add test suite coverage ECOSYSTEM
  • 20. 20 ● QEMU block driver ○ Add rbd_write_zeroes support (Nautilus) ○ Add rbd_encryption_load support (Pacific) ○ Switch to QEMU coroutines ECOSYSTEM
  • 21. 21 ● https://ceph.io/ ● Twitter: @ceph ● Docs: http://docs.ceph.com/ ● Mailing lists: http://lists.ceph.io/ ○ ceph-announce@ceph.io → announcements ○ ceph-users@ceph.io → user discussion ○ dev@ceph.io → developer discussion ● IRC: irc.oftc.net ○ #ceph, #ceph-devel ● GitHub: https://github.com/ceph/ ● YouTube ‘Ceph’ channel FOR MORE INFORMATION