SlideShare une entreprise Scribd logo
1  sur  51
BLUESTORE: A NEW, FASTER STORAGE BACKEND
FOR CEPH
SAGE WEIL
VAULT – 2016.04.21
2
OUTLINE
Ceph background and context
FileStore, and why POSIX failed us
NewStore – a hybrid approach
BlueStore – a new Ceph OSD backend
Metadata
Data
Performance
Upcoming changes
Summary
Update since Vault (4.21.2016)
MOTIVATION
4
ObjectStore
abstract interface for storing local data
EBOFS, FileStore
EBOFS
a user-space extent-based object file
system
deprecated in favor of FileStore on btrfs
in 2009
Object
data (file-like byte stream)
attributes (small key/value)
omap (unbounded key/value)
Collection
placement group shard (slice of the
RADOS pool)
sharded by 32-bit hash value
All writes are transactions
Atomic + Consistent + Durable
Isolation provided by OSD
OBJECTSTORE AND DATA MODEL
5
FileStore
PG = collection = directory
object = file
Leveldb
large xattr spillover
object omap (key/value) data
Originally just for development...
later, only supported backend (on XFS)
/var/lib/ceph/osd/ceph-123/
current/
meta/
osdmap123
osdmap124
0.1_head/
object1
object12
0.7_head/
object3
object5
0.a_head/
object4
object6
db/
<leveldb files>
FILESTORE
6
OSD carefully manages consistency of its data
All writes are transactions
we need A+C+D; OSD provides I
Most are simple
write some bytes to object (file)
update object attribute (file xattr)
append to update log (leveldb insert)
...but others are arbitrarily large/complex
[
 {
 "op_name": "write",
 "collection": "0.6_head",
 "oid": "#0:73d87003:::benchmark_data_gnit_10346_object23:head#",
 "length": 4194304,
 "offset": 0,
 "bufferlist length": 4194304
 },
 {
 "op_name": "setattrs",
 "collection": "0.6_head",
 "oid": "#0:73d87003:::benchmark_data_gnit_10346_object23:head#",
 "attr_lens": {
 "_": 269,
 "snapset": 31
 }
 },
 {
 "op_name": "omap_setkeys",
 "collection": "0.6_head",
 "oid": "#0:60000000::::head#",
 "attr_lens": {
 "0000000005.00000000000000000006": 178,
 "_info": 847
 }
 }
]
POSIX FAILS: TRANSACTIONS
7
POSIX FAILS: TRANSACTIONS
Btrfs transaction hooks
/* trans start and trans end are dangerous, and only for
 * use by applications that know how to avoid the
 * resulting deadlocks
 */
#define BTRFS_IOC_TRANS_START _IO(BTRFS_IOCTL_MAGIC, 6)
#define BTRFS_IOC_TRANS_END _IO(BTRFS_IOCTL_MAGIC, 7)
Writeback ordering
#define BTRFS_MOUNT_FLUSHONCOMMIT (1 << 7)
What if we hit an error? ceph-osd process dies?
#define BTRFS_MOUNT_WEDGEONTRANSABORT (1 << …)
There is no rollback...
POSIX FAILS: TRANSACTIONS
8
POSIX FAILS: TRANSACTIONS
Write-ahead journal
serialize and journal every ObjectStore::Transaction
then write it to the file system
Btrfs parallel journaling
periodic sync takes a snapshot
on restart, rollback, and replay journal against appropriate snapshot
XFS/ext4 write-ahead journaling
periodic sync, then trim old journal entries
on restart, replay entire journal
lots of ugly hackery to deal with events that aren't idempotent
e.g., renames, collection delete + create, …
full data journal → we double write everything → ~halve disk throughput
POSIX FAILS: TRANSACTIONS
9
POSIX FAILS: ENUMERATION
Ceph objects are distributed by a 32-bit hash
Enumeration is in hash order
scrubbing
“backfill” (data rebalancing, recovery)
enumeration via librados client API
POSIX readdir is not well-ordered
Need O(1) “split” for a given shard/range
Build directory tree by hash-value prefix
split any directory when size > ~100 files
merge when size < ~50 files
read entire directory, sort in-memory
…
DIR_A/
DIR_A/A03224D3_qwer
DIR_A/A247233E_zxcv
…
DIR_B/
DIR_B/DIR_8/
DIR_B/DIR_8/B823032D_foo
DIR_B/DIR_8/B8474342_bar
DIR_B/DIR_9/
DIR_B/DIR_9/B924273B_baz
DIR_B/DIR_A/
DIR_B/DIR_A/BA4328D2_asdf
…
NEWSTORE
11
NEW STORE GOALS
More natural transaction atomicity
Avoid double writes
Efficient object enumeration
Efficient clone operation
Efficient splice (“move these bytes from object X to object Y”)
Efficient IO pattern for HDDs, SSDs, NVMe
Minimal locking, maximum parallelism (between PGs)
Advanced features
full data and metadata checksums
compression
12
NEWSTORE – WE MANAGE NAMESPACE
POSIX has the wrong metadata model for us
Ordered key/value is perfect match
well-defined object name sort order
efficient enumeration and random lookup
NewStore = rocksdb + object files
/var/lib/ceph/osd/ceph-123/
db/
<rocksdb, leveldb, whatever>
blobs.1/
0
1
...
blobs.2/
100000
100001
...
HDD
OSD
SSD SSD
OSD
HDD
OSD
NewStore NewStore NewStore
RocksDBRocksDB
13
RocksDB has a write-ahead log “journal”
XFS/ext4(/btrfs) have their own journal
(tree-log)
Journal-on-journal has high overhead
each journal manages half of overall
consistency, but incurs same overhead
write(2) + fsync(2) to new blobs.2/10302
1 write + flush to block device
1 write + flush to XFS/ext4 journal
write(2) + fsync(2) on RocksDB log
1 write + flush to block device
1 write + flush to XFS/ext4 journal
NEWSTORE FAIL: CONSISTENCY OVERHEAD
14
We can't overwrite a POSIX file as part of a atomic transaction
Writing overwrite data to a new file means many files for each object
Write-ahead logging?
put overwrite data in a “WAL” records in RocksDB
commit atomically with transaction
then overwrite original file data
but we're back to a double-write of overwrites...
Performance sucks again
Overwrites dominate RBD block workloads
NEWSTORE FAIL: ATOMICITY NEEDS WAL
BLUESTORE
16
BlueStore = Block + NewStore
consume raw block device(s)
key/value database (RocksDB) for metadata
data written directly to block device
pluggable block Allocator
We must share the block device with RocksDB
implement our own rocksdb::Env
implement tiny “file system” BlueFS
make BlueStore and BlueFS share
BLUESTORE
BlueStore
BlueFS
RocksDB
BlockDeviceBlockDeviceBlockDevice
BlueRocksEnv
data metadata
Allocator
ObjectStore
17
ROCKSDB: BLUEROCKSENV + BLUEFS
class BlueRocksEnv : public
rocksdb::EnvWrapper
passes file IO operations to BlueFS
BlueFS is a super-simple “file system”
all metadata loaded in RAM on start/mount
no need to store block free list
coarse allocation unit (1 MB blocks)
all metadata lives in written to a journal
journal rewritten/compacted when it gets large
superblock journal …
more journal …
data
data
data
data
file 10 file 11 file 12 file 12 file 13 rm file 12 file 13 ...
Map “directories” to different block devices
db.wal/ – on NVRAM, NVMe, SSD
db/ –
level0 and hot SSTs on SSD
db.slow/ – cold SSTs on HDD
BlueStore periodically balances free space
18
ROCKSDB: JOURNAL RECYCLING
Problem: 1 small (4 KB) Ceph write → 3-4 disk IOs!
BlueStore: write 4 KB of user data
rocksdb: append record to WAL
write update block at end of log file
fsync: XFS/ext4/BlueFS journals inode size/alloc update to its journal
fallocate(2) doesn't help
data blocks are not pre-zeroed; fsync still has to update alloc metadata
rocksdb LogReader only understands two modes
read until end of file (need accurate file size)
read all valid records, then ignore zeros at end (need zeroed tail)
19
ROCKSDB: JOURNAL RECYCLING (2)
Put old log files on recycle list (instead of deleting them)
LogWriter
overwrite old log data with new log data
include log number in each record
LogReader
stop replaying when we get garbage (bad CRC)
or when we get a valid CRC but record is from a previous log incarnation
Now we get one log append → one IO!
Upstream in RocksDB!
but missing a bug fix (PR #881)
Works with normal file-based storage, or BlueFS
METADATA
21
BLUESTORE METADATA
Partition namespace for different metadata
S* – “superblock” metadata for the entire store
B* – block allocation metadata
C* – collection name → cnode_t
O* – object name → onode_t or enode_t
L* – write-ahead log entries, promises of future IO
M* – omap (user key/value data, stored in objects)
V* – overlay object data (obsolete?)
22
Per object metadata
Lives directly in key/value pair
Serializes to ~200 bytes
Unique object id (like ino_t)
Size in bytes
Inline attributes (user attr data)
Block pointers (user byte data)
Overlay metadata
Omap prefix/ID (user k/v data)
struct bluestore_onode_t {
 uint64_t nid;
 uint64_t size;
 map<string,bufferptr> attrs;
 map<uint64_t,bluestore_extent_t> block_map;
 map<uint64_t,bluestore_overlay_t> overlay_map;
 uint64_t omap_head;
};
struct bluestore_extent_t {
 uint64_t offset;
 uint32_t length;
 uint32_t flags;
};
ONODE
23
Collection metadata
Interval of object namespace
 shard pool hash name bits
C<NOSHARD,12,3d321e00> “12.e123d3” = <25>
 shard pool hash name snap gen
O<NOSHARD,12,3d321d88,foo,NOSNAP,NOGEN> = …
O<NOSHARD,12,3d321d92,bar,NOSNAP,NOGEN> = …
O<NOSHARD,12,3d321e02,baz,NOSNAP,NOGEN> = …
O<NOSHARD,12,3d321e12,zip,NOSNAP,NOGEN> = …
O<NOSHARD,12,3d321e12,dee,NOSNAP,NOGEN> = …
O<NOSHARD,12,3d321e38,dah,NOSNAP,NOGEN> = …

struct spg_t {
 uint64_t pool;
 uint32_t hash;
 shard_id_t shard;
};
struct bluestore_cnode_t {
 uint32_t bits;
};
Nice properties
Ordered enumeration of objects
We can “split” collections by adjusting metadata
CNODE
24
Extent metadata
Sometimes we share blocks between objects (usually clones/snaps)
We need to reference count those extents
We still want to split collections and repartition extent metadata by
hash
struct bluestore_enode_t {
 struct record_t {
 uint32_t length;
 uint32_t refs;
 };
 map<uint64_t,record_t> ref_map;
 void add(uint64_t offset, uint32_t len,
 unsigned ref=2);
 void get(uint64_t offset, uint32_t len);
 void put(uint64_t offset, uint32_t len,
 vector<bluestore_extent_t> *release);
};
ENODE
DATA PATH
26
Terms
Sequencer
An independent, totally ordered queue
of transactions
One per PG
TransContext
State describing an executing
transaction
DATA PATH BASICS
Two ways to write
New allocation
Any write larger than min_alloc_size
goes to a new, unused extent on disk
Once that IO completes, we commit the
transaction
WAL (write-ahead-logged)
Commit temporary promise to
(over)write data with transaction
includes data!
Do async overwrite
Clean up temporary k/v pair
27
TRANSCONTEXT STATE MACHINE
PREPARE AIO_WAIT
KV_QUEUED KV_COMMITTING
WAL_QUEUED WAL_AIO_WAIT
FINISH
WAL_CLEANUP
Initiate some AIO
Wait for next TransContext(s) in Sequencer to be ready
PREPARE
AIO_WAIT
KV_QUEUED
AIO_WAIT
KV_COMMITTING
KV_COMMITTING
KV_COMMITTING
FINISH
WAL_QUEUEDWAL_QUEUED
FINISH
WAL_AIO_WAIT
WAL_CLEANUP
WAL_CLEANUP
Sequencer queue
Initiate some AIO
Wait for next commit batch
FINISH
FINISHWAL_CLEANUP_COMMITTING
28
From open(2) man page
By default, all IO is AIO + O_DIRECT
but sometimes we want to cache (e.g., POSIX_FADV_WILLNEED)
BlueStore mixes direct and buffered IO
O_DIRECT read(2) and write(2) invalidate and/or flush pages... racily
we avoid mixing them on the same pages
disable readahead: posix_fadvise(fd, 0, 0, POSIX_FADV_RANDOM)
But it's not quite enough...
moving to fully user-space cache
AIO, O_DIRECT, AND CACHING
Applications should avoid mixing O_DIRECT and normal I/O to the same
file, and especially to overlapping byte regions in the same file.
29
FreelistManager
persist list of free extents to key/value
store
prepare incremental updates for allocate
or release
Initial implementation
<offset> = <length>
keep in-memory copy
enforce an ordering on commits
del 1600=100000
put 1700=99990
small initial memory footprint, very
expensive when fragmented
New approach
<offset> = <region bitmap>
where region is N blocks (1024?)
no in-memory state
use k/v merge operator to XOR
allocation or release
merge 10=0000000011
merge 20=1110000000
RocksDB log-structured-merge tree
coalesces keys during compaction
BLOCK FREE LIST
30
Allocator
abstract interface to allocate new space
StupidAllocator
bin free extents by size (powers of 2)
choose sufficiently large extent closest
to hint
highly variable memory usage
btree of free extents
implemented, works
based on ancient ebofs policy

BitmapAllocator
hierarchy of indexes
L1: 2 bits = 2^6 blocks
L2: 2 bits = 2^12 blocks
...
 00 = all free, 11 = all used,
 01 = mix
fixed memory consumption
~35 MB RAM per TB

BLOCK ALLOCATOR
31
Let's support them natively!
256 MB zones / bands
must be written sequentially, but not all
at once
libzbc supports ZAC and ZBC HDDs
host-managed or host-aware
SMRAllocator
write pointer per zone
used + free counters per zone
Bonus: almost no memory!
IO ordering
must ensure allocated writes reach disk
in order
Cleaning
store k/v hints
zone → object hash
pick emptiest closed zone, scan hints,
move objects that are still there
opportunistically rewrite objects we read
if the zone is flagged for cleaning soon
SMR HDD
FANCY STUFF
33
Full data checksums
We scrub... periodically
We want to validate checksum on every
read
Compression
3x replication is expensive
Any scale-out cluster is expensive
WE WANT FANCY STUFF
34
Full data checksums
We scrub... periodically
We want to validate checksum on every
read
More data with extent pointer
4KB for 32-bit csum per 4KB block
bigger onode: 300 → 4396 bytes
larger csum blocks?
Compression
3x replication is expensive
Any scale-out cluster is expensive
Need largish extents to get compression
benefit (64 KB, 128 KB)
overwrites need to do read/modify/write

WE WANT FANCY STUFF
35
INDIRECT EXTENT STRUCTURES
map<uint64_t,bluestore_lextent_t> extent_map;
struct bluestore_lextent_t {
 uint64_t blob_id;
 uint64_t b_off, b_len;
};
map<uint64_t,bluestore_blob_t> blob_map;
struct bluestore_blob_t {
 vector<bluestore_pextent_t> extents;
 uint32_t logical_length; ///< uncompressed length
 uint32_t flags; ///< FLAGS_*
 uint16_t num_refs;
 uint8_t csum_type; ///< CSUM_*
 uint8_t csum_block_order;
 vector<char> csum_data;
};
struct bluestore_pextent_t {
 uint64_t offset, length;
};
onode_t
onode_t or enode_t
data on block device
onode may reference a piece of a blob,
or multiple pieces
csum block size can vary
ref counted (when in enode)
PERFORMANCE
37
0
50
100
150
200
250
300
350
400
450
500
4K 8K 16K 32K 64K 128K 256K 512K 1024K 2048K 4096K
Throughput(MB/s)
IO Size
Ceph 10.1.0 Bluestore vs Filestore Sequential Writes
FS HDD
BS HDD
HDD: SEQUENTIAL WRITE
38
HDD: RANDOM WRITE
0
50
100
150
200
250
300
350
400
450
Throughput(MB/s)
IO Size
Ceph 10.1.0 Bluestore vs Filestore Random Writes
FS HDD
BS HDD
0
200
400
600
800
1000
1200
1400
1600
IOPS
IO Size
Ceph 10.1.0 Bluestore vs Filestore Random Writes
FS HDD
BS HDD
39
HDD: SEQUENTIAL READ
0
200
400
600
800
1000
1200
4K 8K 16K 32K 64K 128K 256K 512K 1024K 2048K 4096K
Throughput(MB/s)
IO Size
Ceph 10.1.0 Bluestore vs Filestore Sequential Reads
FS HDD
BS HDD
40
HDD: RANDOM READ
0
200
400
600
800
000
200
400
Throughput(MB/s)
IO Size
Ceph 10.1.0 Bluestore vs Filestore Random Reads
FS HDD
BS HDD
0
500
1000
1500
2000
2500
3000
3500
IOPS
IO Size
Ceph 10.1.0 Bluestore vs Filestore Random Reads
FS HDD
BS HDD
41
SSD AND NVME?
NVMe journal
random writes ~2x faster
some testing anomalies (problem with test rig kernel?)
SSD only
similar to HDD result
small write benefit is more pronounced
NVMe only
more testing anomalies on test rig.. WIP
42
AVAILABILITY
Experimental backend in Jewel v10.2.z (just released)
enable experimental unrecoverable data corrupting features = bluestore rocksdb
ceph-disk --bluestore DEV
no multi-device magic provisioning just yet
The goal...
stable in Kraken (Fall '16)?
default in Luminous (Spring '17)?
43
SUMMARY
Ceph is great
POSIX was poor choice for storing objects
Our new BlueStore backend is awesome
RocksDB rocks and was easy to embed
Log recycling speeds up commits (now upstream)
Delayed merge will help too (coming soon)
THANK YOU!
Sage Weil
CEPH PRINCIPAL
ARCHITECT
sage@redhat.com
@liewegas
45
Status Update (2016.5.25)
Substantial re-implementation for enhanced features and performance
New extent structure to easily support compression and checksums (PR published yesterday)
New disk format, NOT backward compatible with Tech Preview
Overlay / WAL mechanism rebuilt to separate policy from mechanism
Bitmap allocator to reduce memory usage and unnecessary backend serialization on flash
Extends KeyValueDB to utilize Merge concept
BlueFS log redesign to avoid stall on compaction
User-space block cache
GSoC student working on SMR support
SanDisk ZetaScale open sourced and being prepped as RocksDB replacement on flash
Goal of being stable in K release is still in sight!
46
RockDB Tuning
• RockDB tuned for better performance
• opt.IncreaseParallelism (16);
• opt.OptimizeLevelStyleCompaction (512 * 1024 * 1024);
0
1000
2000
3000
4000
5000
6000
1
19
37
55
73
91
109
127
145
163
181
199
217
235
253
271
289
307
325
343
361
379
397
415
433
451
469
487
505
523
541
559
577
595
613
631
649
TPS
Runtime
4K 100% write( RocksDB Tuned Vs Default options)
RocksDB(Tuned Options)
RocksDB(Default Option)
47
BlueStore performance RocksDB Vs ZetaScale
0
5
10
15
20
25
30
0/100 70/30 99/1
TPSinKs
Read/Write Ratio
RBD 4K workload TPS
BlueStore with RocksDB
BlueStore with ZetaScale
Observation
• Bluestore with ZetaScale provides 2.2x improvement in read performance compared to RocksDB
• Bluestore with ZetaScale provides 20% improvement in mixed read/write compared to RocksDB
• Bluestore with ZetaScale100% write performance is 10% less than the RocksDB
48
BlueStore performance RocksDB Vs ZetaScale
0
500
1000
1500
2000
2500
3000
1
22
43
64
85
106
127
148
169
190
211
232
253
274
295
316
337
358
379
400
421
442
463
484
505
526
547
568
589
610
631
652
TPS
Run time
4K 100% write TPS
RocksDB
ZetaScale
Observation
• ZetaScale performance is stable
• RocksDB performance is choppy due to compaction
• RocksDB write performance keeps degrading as data size grows because the compaction overhead
grows with the data size
49
0
1000
2000
3000
4000
5000
6000
1
20
39
58
77
96
115
134
153
172
191
210
229
248
267
286
305
324
343
362
381
400
419
438
457
476
495
514
533
552
571
590
609
TPS
Runtime
70% read and 30 % write
RocksDB read
ZetaScale Reac
RocksDB write
ZetaScale write
BlueStore performance RocksDB Vs ZetaScale
Observation
• ZetaScale read performance is stable
• RocksDB read performance is choppy due to compaction
50
Read/Write Ratio TPS in Ks IO util% CPU %
Ios (rd+wr)
Per sec
Read KB
per sec
Write KB per
sec
IOs (rd+wr) per
TPS
CPU time in ms
per TPS
Disk BW(rd+wr)
KB per TPS
0/100 RocksDB 1.59 98 171 19094 106926 38290 12.0 1.1 91.3
0/100 ZetaScale 1.42 98 106 11970 28821 48877 8.4 0.7 54.7
70/30 RocksDB 3.59 100 233 24865 143867 33430 6.9 0.6 49.4
70/30 ZetaScale 4.28 98 178 16870 62331 44989 3.9 0.4 25.1
100/0 RocksDB 11.25 100 407 36370 215546 7300 3.2 0.4 19.8
100/0 ZetaScale 24.27 100 578 50109 291947 11537 2.1 0.2 12.5
BlueStore performance RocksDB Vs ZetaScale
Observation
• For the observed performance advantage with ZetaScale, the following are highlights of
system resource utilization
IO : BlueStore with Zeta Scale uses up to 43% less IOs compared to RocksDB
CPU : BlueStore with Zeta Scale uses 35% less cpu compared to RocksDB
Disk BW : BlueStore with Zeta Scale uses up to 50% less bandwidth compared to RocksDB
51
Tasks in progress
• Multiple Blocks performance
• 16K, 64K, 256K
• Large Scale testing
• Multiple OSDs with 128TB capacity
• Runs with worst case metadata size
• Onode size grows as test progress due to random writes
• RocksDB performance keeps degrading as the dataset size
increases.
• Measure performance under this condition and compare
• Tune ZetaScale for further improvement

Contenu connexe

Tendances

High-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uringHigh-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uring
ScyllaDB
 

Tendances (20)

Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
 
Seastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for CephSeastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for Ceph
 
PostgreSQL on EXT4, XFS, BTRFS and ZFS
PostgreSQL on EXT4, XFS, BTRFS and ZFSPostgreSQL on EXT4, XFS, BTRFS and ZFS
PostgreSQL on EXT4, XFS, BTRFS and ZFS
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for Ceph
 
A crash course in CRUSH
A crash course in CRUSHA crash course in CRUSH
A crash course in CRUSH
 
Ceph - A distributed storage system
Ceph - A distributed storage systemCeph - A distributed storage system
Ceph - A distributed storage system
 
2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard
 
Block Storage For VMs With Ceph
Block Storage For VMs With CephBlock Storage For VMs With Ceph
Block Storage For VMs With Ceph
 
Nick Fisk - low latency Ceph
Nick Fisk - low latency CephNick Fisk - low latency Ceph
Nick Fisk - low latency Ceph
 
Ceph as software define storage
Ceph as software define storageCeph as software define storage
Ceph as software define storage
 
Ceph issue 해결 사례
Ceph issue 해결 사례Ceph issue 해결 사례
Ceph issue 해결 사례
 
Build a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
Build a High Available NFS Cluster Based on CephFS - Shangzhong ZhuBuild a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
Build a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
 
Disk health prediction for Ceph
Disk health prediction for CephDisk health prediction for Ceph
Disk health prediction for Ceph
 
Seastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for CephSeastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for Ceph
 
Ceph Day Melbourne - Troubleshooting Ceph
Ceph Day Melbourne - Troubleshooting Ceph Ceph Day Melbourne - Troubleshooting Ceph
Ceph Day Melbourne - Troubleshooting Ceph
 
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
 
Performance optimization for all flash based on aarch64 v2.0
Performance optimization for all flash based on aarch64 v2.0Performance optimization for all flash based on aarch64 v2.0
Performance optimization for all flash based on aarch64 v2.0
 
High-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uringHigh-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uring
 
Your 1st Ceph cluster
Your 1st Ceph clusterYour 1st Ceph cluster
Your 1st Ceph cluster
 

En vedette

Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...
Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...
Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...
Red_Hat_Storage
 
Openstack with ceph
Openstack with cephOpenstack with ceph
Openstack with ceph
Ian Colle
 

En vedette (20)

Which Hypervisor Is Best? My SQL on Ceph
Which Hypervisor Is Best? My SQL on CephWhich Hypervisor Is Best? My SQL on Ceph
Which Hypervisor Is Best? My SQL on Ceph
 
Mellanox High Performance Networks for Ceph
Mellanox High Performance Networks for CephMellanox High Performance Networks for Ceph
Mellanox High Performance Networks for Ceph
 
Ceph - High Performance Without High Costs
Ceph - High Performance Without High CostsCeph - High Performance Without High Costs
Ceph - High Performance Without High Costs
 
Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...
Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...
Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...
 
Designing for High Performance Ceph at Scale
Designing for High Performance Ceph at ScaleDesigning for High Performance Ceph at Scale
Designing for High Performance Ceph at Scale
 
虚拟化与云计算
虚拟化与云计算虚拟化与云计算
虚拟化与云计算
 
Webinar: NAS vs. Object Storage: 10 Reasons Why Object Storage Will Win
Webinar: NAS vs. Object Storage: 10 Reasons Why Object Storage Will WinWebinar: NAS vs. Object Storage: 10 Reasons Why Object Storage Will Win
Webinar: NAS vs. Object Storage: 10 Reasons Why Object Storage Will Win
 
[超强腿击术].马中碧
[超强腿击术].马中碧[超强腿击术].马中碧
[超强腿击术].马中碧
 
Webinar: Sizing Up Object Storage for the Enterprise
Webinar: Sizing Up Object Storage for the EnterpriseWebinar: Sizing Up Object Storage for the Enterprise
Webinar: Sizing Up Object Storage for the Enterprise
 
Scality, Cloud Storage pour Zimbra
Scality, Cloud Storage pour ZimbraScality, Cloud Storage pour Zimbra
Scality, Cloud Storage pour Zimbra
 
Primend Pilveseminar - Soodne hind + lihtne haldus – pilve minek= ?
Primend Pilveseminar - Soodne hind + lihtne haldus – pilve minek= ?Primend Pilveseminar - Soodne hind + lihtne haldus – pilve minek= ?
Primend Pilveseminar - Soodne hind + lihtne haldus – pilve minek= ?
 
Hackathon scality holberton seagate 2016 v5
Hackathon scality holberton seagate 2016 v5Hackathon scality holberton seagate 2016 v5
Hackathon scality holberton seagate 2016 v5
 
Disaster Recovery in oVirt
Disaster Recovery in oVirtDisaster Recovery in oVirt
Disaster Recovery in oVirt
 
Ceph@MIMOS: Growing Pains from R&D to Deployment
Ceph@MIMOS: Growing Pains from R&D to DeploymentCeph@MIMOS: Growing Pains from R&D to Deployment
Ceph@MIMOS: Growing Pains from R&D to Deployment
 
Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...
Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...
Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...
 
Storage best practices
Storage best practicesStorage best practices
Storage best practices
 
Presentazione SimpliVity @ VMUGIT UserCon 2015
Presentazione SimpliVity @ VMUGIT UserCon 2015Presentazione SimpliVity @ VMUGIT UserCon 2015
Presentazione SimpliVity @ VMUGIT UserCon 2015
 
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack CloudJourney to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
 
Openstack with ceph
Openstack with cephOpenstack with ceph
Openstack with ceph
 
Simplivity webinar presentation
Simplivity webinar presentationSimplivity webinar presentation
Simplivity webinar presentation
 

Similaire à Bluestore

Reduce Resource Consumption & Clone in Seconds your Oracle Virtual Environmen...
Reduce Resource Consumption & Clone in Seconds your Oracle Virtual Environmen...Reduce Resource Consumption & Clone in Seconds your Oracle Virtual Environmen...
Reduce Resource Consumption & Clone in Seconds your Oracle Virtual Environmen...
BertrandDrouvot
 
INFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
INFINISTORE(tm) - Scalable Open Source Storage ArhcitectureINFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
INFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
Thomas Uhl
 
Memory Optimization
Memory OptimizationMemory Optimization
Memory Optimization
guest3eed30
 
Memory Optimization
Memory OptimizationMemory Optimization
Memory Optimization
Wei Lin
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
Dvir Volk
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 
GlusterFS Update and OpenStack Integration
GlusterFS Update and OpenStack IntegrationGlusterFS Update and OpenStack Integration
GlusterFS Update and OpenStack Integration
Etsuji Nakai
 
Direct SGA access without SQL
Direct SGA access without SQLDirect SGA access without SQL
Direct SGA access without SQL
Kyle Hailey
 
XPDS14: MirageOS 2.0: branch consistency for Xen Stub Domains - Anil Madhavap...
XPDS14: MirageOS 2.0: branch consistency for Xen Stub Domains - Anil Madhavap...XPDS14: MirageOS 2.0: branch consistency for Xen Stub Domains - Anil Madhavap...
XPDS14: MirageOS 2.0: branch consistency for Xen Stub Domains - Anil Madhavap...
The Linux Foundation
 
Linux Kernel Extension for Databases / Александр Крижановский (Tempesta Techn...
Linux Kernel Extension for Databases / Александр Крижановский (Tempesta Techn...Linux Kernel Extension for Databases / Александр Крижановский (Tempesta Techn...
Linux Kernel Extension for Databases / Александр Крижановский (Tempesta Techn...
Ontico
 

Similaire à Bluestore (20)

Ceph Tech Talk: Bluestore
Ceph Tech Talk: BluestoreCeph Tech Talk: Bluestore
Ceph Tech Talk: Bluestore
 
The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.
The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.
The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.
 
Reduce Resource Consumption & Clone in Seconds your Oracle Virtual Environmen...
Reduce Resource Consumption & Clone in Seconds your Oracle Virtual Environmen...Reduce Resource Consumption & Clone in Seconds your Oracle Virtual Environmen...
Reduce Resource Consumption & Clone in Seconds your Oracle Virtual Environmen...
 
INFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
INFINISTORE(tm) - Scalable Open Source Storage ArhcitectureINFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
INFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
 
Memory Optimization
Memory OptimizationMemory Optimization
Memory Optimization
 
Memory Optimization
Memory OptimizationMemory Optimization
Memory Optimization
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
Linux Survival Kit for Proof of Concept & Proof of Technology
Linux Survival Kit for Proof of Concept & Proof of TechnologyLinux Survival Kit for Proof of Concept & Proof of Technology
Linux Survival Kit for Proof of Concept & Proof of Technology
 
Open Source Storage at Scale: Ceph @ GRNET
Open Source Storage at Scale: Ceph @ GRNETOpen Source Storage at Scale: Ceph @ GRNET
Open Source Storage at Scale: Ceph @ GRNET
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
GlusterFS Update and OpenStack Integration
GlusterFS Update and OpenStack IntegrationGlusterFS Update and OpenStack Integration
GlusterFS Update and OpenStack Integration
 
Direct SGA access without SQL
Direct SGA access without SQLDirect SGA access without SQL
Direct SGA access without SQL
 
PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016
PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016
PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
Top 17 Data Recovery System
Top 17 Data Recovery SystemTop 17 Data Recovery System
Top 17 Data Recovery System
 
XPDS14: MirageOS 2.0: branch consistency for Xen Stub Domains - Anil Madhavap...
XPDS14: MirageOS 2.0: branch consistency for Xen Stub Domains - Anil Madhavap...XPDS14: MirageOS 2.0: branch consistency for Xen Stub Domains - Anil Madhavap...
XPDS14: MirageOS 2.0: branch consistency for Xen Stub Domains - Anil Madhavap...
 
Linux Kernel Extension for Databases / Александр Крижановский (Tempesta Techn...
Linux Kernel Extension for Databases / Александр Крижановский (Tempesta Techn...Linux Kernel Extension for Databases / Александр Крижановский (Tempesta Techn...
Linux Kernel Extension for Databases / Александр Крижановский (Tempesta Techn...
 
Registry
RegistryRegistry
Registry
 
GC free coding in @Java presented @Geecon
GC free coding in @Java presented @GeeconGC free coding in @Java presented @Geecon
GC free coding in @Java presented @Geecon
 
RocksDB meetup
RocksDB meetupRocksDB meetup
RocksDB meetup
 

Plus de Patrick McGarry

Ceph Ecosystem Update - Ceph Day Frankfurt (Feb 2014)
Ceph Ecosystem Update - Ceph Day Frankfurt (Feb 2014)Ceph Ecosystem Update - Ceph Day Frankfurt (Feb 2014)
Ceph Ecosystem Update - Ceph Day Frankfurt (Feb 2014)
Patrick McGarry
 

Plus de Patrick McGarry (15)

QCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference ArchitectureQCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference Architecture
 
librados
libradoslibrados
librados
 
Community Update
Community UpdateCommunity Update
Community Update
 
MySQL Head-to-Head
MySQL Head-to-HeadMySQL Head-to-Head
MySQL Head-to-Head
 
Ceph: A decade in the making and still going strong
Ceph: A decade in the making and still going strongCeph: A decade in the making and still going strong
Ceph: A decade in the making and still going strong
 
2014 Ceph NYLUG Talk
2014 Ceph NYLUG Talk2014 Ceph NYLUG Talk
2014 Ceph NYLUG Talk
 
Ceph, Open Source, and the Path to Ubiquity in Storage - AACS Meetup 2014
Ceph, Open Source, and the Path to Ubiquity in Storage - AACS Meetup 2014Ceph, Open Source, and the Path to Ubiquity in Storage - AACS Meetup 2014
Ceph, Open Source, and the Path to Ubiquity in Storage - AACS Meetup 2014
 
Ceph Ecosystem Update - Ceph Day Frankfurt (Feb 2014)
Ceph Ecosystem Update - Ceph Day Frankfurt (Feb 2014)Ceph Ecosystem Update - Ceph Day Frankfurt (Feb 2014)
Ceph Ecosystem Update - Ceph Day Frankfurt (Feb 2014)
 
DEVIEW 2013
DEVIEW 2013DEVIEW 2013
DEVIEW 2013
 
Ceph, Xen, and CloudStack: Semper Melior
Ceph, Xen, and CloudStack: Semper MeliorCeph, Xen, and CloudStack: Semper Melior
Ceph, Xen, and CloudStack: Semper Melior
 
In-Ceph-tion: Deploying a Ceph cluster on DreamCompute
In-Ceph-tion: Deploying a Ceph cluster on DreamComputeIn-Ceph-tion: Deploying a Ceph cluster on DreamCompute
In-Ceph-tion: Deploying a Ceph cluster on DreamCompute
 
Ceph & OpenStack - Boston Meetup
Ceph & OpenStack - Boston MeetupCeph & OpenStack - Boston Meetup
Ceph & OpenStack - Boston Meetup
 
Ceph in the Ecosystem - Ceph Day NYC 2013
Ceph in the Ecosystem - Ceph Day NYC 2013Ceph in the Ecosystem - Ceph Day NYC 2013
Ceph in the Ecosystem - Ceph Day NYC 2013
 
Powering CloudStack with Ceph RBD - Apachecon
Powering CloudStack with Ceph RBD - ApacheconPowering CloudStack with Ceph RBD - Apachecon
Powering CloudStack with Ceph RBD - Apachecon
 
An intro to Ceph and big data - CERN Big Data Workshop
An intro to Ceph and big data - CERN Big Data WorkshopAn intro to Ceph and big data - CERN Big Data Workshop
An intro to Ceph and big data - CERN Big Data Workshop
 

Dernier

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Dernier (20)

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 

Bluestore

  • 1. BLUESTORE: A NEW, FASTER STORAGE BACKEND FOR CEPH SAGE WEIL VAULT – 2016.04.21
  • 2. 2 OUTLINE Ceph background and context FileStore, and why POSIX failed us NewStore – a hybrid approach BlueStore – a new Ceph OSD backend Metadata Data Performance Upcoming changes Summary Update since Vault (4.21.2016)
  • 4. 4 ObjectStore abstract interface for storing local data EBOFS, FileStore EBOFS a user-space extent-based object file system deprecated in favor of FileStore on btrfs in 2009 Object data (file-like byte stream) attributes (small key/value) omap (unbounded key/value) Collection placement group shard (slice of the RADOS pool) sharded by 32-bit hash value All writes are transactions Atomic + Consistent + Durable Isolation provided by OSD OBJECTSTORE AND DATA MODEL
  • 5. 5 FileStore PG = collection = directory object = file Leveldb large xattr spillover object omap (key/value) data Originally just for development... later, only supported backend (on XFS) /var/lib/ceph/osd/ceph-123/ current/ meta/ osdmap123 osdmap124 0.1_head/ object1 object12 0.7_head/ object3 object5 0.a_head/ object4 object6 db/ <leveldb files> FILESTORE
  • 6. 6 OSD carefully manages consistency of its data All writes are transactions we need A+C+D; OSD provides I Most are simple write some bytes to object (file) update object attribute (file xattr) append to update log (leveldb insert) ...but others are arbitrarily large/complex [  {  "op_name": "write",  "collection": "0.6_head",  "oid": "#0:73d87003:::benchmark_data_gnit_10346_object23:head#",  "length": 4194304,  "offset": 0,  "bufferlist length": 4194304  },  {  "op_name": "setattrs",  "collection": "0.6_head",  "oid": "#0:73d87003:::benchmark_data_gnit_10346_object23:head#",  "attr_lens": {  "_": 269,  "snapset": 31  }  },  {  "op_name": "omap_setkeys",  "collection": "0.6_head",  "oid": "#0:60000000::::head#",  "attr_lens": {  "0000000005.00000000000000000006": 178,  "_info": 847  }  } ] POSIX FAILS: TRANSACTIONS
  • 7. 7 POSIX FAILS: TRANSACTIONS Btrfs transaction hooks /* trans start and trans end are dangerous, and only for  * use by applications that know how to avoid the  * resulting deadlocks  */ #define BTRFS_IOC_TRANS_START _IO(BTRFS_IOCTL_MAGIC, 6) #define BTRFS_IOC_TRANS_END _IO(BTRFS_IOCTL_MAGIC, 7) Writeback ordering #define BTRFS_MOUNT_FLUSHONCOMMIT (1 << 7) What if we hit an error? ceph-osd process dies? #define BTRFS_MOUNT_WEDGEONTRANSABORT (1 << …) There is no rollback... POSIX FAILS: TRANSACTIONS
  • 8. 8 POSIX FAILS: TRANSACTIONS Write-ahead journal serialize and journal every ObjectStore::Transaction then write it to the file system Btrfs parallel journaling periodic sync takes a snapshot on restart, rollback, and replay journal against appropriate snapshot XFS/ext4 write-ahead journaling periodic sync, then trim old journal entries on restart, replay entire journal lots of ugly hackery to deal with events that aren't idempotent e.g., renames, collection delete + create, … full data journal → we double write everything → ~halve disk throughput POSIX FAILS: TRANSACTIONS
  • 9. 9 POSIX FAILS: ENUMERATION Ceph objects are distributed by a 32-bit hash Enumeration is in hash order scrubbing “backfill” (data rebalancing, recovery) enumeration via librados client API POSIX readdir is not well-ordered Need O(1) “split” for a given shard/range Build directory tree by hash-value prefix split any directory when size > ~100 files merge when size < ~50 files read entire directory, sort in-memory … DIR_A/ DIR_A/A03224D3_qwer DIR_A/A247233E_zxcv … DIR_B/ DIR_B/DIR_8/ DIR_B/DIR_8/B823032D_foo DIR_B/DIR_8/B8474342_bar DIR_B/DIR_9/ DIR_B/DIR_9/B924273B_baz DIR_B/DIR_A/ DIR_B/DIR_A/BA4328D2_asdf …
  • 11. 11 NEW STORE GOALS More natural transaction atomicity Avoid double writes Efficient object enumeration Efficient clone operation Efficient splice (“move these bytes from object X to object Y”) Efficient IO pattern for HDDs, SSDs, NVMe Minimal locking, maximum parallelism (between PGs) Advanced features full data and metadata checksums compression
  • 12. 12 NEWSTORE – WE MANAGE NAMESPACE POSIX has the wrong metadata model for us Ordered key/value is perfect match well-defined object name sort order efficient enumeration and random lookup NewStore = rocksdb + object files /var/lib/ceph/osd/ceph-123/ db/ <rocksdb, leveldb, whatever> blobs.1/ 0 1 ... blobs.2/ 100000 100001 ... HDD OSD SSD SSD OSD HDD OSD NewStore NewStore NewStore RocksDBRocksDB
  • 13. 13 RocksDB has a write-ahead log “journal” XFS/ext4(/btrfs) have their own journal (tree-log) Journal-on-journal has high overhead each journal manages half of overall consistency, but incurs same overhead write(2) + fsync(2) to new blobs.2/10302 1 write + flush to block device 1 write + flush to XFS/ext4 journal write(2) + fsync(2) on RocksDB log 1 write + flush to block device 1 write + flush to XFS/ext4 journal NEWSTORE FAIL: CONSISTENCY OVERHEAD
  • 14. 14 We can't overwrite a POSIX file as part of a atomic transaction Writing overwrite data to a new file means many files for each object Write-ahead logging? put overwrite data in a “WAL” records in RocksDB commit atomically with transaction then overwrite original file data but we're back to a double-write of overwrites... Performance sucks again Overwrites dominate RBD block workloads NEWSTORE FAIL: ATOMICITY NEEDS WAL
  • 16. 16 BlueStore = Block + NewStore consume raw block device(s) key/value database (RocksDB) for metadata data written directly to block device pluggable block Allocator We must share the block device with RocksDB implement our own rocksdb::Env implement tiny “file system” BlueFS make BlueStore and BlueFS share BLUESTORE BlueStore BlueFS RocksDB BlockDeviceBlockDeviceBlockDevice BlueRocksEnv data metadata Allocator ObjectStore
  • 17. 17 ROCKSDB: BLUEROCKSENV + BLUEFS class BlueRocksEnv : public rocksdb::EnvWrapper passes file IO operations to BlueFS BlueFS is a super-simple “file system” all metadata loaded in RAM on start/mount no need to store block free list coarse allocation unit (1 MB blocks) all metadata lives in written to a journal journal rewritten/compacted when it gets large superblock journal … more journal … data data data data file 10 file 11 file 12 file 12 file 13 rm file 12 file 13 ... Map “directories” to different block devices db.wal/ – on NVRAM, NVMe, SSD db/ – level0 and hot SSTs on SSD db.slow/ – cold SSTs on HDD BlueStore periodically balances free space
  • 18. 18 ROCKSDB: JOURNAL RECYCLING Problem: 1 small (4 KB) Ceph write → 3-4 disk IOs! BlueStore: write 4 KB of user data rocksdb: append record to WAL write update block at end of log file fsync: XFS/ext4/BlueFS journals inode size/alloc update to its journal fallocate(2) doesn't help data blocks are not pre-zeroed; fsync still has to update alloc metadata rocksdb LogReader only understands two modes read until end of file (need accurate file size) read all valid records, then ignore zeros at end (need zeroed tail)
  • 19. 19 ROCKSDB: JOURNAL RECYCLING (2) Put old log files on recycle list (instead of deleting them) LogWriter overwrite old log data with new log data include log number in each record LogReader stop replaying when we get garbage (bad CRC) or when we get a valid CRC but record is from a previous log incarnation Now we get one log append → one IO! Upstream in RocksDB! but missing a bug fix (PR #881) Works with normal file-based storage, or BlueFS
  • 21. 21 BLUESTORE METADATA Partition namespace for different metadata S* – “superblock” metadata for the entire store B* – block allocation metadata C* – collection name → cnode_t O* – object name → onode_t or enode_t L* – write-ahead log entries, promises of future IO M* – omap (user key/value data, stored in objects) V* – overlay object data (obsolete?)
  • 22. 22 Per object metadata Lives directly in key/value pair Serializes to ~200 bytes Unique object id (like ino_t) Size in bytes Inline attributes (user attr data) Block pointers (user byte data) Overlay metadata Omap prefix/ID (user k/v data) struct bluestore_onode_t {  uint64_t nid;  uint64_t size;  map<string,bufferptr> attrs;  map<uint64_t,bluestore_extent_t> block_map;  map<uint64_t,bluestore_overlay_t> overlay_map;  uint64_t omap_head; }; struct bluestore_extent_t {  uint64_t offset;  uint32_t length;  uint32_t flags; }; ONODE
  • 23. 23 Collection metadata Interval of object namespace  shard pool hash name bits C<NOSHARD,12,3d321e00> “12.e123d3” = <25>  shard pool hash name snap gen O<NOSHARD,12,3d321d88,foo,NOSNAP,NOGEN> = … O<NOSHARD,12,3d321d92,bar,NOSNAP,NOGEN> = … O<NOSHARD,12,3d321e02,baz,NOSNAP,NOGEN> = … O<NOSHARD,12,3d321e12,zip,NOSNAP,NOGEN> = … O<NOSHARD,12,3d321e12,dee,NOSNAP,NOGEN> = … O<NOSHARD,12,3d321e38,dah,NOSNAP,NOGEN> = …  struct spg_t {  uint64_t pool;  uint32_t hash;  shard_id_t shard; }; struct bluestore_cnode_t {  uint32_t bits; }; Nice properties Ordered enumeration of objects We can “split” collections by adjusting metadata CNODE
  • 24. 24 Extent metadata Sometimes we share blocks between objects (usually clones/snaps) We need to reference count those extents We still want to split collections and repartition extent metadata by hash struct bluestore_enode_t {  struct record_t {  uint32_t length;  uint32_t refs;  };  map<uint64_t,record_t> ref_map;  void add(uint64_t offset, uint32_t len,  unsigned ref=2);  void get(uint64_t offset, uint32_t len);  void put(uint64_t offset, uint32_t len,  vector<bluestore_extent_t> *release); }; ENODE
  • 26. 26 Terms Sequencer An independent, totally ordered queue of transactions One per PG TransContext State describing an executing transaction DATA PATH BASICS Two ways to write New allocation Any write larger than min_alloc_size goes to a new, unused extent on disk Once that IO completes, we commit the transaction WAL (write-ahead-logged) Commit temporary promise to (over)write data with transaction includes data! Do async overwrite Clean up temporary k/v pair
  • 27. 27 TRANSCONTEXT STATE MACHINE PREPARE AIO_WAIT KV_QUEUED KV_COMMITTING WAL_QUEUED WAL_AIO_WAIT FINISH WAL_CLEANUP Initiate some AIO Wait for next TransContext(s) in Sequencer to be ready PREPARE AIO_WAIT KV_QUEUED AIO_WAIT KV_COMMITTING KV_COMMITTING KV_COMMITTING FINISH WAL_QUEUEDWAL_QUEUED FINISH WAL_AIO_WAIT WAL_CLEANUP WAL_CLEANUP Sequencer queue Initiate some AIO Wait for next commit batch FINISH FINISHWAL_CLEANUP_COMMITTING
  • 28. 28 From open(2) man page By default, all IO is AIO + O_DIRECT but sometimes we want to cache (e.g., POSIX_FADV_WILLNEED) BlueStore mixes direct and buffered IO O_DIRECT read(2) and write(2) invalidate and/or flush pages... racily we avoid mixing them on the same pages disable readahead: posix_fadvise(fd, 0, 0, POSIX_FADV_RANDOM) But it's not quite enough... moving to fully user-space cache AIO, O_DIRECT, AND CACHING Applications should avoid mixing O_DIRECT and normal I/O to the same file, and especially to overlapping byte regions in the same file.
  • 29. 29 FreelistManager persist list of free extents to key/value store prepare incremental updates for allocate or release Initial implementation <offset> = <length> keep in-memory copy enforce an ordering on commits del 1600=100000 put 1700=99990 small initial memory footprint, very expensive when fragmented New approach <offset> = <region bitmap> where region is N blocks (1024?) no in-memory state use k/v merge operator to XOR allocation or release merge 10=0000000011 merge 20=1110000000 RocksDB log-structured-merge tree coalesces keys during compaction BLOCK FREE LIST
  • 30. 30 Allocator abstract interface to allocate new space StupidAllocator bin free extents by size (powers of 2) choose sufficiently large extent closest to hint highly variable memory usage btree of free extents implemented, works based on ancient ebofs policy  BitmapAllocator hierarchy of indexes L1: 2 bits = 2^6 blocks L2: 2 bits = 2^12 blocks ...  00 = all free, 11 = all used,  01 = mix fixed memory consumption ~35 MB RAM per TB  BLOCK ALLOCATOR
  • 31. 31 Let's support them natively! 256 MB zones / bands must be written sequentially, but not all at once libzbc supports ZAC and ZBC HDDs host-managed or host-aware SMRAllocator write pointer per zone used + free counters per zone Bonus: almost no memory! IO ordering must ensure allocated writes reach disk in order Cleaning store k/v hints zone → object hash pick emptiest closed zone, scan hints, move objects that are still there opportunistically rewrite objects we read if the zone is flagged for cleaning soon SMR HDD
  • 33. 33 Full data checksums We scrub... periodically We want to validate checksum on every read Compression 3x replication is expensive Any scale-out cluster is expensive WE WANT FANCY STUFF
  • 34. 34 Full data checksums We scrub... periodically We want to validate checksum on every read More data with extent pointer 4KB for 32-bit csum per 4KB block bigger onode: 300 → 4396 bytes larger csum blocks? Compression 3x replication is expensive Any scale-out cluster is expensive Need largish extents to get compression benefit (64 KB, 128 KB) overwrites need to do read/modify/write  WE WANT FANCY STUFF
  • 35. 35 INDIRECT EXTENT STRUCTURES map<uint64_t,bluestore_lextent_t> extent_map; struct bluestore_lextent_t {  uint64_t blob_id;  uint64_t b_off, b_len; }; map<uint64_t,bluestore_blob_t> blob_map; struct bluestore_blob_t {  vector<bluestore_pextent_t> extents;  uint32_t logical_length; ///< uncompressed length  uint32_t flags; ///< FLAGS_*  uint16_t num_refs;  uint8_t csum_type; ///< CSUM_*  uint8_t csum_block_order;  vector<char> csum_data; }; struct bluestore_pextent_t {  uint64_t offset, length; }; onode_t onode_t or enode_t data on block device onode may reference a piece of a blob, or multiple pieces csum block size can vary ref counted (when in enode)
  • 37. 37 0 50 100 150 200 250 300 350 400 450 500 4K 8K 16K 32K 64K 128K 256K 512K 1024K 2048K 4096K Throughput(MB/s) IO Size Ceph 10.1.0 Bluestore vs Filestore Sequential Writes FS HDD BS HDD HDD: SEQUENTIAL WRITE
  • 38. 38 HDD: RANDOM WRITE 0 50 100 150 200 250 300 350 400 450 Throughput(MB/s) IO Size Ceph 10.1.0 Bluestore vs Filestore Random Writes FS HDD BS HDD 0 200 400 600 800 1000 1200 1400 1600 IOPS IO Size Ceph 10.1.0 Bluestore vs Filestore Random Writes FS HDD BS HDD
  • 39. 39 HDD: SEQUENTIAL READ 0 200 400 600 800 1000 1200 4K 8K 16K 32K 64K 128K 256K 512K 1024K 2048K 4096K Throughput(MB/s) IO Size Ceph 10.1.0 Bluestore vs Filestore Sequential Reads FS HDD BS HDD
  • 40. 40 HDD: RANDOM READ 0 200 400 600 800 000 200 400 Throughput(MB/s) IO Size Ceph 10.1.0 Bluestore vs Filestore Random Reads FS HDD BS HDD 0 500 1000 1500 2000 2500 3000 3500 IOPS IO Size Ceph 10.1.0 Bluestore vs Filestore Random Reads FS HDD BS HDD
  • 41. 41 SSD AND NVME? NVMe journal random writes ~2x faster some testing anomalies (problem with test rig kernel?) SSD only similar to HDD result small write benefit is more pronounced NVMe only more testing anomalies on test rig.. WIP
  • 42. 42 AVAILABILITY Experimental backend in Jewel v10.2.z (just released) enable experimental unrecoverable data corrupting features = bluestore rocksdb ceph-disk --bluestore DEV no multi-device magic provisioning just yet The goal... stable in Kraken (Fall '16)? default in Luminous (Spring '17)?
  • 43. 43 SUMMARY Ceph is great POSIX was poor choice for storing objects Our new BlueStore backend is awesome RocksDB rocks and was easy to embed Log recycling speeds up commits (now upstream) Delayed merge will help too (coming soon)
  • 44. THANK YOU! Sage Weil CEPH PRINCIPAL ARCHITECT sage@redhat.com @liewegas
  • 45. 45 Status Update (2016.5.25) Substantial re-implementation for enhanced features and performance New extent structure to easily support compression and checksums (PR published yesterday) New disk format, NOT backward compatible with Tech Preview Overlay / WAL mechanism rebuilt to separate policy from mechanism Bitmap allocator to reduce memory usage and unnecessary backend serialization on flash Extends KeyValueDB to utilize Merge concept BlueFS log redesign to avoid stall on compaction User-space block cache GSoC student working on SMR support SanDisk ZetaScale open sourced and being prepped as RocksDB replacement on flash Goal of being stable in K release is still in sight!
  • 46. 46 RockDB Tuning • RockDB tuned for better performance • opt.IncreaseParallelism (16); • opt.OptimizeLevelStyleCompaction (512 * 1024 * 1024); 0 1000 2000 3000 4000 5000 6000 1 19 37 55 73 91 109 127 145 163 181 199 217 235 253 271 289 307 325 343 361 379 397 415 433 451 469 487 505 523 541 559 577 595 613 631 649 TPS Runtime 4K 100% write( RocksDB Tuned Vs Default options) RocksDB(Tuned Options) RocksDB(Default Option)
  • 47. 47 BlueStore performance RocksDB Vs ZetaScale 0 5 10 15 20 25 30 0/100 70/30 99/1 TPSinKs Read/Write Ratio RBD 4K workload TPS BlueStore with RocksDB BlueStore with ZetaScale Observation • Bluestore with ZetaScale provides 2.2x improvement in read performance compared to RocksDB • Bluestore with ZetaScale provides 20% improvement in mixed read/write compared to RocksDB • Bluestore with ZetaScale100% write performance is 10% less than the RocksDB
  • 48. 48 BlueStore performance RocksDB Vs ZetaScale 0 500 1000 1500 2000 2500 3000 1 22 43 64 85 106 127 148 169 190 211 232 253 274 295 316 337 358 379 400 421 442 463 484 505 526 547 568 589 610 631 652 TPS Run time 4K 100% write TPS RocksDB ZetaScale Observation • ZetaScale performance is stable • RocksDB performance is choppy due to compaction • RocksDB write performance keeps degrading as data size grows because the compaction overhead grows with the data size
  • 49. 49 0 1000 2000 3000 4000 5000 6000 1 20 39 58 77 96 115 134 153 172 191 210 229 248 267 286 305 324 343 362 381 400 419 438 457 476 495 514 533 552 571 590 609 TPS Runtime 70% read and 30 % write RocksDB read ZetaScale Reac RocksDB write ZetaScale write BlueStore performance RocksDB Vs ZetaScale Observation • ZetaScale read performance is stable • RocksDB read performance is choppy due to compaction
  • 50. 50 Read/Write Ratio TPS in Ks IO util% CPU % Ios (rd+wr) Per sec Read KB per sec Write KB per sec IOs (rd+wr) per TPS CPU time in ms per TPS Disk BW(rd+wr) KB per TPS 0/100 RocksDB 1.59 98 171 19094 106926 38290 12.0 1.1 91.3 0/100 ZetaScale 1.42 98 106 11970 28821 48877 8.4 0.7 54.7 70/30 RocksDB 3.59 100 233 24865 143867 33430 6.9 0.6 49.4 70/30 ZetaScale 4.28 98 178 16870 62331 44989 3.9 0.4 25.1 100/0 RocksDB 11.25 100 407 36370 215546 7300 3.2 0.4 19.8 100/0 ZetaScale 24.27 100 578 50109 291947 11537 2.1 0.2 12.5 BlueStore performance RocksDB Vs ZetaScale Observation • For the observed performance advantage with ZetaScale, the following are highlights of system resource utilization IO : BlueStore with Zeta Scale uses up to 43% less IOs compared to RocksDB CPU : BlueStore with Zeta Scale uses 35% less cpu compared to RocksDB Disk BW : BlueStore with Zeta Scale uses up to 50% less bandwidth compared to RocksDB
  • 51. 51 Tasks in progress • Multiple Blocks performance • 16K, 64K, 256K • Large Scale testing • Multiple OSDs with 128TB capacity • Runs with worst case metadata size • Onode size grows as test progress due to random writes • RocksDB performance keeps degrading as the dataset size increases. • Measure performance under this condition and compare • Tune ZetaScale for further improvement

Notes de l'éditeur

  1. 36
  2. 44