SlideShare une entreprise Scribd logo
1  sur  55
Télécharger pour lire hors ligne
0 © Fujitsu 2014
Best practice with Ceph as Distributed
Intelligent Unified Cloud Storage
Dieter Kasper
CTO Data Center Infrastructure and Services
Technology Office, International Business
2014-02-27 v4
1 © Fujitsu 2014
Agenda
 Introduction
 Hardware / Software layout
 Performance test cases
 Hints and Challenges for Ceph
 Summary
2 © Fujitsu 2014
Challenges of a Storage Subsystem
 Transparency
 User has the impression of a single global File System / Storage Space
 Scalable Performance,Elasticity
 No degradation of performance as the number of users or volume of data increases
 Intelligent rebalancing on capacity enhancements
 Offer same high performance for all volumes
 Availability, Reliability and Consistency
 User can access the same file system / Block Storage from different locations at the sametime
 User can access the file system at any time
 Highest MTTDL (mean time to data loss)
 Fault tolerance
 System can identify and recover from failure
 Lowest Degradation during rebuild time
 Shortest Rebuild times
 Manageability & ease of use
3 © Fujitsu 2014
 Central allocation tables
 File systems
– Access requires lookup
– Hard to scale table size
+ Stable mapping
+ Expansion trivial
 Hash functions
 Web caching, Storage Virtualization
+ Calculate location
+ No tables
– Unstable mapping
– Expansion reshuffles
Conventional data placement
4 © Fujitsu 2014
Controller 1 Controller 2
 Dual redundancy
 1 node/controllercan fail
 2 drives can fail (RAID 6)
 HA only inside a pair
access network
interconnect
Conventional vs. distributed model
node1
node2
node3
node4
…
 N-way resilience, e.g.
 Multiple nodes can fail
 Multiple drives can fail
 Nodes coordinate replication,
and recovery
5 © Fujitsu 2014
 While traditional storage systems distribute Volumes across sub-sets of Spindles
Scale-Out systems use algorithms to distribute Volumes across all/many Spindles
and provide maximum utilization of all system resources
 Offer same high performance for all volumes and shortest rebuild times
Conventional vs. distributed model
6 © Fujitsu 2014
A model for dynamic “clouds” in nature
Swarm of birds or fishes
Source: w ikipedia
7 © Fujitsu 2014
Distributed intelligence
 Swarm intelligence [Wikipedia]
 (SI) is the collective behavior of decentralized, self-organized systems, natural or artificial.
 Swarm behavior [Wikipedia]
 Swarm behavior, or swarming, is a collective behavior exhibited by animals of similar size
(…) moving en masse or migrating in some direction.
 From a more abstract point of view, swarm behavior is the collective motion of a large
number of self-propelled entities.
 From the perspective of the mathematical modeler, it is an (…)
behavior arising from simple rules that are followed by individuals and does not
involve any central coordination.
8 © Fujitsu 2014
Ceph Key Design Goals
 The system is inherently dynamic:
 Decouples data and metadata
 Eliminates object list for naming and lookup by a hash-like distribution function –
CRUSH (Controlled Replication Under Scalable Hashing)
 Delegates responsibility of data migration, replication, failure detection and recovery to the OSD
(Object Storage Daemon) cluster
 Node failures are the norm, rather than an exception
 Changes in the storage cluster size (up to 10k nodes) cause automatic and fast failure recovery and
rebalancing of data with no interruption
 The characters of workloads are constantly shifting over time
 The Hierarchy is dynamically redistributed over 10s of MDSs (Meta Data Services) by Dynamic
Subtree Partitioning with near-linear scalability
 The system is inevitably built incrementally
 FS can be seamlessly expanded by simply adding storage nodes (OSDs)
 Proactively migrates data to new devices -> balanced distribution of data
 Utilizes all available disk bandwidth and avoids data hot spots
9 © Fujitsu 2014
Cluster
monitors
Clients
OSDsMDSs
File, Block,
ObjectIO
Metadata ops
Metadata IO
RADOS
Architecture: Ceph + RADOS (1)
 Clients
 Standard Interface to use the data
(POSIX, Block Device, S3)
 Transparent for Applications
 Metadata Server Cluster
(MDSs)
 Namespace Management
 Metadata operations (open, stat,
rename, …)
 Ensure Security
 ObjectStorage Cluster (OSDs)
 Stores all data and metadata
 Organizes data into flexible-sized
containers, called objects
10 © Fujitsu 2014
Architecture: Ceph + RADOS (2)
Cluster Monitors
<10
cluster membership
authentication
cluster state
cluster map
Topology +
Authentication
Meta Data Server (MDS)
10s
for POSIX only
Namespace mgmt.
Metadata ops (open, stat, rename, …)
POSIX
meta data only
Object Storage Daemons (OSD)
10,000s
stores all data / metadata
organises all data in flexibly sized containers
Clients
bulk data traffic
block file object
11 © Fujitsu 2014
CRUSH Algorithm Data Placement
 No Hotspots / Less Central Lookups:
 x → [osd12, osd34] is pseudo-random, uniform (weighted) distribution
 Stable: adding devices remaps few x’s
 Storage Topology Aware
 Placement based on physical infrastructure
 e.g., devices, servers, cabinets, rows, DCs, etc.
 Easy and Flexible Rules: describe placement in terms of tree
 ”three replicas, different cabinets, same row”
 Hierarchical Cluster Map
 Storage devices are assigned weights to control the amount of data they are responsible for
storing
 Distributes data uniformly among weighted devices
 Buckets can be composed arbitrarily to construct storage hierarchy
 Data is placed in the hierarchy by recursively selecting nested bucket items via pseudo-random
hash like functions
12 © Fujitsu 2014
Data placement with CRUSH
 Files/bdevs striped over objects
 4 MB objects by default
 Objects mapped to
placementgroups (PGs)
 pgid = hash(object) & mask
 PGs mapped to sets of OSDs
 crush(cluster, rule, pgid) = [osd2, osd3]
 Pseudo-random, statistically uniform
distribution
 ~100 PGs per node
 Fast: O(log n) calculation, no lookups
 Reliable: replicas span failure domains
 Stable: adding/removing OSDs moves few PGs
 A deterministic pseudo-random hash like function that distributes data uniformly among OSDs
 Relies on compact cluster description for new storage target w/o consulting a central allocator
…
… … …
…
OSDs
(grouped by
f ailure domain)
Objects
PGs
…File / Block
13 © Fujitsu 2014
Ceph is the most comprehensive implementation of Unified Storage
Unified Storage for Cloud based on Ceph –
Architecture and Principles
The Ceph difference
Ceph’s CRUSH algorithm
liberates storage clusters
from the scalability and
performance limitations
imposed by centralized data table
mapping. It replicates and re-
balance data within the cluster
dynamically - eliminating this tedious
task for administrators, while
delivering high-performance and
infinite scalability.
http://ceph.com/ceph-storage
http://www.inktank.com
Librados
A library allow ing
apps to directly
access RADOS,
w ith support for
C, C++, Java,
Python, Ruby,
and PHP
Ceph Object
Gateway
(RGW)
A bucket-based
REST gatew ay,
compatible w ith
S3 and Sw ift
Ceph Block
Device
(RBD)
A reliable and fully-
distributed block
device, w ith a Linux
kernel client and a
QEMU/KVM driver
Ceph File
System
(CephFS)
A POSIX-compliant
distributed file
system, w ith a Linux
kernel client and
support for FUSE
App App
Object
Host / VM
Virtual Disk
Client
Files & Dirs
Ceph Storage Cluster (RADOS)
A reliable, autonomous, distributed object store comprised
of self-healing, self-managing, intelligent storage nodes
14 © Fujitsu 2014
Redundant Cluster Interconnect (IP based)
Redundant Client Interconnect (IP based)
Client Client Client Client
Ceph principles
Distributed Redundant Storage
 Intelligent data Distribution across
all nodes and spindles = wide
striping (64KB – 16MB)
 Redundancy with replica=2, 3 … 8
 Thin provisioning
 Fast distributed rebuild
 Availability, Fault tolerance
 Disk, Node, Interconnect
 Automatic rebuild
 Distributed HotSpare Space
 Transparent Block, File access
 Reliability and Consistency
 Scalable Performance
 Pure PCIe-SSD for extreme
Transaction processing
Stor
Node
SAS
SAS
SAS
PCI
SSD
Stor
Node
SAS
SAS
SAS
PCI
SSD
Stor
Node
SAS
SAS
SAS
PCI
SSD
Stor
Node
SAS
SAS
SAS
PCI
SSD
Stor
Node
SAS
SAS
SAS
PCI
SSD
Stor
Node
SAS
SAS
SAS
PCI
SSD
Stor
Node
SAS
SAS
SAS
PCI
SSD
Stor
Node
SAS
SAS
SAS
PCI
SSD
SAS SAS SAS SAS SAS SAS SAS SAS
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
librbd
VMs
Block
App
Object
App
File
App
15 © Fujitsu 2014
Redundant Cluster Interconnect (IP based)
Redundant Client Interconnect (IP based)
Client Client Client Client
Ceph processes
Stor
Node
OSD
OSD
OSD
OSD
Stor
Node
OSD
OSD
OSD
OSD
Stor
Node
OSD
OSD
OSD
OSD
Stor
Node
OSD
OSD
OSD
OSD
Stor
Node
OSD
OSD
OSD
OSD
Stor
Node
OSD
OSD
OSD
OSD
Stor
Node
OSD
OSD
OSD
OSD
Stor
Node
OSD
OSD
OSD
OSD
OSD OSD OSD OSD OSD OSD OSD OSD
OSD OSD OSD OSD OSD OSD OSD OSD
librbd
VMs
Block
App
Object
App
File
App
MON MON MON
Client
Client
Client
Client
Client
Client
Client
Client
Client
Client
Client
Client
Client
Client
Client
Client
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
J
MDS MDS
Distributed Redundant Storage
 Intelligent data Distribution across
all nodes and spindles = wide
striping (64KB – 16MB)
 Redundancy with replica=2, 3 … 8
 Thin provisioning
 Fast distributed rebuild
 Availability, Fault tolerance
 Disk, Node, Interconnect
 Automatic rebuild
 Distributed HotSpare Space
 Transparent Block, File access
 Reliability and Consistency
 Scalable Performance
 Pure PCIe-SSD for extreme
Transaction processing
16 © Fujitsu 2014
Agenda
 Introduction
 Hardware / Software layout
 Performance test cases
 Hints and Challenges for Ceph
 Summary
17 © Fujitsu 2014
rx37-3
Hardware test configuration
rx37-4 rx37-5
PCI
SSD
rx37-6 rx37-7 rx37-8
rx37-1
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
SAS
16x
SAS
16x
SAS
16x
SAS
16x
SAS
16x
SAS
16x
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
 rx37-[3-8]: Fujitsu 2U Server RX300
 2x Intel(R) Xeon(R) E5-2630 @ 2.30GHz
 128GB RAM
 2x 1GbE onboard
 2x 10GbE Intel 82599EB
 1x 40GbIB Intel TrueScale IBA7322 QDR InfiniBand
HCA
 2x 56GbIB Mellanox MT27500 Family (configurable
as 40GbE, too)
 3x Intel PCIe-SSD 910 Series 800GB
 16x SAS 6G 300GB HDD through
 LSI MegaRAID SAS 2108 [Liberator]
 rx37-[12]: same as above, but
 1x Intel(R) Xeon(R) E5-2630 @ 2.30GHz
 64GB RAM
 No SSDs, 2x SAS drives
56 / 40 GbIB, 40 / 10 / 1 GbE Cluster Interconnect
56 / 40 GbIB , 40 / 10 / 1GbE Client Interconnect
RBD
fio
CephFS
rx37-2
RBD
fio
CephFS
MON MON MON
MDS MDS
18 © Fujitsu 2014
rx37-3 rx37-4 rx37-5
PCI
SSD
rx37-6 rx37-7 rx37-8
rx37-1
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
SAS
16x
SAS
16x
SAS
16x
SAS
16x
SAS
16x
SAS
16x
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
(1)Frontend interface: ceph.ko, rbd.ko, ceph-
fuse, rbd-wrapper in user land
(2)OSD object size of data: 64k, 4m
(3)Block Device options for /dev/rbdX, /dev/sdY:
scheduler, rq_affinity, rotational,
read_ahead_kb
(4)Interconnect: 1 / 10 / 40 GbE, 40 / 56 GbIB
CM/DG
(5)Network parameter
(6)Journal: RAM-Disk, SSD
(7)OSD File System: xfs, btrfs
(8)OSD disk type: SAS, SSD
56 / 40 GbIB, 40 / 10 / 1 GbE Cluster Interconnect
56 / 40 GbIB , 40 / 10 / 1GbE Client Interconnect
RBD
fio
CephFS
rx37-2
RBD
fio
CephFS
MON MON MON
MDS MDS
Which parameter to change, tune
1
3
2
4 5
4 5
6
6
7
7
8
3
19 © Fujitsu 2014
Software test configuration
 CentOS 6.4
 With vanilla Kernel 3.8.13
 fio-2.0.13
 ceph version 0.61.7 cuttlefish
1GbE: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500
1GbE: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500
10GbE: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500
10GbE: eth3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500
40Gb: ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520
56Gb: ib1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520
56Gb: ib2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520
40GbE: eth4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9216
40GbE: eth5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9216
# fdisk -l /dev/sdq
Device Boot Start End Blocks Id System
/dev/sdq1 1 22754 182767616 83 Linux
/dev/sdq2 22754 24322 12591320 f Ext'd
/dev/sdq5 22755 23277 4194304 83 Linux
/dev/sdq6 23277 23799 4194304 83 Linux
/dev/sdq7 23799 24322 4194304 83 Linux
#--- 3x Journals on each SSD
# lsscsi
[0:2:0:0] disk LSI RAID 5/6 SAS 6G 2.12 /dev/sda
[0:2:1:0] disk LSI RAID 5/6 SAS 6G 2.12 /dev/sdb
[0:2:2:0] disk LSI RAID 5/6 SAS 6G 2.12 /dev/sdc
[0:2:3:0] disk LSI RAID 5/6 SAS 6G 2.12 /dev/sdd
[0:2:4:0] disk LSI RAID 5/6 SAS 6G 2.12 /dev/sde
[0:2:5:0] disk LSI RAID 5/6 SAS 6G 2.12 /dev/sdf
[0:2:6:0] disk LSI RAID 5/6 SAS 6G 2.12 /dev/sdg
[0:2:7:0] disk LSI RAID 5/6 SAS 6G 2.12 /dev/sdh
[0:2:8:0] disk LSI RAID 5/6 SAS 6G 2.12 /dev/sdi
[0:2:9:0] disk LSI RAID 5/6 SAS 6G 2.12 /dev/sdj
[0:2:10:0] disk LSI RAID 5/6 SAS 6G 2.12 /dev/sdk
[0:2:11:0] disk LSI RAID 5/6 SAS 6G 2.12 /dev/sdl
[0:2:12:0] disk LSI RAID 5/6 SAS 6G 2.12 /dev/sdm
[0:2:13:0] disk LSI RAID 5/6 SAS 6G 2.12 /dev/sdn
[0:2:14:0] disk LSI RAID 5/6 SAS 6G 2.12 /dev/sdo
[0:2:15:0] disk LSI RAID 5/6 SAS 6G 2.12 /dev/sdp
[1:0:0:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdq
[1:0:1:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdr
[1:0:2:0] disk INTEL(R) SSD 910 200GB a411 /dev/sds
[1:0:3:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdt
[2:0:0:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdu
[2:0:1:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdv
[2:0:2:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdw
[2:0:3:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdx
[3:0:0:0] disk INTEL(R) SSD 910 200GB a40D /dev/sdy
[3:0:1:0] disk INTEL(R) SSD 910 200GB a40D /dev/sdz
[3:0:2:0] disk INTEL(R) SSD 910 200GB a40D /dev/sdaa
[3:0:3:0] disk INTEL(R) SSD 910 200GB a40D /dev/sdab
20 © Fujitsu 2014
Agenda
 Introduction
 Hardware / Software layout
 Performance test cases
 Hints and Challenges for Ceph
 Summary
21 © Fujitsu 2014
Performance test tools
 qperf $IP_ADDR -t 10 -oo msg_size:1:8M:*2 –v
tcp_lat | tcp_bw | rc_rdma_write_lat | rc_rdma_write_bw |
rc_rdma_write_poll_lat
 fio --filename=$RBD|--directory=$MDIR --direct=1 --rw=$io
--bs=$bs --size=10G --numjobs=$threads --runtime=60 --
group_reporting --name=file1 --
output=fio_${io}_${bs}_${threads}
 RBD=/dev/rbdX, MDIR=/cephfs/fio-test-dir
 io=write,randwrite,read,randread
 bs=4k,8k,4m,8m
 threads=1,64,128
22 © Fujitsu 2014
qperf - latency
2,00
4,00
8,00
16,00
32,00
64,00
128,00
256,00
512,00
1024,00
2048,00
4096,00
8192,00
16384,00
32768,00
65536,00
131072,00
tcp_lat (µs : IO_bytes)
1GbE
10GbE
40GbE-1500
40GbE-9126
40IPoIB_CM
56IPoIB_CM
56IB_wr_lat
56IB_wr_poll_lat
23 © Fujitsu 2014
qperf - bandwidth
0
1000
2000
3000
4000
5000
6000
7000
tcp_bw (MB/s : IO_bytes)
1GbE
10GbE
40GbE-1500
40GbE-9126
40IPoIB_CM
56IPoIB_CM
56IB_wr_lat
24 © Fujitsu 2014
qperf - explanation
 tcp_lat almost the same for blocks <= 128 bytes
 tcp_lat very similar between 40GbE,40Gb IPoIB and 56Gb IPoIB
 Significant difference between1 / 10GbE only for blocks >= 4k
 Better latency on IB can only be achieved with rdma_write / rdma_write_poll
 Bandwidth on 1 / 10GbE very stable on possible maximum
 40GbE implementation (MLX) with unexpected fluctuation
 Under IB only with RDMA the maximum transfer rate can be achieved
 Use Socketover RDMA
 Options without big code changes are: SDP, rsocket,SMC-R, accelio
25 © Fujitsu 2014
Frontend
Interfaces
OSD-FS
xfs
OSD object
size 4m
Journal
SSD
56 Gb
IPoIB_CM
OSD-DISK
SSD
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
IOPS
ceph-ko
ceph-fuse
rbd-ko
rbd-wrap
26 © Fujitsu 2014
0
500
1000
1500
2000
2500
3000
3500
4000
MB/s
ceph-ko
ceph-fuse
rbd-ko
rbd-wrap
Frontend
Interfaces
OSD-FS
xfs
OSD object
size 4m
Journal
SSD
56 Gb
IPoIB_CM
OSD-DISK
SSD
27 © Fujitsu 2014
 Ceph-fuse seems not to support multiple I/Os in parallel with O_DIRECT
which drops the performance significantly
 RBD-wrap (= rbd in user land) shows some advantages on IOPS, but not
sufficient enough to replace rbd.ko
 Ceph.ko is excellent on sequential IOPS reads, presumable because of
the read (ahead) of complete 4m blocks
 Stay with the official interfaces ceph.ko / rbd.ko to give fio the needed
access to File and Block
 rbd.ko has some room for performance improvement for IOPS
Frontend
Interfaces
OSD-FS
xfs
OSD object
size 4m
Journal
SSD
56 Gb
IPoIB_CM
OSD-DISK
SSD
28 © Fujitsu 2014
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
IOPS
xfs ceph-ko
btrfs ceph-ko
xfs rbd-ko
btrfs rbd-ko
ceph-ko
rbd-ko
OSD-FS
xfs / btrfs
OSD object
size 4m
Journal
SSD
56 Gb
IPoIB_CM
OSD-DISK
SSD
29 © Fujitsu 2014
0
500
1000
1500
2000
2500
3000
3500
4000
4500
MB/s
xfs ceph-ko
btrfs ceph-ko
xfs rbd-ko
btrfs rbd-ko
ceph-ko
rbd-ko
OSD-FS
xfs / btrfs
OSD object
size 4m
Journal
SSD
56 Gb
IPoIB_CM
OSD-DISK
SSD
30 © Fujitsu 2014
 6 months ago with kernel 3.0.x btrfs reveal some weaknesses in writes
 With kernel 3.6 some essential enhancements were made to the btrfs
code, so almost no differences could be identified in our kernel 3.8.13
 Use btrfs in the next test cases, because btrfs has the more promising
storage features: compression, data deduplication
ceph-ko
rbd-ko
OSD-FS
xfs / btrfs
OSD object
size 4m
Journal
SSD
56 Gb
IPoIB_CM
OSD-DISK
SSD
31 © Fujitsu 2014
0
10000
20000
30000
40000
50000
60000
70000
80000
IOPS
64k
4m
0
500
1000
1500
2000
2500
3000
3500
MB/s
64k
4m
rbd-ko
OSD-FS
btrfs
OSD object
64k / 4m
Journal
RAM
56 Gb
IPoIB_CM
OSD-DISK
SSD
32 © Fujitsu 2014
 Small chunks of 64k can especially increase 4k/8k sequential IOPS for
reads as well as for writes
 For random IO 4m chunks are as good as 64k chunks
 But, the usage of 64k chunks will result in very low bandwidth for 4m/8m
blocks
 Create each volume with the appropriate OSD object size: ~64k if small
sequential IOPS are used, otherwise stay with ~4m
rbd-ko
OSD-FS
btrfs
OSD object
64k / 4m
Journal
RAM
56 Gb
IPoIB_CM
OSD-DISK
SSD
33 © Fujitsu 2014
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
IOPS
j-RAM ceph-ko
j-SSD ceph-ko
j-RAM rbd-ko
j-SSD rbd-ko
ceph-ko
rbd-ko
OSD-FS
btrfs
OSD object
size 4m
Journal
RAM / SSD
56 Gb
IPoIB_CM
OSD-DISK
SSD
34 © Fujitsu 2014
0
500
1000
1500
2000
2500
3000
3500
4000
4500
MB/s
j-RAM ceph-ko
j-SSD ceph-ko
j-RAM rbd-ko
j-SSD rbd-ko
ceph-ko
rbd-ko
OSD-FS
btrfs
OSD object
size 4m
Journal
RAM / SSD
56 Gb
IPoIB_CM
OSD-DISK
SSD
35 © Fujitsu 2014
 Only on heavy write bandwidth requests RAM can show its predominance
used as a journal.
 The SSD has to accomplish twice the load when the journal and the data
is written to it.
ceph-ko
rbd-ko
OSD-FS
btrfs
OSD object
size 4m
Journal
RAM / SSD
56 Gb
IPoIB_CM
OSD-DISK
SSD
36 © Fujitsu 2014
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
IOPS
1GbE
10GbE
40GbE
40GbIB_CM
56GbIB_CM
ceph-ko
OSD-FS
btrfs
OSD object
size 4m
Journal
SSD
Network
OSD-DISK
SSD
37 © Fujitsu 2014
0
500
1000
1500
2000
2500
3000
3500
4000
4500
MB/s
1GbE
10GbE
40GbE
40GbIB_CM
56GbIB_CM
ceph-ko
OSD-FS
btrfs
OSD object
size 4m
Journal
SSD
Network
OSD-DISK
SSD
38 © Fujitsu 2014
 In case of write IOPS 1GbE is doing extremely well
 On sequential read IOPS there is nearly no difference between10GbE and 56Gb
IPoIB
 On the bandwidth side with reads the measured performance is close in sync with
the possible speedof the network. Only the TrueScale IB has some weaknesses,
because it was designed for HPC and not for Storage/Streaming.
 If you only look or IOPS 10GbE is a good choice
 If throughput is relevant for your use case you should go for 56GbIB
ceph-ko
OSD-FS
btrfs
OSD object
size 4m
Journal
SSD
Network
OSD-DISK
SSD
39 © Fujitsu 2014
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
IOPS
SAS ceph-ko
SSD ceph-ko
SAS rbd-ko
SSD rbd-ko
ceph-ko
rbd-ko
OSD-FS
btrfs
OSD object
size 4m
Journal
SSD
56 Gb
IPoIB_CM
OSD-DISK
SAS / SSD
40 © Fujitsu 2014
0
500
1000
1500
2000
2500
3000
3500
4000
4500
MB/s
SAS ceph-ko
SSD ceph-ko
SAS rbd-ko
SSD rbd-ko
ceph-ko
rbd-ko
OSD-FS
btrfs
OSD object
size 4m
Journal
SSD
56 Gb
IPoIB_CM
OSD-DISK
SAS / SSD
41 © Fujitsu 2014
rbd-ko
OSD-FS
btrfs
OSD object
64k / 4m
Journal
SSD
56 GbIPoIB
CM / DG
OSD-DISK
SAS / SSD
0
10000
20000
30000
40000
50000
60000
70000
IOPS
SAS 4m CM
SSD 4m CM
SAS 64k DG
SSD 64k DG
42 © Fujitsu 2014
rbd-ko
OSD-FS
btrfs
OSD object
64k / 4m
Journal
SSD
56 GbIPoIB
CM / DG
OSD-DISK
SAS / SSD
0
500
1000
1500
2000
2500
3000
3500
4000
MB/s
SAS 4m CM
SSD 4m CM
SAS 64k DG
SSD 64k DG
43 © Fujitsu 2014
 SAS drives are doing much better than expected. Only on small random
writes they are significantly slower than SSD
 In this comparison is has to be taken into account, that the SSD is getting
twice as much writes (journal + data)
 2.5” 10k 6G SAS drives seems to be an attractive alternative in
combination with SSD for the journal
ceph-ko
rbd-ko
OSD-FS
btrfs
OSD object
64k / 4m
Journal
SSD
56 GbIPoIB
CM / DG
OSD-DISK
SAS / SSD
44 © Fujitsu 2014
Agenda
 Introduction
 Hardware / Software layout
 Performance test cases
 Hints and Challenges for Ceph
 Summary
45 © Fujitsu 2014
Stor
Node
SAS
SAS
SAS
SSD
Stor
Node
SAS
SAS
SAS
SSD
SAS SAS
SSD SSD
Stor
Node
SAS
SAS
SAS
SSD
Stor
Node
SAS
SAS
SAS
SSD
SAS SAS
SSD SSD
Ceph HA/DR Design
Stor
Node
SAS
SAS
SAS
SSD
Stor
Node
SAS
SAS
SAS
SSD
SAS SAS
SSD SSD
Stor
Node
SAS
SAS
SAS
SSD
Stor
Node
SAS
SAS
SAS
SSD
SAS SAS
SSD SSD
Redundant Client Interconnect (IP)
PY PY
File
App
Block
App
Redundant Client Interconnect (IP)
PY PY
File
App’
Block
App’
Redundant Cluster Interconnect (IP) Redundant Cluster Interconnect (IP)
multiple
Storage
Nodes
multiple
Storage
Nodes
 2 replica will be placed locally
 The 3rd replica will go to the remote site
 On a site failure, all data will be available at the
remote site to minimize the RTO Recovery Time
Objective
46 © Fujitsu 2014
Silver Pool
VolB-1
primary
VolB-2
primary
VolB-3
primary VolB-3
2nd
VolB-2
2nd
VolB-1
2nd
…
VolB-3
3rd
VolB-2
3rd
VolB-1
3rd
VolB-3
4th
VolB-2
4th
VolB-1
4th
Ceph HA/DR Design
Redundant Client Interconnect (IP)
PY PY
File
App
Block
App
Redundant Client Interconnect (IP)
PY PY
File
App’
Block
App’… …
Gold Pool
VolA-1
primary
VolA-2
primary
VolA-3
primary
… VolA-3
2nd
VolA-2
2nd
VolA-1
2nd
VolA-3
3rd
VolA-2
3rd
VolA-1
3rd
Bronze Pool
VolC-3
3rd
VolC-2
3rd
VolC-1
3rd
VolC-1
primary
VolC-2
primary
VolC-3
primary VolC-3
2nd
VolC-2
2nd
VolC-1
2nd
…
VolC-3
3rd
VolC-2
3rd
VolC-1
3rd
Class repli=2 repli=3 repli=4 ECode
SSD Gold
SAS Steel Silver
SATA Bronze
47 © Fujitsu 2014
Consider balanced components (repli=3)
Storage Node
Journal /
PCIe-SSD
Redundant Client Interconnect (IP)
PY PY
File
App
Block
App
Redundant Cluster Interconnect (IP)
1GB/s
OSD / SAS
Raid Cntl
2GB/s
2GB/s
1GB/s
Memory
2GB/s
1GB/s
2GB/s Memory
(1)1GB/s writes via the Client
interconnect results in 6GB/s
bandwidth to internal disks
(2)Raid Controllers today
consist of ~ 2GB protected
DRAM inside, which allows a
pre-acknowledge of writes
(= write-back) to SAS/SATA
disks
(3)The design of using a fast
Journal in front of raw HDDs
makes sense, but today there
should be an option to
disable it
48 © Fujitsu 2014
Architecture Vision for Unified Storage
File System
File System
File System
OSDs
MDSs
NFS, CIFS
File System
File System
Client / Hosts
VMs /
Apps
VMs /
Apps
File System
File System
Object
File System
File System
Block
S3, Swift, KVS FC oE, iSCSI
File System
File System
Client / Hosts
File System
File System
Client / Hosts
vfs
Ceph
http
Rados-gw
LIO
RDB
49 © Fujitsu 2014
Challenge: How to enter existing Clouds ?
 ~ 70% of all OS are running as guests
 In ~ 90% the 1st choice is VMware-ESX, 2nd XEN + Hyper-V
 In most cases conventional shared Raid arrays through FC or iSCSI as
block storage are used
 VSAN and Storage-Spaces as SW-Defined Storage is already included
 iSCSI-Target or even FC-target in front of RBD Ceph can be attached to
existing VMs, but will enough value remain ?
 So, what is the best approach to enter existing Clouds ?
 Multi tenancy on File, Block and Object level access and administration
 Accounting & Billing
50 © Fujitsu 2014
Agenda
 Introduction
 Hardware / Software layout
 Performance test cases
 Hints and Challenges for Ceph
 Summary
51 © Fujitsu 2014
Finding & Hints
 Ceph is the most comprehensive implementation of Unified Storage. Ceph
simulates “distributed swarm intelligence” which arise from simple rules
that are followed by individual processes and does not involve any central
coordination.
 The Crush algorithm acts as an enabler for a controlled, scalable,
decentralized placement of replica data.
 Choosing the right mix of Ceph & System parameter will result in a
performance push
 Keep the data bandwidth balanced between Client/Cluster Network,
Journal and OSD devices
 Likely the Ceph code has some room for improvement to achieve higher
throughput
52 © Fujitsu 2014
Ceph principles – translated to value
 Wide striping
 all available resources of storage devices (capacity, bandwidth, IOPS of SSD, SAS, SATA) and
 all available resources of the nodes (interconnect bandwidth, buffer & cache of RAID-Controller, RAM)
 are disposable to every single application (~ federated cache)
 Replication (2,3…8, EC) available on a per pool granularity. Volumes and Directories can be assigned to pools.
 Intelligent data placement based on a defined hierarchical storage topology map
 Thin provisioning: Allocate the capacity needed, then grow the space dynamically and on demand.
 Fast distributed rebuild: All storage devices will help to rebuild the lost capacity of a single device, because the
redundant data is distributed over multiple other devices. This results in shortest rebuild times (h not days)
 Auto-healing of lost redundancies (interconnect, node, disk) including auto-rebuild using distributed Hot Spare
 Ease of use
 create volume/directory size=XX class=SSD|SAS|SATAtype=IOPS|bandwidth
 resize volume size=YY
 Unified and transparent Block, File and Objectaccess using any block device: node internal special block
device, internal SCSI device from RAID-Controller or external SCSI LUNs from traditional RAID arrays
 Scalable by adding capacity and/or performance/bandwidth/IOPS including auto-rebalance
53 © Fujitsu 2014
54 © Fujitsu 201454
Fujitsu
Technology
Solutions
CTO Data Center Infrastructure and Services
Dieter.Kasper@ts.fujitsu.com

Contenu connexe

Tendances

Introducing Lattus Object Storage
Introducing Lattus Object StorageIntroducing Lattus Object Storage
Introducing Lattus Object StorageQuantum
 
A hybrid cloud approach for secure authorized
A hybrid cloud approach for secure authorizedA hybrid cloud approach for secure authorized
A hybrid cloud approach for secure authorizedNinad Samel
 
Open Source Data Deduplication
Open Source Data DeduplicationOpen Source Data Deduplication
Open Source Data DeduplicationRedWireServices
 
Abhishek Kumar - CloudStack Locking Service
Abhishek Kumar - CloudStack Locking ServiceAbhishek Kumar - CloudStack Locking Service
Abhishek Kumar - CloudStack Locking ServiceShapeBlue
 
Quantum NDX - NAS Based Data Protection
Quantum NDX - NAS Based Data Protection Quantum NDX - NAS Based Data Protection
Quantum NDX - NAS Based Data Protection Quantum
 
A hybrid cloud approach for secure authorized deduplication.
A hybrid cloud approach for secure authorized deduplication.A hybrid cloud approach for secure authorized deduplication.
A hybrid cloud approach for secure authorized deduplication.prudhvikumar madithati
 
A hybrid cloud approach for secure authorized deduplication
A hybrid cloud approach for secure authorized deduplicationA hybrid cloud approach for secure authorized deduplication
A hybrid cloud approach for secure authorized deduplicationAdz91 Digital Ads Pvt Ltd
 
Distributed Framework for Data Mining As a Service on Private Cloud
Distributed Framework for Data Mining As a Service on Private CloudDistributed Framework for Data Mining As a Service on Private Cloud
Distributed Framework for Data Mining As a Service on Private CloudIJERA Editor
 
Mesh pull backup parent pools for video-on-demand multicast trees
Mesh pull backup parent pools for video-on-demand multicast treesMesh pull backup parent pools for video-on-demand multicast trees
Mesh pull backup parent pools for video-on-demand multicast treesIJCNCJournal
 
zenoh: zero overhead pub/sub store/query compute
zenoh: zero overhead pub/sub store/query computezenoh: zero overhead pub/sub store/query compute
zenoh: zero overhead pub/sub store/query computeAngelo Corsaro
 
OSCON 15 Building Opensource wtih Open Source
OSCON 15 Building Opensource wtih Open SourceOSCON 15 Building Opensource wtih Open Source
OSCON 15 Building Opensource wtih Open SourceSusan Wu
 
NetApp Unified Scale-Out/Clustered Storage
NetApp Unified Scale-Out/Clustered StorageNetApp Unified Scale-Out/Clustered Storage
NetApp Unified Scale-Out/Clustered StorageNetApp
 

Tendances (20)

Storage
StorageStorage
Storage
 
Swiss National Supercomputing Center
Swiss National Supercomputing CenterSwiss National Supercomputing Center
Swiss National Supercomputing Center
 
Introducing Lattus Object Storage
Introducing Lattus Object StorageIntroducing Lattus Object Storage
Introducing Lattus Object Storage
 
PowerAlluxio
PowerAlluxioPowerAlluxio
PowerAlluxio
 
A hybrid cloud approach for secure authorized
A hybrid cloud approach for secure authorizedA hybrid cloud approach for secure authorized
A hybrid cloud approach for secure authorized
 
Ch11
Ch11Ch11
Ch11
 
Open Source Data Deduplication
Open Source Data DeduplicationOpen Source Data Deduplication
Open Source Data Deduplication
 
Abhishek Kumar - CloudStack Locking Service
Abhishek Kumar - CloudStack Locking ServiceAbhishek Kumar - CloudStack Locking Service
Abhishek Kumar - CloudStack Locking Service
 
Quantum NDX - NAS Based Data Protection
Quantum NDX - NAS Based Data Protection Quantum NDX - NAS Based Data Protection
Quantum NDX - NAS Based Data Protection
 
WJCAT2-13707877
WJCAT2-13707877WJCAT2-13707877
WJCAT2-13707877
 
A hybrid cloud approach for secure authorized deduplication.
A hybrid cloud approach for secure authorized deduplication.A hybrid cloud approach for secure authorized deduplication.
A hybrid cloud approach for secure authorized deduplication.
 
635 642
635 642635 642
635 642
 
A hybrid cloud approach for secure authorized deduplication
A hybrid cloud approach for secure authorized deduplicationA hybrid cloud approach for secure authorized deduplication
A hybrid cloud approach for secure authorized deduplication
 
Distributed Framework for Data Mining As a Service on Private Cloud
Distributed Framework for Data Mining As a Service on Private CloudDistributed Framework for Data Mining As a Service on Private Cloud
Distributed Framework for Data Mining As a Service on Private Cloud
 
Mesh pull backup parent pools for video-on-demand multicast trees
Mesh pull backup parent pools for video-on-demand multicast treesMesh pull backup parent pools for video-on-demand multicast trees
Mesh pull backup parent pools for video-on-demand multicast trees
 
Scaling Up vs. Scaling-out
Scaling Up vs. Scaling-outScaling Up vs. Scaling-out
Scaling Up vs. Scaling-out
 
zenoh: zero overhead pub/sub store/query compute
zenoh: zero overhead pub/sub store/query computezenoh: zero overhead pub/sub store/query compute
zenoh: zero overhead pub/sub store/query compute
 
OSCON 15 Building Opensource wtih Open Source
OSCON 15 Building Opensource wtih Open SourceOSCON 15 Building Opensource wtih Open Source
OSCON 15 Building Opensource wtih Open Source
 
NetApp Unified Scale-Out/Clustered Storage
NetApp Unified Scale-Out/Clustered StorageNetApp Unified Scale-Out/Clustered Storage
NetApp Unified Scale-Out/Clustered Storage
 
SiDe Enabled Reliable Replica Optimization
SiDe Enabled Reliable Replica OptimizationSiDe Enabled Reliable Replica Optimization
SiDe Enabled Reliable Replica Optimization
 

Similaire à Best Practices with Ceph as Distributed, Intelligent, Unified Cloud Storage - Ceph Day Frankfurt

Unveiling the Evolution: Proprietary Hardware to Agile Software-Defined Solut...
Unveiling the Evolution: Proprietary Hardware to Agile Software-Defined Solut...Unveiling the Evolution: Proprietary Hardware to Agile Software-Defined Solut...
Unveiling the Evolution: Proprietary Hardware to Agile Software-Defined Solut...MaryJWilliams2
 
Survey of distributed storage system
Survey of distributed storage systemSurvey of distributed storage system
Survey of distributed storage systemZhichao Liang
 
A Look at the Future of Storage
A Look at the Future of StorageA Look at the Future of Storage
A Look at the Future of StorageIT Brand Pulse
 
TechDay - Toronto 2016 - Hyperconvergence and OpenNebula
TechDay - Toronto 2016 - Hyperconvergence and OpenNebulaTechDay - Toronto 2016 - Hyperconvergence and OpenNebula
TechDay - Toronto 2016 - Hyperconvergence and OpenNebulaOpenNebula Project
 
2015 open storage workshop ceph software defined storage
2015 open storage workshop   ceph software defined storage2015 open storage workshop   ceph software defined storage
2015 open storage workshop ceph software defined storageAndrew Underwood
 
Ceph Day Amsterdam 2015 - Building your own disaster? The safe way to make C...
Ceph Day Amsterdam 2015 - Building your own disaster?  The safe way to make C...Ceph Day Amsterdam 2015 - Building your own disaster?  The safe way to make C...
Ceph Day Amsterdam 2015 - Building your own disaster? The safe way to make C...Ceph Community
 
EMC Isilon Multitenancy for Hadoop Big Data Analytics
EMC Isilon Multitenancy for Hadoop Big Data AnalyticsEMC Isilon Multitenancy for Hadoop Big Data Analytics
EMC Isilon Multitenancy for Hadoop Big Data AnalyticsEMC
 
Ceph Day SF 2015 - Building your own disaster? The safe way to make Ceph ready!
Ceph Day SF 2015 - Building your own disaster? The safe way to make Ceph ready!Ceph Day SF 2015 - Building your own disaster? The safe way to make Ceph ready!
Ceph Day SF 2015 - Building your own disaster? The safe way to make Ceph ready!Ceph Community
 
Ceph Day LA: Building your own disaster? The safe way to make Ceph storage re...
Ceph Day LA: Building your own disaster? The safe way to make Ceph storage re...Ceph Day LA: Building your own disaster? The safe way to make Ceph storage re...
Ceph Day LA: Building your own disaster? The safe way to make Ceph storage re...Ceph Community
 
OSDC 2010 | Use Distributed Filesystem as a Storage Tier by Fabrizio Manfred
OSDC 2010 | Use Distributed Filesystem as a Storage Tier by Fabrizio ManfredOSDC 2010 | Use Distributed Filesystem as a Storage Tier by Fabrizio Manfred
OSDC 2010 | Use Distributed Filesystem as a Storage Tier by Fabrizio ManfredNETWAYS
 
Imperative Induced Innovation - Patrick W. Dowd, Ph. D
Imperative Induced Innovation - Patrick W. Dowd, Ph. DImperative Induced Innovation - Patrick W. Dowd, Ph. D
Imperative Induced Innovation - Patrick W. Dowd, Ph. Dscoopnewsgroup
 
Open Source Data Orchestration for AI, Big Data, and Cloud
Open Source Data Orchestration for AI, Big Data, and CloudOpen Source Data Orchestration for AI, Big Data, and Cloud
Open Source Data Orchestration for AI, Big Data, and CloudAlluxio, Inc.
 
e-Infrastructure Integration-with gCube
e-Infrastructure Integration-with gCubee-Infrastructure Integration-with gCube
e-Infrastructure Integration-with gCubeFAO
 
Consolidating File Servers into the Cloud
Consolidating File Servers into the CloudConsolidating File Servers into the Cloud
Consolidating File Servers into the CloudBuurst
 
Red Hat Storage Day Boston - OpenStack + Ceph Storage
Red Hat Storage Day Boston - OpenStack + Ceph StorageRed Hat Storage Day Boston - OpenStack + Ceph Storage
Red Hat Storage Day Boston - OpenStack + Ceph StorageRed_Hat_Storage
 
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...IOSR Journals
 
Red Hat Storage 2014 - Product(s) Overview
Red Hat Storage 2014 - Product(s) OverviewRed Hat Storage 2014 - Product(s) Overview
Red Hat Storage 2014 - Product(s) OverviewMarcel Hergaarden
 

Similaire à Best Practices with Ceph as Distributed, Intelligent, Unified Cloud Storage - Ceph Day Frankfurt (20)

Unveiling the Evolution: Proprietary Hardware to Agile Software-Defined Solut...
Unveiling the Evolution: Proprietary Hardware to Agile Software-Defined Solut...Unveiling the Evolution: Proprietary Hardware to Agile Software-Defined Solut...
Unveiling the Evolution: Proprietary Hardware to Agile Software-Defined Solut...
 
Survey of distributed storage system
Survey of distributed storage systemSurvey of distributed storage system
Survey of distributed storage system
 
A Look at the Future of Storage
A Look at the Future of StorageA Look at the Future of Storage
A Look at the Future of Storage
 
TechDay - Toronto 2016 - Hyperconvergence and OpenNebula
TechDay - Toronto 2016 - Hyperconvergence and OpenNebulaTechDay - Toronto 2016 - Hyperconvergence and OpenNebula
TechDay - Toronto 2016 - Hyperconvergence and OpenNebula
 
2015 open storage workshop ceph software defined storage
2015 open storage workshop   ceph software defined storage2015 open storage workshop   ceph software defined storage
2015 open storage workshop ceph software defined storage
 
Ceph Day Amsterdam 2015 - Building your own disaster? The safe way to make C...
Ceph Day Amsterdam 2015 - Building your own disaster?  The safe way to make C...Ceph Day Amsterdam 2015 - Building your own disaster?  The safe way to make C...
Ceph Day Amsterdam 2015 - Building your own disaster? The safe way to make C...
 
EMC Isilon Multitenancy for Hadoop Big Data Analytics
EMC Isilon Multitenancy for Hadoop Big Data AnalyticsEMC Isilon Multitenancy for Hadoop Big Data Analytics
EMC Isilon Multitenancy for Hadoop Big Data Analytics
 
Ceph Day SF 2015 - Building your own disaster? The safe way to make Ceph ready!
Ceph Day SF 2015 - Building your own disaster? The safe way to make Ceph ready!Ceph Day SF 2015 - Building your own disaster? The safe way to make Ceph ready!
Ceph Day SF 2015 - Building your own disaster? The safe way to make Ceph ready!
 
Ceph Day LA: Building your own disaster? The safe way to make Ceph storage re...
Ceph Day LA: Building your own disaster? The safe way to make Ceph storage re...Ceph Day LA: Building your own disaster? The safe way to make Ceph storage re...
Ceph Day LA: Building your own disaster? The safe way to make Ceph storage re...
 
OSDC 2010 | Use Distributed Filesystem as a Storage Tier by Fabrizio Manfred
OSDC 2010 | Use Distributed Filesystem as a Storage Tier by Fabrizio ManfredOSDC 2010 | Use Distributed Filesystem as a Storage Tier by Fabrizio Manfred
OSDC 2010 | Use Distributed Filesystem as a Storage Tier by Fabrizio Manfred
 
Imperative Induced Innovation - Patrick W. Dowd, Ph. D
Imperative Induced Innovation - Patrick W. Dowd, Ph. DImperative Induced Innovation - Patrick W. Dowd, Ph. D
Imperative Induced Innovation - Patrick W. Dowd, Ph. D
 
Open Source Data Orchestration for AI, Big Data, and Cloud
Open Source Data Orchestration for AI, Big Data, and CloudOpen Source Data Orchestration for AI, Big Data, and Cloud
Open Source Data Orchestration for AI, Big Data, and Cloud
 
e-Infrastructure Integration-with gCube
e-Infrastructure Integration-with gCubee-Infrastructure Integration-with gCube
e-Infrastructure Integration-with gCube
 
Ceph as software define storage
Ceph as software define storageCeph as software define storage
Ceph as software define storage
 
Consolidating File Servers into the Cloud
Consolidating File Servers into the CloudConsolidating File Servers into the Cloud
Consolidating File Servers into the Cloud
 
E045026031
E045026031E045026031
E045026031
 
Red Hat Storage Day Boston - OpenStack + Ceph Storage
Red Hat Storage Day Boston - OpenStack + Ceph StorageRed Hat Storage Day Boston - OpenStack + Ceph Storage
Red Hat Storage Day Boston - OpenStack + Ceph Storage
 
H017144148
H017144148H017144148
H017144148
 
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...
 
Red Hat Storage 2014 - Product(s) Overview
Red Hat Storage 2014 - Product(s) OverviewRed Hat Storage 2014 - Product(s) Overview
Red Hat Storage 2014 - Product(s) Overview
 

Dernier

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 

Dernier (20)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 

Best Practices with Ceph as Distributed, Intelligent, Unified Cloud Storage - Ceph Day Frankfurt

  • 1. 0 © Fujitsu 2014 Best practice with Ceph as Distributed Intelligent Unified Cloud Storage Dieter Kasper CTO Data Center Infrastructure and Services Technology Office, International Business 2014-02-27 v4
  • 2. 1 © Fujitsu 2014 Agenda  Introduction  Hardware / Software layout  Performance test cases  Hints and Challenges for Ceph  Summary
  • 3. 2 © Fujitsu 2014 Challenges of a Storage Subsystem  Transparency  User has the impression of a single global File System / Storage Space  Scalable Performance,Elasticity  No degradation of performance as the number of users or volume of data increases  Intelligent rebalancing on capacity enhancements  Offer same high performance for all volumes  Availability, Reliability and Consistency  User can access the same file system / Block Storage from different locations at the sametime  User can access the file system at any time  Highest MTTDL (mean time to data loss)  Fault tolerance  System can identify and recover from failure  Lowest Degradation during rebuild time  Shortest Rebuild times  Manageability & ease of use
  • 4. 3 © Fujitsu 2014  Central allocation tables  File systems – Access requires lookup – Hard to scale table size + Stable mapping + Expansion trivial  Hash functions  Web caching, Storage Virtualization + Calculate location + No tables – Unstable mapping – Expansion reshuffles Conventional data placement
  • 5. 4 © Fujitsu 2014 Controller 1 Controller 2  Dual redundancy  1 node/controllercan fail  2 drives can fail (RAID 6)  HA only inside a pair access network interconnect Conventional vs. distributed model node1 node2 node3 node4 …  N-way resilience, e.g.  Multiple nodes can fail  Multiple drives can fail  Nodes coordinate replication, and recovery
  • 6. 5 © Fujitsu 2014  While traditional storage systems distribute Volumes across sub-sets of Spindles Scale-Out systems use algorithms to distribute Volumes across all/many Spindles and provide maximum utilization of all system resources  Offer same high performance for all volumes and shortest rebuild times Conventional vs. distributed model
  • 7. 6 © Fujitsu 2014 A model for dynamic “clouds” in nature Swarm of birds or fishes Source: w ikipedia
  • 8. 7 © Fujitsu 2014 Distributed intelligence  Swarm intelligence [Wikipedia]  (SI) is the collective behavior of decentralized, self-organized systems, natural or artificial.  Swarm behavior [Wikipedia]  Swarm behavior, or swarming, is a collective behavior exhibited by animals of similar size (…) moving en masse or migrating in some direction.  From a more abstract point of view, swarm behavior is the collective motion of a large number of self-propelled entities.  From the perspective of the mathematical modeler, it is an (…) behavior arising from simple rules that are followed by individuals and does not involve any central coordination.
  • 9. 8 © Fujitsu 2014 Ceph Key Design Goals  The system is inherently dynamic:  Decouples data and metadata  Eliminates object list for naming and lookup by a hash-like distribution function – CRUSH (Controlled Replication Under Scalable Hashing)  Delegates responsibility of data migration, replication, failure detection and recovery to the OSD (Object Storage Daemon) cluster  Node failures are the norm, rather than an exception  Changes in the storage cluster size (up to 10k nodes) cause automatic and fast failure recovery and rebalancing of data with no interruption  The characters of workloads are constantly shifting over time  The Hierarchy is dynamically redistributed over 10s of MDSs (Meta Data Services) by Dynamic Subtree Partitioning with near-linear scalability  The system is inevitably built incrementally  FS can be seamlessly expanded by simply adding storage nodes (OSDs)  Proactively migrates data to new devices -> balanced distribution of data  Utilizes all available disk bandwidth and avoids data hot spots
  • 10. 9 © Fujitsu 2014 Cluster monitors Clients OSDsMDSs File, Block, ObjectIO Metadata ops Metadata IO RADOS Architecture: Ceph + RADOS (1)  Clients  Standard Interface to use the data (POSIX, Block Device, S3)  Transparent for Applications  Metadata Server Cluster (MDSs)  Namespace Management  Metadata operations (open, stat, rename, …)  Ensure Security  ObjectStorage Cluster (OSDs)  Stores all data and metadata  Organizes data into flexible-sized containers, called objects
  • 11. 10 © Fujitsu 2014 Architecture: Ceph + RADOS (2) Cluster Monitors <10 cluster membership authentication cluster state cluster map Topology + Authentication Meta Data Server (MDS) 10s for POSIX only Namespace mgmt. Metadata ops (open, stat, rename, …) POSIX meta data only Object Storage Daemons (OSD) 10,000s stores all data / metadata organises all data in flexibly sized containers Clients bulk data traffic block file object
  • 12. 11 © Fujitsu 2014 CRUSH Algorithm Data Placement  No Hotspots / Less Central Lookups:  x → [osd12, osd34] is pseudo-random, uniform (weighted) distribution  Stable: adding devices remaps few x’s  Storage Topology Aware  Placement based on physical infrastructure  e.g., devices, servers, cabinets, rows, DCs, etc.  Easy and Flexible Rules: describe placement in terms of tree  ”three replicas, different cabinets, same row”  Hierarchical Cluster Map  Storage devices are assigned weights to control the amount of data they are responsible for storing  Distributes data uniformly among weighted devices  Buckets can be composed arbitrarily to construct storage hierarchy  Data is placed in the hierarchy by recursively selecting nested bucket items via pseudo-random hash like functions
  • 13. 12 © Fujitsu 2014 Data placement with CRUSH  Files/bdevs striped over objects  4 MB objects by default  Objects mapped to placementgroups (PGs)  pgid = hash(object) & mask  PGs mapped to sets of OSDs  crush(cluster, rule, pgid) = [osd2, osd3]  Pseudo-random, statistically uniform distribution  ~100 PGs per node  Fast: O(log n) calculation, no lookups  Reliable: replicas span failure domains  Stable: adding/removing OSDs moves few PGs  A deterministic pseudo-random hash like function that distributes data uniformly among OSDs  Relies on compact cluster description for new storage target w/o consulting a central allocator … … … … … OSDs (grouped by f ailure domain) Objects PGs …File / Block
  • 14. 13 © Fujitsu 2014 Ceph is the most comprehensive implementation of Unified Storage Unified Storage for Cloud based on Ceph – Architecture and Principles The Ceph difference Ceph’s CRUSH algorithm liberates storage clusters from the scalability and performance limitations imposed by centralized data table mapping. It replicates and re- balance data within the cluster dynamically - eliminating this tedious task for administrators, while delivering high-performance and infinite scalability. http://ceph.com/ceph-storage http://www.inktank.com Librados A library allow ing apps to directly access RADOS, w ith support for C, C++, Java, Python, Ruby, and PHP Ceph Object Gateway (RGW) A bucket-based REST gatew ay, compatible w ith S3 and Sw ift Ceph Block Device (RBD) A reliable and fully- distributed block device, w ith a Linux kernel client and a QEMU/KVM driver Ceph File System (CephFS) A POSIX-compliant distributed file system, w ith a Linux kernel client and support for FUSE App App Object Host / VM Virtual Disk Client Files & Dirs Ceph Storage Cluster (RADOS) A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
  • 15. 14 © Fujitsu 2014 Redundant Cluster Interconnect (IP based) Redundant Client Interconnect (IP based) Client Client Client Client Ceph principles Distributed Redundant Storage  Intelligent data Distribution across all nodes and spindles = wide striping (64KB – 16MB)  Redundancy with replica=2, 3 … 8  Thin provisioning  Fast distributed rebuild  Availability, Fault tolerance  Disk, Node, Interconnect  Automatic rebuild  Distributed HotSpare Space  Transparent Block, File access  Reliability and Consistency  Scalable Performance  Pure PCIe-SSD for extreme Transaction processing Stor Node SAS SAS SAS PCI SSD Stor Node SAS SAS SAS PCI SSD Stor Node SAS SAS SAS PCI SSD Stor Node SAS SAS SAS PCI SSD Stor Node SAS SAS SAS PCI SSD Stor Node SAS SAS SAS PCI SSD Stor Node SAS SAS SAS PCI SSD Stor Node SAS SAS SAS PCI SSD SAS SAS SAS SAS SAS SAS SAS SAS PCI SSD PCI SSD PCI SSD PCI SSD PCI SSD PCI SSD PCI SSD PCI SSD librbd VMs Block App Object App File App
  • 16. 15 © Fujitsu 2014 Redundant Cluster Interconnect (IP based) Redundant Client Interconnect (IP based) Client Client Client Client Ceph processes Stor Node OSD OSD OSD OSD Stor Node OSD OSD OSD OSD Stor Node OSD OSD OSD OSD Stor Node OSD OSD OSD OSD Stor Node OSD OSD OSD OSD Stor Node OSD OSD OSD OSD Stor Node OSD OSD OSD OSD Stor Node OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD librbd VMs Block App Object App File App MON MON MON Client Client Client Client Client Client Client Client Client Client Client Client Client Client Client Client J J J J J J J J J J J J J J J J J J J J J J J J J J J J J J J J J J J J J J J J J J J J J J J J MDS MDS Distributed Redundant Storage  Intelligent data Distribution across all nodes and spindles = wide striping (64KB – 16MB)  Redundancy with replica=2, 3 … 8  Thin provisioning  Fast distributed rebuild  Availability, Fault tolerance  Disk, Node, Interconnect  Automatic rebuild  Distributed HotSpare Space  Transparent Block, File access  Reliability and Consistency  Scalable Performance  Pure PCIe-SSD for extreme Transaction processing
  • 17. 16 © Fujitsu 2014 Agenda  Introduction  Hardware / Software layout  Performance test cases  Hints and Challenges for Ceph  Summary
  • 18. 17 © Fujitsu 2014 rx37-3 Hardware test configuration rx37-4 rx37-5 PCI SSD rx37-6 rx37-7 rx37-8 rx37-1 PCI SSD PCI SSD PCI SSD PCI SSD PCI SSD SAS 16x SAS 16x SAS 16x SAS 16x SAS 16x SAS 16x PCI SSD PCI SSD PCI SSD PCI SSD PCI SSD PCI SSD PCI SSD PCI SSD PCI SSD PCI SSD PCI SSD PCI SSD  rx37-[3-8]: Fujitsu 2U Server RX300  2x Intel(R) Xeon(R) E5-2630 @ 2.30GHz  128GB RAM  2x 1GbE onboard  2x 10GbE Intel 82599EB  1x 40GbIB Intel TrueScale IBA7322 QDR InfiniBand HCA  2x 56GbIB Mellanox MT27500 Family (configurable as 40GbE, too)  3x Intel PCIe-SSD 910 Series 800GB  16x SAS 6G 300GB HDD through  LSI MegaRAID SAS 2108 [Liberator]  rx37-[12]: same as above, but  1x Intel(R) Xeon(R) E5-2630 @ 2.30GHz  64GB RAM  No SSDs, 2x SAS drives 56 / 40 GbIB, 40 / 10 / 1 GbE Cluster Interconnect 56 / 40 GbIB , 40 / 10 / 1GbE Client Interconnect RBD fio CephFS rx37-2 RBD fio CephFS MON MON MON MDS MDS
  • 19. 18 © Fujitsu 2014 rx37-3 rx37-4 rx37-5 PCI SSD rx37-6 rx37-7 rx37-8 rx37-1 PCI SSD PCI SSD PCI SSD PCI SSD PCI SSD SAS 16x SAS 16x SAS 16x SAS 16x SAS 16x SAS 16x PCI SSD PCI SSD PCI SSD PCI SSD PCI SSD PCI SSD PCI SSD PCI SSD PCI SSD PCI SSD PCI SSD PCI SSD (1)Frontend interface: ceph.ko, rbd.ko, ceph- fuse, rbd-wrapper in user land (2)OSD object size of data: 64k, 4m (3)Block Device options for /dev/rbdX, /dev/sdY: scheduler, rq_affinity, rotational, read_ahead_kb (4)Interconnect: 1 / 10 / 40 GbE, 40 / 56 GbIB CM/DG (5)Network parameter (6)Journal: RAM-Disk, SSD (7)OSD File System: xfs, btrfs (8)OSD disk type: SAS, SSD 56 / 40 GbIB, 40 / 10 / 1 GbE Cluster Interconnect 56 / 40 GbIB , 40 / 10 / 1GbE Client Interconnect RBD fio CephFS rx37-2 RBD fio CephFS MON MON MON MDS MDS Which parameter to change, tune 1 3 2 4 5 4 5 6 6 7 7 8 3
  • 20. 19 © Fujitsu 2014 Software test configuration  CentOS 6.4  With vanilla Kernel 3.8.13  fio-2.0.13  ceph version 0.61.7 cuttlefish 1GbE: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 1GbE: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 10GbE: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 10GbE: eth3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 40Gb: ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 56Gb: ib1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 56Gb: ib2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 40GbE: eth4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9216 40GbE: eth5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9216 # fdisk -l /dev/sdq Device Boot Start End Blocks Id System /dev/sdq1 1 22754 182767616 83 Linux /dev/sdq2 22754 24322 12591320 f Ext'd /dev/sdq5 22755 23277 4194304 83 Linux /dev/sdq6 23277 23799 4194304 83 Linux /dev/sdq7 23799 24322 4194304 83 Linux #--- 3x Journals on each SSD # lsscsi [0:2:0:0] disk LSI RAID 5/6 SAS 6G 2.12 /dev/sda [0:2:1:0] disk LSI RAID 5/6 SAS 6G 2.12 /dev/sdb [0:2:2:0] disk LSI RAID 5/6 SAS 6G 2.12 /dev/sdc [0:2:3:0] disk LSI RAID 5/6 SAS 6G 2.12 /dev/sdd [0:2:4:0] disk LSI RAID 5/6 SAS 6G 2.12 /dev/sde [0:2:5:0] disk LSI RAID 5/6 SAS 6G 2.12 /dev/sdf [0:2:6:0] disk LSI RAID 5/6 SAS 6G 2.12 /dev/sdg [0:2:7:0] disk LSI RAID 5/6 SAS 6G 2.12 /dev/sdh [0:2:8:0] disk LSI RAID 5/6 SAS 6G 2.12 /dev/sdi [0:2:9:0] disk LSI RAID 5/6 SAS 6G 2.12 /dev/sdj [0:2:10:0] disk LSI RAID 5/6 SAS 6G 2.12 /dev/sdk [0:2:11:0] disk LSI RAID 5/6 SAS 6G 2.12 /dev/sdl [0:2:12:0] disk LSI RAID 5/6 SAS 6G 2.12 /dev/sdm [0:2:13:0] disk LSI RAID 5/6 SAS 6G 2.12 /dev/sdn [0:2:14:0] disk LSI RAID 5/6 SAS 6G 2.12 /dev/sdo [0:2:15:0] disk LSI RAID 5/6 SAS 6G 2.12 /dev/sdp [1:0:0:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdq [1:0:1:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdr [1:0:2:0] disk INTEL(R) SSD 910 200GB a411 /dev/sds [1:0:3:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdt [2:0:0:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdu [2:0:1:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdv [2:0:2:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdw [2:0:3:0] disk INTEL(R) SSD 910 200GB a411 /dev/sdx [3:0:0:0] disk INTEL(R) SSD 910 200GB a40D /dev/sdy [3:0:1:0] disk INTEL(R) SSD 910 200GB a40D /dev/sdz [3:0:2:0] disk INTEL(R) SSD 910 200GB a40D /dev/sdaa [3:0:3:0] disk INTEL(R) SSD 910 200GB a40D /dev/sdab
  • 21. 20 © Fujitsu 2014 Agenda  Introduction  Hardware / Software layout  Performance test cases  Hints and Challenges for Ceph  Summary
  • 22. 21 © Fujitsu 2014 Performance test tools  qperf $IP_ADDR -t 10 -oo msg_size:1:8M:*2 –v tcp_lat | tcp_bw | rc_rdma_write_lat | rc_rdma_write_bw | rc_rdma_write_poll_lat  fio --filename=$RBD|--directory=$MDIR --direct=1 --rw=$io --bs=$bs --size=10G --numjobs=$threads --runtime=60 -- group_reporting --name=file1 -- output=fio_${io}_${bs}_${threads}  RBD=/dev/rbdX, MDIR=/cephfs/fio-test-dir  io=write,randwrite,read,randread  bs=4k,8k,4m,8m  threads=1,64,128
  • 23. 22 © Fujitsu 2014 qperf - latency 2,00 4,00 8,00 16,00 32,00 64,00 128,00 256,00 512,00 1024,00 2048,00 4096,00 8192,00 16384,00 32768,00 65536,00 131072,00 tcp_lat (µs : IO_bytes) 1GbE 10GbE 40GbE-1500 40GbE-9126 40IPoIB_CM 56IPoIB_CM 56IB_wr_lat 56IB_wr_poll_lat
  • 24. 23 © Fujitsu 2014 qperf - bandwidth 0 1000 2000 3000 4000 5000 6000 7000 tcp_bw (MB/s : IO_bytes) 1GbE 10GbE 40GbE-1500 40GbE-9126 40IPoIB_CM 56IPoIB_CM 56IB_wr_lat
  • 25. 24 © Fujitsu 2014 qperf - explanation  tcp_lat almost the same for blocks <= 128 bytes  tcp_lat very similar between 40GbE,40Gb IPoIB and 56Gb IPoIB  Significant difference between1 / 10GbE only for blocks >= 4k  Better latency on IB can only be achieved with rdma_write / rdma_write_poll  Bandwidth on 1 / 10GbE very stable on possible maximum  40GbE implementation (MLX) with unexpected fluctuation  Under IB only with RDMA the maximum transfer rate can be achieved  Use Socketover RDMA  Options without big code changes are: SDP, rsocket,SMC-R, accelio
  • 26. 25 © Fujitsu 2014 Frontend Interfaces OSD-FS xfs OSD object size 4m Journal SSD 56 Gb IPoIB_CM OSD-DISK SSD 0 10000 20000 30000 40000 50000 60000 70000 80000 90000 IOPS ceph-ko ceph-fuse rbd-ko rbd-wrap
  • 27. 26 © Fujitsu 2014 0 500 1000 1500 2000 2500 3000 3500 4000 MB/s ceph-ko ceph-fuse rbd-ko rbd-wrap Frontend Interfaces OSD-FS xfs OSD object size 4m Journal SSD 56 Gb IPoIB_CM OSD-DISK SSD
  • 28. 27 © Fujitsu 2014  Ceph-fuse seems not to support multiple I/Os in parallel with O_DIRECT which drops the performance significantly  RBD-wrap (= rbd in user land) shows some advantages on IOPS, but not sufficient enough to replace rbd.ko  Ceph.ko is excellent on sequential IOPS reads, presumable because of the read (ahead) of complete 4m blocks  Stay with the official interfaces ceph.ko / rbd.ko to give fio the needed access to File and Block  rbd.ko has some room for performance improvement for IOPS Frontend Interfaces OSD-FS xfs OSD object size 4m Journal SSD 56 Gb IPoIB_CM OSD-DISK SSD
  • 29. 28 © Fujitsu 2014 0 10000 20000 30000 40000 50000 60000 70000 80000 90000 IOPS xfs ceph-ko btrfs ceph-ko xfs rbd-ko btrfs rbd-ko ceph-ko rbd-ko OSD-FS xfs / btrfs OSD object size 4m Journal SSD 56 Gb IPoIB_CM OSD-DISK SSD
  • 30. 29 © Fujitsu 2014 0 500 1000 1500 2000 2500 3000 3500 4000 4500 MB/s xfs ceph-ko btrfs ceph-ko xfs rbd-ko btrfs rbd-ko ceph-ko rbd-ko OSD-FS xfs / btrfs OSD object size 4m Journal SSD 56 Gb IPoIB_CM OSD-DISK SSD
  • 31. 30 © Fujitsu 2014  6 months ago with kernel 3.0.x btrfs reveal some weaknesses in writes  With kernel 3.6 some essential enhancements were made to the btrfs code, so almost no differences could be identified in our kernel 3.8.13  Use btrfs in the next test cases, because btrfs has the more promising storage features: compression, data deduplication ceph-ko rbd-ko OSD-FS xfs / btrfs OSD object size 4m Journal SSD 56 Gb IPoIB_CM OSD-DISK SSD
  • 32. 31 © Fujitsu 2014 0 10000 20000 30000 40000 50000 60000 70000 80000 IOPS 64k 4m 0 500 1000 1500 2000 2500 3000 3500 MB/s 64k 4m rbd-ko OSD-FS btrfs OSD object 64k / 4m Journal RAM 56 Gb IPoIB_CM OSD-DISK SSD
  • 33. 32 © Fujitsu 2014  Small chunks of 64k can especially increase 4k/8k sequential IOPS for reads as well as for writes  For random IO 4m chunks are as good as 64k chunks  But, the usage of 64k chunks will result in very low bandwidth for 4m/8m blocks  Create each volume with the appropriate OSD object size: ~64k if small sequential IOPS are used, otherwise stay with ~4m rbd-ko OSD-FS btrfs OSD object 64k / 4m Journal RAM 56 Gb IPoIB_CM OSD-DISK SSD
  • 34. 33 © Fujitsu 2014 0 10000 20000 30000 40000 50000 60000 70000 80000 90000 IOPS j-RAM ceph-ko j-SSD ceph-ko j-RAM rbd-ko j-SSD rbd-ko ceph-ko rbd-ko OSD-FS btrfs OSD object size 4m Journal RAM / SSD 56 Gb IPoIB_CM OSD-DISK SSD
  • 35. 34 © Fujitsu 2014 0 500 1000 1500 2000 2500 3000 3500 4000 4500 MB/s j-RAM ceph-ko j-SSD ceph-ko j-RAM rbd-ko j-SSD rbd-ko ceph-ko rbd-ko OSD-FS btrfs OSD object size 4m Journal RAM / SSD 56 Gb IPoIB_CM OSD-DISK SSD
  • 36. 35 © Fujitsu 2014  Only on heavy write bandwidth requests RAM can show its predominance used as a journal.  The SSD has to accomplish twice the load when the journal and the data is written to it. ceph-ko rbd-ko OSD-FS btrfs OSD object size 4m Journal RAM / SSD 56 Gb IPoIB_CM OSD-DISK SSD
  • 37. 36 © Fujitsu 2014 0 10000 20000 30000 40000 50000 60000 70000 80000 90000 IOPS 1GbE 10GbE 40GbE 40GbIB_CM 56GbIB_CM ceph-ko OSD-FS btrfs OSD object size 4m Journal SSD Network OSD-DISK SSD
  • 38. 37 © Fujitsu 2014 0 500 1000 1500 2000 2500 3000 3500 4000 4500 MB/s 1GbE 10GbE 40GbE 40GbIB_CM 56GbIB_CM ceph-ko OSD-FS btrfs OSD object size 4m Journal SSD Network OSD-DISK SSD
  • 39. 38 © Fujitsu 2014  In case of write IOPS 1GbE is doing extremely well  On sequential read IOPS there is nearly no difference between10GbE and 56Gb IPoIB  On the bandwidth side with reads the measured performance is close in sync with the possible speedof the network. Only the TrueScale IB has some weaknesses, because it was designed for HPC and not for Storage/Streaming.  If you only look or IOPS 10GbE is a good choice  If throughput is relevant for your use case you should go for 56GbIB ceph-ko OSD-FS btrfs OSD object size 4m Journal SSD Network OSD-DISK SSD
  • 40. 39 © Fujitsu 2014 0 10000 20000 30000 40000 50000 60000 70000 80000 90000 IOPS SAS ceph-ko SSD ceph-ko SAS rbd-ko SSD rbd-ko ceph-ko rbd-ko OSD-FS btrfs OSD object size 4m Journal SSD 56 Gb IPoIB_CM OSD-DISK SAS / SSD
  • 41. 40 © Fujitsu 2014 0 500 1000 1500 2000 2500 3000 3500 4000 4500 MB/s SAS ceph-ko SSD ceph-ko SAS rbd-ko SSD rbd-ko ceph-ko rbd-ko OSD-FS btrfs OSD object size 4m Journal SSD 56 Gb IPoIB_CM OSD-DISK SAS / SSD
  • 42. 41 © Fujitsu 2014 rbd-ko OSD-FS btrfs OSD object 64k / 4m Journal SSD 56 GbIPoIB CM / DG OSD-DISK SAS / SSD 0 10000 20000 30000 40000 50000 60000 70000 IOPS SAS 4m CM SSD 4m CM SAS 64k DG SSD 64k DG
  • 43. 42 © Fujitsu 2014 rbd-ko OSD-FS btrfs OSD object 64k / 4m Journal SSD 56 GbIPoIB CM / DG OSD-DISK SAS / SSD 0 500 1000 1500 2000 2500 3000 3500 4000 MB/s SAS 4m CM SSD 4m CM SAS 64k DG SSD 64k DG
  • 44. 43 © Fujitsu 2014  SAS drives are doing much better than expected. Only on small random writes they are significantly slower than SSD  In this comparison is has to be taken into account, that the SSD is getting twice as much writes (journal + data)  2.5” 10k 6G SAS drives seems to be an attractive alternative in combination with SSD for the journal ceph-ko rbd-ko OSD-FS btrfs OSD object 64k / 4m Journal SSD 56 GbIPoIB CM / DG OSD-DISK SAS / SSD
  • 45. 44 © Fujitsu 2014 Agenda  Introduction  Hardware / Software layout  Performance test cases  Hints and Challenges for Ceph  Summary
  • 46. 45 © Fujitsu 2014 Stor Node SAS SAS SAS SSD Stor Node SAS SAS SAS SSD SAS SAS SSD SSD Stor Node SAS SAS SAS SSD Stor Node SAS SAS SAS SSD SAS SAS SSD SSD Ceph HA/DR Design Stor Node SAS SAS SAS SSD Stor Node SAS SAS SAS SSD SAS SAS SSD SSD Stor Node SAS SAS SAS SSD Stor Node SAS SAS SAS SSD SAS SAS SSD SSD Redundant Client Interconnect (IP) PY PY File App Block App Redundant Client Interconnect (IP) PY PY File App’ Block App’ Redundant Cluster Interconnect (IP) Redundant Cluster Interconnect (IP) multiple Storage Nodes multiple Storage Nodes  2 replica will be placed locally  The 3rd replica will go to the remote site  On a site failure, all data will be available at the remote site to minimize the RTO Recovery Time Objective
  • 47. 46 © Fujitsu 2014 Silver Pool VolB-1 primary VolB-2 primary VolB-3 primary VolB-3 2nd VolB-2 2nd VolB-1 2nd … VolB-3 3rd VolB-2 3rd VolB-1 3rd VolB-3 4th VolB-2 4th VolB-1 4th Ceph HA/DR Design Redundant Client Interconnect (IP) PY PY File App Block App Redundant Client Interconnect (IP) PY PY File App’ Block App’… … Gold Pool VolA-1 primary VolA-2 primary VolA-3 primary … VolA-3 2nd VolA-2 2nd VolA-1 2nd VolA-3 3rd VolA-2 3rd VolA-1 3rd Bronze Pool VolC-3 3rd VolC-2 3rd VolC-1 3rd VolC-1 primary VolC-2 primary VolC-3 primary VolC-3 2nd VolC-2 2nd VolC-1 2nd … VolC-3 3rd VolC-2 3rd VolC-1 3rd Class repli=2 repli=3 repli=4 ECode SSD Gold SAS Steel Silver SATA Bronze
  • 48. 47 © Fujitsu 2014 Consider balanced components (repli=3) Storage Node Journal / PCIe-SSD Redundant Client Interconnect (IP) PY PY File App Block App Redundant Cluster Interconnect (IP) 1GB/s OSD / SAS Raid Cntl 2GB/s 2GB/s 1GB/s Memory 2GB/s 1GB/s 2GB/s Memory (1)1GB/s writes via the Client interconnect results in 6GB/s bandwidth to internal disks (2)Raid Controllers today consist of ~ 2GB protected DRAM inside, which allows a pre-acknowledge of writes (= write-back) to SAS/SATA disks (3)The design of using a fast Journal in front of raw HDDs makes sense, but today there should be an option to disable it
  • 49. 48 © Fujitsu 2014 Architecture Vision for Unified Storage File System File System File System OSDs MDSs NFS, CIFS File System File System Client / Hosts VMs / Apps VMs / Apps File System File System Object File System File System Block S3, Swift, KVS FC oE, iSCSI File System File System Client / Hosts File System File System Client / Hosts vfs Ceph http Rados-gw LIO RDB
  • 50. 49 © Fujitsu 2014 Challenge: How to enter existing Clouds ?  ~ 70% of all OS are running as guests  In ~ 90% the 1st choice is VMware-ESX, 2nd XEN + Hyper-V  In most cases conventional shared Raid arrays through FC or iSCSI as block storage are used  VSAN and Storage-Spaces as SW-Defined Storage is already included  iSCSI-Target or even FC-target in front of RBD Ceph can be attached to existing VMs, but will enough value remain ?  So, what is the best approach to enter existing Clouds ?  Multi tenancy on File, Block and Object level access and administration  Accounting & Billing
  • 51. 50 © Fujitsu 2014 Agenda  Introduction  Hardware / Software layout  Performance test cases  Hints and Challenges for Ceph  Summary
  • 52. 51 © Fujitsu 2014 Finding & Hints  Ceph is the most comprehensive implementation of Unified Storage. Ceph simulates “distributed swarm intelligence” which arise from simple rules that are followed by individual processes and does not involve any central coordination.  The Crush algorithm acts as an enabler for a controlled, scalable, decentralized placement of replica data.  Choosing the right mix of Ceph & System parameter will result in a performance push  Keep the data bandwidth balanced between Client/Cluster Network, Journal and OSD devices  Likely the Ceph code has some room for improvement to achieve higher throughput
  • 53. 52 © Fujitsu 2014 Ceph principles – translated to value  Wide striping  all available resources of storage devices (capacity, bandwidth, IOPS of SSD, SAS, SATA) and  all available resources of the nodes (interconnect bandwidth, buffer & cache of RAID-Controller, RAM)  are disposable to every single application (~ federated cache)  Replication (2,3…8, EC) available on a per pool granularity. Volumes and Directories can be assigned to pools.  Intelligent data placement based on a defined hierarchical storage topology map  Thin provisioning: Allocate the capacity needed, then grow the space dynamically and on demand.  Fast distributed rebuild: All storage devices will help to rebuild the lost capacity of a single device, because the redundant data is distributed over multiple other devices. This results in shortest rebuild times (h not days)  Auto-healing of lost redundancies (interconnect, node, disk) including auto-rebuild using distributed Hot Spare  Ease of use  create volume/directory size=XX class=SSD|SAS|SATAtype=IOPS|bandwidth  resize volume size=YY  Unified and transparent Block, File and Objectaccess using any block device: node internal special block device, internal SCSI device from RAID-Controller or external SCSI LUNs from traditional RAID arrays  Scalable by adding capacity and/or performance/bandwidth/IOPS including auto-rebalance
  • 55. 54 © Fujitsu 201454 Fujitsu Technology Solutions CTO Data Center Infrastructure and Services Dieter.Kasper@ts.fujitsu.com