HKG15-401: Ceph and Software Defined Storage on ARM servers

Presented by
Date
Ceph and software defined
storage on ARM Servers
1
February 12, 2015
Yazen Ghannam Steve Capper
<yazen.ghannam@linaro.org> <steve.capper@linaro.org>

connect.linaro.org
● Part 1: Introduction to Ceph
○ What is Ceph?
○ Lightning Introduction to Ceph Architecture
○ Replication
● Part 2: Linaro Work
○ Motivations & Goals
○ Linaro Austin Colocation Cluster
○ Performance Testing
○ Optimization Opportunities
○ Encountered Issues/Current Limitations
○ Future Work
○ Q & A
Outline
2

connect.linaro.org
Part 1: Introduction to Ceph
3

connect.linaro.org
● Ceph is a distributed object store with no single point of failure.
● It scales up to exabyte levels of storage and runs on commodity
hardware.
● Ceph data are exposed as follows:
What is Ceph?
Ceph Object Store
RESTful interface with
Amazon S3 and OpenStack
Swift compliant APIs.
Ceph Block Device
Linux kernel driver available
for clients. Also has libvirt
support.
Ceph Filesystem
Linux kernel driver available
for clients. Also has FUSE
support.
4

connect.linaro.org
At the host level...
● We have Object Storage Devices (OSDs) and Monitors.
○ Monitors keep track of the components of the Ceph cluster (i.e.,
where the OSDs are).
○ The device, host, rack, row, and room are stored by the Monitors and
used to compute a failure domain.
○ OSDs store the Ceph data objects.
● A host can run multiple OSDs, but it needs to be appropriately specced.
Lightning Introduction to Ceph Architecture (1)
5

connect.linaro.org
At the block device level…
● Object Storage Device (OSD) can be an entire drive, a partition, or
a folder.
● OSDs must be formatted in ext4, XFS, or btrfs (experimental).
Drive/Partition
Filesystem
OSD
Pools
Drive/Partition
Filesystem
OSD
Drive/Partition
Filesystem
OSD
Drive/Partition
Filesystem
OSD
6

connect.linaro.org
At the data organization level...
● Data are partitioned into pools.
● Pools contain a number of Placement Groups (PGs).
● Ceph data objects map to PGs (via a modulo of hash of name).
● PGs then map to multiple OSDs.
Pool: mydata
obj
obj PG #1
PG #2
obj obj
OSD
OSD
OSD
OSD
7

connect.linaro.org
At the client level…
● Objects can be accessed directly.
● Objects can be accessed through Ceph Object Gateway.
● Pools can be used for CephFS (requires 2 pools: data & metadata).
● Pools can be used to create Rados Block Devices.
Pools
Rados Block Device
Filesystem
CephFS
Client
Ceph Obj. Gateway
8

connect.linaro.org
Host #3
Host #2
Host #1
Filesystem
OSD
Filesystem
OSD
Filesystem
OSD
Filesystem
OSD
Pool:
mydata obj obj
obj obj
PG #1 PG #2
Rados Block Device
Filesystem
CephFS
Client
Ceph Obj. Gateway
Drive/Partition Drive/Partition Drive/Partition Drive/Partition
OSD OSD
Filesystem
Drive/Partition Drive/Partition
Filesystem
Pool:
metadata obj obj
obj obj
PG #1 PG #2
Pool:
data obj obj
obj obj
PG #1 PG #2
9

connect.linaro.org
● Ceph pools, by default, are configured to replicate data between
OSDs.
● This allows us to lose some OSDs and not lose data.
● The replication level states how many instances of the object are
to reside on OSDs.
● Large objects will consume significant amounts of cumulative disk
space if replicated.
● An alternative to replication is to adopt erasure coding.
Replication
10

connect.linaro.org
Part 2: Linaro Work
11

connect.linaro.org
● Motivation
○ Ceph is intended to be massively scalable and to be used with commodity
hardware.
○ Ceph clusters would ideally have lots of I/O (storage and network).
○ Ceph is a large system that interacts with many different pieces of software
and hardware (e.g., Kernel, libraries, network).
○ Enterprise ARMv8 vendors are targeting the high-density, highly-scalable
storage solutions market with relatively strong cores and lots of available I/O.
● Goals
○ Bring up a simple Ceph cluster on commodity ARMv8 hardware.
○ Look for CPU hotspots during performance testing.
■ Start with simple workloads, especially those that are part of Ceph.
○ Focus on optimizations specific to AArch64.
Motivation & Goals
12

connect.linaro.org
● 4 systems
○ AMD Opteron A1100 (codenamed Seattle) x3
■ With Cryptographic Extension and CRC
■ 16GB RAM
■ 10GbE available
■ Monitor/OSD nodes
○ APM X-Gene Mustang
■ 16GB RAM
■ Client/Admin node
● Each node has 1 hard drive
○ 500GB 7200RPM
○ OSD Partition (390GB)
● Each node has 1 ethernet connection on a 1Gb network
Linaro Austin Colocation Cluster (1): Hardware
AMD Seattle
Node 1
------------------
MON
OSD
MDS
AMD Seattle
Node 2
------------------
MON
OSD
AMD Seattle
Node 3
------------------
MON
OSD
APM X-Gene
Mustang
------------------
Admin
Client
switch
13

connect.linaro.org
● Fedora 21
● Linux Kernel 3.17
○ arm64 CRC32 module (available in 3.19)
● Ceph v0.91
○ RPM packages built from Ceph git repo
○ Local YUM repo serving updated packages to cluster
● ceph-deploy
○ Used to easily deploy Ceph cluster, including package installs,
setting up keys, etc.
Linaro Austin Colocation Cluster (2): Software
14

connect.linaro.org
● Single Node
○ “perf record {workload}”
● Cluster
○ Client Node: Execute {workload}
○ OSD Nodes: “perf top” for all system data.
○ OSD Nodes: “perf top -p {osd pid}” for only OSD process.
■ Useful when there is little system load.
Performance Testing (1): Collecting Data
15

connect.linaro.org
● Rados Bench
○ rados bench -p rbd 300 write --no-cleanup
○ rados bench -p rbd 300 seq
○ rados bench -p rbd 300 rand
● OSD Bench
○ ceph tell osd.# bench
Performance Testing (2): Workloads, Ceph
16

connect.linaro.org
● dd
○ rbd create name --size #MBs
○ rbd map name -p rbd
○ mkfs [options] /dev/rbd/rbd/name
○ mount /dev/rbd/rbd/name /mnt
○ dd if=/dev/zero of=/mnt/zerofile [options]
● Write a lot of objects
○ foreach file in a_lot_of_files:
rados put object-name-# $file -p data
Performance Testing (3): Workloads, Other
17

connect.linaro.org
● Known
○ CRC32C
■ Ceph (upstreaming)
■ Linux Kernel (upstream already, should arrive in 3.19)
● Possible
○ how memcpy is called (it is a CPU hotspot)
○ tcmalloc
○ Boost C++ libraries
○ rocksdb
Optimization Opportunities
18

connect.linaro.org
● Issues
○ Linux Perf symbol decode
○ Python 2.7 hang when starting Ceph (now fixed)
● Limitations:
○ I/O bound with single OSD on 7200RPM hard drive with 1Gb network
■ Ideal: 8+ SSDs per node; each SSD with an individual OSD
■ Ideal: 10Gb network to support the nodes
○ Only several nodes forming a Ceph cluster (due to lack of hardware)
■ Ideal: 10+ nodes forming a cluster
Encountered Issues/Current Limitations
19

connect.linaro.org
● Teuthology (for ceph-qa)
● More workload profiling:
○ CephFS
○ Ceph Object Gateway (radosgw)
● Ceph prerequisites that could be investigated on AArch64:
○ Boost C++ Libraries
○ tcmalloc
Future Work
20

connect.linaro.org
Backup Slides
23

connect.linaro.org
● Given a 1GB object, let’s split it into 2 x 512MB chunks (A and B).
● Now, let’s introduce a third 512MB chunk P (for parity), and compute
each individual byte P[i] as follows:
P[i] = A[i] ^ B[i]
● We can now lose one of A, B, or P and still be able to reconstruct our
original data:
● To get this level of redundancy with replication requires 2GB of disk
space, as opposed to 1.5 GB with our parity coding.
Erasure coding - An Example
A B
A B P
24

connect.linaro.org
● Erasure codes can get more elaborate. One can split an object into k data
chunks and compute m coding chunks.
● This allows us to lose m chunks before data loss.
● The object will reside on k + m OSDs.
● Pools are configured whether or not to use erasure coding.
● The mathematics gets more complicated as m is increased and requires
specialized Galois Field arithmetic routines.
● Thankfully, these have already been ported over to ARM (both 32-bit and
64-bit) using NEON.
Erasure coding generalized
25

More about Linaro: http://www.linaro.org/about/
More about Linaro engineering: http://www.linaro.org/engineering/
How to join: http://www.linaro.org/about/how-to-join
Linaro members: www.linaro.org/members
connect.linaro.org
26

connect.linaro.org
27
Disclaimer and Attribution
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes,
component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS
flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to
revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY
INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL
AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY
INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
Trademark Attribution
AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names used
in this presentation are for identification purposes only and may be trademarks of their respective owners.
©2015 Advanced Micro Devices, Inc. All rights reserved.

HKG15-401: Ceph and Software Defined Storage on ARM servers

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (16)

En vedette

En vedette (10)

Similaire à HKG15-401: Ceph and Software Defined Storage on ARM servers

Similaire à HKG15-401: Ceph and Software Defined Storage on ARM servers (20)

Plus de Linaro

Plus de Linaro (20)

Dernier

Dernier (20)

HKG15-401: Ceph and Software Defined Storage on ARM servers