HKG15-401: Ceph and Software Defined Storage on ARM servers
---------------------------------------------------
Speaker: Yazen Ghannam Steve Capper
Date: February 12, 2015
---------------------------------------------------
★ Session Summary ★
Running Ceph in the colocation, ongoing optimizations
--------------------------------------------------
★ Resources ★
Pathable: https://hkg15.pathable.com/meetings/250828
Video: https://www.youtube.com/watch?v=RdZojLL7ttk
Etherpad: http://pad.linaro.org/p/hkg15-401
---------------------------------------------------
★ Event Details ★
Linaro Connect Hong Kong 2015 - #HKG15
February 9-13th, 2015
Regal Airport Hotel Hong Kong Airport
---------------------------------------------------
http://www.linaro.org
http://connect.linaro.org
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
HKG15-401: Ceph and Software Defined Storage on ARM servers
1. Presented by
Date
Ceph and software defined
storage on ARM Servers
1
February 12, 2015
Yazen Ghannam Steve Capper
<yazen.ghannam@linaro.org> <steve.capper@linaro.org>
2. connect.linaro.org
● Part 1: Introduction to Ceph
○ What is Ceph?
○ Lightning Introduction to Ceph Architecture
○ Replication
● Part 2: Linaro Work
○ Motivations & Goals
○ Linaro Austin Colocation Cluster
○ Performance Testing
○ Optimization Opportunities
○ Encountered Issues/Current Limitations
○ Future Work
○ Q & A
Outline
2
4. connect.linaro.org
● Ceph is a distributed object store with no single point of failure.
● It scales up to exabyte levels of storage and runs on commodity
hardware.
● Ceph data are exposed as follows:
What is Ceph?
Ceph Object Store
RESTful interface with
Amazon S3 and OpenStack
Swift compliant APIs.
Ceph Block Device
Linux kernel driver available
for clients. Also has libvirt
support.
Ceph Filesystem
Linux kernel driver available
for clients. Also has FUSE
support.
4
5. connect.linaro.org
At the host level...
● We have Object Storage Devices (OSDs) and Monitors.
○ Monitors keep track of the components of the Ceph cluster (i.e.,
where the OSDs are).
○ The device, host, rack, row, and room are stored by the Monitors and
used to compute a failure domain.
○ OSDs store the Ceph data objects.
● A host can run multiple OSDs, but it needs to be appropriately specced.
Lightning Introduction to Ceph Architecture (1)
5
6. connect.linaro.org
At the block device level…
● Object Storage Device (OSD) can be an entire drive, a partition, or
a folder.
● OSDs must be formatted in ext4, XFS, or btrfs (experimental).
Lightning Introduction to Ceph Architecture (2)
Drive/Partition
Filesystem
OSD
Pools
Drive/Partition
Filesystem
OSD
Drive/Partition
Filesystem
OSD
Drive/Partition
Filesystem
OSD
6
7. connect.linaro.org
At the data organization level...
● Data are partitioned into pools.
● Pools contain a number of Placement Groups (PGs).
● Ceph data objects map to PGs (via a modulo of hash of name).
● PGs then map to multiple OSDs.
Lightning Introduction to Ceph Architecture (3)
Pool: mydata
obj
obj PG #1
PG #2
obj obj
OSD
OSD
OSD
OSD
7
8. connect.linaro.org
At the client level…
● Objects can be accessed directly.
● Objects can be accessed through Ceph Object Gateway.
● Pools can be used for CephFS (requires 2 pools: data & metadata).
● Pools can be used to create Rados Block Devices.
Lightning Introduction to Ceph Architecture (4)
Pools
Rados Block Device
Filesystem
CephFS
Client
Ceph Obj. Gateway
8
10. connect.linaro.org
● Ceph pools, by default, are configured to replicate data between
OSDs.
● This allows us to lose some OSDs and not lose data.
● The replication level states how many instances of the object are
to reside on OSDs.
● Large objects will consume significant amounts of cumulative disk
space if replicated.
● An alternative to replication is to adopt erasure coding.
Replication
10
12. connect.linaro.org
● Motivation
○ Ceph is intended to be massively scalable and to be used with commodity
hardware.
○ Ceph clusters would ideally have lots of I/O (storage and network).
○ Ceph is a large system that interacts with many different pieces of software
and hardware (e.g., Kernel, libraries, network).
○ Enterprise ARMv8 vendors are targeting the high-density, highly-scalable
storage solutions market with relatively strong cores and lots of available I/O.
● Goals
○ Bring up a simple Ceph cluster on commodity ARMv8 hardware.
○ Look for CPU hotspots during performance testing.
■ Start with simple workloads, especially those that are part of Ceph.
○ Focus on optimizations specific to AArch64.
Motivation & Goals
12
13. connect.linaro.org
● 4 systems
○ AMD Opteron A1100 (codenamed Seattle) x3
■ With Cryptographic Extension and CRC
■ 16GB RAM
■ 10GbE available
■ Monitor/OSD nodes
○ APM X-Gene Mustang
■ 16GB RAM
■ Client/Admin node
● Each node has 1 hard drive
○ 500GB 7200RPM
○ OSD Partition (390GB)
● Each node has 1 ethernet connection on a 1Gb network
Linaro Austin Colocation Cluster (1): Hardware
AMD Seattle
Node 1
------------------
MON
OSD
MDS
AMD Seattle
Node 2
------------------
MON
OSD
AMD Seattle
Node 3
------------------
MON
OSD
APM X-Gene
Mustang
------------------
Admin
Client
switch
13
14. connect.linaro.org
● Fedora 21
● Linux Kernel 3.17
○ arm64 CRC32 module (available in 3.19)
● Ceph v0.91
○ RPM packages built from Ceph git repo
○ Local YUM repo serving updated packages to cluster
● ceph-deploy
○ Used to easily deploy Ceph cluster, including package installs,
setting up keys, etc.
Linaro Austin Colocation Cluster (2): Software
14
15. connect.linaro.org
● Single Node
○ “perf record {workload}”
● Cluster
○ Client Node: Execute {workload}
○ OSD Nodes: “perf top” for all system data.
○ OSD Nodes: “perf top -p {osd pid}” for only OSD process.
■ Useful when there is little system load.
Performance Testing (1): Collecting Data
15
17. connect.linaro.org
● dd
○ rbd create name --size #MBs
○ rbd map name -p rbd
○ mkfs [options] /dev/rbd/rbd/name
○ mount /dev/rbd/rbd/name /mnt
○ dd if=/dev/zero of=/mnt/zerofile [options]
● Write a lot of objects
○ foreach file in a_lot_of_files:
rados put object-name-# $file -p data
Performance Testing (3): Workloads, Other
17
18. connect.linaro.org
● Known
○ CRC32C
■ Ceph (upstreaming)
■ Linux Kernel (upstream already, should arrive in 3.19)
● Possible
○ how memcpy is called (it is a CPU hotspot)
○ tcmalloc
○ Boost C++ libraries
○ rocksdb
Optimization Opportunities
18
19. connect.linaro.org
● Issues
○ Linux Perf symbol decode
○ Python 2.7 hang when starting Ceph (now fixed)
● Limitations:
○ I/O bound with single OSD on 7200RPM hard drive with 1Gb network
■ Ideal: 8+ SSDs per node; each SSD with an individual OSD
■ Ideal: 10Gb network to support the nodes
○ Only several nodes forming a Ceph cluster (due to lack of hardware)
■ Ideal: 10+ nodes forming a cluster
Encountered Issues/Current Limitations
19
20. connect.linaro.org
● Teuthology (for ceph-qa)
● More workload profiling:
○ CephFS
○ Ceph Object Gateway (radosgw)
● Ceph prerequisites that could be investigated on AArch64:
○ Boost C++ Libraries
○ tcmalloc
Future Work
20
24. connect.linaro.org
● Given a 1GB object, let’s split it into 2 x 512MB chunks (A and B).
● Now, let’s introduce a third 512MB chunk P (for parity), and compute
each individual byte P[i] as follows:
P[i] = A[i] ^ B[i]
● We can now lose one of A, B, or P and still be able to reconstruct our
original data:
● To get this level of redundancy with replication requires 2GB of disk
space, as opposed to 1.5 GB with our parity coding.
Erasure coding - An Example
A B
A B P
24
25. connect.linaro.org
● Erasure codes can get more elaborate. One can split an object into k data
chunks and compute m coding chunks.
● This allows us to lose m chunks before data loss.
● The object will reside on k + m OSDs.
● Pools are configured whether or not to use erasure coding.
● The mathematics gets more complicated as m is increased and requires
specialized Galois Field arithmetic routines.
● Thankfully, these have already been ported over to ARM (both 32-bit and
64-bit) using NEON.
Erasure coding generalized
25
26. More about Linaro: http://www.linaro.org/about/
More about Linaro engineering: http://www.linaro.org/engineering/
How to join: http://www.linaro.org/about/how-to-join
Linaro members: www.linaro.org/members
connect.linaro.org
26