CEPH DAY BERLIN - CEPH ON THE BRAIN!

StackHPC
Ceph on the Brain!
Stig Telfer, CTO, StackHPC Ltd
Ceph Days Berlin 2018

StackHPC
1992-2018: The story so far...
HPC
Alpha Processor
HPC
Software-deﬁned networking
HPC
OpenStack for HPC
HPC on OpenStack

StackHPC
Human Brain Project
• The Human Brain Project is a ﬂagship EU FET project
• Signiﬁcant effort into massively parallel applications in neuro-
simulation and analysis techniques
• Research and development of platforms to enable these
applications
• A 10-year project - 2013-2023

HBP Pre-Commercial Procurement
• EU vehicle for funding R&D activities in public institutions
• Jülich Supercomputer Center and HBP ran three phases of competition
• Phase III winners were Cray + IBM & NVIDIA
• Technical Objectives:
• Dense memory integration
• Interactive supercomputing

JULIA pilot system
• Cray CS400 base system
• 60 KNL nodes
• 4 visualisation nodes
• 4 data nodes
• 2 x 1.6TB Fultondate SSDs
• Intel Omnipath interconnect
• Highly diverse software stack
• Diverse memory / storage system

Why Ceph?
• Primarily to study novel storage / hot-tier object store
• However, also need POSIX-compliant production filesystem
(CephFS)
• Excellent support and engagement from diverse community
• Interesting set of interactions with cloud software (OpenStack etc. )
• Is it performant?

Ceph’s Performance Record
Source: Intel white paper “Using Intel® OptaneTM Technology with Ceph to Build High- Performance Cloud Storage Solutions on Intel® Purley Platform”

StackHPC
JULIA Cluster Fabric
data-0001
data-0002
data-0003
data-0004
viz-0001 viz-0002
viz-0003 viz-0004
OPA
100G
login-1
login-2
prod-0001
prod-0002
prod-0003
prod-0004
prod-0005
prod-0006
prod-0007
prod-0008
prod-0009
prod-0010
prod-0011
prod-0012
prod-0013
prod-0014
prod-0015
prod-0016
prod-0017
prod-0018
prod-0019
prod-0020
prod-0021
prod-0022
prod-0023
prod-0024
prod-0025
prod-0026
prod-0027
prod-0028
prod-0029
prod-0030
prod-0031
prod-0032
prod-0033
prod-0034
prod-0035
prod-0036
prod-0037
prod-0038
prod-0039
prod-0040
prod-0041
prod-0042
prod-0043
prod-0044
prod-0045
prod-0046
prod-0047
prod-0048
prod-0049
prod-0050
prod-0051
prod-0052
prod-0053
prod-0054
prod-0055
prod-0056
prod-0057
prod-0058
prod-0059
prod-0060
OPA
100G
OPA
100G

StackHPC
JULIA Data Node Architecture
CPU 1 - Broadwell E5-2680
14 15 16 17
18 19 20 21
22 23 24 25
26 27
CPU 0 - Broadwell E5-2680
0 1 2 3
4 5 6 7
8 9 10 11
12 13
64GB RAM 64GB RAM
NVME0
P3600 “Fultondale”
1.6 TB
NVME1
P3600 “Fultondale”
1.6 TB
OPA
100G
QPI

StackHPC
JULIA Ceph Cluster Architecture
• Monitors, MDSs, MGRs previously
freestanding, now co-hosted
• 4 OSD processes per NVMe device
• 32 OSDs in total
• Using OPA IPoIB interface for both
front-side and replication networks
data-0001
data-0002
data-0003
data-0004
OPA
100G

Source: Intel white paper “Using Intel® OptaneTM Technology with Ceph to Build High- Performance Cloud Storage Solutions on Intel® Purley Platform”

StackHPC
Data Node - Raw Read
• 64K reads using fio
• 4 jobs per OSD partition  
(32 total)
• Aggregate performance across
all partitions approx 5200 MB/s
data-0001
500 MB/s
0 MB/s

StackHPC
A Non-Uniform Network Fabric
• Single TCP stream performance
(using iperf3)
• IP encapsulation on Omnipath
• KNL appears to struggle with
performance of sequential activity
• High variability between other
classes of node also
viz
knl
data
knl
data
viz
27.0 Gbits/sec -
49.6 Gbits/sec
5.95 Gbits/sec
6.91 Gbits/sec
8.43 Gbits/sec
6.67 Gbits/sec
8.16 Gbits/sec
35.4 Gbits/sec25.9 Gbits/sec
48.1 Gbits/sec
28.1 Gbits/sec -
51.7 Gbits/sec

StackHPC
Network and I/O Compared
500MB/s
0MB/s
1000MB/s
1500MB/s
2000MB/s
2500MB/s
3000MB/s
3500MB/s
4000MB/s
4500MB/s
5000MB/s
5500MB/s
6000MB/s
6500MB/s
TCP/iP Xeon - Best
Worst
TCP/IP KNL - Best
Worst
Data Node NVMe
(read)
variation

StackHPC
First Ceph Upgrade
• Jewel to Luminous release
• Filestore to Bluestore backend
• Use ceph-ansible playbooks (mostly)
• Doesn’t support multiple OSDs per block device
• Manual creation of OSDs in partitions using Ceph tools

StackHPC
Jewel to Luminous - Filestore

StackHPC
Luminous - Filestore to Bluestore
IPoIB

StackHPC
Ceph RADOS
Raw devices
Write Ampliﬁcation - Filestore

StackHPC
Ceph RADOS
Raw devices
Write Ampliﬁcation - Bluestore

StackHPC
Write Degradation Issue

Source: Intel presentation “Accelerate Ceph with Optane and 3D NAND”

StackHPC
Luminous - Scaling Out

StackHPC
Storage Nodes and Processor Sleep
76% read efﬁciency

StackHPC
Second Ceph Upgrade
• Luminous to Mimic release
• Bluestore backend
• Use ceph-ansible playbooks
• Use ceph-volume to create LVM OSDs

Spectre/Meltdown Mitigations
• Storage I/O performance
benchmarked 15.5% slower: 
 
5200 MB/s 4400 MB/s
• Network performance also
adversely affected

Mimic - Scaling Out (Reads, LVM)

Mimic - Scaling Out (Reads, Partitions)
90% read
efﬁciency

StackHPC
Third Ceph Upgrade
• Mimic to Nautilus-development
• RDMA testing

StackHPC
A brief detour away from HBP…

StackHPC
Cambridge Data Accelerator

StackHPC
CPU 1 - Skylake Gold 6142
14 15 16 17
18 19 20 21
22 23 24 25
26 27
CPU 0 - Skylake Gold 6142
0 1 2 3
4 5 6 7
8 9 10 11
12 13
96GB RAM 96GB RAM
NVME
P4600
NVME
P4600
OPA
100G
QPI
NVME
P4600
NVME
P4600
NVME
P4600
NVME
P4600
6x NVME
P4600
NVME
P4600
NVME
P4600
NVME
P4600
NVME
P4600
6x NVME
P4600
OPA
100G
Accelerator Data Node

StackHPC
Burst Buffer Workﬂows
• Stage in / Stage out
• Transparent Caching
• Checkpoint / Restart
• Background data movement
• Journaling
• Swap memory Storage volumes - namespaces - can persist
longer than the jobs and shared with
multiple users, or private and ephemeral.
POSIX or Object (this can also be at a flash
block load/store interface)

StackHPC
Hero Performance Numbers
84% read efﬁciency96% write efﬁciency

StackHPC
Ceph and RDMA
• RoCE 25G - Yes!
• iWARP - not tested
• Omnipath 100G - Nearly…
• Inﬁniband 100G - Looking good…
• Integrated since Luminous Ceph RPMS
• (Mellanox have a bugﬁx tree)
/etc/ceph/ceph.conf:
ms_type = async+rdma or
ms_cluster_type = async+rdma
ms_async_rdma_device_name = hfi1_0 or mlx5_0
ms_async_rdma_polling_us = 0
/etc/security/limits.conf:
#<domain> <type> <item> <value>
* - memlock unlimited
/usr/lib/systemd/system/ceph-*@.service:
[Service]
LimitMEMLOCK=infinity
PrivateDevices=no

StackHPC
New Developments in Ceph-RDMA
• Intel: RDMA Connection Manager and iWARP support
• https://github.com/tanghaodong25/ceph/tree/rdma-cm - merged
• Nothing iWARP-speciﬁc here, (almost) works on IB, OPA too
• Mellanox: New RDMA Messenger based on UCX
• https://github.com/Mellanox/ceph/tree/vasily-ucx - 2019?
• How about libfabric?

StackHPC
Closing Comments
• Ceph is getting there, fast…
• RDMA performance not currently low hanging fruit on most setups
• Intel’s benchmarking claims TCP messaging consumes 25% of CPU
in high-end conﬁgurations - I see over 50% on our setup!
• New approaches to RDMA should help in key areas:
• Performance, Portability, Flexibility, Stability

Thank You!
stig@stackhpc.com
https://www.stackhpc.com
@oneswig
Cray EMEA Research Lab:
Adrian Tate, Nina Mujkanovic, Alessandro Rigazzi
Forschungszentrum Jülich:
Lena Oden, Bastian Tweddell, Dirk Pleiter

CEPH DAY BERLIN - CEPH ON THE BRAIN!

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à CEPH DAY BERLIN - CEPH ON THE BRAIN!

Similaire à CEPH DAY BERLIN - CEPH ON THE BRAIN! (20)

Dernier

Dernier (20)

CEPH DAY BERLIN - CEPH ON THE BRAIN!