How does Ceph perform when used in high-performance computing? This talk will cover a year of running Ceph on a (small) Cray supercomputer. I will describe how Ceph was configured to perform in an all-NVME configuration, and the process of analysis and optimisation of the configuration. I'll also give details on the efforts underway to adapt Ceph's messaging to run over high performance network fabrics, and how this work could become the next frontier in the battle against storage performance bottlenecks.
2. StackHPC
1992-2018: The story so far...
HPC
Alpha Processor
HPC
Software-defined networking
HPC
OpenStack for HPC
HPC on OpenStack
3. StackHPC
Human Brain Project
• The Human Brain Project is a flagship EU FET project
• Significant effort into massively parallel applications in neuro-
simulation and analysis techniques
• Research and development of platforms to enable these
applications
• A 10-year project - 2013-2023
4. HBP Pre-Commercial Procurement
• EU vehicle for funding R&D activities in public institutions
• Jülich Supercomputer Center and HBP ran three phases of competition
• Phase III winners were Cray + IBM & NVIDIA
• Technical Objectives:
• Dense memory integration
• Interactive supercomputing
5. JULIA pilot system
• Cray CS400 base system
• 60 KNL nodes
• 4 visualisation nodes
• 4 data nodes
• 2 x 1.6TB Fultondate SSDs
• Intel Omnipath interconnect
• Highly diverse software stack
• Diverse memory / storage system
6. Why Ceph?
• Primarily to study novel storage / hot-tier object store
• However, also need POSIX-compliant production filesystem
(CephFS)
• Excellent support and engagement from diverse community
• Interesting set of interactions with cloud software (OpenStack etc. )
• Is it performant?
7. Ceph’s Performance Record
Source: Intel white paper “Using Intel® OptaneTM Technology with Ceph to Build High- Performance Cloud Storage Solutions on Intel® Purley Platform”
10. StackHPC
JULIA Ceph Cluster Architecture
• Monitors, MDSs, MGRs previously
freestanding, now co-hosted
• 4 OSD processes per NVMe device
• 32 OSDs in total
• Using OPA IPoIB interface for both
front-side and replication networks
data-0001
data-0002
data-0003
data-0004
OPA
100G
11. Source: Intel white paper “Using Intel® OptaneTM Technology with Ceph to Build High- Performance Cloud Storage Solutions on Intel® Purley Platform”
12. StackHPC
Data Node - Raw Read
• 64K reads using fio
• 4 jobs per OSD partition
(32 total)
• Aggregate performance across
all partitions approx 5200 MB/s
data-0001
500 MB/s
0 MB/s
13. StackHPC
A Non-Uniform Network Fabric
• Single TCP stream performance
(using iperf3)
• IP encapsulation on Omnipath
• KNL appears to struggle with
performance of sequential activity
• High variability between other
classes of node also
viz
knl
data
knl
data
viz
27.0 Gbits/sec -
49.6 Gbits/sec
5.95 Gbits/sec
6.91 Gbits/sec
8.43 Gbits/sec
6.67 Gbits/sec
8.16 Gbits/sec
35.4 Gbits/sec25.9 Gbits/sec
48.1 Gbits/sec
28.1 Gbits/sec -
51.7 Gbits/sec
14. StackHPC
Network and I/O Compared
500MB/s
0MB/s
1000MB/s
1500MB/s
2000MB/s
2500MB/s
3000MB/s
3500MB/s
4000MB/s
4500MB/s
5000MB/s
5500MB/s
6000MB/s
6500MB/s
TCP/iP Xeon - Best
Worst
TCP/IP KNL - Best
Worst
Data Node NVMe
(read)
variation
15. StackHPC
First Ceph Upgrade
• Jewel to Luminous release
• Filestore to Bluestore backend
• Use ceph-ansible playbooks (mostly)
• Doesn’t support multiple OSDs per block device
• Manual creation of OSDs in partitions using Ceph tools
34. StackHPC
Burst Buffer Workflows
• Stage in / Stage out
• Transparent Caching
• Checkpoint / Restart
• Background data movement
• Journaling
• Swap memory Storage volumes - namespaces - can persist
longer than the jobs and shared with
multiple users, or private and ephemeral.
POSIX or Object (this can also be at a flash
block load/store interface)
36. StackHPC
Ceph and RDMA
• RoCE 25G - Yes!
• iWARP - not tested
• Omnipath 100G - Nearly…
• Infiniband 100G - Looking good…
• Integrated since Luminous Ceph RPMS
• (Mellanox have a bugfix tree)
/etc/ceph/ceph.conf:
ms_type = async+rdma or
ms_cluster_type = async+rdma
ms_async_rdma_device_name = hfi1_0 or mlx5_0
ms_async_rdma_polling_us = 0
/etc/security/limits.conf:
#<domain> <type> <item> <value>
* - memlock unlimited
/usr/lib/systemd/system/ceph-*@.service:
[Service]
LimitMEMLOCK=infinity
PrivateDevices=no
37. StackHPC
New Developments in Ceph-RDMA
• Intel: RDMA Connection Manager and iWARP support
• https://github.com/tanghaodong25/ceph/tree/rdma-cm - merged
• Nothing iWARP-specific here, (almost) works on IB, OPA too
• Mellanox: New RDMA Messenger based on UCX
• https://github.com/Mellanox/ceph/tree/vasily-ucx - 2019?
• How about libfabric?
38. StackHPC
Closing Comments
• Ceph is getting there, fast…
• RDMA performance not currently low hanging fruit on most setups
• Intel’s benchmarking claims TCP messaging consumes 25% of CPU
in high-end configurations - I see over 50% on our setup!
• New approaches to RDMA should help in key areas:
• Performance, Portability, Flexibility, Stability