SlideShare une entreprise Scribd logo
1  sur  39
Télécharger pour lire hors ligne
StackHPC
Ceph on the Brain!
Stig Telfer, CTO, StackHPC Ltd
Ceph Days Berlin 2018
StackHPC
1992-2018: The story so far...
HPC
Alpha Processor
HPC
Software-defined networking
HPC
OpenStack for HPC
HPC on OpenStack
StackHPC
Human Brain Project
• The Human Brain Project is a flagship EU FET project
• Significant effort into massively parallel applications in neuro-
simulation and analysis techniques
• Research and development of platforms to enable these
applications
• A 10-year project - 2013-2023
HBP Pre-Commercial Procurement
• EU vehicle for funding R&D activities in public institutions
• Jülich Supercomputer Center and HBP ran three phases of competition
• Phase III winners were Cray + IBM & NVIDIA
• Technical Objectives:
• Dense memory integration
• Interactive supercomputing
JULIA pilot system
• Cray CS400 base system
• 60 KNL nodes
• 4 visualisation nodes
• 4 data nodes
• 2 x 1.6TB Fultondate SSDs
• Intel Omnipath interconnect
• Highly diverse software stack
• Diverse memory / storage system
Why Ceph?
• Primarily to study novel storage / hot-tier object store
• However, also need POSIX-compliant production filesystem
(CephFS)
• Excellent support and engagement from diverse community
• Interesting set of interactions with cloud software (OpenStack etc. )
• Is it performant?
Ceph’s Performance Record
Source: Intel white paper “Using Intel® OptaneTM Technology with Ceph to Build High- Performance Cloud Storage Solutions on Intel® Purley Platform”
StackHPC
JULIA Cluster Fabric
data-0001
data-0002
data-0003
data-0004
viz-0001 viz-0002
viz-0003 viz-0004
OPA
100G
login-1
login-2
prod-0001
prod-0002
prod-0003
prod-0004
prod-0005
prod-0006
prod-0007
prod-0008
prod-0009
prod-0010
prod-0011
prod-0012
prod-0013
prod-0014
prod-0015
prod-0016
prod-0017
prod-0018
prod-0019
prod-0020
prod-0021
prod-0022
prod-0023
prod-0024
prod-0025
prod-0026
prod-0027
prod-0028
prod-0029
prod-0030
prod-0031
prod-0032
prod-0033
prod-0034
prod-0035
prod-0036
prod-0037
prod-0038
prod-0039
prod-0040
prod-0041
prod-0042
prod-0043
prod-0044
prod-0045
prod-0046
prod-0047
prod-0048
prod-0049
prod-0050
prod-0051
prod-0052
prod-0053
prod-0054
prod-0055
prod-0056
prod-0057
prod-0058
prod-0059
prod-0060
OPA
100G
OPA
100G
StackHPC
JULIA Data Node Architecture
CPU 1 - Broadwell E5-2680
14 15 16 17
18 19 20 21
22 23 24 25
26 27
CPU 0 - Broadwell E5-2680
0 1 2 3
4 5 6 7
8 9 10 11
12 13
64GB RAM 64GB RAM
NVME0
P3600 “Fultondale”
1.6 TB
NVME1
P3600 “Fultondale”
1.6 TB
OPA
100G
QPI
StackHPC
JULIA Ceph Cluster Architecture
• Monitors, MDSs, MGRs previously
freestanding, now co-hosted
• 4 OSD processes per NVMe device
• 32 OSDs in total
• Using OPA IPoIB interface for both
front-side and replication networks
data-0001
data-0002
data-0003
data-0004
OPA
100G
Source: Intel white paper “Using Intel® OptaneTM Technology with Ceph to Build High- Performance Cloud Storage Solutions on Intel® Purley Platform”
StackHPC
Data Node - Raw Read
• 64K reads using fio
• 4 jobs per OSD partition 

(32 total)
• Aggregate performance across
all partitions approx 5200 MB/s
data-0001
500 MB/s
0 MB/s
StackHPC
A Non-Uniform Network Fabric
• Single TCP stream performance
(using iperf3)
• IP encapsulation on Omnipath
• KNL appears to struggle with
performance of sequential activity
• High variability between other
classes of node also
viz
knl
data
knl
data
viz
27.0 Gbits/sec -
49.6 Gbits/sec
5.95 Gbits/sec
6.91 Gbits/sec
8.43 Gbits/sec
6.67 Gbits/sec
8.16 Gbits/sec
35.4 Gbits/sec25.9 Gbits/sec
48.1 Gbits/sec
28.1 Gbits/sec -
51.7 Gbits/sec
StackHPC
Network and I/O Compared
500MB/s
0MB/s
1000MB/s
1500MB/s
2000MB/s
2500MB/s
3000MB/s
3500MB/s
4000MB/s
4500MB/s
5000MB/s
5500MB/s
6000MB/s
6500MB/s
TCP/iP Xeon - Best
Worst
TCP/IP KNL - Best
Worst
Data Node NVMe
(read)
variation
StackHPC
First Ceph Upgrade
• Jewel to Luminous release
• Filestore to Bluestore backend
• Use ceph-ansible playbooks (mostly)
• Doesn’t support multiple OSDs per block device
• Manual creation of OSDs in partitions using Ceph tools
StackHPC
Jewel to Luminous - Filestore
StackHPC
Luminous - Filestore to Bluestore
IPoIB
StackHPC
Ceph RADOS
Raw devices
Write Amplification - Filestore
StackHPC
Ceph RADOS
Raw devices
Write Amplification - Bluestore
StackHPC
Write Degradation Issue
Source: Intel presentation “Accelerate Ceph with Optane and 3D NAND”
StackHPC
Luminous - Scaling Out
StackHPC
Storage Nodes and Processor Sleep
76% read efficiency
StackHPC
Second Ceph Upgrade
• Luminous to Mimic release
• Bluestore backend
• Use ceph-ansible playbooks
• Use ceph-volume to create LVM OSDs
Spectre/Meltdown Mitigations
• Storage I/O performance
benchmarked 15.5% slower:



5200 MB/s 4400 MB/s
• Network performance also
adversely affected
Mimic - Scaling Out (Reads, LVM)
Mimic - Scaling Out (Reads, Partitions)
90% read
efficiency
Ceph OSD under Load…
53% CPU in Networking Stack!
StackHPC
Third Ceph Upgrade
• Mimic to Nautilus-development
• RDMA testing
StackHPC
A brief detour away from HBP…
StackHPC
Cambridge Data Accelerator
StackHPC
CPU 1 - Skylake Gold 6142
14 15 16 17
18 19 20 21
22 23 24 25
26 27
CPU 0 - Skylake Gold 6142
0 1 2 3
4 5 6 7
8 9 10 11
12 13
96GB RAM 96GB RAM
NVME
P4600
NVME
P4600
OPA
100G
QPI
NVME
P4600
NVME
P4600
NVME
P4600
NVME
P4600
6x NVME
P4600
NVME
P4600
NVME
P4600
NVME
P4600
NVME
P4600
6x NVME
P4600
OPA
100G
Accelerator Data Node
StackHPC
Burst Buffer Workflows
• Stage in / Stage out
• Transparent Caching
• Checkpoint / Restart
• Background data movement
• Journaling
• Swap memory Storage volumes - namespaces - can persist
longer than the jobs and shared with
multiple users, or private and ephemeral.
POSIX or Object (this can also be at a flash
block load/store interface)
StackHPC
Hero Performance Numbers
84% read efficiency96% write efficiency
StackHPC
Ceph and RDMA
• RoCE 25G - Yes!
• iWARP - not tested
• Omnipath 100G - Nearly…
• Infiniband 100G - Looking good…
• Integrated since Luminous Ceph RPMS
• (Mellanox have a bugfix tree)
/etc/ceph/ceph.conf:
ms_type = async+rdma or
ms_cluster_type = async+rdma
ms_async_rdma_device_name = hfi1_0 or mlx5_0
ms_async_rdma_polling_us = 0
/etc/security/limits.conf:
#<domain> <type> <item> <value>
* - memlock unlimited
/usr/lib/systemd/system/ceph-*@.service:
[Service]
LimitMEMLOCK=infinity
PrivateDevices=no
StackHPC
New Developments in Ceph-RDMA
• Intel: RDMA Connection Manager and iWARP support
• https://github.com/tanghaodong25/ceph/tree/rdma-cm - merged
• Nothing iWARP-specific here, (almost) works on IB, OPA too
• Mellanox: New RDMA Messenger based on UCX
• https://github.com/Mellanox/ceph/tree/vasily-ucx - 2019?
• How about libfabric?
StackHPC
Closing Comments
• Ceph is getting there, fast…
• RDMA performance not currently low hanging fruit on most setups
• Intel’s benchmarking claims TCP messaging consumes 25% of CPU
in high-end configurations - I see over 50% on our setup!
• New approaches to RDMA should help in key areas:
• Performance, Portability, Flexibility, Stability
Thank You!
stig@stackhpc.com
https://www.stackhpc.com
@oneswig
Cray EMEA Research Lab:
Adrian Tate, Nina Mujkanovic, Alessandro Rigazzi
Forschungszentrum Jülich:
Lena Oden, Bastian Tweddell, Dirk Pleiter 

Contenu connexe

Tendances

Ceph and OpenStack - Feb 2014
Ceph and OpenStack - Feb 2014Ceph and OpenStack - Feb 2014
Ceph and OpenStack - Feb 2014Ian Colle
 
Disk health prediction for Ceph
Disk health prediction for CephDisk health prediction for Ceph
Disk health prediction for CephCeph Community
 
Red Hat Ceph Storage Roadmap: January 2016
Red Hat Ceph Storage Roadmap: January 2016Red Hat Ceph Storage Roadmap: January 2016
Red Hat Ceph Storage Roadmap: January 2016Red_Hat_Storage
 
CEPH DAY BERLIN - MASTERING CEPH OPERATIONS: UPMAP AND THE MGR BALANCER
CEPH DAY BERLIN - MASTERING CEPH OPERATIONS: UPMAP AND THE MGR BALANCERCEPH DAY BERLIN - MASTERING CEPH OPERATIONS: UPMAP AND THE MGR BALANCER
CEPH DAY BERLIN - MASTERING CEPH OPERATIONS: UPMAP AND THE MGR BALANCERCeph Community
 
CEPH DAY BERLIN - DISK HEALTH PREDICTION AND RESOURCE ALLOCATION FOR CEPH BY ...
CEPH DAY BERLIN - DISK HEALTH PREDICTION AND RESOURCE ALLOCATION FOR CEPH BY ...CEPH DAY BERLIN - DISK HEALTH PREDICTION AND RESOURCE ALLOCATION FOR CEPH BY ...
CEPH DAY BERLIN - DISK HEALTH PREDICTION AND RESOURCE ALLOCATION FOR CEPH BY ...Ceph Community
 
Red Hat Storage Day New York - What's New in Red Hat Ceph Storage
Red Hat Storage Day New York - What's New in Red Hat Ceph StorageRed Hat Storage Day New York - What's New in Red Hat Ceph Storage
Red Hat Storage Day New York - What's New in Red Hat Ceph StorageRed_Hat_Storage
 
CEPH DAY BERLIN - WHAT'S NEW IN CEPH
CEPH DAY BERLIN - WHAT'S NEW IN CEPH CEPH DAY BERLIN - WHAT'S NEW IN CEPH
CEPH DAY BERLIN - WHAT'S NEW IN CEPH Ceph Community
 
Disaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFDisaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFShapeBlue
 
MySQL Head to Head Performance
MySQL Head to Head PerformanceMySQL Head to Head Performance
MySQL Head to Head PerformanceKyle Bader
 
My personal journey through the World of Open Source! How What Was Old Beco...
My personal journey through  the World of Open Source!  How What Was Old Beco...My personal journey through  the World of Open Source!  How What Was Old Beco...
My personal journey through the World of Open Source! How What Was Old Beco...Ceph Community
 
Which Hypervisor is Best?
Which Hypervisor is Best?Which Hypervisor is Best?
Which Hypervisor is Best?Kyle Bader
 
CEPH DAY BERLIN - UNLIMITED FILESERVER WITH SAMBA CTDB AND CEPHFS
CEPH DAY BERLIN - UNLIMITED FILESERVER WITH SAMBA CTDB AND CEPHFSCEPH DAY BERLIN - UNLIMITED FILESERVER WITH SAMBA CTDB AND CEPHFS
CEPH DAY BERLIN - UNLIMITED FILESERVER WITH SAMBA CTDB AND CEPHFSCeph Community
 
Ceph Tech Talk: Ceph at DigitalOcean
Ceph Tech Talk: Ceph at DigitalOceanCeph Tech Talk: Ceph at DigitalOcean
Ceph Tech Talk: Ceph at DigitalOceanCeph Community
 
2021.06. Ceph Project Update
2021.06. Ceph Project Update2021.06. Ceph Project Update
2021.06. Ceph Project UpdateCeph Community
 
Hyperconverged Cloud, Not just a toy anymore - Andrew Hatfield, Red Hat
Hyperconverged Cloud, Not just a toy anymore - Andrew Hatfield, Red HatHyperconverged Cloud, Not just a toy anymore - Andrew Hatfield, Red Hat
Hyperconverged Cloud, Not just a toy anymore - Andrew Hatfield, Red HatOpenStack
 
Stor4NFV: Exploration of Cloud native Storage in OPNFV - Ren Qiaowei, Wang Hui
Stor4NFV: Exploration of Cloud native Storage in OPNFV - Ren Qiaowei, Wang HuiStor4NFV: Exploration of Cloud native Storage in OPNFV - Ren Qiaowei, Wang Hui
Stor4NFV: Exploration of Cloud native Storage in OPNFV - Ren Qiaowei, Wang HuiCeph Community
 
Red Hat Storage Day Dallas - Storage for OpenShift Containers
Red Hat Storage Day Dallas - Storage for OpenShift Containers Red Hat Storage Day Dallas - Storage for OpenShift Containers
Red Hat Storage Day Dallas - Storage for OpenShift Containers Red_Hat_Storage
 

Tendances (20)

Ceph and OpenStack - Feb 2014
Ceph and OpenStack - Feb 2014Ceph and OpenStack - Feb 2014
Ceph and OpenStack - Feb 2014
 
Disk health prediction for Ceph
Disk health prediction for CephDisk health prediction for Ceph
Disk health prediction for Ceph
 
Red Hat Ceph Storage Roadmap: January 2016
Red Hat Ceph Storage Roadmap: January 2016Red Hat Ceph Storage Roadmap: January 2016
Red Hat Ceph Storage Roadmap: January 2016
 
CEPH DAY BERLIN - MASTERING CEPH OPERATIONS: UPMAP AND THE MGR BALANCER
CEPH DAY BERLIN - MASTERING CEPH OPERATIONS: UPMAP AND THE MGR BALANCERCEPH DAY BERLIN - MASTERING CEPH OPERATIONS: UPMAP AND THE MGR BALANCER
CEPH DAY BERLIN - MASTERING CEPH OPERATIONS: UPMAP AND THE MGR BALANCER
 
CEPH DAY BERLIN - DISK HEALTH PREDICTION AND RESOURCE ALLOCATION FOR CEPH BY ...
CEPH DAY BERLIN - DISK HEALTH PREDICTION AND RESOURCE ALLOCATION FOR CEPH BY ...CEPH DAY BERLIN - DISK HEALTH PREDICTION AND RESOURCE ALLOCATION FOR CEPH BY ...
CEPH DAY BERLIN - DISK HEALTH PREDICTION AND RESOURCE ALLOCATION FOR CEPH BY ...
 
Red Hat Storage Day New York - What's New in Red Hat Ceph Storage
Red Hat Storage Day New York - What's New in Red Hat Ceph StorageRed Hat Storage Day New York - What's New in Red Hat Ceph Storage
Red Hat Storage Day New York - What's New in Red Hat Ceph Storage
 
MySQL on Ceph
MySQL on CephMySQL on Ceph
MySQL on Ceph
 
CEPH DAY BERLIN - WHAT'S NEW IN CEPH
CEPH DAY BERLIN - WHAT'S NEW IN CEPH CEPH DAY BERLIN - WHAT'S NEW IN CEPH
CEPH DAY BERLIN - WHAT'S NEW IN CEPH
 
Disaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFDisaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoF
 
Ceph Research at UCSC
Ceph Research at UCSCCeph Research at UCSC
Ceph Research at UCSC
 
Ceph on arm64 upload
Ceph on arm64   uploadCeph on arm64   upload
Ceph on arm64 upload
 
MySQL Head to Head Performance
MySQL Head to Head PerformanceMySQL Head to Head Performance
MySQL Head to Head Performance
 
My personal journey through the World of Open Source! How What Was Old Beco...
My personal journey through  the World of Open Source!  How What Was Old Beco...My personal journey through  the World of Open Source!  How What Was Old Beco...
My personal journey through the World of Open Source! How What Was Old Beco...
 
Which Hypervisor is Best?
Which Hypervisor is Best?Which Hypervisor is Best?
Which Hypervisor is Best?
 
CEPH DAY BERLIN - UNLIMITED FILESERVER WITH SAMBA CTDB AND CEPHFS
CEPH DAY BERLIN - UNLIMITED FILESERVER WITH SAMBA CTDB AND CEPHFSCEPH DAY BERLIN - UNLIMITED FILESERVER WITH SAMBA CTDB AND CEPHFS
CEPH DAY BERLIN - UNLIMITED FILESERVER WITH SAMBA CTDB AND CEPHFS
 
Ceph Tech Talk: Ceph at DigitalOcean
Ceph Tech Talk: Ceph at DigitalOceanCeph Tech Talk: Ceph at DigitalOcean
Ceph Tech Talk: Ceph at DigitalOcean
 
2021.06. Ceph Project Update
2021.06. Ceph Project Update2021.06. Ceph Project Update
2021.06. Ceph Project Update
 
Hyperconverged Cloud, Not just a toy anymore - Andrew Hatfield, Red Hat
Hyperconverged Cloud, Not just a toy anymore - Andrew Hatfield, Red HatHyperconverged Cloud, Not just a toy anymore - Andrew Hatfield, Red Hat
Hyperconverged Cloud, Not just a toy anymore - Andrew Hatfield, Red Hat
 
Stor4NFV: Exploration of Cloud native Storage in OPNFV - Ren Qiaowei, Wang Hui
Stor4NFV: Exploration of Cloud native Storage in OPNFV - Ren Qiaowei, Wang HuiStor4NFV: Exploration of Cloud native Storage in OPNFV - Ren Qiaowei, Wang Hui
Stor4NFV: Exploration of Cloud native Storage in OPNFV - Ren Qiaowei, Wang Hui
 
Red Hat Storage Day Dallas - Storage for OpenShift Containers
Red Hat Storage Day Dallas - Storage for OpenShift Containers Red Hat Storage Day Dallas - Storage for OpenShift Containers
Red Hat Storage Day Dallas - Storage for OpenShift Containers
 

Similaire à CEPH DAY BERLIN - CEPH ON THE BRAIN!

OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsHPCC Systems
 
3.INTEL.Optane_on_ceph_v2.pdf
3.INTEL.Optane_on_ceph_v2.pdf3.INTEL.Optane_on_ceph_v2.pdf
3.INTEL.Optane_on_ceph_v2.pdfhellobank1
 
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...Danielle Womboldt
 
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...Ceph Community
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...Ganesan Narayanasamy
 
QCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference ArchitectureQCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference ArchitectureCeph Community
 
QCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference ArchitectureQCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference ArchitecturePatrick McGarry
 
00 opencapi acceleration framework yonglu_ver2
00 opencapi acceleration framework yonglu_ver200 opencapi acceleration framework yonglu_ver2
00 opencapi acceleration framework yonglu_ver2Yutaka Kawai
 
6 open capi_meetup_in_japan_final
6 open capi_meetup_in_japan_final6 open capi_meetup_in_japan_final
6 open capi_meetup_in_japan_finalYutaka Kawai
 
CAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablementCAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablementGanesan Narayanasamy
 
Exploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudExploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudRyousei Takano
 
FPGAs in the cloud? (October 2017)
FPGAs in the cloud? (October 2017)FPGAs in the cloud? (October 2017)
FPGAs in the cloud? (October 2017)Julien SIMON
 
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...Redis Labs
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Odinot Stanislas
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDKKernel TLV
 
High performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHigh performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHungWei Chiu
 
Deploying flash storage for Ceph without compromising performance
Deploying flash storage for Ceph without compromising performance Deploying flash storage for Ceph without compromising performance
Deploying flash storage for Ceph without compromising performance Ceph Community
 

Similaire à CEPH DAY BERLIN - CEPH ON THE BRAIN! (20)

OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
 
3.INTEL.Optane_on_ceph_v2.pdf
3.INTEL.Optane_on_ceph_v2.pdf3.INTEL.Optane_on_ceph_v2.pdf
3.INTEL.Optane_on_ceph_v2.pdf
 
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
 
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...
 
OpenCAPI Technology Ecosystem
OpenCAPI Technology EcosystemOpenCAPI Technology Ecosystem
OpenCAPI Technology Ecosystem
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
 
QCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference ArchitectureQCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference Architecture
 
QCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference ArchitectureQCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference Architecture
 
00 opencapi acceleration framework yonglu_ver2
00 opencapi acceleration framework yonglu_ver200 opencapi acceleration framework yonglu_ver2
00 opencapi acceleration framework yonglu_ver2
 
POWER9 for AI & HPC
POWER9 for AI & HPCPOWER9 for AI & HPC
POWER9 for AI & HPC
 
6 open capi_meetup_in_japan_final
6 open capi_meetup_in_japan_final6 open capi_meetup_in_japan_final
6 open capi_meetup_in_japan_final
 
CAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablementCAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablement
 
Exploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudExploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC Cloud
 
FPGAs in the cloud? (October 2017)
FPGAs in the cloud? (October 2017)FPGAs in the cloud? (October 2017)
FPGAs in the cloud? (October 2017)
 
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...
 
Ceph
CephCeph
Ceph
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDK
 
High performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHigh performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User Group
 
Deploying flash storage for Ceph without compromising performance
Deploying flash storage for Ceph without compromising performance Deploying flash storage for Ceph without compromising performance
Deploying flash storage for Ceph without compromising performance
 

Dernier

Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 

Dernier (20)

Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 

CEPH DAY BERLIN - CEPH ON THE BRAIN!

  • 1. StackHPC Ceph on the Brain! Stig Telfer, CTO, StackHPC Ltd Ceph Days Berlin 2018
  • 2. StackHPC 1992-2018: The story so far... HPC Alpha Processor HPC Software-defined networking HPC OpenStack for HPC HPC on OpenStack
  • 3. StackHPC Human Brain Project • The Human Brain Project is a flagship EU FET project • Significant effort into massively parallel applications in neuro- simulation and analysis techniques • Research and development of platforms to enable these applications • A 10-year project - 2013-2023
  • 4. HBP Pre-Commercial Procurement • EU vehicle for funding R&D activities in public institutions • Jülich Supercomputer Center and HBP ran three phases of competition • Phase III winners were Cray + IBM & NVIDIA • Technical Objectives: • Dense memory integration • Interactive supercomputing
  • 5. JULIA pilot system • Cray CS400 base system • 60 KNL nodes • 4 visualisation nodes • 4 data nodes • 2 x 1.6TB Fultondate SSDs • Intel Omnipath interconnect • Highly diverse software stack • Diverse memory / storage system
  • 6. Why Ceph? • Primarily to study novel storage / hot-tier object store • However, also need POSIX-compliant production filesystem (CephFS) • Excellent support and engagement from diverse community • Interesting set of interactions with cloud software (OpenStack etc. ) • Is it performant?
  • 7. Ceph’s Performance Record Source: Intel white paper “Using Intel® OptaneTM Technology with Ceph to Build High- Performance Cloud Storage Solutions on Intel® Purley Platform”
  • 8. StackHPC JULIA Cluster Fabric data-0001 data-0002 data-0003 data-0004 viz-0001 viz-0002 viz-0003 viz-0004 OPA 100G login-1 login-2 prod-0001 prod-0002 prod-0003 prod-0004 prod-0005 prod-0006 prod-0007 prod-0008 prod-0009 prod-0010 prod-0011 prod-0012 prod-0013 prod-0014 prod-0015 prod-0016 prod-0017 prod-0018 prod-0019 prod-0020 prod-0021 prod-0022 prod-0023 prod-0024 prod-0025 prod-0026 prod-0027 prod-0028 prod-0029 prod-0030 prod-0031 prod-0032 prod-0033 prod-0034 prod-0035 prod-0036 prod-0037 prod-0038 prod-0039 prod-0040 prod-0041 prod-0042 prod-0043 prod-0044 prod-0045 prod-0046 prod-0047 prod-0048 prod-0049 prod-0050 prod-0051 prod-0052 prod-0053 prod-0054 prod-0055 prod-0056 prod-0057 prod-0058 prod-0059 prod-0060 OPA 100G OPA 100G
  • 9. StackHPC JULIA Data Node Architecture CPU 1 - Broadwell E5-2680 14 15 16 17 18 19 20 21 22 23 24 25 26 27 CPU 0 - Broadwell E5-2680 0 1 2 3 4 5 6 7 8 9 10 11 12 13 64GB RAM 64GB RAM NVME0 P3600 “Fultondale” 1.6 TB NVME1 P3600 “Fultondale” 1.6 TB OPA 100G QPI
  • 10. StackHPC JULIA Ceph Cluster Architecture • Monitors, MDSs, MGRs previously freestanding, now co-hosted • 4 OSD processes per NVMe device • 32 OSDs in total • Using OPA IPoIB interface for both front-side and replication networks data-0001 data-0002 data-0003 data-0004 OPA 100G
  • 11. Source: Intel white paper “Using Intel® OptaneTM Technology with Ceph to Build High- Performance Cloud Storage Solutions on Intel® Purley Platform”
  • 12. StackHPC Data Node - Raw Read • 64K reads using fio • 4 jobs per OSD partition 
 (32 total) • Aggregate performance across all partitions approx 5200 MB/s data-0001 500 MB/s 0 MB/s
  • 13. StackHPC A Non-Uniform Network Fabric • Single TCP stream performance (using iperf3) • IP encapsulation on Omnipath • KNL appears to struggle with performance of sequential activity • High variability between other classes of node also viz knl data knl data viz 27.0 Gbits/sec - 49.6 Gbits/sec 5.95 Gbits/sec 6.91 Gbits/sec 8.43 Gbits/sec 6.67 Gbits/sec 8.16 Gbits/sec 35.4 Gbits/sec25.9 Gbits/sec 48.1 Gbits/sec 28.1 Gbits/sec - 51.7 Gbits/sec
  • 14. StackHPC Network and I/O Compared 500MB/s 0MB/s 1000MB/s 1500MB/s 2000MB/s 2500MB/s 3000MB/s 3500MB/s 4000MB/s 4500MB/s 5000MB/s 5500MB/s 6000MB/s 6500MB/s TCP/iP Xeon - Best Worst TCP/IP KNL - Best Worst Data Node NVMe (read) variation
  • 15. StackHPC First Ceph Upgrade • Jewel to Luminous release • Filestore to Bluestore backend • Use ceph-ansible playbooks (mostly) • Doesn’t support multiple OSDs per block device • Manual creation of OSDs in partitions using Ceph tools
  • 17. StackHPC Luminous - Filestore to Bluestore IPoIB
  • 18. StackHPC Ceph RADOS Raw devices Write Amplification - Filestore
  • 19. StackHPC Ceph RADOS Raw devices Write Amplification - Bluestore
  • 21. Source: Intel presentation “Accelerate Ceph with Optane and 3D NAND”
  • 23. StackHPC Storage Nodes and Processor Sleep 76% read efficiency
  • 24. StackHPC Second Ceph Upgrade • Luminous to Mimic release • Bluestore backend • Use ceph-ansible playbooks • Use ceph-volume to create LVM OSDs
  • 25. Spectre/Meltdown Mitigations • Storage I/O performance benchmarked 15.5% slower:
 
 5200 MB/s 4400 MB/s • Network performance also adversely affected
  • 26. Mimic - Scaling Out (Reads, LVM)
  • 27. Mimic - Scaling Out (Reads, Partitions) 90% read efficiency
  • 28. Ceph OSD under Load…
  • 29. 53% CPU in Networking Stack!
  • 30. StackHPC Third Ceph Upgrade • Mimic to Nautilus-development • RDMA testing
  • 31. StackHPC A brief detour away from HBP…
  • 33. StackHPC CPU 1 - Skylake Gold 6142 14 15 16 17 18 19 20 21 22 23 24 25 26 27 CPU 0 - Skylake Gold 6142 0 1 2 3 4 5 6 7 8 9 10 11 12 13 96GB RAM 96GB RAM NVME P4600 NVME P4600 OPA 100G QPI NVME P4600 NVME P4600 NVME P4600 NVME P4600 6x NVME P4600 NVME P4600 NVME P4600 NVME P4600 NVME P4600 6x NVME P4600 OPA 100G Accelerator Data Node
  • 34. StackHPC Burst Buffer Workflows • Stage in / Stage out • Transparent Caching • Checkpoint / Restart • Background data movement • Journaling • Swap memory Storage volumes - namespaces - can persist longer than the jobs and shared with multiple users, or private and ephemeral. POSIX or Object (this can also be at a flash block load/store interface)
  • 35. StackHPC Hero Performance Numbers 84% read efficiency96% write efficiency
  • 36. StackHPC Ceph and RDMA • RoCE 25G - Yes! • iWARP - not tested • Omnipath 100G - Nearly… • Infiniband 100G - Looking good… • Integrated since Luminous Ceph RPMS • (Mellanox have a bugfix tree) /etc/ceph/ceph.conf: ms_type = async+rdma or ms_cluster_type = async+rdma ms_async_rdma_device_name = hfi1_0 or mlx5_0 ms_async_rdma_polling_us = 0 /etc/security/limits.conf: #<domain> <type> <item> <value> * - memlock unlimited /usr/lib/systemd/system/ceph-*@.service: [Service] LimitMEMLOCK=infinity PrivateDevices=no
  • 37. StackHPC New Developments in Ceph-RDMA • Intel: RDMA Connection Manager and iWARP support • https://github.com/tanghaodong25/ceph/tree/rdma-cm - merged • Nothing iWARP-specific here, (almost) works on IB, OPA too • Mellanox: New RDMA Messenger based on UCX • https://github.com/Mellanox/ceph/tree/vasily-ucx - 2019? • How about libfabric?
  • 38. StackHPC Closing Comments • Ceph is getting there, fast… • RDMA performance not currently low hanging fruit on most setups • Intel’s benchmarking claims TCP messaging consumes 25% of CPU in high-end configurations - I see over 50% on our setup! • New approaches to RDMA should help in key areas: • Performance, Portability, Flexibility, Stability
  • 39. Thank You! stig@stackhpc.com https://www.stackhpc.com @oneswig Cray EMEA Research Lab: Adrian Tate, Nina Mujkanovic, Alessandro Rigazzi Forschungszentrum Jülich: Lena Oden, Bastian Tweddell, Dirk Pleiter