SlideShare une entreprise Scribd logo
1  sur  39
Télécharger pour lire hors ligne
Zoltan Arnold Nagy
IBM Research - Zurich
Disaggregating Ceph using
NVMeoF
About me
2
§ Technical Lead – Zurich Compute Cloud @ IBM Research – Zurich
– Involved all aspects (compute, storage, networking…)
§ OpenStack since 2011 – “cactus”
§ Service local Zurich Resarch Lab’s research community – some data must remain in
Switzerland/EU and/or too large to move off-site
§ ~4.5k cores / ~90TB memory and growing
§ 10/25/100GbE
§ Ceph + GPFS
§ Ceph since 2014 – “firefly”
– Current cluster is 2.2PiB RAW
§ Mostly HDD
§ 100TB NVMe that sparked this whole investigation
– Upgraded and growing since firefly!
About IBM Research - Zurich
3
§ Established in 1956
§ 45+ different nationalities
§ Open Collaboration:
– Horizon2020: 50+ funded projects and 500+ partners
§ Two Nobel Prizes:
– 1986: Nobel Prize in Physics for the invention of the scanning
tunneling microscope by Heinrich Rohrer and Gerd K. Binnig
– 1987: Nobel Prize in Physics for the discovery of
high-temperature superconductivity by
K. Alex Müller and J. Georg Bednorz
§ 2017: European Physical Society Historic Site
§ Binnig and Rohrer Nanotechnology Centre opened in
2011 (Public Private Partnership with ETH Zürich and EMPA)
§ 7 European Research Council Grants
Motivation #1
4
§ Were great when we got them – years ago
§ 2xE5-2630v3 – 2x8 cores @ 2.4GHz
§ 2x10Gbit LACP, flat L2 network
§ Wanted to add NVMe to our current nodes
– E5-2630v3 / 64GB RAM
5
6
7
1x Intel Optane 900P
8x Samsung 970 PRO 1TB
1x Mellanox ConnectX-4
(2x100GbE - PCIe v3
limits to ~128GBit/s)
Motivation
8
56 cores / 140 GHz total compute for 7x NVMe drives
Motivation
9
48 cores / 129.6 GHz total compute for 10 NVMe drives
Motivation
10
Conclusion on those configurations?
small block size IO: you run out of CPU
large block size IO: you run ouf of network
Quick math
11
§ Resources per device (lots of assumptions: idle OS, RAM, NUMA, …)
– 32 threads / 8 NVMe = 4 thread / NVMe
– 100Gbit / 8 NVMe = 12.5Gbit/s
– 3x replication: n Gbit/s write on the frontend
causes 2n outgoing bandwidth
-> we can support 6.25Gbit/s write per OSD as
theoretical maximum throughput!
12
Can we do better?
Don’t we have a bunch of compute nodes?
13
14
84 compute nodes per rack
(yes, you need cooling…)
Each node:
2xE5-2683v4
(2x16 cores @ 2.1GHz)
256GB RAM
Plan
15
OSD OSD
OSD OSD
OSD OSD
OSD OSD
Storage node
100Gbps
Plan
16
OSD OSD
OSD OSD
OSD OSD
OSD OSD
Storage node
100Gbps
OSD
Compute nodes
50Gbps
OSD
50Gbps
OSD
50Gbps
OSD
50Gbps
OSD
50Gbps
OSD
50Gbps
OSD
50Gbps
OSD
50Gbps
How does the math improve for writes?
17
OSD OSD OSD
OSD OSD OSD
RX n
TX 2n
RX n
Client
TX n
RX n
TX n TX nTX n
18
We know the protocol (NVMe) – let’s talk fabric!
Fabric topology
19
32x compute nodes
leaf leaf
1…n spines (32x100GbE)
leaf
32x25GbE, 8x100GbE
on compute leafs
6-10 storage nodes
32x100GbE
on storage leafs
20
6x Mellanox SN2100 switch per rack
(16x100GbE)
split into 8x4x25GbE + 8x100GbE
Each node has full bi-directional
bandwidth to the spines!
Fabric latency (ib_write_lat)
21
Fabric bandwidth (ib_write_bw)
22
Ingredient 1: RoCEv2
23
§ R stands for RDMA that stands for “remote DMA”
§ “oCE” is over Converged Ethernet
– Tries to be “lossless”
– PFC (L2 for example NIC<>Switch)
– ECN (L3)
§ Applications can directly copy to each
other’s memory, skipping the kernel
§ Some cards can do full NVMeoF offload
meaning 0% CPU use on the target
Ingredient 2: NVMeoF
24
§ NVMe = storage protocol = how do I talk to my storage?
§ “oF” = “over Fabrics” where ”a fabric” can be
– Fibre Channel
– RDMA over Converged Ethernet (RoCE)
– TCP
§ Basically attach a remote disk over some fabric to your local system pretending to be a local
NVMe device
– If target is native NVMe, pretty ideal
– NVMeoF vs iSCSI: the same comparison applies as to NVMe vs SATA/SAS/SCSI
§ Linux kernel 5.0 introduced native NVMe-oF/TCP support
§ SPDK supports both being a target and an initiator in userspace
25
SQ = Submission Queue
CQ = Completion Queue
Netdev 0x13
26
Netdev 0x13
27
Netdev 0x13
28
Netdev 0x13
29
NVMeF export
30
NVMeF/RDMA discovery
31
NVMeF/RDMA connect
32
• Each interface needs an IP, can’t be full L3
• I’d prefer a /32 loopback address + unnumbered BGP
• currently the kernel cannot specify source address for NVMeoF connections
• going to ”stick” to one of the interfaces
• TCP connections between OSD nodes going to be imbalanced
• source address is going to be one of the NICs (hashed by destination info)
Drawbacks – network complexity blows up
33
Ceph measurements (WIP)
34
§ Single client against 8xNVMe cluster
– 8 volumes:
randread: 210.29k IOPS (~26.29k IOPS/volume), stdev: 616.37
– @ 5.62ms 99p / 8.4ms 99.95p
randwrite: ~48440 IOPS (~6055 IOPS/volume) stdev: 94.46625
@ 12.9ms 99p / 19.6ms 99.95p
§ Single client against 8xNVMe cluster distributed according to plans
– 8 volumes:
randread: 321.975k IOPS (40.25k IOPS/volume), stdev: 2483
@ 1.254ms 99p, 2.38ms 99.95p
randwrite: 43.56k IOPS (~5445 IOPS/volume), stddev: 5752
@ 14.1ms 99p / 21.3ms 99.95p
Can we still improve these numbers?
35
§ Linux 5.1+ has a new interface instead of async calling “uring”
– short for userspace ring
– shared ring buffer between kernel and userspace
– The goal is to replace the async IO interface in the long run
– For more: https://lwn.net/Articles/776703/
§ Bluestore has NVMEDevice support w/ SPDK
– Couldn’t get it to work with NVMeoF despite SPDK having full native support
Source: Netdev 0x13
36
Netdev 0x13
37
Future: targets maybe replaced by ASICs?
38
External references:
39
§ RHCS lab environment: https://ceph.io/community/bluestore-default-vs-tuned-performance-
comparison/
§ Micron’s reference architecture: https://www.micron.com/-
/media/client/global/documents/products/other-
documents/micron_9300_and_red_hat_ceph_reference_architecture.pdf
§ Marvell ASIC: https://www.servethehome.com/marvell-25gbe-nvmeof-adapter-prefaces-a-
super-cool-future/
§ Netdev 0x13 SPDK RDMA vs TCP:
https://www.youtube.com/watch?v=HLXxE5WWRf8&feature=youtu.be&t=643
§ Zcopy: https://zcopy.wordpress.com/2010/10/08/quick-concepts-part-1-%E2%80%93-
introduction-to-rdma/

Contenu connexe

Tendances

Accelerating Ceph with RDMA and NVMe-oF
Accelerating Ceph with RDMA and NVMe-oFAccelerating Ceph with RDMA and NVMe-oF
Accelerating Ceph with RDMA and NVMe-oF
inside-BigData.com
 
BPF: Tracing and more
BPF: Tracing and moreBPF: Tracing and more
BPF: Tracing and more
Brendan Gregg
 
오픈소스컨설팅 클러스터제안 V1.0
오픈소스컨설팅 클러스터제안 V1.0오픈소스컨설팅 클러스터제안 V1.0
오픈소스컨설팅 클러스터제안 V1.0
sprdd
 

Tendances (20)

Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA ArchitectureCeph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
 
eBPF - Observability In Deep
eBPF - Observability In DeepeBPF - Observability In Deep
eBPF - Observability In Deep
 
Red Hat OpenStack 17 저자직강+스터디그룹_1주차
Red Hat OpenStack 17 저자직강+스터디그룹_1주차Red Hat OpenStack 17 저자직강+스터디그룹_1주차
Red Hat OpenStack 17 저자직강+스터디그룹_1주차
 
Ceph and RocksDB
Ceph and RocksDBCeph and RocksDB
Ceph and RocksDB
 
Linux Networking Explained
Linux Networking ExplainedLinux Networking Explained
Linux Networking Explained
 
카카오에서의 Trove 운영사례
카카오에서의 Trove 운영사례카카오에서의 Trove 운영사례
카카오에서의 Trove 운영사례
 
LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems Performance
 
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionCeph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
 
Accelerating Ceph with RDMA and NVMe-oF
Accelerating Ceph with RDMA and NVMe-oFAccelerating Ceph with RDMA and NVMe-oF
Accelerating Ceph with RDMA and NVMe-oF
 
Xdp and ebpf_maps
Xdp and ebpf_mapsXdp and ebpf_maps
Xdp and ebpf_maps
 
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
 
Cilium - Container Networking with BPF & XDP
Cilium - Container Networking with BPF & XDPCilium - Container Networking with BPF & XDP
Cilium - Container Networking with BPF & XDP
 
Ceph Day Beijing - SPDK for Ceph
Ceph Day Beijing - SPDK for CephCeph Day Beijing - SPDK for Ceph
Ceph Day Beijing - SPDK for Ceph
 
Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)
 
DPDK in Containers Hands-on Lab
DPDK in Containers Hands-on LabDPDK in Containers Hands-on Lab
DPDK in Containers Hands-on Lab
 
Revisiting CephFS MDS and mClock QoS Scheduler
Revisiting CephFS MDS and mClock QoS SchedulerRevisiting CephFS MDS and mClock QoS Scheduler
Revisiting CephFS MDS and mClock QoS Scheduler
 
BPF: Tracing and more
BPF: Tracing and moreBPF: Tracing and more
BPF: Tracing and more
 
Ceph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion ObjectsCeph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion Objects
 
오픈소스컨설팅 클러스터제안 V1.0
오픈소스컨설팅 클러스터제안 V1.0오픈소스컨설팅 클러스터제안 V1.0
오픈소스컨설팅 클러스터제안 V1.0
 
Ceph issue 해결 사례
Ceph issue 해결 사례Ceph issue 해결 사례
Ceph issue 해결 사례
 

Similaire à Disaggregating Ceph using NVMeoF

Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014
Hajime Tazaki
 

Similaire à Disaggregating Ceph using NVMeoF (20)

Disaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFDisaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoF
 
Ceph in the GRNET cloud stack
Ceph in the GRNET cloud stackCeph in the GRNET cloud stack
Ceph in the GRNET cloud stack
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDK
 
TUT18972: Unleash the power of Ceph across the Data Center
TUT18972: Unleash the power of Ceph across the Data CenterTUT18972: Unleash the power of Ceph across the Data Center
TUT18972: Unleash the power of Ceph across the Data Center
 
Ceph for Big Science - Dan van der Ster
Ceph for Big Science - Dan van der SterCeph for Big Science - Dan van der Ster
Ceph for Big Science - Dan van der Ster
 
Trip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine LearningTrip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine Learning
 
CEPH DAY BERLIN - CEPH ON THE BRAIN!
CEPH DAY BERLIN - CEPH ON THE BRAIN!CEPH DAY BERLIN - CEPH ON THE BRAIN!
CEPH DAY BERLIN - CEPH ON THE BRAIN!
 
Your 1st Ceph cluster
Your 1st Ceph clusterYour 1st Ceph cluster
Your 1st Ceph cluster
 
Ceph in 2023 and Beyond.pdf
Ceph in 2023 and Beyond.pdfCeph in 2023 and Beyond.pdf
Ceph in 2023 and Beyond.pdf
 
Ceph
CephCeph
Ceph
 
Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014
 
Ceph Day Netherlands - Ceph @ BIT
Ceph Day Netherlands - Ceph @ BIT Ceph Day Netherlands - Ceph @ BIT
Ceph Day Netherlands - Ceph @ BIT
 
Hyperconverged Cloud, Not just a toy anymore - Andrew Hatfield, Red Hat
Hyperconverged Cloud, Not just a toy anymore - Andrew Hatfield, Red HatHyperconverged Cloud, Not just a toy anymore - Andrew Hatfield, Red Hat
Hyperconverged Cloud, Not just a toy anymore - Andrew Hatfield, Red Hat
 
Theta and the Future of Accelerator Programming
Theta and the Future of Accelerator ProgrammingTheta and the Future of Accelerator Programming
Theta and the Future of Accelerator Programming
 
Manycores for the Masses
Manycores for the MassesManycores for the Masses
Manycores for the Masses
 
High performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHigh performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User Group
 
Architecting a 35 PB distributed parallel file system for science
Architecting a 35 PB distributed parallel file system for scienceArchitecting a 35 PB distributed parallel file system for science
Architecting a 35 PB distributed parallel file system for science
 
Ceph on arm64 upload
Ceph on arm64   uploadCeph on arm64   upload
Ceph on arm64 upload
 
Red hat open stack and storage presentation
Red hat open stack and storage presentationRed hat open stack and storage presentation
Red hat open stack and storage presentation
 
Scaling Ceph at CERN - Ceph Day Frankfurt
Scaling Ceph at CERN - Ceph Day Frankfurt Scaling Ceph at CERN - Ceph Day Frankfurt
Scaling Ceph at CERN - Ceph Day Frankfurt
 

Plus de ShapeBlue

Plus de ShapeBlue (20)

CloudStack Authentication Methods – Harikrishna Patnala, ShapeBlue
CloudStack Authentication Methods – Harikrishna Patnala, ShapeBlueCloudStack Authentication Methods – Harikrishna Patnala, ShapeBlue
CloudStack Authentication Methods – Harikrishna Patnala, ShapeBlue
 
CloudStack Tooling Ecosystem – Kiran Chavala, ShapeBlue
CloudStack Tooling Ecosystem – Kiran Chavala, ShapeBlueCloudStack Tooling Ecosystem – Kiran Chavala, ShapeBlue
CloudStack Tooling Ecosystem – Kiran Chavala, ShapeBlue
 
Elevating Cloud Infrastructure with Object Storage, DRS, VM Scheduling, and D...
Elevating Cloud Infrastructure with Object Storage, DRS, VM Scheduling, and D...Elevating Cloud Infrastructure with Object Storage, DRS, VM Scheduling, and D...
Elevating Cloud Infrastructure with Object Storage, DRS, VM Scheduling, and D...
 
VM Migration from VMware to CloudStack and KVM – Suresh Anaparti, ShapeBlue
VM Migration from VMware to CloudStack and KVM – Suresh Anaparti, ShapeBlueVM Migration from VMware to CloudStack and KVM – Suresh Anaparti, ShapeBlue
VM Migration from VMware to CloudStack and KVM – Suresh Anaparti, ShapeBlue
 
How We Grew Up with CloudStack and its Journey – Dilip Singh, DataHub
How We Grew Up with CloudStack and its Journey – Dilip Singh, DataHubHow We Grew Up with CloudStack and its Journey – Dilip Singh, DataHub
How We Grew Up with CloudStack and its Journey – Dilip Singh, DataHub
 
What’s New in CloudStack 4.19, Abhishek Kumar, Release Manager Apache CloudSt...
What’s New in CloudStack 4.19, Abhishek Kumar, Release Manager Apache CloudSt...What’s New in CloudStack 4.19, Abhishek Kumar, Release Manager Apache CloudSt...
What’s New in CloudStack 4.19, Abhishek Kumar, Release Manager Apache CloudSt...
 
CloudStack 101: The Best Way to Build Your Private Cloud – Rohit Yadav, VP Ap...
CloudStack 101: The Best Way to Build Your Private Cloud – Rohit Yadav, VP Ap...CloudStack 101: The Best Way to Build Your Private Cloud – Rohit Yadav, VP Ap...
CloudStack 101: The Best Way to Build Your Private Cloud – Rohit Yadav, VP Ap...
 
How We Use CloudStack to Provide Managed Hosting - Swen Brüseke - proIO
How We Use CloudStack to Provide Managed Hosting - Swen Brüseke - proIOHow We Use CloudStack to Provide Managed Hosting - Swen Brüseke - proIO
How We Use CloudStack to Provide Managed Hosting - Swen Brüseke - proIO
 
Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And...
Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And...Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And...
Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And...
 
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...
 
KVM Security Groups Under the Hood - Wido den Hollander - Your.Online
KVM Security Groups Under the Hood - Wido den Hollander - Your.OnlineKVM Security Groups Under the Hood - Wido den Hollander - Your.Online
KVM Security Groups Under the Hood - Wido den Hollander - Your.Online
 
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...
 
Use Existing Assets to Build a Powerful In-house Cloud Solution - Magali Perv...
Use Existing Assets to Build a Powerful In-house Cloud Solution - Magali Perv...Use Existing Assets to Build a Powerful In-house Cloud Solution - Magali Perv...
Use Existing Assets to Build a Powerful In-house Cloud Solution - Magali Perv...
 
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...
 
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...
 
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...
 
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlueElevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue
 
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...
 
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...
 
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlueWhat’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
 

Dernier

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Dernier (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 

Disaggregating Ceph using NVMeoF

  • 1. Zoltan Arnold Nagy IBM Research - Zurich Disaggregating Ceph using NVMeoF
  • 2. About me 2 § Technical Lead – Zurich Compute Cloud @ IBM Research – Zurich – Involved all aspects (compute, storage, networking…) § OpenStack since 2011 – “cactus” § Service local Zurich Resarch Lab’s research community – some data must remain in Switzerland/EU and/or too large to move off-site § ~4.5k cores / ~90TB memory and growing § 10/25/100GbE § Ceph + GPFS § Ceph since 2014 – “firefly” – Current cluster is 2.2PiB RAW § Mostly HDD § 100TB NVMe that sparked this whole investigation – Upgraded and growing since firefly!
  • 3. About IBM Research - Zurich 3 § Established in 1956 § 45+ different nationalities § Open Collaboration: – Horizon2020: 50+ funded projects and 500+ partners § Two Nobel Prizes: – 1986: Nobel Prize in Physics for the invention of the scanning tunneling microscope by Heinrich Rohrer and Gerd K. Binnig – 1987: Nobel Prize in Physics for the discovery of high-temperature superconductivity by K. Alex Müller and J. Georg Bednorz § 2017: European Physical Society Historic Site § Binnig and Rohrer Nanotechnology Centre opened in 2011 (Public Private Partnership with ETH Zürich and EMPA) § 7 European Research Council Grants
  • 4. Motivation #1 4 § Were great when we got them – years ago § 2xE5-2630v3 – 2x8 cores @ 2.4GHz § 2x10Gbit LACP, flat L2 network § Wanted to add NVMe to our current nodes – E5-2630v3 / 64GB RAM
  • 5. 5
  • 6. 6
  • 7. 7 1x Intel Optane 900P 8x Samsung 970 PRO 1TB 1x Mellanox ConnectX-4 (2x100GbE - PCIe v3 limits to ~128GBit/s)
  • 8. Motivation 8 56 cores / 140 GHz total compute for 7x NVMe drives
  • 9. Motivation 9 48 cores / 129.6 GHz total compute for 10 NVMe drives
  • 10. Motivation 10 Conclusion on those configurations? small block size IO: you run out of CPU large block size IO: you run ouf of network
  • 11. Quick math 11 § Resources per device (lots of assumptions: idle OS, RAM, NUMA, …) – 32 threads / 8 NVMe = 4 thread / NVMe – 100Gbit / 8 NVMe = 12.5Gbit/s – 3x replication: n Gbit/s write on the frontend causes 2n outgoing bandwidth -> we can support 6.25Gbit/s write per OSD as theoretical maximum throughput!
  • 12. 12 Can we do better? Don’t we have a bunch of compute nodes?
  • 13. 13
  • 14. 14 84 compute nodes per rack (yes, you need cooling…) Each node: 2xE5-2683v4 (2x16 cores @ 2.1GHz) 256GB RAM
  • 15. Plan 15 OSD OSD OSD OSD OSD OSD OSD OSD Storage node 100Gbps
  • 16. Plan 16 OSD OSD OSD OSD OSD OSD OSD OSD Storage node 100Gbps OSD Compute nodes 50Gbps OSD 50Gbps OSD 50Gbps OSD 50Gbps OSD 50Gbps OSD 50Gbps OSD 50Gbps OSD 50Gbps
  • 17. How does the math improve for writes? 17 OSD OSD OSD OSD OSD OSD RX n TX 2n RX n Client TX n RX n TX n TX nTX n
  • 18. 18 We know the protocol (NVMe) – let’s talk fabric!
  • 19. Fabric topology 19 32x compute nodes leaf leaf 1…n spines (32x100GbE) leaf 32x25GbE, 8x100GbE on compute leafs 6-10 storage nodes 32x100GbE on storage leafs
  • 20. 20 6x Mellanox SN2100 switch per rack (16x100GbE) split into 8x4x25GbE + 8x100GbE Each node has full bi-directional bandwidth to the spines!
  • 23. Ingredient 1: RoCEv2 23 § R stands for RDMA that stands for “remote DMA” § “oCE” is over Converged Ethernet – Tries to be “lossless” – PFC (L2 for example NIC<>Switch) – ECN (L3) § Applications can directly copy to each other’s memory, skipping the kernel § Some cards can do full NVMeoF offload meaning 0% CPU use on the target
  • 24. Ingredient 2: NVMeoF 24 § NVMe = storage protocol = how do I talk to my storage? § “oF” = “over Fabrics” where ”a fabric” can be – Fibre Channel – RDMA over Converged Ethernet (RoCE) – TCP § Basically attach a remote disk over some fabric to your local system pretending to be a local NVMe device – If target is native NVMe, pretty ideal – NVMeoF vs iSCSI: the same comparison applies as to NVMe vs SATA/SAS/SCSI § Linux kernel 5.0 introduced native NVMe-oF/TCP support § SPDK supports both being a target and an initiator in userspace
  • 25. 25 SQ = Submission Queue CQ = Completion Queue
  • 33. • Each interface needs an IP, can’t be full L3 • I’d prefer a /32 loopback address + unnumbered BGP • currently the kernel cannot specify source address for NVMeoF connections • going to ”stick” to one of the interfaces • TCP connections between OSD nodes going to be imbalanced • source address is going to be one of the NICs (hashed by destination info) Drawbacks – network complexity blows up 33
  • 34. Ceph measurements (WIP) 34 § Single client against 8xNVMe cluster – 8 volumes: randread: 210.29k IOPS (~26.29k IOPS/volume), stdev: 616.37 – @ 5.62ms 99p / 8.4ms 99.95p randwrite: ~48440 IOPS (~6055 IOPS/volume) stdev: 94.46625 @ 12.9ms 99p / 19.6ms 99.95p § Single client against 8xNVMe cluster distributed according to plans – 8 volumes: randread: 321.975k IOPS (40.25k IOPS/volume), stdev: 2483 @ 1.254ms 99p, 2.38ms 99.95p randwrite: 43.56k IOPS (~5445 IOPS/volume), stddev: 5752 @ 14.1ms 99p / 21.3ms 99.95p
  • 35. Can we still improve these numbers? 35 § Linux 5.1+ has a new interface instead of async calling “uring” – short for userspace ring – shared ring buffer between kernel and userspace – The goal is to replace the async IO interface in the long run – For more: https://lwn.net/Articles/776703/ § Bluestore has NVMEDevice support w/ SPDK – Couldn’t get it to work with NVMeoF despite SPDK having full native support
  • 38. Future: targets maybe replaced by ASICs? 38
  • 39. External references: 39 § RHCS lab environment: https://ceph.io/community/bluestore-default-vs-tuned-performance- comparison/ § Micron’s reference architecture: https://www.micron.com/- /media/client/global/documents/products/other- documents/micron_9300_and_red_hat_ceph_reference_architecture.pdf § Marvell ASIC: https://www.servethehome.com/marvell-25gbe-nvmeof-adapter-prefaces-a- super-cool-future/ § Netdev 0x13 SPDK RDMA vs TCP: https://www.youtube.com/watch?v=HLXxE5WWRf8&feature=youtu.be&t=643 § Zcopy: https://zcopy.wordpress.com/2010/10/08/quick-concepts-part-1-%E2%80%93- introduction-to-rdma/