2. About me
2
§ Technical Lead – Zurich Compute Cloud @ IBM Research – Zurich
– Involved all aspects (compute, storage, networking…)
§ OpenStack since 2011 – “cactus”
§ Service local Zurich Resarch Lab’s research community – some data must remain in
Switzerland/EU and/or too large to move off-site
§ ~4.5k cores / ~90TB memory and growing
§ 10/25/100GbE
§ Ceph + GPFS
§ Ceph since 2014 – “firefly”
– Current cluster is 2.2PiB RAW
§ Mostly HDD
§ 100TB NVMe that sparked this whole investigation
– Upgraded and growing since firefly!
3. About IBM Research - Zurich
3
§ Established in 1956
§ 45+ different nationalities
§ Open Collaboration:
– Horizon2020: 50+ funded projects and 500+ partners
§ Two Nobel Prizes:
– 1986: Nobel Prize in Physics for the invention of the scanning
tunneling microscope by Heinrich Rohrer and Gerd K. Binnig
– 1987: Nobel Prize in Physics for the discovery of
high-temperature superconductivity by
K. Alex Müller and J. Georg Bednorz
§ 2017: European Physical Society Historic Site
§ Binnig and Rohrer Nanotechnology Centre opened in
2011 (Public Private Partnership with ETH Zürich and EMPA)
§ 7 European Research Council Grants
4. Motivation #1
4
§ Were great when we got them – years ago
§ 2xE5-2630v3 – 2x8 cores @ 2.4GHz
§ 2x10Gbit LACP, flat L2 network
§ Wanted to add NVMe to our current nodes
– E5-2630v3 / 64GB RAM
10. Motivation
10
Conclusion on those configurations?
small block size IO: you run out of CPU
large block size IO: you run ouf of network
11. Quick math
11
§ Resources per device (lots of assumptions: idle OS, RAM, NUMA, …)
– 32 threads / 8 NVMe = 4 thread / NVMe
– 100Gbit / 8 NVMe = 12.5Gbit/s
– 3x replication: n Gbit/s write on the frontend
causes 2n outgoing bandwidth
-> we can support 6.25Gbit/s write per OSD as
theoretical maximum throughput!
12. 12
Can we do better?
Don’t we have a bunch of compute nodes?
23. Ingredient 1: RoCEv2
23
§ R stands for RDMA that stands for “remote DMA”
§ “oCE” is over Converged Ethernet
– Tries to be “lossless”
– PFC (L2 for example NIC<>Switch)
– ECN (L3)
§ Applications can directly copy to each
other’s memory, skipping the kernel
§ Some cards can do full NVMeoF offload
meaning 0% CPU use on the target
24. Ingredient 2: NVMeoF
24
§ NVMe = storage protocol = how do I talk to my storage?
§ “oF” = “over Fabrics” where ”a fabric” can be
– Fibre Channel
– RDMA over Converged Ethernet (RoCE)
– TCP
§ Basically attach a remote disk over some fabric to your local system pretending to be a local
NVMe device
– If target is native NVMe, pretty ideal
– NVMeoF vs iSCSI: the same comparison applies as to NVMe vs SATA/SAS/SCSI
§ Linux kernel 5.0 introduced native NVMe-oF/TCP support
§ SPDK supports both being a target and an initiator in userspace
33. • Each interface needs an IP, can’t be full L3
• I’d prefer a /32 loopback address + unnumbered BGP
• currently the kernel cannot specify source address for NVMeoF connections
• going to ”stick” to one of the interfaces
• TCP connections between OSD nodes going to be imbalanced
• source address is going to be one of the NICs (hashed by destination info)
Drawbacks – network complexity blows up
33
35. Can we still improve these numbers?
35
§ Linux 5.1+ has a new interface instead of async calling “uring”
– short for userspace ring
– shared ring buffer between kernel and userspace
– The goal is to replace the async IO interface in the long run
– For more: https://lwn.net/Articles/776703/
§ Bluestore has NVMEDevice support w/ SPDK
– Couldn’t get it to work with NVMeoF despite SPDK having full native support