SlideShare une entreprise Scribd logo
1  sur  31
Télécharger pour lire hors ligne
Introduction to
NVMe over Fabrics
10/2016
v3
Simon Huang
Email:simonhuang1688@gmail.com
• What is NVM Express™?
• What’s NVMe over Fabrics?
• Why NVMe over Fabrics?
• Expanding NVMe to Fabrics
• NVMe over Fabrics in the Data Center
• End-to-End NVMe over Fabrics
• NVMe Multi-Fabric Transport Mapping
• NVMe over Fabrics at Storage Tiers
• End-to-End NVMe Model
• Shared Server Flash
• NVMe Over Fabrics Products(Examples)
• Recap
• Backup 1 and 2
Agenda
What is NVM Express™?
• Industry standard for PCIe SSDs
• High-performance, low-latency, PCIe SSD interface
• Command set + PCIe register interface
• In-box NVMe host drivers for Linux, Windows, VmWare, …
• Standard h/w drive form factors, mobile to enterprise
• NVMe community is 100+ companies strong and growing
• Learn more at nvmexpress.org
What’s NVMe over Fabrics?
• Nonvolatile Memory Express (NVMe) over
Fabrics is a technology specification designed
to enable NVMe message-based commands
to transfer data between a host computer
and a target solid-state storage device or
system over a network such as Ethernet,
Fibre Channel, and InfiniBand.
Why NVMe over Fabrics?
• End-to-End NVMe semantics across a range of topologies
– Retains NVMe efficiency and performance over network fabrics
– Eliminates unnecessary protocol translations
– Enables low-latency and high IOPS remote NVMe storage solutions
Expanding NVMe to Fabrics
• Built on common NVMe architecture with additional definitions to support
message-based NVMe operations
• Standardization of NVMe over a range Fabric types
• Initial fabrics: RDMA(RoCE, iWARP, InfiniBand™) and Fibre Channel
• First specification has been released in June, 2016
• NVMe.org Fabrics Linux Driver WG developing host and target drivers
NVMe Over Fabrics
Evolution of Non-Volatile Storage in the Data Center
End-to-End NVMe over Fabrics
Extend efficiency of NVMe over Front and Back-end Fabrics Enables
efficient NVMe end-to-end model (Host<->NVMe PCIe SSD)
NVM Over Fabrics Advantages
• Industry standard interface (Multiple sources)
• Unlimited storage per server
• Scale storage independent of servers
• High Efficient shared storage
• HA is straightforward
• Greater IO performance
NVMe Multi-Fabric Transport Mapping
Fabric Message Based Transports
NVMe over Fabrics at Storage Tiers
End-to-End NVMe Model
• NVMe efficiencies scaled across entire fabric
Shared Server Flash
- NVMe Storage
• RDMA support required for lowest latency
• Ethernet or IB or OmniPath fabrics possible – IB and OmniPath support RDMA – Ethernet has
RoCEv1-v2, iWARP and iSCSI RDMA options – iSCSI offload has built-in RDMA WRITE
• Disaster Recovery (DR) requires MAN or WAN – iWARP, iSCSI only options that support MAN
and WAN
NVMe Over Fabrics Products(Examples)
• Gangster (NX6320/NX6325/NX6340)
All-Flash Arrays
• Chelsio’s Terminator 5
• QLogic FastLinQ QL45611HLCU
100Gb Intelligent Ethernet Adapter
Arrays:
Adapters:
• EMC DSSD D5
Recap
• NVMe was built from the ground up to support a consistent
model for NVM interfaces, even across network fabrics
• Simplicity of protocol enables hardware automated I/O
Queues – NVMe transport bridge
• No translation to or from another protocol like SCSI (in
firmware/software)
• Inherent parallelism of NVMe multiple I/O Queues is exposed
to the host
• NVMe commands and structures are transferred end-to-end
• Maintains the NVMe architecture across a range of fabric
types
Backup-1
Seagate SSD
1200.2 Series SAS 12Gbs
-Up to 210K RR IPOS and 25 DWPD
XM1400 Series M.2 22110 PCIe G3 x 4
-Up to 3DWPD
XF1400 Series U.2 PCIe G3 x 4
-Up to 200K RR IPOS and 3 DWPD
XP6500 Series AIC PCIe G3 x 8
-Up to 300K RR IPOS
XP7200 Series AIC PCIe G3 x 16
-Up to 940K RR IPOS
XP6300 Series AIC PCIe G3 x 8
-Up to 296K RR IPOS
Traditional Scale Out Storage
• Support for high BW/IOPS NVMe support preserves software investment, because it
keeps existing software price/performance competitive
• Support for high BW/IOPS NVMe support realizes most of the NVMe speedup benefits
• Disaster Recovery (DR) requires MAN or WAN
RDMA
• RDMA stands for Remote Direct Memory Access and enables one
computer to access another’s internal memory directly without
involving the destination computer’s operating system. The
destination computer’s network adapter moves data directly from
the network into an area of application memory without involving
the OS involving its own data buffers and network I/O stack.
Consequently the transfer us very fast. It has the downside of not
having an acknowledgement (ack) sent back to the source
computer telling it that the transfer has been successful.
• There is no general RDMA standard, meaning that implementations
are specific to particular servers and network adapters, operating
systems and applications. There are RDMA implementations for
Linux and Windows Server 2012, which may use iWARP, RoCE, and
InfiniBand as the carrying layer for the transfers.
iWARP - Internet Wide
Area RDMA Protocol
• iWARP (internet Wide Area RDMA Protocol) implements RDMA over
Internet Protocol networks. It is layered on IETF-standard congestion-
aware protocols such as TCP and SCTP, and uses a mix of layers, including
DDP (Direct Data Placement), MPA (Marker PDU Aligned framing), and a
separate RDMA protocol (RDMAP) to deliver RDMA services over TCP/IP.
Because of this it's said to have lower throughput, higher latency and
require higher CPU and memory utilisation than RoCE.
• For example: "Latency will be higher than RoCE (at least with both Chelsio
and Intel/NetEffect implementations), but still well under 10 μs."
• Mellanox says no iWARP support is available at 25, 50, and 100Gbit/s
Ethernet speeds. Chelsio saysthe IETF standard for RDMA is iWARP. It
provides the same host interface as InfiniBand and is available in the same
OpenFabrics Enterprise Distribution (OFED).
• Chelsio, which positions iWARP as an alternative to InfiniBand, says iWARP
is the industry standard for RDMA over Ethernet is iWARP. High
performance iWARP implementations are available and compete directly
with InfiniBand in application benchmarks.
RoCE - RDMA over Converged Ethernet
• RoCE (RDMA over Converged Ethernet) allows remote direct
memory access (RDMA) over an Ethernet network. It operates over
layer 2 and layer 3 DCB-capable (DCB - Data Centre Bridging)
switches. Such switches comply with the IEEE 802.1 Data Center
Bridging standard, which is a set of extensions to traditional
Ethernet, geared to providing a lossless data centre transport layer
that, Cisco says, helps enable the convergence of LANs and SANs
onto a single unified fabric. DCB switches support the Fibre Channel
over Ethernet (FCoE) networking protocol.
There are two versions:
• RoCE v1 uses the Ethernet protocol as a link layer protocol and
hence allows communication between any two hosts in the same
Ethernet broadcast domain,
• RoCE v2 is a RDMA running on top of UDP/IP and can be routed.
Backup-2
facebook – Lightning
NVMe JBOF Architecture
Facebook Lightning Target
• Hot-plug. We want the NVMe JBOF to behave like a SAS JBOD when drives are replaced. We don't want to
follow the complicated procedure that traditional PCIe hot-plug requires. As a result, we need to be able to
robustly support surprise hot-removal and surprise hot-add without causing operating system hangs or crashes.
• Management.PCIe does not yet have an in-band enclosure and chassis management scheme like the SAS
ecosystem does. While this is coming, we chose to address this using a more traditional BMC approach, which can
be modified in the future as the ecosystem evolves.
• Signal integrity. The decision to maintain the separation of a PEB from the PDPB as well as supporting multiple
SSDs per “slot” results in some long PCIe connections through multiple connectors. Extensive simulations, layout
optimizations, and the use of low-loss but still low-cost PCB materials should allow us to achieve the bit error rate
requirements of PCIe without the use of redrivers/retimers or exotic PCB materials.
• External PCIe cables. We chose to keep the compute head node separate from the storage chassis, as this
gives us the flexibility to scale the compute-to-storage ratio as needed. It also allows us to use more powerful
CPUs, larger memory footprints, and faster network connections all of which will be needed to take full advantage
of high-performance SSDs. As the existing PCIe cables are clunky and difficult to use, we chose to use mini-SAS HD
cables (SFF-8644). This also aligns with upcoming external cabling standards. We designed the cables such that
they include a full complement of PCIe side-band signals and a USB connection for an out-of-band management
connection.
• Power. Current 2.5" NVMe SSDs may consume up to 25W of power! This creates an unnecessary system
constraint, and we have chosen to limit the power consumption per slot to 14W. This aligns much better with the
switch oversubscription and our performance targets.
NVMe JBOF Benefits
• Manageability
• Flexibility
• Modularity
• Performance
Manageability of BMC
USB, I2C, Ethernet Out-of-band (OOB)
Flexibility in NVMe SSDs
Modularity in PCIe Switch
A common switch board for both trays
•Easily design new or difference version
without modifying the rest of the
infrastructure
Low IO-Watt Performance
Ultra-High I/O Performance
5X Throughput + 1200X IOPS
OCP All-Flash NVMe Storage
• 2U 60/30 NVMe SSDs
• Ultra-high IOPS and <10µS latency
• PCIe 3.0 + U.2 or M.2 NVMe SSD support
• High density storage system with 60 SSDs (M.2)

Contenu connexe

Tendances

PCI Express Verification using Reference Modeling
PCI Express Verification using Reference ModelingPCI Express Verification using Reference Modeling
PCI Express Verification using Reference Modeling
DVClub
 
Protocolo http
Protocolo httpProtocolo http
Protocolo http
Biel2013a
 
03_03_Implementing_PCIe_ATS_in_ARM-based_SoCs_Final
03_03_Implementing_PCIe_ATS_in_ARM-based_SoCs_Final03_03_Implementing_PCIe_ATS_in_ARM-based_SoCs_Final
03_03_Implementing_PCIe_ATS_in_ARM-based_SoCs_Final
Gopi Krishnamurthy
 

Tendances (20)

CXL Consortium Update: Advancing Coherent Connectivity
CXL Consortium Update: Advancing Coherent ConnectivityCXL Consortium Update: Advancing Coherent Connectivity
CXL Consortium Update: Advancing Coherent Connectivity
 
Pcie basic
Pcie basicPcie basic
Pcie basic
 
PCI Express Verification using Reference Modeling
PCI Express Verification using Reference ModelingPCI Express Verification using Reference Modeling
PCI Express Verification using Reference Modeling
 
NVMe overview
NVMe overviewNVMe overview
NVMe overview
 
PCI Express* based Storage: Data Center NVM Express* Platform Topologies
PCI Express* based Storage: Data Center NVM Express* Platform TopologiesPCI Express* based Storage: Data Center NVM Express* Platform Topologies
PCI Express* based Storage: Data Center NVM Express* Platform Topologies
 
5G Cellular D2D RDMA Clusters
5G Cellular D2D RDMA Clusters5G Cellular D2D RDMA Clusters
5G Cellular D2D RDMA Clusters
 
PCI express
PCI expressPCI express
PCI express
 
Slideshare - PCIe
Slideshare - PCIeSlideshare - PCIe
Slideshare - PCIe
 
MemVerge: The Software Stack for CXL Environments
MemVerge: The Software Stack for CXL EnvironmentsMemVerge: The Software Stack for CXL Environments
MemVerge: The Software Stack for CXL Environments
 
Modelo TCP/IP
Modelo TCP/IPModelo TCP/IP
Modelo TCP/IP
 
Memory Management in Windows 7
Memory Management in Windows 7Memory Management in Windows 7
Memory Management in Windows 7
 
Receive side scaling (RSS) with eBPF in QEMU and virtio-net
Receive side scaling (RSS) with eBPF in QEMU and virtio-netReceive side scaling (RSS) with eBPF in QEMU and virtio-net
Receive side scaling (RSS) with eBPF in QEMU and virtio-net
 
Protocolo http
Protocolo httpProtocolo http
Protocolo http
 
Making Linux do Hard Real-time
Making Linux do Hard Real-timeMaking Linux do Hard Real-time
Making Linux do Hard Real-time
 
03_03_Implementing_PCIe_ATS_in_ARM-based_SoCs_Final
03_03_Implementing_PCIe_ATS_in_ARM-based_SoCs_Final03_03_Implementing_PCIe_ATS_in_ARM-based_SoCs_Final
03_03_Implementing_PCIe_ATS_in_ARM-based_SoCs_Final
 
All Presentations during CXL Forum at Flash Memory Summit 22
All Presentations during CXL Forum at Flash Memory Summit 22All Presentations during CXL Forum at Flash Memory Summit 22
All Presentations during CXL Forum at Flash Memory Summit 22
 
Moving to PCI Express based SSD with NVM Express
Moving to PCI Express based SSD with NVM ExpressMoving to PCI Express based SSD with NVM Express
Moving to PCI Express based SSD with NVM Express
 
Virtualization Architecture & KVM
Virtualization Architecture & KVMVirtualization Architecture & KVM
Virtualization Architecture & KVM
 
Past Present and Future of CXL
Past Present and Future of CXLPast Present and Future of CXL
Past Present and Future of CXL
 
NVMe Over Fabrics Support in Linux
NVMe Over Fabrics Support in LinuxNVMe Over Fabrics Support in Linux
NVMe Over Fabrics Support in Linux
 

En vedette

Hardware accelerated virtio networking for nfv linux con
Hardware accelerated virtio networking for nfv linux conHardware accelerated virtio networking for nfv linux con
Hardware accelerated virtio networking for nfv linux con
sprdd
 

En vedette (20)

Intel and DataStax: 3D XPoint and NVME Technology Cassandra Storage Comparison
Intel and DataStax: 3D XPoint and NVME Technology Cassandra Storage ComparisonIntel and DataStax: 3D XPoint and NVME Technology Cassandra Storage Comparison
Intel and DataStax: 3D XPoint and NVME Technology Cassandra Storage Comparison
 
2016 Flash Storage-NVMe Brand Leader Mini-Report
2016 Flash Storage-NVMe Brand Leader Mini-Report2016 Flash Storage-NVMe Brand Leader Mini-Report
2016 Flash Storage-NVMe Brand Leader Mini-Report
 
Webinar: How NVMe Will Change Flash Storage
Webinar: How NVMe Will Change Flash StorageWebinar: How NVMe Will Change Flash Storage
Webinar: How NVMe Will Change Flash Storage
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
 
Function Level Analysis of Linux NVMe Driver
Function Level Analysis of Linux NVMe DriverFunction Level Analysis of Linux NVMe Driver
Function Level Analysis of Linux NVMe Driver
 
NVMe PCIe and TLC V-NAND It’s about Time
NVMe PCIe and TLC V-NAND It’s about TimeNVMe PCIe and TLC V-NAND It’s about Time
NVMe PCIe and TLC V-NAND It’s about Time
 
Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Effici...
Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Effici...Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Effici...
Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Effici...
 
Red Hat Storage Day New York -Performance Intensive Workloads with Samsung NV...
Red Hat Storage Day New York -Performance Intensive Workloads with Samsung NV...Red Hat Storage Day New York -Performance Intensive Workloads with Samsung NV...
Red Hat Storage Day New York -Performance Intensive Workloads with Samsung NV...
 
Bullying
BullyingBullying
Bullying
 
Scale-out Storage on Intel® Architecture Based Platforms: Characterizing and ...
Scale-out Storage on Intel® Architecture Based Platforms: Characterizing and ...Scale-out Storage on Intel® Architecture Based Platforms: Characterizing and ...
Scale-out Storage on Intel® Architecture Based Platforms: Characterizing and ...
 
Storage Class Memory: Learning from 3D NAND
Storage Class Memory: Learning from 3D NANDStorage Class Memory: Learning from 3D NAND
Storage Class Memory: Learning from 3D NAND
 
Userspace Linux I/O
Userspace Linux I/O Userspace Linux I/O
Userspace Linux I/O
 
3D Xpoint memory technology
3D Xpoint memory technology3D Xpoint memory technology
3D Xpoint memory technology
 
Hardware accelerated virtio networking for nfv linux con
Hardware accelerated virtio networking for nfv linux conHardware accelerated virtio networking for nfv linux con
Hardware accelerated virtio networking for nfv linux con
 
Intel, Micron unveil “breakthrough” 3D XPoint Memory Tech – A revolutionary b...
Intel, Micron unveil “breakthrough” 3D XPoint Memory Tech – A revolutionary b...Intel, Micron unveil “breakthrough” 3D XPoint Memory Tech – A revolutionary b...
Intel, Micron unveil “breakthrough” 3D XPoint Memory Tech – A revolutionary b...
 
Devconf2017 - Can VMs networking benefit from DPDK
Devconf2017 - Can VMs networking benefit from DPDKDevconf2017 - Can VMs networking benefit from DPDK
Devconf2017 - Can VMs networking benefit from DPDK
 
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based HardwareRed hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
 
Red Hat Storage Day New York - Penguin Computing Spotlight: Delivering Open S...
Red Hat Storage Day New York - Penguin Computing Spotlight: Delivering Open S...Red Hat Storage Day New York - Penguin Computing Spotlight: Delivering Open S...
Red Hat Storage Day New York - Penguin Computing Spotlight: Delivering Open S...
 
Ceph on Intel: Intel Storage Components, Benchmarks, and Contributions
Ceph on Intel: Intel Storage Components, Benchmarks, and ContributionsCeph on Intel: Intel Storage Components, Benchmarks, and Contributions
Ceph on Intel: Intel Storage Components, Benchmarks, and Contributions
 
SR-IOV ixgbe Driver Limitations and Improvement
SR-IOV ixgbe Driver Limitations and ImprovementSR-IOV ixgbe Driver Limitations and Improvement
SR-IOV ixgbe Driver Limitations and Improvement
 

Similaire à Introduction to NVMe Over Fabrics-V3R

Deploying flash storage for Ceph without compromising performance
Deploying flash storage for Ceph without compromising performance Deploying flash storage for Ceph without compromising performance
Deploying flash storage for Ceph without compromising performance
Ceph Community
 
Ceph Day Chicago - Deploying flash storage for Ceph without compromising perf...
Ceph Day Chicago - Deploying flash storage for Ceph without compromising perf...Ceph Day Chicago - Deploying flash storage for Ceph without compromising perf...
Ceph Day Chicago - Deploying flash storage for Ceph without compromising perf...
Ceph Community
 
Advanced Networking: The Critical Path for HPC, Cloud, Machine Learning and more
Advanced Networking: The Critical Path for HPC, Cloud, Machine Learning and moreAdvanced Networking: The Critical Path for HPC, Cloud, Machine Learning and more
Advanced Networking: The Critical Path for HPC, Cloud, Machine Learning and more
inside-BigData.com
 

Similaire à Introduction to NVMe Over Fabrics-V3R (20)

Run PostgreSQL in Warp Speed Using NVMe/TCP in the Cloud
Run PostgreSQL in Warp Speed Using NVMe/TCP in the CloudRun PostgreSQL in Warp Speed Using NVMe/TCP in the Cloud
Run PostgreSQL in Warp Speed Using NVMe/TCP in the Cloud
 
Ceph Day Amsterdam 2015 - Deploying flash storage for Ceph without compromisi...
Ceph Day Amsterdam 2015 - Deploying flash storage for Ceph without compromisi...Ceph Day Amsterdam 2015 - Deploying flash storage for Ceph without compromisi...
Ceph Day Amsterdam 2015 - Deploying flash storage for Ceph without compromisi...
 
Deploying flash storage for Ceph without compromising performance
Deploying flash storage for Ceph without compromising performance Deploying flash storage for Ceph without compromising performance
Deploying flash storage for Ceph without compromising performance
 
Born to be fast! - Aviram Bar Haim - OpenStack Israel 2017
Born to be fast! - Aviram Bar Haim - OpenStack Israel 2017Born to be fast! - Aviram Bar Haim - OpenStack Israel 2017
Born to be fast! - Aviram Bar Haim - OpenStack Israel 2017
 
Ceph Day London 2014 - Ceph Over High-Performance Networks
Ceph Day London 2014 - Ceph Over High-Performance Networks Ceph Day London 2014 - Ceph Over High-Performance Networks
Ceph Day London 2014 - Ceph Over High-Performance Networks
 
The Performance of NVMe™ Flash in Shared Storage
The Performance of NVMe™ Flash in Shared StorageThe Performance of NVMe™ Flash in Shared Storage
The Performance of NVMe™ Flash in Shared Storage
 
Ceph Day New York 2014: Ceph over High Performance Networks
Ceph Day New York 2014: Ceph over High Performance NetworksCeph Day New York 2014: Ceph over High Performance Networks
Ceph Day New York 2014: Ceph over High Performance Networks
 
#IBMEdge: "Not all Networks are Equal"
#IBMEdge: "Not all Networks are Equal" #IBMEdge: "Not all Networks are Equal"
#IBMEdge: "Not all Networks are Equal"
 
Introduction to HPC & Supercomputing in AI
Introduction to HPC & Supercomputing in AIIntroduction to HPC & Supercomputing in AI
Introduction to HPC & Supercomputing in AI
 
DPDK Summit 2015 - Aspera - Charles Shiflett
DPDK Summit 2015 - Aspera - Charles ShiflettDPDK Summit 2015 - Aspera - Charles Shiflett
DPDK Summit 2015 - Aspera - Charles Shiflett
 
SAN overview.pptx
SAN overview.pptxSAN overview.pptx
SAN overview.pptx
 
Ceph Day SF 2015 - Deploying flash storage for Ceph without compromising perf...
Ceph Day SF 2015 - Deploying flash storage for Ceph without compromising perf...Ceph Day SF 2015 - Deploying flash storage for Ceph without compromising perf...
Ceph Day SF 2015 - Deploying flash storage for Ceph without compromising perf...
 
Optimized HPC/AI cloud with OpenStack acceleration service and composable har...
Optimized HPC/AI cloud with OpenStack acceleration service and composable har...Optimized HPC/AI cloud with OpenStack acceleration service and composable har...
Optimized HPC/AI cloud with OpenStack acceleration service and composable har...
 
Ceph Day Chicago - Deploying flash storage for Ceph without compromising perf...
Ceph Day Chicago - Deploying flash storage for Ceph without compromising perf...Ceph Day Chicago - Deploying flash storage for Ceph without compromising perf...
Ceph Day Chicago - Deploying flash storage for Ceph without compromising perf...
 
Advanced Networking: The Critical Path for HPC, Cloud, Machine Learning and more
Advanced Networking: The Critical Path for HPC, Cloud, Machine Learning and moreAdvanced Networking: The Critical Path for HPC, Cloud, Machine Learning and more
Advanced Networking: The Critical Path for HPC, Cloud, Machine Learning and more
 
Nv me over_fabrics
Nv me over_fabricsNv me over_fabrics
Nv me over_fabrics
 
Recent Developments in Donard
Recent Developments in DonardRecent Developments in Donard
Recent Developments in Donard
 
Designing and deploying converged storage area networks final
Designing and deploying converged storage area networks finalDesigning and deploying converged storage area networks final
Designing and deploying converged storage area networks final
 
Sharing High-Performance Interconnects Across Multiple Virtual Machines
Sharing High-Performance Interconnects Across Multiple Virtual MachinesSharing High-Performance Interconnects Across Multiple Virtual Machines
Sharing High-Performance Interconnects Across Multiple Virtual Machines
 
VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...
VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...
VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...
 

Introduction to NVMe Over Fabrics-V3R

  • 1. Introduction to NVMe over Fabrics 10/2016 v3 Simon Huang Email:simonhuang1688@gmail.com
  • 2. • What is NVM Express™? • What’s NVMe over Fabrics? • Why NVMe over Fabrics? • Expanding NVMe to Fabrics • NVMe over Fabrics in the Data Center • End-to-End NVMe over Fabrics • NVMe Multi-Fabric Transport Mapping • NVMe over Fabrics at Storage Tiers • End-to-End NVMe Model • Shared Server Flash • NVMe Over Fabrics Products(Examples) • Recap • Backup 1 and 2 Agenda
  • 3. What is NVM Express™? • Industry standard for PCIe SSDs • High-performance, low-latency, PCIe SSD interface • Command set + PCIe register interface • In-box NVMe host drivers for Linux, Windows, VmWare, … • Standard h/w drive form factors, mobile to enterprise • NVMe community is 100+ companies strong and growing • Learn more at nvmexpress.org
  • 4. What’s NVMe over Fabrics? • Nonvolatile Memory Express (NVMe) over Fabrics is a technology specification designed to enable NVMe message-based commands to transfer data between a host computer and a target solid-state storage device or system over a network such as Ethernet, Fibre Channel, and InfiniBand.
  • 5. Why NVMe over Fabrics? • End-to-End NVMe semantics across a range of topologies – Retains NVMe efficiency and performance over network fabrics – Eliminates unnecessary protocol translations – Enables low-latency and high IOPS remote NVMe storage solutions
  • 6. Expanding NVMe to Fabrics • Built on common NVMe architecture with additional definitions to support message-based NVMe operations • Standardization of NVMe over a range Fabric types • Initial fabrics: RDMA(RoCE, iWARP, InfiniBand™) and Fibre Channel • First specification has been released in June, 2016 • NVMe.org Fabrics Linux Driver WG developing host and target drivers
  • 7. NVMe Over Fabrics Evolution of Non-Volatile Storage in the Data Center
  • 8. End-to-End NVMe over Fabrics Extend efficiency of NVMe over Front and Back-end Fabrics Enables efficient NVMe end-to-end model (Host<->NVMe PCIe SSD)
  • 9. NVM Over Fabrics Advantages • Industry standard interface (Multiple sources) • Unlimited storage per server • Scale storage independent of servers • High Efficient shared storage • HA is straightforward • Greater IO performance
  • 10. NVMe Multi-Fabric Transport Mapping Fabric Message Based Transports
  • 11. NVMe over Fabrics at Storage Tiers
  • 12. End-to-End NVMe Model • NVMe efficiencies scaled across entire fabric
  • 13. Shared Server Flash - NVMe Storage • RDMA support required for lowest latency • Ethernet or IB or OmniPath fabrics possible – IB and OmniPath support RDMA – Ethernet has RoCEv1-v2, iWARP and iSCSI RDMA options – iSCSI offload has built-in RDMA WRITE • Disaster Recovery (DR) requires MAN or WAN – iWARP, iSCSI only options that support MAN and WAN
  • 14. NVMe Over Fabrics Products(Examples) • Gangster (NX6320/NX6325/NX6340) All-Flash Arrays • Chelsio’s Terminator 5 • QLogic FastLinQ QL45611HLCU 100Gb Intelligent Ethernet Adapter Arrays: Adapters: • EMC DSSD D5
  • 15. Recap • NVMe was built from the ground up to support a consistent model for NVM interfaces, even across network fabrics • Simplicity of protocol enables hardware automated I/O Queues – NVMe transport bridge • No translation to or from another protocol like SCSI (in firmware/software) • Inherent parallelism of NVMe multiple I/O Queues is exposed to the host • NVMe commands and structures are transferred end-to-end • Maintains the NVMe architecture across a range of fabric types
  • 17. Seagate SSD 1200.2 Series SAS 12Gbs -Up to 210K RR IPOS and 25 DWPD XM1400 Series M.2 22110 PCIe G3 x 4 -Up to 3DWPD XF1400 Series U.2 PCIe G3 x 4 -Up to 200K RR IPOS and 3 DWPD XP6500 Series AIC PCIe G3 x 8 -Up to 300K RR IPOS XP7200 Series AIC PCIe G3 x 16 -Up to 940K RR IPOS XP6300 Series AIC PCIe G3 x 8 -Up to 296K RR IPOS
  • 18. Traditional Scale Out Storage • Support for high BW/IOPS NVMe support preserves software investment, because it keeps existing software price/performance competitive • Support for high BW/IOPS NVMe support realizes most of the NVMe speedup benefits • Disaster Recovery (DR) requires MAN or WAN
  • 19. RDMA • RDMA stands for Remote Direct Memory Access and enables one computer to access another’s internal memory directly without involving the destination computer’s operating system. The destination computer’s network adapter moves data directly from the network into an area of application memory without involving the OS involving its own data buffers and network I/O stack. Consequently the transfer us very fast. It has the downside of not having an acknowledgement (ack) sent back to the source computer telling it that the transfer has been successful. • There is no general RDMA standard, meaning that implementations are specific to particular servers and network adapters, operating systems and applications. There are RDMA implementations for Linux and Windows Server 2012, which may use iWARP, RoCE, and InfiniBand as the carrying layer for the transfers.
  • 20. iWARP - Internet Wide Area RDMA Protocol • iWARP (internet Wide Area RDMA Protocol) implements RDMA over Internet Protocol networks. It is layered on IETF-standard congestion- aware protocols such as TCP and SCTP, and uses a mix of layers, including DDP (Direct Data Placement), MPA (Marker PDU Aligned framing), and a separate RDMA protocol (RDMAP) to deliver RDMA services over TCP/IP. Because of this it's said to have lower throughput, higher latency and require higher CPU and memory utilisation than RoCE. • For example: "Latency will be higher than RoCE (at least with both Chelsio and Intel/NetEffect implementations), but still well under 10 μs." • Mellanox says no iWARP support is available at 25, 50, and 100Gbit/s Ethernet speeds. Chelsio saysthe IETF standard for RDMA is iWARP. It provides the same host interface as InfiniBand and is available in the same OpenFabrics Enterprise Distribution (OFED). • Chelsio, which positions iWARP as an alternative to InfiniBand, says iWARP is the industry standard for RDMA over Ethernet is iWARP. High performance iWARP implementations are available and compete directly with InfiniBand in application benchmarks.
  • 21. RoCE - RDMA over Converged Ethernet • RoCE (RDMA over Converged Ethernet) allows remote direct memory access (RDMA) over an Ethernet network. It operates over layer 2 and layer 3 DCB-capable (DCB - Data Centre Bridging) switches. Such switches comply with the IEEE 802.1 Data Center Bridging standard, which is a set of extensions to traditional Ethernet, geared to providing a lossless data centre transport layer that, Cisco says, helps enable the convergence of LANs and SANs onto a single unified fabric. DCB switches support the Fibre Channel over Ethernet (FCoE) networking protocol. There are two versions: • RoCE v1 uses the Ethernet protocol as a link layer protocol and hence allows communication between any two hosts in the same Ethernet broadcast domain, • RoCE v2 is a RDMA running on top of UDP/IP and can be routed.
  • 23. facebook – Lightning NVMe JBOF Architecture
  • 24. Facebook Lightning Target • Hot-plug. We want the NVMe JBOF to behave like a SAS JBOD when drives are replaced. We don't want to follow the complicated procedure that traditional PCIe hot-plug requires. As a result, we need to be able to robustly support surprise hot-removal and surprise hot-add without causing operating system hangs or crashes. • Management.PCIe does not yet have an in-band enclosure and chassis management scheme like the SAS ecosystem does. While this is coming, we chose to address this using a more traditional BMC approach, which can be modified in the future as the ecosystem evolves. • Signal integrity. The decision to maintain the separation of a PEB from the PDPB as well as supporting multiple SSDs per “slot” results in some long PCIe connections through multiple connectors. Extensive simulations, layout optimizations, and the use of low-loss but still low-cost PCB materials should allow us to achieve the bit error rate requirements of PCIe without the use of redrivers/retimers or exotic PCB materials. • External PCIe cables. We chose to keep the compute head node separate from the storage chassis, as this gives us the flexibility to scale the compute-to-storage ratio as needed. It also allows us to use more powerful CPUs, larger memory footprints, and faster network connections all of which will be needed to take full advantage of high-performance SSDs. As the existing PCIe cables are clunky and difficult to use, we chose to use mini-SAS HD cables (SFF-8644). This also aligns with upcoming external cabling standards. We designed the cables such that they include a full complement of PCIe side-band signals and a USB connection for an out-of-band management connection. • Power. Current 2.5" NVMe SSDs may consume up to 25W of power! This creates an unnecessary system constraint, and we have chosen to limit the power consumption per slot to 14W. This aligns much better with the switch oversubscription and our performance targets.
  • 25. NVMe JBOF Benefits • Manageability • Flexibility • Modularity • Performance
  • 26. Manageability of BMC USB, I2C, Ethernet Out-of-band (OOB)
  • 28. Modularity in PCIe Switch A common switch board for both trays •Easily design new or difference version without modifying the rest of the infrastructure
  • 30. Ultra-High I/O Performance 5X Throughput + 1200X IOPS
  • 31. OCP All-Flash NVMe Storage • 2U 60/30 NVMe SSDs • Ultra-high IOPS and <10µS latency • PCIe 3.0 + U.2 or M.2 NVMe SSD support • High density storage system with 60 SSDs (M.2)