2. • What is NVM Express™?
• What’s NVMe over Fabrics?
• Why NVMe over Fabrics?
• Expanding NVMe to Fabrics
• NVMe over Fabrics in the Data Center
• End-to-End NVMe over Fabrics
• NVMe Multi-Fabric Transport Mapping
• NVMe over Fabrics at Storage Tiers
• End-to-End NVMe Model
• Shared Server Flash
• NVMe Over Fabrics Products(Examples)
• Recap
• Backup 1 and 2
Agenda
3. What is NVM Express™?
• Industry standard for PCIe SSDs
• High-performance, low-latency, PCIe SSD interface
• Command set + PCIe register interface
• In-box NVMe host drivers for Linux, Windows, VmWare, …
• Standard h/w drive form factors, mobile to enterprise
• NVMe community is 100+ companies strong and growing
• Learn more at nvmexpress.org
4. What’s NVMe over Fabrics?
• Nonvolatile Memory Express (NVMe) over
Fabrics is a technology specification designed
to enable NVMe message-based commands
to transfer data between a host computer
and a target solid-state storage device or
system over a network such as Ethernet,
Fibre Channel, and InfiniBand.
5. Why NVMe over Fabrics?
• End-to-End NVMe semantics across a range of topologies
– Retains NVMe efficiency and performance over network fabrics
– Eliminates unnecessary protocol translations
– Enables low-latency and high IOPS remote NVMe storage solutions
6. Expanding NVMe to Fabrics
• Built on common NVMe architecture with additional definitions to support
message-based NVMe operations
• Standardization of NVMe over a range Fabric types
• Initial fabrics: RDMA(RoCE, iWARP, InfiniBand™) and Fibre Channel
• First specification has been released in June, 2016
• NVMe.org Fabrics Linux Driver WG developing host and target drivers
8. End-to-End NVMe over Fabrics
Extend efficiency of NVMe over Front and Back-end Fabrics Enables
efficient NVMe end-to-end model (Host<->NVMe PCIe SSD)
9. NVM Over Fabrics Advantages
• Industry standard interface (Multiple sources)
• Unlimited storage per server
• Scale storage independent of servers
• High Efficient shared storage
• HA is straightforward
• Greater IO performance
13. Shared Server Flash
- NVMe Storage
• RDMA support required for lowest latency
• Ethernet or IB or OmniPath fabrics possible – IB and OmniPath support RDMA – Ethernet has
RoCEv1-v2, iWARP and iSCSI RDMA options – iSCSI offload has built-in RDMA WRITE
• Disaster Recovery (DR) requires MAN or WAN – iWARP, iSCSI only options that support MAN
and WAN
15. Recap
• NVMe was built from the ground up to support a consistent
model for NVM interfaces, even across network fabrics
• Simplicity of protocol enables hardware automated I/O
Queues – NVMe transport bridge
• No translation to or from another protocol like SCSI (in
firmware/software)
• Inherent parallelism of NVMe multiple I/O Queues is exposed
to the host
• NVMe commands and structures are transferred end-to-end
• Maintains the NVMe architecture across a range of fabric
types
17. Seagate SSD
1200.2 Series SAS 12Gbs
-Up to 210K RR IPOS and 25 DWPD
XM1400 Series M.2 22110 PCIe G3 x 4
-Up to 3DWPD
XF1400 Series U.2 PCIe G3 x 4
-Up to 200K RR IPOS and 3 DWPD
XP6500 Series AIC PCIe G3 x 8
-Up to 300K RR IPOS
XP7200 Series AIC PCIe G3 x 16
-Up to 940K RR IPOS
XP6300 Series AIC PCIe G3 x 8
-Up to 296K RR IPOS
18. Traditional Scale Out Storage
• Support for high BW/IOPS NVMe support preserves software investment, because it
keeps existing software price/performance competitive
• Support for high BW/IOPS NVMe support realizes most of the NVMe speedup benefits
• Disaster Recovery (DR) requires MAN or WAN
19. RDMA
• RDMA stands for Remote Direct Memory Access and enables one
computer to access another’s internal memory directly without
involving the destination computer’s operating system. The
destination computer’s network adapter moves data directly from
the network into an area of application memory without involving
the OS involving its own data buffers and network I/O stack.
Consequently the transfer us very fast. It has the downside of not
having an acknowledgement (ack) sent back to the source
computer telling it that the transfer has been successful.
• There is no general RDMA standard, meaning that implementations
are specific to particular servers and network adapters, operating
systems and applications. There are RDMA implementations for
Linux and Windows Server 2012, which may use iWARP, RoCE, and
InfiniBand as the carrying layer for the transfers.
20. iWARP - Internet Wide
Area RDMA Protocol
• iWARP (internet Wide Area RDMA Protocol) implements RDMA over
Internet Protocol networks. It is layered on IETF-standard congestion-
aware protocols such as TCP and SCTP, and uses a mix of layers, including
DDP (Direct Data Placement), MPA (Marker PDU Aligned framing), and a
separate RDMA protocol (RDMAP) to deliver RDMA services over TCP/IP.
Because of this it's said to have lower throughput, higher latency and
require higher CPU and memory utilisation than RoCE.
• For example: "Latency will be higher than RoCE (at least with both Chelsio
and Intel/NetEffect implementations), but still well under 10 μs."
• Mellanox says no iWARP support is available at 25, 50, and 100Gbit/s
Ethernet speeds. Chelsio saysthe IETF standard for RDMA is iWARP. It
provides the same host interface as InfiniBand and is available in the same
OpenFabrics Enterprise Distribution (OFED).
• Chelsio, which positions iWARP as an alternative to InfiniBand, says iWARP
is the industry standard for RDMA over Ethernet is iWARP. High
performance iWARP implementations are available and compete directly
with InfiniBand in application benchmarks.
21. RoCE - RDMA over Converged Ethernet
• RoCE (RDMA over Converged Ethernet) allows remote direct
memory access (RDMA) over an Ethernet network. It operates over
layer 2 and layer 3 DCB-capable (DCB - Data Centre Bridging)
switches. Such switches comply with the IEEE 802.1 Data Center
Bridging standard, which is a set of extensions to traditional
Ethernet, geared to providing a lossless data centre transport layer
that, Cisco says, helps enable the convergence of LANs and SANs
onto a single unified fabric. DCB switches support the Fibre Channel
over Ethernet (FCoE) networking protocol.
There are two versions:
• RoCE v1 uses the Ethernet protocol as a link layer protocol and
hence allows communication between any two hosts in the same
Ethernet broadcast domain,
• RoCE v2 is a RDMA running on top of UDP/IP and can be routed.
24. Facebook Lightning Target
• Hot-plug. We want the NVMe JBOF to behave like a SAS JBOD when drives are replaced. We don't want to
follow the complicated procedure that traditional PCIe hot-plug requires. As a result, we need to be able to
robustly support surprise hot-removal and surprise hot-add without causing operating system hangs or crashes.
• Management.PCIe does not yet have an in-band enclosure and chassis management scheme like the SAS
ecosystem does. While this is coming, we chose to address this using a more traditional BMC approach, which can
be modified in the future as the ecosystem evolves.
• Signal integrity. The decision to maintain the separation of a PEB from the PDPB as well as supporting multiple
SSDs per “slot” results in some long PCIe connections through multiple connectors. Extensive simulations, layout
optimizations, and the use of low-loss but still low-cost PCB materials should allow us to achieve the bit error rate
requirements of PCIe without the use of redrivers/retimers or exotic PCB materials.
• External PCIe cables. We chose to keep the compute head node separate from the storage chassis, as this
gives us the flexibility to scale the compute-to-storage ratio as needed. It also allows us to use more powerful
CPUs, larger memory footprints, and faster network connections all of which will be needed to take full advantage
of high-performance SSDs. As the existing PCIe cables are clunky and difficult to use, we chose to use mini-SAS HD
cables (SFF-8644). This also aligns with upcoming external cabling standards. We designed the cables such that
they include a full complement of PCIe side-band signals and a USB connection for an out-of-band management
connection.
• Power. Current 2.5" NVMe SSDs may consume up to 25W of power! This creates an unnecessary system
constraint, and we have chosen to limit the power consumption per slot to 14W. This aligns much better with the
switch oversubscription and our performance targets.
28. Modularity in PCIe Switch
A common switch board for both trays
•Easily design new or difference version
without modifying the rest of the
infrastructure
31. OCP All-Flash NVMe Storage
• 2U 60/30 NVMe SSDs
• Ultra-high IOPS and <10µS latency
• PCIe 3.0 + U.2 or M.2 NVMe SSD support
• High density storage system with 60 SSDs (M.2)