Introduction to Cisco UCS and Userspace NIC (usNIC

Introduction to
Cisco UCS and
Userspace NIC (usNIC)
Argonne National Laboratory
September 2, 2014
Dave Goodell
dgoodell@cisco.com
© 2013 Cisco and/or its affiliates. All rights reserved. 1

Record-setting
Intel Ivy Bridge
1U and 2U servers
(with GPU Support)
Low
latency Ethernet
Up to
1.5 TB
RAM
Yes,
really!
10 & 40 Gbps
top-of-rack
& Core Switching
1.6
usecs
190
nsecs
10 & 40 Gbps!

Performance optimized
for any type of workload Integrated Design
Service Profiles
UCS Manager
UCS Central
Unified Fabric
Virtualized I/O
Form Factor
Independence
Low
Latency
Agility and reduced time
to deploy and provision applications
Role-based management,
automation, ease of integration
Centralized, multi-domain
management, alerting and visibility
Simplified infrastructure
Security isolation per application,
scale, improved performance
Supports both blades and rack
mount servers in a single domain
Low Latency over Industry Standard
Ethernet networking

Consolidating the messaging/interconnect network
Traditional Network
LAN
Ethernet FC FC
Ethernet FC
Unified Fabric
LAN
Ethernet FC
Infiniband
Cluster
DCB, FCoE
& Low Latency

• Benefits
• Low Latency Ethernet delivers high performance while
retaining all the advantages of managing unified network
fabric
• HPC Compute Clusters can coexist with Enterprise IT
under same management framework
• Leverage True Hybrid Solutions From All IT Resources
• Simplifies Procurement
• Accelerates Deployment
• Non Intrusive
• Extends the Product Life Cycle / Reusability
Lower CAPEX and OPEX

One wire to rule them all:
• OS Mgmt Traffic (e.g., ssh)
• Server Hardware Mgmt
• File System / IO Traffic
• MPI / Application Traffic
Cisco CIMC
Rich XML Interface
Unified Management
10 & 40 Gbps Ethernet
With QoS
HPC Networking /
Routing

Host Port Switch Port
eth0
eth1
eth2
VLAN 27, MTU 1500B, Bandwidth: 100 Mbps
VLAN 42, MTU 9000B, Bandwidth: 2Gbps
VLAN 64, MTU 9000B, Bandwidth: Not limited
PCIe Physical Function
eth2
Isolated HW
Resource
Virtual Functions
RX/TX Queue Pairs
CPU
MPI
Process
SSH
Process eth0

Characteristics
• Up to 20 Chassis (160 Blades)
• 3840 CPU Cores
• 20 Gbps Bandwidth/Blade
• Burst Capacity up to 80 Gbps
• Single Wire Management
• Enterprise & HPC
• Pod Architecture
• Scalable
96 or 48
Ports
5.3 usecs
Any to Any
Latency
Up to 82.94 TeraFLOPs
(Intel Ivy Bridge)

3rd Party GPU
Expansion
C220 M3 - 1RU Dual Socket Rack Server (Up to 384 GB RAM)
3rd Party GPU
Expansion
C240 M3 - 2RU Dual Socket Compute OR Storage Rack
Server
3rd Party GPU
Expansion
C420 M3 - 2RU Dual OR Quad Socket Server (Upto 1.5 TB RAM)

Port-to-Port Latency
190
nsecs
<500
nsecs
<500
nsecs
<500
nsecs
Nexus 3548
48 Port x 10 Gbps
12 x 40 Gbps
Nexus 3172PQ
72 Port x 10 Gbps
6 x 40 Gbps
Nexus 3132Q
32 Port x 40 Gbps
Nexus 9000
9504 - 144 Port x 40 Gbps
9508 - 288 Port x 40 Gbps
9516 - 576 Port x 40 Gbps

App to App Latency Components
Kernel Bypass 2.02 usecs
using SRIOV
Kernel Overhead
9.42 usecs
0 2 4 6 8 10
usNIC
TCP/IP
Latency (usecs)
Middle Ware Kernel NIC Network
HW Resource
isolation using
IOMMU
TCP/IP usNIC
Dual Functionality!

• Direct access to NIC hardware from
Linux userspace
Operating System bypass
via the Linux Verbs API (UD)
• Utilizes Cisco Virtual Interface Card
(VIC) for ultra-low Ethernet latency
2nd generation 80Gbps Cisco ASIC
2 x 10Gbps Ethernet ports, or
2 x 40Gbps Ethernet ports
PCI and mezzanine form factors
• Half-round trip (HRT) ping-pong
latencies (Intel E5-2690 v2 servers):
Raw back to back: 1.57μs
MPI back to back: 1.85μs
Through MPI+N3548: 2.02μs
These
numbers keep
going down

• 2nd generation VIC:
Can present itself 256 times on the
PCI bus
Has enough hardware queues /
buffering for 256 actual NICs
• Created for virtualization
Designed for hypervisor bypass
• Intent:
Each vNIC assigned to a single
virtual machine
Can therefore bypass hypervisor
“Bare metal” network performance in
a VM

VIC
vNIC
vNIC
PCI Physical Function (PF)
vNIC
vNIC
MAC address: aa:bb:cc:dd:ee:fa
vNIC
MAC address: aa:bb:cc:dd:ee:fb
vNIC
MAC address: aa:bb:cc:dd:ee:fc
MAC address: aa:bb:cc:dd:ee:fd
MAC address: aa:bb:cc:dd:ee:fe
MAC address: aa:bb:cc:dd:ee:ff
Physical port Physical port

VM
App VM
Guest kernel
Guest driver
App
Guest kernel
Guest driver
App
Guest kernel
Guest driver
virtual switch
Host driver
VM
Hypervisor
data path
VIC
PCI PF
PCI PF

VM
App VM
Guest kernel
Guest driver
App
Guest kernel
Guest driver
App
Guest kernel
Guest driver
virtual switch
Host driver
VM
Hypervisor
data path
VIC
PCI VF
PCI VF
PCI PF

VM
App
User process
User space driver
VM
App
User process
User space driver
VM App
User process
virtual switch
Host driver
Hypervisor
data path
VIC
PCI VF
PCI VF
PCI PF
Host OS
Host TCP/IP
stack

TCP/IP usNIC
Application
Userspace sockets
Userspace
Kernel
library
TCP stack
General Ethernet
driver
Cisco VIC driver
Cisco VIC hardware
Application
Userspace verbs library
Bootstrapping
and setup
Verbs IB core
Cisco USNIC
driver
Send and
receive
fast path
Cisco VIC hardware

MPI
MPI receives
L2 frames
directly from
the VIC
Userspace verbs
library
Cisco VIC hardware
MPI directly
injects L2 frames
(with UDP/IP
payloads)

x86 Chipset VT-d
I/O MMU
VIC
SR-IOV NIC
MPI process
MPI process
Classifier
QQPP
Inbound
L2 frames
Outbound
L2 frames

VIC
Physical Function (PF) Physical Function (PF)
MAC address: aa:bb:cc:dd:ee:fe MAC address: aa:bb:cc:dd:ee:ff
QP QP
VF VF VF
QP QP
VF VF VF
QP QP
VF VF VF
QP QP
VF VF VF
Physical port Physical port

VIC
PF (MAC)
V
F
V
F
V
F
QP QP QP QP
V
F
V
F
V
F
PF (MAC)
V
F
V
F
V
F
V
F
V
F
V
F
MPI process
Intel IO MMU
MPI process Physical
port
Physical
port

• Used for physical  virtual memory translation
• usnic verbs driver programs (and de-programs) the IOMMU
Virtual
Virtual VIC Intel IO MMU
Userspace
process
Physical
RAM
Virtual
Physical

• Do you know what these are?
MAC address
IP Subnet
ARP
GID
LID
GRH

• Manage your Ethernet network however you want
• Manage and monitor UDP/IP traffic with standard tools
• Can use IP routing + ECMP to create spine+leaf (Clos) networks
• Incrementally grow deployments without rejiggering existing sub-cluster
subnet config
• No additional cost for IP: Cisco switches route L2/L3 at same
speed

• Design Principle: Behave like OS network stack as much as
possible!
• Examples
Routing
ARP
UDP/IP port usage + visibility
MAC in L2 frames
• Can’t always achieve full parity
exotic routing configurations (e.g., ip rule add blackhole …)
tcpdump  (no OS in datapath*)

1. call ibv_create_qp()
2. allocates a full Linux
UDP socket w/ port in
OS tables
3. pass to kmod w/
create_qp command
4. bump refcount before
installing filter, prevents
freeing socket before
QP destruction
MPI
libibverbs
libusnic_verbs
user
space
kernel usnic_verbs.ko
shows up in lsof/netstat 

• Open MPI natively supports multi-rail
• Open MPI automagic configuration philosophy (when possible)
• VICs have 2 ports, can have >1 VIC per server
• Want to avoid artificial contention
pair local interfaces with remote interfaces
• Remote MPI process might be on the same subnet, might not
• Nontrivial software problem

Example Interface Pairing
Host A Host B
NIC A1
NIC A2
NIC B1
NIC B2
P1
P2
Host A Host B
P1
P2
Host A Host B
possible connectivity
OMPI selected pairing
NIC A1
NIC A2
NIC A1
NIC A2
Key
NIC B1
NIC B2
NIC B1
NIC B2
P1
P2
before pairing
valid pairing 1
valid pairing 2
an MPI process

Host A
NIC A1
NIC A2
Host B
NIC
R1a
NIC
R2a
Subnet S1
NIC
R1b
NIC
R2b
NIC B1
NIC B2
Subnet S2
Switch (does not need L3 capability)

Matching Logic Must Watch For Sub-optimal Pairings
Host A Host B
NIC A1
NIC A2
NIC B1
NIC B2
A1 can reach B1 and B2
A2 can only reach B1
NIC A1
NIC A2
NIC B1
NIC B2
NIC A1
NIC A2
NIC B1
NIC B2
Case 1 (sub-optimal)
• A2 cannot pair with
any interface on Host
B
• reduces aggregate
bandwidth
Host A
Host A
Host A
Host B
Case 2 (desired)
• Both Host A interfaces
can pair with Host B
interfaces

1.88 μs on this SB machine

• Everything above the
firmware is open source
• Open MPI
Distributing in Cisco Open MPI
v1.6.5 (soon to be v1.8.2)
Upstream in Open MPI v1.7.3 and
beyond (current stable is v1.8.1)
• Libibverbs plugin
• Verbs kernel module

• 3rd Generation VIC
2 x 40G and PCIe gen 3
More MPI offload to hardware
• Software update (expected this week)
Upgrade transport from custom L2 protocol to UDP
Key rationale point: Cisco switches L2 and L3 at same speed
Allows switching usNIC traffic around data center
Allows easier monitoring and policy control of usNIC traffic
Kernel + userspace support for RHEL 7.0, SLES 12
Open MPI optimizations for 3rd generation VIC

Introduction to Cisco UCS and Userspace NIC (usNIC

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Introduction to Cisco UCS and Userspace NIC (usNIC

Similaire à Introduction to Cisco UCS and Userspace NIC (usNIC (20)

Dernier

Dernier (20)

Introduction to Cisco UCS and Userspace NIC (usNIC

Notes de l'éditeur