TitanIC presented, "ODSA Use Case - SmartNIC," at the ODSA Workshop. The charter of the ODSA (Open Domain Specification Architecture) Workgroup is to define an open specification that enables building of Domain Specific Accelerator silicon using best-of-breed components from the industry made available as chiplet dies that can be integrated together as Lego blocks on an organic substrate packaging layer. The resulting multi-chip module (MCM) silicon can be produced at significantly lower development and manufacturing costs, and will deliver much needed performance per watt and performance per dollar efficiencies in networking, security, machine learning and other applications. The ODSA Workgroup also intends to deliver implementations of the specification as board-level prototypes, RTL code and libraries.
Boost PC performance: How more available memory can improve productivity
ODSA Use Case - SmartNIC
1. Sakir Sezer – CTO
29 January 2019, at Global Foundries , Santa Clara
1
ODSA use case – Smart NIC
Harnessing Domain Specific Acceleration at the Datacenter
2. What is a Smart NIC
• Support of baseline NIC functions and features, such as MAC and L3/L4
packet filtering and forwarding
• “Smartness” of a NIC implies its capability of making semi-autonomous
decisions based on IP traffic
• However, “smartness” will NOT enable any critical advantage for
networking or the host, if major workload and network related
responsibilities cannot be effectively offloaded from the host.
• Offloaded workload does not necessarily have to be networking
related, but must be network enabled so that tasks executed on the
NIC enable performance and/or feature advantage that otherwise
would be too expensive or impossible to execute on the local host.
2
3. Key Features Defining a Smart NIC
• High-level programmability of the NIC enabling in-field
customisation and extension of NIC features.
• On device processing of upper-layer functions (up to layer 7)
and applications that will enable localized decision making,
critical for services, networking and security.
• On device acceleration for offloading of heavy-duty tasks on
network traffic before it is forwarded to the host or to the
network, such as encryption, switching, inspection etc.
3
4. Generic ”Smart NIC” Architecture
nxGigE PHY
SERDES
Standard
Host Interface
PCIe Gen4
Embedded
General Purpose
Processing
Embedded General
Purpose Processing
ARM – RISC-V
Standard NIC
Functions
MAC + IP Header
processing
Accelerated Flow
Processing
(Flow
classification/tracking,
Firewall/ACL, NAT, etc.)
Domain Specific
Accelerators
(DPI, Crypto (ECC/AES),
Compression, TCP
offload
Network
Interface
External
Memory
Interface
External
Accelerator
Interfaced
Storage
Network on Chip
Host Interface
PCIe Gen 4
AI/ML Chip
SSD (NVMe)
(PCIe / CCIX
HMC or
DDR4
Embedded
Memory
(fast storage)
Last Level
Cache (LLC)
4
5. Tradition vs Smart NIC approach
for Domain Specific Offload Acceleration
Traditional NIC with separate offload PCIe card
based on GPU or FPGA (e.g. AWS F1)
Highly constrained by the host interface acting as a “bridge”
between the network and the accelerators
Smart NIC
with emended
processing
(ARM/RISC-V)
& accelerator
Host
2 x Xeon
Accelerators
AI/ML, FPGA
NVMe SSD
Storage
DDR4
(64-bit)
PCIe
Accelerator
Searching
Computing
AI/ML
(GPU / FPGA)
Host
2 x Xeon
SSD
Storage
NIC
DDR4
(64-bit)
PCIe
PCIe
Smart NIC with offload acceleration
without involving the host
Accelerators can access and process data directly
from the network before forwarding the original data
(and/or any extracted information) to the host
e.g. Storage, Security, OVS, AI/ML
5
6. What is Titan IC Regular Expression Processor
• Our RXP, Regular Expression Processor is a programmable custom-purpose content
processor for high-speed pattern matching, supporting PCRE/POSIX regular expressions
• Optimised for matching large number of regex rules in parallel
• Scalable single regex processor core capable of supporting beyond 100Gb/s pattern
matching bandwidth
• Rich set of software support, compiler, API, etc.
• Customisable for target applications, Memory, Performance, Footprint, Power(ASIC)
• Complex regex-based pattern matching for:
– Traditional (ACL) and NextGen Firewall (DPI, Intrusion Detection/Prevention
(IDS/IPS), e.g. Snort
– Application Recognition, Protocol Recognition,
– Application Firewall, detection of SQL injection, Application DoS
– SDN rule lookup/matching (Multi-Table), …………….
6
7. Titan IC - 100Gb/s RXP Processor
Parameter Value
Data width 128-bit
Clock frequency 800 MHz
Prefix capacity 16K
Number of clusters 8
TCM:CACHE 2K:2K
Total memory 27,132,864 bits
Memory macro
area
14.628 mm2
Standard cell area 0.935 mm2
Total post P&R area 19.665 mm2
Power 4.55 W
Technology: GlobalFoundries, 28nm HPP
7
8. Centralised vs Smart NIC based Network Security
Switch
NIC
NIC
NIC
NIC
“Middle Box”
Security
Appliance
Physical or Virtualised
as NFV or AWS Virtual Appliance
Security Management
Switch
Smart NIC
Smart NIC
Smart NIC
Smart NIC
SEC
SEC
SEC
SEC
Security is an embedded function and integral part of
a NIC, customised for the applications on the server
Key Advantages
- Distributed, inherently resilient
- No single point of failure
- Smaller attack surface
- Tailored to the application
- Fully virtualizable without
the compute overhead (Advanced NFV)
8
9. ARM
Rule
alert tcp $EXTERNAL_NET any -> $HOME_NET 1978
(msg:"APP-DETECT Apple OSX Remote Mouse usage";
flow:to_server,established;
content:"mos "; fast_pattern:only;
pcre:"/moss{2}dmsd/";
reference:url,pastebin.com/F81NCiYE;
classtype:policy-violation; sid:20443; rev:2;)
Snort Use-Case For RXP
Expected Snort Performance: between 4 to 5 x better
performance with Content Scanning Offload
PCRE Rule
MatchJob
DPDK Framework
Snort Application
RXP Plugin RXP API
Core-1
Fast Pattern
Rules
RXPI/O
9
10. Smart NIC HW accelerated IDS/IP
• Asynchronous operation
- Supporting multiple packets in-flight
• Multithreading
- Sub-blocks can be an independent HW offload or SW thread
Packet
Acquisition
Packet Decoding
Packet Pre-
Processors
OutputInspection
Titan IC
RXP
HW-NIC
HW Packet
Processing SW/SW Thread SW Thread SW Thread
Multiple packet in-flight HW Plug-in
10
12. Design Considerations
• Standard on-device communication related driver stack
– Open sources and part of the standard platform for on-chip embedded
processors and external (PCIe) hosts
• On-device communication (NOC)
– Scalable bandwidth
– low latency
– Technology independent high-speed on-chip interface (28nm TSMC <-> 14 nm GF)
• Low-latency high-bandwidth external memory access, preferably
with embedded LLC
12
14. Low-level Drivers and API - Kernel vs Userspace
Application
Hyperion
PCIe
driver
RXP
Userspace API
AWS F1
EDMA/PCIe
driver
RXP driver
common
functions
Code
common
across all
platforms
Userspace
Kernel
Platform specific drivers
ODSA
NOC
driver
14
15. ODSA Ref Model - RXP Native Interface
• Data Plane,
• Control Plane,
• Programming Plane,
• External (Shared) Memory
Comm Agent
NOC Fabric
Comm
Agent
N x ARM A72
Cores
N x ARM A72
Cores
N x ARM A72
Cores
N x RISC
Cores
RXP Native Interface
Shim Layer
Interface, could be
AWS F1 type I/O Shell Architecture
15
16. External Memory
• External memory architecture underpins overall systems
performance and require a CPU centric approach
• Memory management is essential to deal with:
– Effective utilization of memory resources
– Software integration
• Most high-performance use-cases require cache deployment
– LLC tightly coupled to external DDR and to embedded processor
• Simple external private memory integration
16
17. Reducing Complexity
Network on Chip
Comm
Agent
LLC
Internal Bus
DDR4 / HMC
Controller
N x ARM A72
Cores
N x ARM A72
Cores
N x ARM A72
Cores
N x RISC
Cores
Accelerator
(RXP)
Host Interface
PCIe
DDR4 / HMC
Controller
Private Mem
Controller
DDR4 or HMCDDR4 or HMC
Comm AgentComm Agent
Comm Agent
Host (x86)
Chiplets
Other
Accelerator
or interface
Chiplest
17
18. In Summary
• Exciting new opportunities for rapid development and
deployment of high-performance and highly tailored solutions
• Reducing redesign impact on SW (drivers and device specific API)
interfaces may have to more restrictive than any NOC type state-
of-the-art deployments.
• NOC interface and interface adaptation layer (shim) must be
provided for various use-cases (open source Verilog/System-C)
• Tools generating low-level drivers (pre-allocated common and
custom register maps) will reduce software integration efforts and
cost, enabling independent third parties software development
18