SlideShare une entreprise Scribd logo
1  sur  21
Télécharger pour lire hors ligne
HKG18-110
net_mdev: Fast-path userspace I/O
Ilias Apalodimas
Mykyta Iziumtsev
François-Frédéric Ozog
Why userland I/O
● Time sensitive networking
○ Developed mostly for Industrial IOT, automotive and audio/video applications requiring low
latency and jitter
○ Some of applications need 1 microsecond latency
○ Adapter-adapter latency across 5 cut-through switches can be 1 microsecond
○ Adapter-application latency with 500MHz-1Ghz processor: 20-40µs, jitter 200-600µs!
○ Adopted by OpenAvnu for Intel i210 (without IOMMU support)
● High speed networks
○ Kernel can’t do line rate on 40Gbit
○ VFIO-PCI already does that but unmaps the netdevice from the kernel and needs to be
reprogrammed from the userspace
Why net_mdev
● Based on MDEV VFIO framework, already a part of the linux kernel (mdev
since 4.10)
● Uses IOMMU/SMMU on supported platforms to isolate devices on respective
domains
● Greatly simplifies drivers in userspace.
○ VFIO-PCI/UIO needs 30k-50k LOC (userland driver is usually a copy of the kernel driver)
○ MDEV needs ~1500 LOC for a full driver
○ Don’t have to deal with complex hardware revisions, erratas, device resets etc. Kernel is still in
charge
○ Kernel driver changes required to enable mdev are usually less than 500 LOC
● Can expand to much more than just ethernet (Huawei is having a similar
approach for crypto, Intel is doing in on i915 GPU)
Why net_mdev
● Can support zero-copy
● Best of both worlds, you get to keep existing userspace tools
○ ifconfig/ethtool/iproute2 continue to work
○ tcpdump can be easily ported
● Freedom to use “vendor specific” memory schemes
○ Intel uses “slot mode”
○ Chelsio is using “tape mode” and coalesces Tx
● Bus agnostic (PCIe, DPAA etc)
Goals and Limits
● Any IO model usable by DPDK, ODP, VPP, any other app (AF_XDP is limited to
one model at the moment, check
https://www.spinics.net/lists/netdev/msg481758.html)
● Line rate even on 100Gbps (148Mpps)
● IOMMU/SMMU is a must when possible
● Expand to more than NICs (upstreamed mdev is used on i915 gpu)
● Provide a user-space API
● Limit to cache-coherent hardware (not all architectures support unprivileged
cache flushing/invalidating and doing syscalls for syncing will kill performance)
Code statistics
● Current implementation does not support all hardware features
NIC Original Kernel Userspace
Chelsio
T4/T5(10/40Gbit)
48KLOC 550LOC 950LOC
XL710(40Gbit) 52KLOC 400LOC 650LOC
e1000e(1Gbit) 30KLOC 250LOC 600LOC
Performance
● Rx not optimized yet (same numbers on 1-8 cpus)
● All results were achieved using 4 cpus (3 for Intel)
● XL710 and Chelsio T4 tested on core-i5. Chelsio T5 on Xeon
NIC Speed Rx Tx Line rate (64b
packets)
Chelsio T4 10Gbit 8Mpps 13Mpps 14.88Mpps
Chelsio T5 40Gbit 10.3Mpps 48Mpps 59.52Mpps
Intel XL 710 40Gbit 19Mpps 41.55Mpps 59.52Mpps
Architecture
ODP/VPP/DPDK
MDEV lib
User-space
R
X
M
M
I
O
T
X
Kernel
Kernel driver
Key differences
Kernel
User-space
SKBs
Userspace
buffers
Kernel
User-space
NOW
VFIO-PCI
Key differences
Userspace
Buffers (will
change)
User-space
Kernel
MDEV
IOMMU/SMMU/VFIO basics
● Intel(IOMMU)/Arm(SMMU)
● IOMMU group = VFIO group (R/O, topology dependant, programmed at boot
time)
● IOMMU domains = VFIO “container” (multiple domains can exist in a container)
● Groups are added to containers, ending up on the same domain
VA
HPA /
PA
IOVA /
Bus address
Operational overview (kernel)
Allocate Rx/Tx
descriptors
Add MDEV
regions
Wait for
transition
Transition received
Set NIC to
“transition”
status
Block Kernel
Tx path/Rx irq
Receive
fresh
userspace
buffers
Packet flow
starts
Traffic flowing through
kernel until we receive
“transition”
Operational walkthrough (kernel)
● Load driver with enable parameter net_mdev=1
● mdev_add_essential(): Added on each NIC driver
○ Inventory memory regions to be mapped in user-space (Rx/Tx
descriptors arrays, MMIO for doorbells). Each region is exported
using struct vfio_region_info_cap_type from the VFIO-API
○ Currently supported regions are VFIO_NET_MDEV_MMIO,
VFIO_NET_MDEV_RX_RING, VFIO_NET_MDEV_TX_RING,
VFIO_NET_MDEV_RX_BUFFER_POOL
● VFIO-MDEV creates control files in sysfs
● I/O is handled from the kernel at this point
Operational walkthrough (kernel)
On Transition
● Graceful rx/tx shutdown: netif_tx_stop_all_queues
● Keep carrier up if possible
● VFIO-MDEV module sets IFF_NET_MDEV flag
● Set hardware in known state called “transition state” in the diagram (hardware
dependent, from clear producer/consumer indexes to full reset at hw level)
Operational walkthrough (kernel)
● Set RX interrupts according to polling strategy. Using the IFF_NET_MDEV flag
we can intercept the kernel interrupt handler and redirect it to the userspace
with eventfd or similar functionality
● Allocate new Rx/Tx buffers and memory map them to the user-space
● This is planned to change, user-space will do the allocation, VFIO framework
will map it to the hardware
● Kernel can’t do I/O but is still in charge of the device
Operational overview (user-space)
Add groups to
VFIO
containers
Discover NIC
regions &
lengths
Iterate over regions and
map them
Inform NIC of
fresh buffers
Transition
complete
Packet flow
starts
MDEV
create
Operational walkthrough (user-space)
● echo $dev_uuid >
/sys/class/net/$intf/device/mdev_supported_types/$sys_drv_name/create
● VFIO_GROUP_GET_STATUS, Test the group is viable and available
● VFIO_SET_CONTAINER, Add the group to the container
○ This adds the device on the proper IOMMU domain
● VFIO_SET_IOMMU, Enable the IOMMU model we want
● VFIO_DEVICE_GET_INFO discover device type (PCI…) and regions
(Rx/Tx/MMIO)
● VFIO_DEVICE_GET_REGION_INFO get type, size and mmap each device
region
Operational walkthrough (user-space)
● Packet memory preparation
● Packet arrays or unstructured memory areas allocation
● ioctl VFIO_IOMMU_MAP_DMA with mapping parameters (BIDIRECTIONAL…).
● hardware update: hardware specific (populate memory area, ring doorbells)
● Signal transition finished (ioctl), kernel does whatever it needs to resume
operation
Future development
● Allocate memory from userspace and map it to IOMMU
○ Will require small changes to VFIO API
○ Need to be able to assign devices to domains without detaching
them from the kernel
● Expand net_mdev to more than NICs
● DCA (Direct Cache Access) / DDIO (Direct Data I/O) usage
● Check Arm SMMUv3/SVM RFC (Mellanox has PASID
NICs instead of SR-IOV)
● Collaborate with Huawei. Wrapdrive has a similar
approach for crypto devices
Resources
● POC driver for affordable e1000e NIC
● MDEV linux modifications
https://github.com/apalos/odp-linux-mdev
● MDEV in ODP
https://github.com/Linaro/odp/tree/caterpillar
Thank You
#HKG18
HKG18 keynotes and videos on: connect.linaro.org
For further information: www.linaro.org

Contenu connexe

Tendances

BUD17-309: IRQ prediction
BUD17-309: IRQ prediction BUD17-309: IRQ prediction
BUD17-309: IRQ prediction
Linaro
 
The FlexTiles Development Platform offers Dual FPGA for 3D SoC Prototyping
The FlexTiles Development Platform offers Dual FPGA for 3D SoC PrototypingThe FlexTiles Development Platform offers Dual FPGA for 3D SoC Prototyping
The FlexTiles Development Platform offers Dual FPGA for 3D SoC Prototyping
FlexTiles Team
 
Embedded c & working with avr studio
Embedded c & working with avr studioEmbedded c & working with avr studio
Embedded c & working with avr studio
Nitesh Singh
 

Tendances (20)

PX4 Seminar 03
PX4 Seminar 03PX4 Seminar 03
PX4 Seminar 03
 
Architecture of TPU, GPU and CPU
Architecture of TPU, GPU and CPUArchitecture of TPU, GPU and CPU
Architecture of TPU, GPU and CPU
 
PX4 Seminar 01
PX4 Seminar 01PX4 Seminar 01
PX4 Seminar 01
 
Linux Conference Australia 2018 : Device Tree, past, present, future
Linux Conference Australia 2018 : Device Tree, past, present, futureLinux Conference Australia 2018 : Device Tree, past, present, future
Linux Conference Australia 2018 : Device Tree, past, present, future
 
BUD17-309: IRQ prediction
BUD17-309: IRQ prediction BUD17-309: IRQ prediction
BUD17-309: IRQ prediction
 
LAS16-200: Firmware summit - Tianocore Progress and Status
LAS16-200:  Firmware summit - Tianocore Progress and StatusLAS16-200:  Firmware summit - Tianocore Progress and Status
LAS16-200: Firmware summit - Tianocore Progress and Status
 
The FlexTiles Development Platform offers Dual FPGA for 3D SoC Prototyping
The FlexTiles Development Platform offers Dual FPGA for 3D SoC PrototypingThe FlexTiles Development Platform offers Dual FPGA for 3D SoC Prototyping
The FlexTiles Development Platform offers Dual FPGA for 3D SoC Prototyping
 
Autonomous Drones Architecture - Initial proposal
Autonomous Drones Architecture - Initial proposalAutonomous Drones Architecture - Initial proposal
Autonomous Drones Architecture - Initial proposal
 
Deploy STM32 family on Zephyr - SFO17-102
Deploy STM32 family on Zephyr - SFO17-102Deploy STM32 family on Zephyr - SFO17-102
Deploy STM32 family on Zephyr - SFO17-102
 
Pc 104 express w. virtex 5-2014_5
Pc 104 express w. virtex 5-2014_5Pc 104 express w. virtex 5-2014_5
Pc 104 express w. virtex 5-2014_5
 
ELC North America 2021 Introduction to pin muxing and gpio control under linux
ELC  North America 2021 Introduction to pin muxing and gpio control under linuxELC  North America 2021 Introduction to pin muxing and gpio control under linux
ELC North America 2021 Introduction to pin muxing and gpio control under linux
 
PX4 Setup Workshop
PX4 Setup WorkshopPX4 Setup Workshop
PX4 Setup Workshop
 
BKK16-304 The State of GDB on AArch64
BKK16-304 The State of GDB on AArch64BKK16-304 The State of GDB on AArch64
BKK16-304 The State of GDB on AArch64
 
Asymmetric Multiprocessing - Kynetics ELC 2018 portland
Asymmetric Multiprocessing - Kynetics ELC 2018 portlandAsymmetric Multiprocessing - Kynetics ELC 2018 portland
Asymmetric Multiprocessing - Kynetics ELC 2018 portland
 
Reliability, Availability, and Serviceability (RAS) on ARM64 status - SFO17-203
Reliability, Availability, and Serviceability (RAS) on ARM64 status - SFO17-203Reliability, Availability, and Serviceability (RAS) on ARM64 status - SFO17-203
Reliability, Availability, and Serviceability (RAS) on ARM64 status - SFO17-203
 
Embedded platform choices
Embedded platform choicesEmbedded platform choices
Embedded platform choices
 
Heterogeneous multiprocessing on androd and i.mx7
Heterogeneous multiprocessing on androd and i.mx7Heterogeneous multiprocessing on androd and i.mx7
Heterogeneous multiprocessing on androd and i.mx7
 
LAS16-300: Mini Conference 2 Cortex-M Software - Device Configuration
LAS16-300: Mini Conference 2 Cortex-M Software - Device ConfigurationLAS16-300: Mini Conference 2 Cortex-M Software - Device Configuration
LAS16-300: Mini Conference 2 Cortex-M Software - Device Configuration
 
AAME ARM Techcon2013 006v02 Implementation Diversity
AAME ARM Techcon2013 006v02 Implementation DiversityAAME ARM Techcon2013 006v02 Implementation Diversity
AAME ARM Techcon2013 006v02 Implementation Diversity
 
Embedded c & working with avr studio
Embedded c & working with avr studioEmbedded c & working with avr studio
Embedded c & working with avr studio
 

Similaire à HKG18-110 - net_mdev: Fast path user space I/O

LF_DPDK17_mediated devices: better userland IO
LF_DPDK17_mediated devices: better userland IOLF_DPDK17_mediated devices: better userland IO
LF_DPDK17_mediated devices: better userland IO
LF_DPDK
 
Ceph Day Shanghai - On the Productization Practice of Ceph
Ceph Day Shanghai - On the Productization Practice of Ceph Ceph Day Shanghai - On the Productization Practice of Ceph
Ceph Day Shanghai - On the Productization Practice of Ceph
Ceph Community
 

Similaire à HKG18-110 - net_mdev: Fast path user space I/O (20)

LF_DPDK17_mediated devices: better userland IO
LF_DPDK17_mediated devices: better userland IOLF_DPDK17_mediated devices: better userland IO
LF_DPDK17_mediated devices: better userland IO
 
[OpenStack Day in Korea 2015] Track 1-6 - 갈라파고스의 이구아나, 인프라에 오픈소스를 올리다. 그래서 보이...
[OpenStack Day in Korea 2015] Track 1-6 - 갈라파고스의 이구아나, 인프라에 오픈소스를 올리다. 그래서 보이...[OpenStack Day in Korea 2015] Track 1-6 - 갈라파고스의 이구아나, 인프라에 오픈소스를 올리다. 그래서 보이...
[OpenStack Day in Korea 2015] Track 1-6 - 갈라파고스의 이구아나, 인프라에 오픈소스를 올리다. 그래서 보이...
 
Enduro/X Middleware
Enduro/X MiddlewareEnduro/X Middleware
Enduro/X Middleware
 
RISC-V 30908 patra
RISC-V 30908 patraRISC-V 30908 patra
RISC-V 30908 patra
 
Cloud firewall logging
Cloud firewall loggingCloud firewall logging
Cloud firewall logging
 
XPDDS17: Keynote: Shared Coprocessor Framework on ARM - Oleksandr Andrushchen...
XPDDS17: Keynote: Shared Coprocessor Framework on ARM - Oleksandr Andrushchen...XPDDS17: Keynote: Shared Coprocessor Framework on ARM - Oleksandr Andrushchen...
XPDDS17: Keynote: Shared Coprocessor Framework on ARM - Oleksandr Andrushchen...
 
Introduction to FreeRTOS
Introduction to FreeRTOSIntroduction to FreeRTOS
Introduction to FreeRTOS
 
netLec5.pdf
netLec5.pdfnetLec5.pdf
netLec5.pdf
 
Gl embedded starterkit_ethernet
Gl embedded starterkit_ethernetGl embedded starterkit_ethernet
Gl embedded starterkit_ethernet
 
Adding IEEE 802.15.4 and 6LoWPAN to an Embedded Linux Device
Adding IEEE 802.15.4 and 6LoWPAN to an Embedded Linux DeviceAdding IEEE 802.15.4 and 6LoWPAN to an Embedded Linux Device
Adding IEEE 802.15.4 and 6LoWPAN to an Embedded Linux Device
 
SoC Idling for unconf COSCUP 2016
SoC Idling for unconf COSCUP 2016SoC Idling for unconf COSCUP 2016
SoC Idling for unconf COSCUP 2016
 
Install FD.IO VPP On Intel(r) Architecture & Test with Trex*
Install FD.IO VPP On Intel(r) Architecture & Test with Trex*Install FD.IO VPP On Intel(r) Architecture & Test with Trex*
Install FD.IO VPP On Intel(r) Architecture & Test with Trex*
 
NWU and HPC
NWU and HPCNWU and HPC
NWU and HPC
 
Security of Linux containers in the cloud
Security of Linux containers in the cloudSecurity of Linux containers in the cloud
Security of Linux containers in the cloud
 
Stacks and Layers: Integrating P4, C, OVS and OpenStack
Stacks and Layers: Integrating P4, C, OVS and OpenStackStacks and Layers: Integrating P4, C, OVS and OpenStack
Stacks and Layers: Integrating P4, C, OVS and OpenStack
 
LAS16-504: Secure Storage updates in OP-TEE
LAS16-504: Secure Storage updates in OP-TEELAS16-504: Secure Storage updates in OP-TEE
LAS16-504: Secure Storage updates in OP-TEE
 
Ch2 embedded processors-i
Ch2 embedded processors-iCh2 embedded processors-i
Ch2 embedded processors-i
 
Achieving the Ultimate Performance with KVM
Achieving the Ultimate Performance with KVMAchieving the Ultimate Performance with KVM
Achieving the Ultimate Performance with KVM
 
Ceph Day Shanghai - On the Productization Practice of Ceph
Ceph Day Shanghai - On the Productization Practice of Ceph Ceph Day Shanghai - On the Productization Practice of Ceph
Ceph Day Shanghai - On the Productization Practice of Ceph
 
Achieving the Ultimate Performance with KVM
Achieving the Ultimate Performance with KVMAchieving the Ultimate Performance with KVM
Achieving the Ultimate Performance with KVM
 

Plus de Linaro

Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea GalloDeep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Linaro
 
HPC network stack on ARM - Linaro HPC Workshop 2018
HPC network stack on ARM - Linaro HPC Workshop 2018HPC network stack on ARM - Linaro HPC Workshop 2018
HPC network stack on ARM - Linaro HPC Workshop 2018
Linaro
 
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Linaro
 
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Linaro
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
Linaro
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
Linaro
 
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse HypervisorHKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
Linaro
 
HKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMUHKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMU
Linaro
 
HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation
Linaro
 
HKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted bootHKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted boot
Linaro
 

Plus de Linaro (20)

Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea GalloDeep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
 
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta VekariaArm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
 
Huawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
Huawei’s requirements for the ARM based HPC solution readiness - Joshua MoraHuawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
Huawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
 
Bud17 113: distribution ci using qemu and open qa
Bud17 113: distribution ci using qemu and open qaBud17 113: distribution ci using qemu and open qa
Bud17 113: distribution ci using qemu and open qa
 
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
 
HPC network stack on ARM - Linaro HPC Workshop 2018
HPC network stack on ARM - Linaro HPC Workshop 2018HPC network stack on ARM - Linaro HPC Workshop 2018
HPC network stack on ARM - Linaro HPC Workshop 2018
 
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
 
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
 
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
 
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
 
HKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening KeynoteHKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening Keynote
 
HKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP WorkshopHKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP Workshop
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
 
HKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18-315 - Why the ecosystem is a wonderful thing, warts and allHKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18-315 - Why the ecosystem is a wonderful thing, warts and all
 
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse HypervisorHKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
 
HKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMUHKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMU
 
HKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8MHKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8M
 
HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation
 
HKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted bootHKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted boot
 

Dernier

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Dernier (20)

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 

HKG18-110 - net_mdev: Fast path user space I/O

  • 1. HKG18-110 net_mdev: Fast-path userspace I/O Ilias Apalodimas Mykyta Iziumtsev François-Frédéric Ozog
  • 2. Why userland I/O ● Time sensitive networking ○ Developed mostly for Industrial IOT, automotive and audio/video applications requiring low latency and jitter ○ Some of applications need 1 microsecond latency ○ Adapter-adapter latency across 5 cut-through switches can be 1 microsecond ○ Adapter-application latency with 500MHz-1Ghz processor: 20-40µs, jitter 200-600µs! ○ Adopted by OpenAvnu for Intel i210 (without IOMMU support) ● High speed networks ○ Kernel can’t do line rate on 40Gbit ○ VFIO-PCI already does that but unmaps the netdevice from the kernel and needs to be reprogrammed from the userspace
  • 3. Why net_mdev ● Based on MDEV VFIO framework, already a part of the linux kernel (mdev since 4.10) ● Uses IOMMU/SMMU on supported platforms to isolate devices on respective domains ● Greatly simplifies drivers in userspace. ○ VFIO-PCI/UIO needs 30k-50k LOC (userland driver is usually a copy of the kernel driver) ○ MDEV needs ~1500 LOC for a full driver ○ Don’t have to deal with complex hardware revisions, erratas, device resets etc. Kernel is still in charge ○ Kernel driver changes required to enable mdev are usually less than 500 LOC ● Can expand to much more than just ethernet (Huawei is having a similar approach for crypto, Intel is doing in on i915 GPU)
  • 4. Why net_mdev ● Can support zero-copy ● Best of both worlds, you get to keep existing userspace tools ○ ifconfig/ethtool/iproute2 continue to work ○ tcpdump can be easily ported ● Freedom to use “vendor specific” memory schemes ○ Intel uses “slot mode” ○ Chelsio is using “tape mode” and coalesces Tx ● Bus agnostic (PCIe, DPAA etc)
  • 5. Goals and Limits ● Any IO model usable by DPDK, ODP, VPP, any other app (AF_XDP is limited to one model at the moment, check https://www.spinics.net/lists/netdev/msg481758.html) ● Line rate even on 100Gbps (148Mpps) ● IOMMU/SMMU is a must when possible ● Expand to more than NICs (upstreamed mdev is used on i915 gpu) ● Provide a user-space API ● Limit to cache-coherent hardware (not all architectures support unprivileged cache flushing/invalidating and doing syscalls for syncing will kill performance)
  • 6. Code statistics ● Current implementation does not support all hardware features NIC Original Kernel Userspace Chelsio T4/T5(10/40Gbit) 48KLOC 550LOC 950LOC XL710(40Gbit) 52KLOC 400LOC 650LOC e1000e(1Gbit) 30KLOC 250LOC 600LOC
  • 7. Performance ● Rx not optimized yet (same numbers on 1-8 cpus) ● All results were achieved using 4 cpus (3 for Intel) ● XL710 and Chelsio T4 tested on core-i5. Chelsio T5 on Xeon NIC Speed Rx Tx Line rate (64b packets) Chelsio T4 10Gbit 8Mpps 13Mpps 14.88Mpps Chelsio T5 40Gbit 10.3Mpps 48Mpps 59.52Mpps Intel XL 710 40Gbit 19Mpps 41.55Mpps 59.52Mpps
  • 11. IOMMU/SMMU/VFIO basics ● Intel(IOMMU)/Arm(SMMU) ● IOMMU group = VFIO group (R/O, topology dependant, programmed at boot time) ● IOMMU domains = VFIO “container” (multiple domains can exist in a container) ● Groups are added to containers, ending up on the same domain VA HPA / PA IOVA / Bus address
  • 12. Operational overview (kernel) Allocate Rx/Tx descriptors Add MDEV regions Wait for transition Transition received Set NIC to “transition” status Block Kernel Tx path/Rx irq Receive fresh userspace buffers Packet flow starts Traffic flowing through kernel until we receive “transition”
  • 13. Operational walkthrough (kernel) ● Load driver with enable parameter net_mdev=1 ● mdev_add_essential(): Added on each NIC driver ○ Inventory memory regions to be mapped in user-space (Rx/Tx descriptors arrays, MMIO for doorbells). Each region is exported using struct vfio_region_info_cap_type from the VFIO-API ○ Currently supported regions are VFIO_NET_MDEV_MMIO, VFIO_NET_MDEV_RX_RING, VFIO_NET_MDEV_TX_RING, VFIO_NET_MDEV_RX_BUFFER_POOL ● VFIO-MDEV creates control files in sysfs ● I/O is handled from the kernel at this point
  • 14. Operational walkthrough (kernel) On Transition ● Graceful rx/tx shutdown: netif_tx_stop_all_queues ● Keep carrier up if possible ● VFIO-MDEV module sets IFF_NET_MDEV flag ● Set hardware in known state called “transition state” in the diagram (hardware dependent, from clear producer/consumer indexes to full reset at hw level)
  • 15. Operational walkthrough (kernel) ● Set RX interrupts according to polling strategy. Using the IFF_NET_MDEV flag we can intercept the kernel interrupt handler and redirect it to the userspace with eventfd or similar functionality ● Allocate new Rx/Tx buffers and memory map them to the user-space ● This is planned to change, user-space will do the allocation, VFIO framework will map it to the hardware ● Kernel can’t do I/O but is still in charge of the device
  • 16. Operational overview (user-space) Add groups to VFIO containers Discover NIC regions & lengths Iterate over regions and map them Inform NIC of fresh buffers Transition complete Packet flow starts MDEV create
  • 17. Operational walkthrough (user-space) ● echo $dev_uuid > /sys/class/net/$intf/device/mdev_supported_types/$sys_drv_name/create ● VFIO_GROUP_GET_STATUS, Test the group is viable and available ● VFIO_SET_CONTAINER, Add the group to the container ○ This adds the device on the proper IOMMU domain ● VFIO_SET_IOMMU, Enable the IOMMU model we want ● VFIO_DEVICE_GET_INFO discover device type (PCI…) and regions (Rx/Tx/MMIO) ● VFIO_DEVICE_GET_REGION_INFO get type, size and mmap each device region
  • 18. Operational walkthrough (user-space) ● Packet memory preparation ● Packet arrays or unstructured memory areas allocation ● ioctl VFIO_IOMMU_MAP_DMA with mapping parameters (BIDIRECTIONAL…). ● hardware update: hardware specific (populate memory area, ring doorbells) ● Signal transition finished (ioctl), kernel does whatever it needs to resume operation
  • 19. Future development ● Allocate memory from userspace and map it to IOMMU ○ Will require small changes to VFIO API ○ Need to be able to assign devices to domains without detaching them from the kernel ● Expand net_mdev to more than NICs ● DCA (Direct Cache Access) / DDIO (Direct Data I/O) usage ● Check Arm SMMUv3/SVM RFC (Mellanox has PASID NICs instead of SR-IOV) ● Collaborate with Huawei. Wrapdrive has a similar approach for crypto devices
  • 20. Resources ● POC driver for affordable e1000e NIC ● MDEV linux modifications https://github.com/apalos/odp-linux-mdev ● MDEV in ODP https://github.com/Linaro/odp/tree/caterpillar
  • 21. Thank You #HKG18 HKG18 keynotes and videos on: connect.linaro.org For further information: www.linaro.org