SlideShare une entreprise Scribd logo
1  sur  42
Télécharger pour lire hors ligne
The Linux Block Layer
Built for Fast Storage
Light up your cloud!
Sagi Grimberg KernelTLV
27/6/18
1
First off, Happy 1’st birthday Roni!
2
Who am I?
• Co-founder and Principal Architect @ Lightbits Labs
• LightBits Labs is a stealth-mode startup pushing the software and hardware technology
boundaries in cloud-scale storage.
• We are looking for excellent people who enjoy a challenge for a variety of positions, including
both software and hardware. More information on our website at
http://www.lightbitslabs.com/#join or talk with me after the talk.
• Active contributor to Linux I/O and RDMA stack
• I am the maintainer of the iSCSI over RDMA (iSER) drivers
• I co-maintain the Linux NVMe subsystem
• Used to work for Mellanox on Storage, Networking, RDMA and
pretty much everything in between…
3
Where were we 10 years ago
• Only rotating storage devices exist.
• Devices were limited to hundreds of IOPs
• Devices access latency was in the milliseconds ballpark
• The Linux block layer was sufficient to handle these devices
• High performance applications found clever ways to avoid storage
access as much as possible
4
What happened? (hint: HW)
• Flash SSDs started appearing in the DataCenter
• IOPs went from Hundreds to Hundreds of thousands to Millions
• Latency went from Milliseconds to Microseconds
• Fast Interfaces evolved: PCIe (NVMe)
• Processors core count increased a lot!
• And NUMA...
5
I/O Stack
6
What are the issues?
• Existing I/O stack had a lot of data sharing
• Between different applications (running on different cores)
• Between submission and completion
• Locking for synchronization
• Zero NUMA awareness
• All stack heuristics and optimizations centered around slow
storage
• The result is very bad (even negative) scaling, spending lots of CPU
cycles and much much higher latencies.
7
I/O Stack - Little deeper
8
I/O Stack - Little deeper
9
Hmmm...
- Request are serialized
- Placed for staging
- Retrieved by the drivers
⇒ Lots of shared state!
I/O Stack - Performance
10
I/O Stack - Performance
11
Workaround: Bypass the the request layer
12
Problems with bypass:
● Give up flow control
● Give up error handling
● Give up statistics
● Give up tagging and indexing
● Give up I/O deadlines
● Give up I/O scheduling
● Crazy code duplication -
mistakes are copied because
people get stuff wrong...
Most importantly, this is not the
Linux design approach!
Enter Block Multiqueue
• The old stack does not consist of “one serialization point”
• The stack needed a complete re-write from ground up
• What do we do:
• Go look at the networking stack which solved the exact same issue 10+
years ago.
• But build from scratch for storage devices
13
Block Multiqueue - Goals
• Linear Scaling with CPU cores
• Split shared state between applications and
submission/completion
• Careful locality awareness: Cachelines, NUMA
• Pre-allocate resources as much as possible
• Provide full helper functionality - ease of implementation
• Support all existing HW
• Become THE queueing mode, not a “3’rd one”
14
Block Multiqueue - Architecture
15
Block Multiqueue - Features
• Efficient tagging
• Locality of submissions and completions
• Extremely aware to minimize cache pollutions
• Smart error handling - minimum intrusion to the hot path
• Smart cpu <-> queue mappings
• Clean API
• Easy conversion (usually just cleanup old cruft)
16
Block Multiqueue - I/O Flow
17
Block Multiqueue - Completions
18
● Applications are usually “cpu-sticky”
● If I/O completion comes on the
“correct” cpu, complete it
● Else, IPI to the “correct” cpu
Block Multiqueue - Tagging
19
• Almost every modern HW supports queueing
• Tags are used to identify individual I/Os in the presence of
out-of-order completions
• Tags are limited by capabilities of the HW, driver needs to flow
control
Block Multiqueue - Tagging
20
• PerCPU Cacheline aware scalable bitmaps
• Efficient at near-exhaustion
• Rolling wake-ups
• Maps 1x1 with HW usage - no driver specific tagging
Block Multiqueue - Pre-allocations
21
• Eliminate hot path allocations
• Allocate all the requests memory at initialization time
• Tag and request allocations are combined (no two step allocation)
• No driver per-request allocation
• Driver context and SG lists are placed in “slack space” behind the request
Block Multiqueue - Performance
22
Test-Case:
- null_blk driver
- fio
- 4K sync random read
- Dual socket system
Block Multiqueue - perf profiling
23
• Locking time is drastically reduced
• FIO reports much less “system time”
• Average and tail latencies are much lower and consistent
Next on the Agenda: SCSI, NVMe and friends
• NVMe started as a bypass driver - converted to blk-mq
• mtip32xx (Micron)
• virtio_blk, xen
• rbd (ceph)
• loop
• more...
• SCSI midlayer was a bigger project..
24
SCSI multiqueue
• Needed the concept of “shared tag sets”
• Tags are now a property of the HBA and not the storage device
• Needed a chunking of scatter-gather lists
• SCSI HBAs support huge sg lists, two much to allocate up front
• Needed “Head of queue” insertion
• For SCSI complex error handling
• Removed the “big scsi host_busy lock”
• reduced the huge contention on the scsi target “busy” atomic
• Needed Partial completion support
• Needed BIDI support (yukk..)
• Hardened the stack a lot with lots of user bug reports.
25
Block multiqueue - MSI(X) based queue mapping
26
● Motivation: Eliminate the IPI case
● Expose MSI(X) vector affinity
mappings to the block layer
● Map the HW context mappings via
the underlying device IRQ mappings
● Offer MSI(X) allocation and correct
affinity spreading via the PCI
subsystem
● Take advantage in pci based drivers
(nvme, rdma, fc, hpsa, etc..)
But wait, what about I/O schedulers?
• What we didn’t mention was that block multiqueue lacked a
proper I/O scheduler for approximately 3 years!
• A fundamental part of the I/O stack functionality is scheduling
• To optimize I/O sequentiality - Elevator algorithm
• Prevent write vs. read starvation (i.e. deadline scheduler)
• Fairness enforcement (i.e. CFQ)
• One can argue that I/O scheduling was designed for rotating media
• Optimized for reducing actuator seek time
NOT NECESSARILY TRUE - Flash can benefit scheduling!
27
Start from ground up: WriteBack Throttling
• Linux since the dawn of times sucked at buffered I/O
• Writes are naturally buffered and committed to disk in the
background
• Needs to have little or no impact on foreground activity
• What was needed:
• Plumb I/O stats for submitted reads and writes
• Track average latency in window granularity and what is currently enqueued
• Scale queue depth accordingly
• Prefer reads over non-directIO writes
28
WriteBack Throttling - Performance
29
Before... After...
Now we are ready for I/O schedules - MQ-Deadline
• Added I/O interception of requests for building schedulers on top
• First MQ conversion was for deadline scheduler
• Pretty easy and straightforward
• Just delay writes FIFO until deadline hits
• Reads FIFO are pass-through
• All percpu context - tradeoff?
• Remember: I/O scheduler can hurt synthetic workloads, but impact on
real life workloads.
30
Next: Kyber I/O Scheduler
• Targeted for fast multi-queue devices
• Lightweight
• Prefers reads over writes
• All I/Os are split into two queues (reads and writes)
• Reads are typically preferred
• Writes are throttled but not to a point of starvation
• The key is to keep submission queues short to guarantee latency
targets
• Kyber tracks I/O latency stats and adjust queue size accordingly
• Aware of flash background operations.
31
Next: BFQ I/O Scheduler
• Budget fair queueing scheduler
• A lot heavier
• Maintain Per-Process I/O budget
• Maintain bunch of Per-Process heuristics
• Yields the “best” I/O to queue at any given time
• A better fit for slower storage, especially rotating media and cheap &
deep SSDs.
32
But wait #2: What about Ultra-low latency devices
• New media is emerging with Ultra low latency (1-2 us)
• 3D-Xpoint
• Z-NAND
• Even with block MQ, the Linux I/O stack still has issues providing these
latencies
• It starts with IRQ (interrupt handling)
• If I/O is so fast, we might want to poll for completion and avoid paying the
cost of MSI(X) interrupt
33
Interrupt based I/O completion model
34
Polling based I/O completion model
35
IRQ vs. Polling
36
• Polling can remove the extra context switch from the completion
handling
So we should support polling!
37
• Add selective polling syscall interface:
• Use preadv2/pwritev2 with flag IOCB_HIGHPRI
• Saves roughly 25% of added latency
But what about CPU% - can we go hybrid?
38
• Yes!
• We have all the statistics framework in place, let’s use it for hybrid polling!
• Wake up poller after ½ of the mean latency.
Hybrid polling - Performance
39
Hybrid polling - Adjust to I/O size
40
• Block layer sees I/Os of different sizes.
• Some are 4k, some are 256K and some or 1-2MB
• We need to consider that when tracking stats for Polling considerations
• Simple solution: Bucketize stats...
• 0-4k
• 4-16k
• 16k-64k
• >64k
• Now Hybrid polling has good QoS!
To Conclude
41
• Lots of interesting stuff happening in Linux
• Linux belongs to everyone, Get involved!
• We always welcome patches and bug reports :)
42
LIGHT UP YOUR CLOUD!

Contenu connexe

Tendances

BPF Internals (eBPF)
BPF Internals (eBPF)BPF Internals (eBPF)
BPF Internals (eBPF)Brendan Gregg
 
DMA Survival Guide
DMA Survival GuideDMA Survival Guide
DMA Survival GuideKernel TLV
 
Process Address Space: The way to create virtual address (page table) of user...
Process Address Space: The way to create virtual address (page table) of user...Process Address Space: The way to create virtual address (page table) of user...
Process Address Space: The way to create virtual address (page table) of user...Adrian Huang
 
Prerequisite knowledge for shared memory concurrency
Prerequisite knowledge for shared memory concurrencyPrerequisite knowledge for shared memory concurrency
Prerequisite knowledge for shared memory concurrencyViller Hsiao
 
Memory Mapping Implementation (mmap) in Linux Kernel
Memory Mapping Implementation (mmap) in Linux KernelMemory Mapping Implementation (mmap) in Linux Kernel
Memory Mapping Implementation (mmap) in Linux KernelAdrian Huang
 
Linux Networking Explained
Linux Networking ExplainedLinux Networking Explained
Linux Networking ExplainedThomas Graf
 
Page cache in Linux kernel
Page cache in Linux kernelPage cache in Linux kernel
Page cache in Linux kernelAdrian Huang
 
Linux Initialization Process (1)
Linux Initialization Process (1)Linux Initialization Process (1)
Linux Initialization Process (1)shimosawa
 
Kvm performance optimization for ubuntu
Kvm performance optimization for ubuntuKvm performance optimization for ubuntu
Kvm performance optimization for ubuntuSim Janghoon
 
Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...
Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...
Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...Anne Nicolas
 
The linux networking architecture
The linux networking architectureThe linux networking architecture
The linux networking architecturehugo lu
 
VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity ...
VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity ...VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity ...
VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity ...VMworld
 
Linux Kernel - Virtual File System
Linux Kernel - Virtual File SystemLinux Kernel - Virtual File System
Linux Kernel - Virtual File SystemAdrian Huang
 
Slab Allocator in Linux Kernel
Slab Allocator in Linux KernelSlab Allocator in Linux Kernel
Slab Allocator in Linux KernelAdrian Huang
 
Memory management in Linux kernel
Memory management in Linux kernelMemory management in Linux kernel
Memory management in Linux kernelVadim Nikitin
 
Physical Memory Models.pdf
Physical Memory Models.pdfPhysical Memory Models.pdf
Physical Memory Models.pdfAdrian Huang
 
malloc & vmalloc in Linux
malloc & vmalloc in Linuxmalloc & vmalloc in Linux
malloc & vmalloc in LinuxAdrian Huang
 

Tendances (20)

BPF Internals (eBPF)
BPF Internals (eBPF)BPF Internals (eBPF)
BPF Internals (eBPF)
 
DMA Survival Guide
DMA Survival GuideDMA Survival Guide
DMA Survival Guide
 
Process Address Space: The way to create virtual address (page table) of user...
Process Address Space: The way to create virtual address (page table) of user...Process Address Space: The way to create virtual address (page table) of user...
Process Address Space: The way to create virtual address (page table) of user...
 
Prerequisite knowledge for shared memory concurrency
Prerequisite knowledge for shared memory concurrencyPrerequisite knowledge for shared memory concurrency
Prerequisite knowledge for shared memory concurrency
 
Memory Mapping Implementation (mmap) in Linux Kernel
Memory Mapping Implementation (mmap) in Linux KernelMemory Mapping Implementation (mmap) in Linux Kernel
Memory Mapping Implementation (mmap) in Linux Kernel
 
Linux Networking Explained
Linux Networking ExplainedLinux Networking Explained
Linux Networking Explained
 
Page cache in Linux kernel
Page cache in Linux kernelPage cache in Linux kernel
Page cache in Linux kernel
 
Linux Initialization Process (1)
Linux Initialization Process (1)Linux Initialization Process (1)
Linux Initialization Process (1)
 
Kvm performance optimization for ubuntu
Kvm performance optimization for ubuntuKvm performance optimization for ubuntu
Kvm performance optimization for ubuntu
 
Spi drivers
Spi driversSpi drivers
Spi drivers
 
Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...
Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...
Kernel Recipes 2015: Linux Kernel IO subsystem - How it works and how can I s...
 
The linux networking architecture
The linux networking architectureThe linux networking architecture
The linux networking architecture
 
Linux device drivers
Linux device drivers Linux device drivers
Linux device drivers
 
VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity ...
VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity ...VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity ...
VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity ...
 
Linux Kernel - Virtual File System
Linux Kernel - Virtual File SystemLinux Kernel - Virtual File System
Linux Kernel - Virtual File System
 
Slab Allocator in Linux Kernel
Slab Allocator in Linux KernelSlab Allocator in Linux Kernel
Slab Allocator in Linux Kernel
 
Memory management in Linux kernel
Memory management in Linux kernelMemory management in Linux kernel
Memory management in Linux kernel
 
Physical Memory Models.pdf
Physical Memory Models.pdfPhysical Memory Models.pdf
Physical Memory Models.pdf
 
Ixgbe internals
Ixgbe internalsIxgbe internals
Ixgbe internals
 
malloc & vmalloc in Linux
malloc & vmalloc in Linuxmalloc & vmalloc in Linux
malloc & vmalloc in Linux
 

Similaire à The Linux Block Layer - Built for Fast Storage

Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications OpenEBS
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsHPCC Systems
 
Porting_uClinux_CELF2008_Griffin
Porting_uClinux_CELF2008_GriffinPorting_uClinux_CELF2008_Griffin
Porting_uClinux_CELF2008_GriffinPeter Griffin
 
System On Chip (SOC)
System On Chip (SOC)System On Chip (SOC)
System On Chip (SOC)Shivam Gupta
 
High performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHigh performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHungWei Chiu
 
Intel® hyper threading technology
Intel® hyper threading technologyIntel® hyper threading technology
Intel® hyper threading technologyAmirali Sharifian
 
Realtime traffic analyser
Realtime traffic analyserRealtime traffic analyser
Realtime traffic analyserAlex Moskvin
 
Kernel Recipes 2015: Solving the Linux storage scalability bottlenecks
Kernel Recipes 2015: Solving the Linux storage scalability bottlenecksKernel Recipes 2015: Solving the Linux storage scalability bottlenecks
Kernel Recipes 2015: Solving the Linux storage scalability bottlenecksAnne Nicolas
 
Container Attached Storage (CAS) with OpenEBS - Berlin Kubernetes Meetup - Ma...
Container Attached Storage (CAS) with OpenEBS - Berlin Kubernetes Meetup - Ma...Container Attached Storage (CAS) with OpenEBS - Berlin Kubernetes Meetup - Ma...
Container Attached Storage (CAS) with OpenEBS - Berlin Kubernetes Meetup - Ma...OpenEBS
 
LMAX Disruptor - High Performance Inter-Thread Messaging Library
LMAX Disruptor - High Performance Inter-Thread Messaging LibraryLMAX Disruptor - High Performance Inter-Thread Messaging Library
LMAX Disruptor - High Performance Inter-Thread Messaging LibrarySebastian Andrasoni
 
RISC Vs CISC Computer architecture and design
RISC Vs CISC Computer architecture and designRISC Vs CISC Computer architecture and design
RISC Vs CISC Computer architecture and designyousefzahdeh
 
Spil Storage Platform (Erlang) @ EUG-NL
Spil Storage Platform (Erlang) @ EUG-NLSpil Storage Platform (Erlang) @ EUG-NL
Spil Storage Platform (Erlang) @ EUG-NLThijs Terlouw
 
4.1 Introduction 145• In this section, we first take a gander at a.pdf
4.1 Introduction 145• In this section, we first take a gander at a.pdf4.1 Introduction 145• In this section, we first take a gander at a.pdf
4.1 Introduction 145• In this section, we first take a gander at a.pdfarpowersarps
 
The Quest for the Perfect API
The Quest for the Perfect APIThe Quest for the Perfect API
The Quest for the Perfect APImicrokerneldude
 
Tuning Linux for your database FLOSSUK 2016
Tuning Linux for your database FLOSSUK 2016Tuning Linux for your database FLOSSUK 2016
Tuning Linux for your database FLOSSUK 2016Colin Charles
 
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageWebinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageMayaData Inc
 

Similaire à The Linux Block Layer - Built for Fast Storage (20)

Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications
 
SoC FPGA Technology
SoC FPGA TechnologySoC FPGA Technology
SoC FPGA Technology
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
 
CPU Caches
CPU CachesCPU Caches
CPU Caches
 
Porting_uClinux_CELF2008_Griffin
Porting_uClinux_CELF2008_GriffinPorting_uClinux_CELF2008_Griffin
Porting_uClinux_CELF2008_Griffin
 
OpenCAPI Technology Ecosystem
OpenCAPI Technology EcosystemOpenCAPI Technology Ecosystem
OpenCAPI Technology Ecosystem
 
System On Chip (SOC)
System On Chip (SOC)System On Chip (SOC)
System On Chip (SOC)
 
High performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHigh performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User Group
 
Intel® hyper threading technology
Intel® hyper threading technologyIntel® hyper threading technology
Intel® hyper threading technology
 
Micro controller & Micro processor
Micro controller & Micro processorMicro controller & Micro processor
Micro controller & Micro processor
 
Realtime traffic analyser
Realtime traffic analyserRealtime traffic analyser
Realtime traffic analyser
 
Kernel Recipes 2015: Solving the Linux storage scalability bottlenecks
Kernel Recipes 2015: Solving the Linux storage scalability bottlenecksKernel Recipes 2015: Solving the Linux storage scalability bottlenecks
Kernel Recipes 2015: Solving the Linux storage scalability bottlenecks
 
Container Attached Storage (CAS) with OpenEBS - Berlin Kubernetes Meetup - Ma...
Container Attached Storage (CAS) with OpenEBS - Berlin Kubernetes Meetup - Ma...Container Attached Storage (CAS) with OpenEBS - Berlin Kubernetes Meetup - Ma...
Container Attached Storage (CAS) with OpenEBS - Berlin Kubernetes Meetup - Ma...
 
LMAX Disruptor - High Performance Inter-Thread Messaging Library
LMAX Disruptor - High Performance Inter-Thread Messaging LibraryLMAX Disruptor - High Performance Inter-Thread Messaging Library
LMAX Disruptor - High Performance Inter-Thread Messaging Library
 
RISC Vs CISC Computer architecture and design
RISC Vs CISC Computer architecture and designRISC Vs CISC Computer architecture and design
RISC Vs CISC Computer architecture and design
 
Spil Storage Platform (Erlang) @ EUG-NL
Spil Storage Platform (Erlang) @ EUG-NLSpil Storage Platform (Erlang) @ EUG-NL
Spil Storage Platform (Erlang) @ EUG-NL
 
4.1 Introduction 145• In this section, we first take a gander at a.pdf
4.1 Introduction 145• In this section, we first take a gander at a.pdf4.1 Introduction 145• In this section, we first take a gander at a.pdf
4.1 Introduction 145• In this section, we first take a gander at a.pdf
 
The Quest for the Perfect API
The Quest for the Perfect APIThe Quest for the Perfect API
The Quest for the Perfect API
 
Tuning Linux for your database FLOSSUK 2016
Tuning Linux for your database FLOSSUK 2016Tuning Linux for your database FLOSSUK 2016
Tuning Linux for your database FLOSSUK 2016
 
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageWebinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
 

Plus de Kernel TLV

Building Network Functions with eBPF & BCC
Building Network Functions with eBPF & BCCBuilding Network Functions with eBPF & BCC
Building Network Functions with eBPF & BCCKernel TLV
 
SGX Trusted Execution Environment
SGX Trusted Execution EnvironmentSGX Trusted Execution Environment
SGX Trusted Execution EnvironmentKernel TLV
 
Kernel Proc Connector and Containers
Kernel Proc Connector and ContainersKernel Proc Connector and Containers
Kernel Proc Connector and ContainersKernel TLV
 
Bypassing ASLR Exploiting CVE 2015-7545
Bypassing ASLR Exploiting CVE 2015-7545Bypassing ASLR Exploiting CVE 2015-7545
Bypassing ASLR Exploiting CVE 2015-7545Kernel TLV
 
Present Absence of Linux Filesystem Security
Present Absence of Linux Filesystem SecurityPresent Absence of Linux Filesystem Security
Present Absence of Linux Filesystem SecurityKernel TLV
 
OpenWrt From Top to Bottom
OpenWrt From Top to BottomOpenWrt From Top to Bottom
OpenWrt From Top to BottomKernel TLV
 
Make Your Containers Faster: Linux Container Performance Tools
Make Your Containers Faster: Linux Container Performance ToolsMake Your Containers Faster: Linux Container Performance Tools
Make Your Containers Faster: Linux Container Performance ToolsKernel TLV
 
Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User ...
Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User ...Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User ...
Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User ...Kernel TLV
 
File Systems: Why, How and Where
File Systems: Why, How and WhereFile Systems: Why, How and Where
File Systems: Why, How and WhereKernel TLV
 
netfilter and iptables
netfilter and iptablesnetfilter and iptables
netfilter and iptablesKernel TLV
 
KernelTLV Speaker Guidelines
KernelTLV Speaker GuidelinesKernelTLV Speaker Guidelines
KernelTLV Speaker GuidelinesKernel TLV
 
Userfaultfd: Current Features, Limitations and Future Development
Userfaultfd: Current Features, Limitations and Future DevelopmentUserfaultfd: Current Features, Limitations and Future Development
Userfaultfd: Current Features, Limitations and Future DevelopmentKernel TLV
 
Linux Kernel Cryptographic API and Use Cases
Linux Kernel Cryptographic API and Use CasesLinux Kernel Cryptographic API and Use Cases
Linux Kernel Cryptographic API and Use CasesKernel TLV
 
FD.IO Vector Packet Processing
FD.IO Vector Packet ProcessingFD.IO Vector Packet Processing
FD.IO Vector Packet ProcessingKernel TLV
 
WiFi and the Beast
WiFi and the BeastWiFi and the Beast
WiFi and the BeastKernel TLV
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDKKernel TLV
 
FreeBSD and Drivers
FreeBSD and DriversFreeBSD and Drivers
FreeBSD and DriversKernel TLV
 
Specializing the Data Path - Hooking into the Linux Network Stack
Specializing the Data Path - Hooking into the Linux Network StackSpecializing the Data Path - Hooking into the Linux Network Stack
Specializing the Data Path - Hooking into the Linux Network StackKernel TLV
 

Plus de Kernel TLV (20)

DPDK In Depth
DPDK In DepthDPDK In Depth
DPDK In Depth
 
Building Network Functions with eBPF & BCC
Building Network Functions with eBPF & BCCBuilding Network Functions with eBPF & BCC
Building Network Functions with eBPF & BCC
 
SGX Trusted Execution Environment
SGX Trusted Execution EnvironmentSGX Trusted Execution Environment
SGX Trusted Execution Environment
 
Fun with FUSE
Fun with FUSEFun with FUSE
Fun with FUSE
 
Kernel Proc Connector and Containers
Kernel Proc Connector and ContainersKernel Proc Connector and Containers
Kernel Proc Connector and Containers
 
Bypassing ASLR Exploiting CVE 2015-7545
Bypassing ASLR Exploiting CVE 2015-7545Bypassing ASLR Exploiting CVE 2015-7545
Bypassing ASLR Exploiting CVE 2015-7545
 
Present Absence of Linux Filesystem Security
Present Absence of Linux Filesystem SecurityPresent Absence of Linux Filesystem Security
Present Absence of Linux Filesystem Security
 
OpenWrt From Top to Bottom
OpenWrt From Top to BottomOpenWrt From Top to Bottom
OpenWrt From Top to Bottom
 
Make Your Containers Faster: Linux Container Performance Tools
Make Your Containers Faster: Linux Container Performance ToolsMake Your Containers Faster: Linux Container Performance Tools
Make Your Containers Faster: Linux Container Performance Tools
 
Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User ...
Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User ...Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User ...
Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User ...
 
File Systems: Why, How and Where
File Systems: Why, How and WhereFile Systems: Why, How and Where
File Systems: Why, How and Where
 
netfilter and iptables
netfilter and iptablesnetfilter and iptables
netfilter and iptables
 
KernelTLV Speaker Guidelines
KernelTLV Speaker GuidelinesKernelTLV Speaker Guidelines
KernelTLV Speaker Guidelines
 
Userfaultfd: Current Features, Limitations and Future Development
Userfaultfd: Current Features, Limitations and Future DevelopmentUserfaultfd: Current Features, Limitations and Future Development
Userfaultfd: Current Features, Limitations and Future Development
 
Linux Kernel Cryptographic API and Use Cases
Linux Kernel Cryptographic API and Use CasesLinux Kernel Cryptographic API and Use Cases
Linux Kernel Cryptographic API and Use Cases
 
FD.IO Vector Packet Processing
FD.IO Vector Packet ProcessingFD.IO Vector Packet Processing
FD.IO Vector Packet Processing
 
WiFi and the Beast
WiFi and the BeastWiFi and the Beast
WiFi and the Beast
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDK
 
FreeBSD and Drivers
FreeBSD and DriversFreeBSD and Drivers
FreeBSD and Drivers
 
Specializing the Data Path - Hooking into the Linux Network Stack
Specializing the Data Path - Hooking into the Linux Network StackSpecializing the Data Path - Hooking into the Linux Network Stack
Specializing the Data Path - Hooking into the Linux Network Stack
 

Dernier

Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
cpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptcpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptrcbcrtm
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 

Dernier (20)

Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
cpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptcpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.ppt
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
Odoo Development Company in India | Devintelle Consulting Service
Odoo Development Company in India | Devintelle Consulting ServiceOdoo Development Company in India | Devintelle Consulting Service
Odoo Development Company in India | Devintelle Consulting Service
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 

The Linux Block Layer - Built for Fast Storage

  • 1. The Linux Block Layer Built for Fast Storage Light up your cloud! Sagi Grimberg KernelTLV 27/6/18 1
  • 2. First off, Happy 1’st birthday Roni! 2
  • 3. Who am I? • Co-founder and Principal Architect @ Lightbits Labs • LightBits Labs is a stealth-mode startup pushing the software and hardware technology boundaries in cloud-scale storage. • We are looking for excellent people who enjoy a challenge for a variety of positions, including both software and hardware. More information on our website at http://www.lightbitslabs.com/#join or talk with me after the talk. • Active contributor to Linux I/O and RDMA stack • I am the maintainer of the iSCSI over RDMA (iSER) drivers • I co-maintain the Linux NVMe subsystem • Used to work for Mellanox on Storage, Networking, RDMA and pretty much everything in between… 3
  • 4. Where were we 10 years ago • Only rotating storage devices exist. • Devices were limited to hundreds of IOPs • Devices access latency was in the milliseconds ballpark • The Linux block layer was sufficient to handle these devices • High performance applications found clever ways to avoid storage access as much as possible 4
  • 5. What happened? (hint: HW) • Flash SSDs started appearing in the DataCenter • IOPs went from Hundreds to Hundreds of thousands to Millions • Latency went from Milliseconds to Microseconds • Fast Interfaces evolved: PCIe (NVMe) • Processors core count increased a lot! • And NUMA... 5
  • 7. What are the issues? • Existing I/O stack had a lot of data sharing • Between different applications (running on different cores) • Between submission and completion • Locking for synchronization • Zero NUMA awareness • All stack heuristics and optimizations centered around slow storage • The result is very bad (even negative) scaling, spending lots of CPU cycles and much much higher latencies. 7
  • 8. I/O Stack - Little deeper 8
  • 9. I/O Stack - Little deeper 9 Hmmm... - Request are serialized - Placed for staging - Retrieved by the drivers ⇒ Lots of shared state!
  • 10. I/O Stack - Performance 10
  • 11. I/O Stack - Performance 11
  • 12. Workaround: Bypass the the request layer 12 Problems with bypass: ● Give up flow control ● Give up error handling ● Give up statistics ● Give up tagging and indexing ● Give up I/O deadlines ● Give up I/O scheduling ● Crazy code duplication - mistakes are copied because people get stuff wrong... Most importantly, this is not the Linux design approach!
  • 13. Enter Block Multiqueue • The old stack does not consist of “one serialization point” • The stack needed a complete re-write from ground up • What do we do: • Go look at the networking stack which solved the exact same issue 10+ years ago. • But build from scratch for storage devices 13
  • 14. Block Multiqueue - Goals • Linear Scaling with CPU cores • Split shared state between applications and submission/completion • Careful locality awareness: Cachelines, NUMA • Pre-allocate resources as much as possible • Provide full helper functionality - ease of implementation • Support all existing HW • Become THE queueing mode, not a “3’rd one” 14
  • 15. Block Multiqueue - Architecture 15
  • 16. Block Multiqueue - Features • Efficient tagging • Locality of submissions and completions • Extremely aware to minimize cache pollutions • Smart error handling - minimum intrusion to the hot path • Smart cpu <-> queue mappings • Clean API • Easy conversion (usually just cleanup old cruft) 16
  • 17. Block Multiqueue - I/O Flow 17
  • 18. Block Multiqueue - Completions 18 ● Applications are usually “cpu-sticky” ● If I/O completion comes on the “correct” cpu, complete it ● Else, IPI to the “correct” cpu
  • 19. Block Multiqueue - Tagging 19 • Almost every modern HW supports queueing • Tags are used to identify individual I/Os in the presence of out-of-order completions • Tags are limited by capabilities of the HW, driver needs to flow control
  • 20. Block Multiqueue - Tagging 20 • PerCPU Cacheline aware scalable bitmaps • Efficient at near-exhaustion • Rolling wake-ups • Maps 1x1 with HW usage - no driver specific tagging
  • 21. Block Multiqueue - Pre-allocations 21 • Eliminate hot path allocations • Allocate all the requests memory at initialization time • Tag and request allocations are combined (no two step allocation) • No driver per-request allocation • Driver context and SG lists are placed in “slack space” behind the request
  • 22. Block Multiqueue - Performance 22 Test-Case: - null_blk driver - fio - 4K sync random read - Dual socket system
  • 23. Block Multiqueue - perf profiling 23 • Locking time is drastically reduced • FIO reports much less “system time” • Average and tail latencies are much lower and consistent
  • 24. Next on the Agenda: SCSI, NVMe and friends • NVMe started as a bypass driver - converted to blk-mq • mtip32xx (Micron) • virtio_blk, xen • rbd (ceph) • loop • more... • SCSI midlayer was a bigger project.. 24
  • 25. SCSI multiqueue • Needed the concept of “shared tag sets” • Tags are now a property of the HBA and not the storage device • Needed a chunking of scatter-gather lists • SCSI HBAs support huge sg lists, two much to allocate up front • Needed “Head of queue” insertion • For SCSI complex error handling • Removed the “big scsi host_busy lock” • reduced the huge contention on the scsi target “busy” atomic • Needed Partial completion support • Needed BIDI support (yukk..) • Hardened the stack a lot with lots of user bug reports. 25
  • 26. Block multiqueue - MSI(X) based queue mapping 26 ● Motivation: Eliminate the IPI case ● Expose MSI(X) vector affinity mappings to the block layer ● Map the HW context mappings via the underlying device IRQ mappings ● Offer MSI(X) allocation and correct affinity spreading via the PCI subsystem ● Take advantage in pci based drivers (nvme, rdma, fc, hpsa, etc..)
  • 27. But wait, what about I/O schedulers? • What we didn’t mention was that block multiqueue lacked a proper I/O scheduler for approximately 3 years! • A fundamental part of the I/O stack functionality is scheduling • To optimize I/O sequentiality - Elevator algorithm • Prevent write vs. read starvation (i.e. deadline scheduler) • Fairness enforcement (i.e. CFQ) • One can argue that I/O scheduling was designed for rotating media • Optimized for reducing actuator seek time NOT NECESSARILY TRUE - Flash can benefit scheduling! 27
  • 28. Start from ground up: WriteBack Throttling • Linux since the dawn of times sucked at buffered I/O • Writes are naturally buffered and committed to disk in the background • Needs to have little or no impact on foreground activity • What was needed: • Plumb I/O stats for submitted reads and writes • Track average latency in window granularity and what is currently enqueued • Scale queue depth accordingly • Prefer reads over non-directIO writes 28
  • 29. WriteBack Throttling - Performance 29 Before... After...
  • 30. Now we are ready for I/O schedules - MQ-Deadline • Added I/O interception of requests for building schedulers on top • First MQ conversion was for deadline scheduler • Pretty easy and straightforward • Just delay writes FIFO until deadline hits • Reads FIFO are pass-through • All percpu context - tradeoff? • Remember: I/O scheduler can hurt synthetic workloads, but impact on real life workloads. 30
  • 31. Next: Kyber I/O Scheduler • Targeted for fast multi-queue devices • Lightweight • Prefers reads over writes • All I/Os are split into two queues (reads and writes) • Reads are typically preferred • Writes are throttled but not to a point of starvation • The key is to keep submission queues short to guarantee latency targets • Kyber tracks I/O latency stats and adjust queue size accordingly • Aware of flash background operations. 31
  • 32. Next: BFQ I/O Scheduler • Budget fair queueing scheduler • A lot heavier • Maintain Per-Process I/O budget • Maintain bunch of Per-Process heuristics • Yields the “best” I/O to queue at any given time • A better fit for slower storage, especially rotating media and cheap & deep SSDs. 32
  • 33. But wait #2: What about Ultra-low latency devices • New media is emerging with Ultra low latency (1-2 us) • 3D-Xpoint • Z-NAND • Even with block MQ, the Linux I/O stack still has issues providing these latencies • It starts with IRQ (interrupt handling) • If I/O is so fast, we might want to poll for completion and avoid paying the cost of MSI(X) interrupt 33
  • 34. Interrupt based I/O completion model 34
  • 35. Polling based I/O completion model 35
  • 36. IRQ vs. Polling 36 • Polling can remove the extra context switch from the completion handling
  • 37. So we should support polling! 37 • Add selective polling syscall interface: • Use preadv2/pwritev2 with flag IOCB_HIGHPRI • Saves roughly 25% of added latency
  • 38. But what about CPU% - can we go hybrid? 38 • Yes! • We have all the statistics framework in place, let’s use it for hybrid polling! • Wake up poller after ½ of the mean latency.
  • 39. Hybrid polling - Performance 39
  • 40. Hybrid polling - Adjust to I/O size 40 • Block layer sees I/Os of different sizes. • Some are 4k, some are 256K and some or 1-2MB • We need to consider that when tracking stats for Polling considerations • Simple solution: Bucketize stats... • 0-4k • 4-16k • 16k-64k • >64k • Now Hybrid polling has good QoS!
  • 41. To Conclude 41 • Lots of interesting stuff happening in Linux • Linux belongs to everyone, Get involved! • We always welcome patches and bug reports :)