SlideShare une entreprise Scribd logo
1  sur  20
Télécharger pour lire hors ligne
Brought to you by
Keeping Latency Low and Throughput
High with Application-level
Priority Management
Avi Kivity
CTO at
Avi Kivity
CTO at ScyllaDB
Creator and ex-maintainer of Kernel-based Virtual Machine (KVM)
Creator of the Seastar I/O framework
Co-founder, CTO @ ScyllaDB
Comparing throughput and latency
Throughput computing (~ OLAP)
■ Want to maximize utilization
■ Extensive buffering to hide
device/network latency
■ Total time is important
■ Fewer operations, serialization is
permissible
Latency computing (~ OLTP)
■ Leave free cycles to absorb
bursts
■ Cannot predict data to read
Often must synchronously write
■ Individual operation time is
important
■ Many operations execute
concurrently
Why mix throughput and latency computing?
■ Run different workloads on the same data - HTAP
● Fewer resources than dedicated clusters
■ Maintenance operations on an OLTP workload
● Garbage collection
● Grooming a Log-Structured Merge Tree (LSM Tree)
● Cluster maintenance - add/remove/rebuild/backup/scrub nodes
General plan
1. Isolate/identify different tasks
2. Schedule tasks
Isolating tasks in threads
■ Each operation becomes a thread
● Perhaps temporarily borrowed from a thread pool
■ Let the kernel schedule these threads
■ Influence kernel choices with priority
Isolating tasks in threads
Advantages
■ Well understood
■ Large ecosystem
Disadvantages
■ Context switches are expensive
■ Communicating priority to the OS is
hard
● Priority levels not meaningful
■ Locking becomes complex and
expensive
■ Priority inversion is possible
■ Kernel scheduling granularity may be
too high
Application-level task isolation
■ Every operation is a normal object
■ Operations are multiplexed on a small number of threads
● Ideally one thread per logical core
● Both throughput and latency tasks on the same thread!
■ Concurrency framework assigns tasks to threads
■ Concurrency framework controls order
Application-level task isolation
Advantages
■ Full control
■ Low overhead with cooperative scheduling
■ Many locks become unnecessary
■ Good CPU affinity
■ Fewer surprises from the kernel
Disadvantages
■ Full control
■ Less mature ecosystem
Application-managed tasks
Scheduler
tq1 tq2 tq3 tqn
Execution timeline
time
tq1 tq2 tq3 tq1 tq2 tq3
Switching queues
■ When queue is exhausted
● Common for latency sensitive queues
■ When time slice is exhausted
● Throughput oriented queues
● Queue may have more tasks
● Tasks can be preempted
■ Poll for I/O
● io_uring_enter or equivalent
■ Make scheduling decision
● Pick next queue
● Scheduling goal is to keep q_runtime / q_shares equal across queues
● Selection of queue is not round-robin
Preemption techniques
■ Read clock and compare to timeslice end deadline
● Prohibitively expensive
■ Use timer+signal
● Works, icky locking
■ Use kernel timer to write to user memory location
● linux-aio or io_uring
● Tricky but very efficient
Stall detector
■ Signal-based mechanism to detect where you “forgot” to add a
preemption check
■ cf. Accidentally Quadratic
Implementation in ScyllaDB
About ScyllaDB
■ Distributed OLTP NoSQL Database
■ Compatibility
● Apache Cassandra (CQL, Thrift)
● AWS DynamoDB (JSON/HTTP)
● Redis (RESP)
■ ~10X performance on same hardware
■ Low latency, esp. higher percentiles
■ C++20, Open Source
■ Fully asynchronous; Seastar!
Dynamic Shares Adjustment
• Internal feedback loops to balance competing loads
Memtable
Seastar
Scheduler
Compaction
Query
Repair
Commitlog
SSD
Compaction
Backlog
Monitor
Memory
Monitor
Adjust priority
Adjust priority
WAN
CPU
Resource partitioning (QoS)
• Provide different quality of service to different users
Memtable
Seastar
Scheduler
Compaction
Query 1
Repair
Commitlog
SSD
Compaction
Backlog
Monitor
Memory
Monitor
Adjust priority
Adjust priority
WAN
CPU
Query 2
I/O scheduling
■ Logically, same
■ But scheduling an entity much more complicated than a CPU core
■ More difficult cross-core coordination
■ More in Pavel’s talk
● “What We Need to Unlearn about Persistent Storage”
Brought to you by
Avi Kivity
@AviKivity

Contenu connexe

Tendances

OSNoise Tracer: Who Is Stealing My CPU Time?
OSNoise Tracer: Who Is Stealing My CPU Time?OSNoise Tracer: Who Is Stealing My CPU Time?
OSNoise Tracer: Who Is Stealing My CPU Time?
ScyllaDB
 
High-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uringHigh-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uring
ScyllaDB
 
OSv Unikernel — Optimizing Guest OS to Run Stateless and Serverless Apps in t...
OSv Unikernel — Optimizing Guest OS to Run Stateless and Serverless Apps in t...OSv Unikernel — Optimizing Guest OS to Run Stateless and Serverless Apps in t...
OSv Unikernel — Optimizing Guest OS to Run Stateless and Serverless Apps in t...
ScyllaDB
 

Tendances (20)

RISC-V on Edge: Porting EVE and Alpine Linux to RISC-V
RISC-V on Edge: Porting EVE and Alpine Linux to RISC-VRISC-V on Edge: Porting EVE and Alpine Linux to RISC-V
RISC-V on Edge: Porting EVE and Alpine Linux to RISC-V
 
OSNoise Tracer: Who Is Stealing My CPU Time?
OSNoise Tracer: Who Is Stealing My CPU Time?OSNoise Tracer: Who Is Stealing My CPU Time?
OSNoise Tracer: Who Is Stealing My CPU Time?
 
Data Structures for High Resolution, Real-time Telemetry at Scale
Data Structures for High Resolution, Real-time Telemetry at ScaleData Structures for High Resolution, Real-time Telemetry at Scale
Data Structures for High Resolution, Real-time Telemetry at Scale
 
Object Compaction in Cloud for High Yield
Object Compaction in Cloud for High YieldObject Compaction in Cloud for High Yield
Object Compaction in Cloud for High Yield
 
High-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uringHigh-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uring
 
G1: To Infinity and Beyond
G1: To Infinity and BeyondG1: To Infinity and Beyond
G1: To Infinity and Beyond
 
Rust, Wright's Law, and the Future of Low-Latency Systems
Rust, Wright's Law, and the Future of Low-Latency SystemsRust, Wright's Law, and the Future of Low-Latency Systems
Rust, Wright's Law, and the Future of Low-Latency Systems
 
OSv Unikernel — Optimizing Guest OS to Run Stateless and Serverless Apps in t...
OSv Unikernel — Optimizing Guest OS to Run Stateless and Serverless Apps in t...OSv Unikernel — Optimizing Guest OS to Run Stateless and Serverless Apps in t...
OSv Unikernel — Optimizing Guest OS to Run Stateless and Serverless Apps in t...
 
Continuous Go Profiling & Observability
Continuous Go Profiling & ObservabilityContinuous Go Profiling & Observability
Continuous Go Profiling & Observability
 
Get Lower Latency and Higher Throughput for Java Applications
Get Lower Latency and Higher Throughput for Java ApplicationsGet Lower Latency and Higher Throughput for Java Applications
Get Lower Latency and Higher Throughput for Java Applications
 
Understanding Apache Kafka P99 Latency at Scale
Understanding Apache Kafka P99 Latency at ScaleUnderstanding Apache Kafka P99 Latency at Scale
Understanding Apache Kafka P99 Latency at Scale
 
Practical SystemTAP basics: Perl memory profiling
Practical SystemTAP basics: Perl memory profilingPractical SystemTAP basics: Perl memory profiling
Practical SystemTAP basics: Perl memory profiling
 
Java Performance Tuning
Java Performance TuningJava Performance Tuning
Java Performance Tuning
 
Using SLOs for Continuous Performance Optimizations of Your k8s Workloads
Using SLOs for Continuous Performance Optimizations of Your k8s WorkloadsUsing SLOs for Continuous Performance Optimizations of Your k8s Workloads
Using SLOs for Continuous Performance Optimizations of Your k8s Workloads
 
Performance Tuning - Memory leaks, Thread deadlocks, JDK tools
Performance Tuning -  Memory leaks, Thread deadlocks, JDK toolsPerformance Tuning -  Memory leaks, Thread deadlocks, JDK tools
Performance Tuning - Memory leaks, Thread deadlocks, JDK tools
 
Performance optimization 101 - Erlang Factory SF 2014
Performance optimization 101 - Erlang Factory SF 2014Performance optimization 101 - Erlang Factory SF 2014
Performance optimization 101 - Erlang Factory SF 2014
 
WebLogic Stability; Detect and Analyse Stuck Threads
WebLogic Stability; Detect and Analyse Stuck ThreadsWebLogic Stability; Detect and Analyse Stuck Threads
WebLogic Stability; Detect and Analyse Stuck Threads
 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/await
 
State of systemd @ Facebook
State of systemd @ FacebookState of systemd @ Facebook
State of systemd @ Facebook
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
 

Similaire à Keeping Latency Low and Throughput High with Application-level Priority Management

Hardware Assisted Latency Investigations
Hardware Assisted Latency InvestigationsHardware Assisted Latency Investigations
Hardware Assisted Latency Investigations
ScyllaDB
 

Similaire à Keeping Latency Low and Throughput High with Application-level Priority Management (20)

Lec 9-os-review
Lec 9-os-reviewLec 9-os-review
Lec 9-os-review
 
Hardware Assisted Latency Investigations
Hardware Assisted Latency InvestigationsHardware Assisted Latency Investigations
Hardware Assisted Latency Investigations
 
Using the big guns: Advanced OS performance tools for troubleshooting databas...
Using the big guns: Advanced OS performance tools for troubleshooting databas...Using the big guns: Advanced OS performance tools for troubleshooting databas...
Using the big guns: Advanced OS performance tools for troubleshooting databas...
 
An End to Order
An End to OrderAn End to Order
An End to Order
 
Building Efficient Multi-Threaded Filters for Faster SQL Queries
Building Efficient Multi-Threaded Filters for Faster SQL QueriesBuilding Efficient Multi-Threaded Filters for Faster SQL Queries
Building Efficient Multi-Threaded Filters for Faster SQL Queries
 
Optimizing Linux Servers
Optimizing Linux ServersOptimizing Linux Servers
Optimizing Linux Servers
 
An End to Order (many cores with java, session two)
An End to Order (many cores with java, session two)An End to Order (many cores with java, session two)
An End to Order (many cores with java, session two)
 
IO Schedulers (Elevater) concept and its affection on database performance
IO Schedulers (Elevater) concept and its affection on database performanceIO Schedulers (Elevater) concept and its affection on database performance
IO Schedulers (Elevater) concept and its affection on database performance
 
The Linux Block Layer - Built for Fast Storage
The Linux Block Layer - Built for Fast StorageThe Linux Block Layer - Built for Fast Storage
The Linux Block Layer - Built for Fast Storage
 
Introduction to ZooKeeper - TriHUG May 22, 2012
Introduction to ZooKeeper - TriHUG May 22, 2012Introduction to ZooKeeper - TriHUG May 22, 2012
Introduction to ZooKeeper - TriHUG May 22, 2012
 
Nagios Conference 2014 - Andy Brist - Nagios XI Failover and HA Solutions
Nagios Conference 2014 - Andy Brist - Nagios XI Failover and HA SolutionsNagios Conference 2014 - Andy Brist - Nagios XI Failover and HA Solutions
Nagios Conference 2014 - Andy Brist - Nagios XI Failover and HA Solutions
 
Concurrency, Parallelism And IO
Concurrency,  Parallelism And IOConcurrency,  Parallelism And IO
Concurrency, Parallelism And IO
 
Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messages
Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messagesMulti-Tenancy Kafka cluster for LINE services with 250 billion daily messages
Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messages
 
Functional? Reactive? Why?
Functional? Reactive? Why?Functional? Reactive? Why?
Functional? Reactive? Why?
 
Streaming datasets for personalization
Streaming datasets for personalizationStreaming datasets for personalization
Streaming datasets for personalization
 
0507 057 01 98 * Adana Klima Servisleri
0507 057 01 98 * Adana Klima Servisleri0507 057 01 98 * Adana Klima Servisleri
0507 057 01 98 * Adana Klima Servisleri
 
Shall we play a game
Shall we play a gameShall we play a game
Shall we play a game
 
Shall we play a game?
Shall we play a game?Shall we play a game?
Shall we play a game?
 
Task migration using CRIU
Task migration using CRIUTask migration using CRIU
Task migration using CRIU
 
Realtime
RealtimeRealtime
Realtime
 

Plus de ScyllaDB

Plus de ScyllaDB (20)

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
What Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQLWhat Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQL
 
Low Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsLow Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & Pitfalls
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance Dilemmas
 
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDBBeyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance Dilemmas
 
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
 
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
 
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr SarnaDatabase Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
 
Replacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDBReplacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDB
 
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear ScalabilityPowering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
 
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
 
Getting the most out of ScyllaDB
Getting the most out of ScyllaDBGetting the most out of ScyllaDB
Getting the most out of ScyllaDB
 
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a MigrationNoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
 
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration LogisticsNoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
 
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and ChallengesNoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
 
ScyllaDB Virtual Workshop
ScyllaDB Virtual WorkshopScyllaDB Virtual Workshop
ScyllaDB Virtual Workshop
 
DBaaS in the Real World: Risks, Rewards & Tradeoffs
DBaaS in the Real World: Risks, Rewards & TradeoffsDBaaS in the Real World: Risks, Rewards & Tradeoffs
DBaaS in the Real World: Risks, Rewards & Tradeoffs
 
Build Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBBuild Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDB
 
NoSQL Data Modeling 101
NoSQL Data Modeling 101NoSQL Data Modeling 101
NoSQL Data Modeling 101
 

Dernier

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Dernier (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

Keeping Latency Low and Throughput High with Application-level Priority Management

  • 1. Brought to you by Keeping Latency Low and Throughput High with Application-level Priority Management Avi Kivity CTO at
  • 2. Avi Kivity CTO at ScyllaDB Creator and ex-maintainer of Kernel-based Virtual Machine (KVM) Creator of the Seastar I/O framework Co-founder, CTO @ ScyllaDB
  • 3. Comparing throughput and latency Throughput computing (~ OLAP) ■ Want to maximize utilization ■ Extensive buffering to hide device/network latency ■ Total time is important ■ Fewer operations, serialization is permissible Latency computing (~ OLTP) ■ Leave free cycles to absorb bursts ■ Cannot predict data to read Often must synchronously write ■ Individual operation time is important ■ Many operations execute concurrently
  • 4. Why mix throughput and latency computing? ■ Run different workloads on the same data - HTAP ● Fewer resources than dedicated clusters ■ Maintenance operations on an OLTP workload ● Garbage collection ● Grooming a Log-Structured Merge Tree (LSM Tree) ● Cluster maintenance - add/remove/rebuild/backup/scrub nodes
  • 5. General plan 1. Isolate/identify different tasks 2. Schedule tasks
  • 6. Isolating tasks in threads ■ Each operation becomes a thread ● Perhaps temporarily borrowed from a thread pool ■ Let the kernel schedule these threads ■ Influence kernel choices with priority
  • 7. Isolating tasks in threads Advantages ■ Well understood ■ Large ecosystem Disadvantages ■ Context switches are expensive ■ Communicating priority to the OS is hard ● Priority levels not meaningful ■ Locking becomes complex and expensive ■ Priority inversion is possible ■ Kernel scheduling granularity may be too high
  • 8. Application-level task isolation ■ Every operation is a normal object ■ Operations are multiplexed on a small number of threads ● Ideally one thread per logical core ● Both throughput and latency tasks on the same thread! ■ Concurrency framework assigns tasks to threads ■ Concurrency framework controls order
  • 9. Application-level task isolation Advantages ■ Full control ■ Low overhead with cooperative scheduling ■ Many locks become unnecessary ■ Good CPU affinity ■ Fewer surprises from the kernel Disadvantages ■ Full control ■ Less mature ecosystem
  • 12. Switching queues ■ When queue is exhausted ● Common for latency sensitive queues ■ When time slice is exhausted ● Throughput oriented queues ● Queue may have more tasks ● Tasks can be preempted ■ Poll for I/O ● io_uring_enter or equivalent ■ Make scheduling decision ● Pick next queue ● Scheduling goal is to keep q_runtime / q_shares equal across queues ● Selection of queue is not round-robin
  • 13. Preemption techniques ■ Read clock and compare to timeslice end deadline ● Prohibitively expensive ■ Use timer+signal ● Works, icky locking ■ Use kernel timer to write to user memory location ● linux-aio or io_uring ● Tricky but very efficient
  • 14. Stall detector ■ Signal-based mechanism to detect where you “forgot” to add a preemption check ■ cf. Accidentally Quadratic
  • 16. About ScyllaDB ■ Distributed OLTP NoSQL Database ■ Compatibility ● Apache Cassandra (CQL, Thrift) ● AWS DynamoDB (JSON/HTTP) ● Redis (RESP) ■ ~10X performance on same hardware ■ Low latency, esp. higher percentiles ■ C++20, Open Source ■ Fully asynchronous; Seastar!
  • 17. Dynamic Shares Adjustment • Internal feedback loops to balance competing loads Memtable Seastar Scheduler Compaction Query Repair Commitlog SSD Compaction Backlog Monitor Memory Monitor Adjust priority Adjust priority WAN CPU
  • 18. Resource partitioning (QoS) • Provide different quality of service to different users Memtable Seastar Scheduler Compaction Query 1 Repair Commitlog SSD Compaction Backlog Monitor Memory Monitor Adjust priority Adjust priority WAN CPU Query 2
  • 19. I/O scheduling ■ Logically, same ■ But scheduling an entity much more complicated than a CPU core ■ More difficult cross-core coordination ■ More in Pavel’s talk ● “What We Need to Unlearn about Persistent Storage”
  • 20. Brought to you by Avi Kivity @AviKivity