SlideShare une entreprise Scribd logo
1  sur  29
Brought to you by
Optimizing Servers for High
Throughput and Low Latency
Alexey Ivanov
Software Engineer at Dapper Labs
Alexey Ivanov
Software Engineer, Dapper Labs
■ Previously: Traffic, Networking, and Databases @Dropbox
■ Performance: Hardware. OS. Application. RUM.
Optimizing (web-)Servers 5 Years Later…
This is an updated version of the nginx.conf’17 talk.
Changelog:
■ New hardware features are available. AMD EPYCs and ARM64 are a thing.
■ New linux kernel features. Especially around observability.
■ Replace nginx with a generic HTTP-server/-client focus.
● (Most of the clients and servers nowadays are HTTP- or HTTP/2-based)
The biggest performance gains are usually gained via high-level optimizations:
load-balancing, algorithms,data structures, and (especially) business logic.
A few examples from large scale production systems.
■ The lower the variance in backend load – the better.
● Applying “Two Random Choices” load-balancing greatly reduced latencies.
■ The fastest code is “no code”.
● E.g. at Dropbox we’ve pre-compressed static files for web so we spent 0% CPU on it while
maintaining the best possible compression ratio.
■ Algorithm improvements.
● Switching from zlib to brotli saved us both CPU and storage.
■ Data locality improvements.
● Switching from B+tree to LSM-based storage improved compression efficiency and reduced
database sizes by ~2.5x.
High-level vs Low-level Optimizations
Hardware
CPU and Memory
Generally, picking the newest processor is the best choice since it will have the
most hardware offloads:
■ AVX2, BMI, ADX, AVX-512, AES-NI, SHA-NI (x86)
● (Symmetric/Asymmetric encryption, signatures, hashing, MACs)
■ PMUL, PMULL2, SHA256H, SHA3 (ARMv8.2+)
● (finite field arithmetic, hashing, MACs)
Many of the things that previously were prohibitively expensive now are almost
free due to hardware offloads: mTLS, crypto-hashing, storage encryption.
CPU and Memory (Cont’d)
What if budget is limited? Rules of thumb:
■ Low-latency: single NUMA-node, bigger caches, disabled SMT, more Ghz,
more memory channels.
■ High-throughput: more cores, enabled SMT, more memory.
Frequently, in production, high CPU usage does not mean a CPU bottleneck but a
“CPU pipeline stall” problem, i.e.: cache, TLB, or memory-bandwidth limitation.
Top-Down Analysis (TMA)
github.com/andikleen/pmu-tools
# toplev.py -l1 --single-thread --force-events ./app
BE Backend_Bound: 60.34%
This category reflects slots where no uops are being
delivered due to a lack of required resources for
accepting more uops in the Backend of the pipeline...
github.com/andikleen/pmu-tools
# toplev.py -l3 --single-thread --force-events ./app
BE Backend_Bound: 60.42%
BE/Mem Backend_Bound.Memory_Bound: 32.23%
BE/Mem Backend_Bound.Memory_Bound.L1_Bound: 32.44%
This metric represents how often CPU was stalled without
missing the L1 data cache...
BE/Core Backend_Bound.Core_Bound: 45.93%
BE/Core Backend_Bound.Core_Bound.Ports_Utilization: 45.93%
This metric represents cycles fraction application was
stalled due to Core computation issues (non divider-
related)...
NICs
Relevant only for real hardware, not clouds.
■ 25Gbits or more, older NICs would likely have misc bottlenecks.
■ Open-source drivers, small firmwares, active community.
● In case if (but most likely, “when”) issues occur.
Pressure Stall Information (PSI)
“PSI provides for the first time a canonical way to see resource pressure increases
as they develop, with new pressure metrics for three major resources—
memory, CPU, and IO.”
Source: https://facebookmicrosites.github.io/psi/docs/overview
PSI: global and Per-cgroup (v2)
$ cat /proc/pressure/io
some avg10=0.00 avg60=0.00 avg300=0.00 total=0
full avg10=0.00 avg60=0.00 avg300=0.00 total=0
$ cat /sys/fs/cgroup/cg1/io.pressure
some avg10=0.00 avg60=0.00 avg300=0.00 total=0
full avg10=0.00 avg60=0.00 avg300=0.00 total=0
Understanding
Software Dynamics
by Richard L. Sites
Linux Kernel
Kernel Optimizations
The best Linux optimization is the recent kernel version. New kernel versions bring
improvements to networking, memory management, io, and the rest of linux
subsystems.
But most importantly they bring improvements to observability tooling.
CPU and Memory
After you’ve picked the best CPU for your workload, you’ll need to utilize it to the
max:
■ For Intel/AMD you would want to use intel_pstate or amd-pstate driver.
● If you want to be more energy efficient you may consider using schedutil governor. Use
performance otherwise.
■ Set NUMA affinity for your application.
■ Use transparent huge pages.
● Careful here, this may lead to reduction in performance on some workloads.
Networking
The main goal of low-level tuning is to parallelize packet processing, add affinities,
increase buffer sizes, and enable hardware offloads.
■ ethtool is your friend here: # of queues, ring buffers, offloads, coalescing.
● -L, -G, -K, -C, etc.
● -S is your friend to keep track of drops/misses/errors/overruns/etc.
■ Mellanox and Intel cards come with set_irq_affinity/mlnx_affinity.
● Do not forget to turn off irqbalance.
■ After RSS is enabled it is generally a good idea to turn on XPS and xps_rxqs.
■ Avoid RPS. RFS can also have negative consequences.
■ For low latency: try to stay within the NUMA node PCIe NIC is attached to.
The main goal of high-level tuning is to remove transport-level bottlenecks.
■ Enabling BBR congestion control is generally a good idea.
■ Enabling FQ scheduler w/ pacing is always a good idea.
■ Your friends here are RUM metrics and
ss -n --extended --info or getsockopt(TCP_INFO/TCP_CC_INFO)
Networking (Cont’d)
iproute2
$ ss -tie
…
ts sack bbr rto:220 rtt:16.139/10.041 ato:40 mss:1448 pmtu:1500 rcvmss:1269
advmss:1428 cwnd:106 ssthresh:52 bytes_sent:9070462 bytes_retrans:3375
bytes_acked:9067087 bytes_received:5775 segs_out:6327 segs_in:551
data_segs_out:6315 data_segs_in:12
bbr:(bw:99.5Mbps,mrtt:1.912,pacing_gain:1,cwnd_gain:2) send 76.1Mbps
lastsnd:9896 lastrcv:10944 lastack:9864 pacing_rate 98.5Mbps delivery_rate
27.9Mbps delivered:6316 busy:3020ms rwnd_limited:2072ms(68.6%) retrans:0/5
dsack_dups:5 rcv_rtt:16.125 rcv_space:14400 rcv_ssthresh:65535 minrtt:1.907
…
It is impossible to talk about network tuning w/o mentioning sysctls. Here is a
couple of a relatively safe ones.
■ net.ipv4.tcp_slow_start_after_idle=0
● Should be safe if FQ w/ pacing is enabled.
■ net.ipv4.tcp_mtu_probing=1
● Must have on the edge (along with a slightly reduced advmss)
■ net.ipv4.tcp_rmem, net.ipv4.tcp_wmem
● Should be big enough for connections to not be rcv/snd window limited.
■ net.ipv4.tcp_notsent_lowat=262144
● Or even lower if HTTP/2 prioritization is used.
Sysctl Cargo Culting
Systems
Performance
BPF
Performance Tools
by Brendan Gregg
Application
Compiler Flags, Toolchains, and Runtimes
Keeping you compiler/runtime up-to-date is generally a good idea.
■ Compiler upgrade, -O2, and -mtune can visibly affect performance.
● You can also try keeping -march/GOAMD64 in sync with your (cloud) hardware.
■ Link time optimization (LTO) can give a measurable perf boost.
■ Runtime upgrade can frequently give you single to double digit perf
improvements.
● For example, Go runtime upgrades frequently deliver memory/cpu usage improvements.
■ (Toolchain upgrades are also great from the security perspective)
Profile-guided Optimization and Beyond
Most compilers are capable of PGO based on `perf record` profiles.
■ Clang has AutoFDO.
■ Golang would likely have Feedback-Guided Optimization in 1.20.
You can go beyond compile-time optimization and use post-link optimizer:
■ Facebook’s BOLT is now a part of LLVM:
https://github.com/llvm/llvm-project/tree/main/bolt
Any modern application consists of a myriad of libraries. Most servers nowadays
would have allocator, TLS, compression, and serialization libraries. These are the
main candidates for tuning. For example in case of C/C++ servers:
■ Keeping libraries up-to-date is important.
● It doesn’t matter whether CPU supports AVX2 if your library can’t use it.
■ Changing malloc implementation is an option.
● Both jemalloc and tcmalloc have excellent tuning guides.
■ BoringSSL can (mostly) be used as a drop–in replacement for OpenSSL.
● Often switching from RSA to ECDSA, or from AES to ChaCha (or back) can improve perf.
■ zlib has multiple performance-oriented forks.
● Intel, Cloudflare, zlib-ng.
● Sometimes more efficient algorithms like brotli or zstd can be used instead.
Libraries
Designing
Data-Intensive
Applications
by Martin Kleppmann
Site
Reliability
Engineering
Chapter 19. Load Balancing at the Frontend
Chapter 20. Load Balancing in the Datacenter
Chapter 21. Handling Overload
Chapter 22. Addressing Cascading Failures
Brought to you by
Alexey Ivanov
rbtz@dapperlabs.com
@SaveTheRbtz

Contenu connexe

Tendances

Boosting I/O Performance with KVM io_uring
Boosting I/O Performance with KVM io_uringBoosting I/O Performance with KVM io_uring
Boosting I/O Performance with KVM io_uring
ShapeBlue
 
Square Engineering's "Fail Fast, Retry Soon" Performance Optimization Technique
Square Engineering's "Fail Fast, Retry Soon" Performance Optimization TechniqueSquare Engineering's "Fail Fast, Retry Soon" Performance Optimization Technique
Square Engineering's "Fail Fast, Retry Soon" Performance Optimization Technique
ScyllaDB
 
Hardware Assisted Latency Investigations
Hardware Assisted Latency InvestigationsHardware Assisted Latency Investigations
Hardware Assisted Latency Investigations
ScyllaDB
 

Tendances (20)

Boosting I/O Performance with KVM io_uring
Boosting I/O Performance with KVM io_uringBoosting I/O Performance with KVM io_uring
Boosting I/O Performance with KVM io_uring
 
eBPF Basics
eBPF BasicseBPF Basics
eBPF Basics
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
 
The Linux Block Layer - Built for Fast Storage
The Linux Block Layer - Built for Fast StorageThe Linux Block Layer - Built for Fast Storage
The Linux Block Layer - Built for Fast Storage
 
Square Engineering's "Fail Fast, Retry Soon" Performance Optimization Technique
Square Engineering's "Fail Fast, Retry Soon" Performance Optimization TechniqueSquare Engineering's "Fail Fast, Retry Soon" Performance Optimization Technique
Square Engineering's "Fail Fast, Retry Soon" Performance Optimization Technique
 
Under the Hood of a Shard-per-Core Database Architecture
Under the Hood of a Shard-per-Core Database ArchitectureUnder the Hood of a Shard-per-Core Database Architecture
Under the Hood of a Shard-per-Core Database Architecture
 
BlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year InBlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year In
 
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
 
Linux Networking Explained
Linux Networking ExplainedLinux Networking Explained
Linux Networking Explained
 
What are latest new features that DPDK brings into 2018?
What are latest new features that DPDK brings into 2018?What are latest new features that DPDK brings into 2018?
What are latest new features that DPDK brings into 2018?
 
Capturing NIC and Kernel TX and RX Timestamps for Packets in Go
Capturing NIC and Kernel TX and RX Timestamps for Packets in GoCapturing NIC and Kernel TX and RX Timestamps for Packets in Go
Capturing NIC and Kernel TX and RX Timestamps for Packets in Go
 
Seastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for CephSeastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for Ceph
 
Linux Network Stack
Linux Network StackLinux Network Stack
Linux Network Stack
 
Linux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performanceLinux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performance
 
Understanding blue store, Ceph's new storage backend - Tim Serong, SUSE
Understanding blue store, Ceph's new storage backend - Tim Serong, SUSEUnderstanding blue store, Ceph's new storage backend - Tim Serong, SUSE
Understanding blue store, Ceph's new storage backend - Tim Serong, SUSE
 
Hardware Assisted Latency Investigations
Hardware Assisted Latency InvestigationsHardware Assisted Latency Investigations
Hardware Assisted Latency Investigations
 
Scaling for Performance
Scaling for PerformanceScaling for Performance
Scaling for Performance
 
eBPF - Rethinking the Linux Kernel
eBPF - Rethinking the Linux KerneleBPF - Rethinking the Linux Kernel
eBPF - Rethinking the Linux Kernel
 
Linux kernel tracing
Linux kernel tracingLinux kernel tracing
Linux kernel tracing
 
SpectreとMeltdown:最近のCPUの深い話
SpectreとMeltdown:最近のCPUの深い話SpectreとMeltdown:最近のCPUの深い話
SpectreとMeltdown:最近のCPUの深い話
 

Similaire à Optimizing Servers for High-Throughput and Low-Latency at Dropbox

Shak larry-jeder-perf-and-tuning-summit14-part1-final
Shak larry-jeder-perf-and-tuning-summit14-part1-finalShak larry-jeder-perf-and-tuning-summit14-part1-final
Shak larry-jeder-perf-and-tuning-summit14-part1-final
Tommy Lee
 
Maxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialMaxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorial
madhuinturi
 
CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016] CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016]
IO Visor Project
 
Icg hpc-user
Icg hpc-userIcg hpc-user
Icg hpc-user
gdburton
 

Similaire à Optimizing Servers for High-Throughput and Low-Latency at Dropbox (20)

CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performance
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDK
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloads
 
Shak larry-jeder-perf-and-tuning-summit14-part1-final
Shak larry-jeder-perf-and-tuning-summit14-part1-finalShak larry-jeder-perf-and-tuning-summit14-part1-final
Shak larry-jeder-perf-and-tuning-summit14-part1-final
 
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
 
R&D work on pre exascale HPC systems
R&D work on pre exascale HPC systemsR&D work on pre exascale HPC systems
R&D work on pre exascale HPC systems
 
CAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablementCAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablement
 
PLNOG 13: Maciej Grabowski: HP Moonshot
PLNOG 13: Maciej Grabowski: HP MoonshotPLNOG 13: Maciej Grabowski: HP Moonshot
PLNOG 13: Maciej Grabowski: HP Moonshot
 
Ceph Day Beijing - Ceph all-flash array design based on NUMA architecture
Ceph Day Beijing - Ceph all-flash array design based on NUMA architectureCeph Day Beijing - Ceph all-flash array design based on NUMA architecture
Ceph Day Beijing - Ceph all-flash array design based on NUMA architecture
 
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA ArchitectureCeph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
 
Optimization_of_Virtual_Machines_for_High_Performance
Optimization_of_Virtual_Machines_for_High_PerformanceOptimization_of_Virtual_Machines_for_High_Performance
Optimization_of_Virtual_Machines_for_High_Performance
 
Optimization of OpenNebula VMs for Higher Performance - Boyan Krosnov
Optimization of OpenNebula VMs for Higher Performance - Boyan KrosnovOptimization of OpenNebula VMs for Higher Performance - Boyan Krosnov
Optimization of OpenNebula VMs for Higher Performance - Boyan Krosnov
 
Maxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialMaxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorial
 
Big Lab Problems Solved with Spectrum Scale: Innovations for the Coral Program
Big Lab Problems Solved with Spectrum Scale: Innovations for the Coral ProgramBig Lab Problems Solved with Spectrum Scale: Innovations for the Coral Program
Big Lab Problems Solved with Spectrum Scale: Innovations for the Coral Program
 
Zendcon scaling magento
Zendcon scaling magentoZendcon scaling magento
Zendcon scaling magento
 
CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016] CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016]
 
100 M pps on PC.
100 M pps on PC.100 M pps on PC.
100 M pps on PC.
 
Design installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttuDesign installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttu
 
Icg hpc-user
Icg hpc-userIcg hpc-user
Icg hpc-user
 
Designing for High Performance Ceph at Scale
Designing for High Performance Ceph at ScaleDesigning for High Performance Ceph at Scale
Designing for High Performance Ceph at Scale
 

Plus de ScyllaDB

Plus de ScyllaDB (20)

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
What Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQLWhat Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQL
 
Low Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsLow Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & Pitfalls
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance Dilemmas
 
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDBBeyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance Dilemmas
 
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
 
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
 
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr SarnaDatabase Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
 
Replacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDBReplacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDB
 
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear ScalabilityPowering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
 
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
 
Getting the most out of ScyllaDB
Getting the most out of ScyllaDBGetting the most out of ScyllaDB
Getting the most out of ScyllaDB
 
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a MigrationNoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
 
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration LogisticsNoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
 
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and ChallengesNoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
 
ScyllaDB Virtual Workshop
ScyllaDB Virtual WorkshopScyllaDB Virtual Workshop
ScyllaDB Virtual Workshop
 
DBaaS in the Real World: Risks, Rewards & Tradeoffs
DBaaS in the Real World: Risks, Rewards & TradeoffsDBaaS in the Real World: Risks, Rewards & Tradeoffs
DBaaS in the Real World: Risks, Rewards & Tradeoffs
 
Build Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBBuild Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDB
 
NoSQL Data Modeling 101
NoSQL Data Modeling 101NoSQL Data Modeling 101
NoSQL Data Modeling 101
 

Dernier

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Dernier (20)

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 

Optimizing Servers for High-Throughput and Low-Latency at Dropbox

  • 1. Brought to you by Optimizing Servers for High Throughput and Low Latency Alexey Ivanov Software Engineer at Dapper Labs
  • 2. Alexey Ivanov Software Engineer, Dapper Labs ■ Previously: Traffic, Networking, and Databases @Dropbox ■ Performance: Hardware. OS. Application. RUM.
  • 3. Optimizing (web-)Servers 5 Years Later… This is an updated version of the nginx.conf’17 talk. Changelog: ■ New hardware features are available. AMD EPYCs and ARM64 are a thing. ■ New linux kernel features. Especially around observability. ■ Replace nginx with a generic HTTP-server/-client focus. ● (Most of the clients and servers nowadays are HTTP- or HTTP/2-based)
  • 4. The biggest performance gains are usually gained via high-level optimizations: load-balancing, algorithms,data structures, and (especially) business logic. A few examples from large scale production systems. ■ The lower the variance in backend load – the better. ● Applying “Two Random Choices” load-balancing greatly reduced latencies. ■ The fastest code is “no code”. ● E.g. at Dropbox we’ve pre-compressed static files for web so we spent 0% CPU on it while maintaining the best possible compression ratio. ■ Algorithm improvements. ● Switching from zlib to brotli saved us both CPU and storage. ■ Data locality improvements. ● Switching from B+tree to LSM-based storage improved compression efficiency and reduced database sizes by ~2.5x. High-level vs Low-level Optimizations
  • 6. CPU and Memory Generally, picking the newest processor is the best choice since it will have the most hardware offloads: ■ AVX2, BMI, ADX, AVX-512, AES-NI, SHA-NI (x86) ● (Symmetric/Asymmetric encryption, signatures, hashing, MACs) ■ PMUL, PMULL2, SHA256H, SHA3 (ARMv8.2+) ● (finite field arithmetic, hashing, MACs) Many of the things that previously were prohibitively expensive now are almost free due to hardware offloads: mTLS, crypto-hashing, storage encryption.
  • 7. CPU and Memory (Cont’d) What if budget is limited? Rules of thumb: ■ Low-latency: single NUMA-node, bigger caches, disabled SMT, more Ghz, more memory channels. ■ High-throughput: more cores, enabled SMT, more memory. Frequently, in production, high CPU usage does not mean a CPU bottleneck but a “CPU pipeline stall” problem, i.e.: cache, TLB, or memory-bandwidth limitation.
  • 9. github.com/andikleen/pmu-tools # toplev.py -l1 --single-thread --force-events ./app BE Backend_Bound: 60.34% This category reflects slots where no uops are being delivered due to a lack of required resources for accepting more uops in the Backend of the pipeline...
  • 10. github.com/andikleen/pmu-tools # toplev.py -l3 --single-thread --force-events ./app BE Backend_Bound: 60.42% BE/Mem Backend_Bound.Memory_Bound: 32.23% BE/Mem Backend_Bound.Memory_Bound.L1_Bound: 32.44% This metric represents how often CPU was stalled without missing the L1 data cache... BE/Core Backend_Bound.Core_Bound: 45.93% BE/Core Backend_Bound.Core_Bound.Ports_Utilization: 45.93% This metric represents cycles fraction application was stalled due to Core computation issues (non divider- related)...
  • 11. NICs Relevant only for real hardware, not clouds. ■ 25Gbits or more, older NICs would likely have misc bottlenecks. ■ Open-source drivers, small firmwares, active community. ● In case if (but most likely, “when”) issues occur.
  • 12. Pressure Stall Information (PSI) “PSI provides for the first time a canonical way to see resource pressure increases as they develop, with new pressure metrics for three major resources— memory, CPU, and IO.” Source: https://facebookmicrosites.github.io/psi/docs/overview
  • 13. PSI: global and Per-cgroup (v2) $ cat /proc/pressure/io some avg10=0.00 avg60=0.00 avg300=0.00 total=0 full avg10=0.00 avg60=0.00 avg300=0.00 total=0 $ cat /sys/fs/cgroup/cg1/io.pressure some avg10=0.00 avg60=0.00 avg300=0.00 total=0 full avg10=0.00 avg60=0.00 avg300=0.00 total=0
  • 16. Kernel Optimizations The best Linux optimization is the recent kernel version. New kernel versions bring improvements to networking, memory management, io, and the rest of linux subsystems. But most importantly they bring improvements to observability tooling.
  • 17. CPU and Memory After you’ve picked the best CPU for your workload, you’ll need to utilize it to the max: ■ For Intel/AMD you would want to use intel_pstate or amd-pstate driver. ● If you want to be more energy efficient you may consider using schedutil governor. Use performance otherwise. ■ Set NUMA affinity for your application. ■ Use transparent huge pages. ● Careful here, this may lead to reduction in performance on some workloads.
  • 18. Networking The main goal of low-level tuning is to parallelize packet processing, add affinities, increase buffer sizes, and enable hardware offloads. ■ ethtool is your friend here: # of queues, ring buffers, offloads, coalescing. ● -L, -G, -K, -C, etc. ● -S is your friend to keep track of drops/misses/errors/overruns/etc. ■ Mellanox and Intel cards come with set_irq_affinity/mlnx_affinity. ● Do not forget to turn off irqbalance. ■ After RSS is enabled it is generally a good idea to turn on XPS and xps_rxqs. ■ Avoid RPS. RFS can also have negative consequences. ■ For low latency: try to stay within the NUMA node PCIe NIC is attached to.
  • 19. The main goal of high-level tuning is to remove transport-level bottlenecks. ■ Enabling BBR congestion control is generally a good idea. ■ Enabling FQ scheduler w/ pacing is always a good idea. ■ Your friends here are RUM metrics and ss -n --extended --info or getsockopt(TCP_INFO/TCP_CC_INFO) Networking (Cont’d)
  • 20. iproute2 $ ss -tie … ts sack bbr rto:220 rtt:16.139/10.041 ato:40 mss:1448 pmtu:1500 rcvmss:1269 advmss:1428 cwnd:106 ssthresh:52 bytes_sent:9070462 bytes_retrans:3375 bytes_acked:9067087 bytes_received:5775 segs_out:6327 segs_in:551 data_segs_out:6315 data_segs_in:12 bbr:(bw:99.5Mbps,mrtt:1.912,pacing_gain:1,cwnd_gain:2) send 76.1Mbps lastsnd:9896 lastrcv:10944 lastack:9864 pacing_rate 98.5Mbps delivery_rate 27.9Mbps delivered:6316 busy:3020ms rwnd_limited:2072ms(68.6%) retrans:0/5 dsack_dups:5 rcv_rtt:16.125 rcv_space:14400 rcv_ssthresh:65535 minrtt:1.907 …
  • 21. It is impossible to talk about network tuning w/o mentioning sysctls. Here is a couple of a relatively safe ones. ■ net.ipv4.tcp_slow_start_after_idle=0 ● Should be safe if FQ w/ pacing is enabled. ■ net.ipv4.tcp_mtu_probing=1 ● Must have on the edge (along with a slightly reduced advmss) ■ net.ipv4.tcp_rmem, net.ipv4.tcp_wmem ● Should be big enough for connections to not be rcv/snd window limited. ■ net.ipv4.tcp_notsent_lowat=262144 ● Or even lower if HTTP/2 prioritization is used. Sysctl Cargo Culting
  • 24. Compiler Flags, Toolchains, and Runtimes Keeping you compiler/runtime up-to-date is generally a good idea. ■ Compiler upgrade, -O2, and -mtune can visibly affect performance. ● You can also try keeping -march/GOAMD64 in sync with your (cloud) hardware. ■ Link time optimization (LTO) can give a measurable perf boost. ■ Runtime upgrade can frequently give you single to double digit perf improvements. ● For example, Go runtime upgrades frequently deliver memory/cpu usage improvements. ■ (Toolchain upgrades are also great from the security perspective)
  • 25. Profile-guided Optimization and Beyond Most compilers are capable of PGO based on `perf record` profiles. ■ Clang has AutoFDO. ■ Golang would likely have Feedback-Guided Optimization in 1.20. You can go beyond compile-time optimization and use post-link optimizer: ■ Facebook’s BOLT is now a part of LLVM: https://github.com/llvm/llvm-project/tree/main/bolt
  • 26. Any modern application consists of a myriad of libraries. Most servers nowadays would have allocator, TLS, compression, and serialization libraries. These are the main candidates for tuning. For example in case of C/C++ servers: ■ Keeping libraries up-to-date is important. ● It doesn’t matter whether CPU supports AVX2 if your library can’t use it. ■ Changing malloc implementation is an option. ● Both jemalloc and tcmalloc have excellent tuning guides. ■ BoringSSL can (mostly) be used as a drop–in replacement for OpenSSL. ● Often switching from RSA to ECDSA, or from AES to ChaCha (or back) can improve perf. ■ zlib has multiple performance-oriented forks. ● Intel, Cloudflare, zlib-ng. ● Sometimes more efficient algorithms like brotli or zstd can be used instead. Libraries
  • 28. Site Reliability Engineering Chapter 19. Load Balancing at the Frontend Chapter 20. Load Balancing in the Datacenter Chapter 21. Handling Overload Chapter 22. Addressing Cascading Failures
  • 29. Brought to you by Alexey Ivanov rbtz@dapperlabs.com @SaveTheRbtz