I'm going to discuss the efficiency/performance optimizations of different layers of the system. Starting from the lowest levels like hardware and drivers: these tunings can be applied to pretty much any high-load server. Then we’ll move to Linux kernel and its TCP/IP stack: these are the knobs you want to try on any of your TCP-heavy boxes. Finally, we’ll discuss library and application-level tunings, which are mostly applicable to HTTP servers in general and nginx/envoy specifically.
For each potential area of optimization I’ll try to give some background on latency/throughput tradeoffs (if any), monitoring guidelines, and, finally, suggest tunings for different workloads.
Also, I'll cover more theoretical approaches to performance analysis and the newly developed tooling like `bpftrace` and new `perf` features.
3. Optimizing (web-)Servers 5 Years Later…
This is an updated version of the nginx.conf’17 talk.
Changelog:
■ New hardware features are available. AMD EPYCs and ARM64 are a thing.
■ New linux kernel features. Especially around observability.
■ Replace nginx with a generic HTTP-server/-client focus.
● (Most of the clients and servers nowadays are HTTP- or HTTP/2-based)
4. The biggest performance gains are usually gained via high-level optimizations:
load-balancing, algorithms,data structures, and (especially) business logic.
A few examples from large scale production systems.
■ The lower the variance in backend load – the better.
● Applying “Two Random Choices” load-balancing greatly reduced latencies.
■ The fastest code is “no code”.
● E.g. at Dropbox we’ve pre-compressed static files for web so we spent 0% CPU on it while
maintaining the best possible compression ratio.
■ Algorithm improvements.
● Switching from zlib to brotli saved us both CPU and storage.
■ Data locality improvements.
● Switching from B+tree to LSM-based storage improved compression efficiency and reduced
database sizes by ~2.5x.
High-level vs Low-level Optimizations
6. CPU and Memory
Generally, picking the newest processor is the best choice since it will have the
most hardware offloads:
■ AVX2, BMI, ADX, AVX-512, AES-NI, SHA-NI (x86)
● (Symmetric/Asymmetric encryption, signatures, hashing, MACs)
■ PMUL, PMULL2, SHA256H, SHA3 (ARMv8.2+)
● (finite field arithmetic, hashing, MACs)
Many of the things that previously were prohibitively expensive now are almost
free due to hardware offloads: mTLS, crypto-hashing, storage encryption.
7. CPU and Memory (Cont’d)
What if budget is limited? Rules of thumb:
■ Low-latency: single NUMA-node, bigger caches, disabled SMT, more Ghz,
more memory channels.
■ High-throughput: more cores, enabled SMT, more memory.
Frequently, in production, high CPU usage does not mean a CPU bottleneck but a
“CPU pipeline stall” problem, i.e.: cache, TLB, or memory-bandwidth limitation.
9. github.com/andikleen/pmu-tools
# toplev.py -l1 --single-thread --force-events ./app
BE Backend_Bound: 60.34%
This category reflects slots where no uops are being
delivered due to a lack of required resources for
accepting more uops in the Backend of the pipeline...
10. github.com/andikleen/pmu-tools
# toplev.py -l3 --single-thread --force-events ./app
BE Backend_Bound: 60.42%
BE/Mem Backend_Bound.Memory_Bound: 32.23%
BE/Mem Backend_Bound.Memory_Bound.L1_Bound: 32.44%
This metric represents how often CPU was stalled without
missing the L1 data cache...
BE/Core Backend_Bound.Core_Bound: 45.93%
BE/Core Backend_Bound.Core_Bound.Ports_Utilization: 45.93%
This metric represents cycles fraction application was
stalled due to Core computation issues (non divider-
related)...
11. NICs
Relevant only for real hardware, not clouds.
■ 25Gbits or more, older NICs would likely have misc bottlenecks.
■ Open-source drivers, small firmwares, active community.
● In case if (but most likely, “when”) issues occur.
12. Pressure Stall Information (PSI)
“PSI provides for the first time a canonical way to see resource pressure increases
as they develop, with new pressure metrics for three major resources—
memory, CPU, and IO.”
Source: https://facebookmicrosites.github.io/psi/docs/overview
13. PSI: global and Per-cgroup (v2)
$ cat /proc/pressure/io
some avg10=0.00 avg60=0.00 avg300=0.00 total=0
full avg10=0.00 avg60=0.00 avg300=0.00 total=0
$ cat /sys/fs/cgroup/cg1/io.pressure
some avg10=0.00 avg60=0.00 avg300=0.00 total=0
full avg10=0.00 avg60=0.00 avg300=0.00 total=0
16. Kernel Optimizations
The best Linux optimization is the recent kernel version. New kernel versions bring
improvements to networking, memory management, io, and the rest of linux
subsystems.
But most importantly they bring improvements to observability tooling.
17. CPU and Memory
After you’ve picked the best CPU for your workload, you’ll need to utilize it to the
max:
■ For Intel/AMD you would want to use intel_pstate or amd-pstate driver.
● If you want to be more energy efficient you may consider using schedutil governor. Use
performance otherwise.
■ Set NUMA affinity for your application.
■ Use transparent huge pages.
● Careful here, this may lead to reduction in performance on some workloads.
18. Networking
The main goal of low-level tuning is to parallelize packet processing, add affinities,
increase buffer sizes, and enable hardware offloads.
■ ethtool is your friend here: # of queues, ring buffers, offloads, coalescing.
● -L, -G, -K, -C, etc.
● -S is your friend to keep track of drops/misses/errors/overruns/etc.
■ Mellanox and Intel cards come with set_irq_affinity/mlnx_affinity.
● Do not forget to turn off irqbalance.
■ After RSS is enabled it is generally a good idea to turn on XPS and xps_rxqs.
■ Avoid RPS. RFS can also have negative consequences.
■ For low latency: try to stay within the NUMA node PCIe NIC is attached to.
19. The main goal of high-level tuning is to remove transport-level bottlenecks.
■ Enabling BBR congestion control is generally a good idea.
■ Enabling FQ scheduler w/ pacing is always a good idea.
■ Your friends here are RUM metrics and
ss -n --extended --info or getsockopt(TCP_INFO/TCP_CC_INFO)
Networking (Cont’d)
21. It is impossible to talk about network tuning w/o mentioning sysctls. Here is a
couple of a relatively safe ones.
■ net.ipv4.tcp_slow_start_after_idle=0
● Should be safe if FQ w/ pacing is enabled.
■ net.ipv4.tcp_mtu_probing=1
● Must have on the edge (along with a slightly reduced advmss)
■ net.ipv4.tcp_rmem, net.ipv4.tcp_wmem
● Should be big enough for connections to not be rcv/snd window limited.
■ net.ipv4.tcp_notsent_lowat=262144
● Or even lower if HTTP/2 prioritization is used.
Sysctl Cargo Culting
24. Compiler Flags, Toolchains, and Runtimes
Keeping you compiler/runtime up-to-date is generally a good idea.
■ Compiler upgrade, -O2, and -mtune can visibly affect performance.
● You can also try keeping -march/GOAMD64 in sync with your (cloud) hardware.
■ Link time optimization (LTO) can give a measurable perf boost.
■ Runtime upgrade can frequently give you single to double digit perf
improvements.
● For example, Go runtime upgrades frequently deliver memory/cpu usage improvements.
■ (Toolchain upgrades are also great from the security perspective)
25. Profile-guided Optimization and Beyond
Most compilers are capable of PGO based on `perf record` profiles.
■ Clang has AutoFDO.
■ Golang would likely have Feedback-Guided Optimization in 1.20.
You can go beyond compile-time optimization and use post-link optimizer:
■ Facebook’s BOLT is now a part of LLVM:
https://github.com/llvm/llvm-project/tree/main/bolt
26. Any modern application consists of a myriad of libraries. Most servers nowadays
would have allocator, TLS, compression, and serialization libraries. These are the
main candidates for tuning. For example in case of C/C++ servers:
■ Keeping libraries up-to-date is important.
● It doesn’t matter whether CPU supports AVX2 if your library can’t use it.
■ Changing malloc implementation is an option.
● Both jemalloc and tcmalloc have excellent tuning guides.
■ BoringSSL can (mostly) be used as a drop–in replacement for OpenSSL.
● Often switching from RSA to ECDSA, or from AES to ChaCha (or back) can improve perf.
■ zlib has multiple performance-oriented forks.
● Intel, Cloudflare, zlib-ng.
● Sometimes more efficient algorithms like brotli or zstd can be used instead.
Libraries