DotNext 2017 conference in Moscow, RU - 2017/11/12
Talk: .NET Core Networking stack and Performance by Karel Zikmund
http://2017.dotnext-moscow.ru/en/2017/msk/talks/3hcuoycrw4egcs0mkewoio/
10. Networking – Sockets perf results
• Micro-benchmark only (disclaimer: Netty/Go impl may be inefficient)
• Linux 2 CPUs
1,000x RPS 1 B 16 B 256 B 4 KB
.NET Core 370 369 384 198
Netty 527 540 454 124
Go 517 531 485 210
GB/s 256 B 4 KB 64 KB 1 MB
.NET Core 0.09 0.77 1.09 1.10
Netty 0.11 0.48 0.66 0.67
Go 0.12 0.82 1.10 1.11
11. Networking – Sockets perf results
• Micro-benchmark only (disclaimer: Netty/Go impl may be inefficient)
• Linux 2 CPUs GB/s 256 B 4 KB 64 KB 1 MB
.NET Core 0.09 0.77 1.09 1.10
Netty 0.11 0.48 0.66 0.67
Go 0.12 0.82 1.10 1.11
SSL - GB/s 256 B 4 KB 64 KB 1 MB
.NET Core 0.04 0.31 0.71 0.87
Netty 0.03 0.12 0.15 0.15
Go 0.06 0.56 0.98 1.12
12. Networking – Sockets perf on Server
• Kestrel server uses libuv -> Sockets prototypes
• Early prototype (with hacks):
• 7% improvement + more potential
• Recent prototype (very preliminary data):
• 15% worse on Linux
• 20% worse on Windows
• Workarounds in Sockets -> parity with libuv perf
• Investigation in progress
13. Networking – ManagedHandler perf
• ManagedHandler
• Very early development stage
• Bugs
• Missing large features – authentication, proxy, http2
• Early measurements (simple http micro-benchmark):
• Windows: Parity with Go
• Linux: 15% gap (pending investigation)
14. Networking – SSL perf
• Historical reports on some .NET Framework scenarios: 2x slower
• Linux .NET Core 2.0 app report 4x slower
• libcurl+HttpClient pattern limitation
• With workaround: 14% overhead of SSL
• TechEmpower benchmark
• http vs. https larger diff than Rust/Go/Netty
• Sockets micro-benchmarks – 23% gap
• Attempt for rewrite by community (@drawaes)
• Next steps: Measure & analyze micro-benchmarks & end-to-end scenarios
15. Networking – Industry benchmarks
• TechEmpower benchmark
• More end-to-end, with DB, etc.
• Useful for overall platform performance comparison
• Round 15 (preliminary data)
• ASP.NET Core at #5 entry (jump from #14 in Round 14)
16. Importance of Performance
Platform performance shows how fast your app could be
… but it is not everything:
• Productivity
• Tooling
• Developer availability (in-house/to hire)
• Documentation
• Community
• etc.
17. Application Performance Tips
• Plan for performance during design
• Understand scenario, set goals
• Prototype and measure early
• Optimize what’s important – measure
• Understand the big picture
• Avoid micro-optimizations
• Don’t guess root cause – measure
• Minimize repro – it’s worth it!
18. BCL Performance
• Fine-tuned over 15 years
• Opportunities are often trade-offs (memory vs. speed, etc.)
• Problem: Identify scenarios which matter
• OSS helps
• More eyes on code
• Motivated contributors
• More reports
• Perf improvements in .NET Core (.NET blog by Stephen Toub)
• Collections, Linq, Compression, Crypto, Math, Serialization, Networking
• Span<T> sprinkled in BCL
19. BCL Performance – What to not take?
• Specialized collections
• BCL designed for usability and decent perf for 95% customers
• Code complexity (maintainability) vs. perf wins
• APIs for specialized operations (e.g. to save duplicate lookup)
• Creates complexity
• May leak implementation into API surface
20. Wrap Up
• Proactive investments into .NET Networking stack
• Consistency across platforms
• Great performance for all workloads
• Ongoing scenario/feedback-based improvements in BCL perf
• Performance in general is:
• Important
• But not the only important thing
• Tricky to get right in the right place
Notes de l'éditeur
Client stack
HttpWebRequest since .NET 1.0
Exceptions on errors (404)
Headers as strings (parsing) – error prone
4.5 added HttpClient, driven by WCF
Pipeline architecture (extensibility)
No http2 support
Original plan to later re-wire HttpClient directly on fundamentals – turns out it is huge compat work and will likely never happen (compat is king on .NET Framework)
Missing APIs from the picture:
Fundamentals: NetworkingInfo, Uri
FtpWebRequest, FileWebRequest
Mail on Sockets & SslStream – now obsoleted (MailKit recommended)
WebSockets on HttpWebRequest & Sockets & websockets.dll (Win8+)
HttpListener (server) – now obsoleted by Kestrel
UWP:
WinRT APIs designed by Windows Networking team at he same time as 4.5 – almost 1:1 mapping
win9net.dll is client library - http2 support
.NET Core:
http2 support – server library
HttpWebRequest for compatibility in .NET Core 2.0 (.NET Standard 2.0), but “obsoleted”
Other “obsoleted” in 2.0: *WebRequest & Mail & HttpListener
libcurl with NSS, not just OpenSSL
Different behaviors on different OS’s or even Linux distros – inconsistencies in behavior
WebSocket = ManagedWebSocket on Win7 & Linux/Mac (With mini ManagedHandler on Sockets)
Note: Attempt to ship WinHttpHandler also for .NET Framework as System.Net.Http.dll (http2 on Desktop) – ambition to replace inbox failed due to differences
Key values: Consistency and Perf wins
Mono has OS-specific handlers (Phone OS specific capabilities around connection transition between data and Wi-Fi)
Foundation – performance (& reliability)
Usable for both server and client scenarios
Web stack – consistency & performance (& reliability)
Important for middleware scenarios (not just 1 server)
Emerging technologies – new protocols, capabilities and performance motivated
RIO = Registered I/O (Win8+)
QUIC = Quick UDP Internet Protocol (TCP/TSL replacement) … Latency improvements (esp. for streaming)
Maintenance components – minimal investments – mostly for reliability and the most necessary standards support (preserve customers’ investments)
Repeatability
Non-networking micro-benchmarks – time (wallclock), memory … classic CLR perf lab disables many OS features
Cloud (Azure) attempt
You want: Full control
Column is payload size (1B -> 1MB, each column is 16x bigger than previous one)
Note: RPS as metric in second table better tells the story of scaling based on payload size
Go has assembly-written crypto
Note: .NET Core 1.1 was at 0.47 GB/s at 1MB – 2x improvement in 2.0 and 2.1 (=future)
Sockets value: Consistency & less external dependency
Early prototype – Hacks around response buffering in Kestrel – flushing tuned for libuv
Recent prototype (1 week old) – Workarounds point to potential perf improvements
Known fact: SslStream class could use some love
Note: ManagedHandler https: 7% slower than CurlHandler (without Ssl)
Feature across Language (C# compiler), Runtime (incl. JIT), and BCL
Value is not perf on its own – unsafe code can be faster (up to 25% on micro-benchmarks)
But not universal for Native memory
Pinning memory
Example of not everything is black and white … clearly better perf, often trade offs
See black line – regressions from Span<T> and recovering back perf
Warning: Not everyone is truly building hyper-scale service
Even if you think you are, don’t forget that most scaling apps are rewritten every 2-3 years
You don’t have to be perfect on day 1, evolve
Story: Trading SW
.NET for productivity (over C++ proposal)
Later perf
More and more serious
Down to sub-ms GCs, reusing memory
Rewriting key components to native eventually
Note: .NET can be faster than C++ in certain workloads (e.g. allocations in Gen0 are super efficient)
Real app startup (demo app in both C++ and C#)
Optimize what’s important:
90-10 rule (95-5) … what contributes to app performance
Story: Arguing over d.s. List vs. array when the data is <100 items and the service has 3 hops between servers
Guess root cause mistakes:
Conclusions based on 1 dump (or 2 in better case)
Blaming platform:
Lots of memory used by app => it must be GC bug … if there is no mem pressure on OS, GC will use it
GC shows happens too often => it must be GC bug … but maybe you just allocate too much and GC is doing what its told
Also in non-performance:
JIT_Throw on callstack => JIT bug … we renamed it to IL_Throw (just throws exception)
Crash during GC => GC bug … corruption is likely from interop / native/unsafe code
BTW: TTD is life saver
Thousands of APIs, hard to pick the right ones
Problem:
Using telemetry from MS/partners, partner teams reports, working closely with customers and community
OSS:
We often ask about the scenario, to understand the big picture
Examples: (quite often 30% and more)
SortedSet<T>.ctor – O(n^2) -> O(n) … 600x on 400K items
List<T>.Add – already fast, but used everywhere
ConcurrentBag/ConcurrentQueue