5. What is latency?
• Latency impacts the user experience
• Lower latency = more responsive = better
experience
• A fast download over link of high latency can take
longer than a slow down load over a low latency
link
6. Why measure latency?
• Efficiency:
• Improved resource usage
• Improved user experience
• Spotting and diagnosing defects
7. Where is Latency?
• Between:
• A CPU and it’s cache
• Client and server over a network
• Application and disk
• Anywhere a system does work
8. Where is latency?
• L1 cache reference 0.5 ns
• Branch mispredict 5 ns
• L2 cache reference 7 ns
• Mutex lock/unlock 100 ns
• Main memory reference 100 ns
• Compress 1K bytes with Zippy 10,000 ns
• Send 2K bytes over 1 Gbps network 20,000 ns
• Read 1 MB sequentially from memory 250,000 ns
• Round trip within same datacenter 500,000 ns
• Disk seek 10,000,000 ns
• Read 1 MB sequentially from network 10,000,000 ns
• Read 1 MB sequentially from disk 30,000,000 ns
• Send packet CA->Netherlands->CA 150,000,000 ns
9. Causes of network latency
• Physical limitations - speed of light, wire speeds
• Congestion at switches, routers and servers
• Packet loss due to noise, congestion, faults
10. Round Trip Times
• aka RTT
• Time to go their and back again
• Return route my be different from the outbound
11. Network Latency Tools
• Ping. Time between sending ICMP Echo Request and
receiving ICMP Echo Reply
• Traceroute. Time between sending a packet with
incremented TTL value and receiving ICMP Time
Exceeded package..
• tcptraceroute. traceroute using TCP packages to
configurable ports
• mtr - does ICMP, UDP and TCP traceroute
13. Transmission Control
Protocol
• Stateful, connection oriented protocol for reliable
data transmission
• Guarantees data delivery and ordering
• Server maintain state tables of connections
• HTTP, SMTP, SSL/TLS, IRC, SSH…
14. TCP
• Three way handshake. 1.5 roundtrips to set up
connection
15. TCP Latency Improvements
• By reducing number of round trips:
• Compress content into fewer packets. 1500 MTU
=1460 byte payload
• TCP timestamps take an extra 12 bytes = 1448
byte payload. Timestamp can be disabled.
16. TCP Improvements
• Move your content closer to your users:
• Make good use of local caches (e.g. browser)
• Content Delivery Networks (Cloudflare,
Cloudfront, Akamai)
• Host geographically closely
• Host at locations with low latency links
17. HTTP Latency
• Use HTTP/1.1, HTTP/2 (née SPDY)
• Ensure pipelining is enabled
• Tune TCP keep alive
• Try TCP corking (buffer stream and
send), nodelay (buffer small
payload
18. HTTP Latency
• Take care over caching and provide well formed
headers
• Use tools like Pagespeed Insight to analyse
performance
• Pagespeed module to modify content on the
server
19. SSL/TLS
• Use AES and compatible libraries on processors
with AES-NI for hardware acceleration
• Elliptic Curve (EC-DSA) for smaller certs & keys
and better performance.
• Terminate SSL at the edge and consider using
lightweight or no encryption inside the local
network.
20. User Datagram Protocol
• ‘Fire and forget’ - no inbuilt reliability, connection-
less
• No hand shake
• Ordering and retransmission at the application
level
• Stateless, so no connect states to manage
• DNS, VOIP, SNMP, RIP, VPNs, Games, Mosh
21. Domain Name Service
• DNS lookups can hamper user experience
significantly
• Synchronous lookup before each resource
access
• Uses UDP (usually) for client/server lookups
22. DNS
• Caches are distributed nearer to the user (DNS
resolvers/forwarders)
• Great for popular sites
• For lower traffic site may still require an
authoritative lookup
23. DNS CNAMES
• DNS CNAMEs - name -> name -> IP
• Two DNS lookups. Two round trips.
• Never use a CNAME at a zone apex if you have
other records in that zone.
24. DNS Time to Live
• Time a DNS record is cached in a non-
authoritative servers.
• Need to strike a balance between keeping the
record cached near the user and the ability to
update the record
• 1 day is a good starting point. Decrease before
record switch overs.
25. DNS clients
• Avoid synchronous DNS lookups where possible:
async libraries, or batch process results later
• Consider local hosts files, use config
management to distribute
26. DNS
• Keep DNS geographically close to users
• Use providers with anycast DNS servers
• Globally distribute records if the audience is
global
• Can make initial load significantly faster
27. QUIC
• Experimental protocol from Google for encrypted,
multiplexed streams over UDP
• Aims to reduce number of round trips
• May make the next TLS standard
• Supported by Chrome, prototype server
28. Client and Servers hosts
• Watch for queuing - something in a queue means
not enough resource to service the request
• Disk IO historically a problem. Throughput in
IOPS. SSDs are reducing this latency.
• Be familiar with the standard system monitoring
tools
• Be wary of multi-threaded processes and locks
29. Cloud
• Get familiar with cloud providers tools. Useful views
outside the hosts.
• Load test for 5+ cycles of monitoring
• Can provide protocol level information
• Test apps from the point of view of the users -
Nagios, Pingdom, hitting representative end points
• Don’t take their word for performance - measure it
Measured in seconds, typically, or milliseconds on the IO scale and ns on the CPU/memory scale. Minutes, hours, days for large processing tasks.
Or action that starts the chain of events. This might be a keypress, or a download request or following a link
The reaction to the action - displaying the keypressed on the screen, starting the download or finishing the download, depending on what it’s being used for, painting an initial page layout or loading the full page.
The end point is often viewed from the point of the view of the next step, it is it that which suffers from the latency.
Typing on a keyboard = <40ms response is needed = the less the better.
The more interactive the lower the latency between user input and the response.
Some events are synchronous and must complete before the next step can start and will delay the next event.
Some latency is long enough for other task to go away and come back later - they are syncronous
CPUs waiting for IO to finish can be used for other tasks
Users get faster response to their interactions and get their work load done in a shorter time
A db request takes 5 seconds where normally it would take one
Measure over time, graphs are useful
L1, L2, L3, memory, disk
Open file, Read file, Write file Close file, Seek
A packet going from a->b is work
A car accelerating - latency between start and 85mph
More topically, there was a significant latency between Richard III dying and getting a king burial
CPU cycle is currently ~0.3ns. Which is 5x the speed light travels 1m
150,000,000 = 150ms
Sheer distance is a limiting factor. We’re reaching or have reached in some areas the point where light speed is the limiting factor.
Congestion is bandwidth over use - packets get queued and ultimately dropped.
Packet loss will lead to retransmission in TCP in at the TCP layer
UDP the application will have to deal with.
This increases latency due to timeout before retransmission.
Route out may also differ for each run.
Because different paths are taken, it can be hard to tell if the delay is out or return
Measuring point to point latency requires clocks synchronised with the require degree of accuracy (< a few ms)
Sometimes ICMP Echo requests can dropped for security reasons.
Both ICMP Echo and Time Exceeded may be give low priority compared to data traffic, skewing the values
Demo
sudo mtr -P 53 -U 8.8.8.8
sudo mtr -P80 -T www.google.com
TCP is designed for (relatively) long running connections transferring a (relatively) large amount of data
Makes sure packets are received by the next network layer (app) in the order they were sent.
Deals with retransmission on error after an error.
Quite complex, quite a lot of tweakable values, though largely well tuned by default in modern OSs - worth visiting for high-utilisation workloads
Most common protocol currently.
Connection tables can be seen with netstat -a on Windows and UNIX like OSs, including states.
telnet is useful for testing plain text TCP session.
In establishing connections - there is 1.5 trips to set up the connection.
To the US east cost, 40ms, so 120ms setup cost per connection. Connections may be asyncronous
There maybe more latency at certain hops - eg. CPE to ISP might be 20ms or more (ADSL)
Less round trips == less latency and more raw bandwidth after decompression.
Compression server side can lead to latency in the compression. Pick a fast compression algorithm, or pre-compress files. Nginx supports this.
Timestamps are generally best left on as they are used m
Browser caches have the latency of the users local machine, which may be RAM or disk. A machine with a good network connection and slow 5400rpm disk might be slower to get cached items that from a server, especially locally.
CDNs require a good understanding of the data set and careful management. Cloudflare is of particular note as it’s free to start. CDNs also provide other functions like application firewall, DDOS mitigation.
Geographically closer - given the same link types, latency will be less. The closer, the lower the latency.
Not all hosting locations are created equally. A site may have a 100Gbps in and out, but if it’s heavy contended it may be slower for your app than a small link. Measure it.
SPDY / HTTP/2 add Multiplexing, compression, prioritisation
Pipelining is part of HTTP/1.1 and should be enabled by default these days. It’s a method of sending a stream of requests without waiting for each reply.
TCP keepalive will keep TCP sessions open for longer, but must be balance against server resource usage, especially under heavy loads.
No_delay will buffer up small payloads and send them in a single packet. e.g. a 1 byte packet will have 20 byte packet head, plus lower level encapsulation data.
If good cache headers get set from the out set, any system between the users’ screen and the server will benefit
Pagespeed module is useful as a quick fix, but be sure to test before long term use.
Nothing is better than really known the application and tuning accordingly.
AES-NI support in recent OpenSSL. Check CPU config in any VMs in hosting providers.
EC-DSA Needs modern clients and servers.
Load balancers should be tuned for encryption hand off.
A well configured SSL termination should pass HTTP headers through to indicate it came over a secure connection. X-Forwarded-Proto (defacto) , Front-End-Https (MS), Secure:
UDP is used for message based protocol - typically low volume where speed is important.
No handshake - the packet is just sent.
Stateless, so no overhead for connection tracking in the kernel. Netstat only shows UDP listeners.
Application layer needs to handle errors and missing/out of order packets .
Smaller header than TDP (8 bytes vs 20 bytes)
VPNs - use UDP not TCP due to out tunnel drop causing retransmission of both the inner and outer streams. This can lead to failure caused by amplification.
DNS is the look up of IP addresses from names
Used liberally in systems because IPs are hard to remember and server IPs can change.
The initial load of a web site will be a DNS request for the sites IP. This is syncronous.
UDP is used for most client -> server look ups. Zones transfers use TCP due to volume of data.
DNS makes heavy use of caches. These are closer to the user and server to reduce load on authoritative servers as well as provide a lower latency response to users.
Popular sites will mostly hit the cache as one use request that result in the authoritative server being hit results in it being cached for other.
CNAME records are a useful short cut to point name to names
Two DNS looks occur with a CNAME record - one to find the canonical name, the other to retrieve the IP. 2 round trips.
As as aside never use CNAMEs on a zone apex - root of the domain - that has other records. Those other records will not be properly addressable - especially mail will fail, possibly intermittently.
Once a client requests results in the auth server being polled, the result gets cache for for the TTL of the record. Once the TTL counts to 0, a request to the authoritative server is done.
For internal DNS where the authoritative server is local, a lower TTL may be appropriate.
Concurrency helps machines do meaningful work while waiting for other tasks to complete. DNS requests can plentiful and if synchronous, even with a relatively low latency, the delays can stack up. Certainly don’t do this for the purpose of logs unless asynchronous.
Local hosts files allow the use of name and provide very low latency resolution at the cost of ease of management and number of records. Configuration management tools can help with the management, and populate hosts files across servers.
Anycast is a mechanism using routing tables to send users to the closest server by latency or geographically which still having the same global IP.
If the audience is global, distribute the authoritative servers.
This will make the latency of the initial connection to the server lower and improve the overall experience.
QUIC is still experimental, but an interesting look at the way protocols may be headed
SSL over UDP, with some similarities to SPDY. It multiplexs connections and aims to reduce the number of round trips
It may become standardised.
Not widely support yet.
Queuing shows that there isn’t sufficient resource to satisfy demand. A degree of queuing is normal and desirable - the resource always has work available to do - but queuing means latency. What a reasonable length queue is depends on the speed at which requests get processed and what the latency expectations are.
Disk IO, especially random access, historically has been a bottleneck. IOps are number of input/output operations per second. 7200rpm SATA disk does 75-100 IOPS. 15K SAS 175-210 IOPS. SSD are 1000s or 10,000s or even millions in some pre-market models.
sar, iostat, vmstat, top, mpstat tools
Multi thread process use lots of core and CPU but may have locks for some resources which can cause bottleneck.
AWS Cloudwatch is very good. Once a service is under load, get used to the figure and set alerts on deviations. I’ve solved most AWS performance issue just with Cloudwatch with 1 minute intervals.
Load testing needs to gather enough information over time. If the cycle period for monitoring is 5 minutes, at least 25 minutes of load should be applied. Ideally use smaller cycles and longer periods to see trends.
Protocol level stats for a load balancer might be number of 400s or 500s from a web app, and the latency of the requests from the point of view of the load balance.
Monitor latency from the point of view of the clients. If the client base is global, monitor the end points they will hit globally.
Don’t trust what cloud providers say - measure it and prove it meets requirements for the given work load. As with any performance figures, they are often under ideal conditions and may not reflect results under complex conditions.