Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

On the way to low latency (2nd edition)

Ad

On the way to low
latency
Artem Orobets
Smartling Inc

Ad

Long story short
We realized that latency is important for us
Our fabulous architecture supposed to work, but it didn’t
Th...

Ad

Those guys consider 10µs
latencies slow
We have only 100ms
threshold
We are not a trading company

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Prochain SlideShare
On the way to low latency
On the way to low latency
Chargement dans…3
×

Consultez-les par la suite

1 sur 65 Publicité
1 sur 65 Publicité

On the way to low latency (2nd edition)

Télécharger pour lire hors ligne

This is the second edition of the story about how we struggled to implement strict latency requirements in a service implemented with Java and how we managed to do that.
The most common latency contributors are an in-process locking, thread scheduling, I/O, algorithmic inefficiencies and, of course, garbage collector.
I will share our experience of dealing with the causes. And tell what you can do to prevent them from affecting the production.

This is the second edition of the story about how we struggled to implement strict latency requirements in a service implemented with Java and how we managed to do that.
The most common latency contributors are an in-process locking, thread scheduling, I/O, algorithmic inefficiencies and, of course, garbage collector.
I will share our experience of dealing with the causes. And tell what you can do to prevent them from affecting the production.

Publicité
Publicité

Plus De Contenu Connexe

Publicité

Similaire à On the way to low latency (2nd edition) (20)

Publicité

On the way to low latency (2nd edition)

  1. 1. On the way to low latency Artem Orobets Smartling Inc
  2. 2. Long story short We realized that latency is important for us Our fabulous architecture supposed to work, but it didn’t The issues that we have faced on the way
  3. 3. Those guys consider 10µs latencies slow We have only 100ms threshold We are not a trading company
  4. 4. What is low latency?
  5. 5. Latency is a time interval between
 the stimulation
 and response
  6. 6. What is latency? total response time = 
 service time + time waiting for service
  7. 7. Why is it important? • SLA • Negative correlation to income
  8. 8. Latencies about 50ms 
 is barely noticeable for human
  9. 9. You mostly care about throughput
  10. 10. How to measure it?
  11. 11. Duration of a single test run
  12. 12. Average of test run durations
  13. 13. Quantiles of test run durations (usually 95th, 99th percentiles)
  14. 14. • to test • to analyze • to controle Latency is more difficult to:
  15. 15. Design
  16. 16. Storage
  17. 17. * where latency is 99th percentile
  18. 18. Context switch problem In production we have about 4k connections opened simultaneously
  19. 19. Context switch problem • Thread per request doesn’t work • Too much overhead on context switching • Too much overhead on memory Usually a Thread takes memory from 256kb to 1mb for the stack space!
  20. 20. Troubleshooting framework 1. Discovery. 2. Problem Reproduction. 3. Isolate the variables that relate directly to the problem. 4. Analyze your findings to determine the cause of the problem.
  21. 21. We have have fixed a lot of things that we believed were the most problematic parts. But they weren’t.
  22. 22. Find an evidence that proves your suggestion
  23. 23. A good tool 
 can give you a clue • Proper logging and log analysis tool • Performance tests • Monitoring
  24. 24. Performance benchmark 98.47% <= 2 ms 99.95% <= 10 ms 99.98% <= 16 ms 99.99% <= 17 ms 100.00% <= 18 ms 750 rps Throughput Latency percentiles
  25. 25. A good tool 
 can give you a clue
  26. 26. KPI is necessity
  27. 27. Problem that we faced
  28. 28. Some requests take almost a second And it seems it always happens after deploy
  29. 29. is so lazy
  30. 30. Smoke tests • A good practice when you have continuous delivery • It makes all your code initialized by the time real load comes in
  31. 31. Logging Synchronous logging is not appropriate for asynchronous application
  32. 32. log4j2: Asynchronous Loggers for Low-Latency Logging http://logging.apache.org/log4j/2.x/manual/async.html
  33. 33. Sync Async 98.85% <= 1 ms 99.95% <= 7 ms 99.98% <= 13 ms 99.99% <= 15 ms 100.00% <= 18 ms 1658 rps 98.47% <= 2 ms 99.95% <= 10 ms 99.98% <= 16 ms 99.99% <= 17 ms 100.00% <= 18 ms 769.05 rps Logging
  34. 34. Pauses 50-150ms A network according to logs
  35. 35. Disappear when I scroll through logs via SSH
  36. 36. Any ideas?
  37. 37. TCP_NODELAY
  38. 38. Nagle's algorithm • the "small packet problem” • TCP packets have a 40 byte header 
 (20 bytes for TCP, 20 bytes for IPv4) • combining a number of small outgoing messages, and sending them all at once
  39. 39. • Pauses ~100 ms every couple of hours • During connection creation • Doesn’t reproduces on a local setup
  40. 40. How to diagnose that?
  41. 41. tcpdump -i eth0
  42. 42. TCPDUMP 15:47:57.250119 IP (tos 0x0, ttl 64, id 44402, offset 0, flags [DF], proto TCP (6), length 569) 192.168.3.131.58749 > 93.184.216.34.80: Flags [P.], cksum 0x76b5 (correct), seq 3847355529:3847356046, ack 3021125542, win 4096, options [nop,nop,TS val 848825338 ecr 1053000005], length 517: HTTP, length: 517 GET / HTTP/1.1 Host: example.com Connection: keep-alive …
  43. 43. TCPDUMP 15:58:32.009884 IP (tos 0x0, ttl 255, id 39809, offset 0, flags [none], proto UDP (17), length 63) 192.168.3.131.56546 > 192.168.3.1.53: [udp sum ok] 52969+ A? www.google.com.ua. … 15:58:32.012844 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 127) 192.168.3.1.53 > 192.168.3.131.56546: [udp sum ok] 52969 q: A? www.google.com.ua. …
  44. 44. DNS lookups • After hours of looking through tcp dumps • We have found that DNS lookups sometimes take more than 100ms
  45. 45. How much time GC could take?
  46. 46. GC logging • -Xloggc:path_to_log_file • -XX:+PrintGCDetails • -XX:+PrintGCDateStamps • -XX:+PrintHeapAtGC • -XX:+PrintTenuringDistribution
  47. 47. -XX:+PrintGCDetails [GC (Allocation Failure) 260526.491: [ParNew … [Times: user=0.02 sys=0.00, real=0.01 secs]
  48. 48. -XX:+PrintHeapAtGC Heap after GC invocations=43363 
 (full 3): par new generation total 59008K, used 1335K eden space 52480K, 0% from space 6528K, 20% used to space 6528K, 0% used concurrent mark-sweep generation total 2031616K, used 1830227K
  49. 49. -XX:+PrintTenuringDistribution Desired survivor size 3342336 bytes, new threshold 2 (max 2) - age 1: 878568 bytes, 878568 total - age 2: 1616 bytes, 880184 total : 53829K->1380K(59008K), 0.0083140 secs] 1884058K->1831609K(2090624K), 0.0084006 secs]
  50. 50. A big amount of wrappers Significant allocation pressure
  51. 51. ~100ms GC pauses in logs
  52. 52. -XX:+UseConcMarkSweepGC
  53. 53. Note: CMS collector on young generation uses the same algorithm as that of the parallel collector. Java GC documentation at oracle.com * http://www.oracle.com/webfolder/technetwork/tutorials/obe/java/gc01/index.html
  54. 54. Too many alive objects during young gen GC • Minimize survivors • Watch the tenuring threshold, might need to tune it to tenure long lived objects faster • Reduce NewSize • Reduce survivor spaces
  55. 55. Watch your GC *time span is 2h
  56. 56. Watch your GC
  57. 57. You should have • a deeper understanding of the JVM, OS, hardware … • be brave
  58. 58. aorobets@smartling.com
  59. 59. http://tech.smartling.com/ aorobets@smartling.com

×