Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Techniques and Tools for
A Coherent Discussion
About Performance
in Complex Systems
Performance Must Matter
First it must be made relevant.
Then it must be made important.
If you don’t care about
Performance
You are in the wrong talk.
@postwait should throw you out.
Perhaps some justification is warranted
Performance…
makes a better user experience
increases loyalty
reduces product aband...
Consistent Terminology
Inconsistent terminology is the
best way to argue about agreeing
It’s all about latency…
Throughput vs. Latency
Lower latency often

affords increased throughput.
Latency is the focus.
ht...
Generally, time should be measured in seconds.
UX latency should be in milliseconds.
Time
Users can’t observe microseconds...
Music is all about the space between the notes.
Connectedness
Performance is about how quickly you can
complete some work....
Developing a
Performance Culture
It is easy to develop a rather
unhealthy performance culture.
Focus on
Small Individual Wins
https://www.flickr.com/photos/skynoir/8783914886
Report on and celebrate
Large Collective Wins
https://www.flickr.com/photos/tomer_a/1130647512
Transcendant Tooling
Tooling must transcend the team
and keep consistent conversation
https://www.flickr.com/photos/meanest...
Large-Scale Distributed Systems Tracing Infrastructure
Dapper
Google published a paper:
research.google.com/pubs/pub36356....
Large-Scale Distributed Systems Tracing Infrastructure
Dapper
Google published a paper:
research.google.com/pubs/pub36356....
The Basics
❖ Focused on User Interactions (not req.)
❖ Each new request is assigned a “Trace ID”
❖ The service records sta...
Example
Web Request: /do/magic
(no X-B3-TraceId header)
Creates TraceId T1, SpanId T1
Notes “sr” (server receive)
needs to...
Visualization
service1
service2
sr
sr ss crcs
ss
cs? cr?
Siloed Teams
service1
service2
sr
sr ss crcs
ss
cs? cr?
Net Ops
AppTeam1
AppTeam2/DBA
Better Responsibilities
service1
service2
sr
sr ss crcs
ss
cs? cr?
Net Ops
AppTeam1
AppTeam2/DBA
A pseudo-Dapper
Zipkin
Twitter sought to (re)implement Dapper.
Disappointingly few improvements.
Some unfortunate UX issue...
Thrift and Scribe should both die.
Scribe is Terrible
Terrible. Terrible Terrible.
Thrift is terrible.
Scribe is “strings”...
The whole point is to be low overhead
Screw Scribe
We push raw thrift over Fq

github.com/circonus-labs/fq
Completely asyn...
Telling computers what to do.
Zipkin is Java/Scala
Wrote C support:
github.com/circonus-labs/libmtev
Wrote Perl support:
g...
Real world
A sample trace: data from S1
A sample trace: data from S2
Celebration
Day 1
Noticed unexpected topology queries.
Found a data location caching issue.
Shaved 350ms off every graph r...
Celebration
Day 4-7
Noticed frequent 150ms stalls in internal REST.
Often: 90%+
Found a libcurl issue (async resolver).
Sh...
You can do all of this at work.
Go To Work
And have a deeply technical

cross-team conversation

about performance
Future
IPv6 piggy-backing
audit or ld/preload libs
nanosecond granularity
cap-n-proto + UDP reporting
https://www.flickr.co...
Thanks!
A Coherent Discussion About Performance
A Coherent Discussion About Performance
A Coherent Discussion About Performance
Prochain SlideShare
Chargement dans…5
×

A Coherent Discussion About Performance

  • Soyez le premier à commenter

A Coherent Discussion About Performance

  1. 1. Techniques and Tools for A Coherent Discussion About Performance in Complex Systems
  2. 2. Performance Must Matter First it must be made relevant. Then it must be made important.
  3. 3. If you don’t care about Performance You are in the wrong talk. @postwait should throw you out.
  4. 4. Perhaps some justification is warranted Performance… makes a better user experience increases loyalty reduces product abandonment increases speed of product development lowers total cost of ownership builds more cohesive teams
  5. 5. Consistent Terminology Inconsistent terminology is the best way to argue about agreeing
  6. 6. It’s all about latency… Throughput vs. Latency Lower latency often
 affords increased throughput. Latency is the focus. https://www.flickr.com/photos/poeloq/3140100971
  7. 7. Generally, time should be measured in seconds. UX latency should be in milliseconds. Time Users can’t observe microseconds. Users quit over seconds. Users experience is measured in milliseconds. (with at least microsecond precision)
  8. 8. Music is all about the space between the notes. Connectedness Performance is about how quickly you can complete some work. In a connected service architecture, performance is also about the time spent between the service layers.
  9. 9. Developing a Performance Culture It is easy to develop a rather unhealthy performance culture.
  10. 10. Focus on Small Individual Wins https://www.flickr.com/photos/skynoir/8783914886
  11. 11. Report on and celebrate Large Collective Wins https://www.flickr.com/photos/tomer_a/1130647512
  12. 12. Transcendant Tooling Tooling must transcend the team and keep consistent conversation https://www.flickr.com/photos/meanestindian/2260343214
  13. 13. Large-Scale Distributed Systems Tracing Infrastructure Dapper Google published a paper: research.google.com/pubs/pub36356.html As usual, code never saw the outside.
  14. 14. Large-Scale Distributed Systems Tracing Infrastructure Dapper Google published a paper: research.google.com/pubs/pub36356.html As usual, code never saw the outside. web api data agg mq db data store cep alerting
  15. 15. The Basics ❖ Focused on User Interactions (not req.) ❖ Each new request is assigned a “Trace ID” ❖ The service records start/stop/etc. against a “Span ID” (first Span ID == Trace ID) ❖ In the context of a “Span ID”,
 each remote call get’s a new Span ID,
 with the Parent Span ID set to the context.
  16. 16. Example Web Request: /do/magic (no X-B3-TraceId header) Creates TraceId T1, SpanId T1 Notes “sr” (server receive) needs to tall to service MS Creates new SpanId T2 Notes “cs” (client send) Request to MS Notes “cr” (client receive) Notes “ss” (server send) Sends response Async publish span(s) GET /pixie/dust X-B3-TraceId: T1 X-B3-ParentSpanId: T1 X-B3-SpanId: T2 Extracts headers Notes “sr” (server receive) performs actions Notes “ss” (server send) Responds Async publish span(s) Scribe
  17. 17. Visualization service1 service2 sr sr ss crcs ss cs? cr?
  18. 18. Siloed Teams service1 service2 sr sr ss crcs ss cs? cr? Net Ops AppTeam1 AppTeam2/DBA
  19. 19. Better Responsibilities service1 service2 sr sr ss crcs ss cs? cr? Net Ops AppTeam1 AppTeam2/DBA
  20. 20. A pseudo-Dapper Zipkin Twitter sought to (re)implement Dapper. Disappointingly few improvements. Some unfortunate UX issues. Sound. Simple. Valuable.
  21. 21. Thrift and Scribe should both die. Scribe is Terrible Terrible. Terrible Terrible. Thrift is terrible. Scribe is “strings” in Thrift. Performance focused people don’t use strings.
  22. 22. The whole point is to be low overhead Screw Scribe We push raw thrift over Fq
 github.com/circonus-labs/fq Completely async publishing,
 lock free if using the C library. Consolidating Zipkin’s bad decisions: github.com/circonus-labs/fq2scribe
  23. 23. Telling computers what to do. Zipkin is Java/Scala Wrote C support: github.com/circonus-labs/libmtev Wrote Perl support: github.com/circonus-labs/circonus-tracer-perl
  24. 24. Real world
  25. 25. A sample trace: data from S1
  26. 26. A sample trace: data from S2
  27. 27. Celebration Day 1 Noticed unexpected topology queries. Found a data location caching issue. Shaved 350ms off every graph request.
  28. 28. Celebration Day 4-7 Noticed frequent 150ms stalls in internal REST. Often: 90%+ Found a libcurl issue (async resolver). Shaved 150ms*(n*0.9) off ~50% of page loads.
  29. 29. You can do all of this at work. Go To Work And have a deeply technical
 cross-team conversation
 about performance
  30. 30. Future IPv6 piggy-backing audit or ld/preload libs nanosecond granularity cap-n-proto + UDP reporting https://www.flickr.com/photos/robin1966/16188457397
  31. 31. Thanks!

×