8. Hi I’m Fred
● @phredmoyer
● Monitoring Nerd
● Writing code 20 years
● And breaking prod
● Likes Go, Perl, C, Pg
● Likes SLOs
● Doesn’t like errors
@phredmoyer
9. Talk Agenda
● SLOs and Error Budgets
● Calculating Error Budgets with Logs
● Calculating Error Budgets with Metrics
@phredmoyer
10. What is an Error Budget?
@phredmoyer
Zero Errors!
Happy Users!
11. What is an Error Budget?
@phredmoyer
Too much risk = Too many errors
Too many errors = Unhappy users
Too little risk = No code shipped
No code shipped = Unhappy users
12. What is an Error Budget?
@phredmoyer
Too much risk = Too many errors
Too many errors = Unhappy users
Too little risk = No code shipped
No code shipped = Unhappy users
13. What is an Error Budget?
@phredmoyer
Too much risk = Too many errors
Too many errors = Unhappy users
Too little risk = No code shipped
No code shipped = Unhappy users
14. What is an Error Budget?
@phredmoyer
Too much risk = Unhappy users
Just enough risk = Happy users
Too little risk = Unhappy users
15. What is an Error Budget?
@phredmoyer
Error budget = Acceptable risk
Acceptable risk = 100%-SLO
Error budget = 100%-SLO
17. SLOs, How Do They Work?
@phredmoyer
SLIs, SLOs, SLAs, oh my!
https://www.youtube.com/watch?v=tEylFyxbDLE
@lizthegrey ⇔ @sethvargo
SLI: 95th %ile requests over 5 min < 300ms
SLO: 95th %ile SLI for 1 month succeeds 99.9%
SLA: 95th %ile SLI for 1 month succeeds 99.5%
or you have to refund money
18. What is an Error Budget?
@phredmoyer
SLI: 95th %ile req over 5 min < 300ms
SLO: 95th %ile SLI for 1 month succeeds 99.9%
1M reqs in one month
Error Budget = (1-0.999)*1M = 1k requests
1k requests can exceed 300ms
19. What is an Error Budget?
@phredmoyer
Chapter 3
Embracing Risk
20. Talk Agenda
● SLOs and Error Budgets
● Calculating Error Budgets with Logs
● Calculating Error Budgets with Metrics
@phredmoyer
28. Calculating Error Budgets with Metrics
@phredmoyer
Use a counter metric (uint32/uint64)
Error Budget = 1k requests/day > 300ms
For each app error, error_budget++
If req duration > SLI (300ms), error_budget++
Alert if error_budget/total_reqs > 80% * 1-SLO
29. Calculating Error Budgets with Metrics (and Logs)
@phredmoyer
Problems:
● SLI fixed threshold
● Inability to introspect historical data
● Difficult to compare different SLI behavior
30. Calculating Error Budgets with Metrics - Histograms
@phredmoyer
Use a histogram
Image source
http://www.brendangregg.com/FrequencyTrails/modes.html
31. Calculating Error Budgets with Metrics - Histograms
@phredmoyer
Linear, Cumulative, Log-Linear, Approximate…
High dynamic range, log-linear recommended
http://hdrhistogram.org/
https://github.com/circonus/-labs/circonusllhist
32. Calculating Error Budgets with Metrics - Histograms
@phredmoyer
Error Budget = 1k requests/day > Xms
For each histogram bin >= X:
error_budget += bin_sample_count
Alert if error_budget/total_reqs > 80% * 1-SLO
33. Calculating Error Budgets with Metrics - Histograms
@phredmoyer
Choose bin boundary for SLI (preferred) or
interpolate within boundaries
40. Appendix - SLOs, How Do They Work?
@phredmoyer
● Chapter 4
○ Service Level Objectives
● 99% Get RPC calls < 100ms
● https://landing.google.com/sre/sre-book/toc/index.html
41. @phredmoyer
● Ch 2: Implementing SLOs
● Ch 3: SLO Eng case studies
● Ch 5: Alerting on SLOs
● https://landing.google.com/sre/workbook/toc
Appendix - SLOs, How Do They Work?
42. @phredmoyer
● Chapter 21
○ The Art and Science of
The Service Level
Objective
Appendix - SLOs, How Do They Work?