SlideShare une entreprise Scribd logo
1  sur  42
Practical Service Level
Objectives With Error
Budgeting
Fred Moyer @phredmoyer
BayLISA May 16, 2019
Are Errors important?
@phredmoyer
Is Latency Important?
@phredmoyer
How many errors in your app last week?
@phredmoyer
How many requests over 500ms last week?
@phredmoyer
Your error/request ratio last week?
@phredmoyer
Are slow requests errors?
@phredmoyer
Hi I’m Fred
● @phredmoyer
● Monitoring Nerd
● Writing code 20 years
● And breaking prod
● Likes Go, Perl, C, Pg
● Likes SLOs
● Doesn’t like errors
@phredmoyer
Talk Agenda
● SLOs and Error Budgets
● Calculating Error Budgets with Logs
● Calculating Error Budgets with Metrics
@phredmoyer
What is an Error Budget?
@phredmoyer
Zero Errors!
Happy Users!
What is an Error Budget?
@phredmoyer
Too much risk = Too many errors
Too many errors = Unhappy users
Too little risk = No code shipped
No code shipped = Unhappy users
What is an Error Budget?
@phredmoyer
Too much risk = Too many errors
Too many errors = Unhappy users
Too little risk = No code shipped
No code shipped = Unhappy users
What is an Error Budget?
@phredmoyer
Too much risk = Too many errors
Too many errors = Unhappy users
Too little risk = No code shipped
No code shipped = Unhappy users
What is an Error Budget?
@phredmoyer
Too much risk = Unhappy users
Just enough risk = Happy users
Too little risk = Unhappy users
What is an Error Budget?
@phredmoyer
Error budget = Acceptable risk
Acceptable risk = 100%-SLO
Error budget = 100%-SLO
@phredmoyer
SLOs, How Do They Work?
SLOs, How Do They Work?
@phredmoyer
SLIs, SLOs, SLAs, oh my!
https://www.youtube.com/watch?v=tEylFyxbDLE
@lizthegrey ⇔ @sethvargo
SLI: 95th %ile requests over 5 min < 300ms
SLO: 95th %ile SLI for 1 month succeeds 99.9%
SLA: 95th %ile SLI for 1 month succeeds 99.5%
or you have to refund money
What is an Error Budget?
@phredmoyer
SLI: 95th %ile req over 5 min < 300ms
SLO: 95th %ile SLI for 1 month succeeds 99.9%
1M reqs in one month
Error Budget = (1-0.999)*1M = 1k requests
1k requests can exceed 300ms
What is an Error Budget?
@phredmoyer
Chapter 3
Embracing Risk
Talk Agenda
● SLOs and Error Budgets
● Calculating Error Budgets with Logs
● Calculating Error Budgets with Metrics
@phredmoyer
Calculating Error Budgets with Logs
@phredmoyer
Latency
Calculating Error Budgets with Logs - Latency
@phredmoyer
Error Budget = 100%-SLO = (1-0.999)*1M = 1k
Error Budget = 1k requests/day > 300ms
EventLog "%h %l %u %O "%{User-Agent}i" %D"
%D - Request duration in milliseconds
For each request:
If duration > SLI (300ms), error_budget++
Calculating Error Budgets with Logs - Errors
@phredmoyer
Errors
Calculating Error Budgets with Logs - Errors
@phredmoyer
Error Budget = 1k requests/day > 300ms
[Wed Oct 11 14:32:52 2000] [error] [client 127.0.0.1]
client denied by server: /export/home/live/ap/htdocs/test
For each error log entry, error_budget++
If req duration > SLI (300ms), error_budget++
Alert if error_budget/total_reqs > 80% * 1-SLO
Calculating Error Budgets with Logs
@phredmoyer
Cumulative sum functionality required
● Splunk
● ELK
● Mtail
○ https://github.com/google/mtail
● Honeycomb.io
● Circonus Logwatch
○ https://github.com/circonus-
labs/circonus-logwatch
Talk Agenda
● SLOs and Error Budgets
● Calculating Error Budgets with Logs
● Calculating Error Budgets with Metrics
@phredmoyer
Calculating Error Budgets with Metrics
@phredmoyer
Errors
Calculating Error Budgets with Metrics
@phredmoyer
Use a counter metric (uint32/uint64)
Error Budget = 1k requests/day > 300ms
For each app error, error_budget++
If req duration > SLI (300ms), error_budget++
Alert if error_budget/total_reqs > 80% * 1-SLO
Calculating Error Budgets with Metrics (and Logs)
@phredmoyer
Problems:
● SLI fixed threshold
● Inability to introspect historical data
● Difficult to compare different SLI behavior
Calculating Error Budgets with Metrics - Histograms
@phredmoyer
Use a histogram
Image source
http://www.brendangregg.com/FrequencyTrails/modes.html
Calculating Error Budgets with Metrics - Histograms
@phredmoyer
Linear, Cumulative, Log-Linear, Approximate…
High dynamic range, log-linear recommended
http://hdrhistogram.org/
https://github.com/circonus/-labs/circonusllhist
Calculating Error Budgets with Metrics - Histograms
@phredmoyer
Error Budget = 1k requests/day > Xms
For each histogram bin >= X:
error_budget += bin_sample_count
Alert if error_budget/total_reqs > 80% * 1-SLO
Calculating Error Budgets with Metrics - Histograms
@phredmoyer
Choose bin boundary for SLI (preferred) or
interpolate within boundaries
Calculating Error Budgets with Metrics - Histograms
@phredmoyer
Error Budget ~ 1k requests/day > 1,800µs
Calculating Error Budgets with Metrics - Histograms
@phredmoyer
Error Budget ~ 1k requests/day > 2,400µs
Calculating Error Budgets with Metrics - Histograms
@phredmoyer
Benefits:
● SLI variable threshold
● Ability to analyze historical data
● Examine error budgets for different SLIs
Talk Agenda
● SLOs and Error Budgets
● Calculating Error Budgets with Logs
● Calculating Error Budgets with Metrics
@phredmoyer
Questions?
? @phredmoyer
Thanks!
https://slideshare.net/redhotpenguin
https://twitter.com/phredmoyer
https://linkedin.com/in/redhotpenguin
https://github.com/redhotpenguin
@phredmoyer
Appendix - SLOs, How Do They Work?
@phredmoyer
● Chapter 4
○ Service Level Objectives
● 99% Get RPC calls < 100ms
● https://landing.google.com/sre/sre-book/toc/index.html
@phredmoyer
● Ch 2: Implementing SLOs
● Ch 3: SLO Eng case studies
● Ch 5: Alerting on SLOs
● https://landing.google.com/sre/workbook/toc
Appendix - SLOs, How Do They Work?
@phredmoyer
● Chapter 21
○ The Art and Science of
The Service Level
Objective
Appendix - SLOs, How Do They Work?

Contenu connexe

Tendances

Service Level Terminology : SLA ,SLO & SLI
Service Level Terminology : SLA ,SLO & SLIService Level Terminology : SLA ,SLO & SLI
Service Level Terminology : SLA ,SLO & SLI
Knoldus Inc.
 

Tendances (20)

How to SRE when you have no SRE
How to SRE when you have no SREHow to SRE when you have no SRE
How to SRE when you have no SRE
 
Getting started with Site Reliability Engineering (SRE)
Getting started with Site Reliability Engineering (SRE)Getting started with Site Reliability Engineering (SRE)
Getting started with Site Reliability Engineering (SRE)
 
Service Level Terminology : SLA ,SLO & SLI
Service Level Terminology : SLA ,SLO & SLIService Level Terminology : SLA ,SLO & SLI
Service Level Terminology : SLA ,SLO & SLI
 
What is Site Reliability Engineering (SRE)
What is Site Reliability Engineering (SRE)What is Site Reliability Engineering (SRE)
What is Site Reliability Engineering (SRE)
 
Site reliability engineering
Site reliability engineeringSite reliability engineering
Site reliability engineering
 
When down is not good enough. SRE On Azure - PolarConf
When down is not good enough. SRE On Azure - PolarConfWhen down is not good enough. SRE On Azure - PolarConf
When down is not good enough. SRE On Azure - PolarConf
 
SRE-iously! Reliability!
SRE-iously! Reliability!SRE-iously! Reliability!
SRE-iously! Reliability!
 
Overview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practicesOverview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practices
 
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...
 
How to Monitoring the SRE Golden Signals (E-Book)
How to Monitoring the SRE Golden Signals (E-Book)How to Monitoring the SRE Golden Signals (E-Book)
How to Monitoring the SRE Golden Signals (E-Book)
 
Prometheus Overview
Prometheus OverviewPrometheus Overview
Prometheus Overview
 
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
 
OSMC 2022 | OpenTelemetry 101 by Dotan Horovit s.pdf
OSMC 2022 | OpenTelemetry 101 by Dotan Horovit s.pdfOSMC 2022 | OpenTelemetry 101 by Dotan Horovit s.pdf
OSMC 2022 | OpenTelemetry 101 by Dotan Horovit s.pdf
 
SRE in Startup
SRE in StartupSRE in Startup
SRE in Startup
 
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
 
Monitoring With Prometheus
Monitoring With PrometheusMonitoring With Prometheus
Monitoring With Prometheus
 
DevOps Torino Meetup - SRE Concepts
DevOps Torino Meetup - SRE ConceptsDevOps Torino Meetup - SRE Concepts
DevOps Torino Meetup - SRE Concepts
 
SRE 101
SRE 101SRE 101
SRE 101
 
Agile Estimating & Planning by Amaad Qureshi
Agile Estimating & Planning by Amaad QureshiAgile Estimating & Planning by Amaad Qureshi
Agile Estimating & Planning by Amaad Qureshi
 
Building an SRE Organization @ Squarespace
Building an SRE Organization @ SquarespaceBuilding an SRE Organization @ Squarespace
Building an SRE Organization @ Squarespace
 

Similaire à Practical service level objectives with error budgeting

What is Drools, Guvnor and Planner? 2012 02-17 Brno Dev Conference
What is Drools, Guvnor and Planner? 2012 02-17 Brno Dev ConferenceWhat is Drools, Guvnor and Planner? 2012 02-17 Brno Dev Conference
What is Drools, Guvnor and Planner? 2012 02-17 Brno Dev Conference
Geoffrey De Smet
 

Similaire à Practical service level objectives with error budgeting (20)

Reliable observability at scale: Error Budgets for 1,000+
Reliable observability at scale: Error Budgets for 1,000+Reliable observability at scale: Error Budgets for 1,000+
Reliable observability at scale: Error Budgets for 1,000+
 
Performance hosting with Ninefold for Spree Apps and Stores
Performance hosting with Ninefold for Spree Apps and StoresPerformance hosting with Ninefold for Spree Apps and Stores
Performance hosting with Ninefold for Spree Apps and Stores
 
Latency SLOs Done Right
Latency SLOs Done RightLatency SLOs Done Right
Latency SLOs Done Right
 
Lie Cheat & Steal to build Hyper-Fast Applications using Event-Driven Archite...
Lie Cheat & Steal to build Hyper-Fast Applications using Event-Driven Archite...Lie Cheat & Steal to build Hyper-Fast Applications using Event-Driven Archite...
Lie Cheat & Steal to build Hyper-Fast Applications using Event-Driven Archite...
 
Performance hosting on Ninefold for Spree Stores and Apps
Performance hosting on Ninefold for Spree Stores and AppsPerformance hosting on Ninefold for Spree Stores and Apps
Performance hosting on Ninefold for Spree Stores and Apps
 
MeasureWorks - Windesheim Almere - Why Performance matters?
MeasureWorks  - Windesheim Almere - Why Performance matters?MeasureWorks  - Windesheim Almere - Why Performance matters?
MeasureWorks - Windesheim Almere - Why Performance matters?
 
Latency SLOs done right
Latency SLOs done rightLatency SLOs done right
Latency SLOs done right
 
Denver MuleSoft Meetup: Greatest MuleSoft Hits of 2022
Denver MuleSoft Meetup: Greatest MuleSoft Hits of 2022Denver MuleSoft Meetup: Greatest MuleSoft Hits of 2022
Denver MuleSoft Meetup: Greatest MuleSoft Hits of 2022
 
2013 credit card fraud detection why theory dosent adjust to practice
2013 credit card fraud detection why theory dosent adjust to practice2013 credit card fraud detection why theory dosent adjust to practice
2013 credit card fraud detection why theory dosent adjust to practice
 
2012 02-04 fosdem 2012 - guvnor and j bpm designer
2012 02-04 fosdem 2012 - guvnor and j bpm designer 2012 02-04 fosdem 2012 - guvnor and j bpm designer
2012 02-04 fosdem 2012 - guvnor and j bpm designer
 
GDG Cloud Southlake #6 Tammy Bryant Butow: Chaos Engineering The Road To Res...
 GDG Cloud Southlake #6 Tammy Bryant Butow: Chaos Engineering The Road To Res... GDG Cloud Southlake #6 Tammy Bryant Butow: Chaos Engineering The Road To Res...
GDG Cloud Southlake #6 Tammy Bryant Butow: Chaos Engineering The Road To Res...
 
We Built This City - Apigee Edge Architecture
We Built This City - Apigee Edge ArchitectureWe Built This City - Apigee Edge Architecture
We Built This City - Apigee Edge Architecture
 
What is Drools, Guvnor and Planner? 2012 02-17 Brno Dev Conference
What is Drools, Guvnor and Planner? 2012 02-17 Brno Dev ConferenceWhat is Drools, Guvnor and Planner? 2012 02-17 Brno Dev Conference
What is Drools, Guvnor and Planner? 2012 02-17 Brno Dev Conference
 
Experimentation as a growth strategy at Booking.com
Experimentation as a growth strategy at Booking.comExperimentation as a growth strategy at Booking.com
Experimentation as a growth strategy at Booking.com
 
Detecting Fraud and AML Violations In Real-Time for Banking, Telecom and eCom...
Detecting Fraud and AML Violations In Real-Time for Banking, Telecom and eCom...Detecting Fraud and AML Violations In Real-Time for Banking, Telecom and eCom...
Detecting Fraud and AML Violations In Real-Time for Banking, Telecom and eCom...
 
Doug Sillars on App Optimization
Doug Sillars on App OptimizationDoug Sillars on App Optimization
Doug Sillars on App Optimization
 
Leveraging Data Insights to Measure R.O.I.
Leveraging Data Insights to Measure R.O.I.Leveraging Data Insights to Measure R.O.I.
Leveraging Data Insights to Measure R.O.I.
 
3 types of monitoring for 2020
3 types of monitoring for 20203 types of monitoring for 2020
3 types of monitoring for 2020
 
Comprehensive container based service monitoring with kubernetes and istio
Comprehensive container based service monitoring with kubernetes and istioComprehensive container based service monitoring with kubernetes and istio
Comprehensive container based service monitoring with kubernetes and istio
 
Mobile App User Experience Myths, Debunked
Mobile App User Experience Myths, DebunkedMobile App User Experience Myths, Debunked
Mobile App User Experience Myths, Debunked
 

Plus de Fred Moyer

Plus de Fred Moyer (16)

SREcon americas 2019 - Latency SLOs Done Right
SREcon americas 2019 - Latency SLOs Done RightSREcon americas 2019 - Latency SLOs Done Right
SREcon americas 2019 - Latency SLOs Done Right
 
Scale17x - Latency SLOs Done Right
Scale17x - Latency SLOs Done RightScale17x - Latency SLOs Done Right
Scale17x - Latency SLOs Done Right
 
Comprehensive Container Based Service Monitoring with Kubernetes and Istio
Comprehensive Container Based Service Monitoring with Kubernetes and IstioComprehensive Container Based Service Monitoring with Kubernetes and Istio
Comprehensive Container Based Service Monitoring with Kubernetes and Istio
 
Effective management of high volume numeric data with histograms
Effective management of high volume numeric data with histogramsEffective management of high volume numeric data with histograms
Effective management of high volume numeric data with histograms
 
Statistics for dummies
Statistics for dummiesStatistics for dummies
Statistics for dummies
 
GrafanaCon EU 2018
GrafanaCon EU 2018GrafanaCon EU 2018
GrafanaCon EU 2018
 
Fredmoyer postgresopen 2017
Fredmoyer postgresopen 2017Fredmoyer postgresopen 2017
Fredmoyer postgresopen 2017
 
Better service monitoring through histograms sv perl 09012016
Better service monitoring through histograms sv perl 09012016Better service monitoring through histograms sv perl 09012016
Better service monitoring through histograms sv perl 09012016
 
Better service monitoring through histograms
Better service monitoring through histogramsBetter service monitoring through histograms
Better service monitoring through histograms
 
The Breakup - Logically Sharding a Growing PostgreSQL Database
The Breakup - Logically Sharding a Growing PostgreSQL DatabaseThe Breakup - Logically Sharding a Growing PostgreSQL Database
The Breakup - Logically Sharding a Growing PostgreSQL Database
 
Learning go for perl programmers
Learning go for perl programmersLearning go for perl programmers
Learning go for perl programmers
 
Surge 2012 fred_moyer_lightning
Surge 2012 fred_moyer_lightningSurge 2012 fred_moyer_lightning
Surge 2012 fred_moyer_lightning
 
Qpsmtpd
QpsmtpdQpsmtpd
Qpsmtpd
 
Apache Dispatch
Apache DispatchApache Dispatch
Apache Dispatch
 
Ball Of Mud Yapc 2008
Ball Of Mud Yapc 2008Ball Of Mud Yapc 2008
Ball Of Mud Yapc 2008
 
Data::FormValidator Simplified
Data::FormValidator SimplifiedData::FormValidator Simplified
Data::FormValidator Simplified
 

Dernier

%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
masabamasaba
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
masabamasaba
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
masabamasaba
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
masabamasaba
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 

Dernier (20)

%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
Harnessing ChatGPT - Elevating Productivity in Today's Agile Environment
Harnessing ChatGPT  - Elevating Productivity in Today's Agile EnvironmentHarnessing ChatGPT  - Elevating Productivity in Today's Agile Environment
Harnessing ChatGPT - Elevating Productivity in Today's Agile Environment
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
tonesoftg
tonesoftgtonesoftg
tonesoftg
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 

Practical service level objectives with error budgeting

  • 1. Practical Service Level Objectives With Error Budgeting Fred Moyer @phredmoyer BayLISA May 16, 2019
  • 4. How many errors in your app last week? @phredmoyer
  • 5. How many requests over 500ms last week? @phredmoyer
  • 6. Your error/request ratio last week? @phredmoyer
  • 7. Are slow requests errors? @phredmoyer
  • 8. Hi I’m Fred ● @phredmoyer ● Monitoring Nerd ● Writing code 20 years ● And breaking prod ● Likes Go, Perl, C, Pg ● Likes SLOs ● Doesn’t like errors @phredmoyer
  • 9. Talk Agenda ● SLOs and Error Budgets ● Calculating Error Budgets with Logs ● Calculating Error Budgets with Metrics @phredmoyer
  • 10. What is an Error Budget? @phredmoyer Zero Errors! Happy Users!
  • 11. What is an Error Budget? @phredmoyer Too much risk = Too many errors Too many errors = Unhappy users Too little risk = No code shipped No code shipped = Unhappy users
  • 12. What is an Error Budget? @phredmoyer Too much risk = Too many errors Too many errors = Unhappy users Too little risk = No code shipped No code shipped = Unhappy users
  • 13. What is an Error Budget? @phredmoyer Too much risk = Too many errors Too many errors = Unhappy users Too little risk = No code shipped No code shipped = Unhappy users
  • 14. What is an Error Budget? @phredmoyer Too much risk = Unhappy users Just enough risk = Happy users Too little risk = Unhappy users
  • 15. What is an Error Budget? @phredmoyer Error budget = Acceptable risk Acceptable risk = 100%-SLO Error budget = 100%-SLO
  • 17. SLOs, How Do They Work? @phredmoyer SLIs, SLOs, SLAs, oh my! https://www.youtube.com/watch?v=tEylFyxbDLE @lizthegrey ⇔ @sethvargo SLI: 95th %ile requests over 5 min < 300ms SLO: 95th %ile SLI for 1 month succeeds 99.9% SLA: 95th %ile SLI for 1 month succeeds 99.5% or you have to refund money
  • 18. What is an Error Budget? @phredmoyer SLI: 95th %ile req over 5 min < 300ms SLO: 95th %ile SLI for 1 month succeeds 99.9% 1M reqs in one month Error Budget = (1-0.999)*1M = 1k requests 1k requests can exceed 300ms
  • 19. What is an Error Budget? @phredmoyer Chapter 3 Embracing Risk
  • 20. Talk Agenda ● SLOs and Error Budgets ● Calculating Error Budgets with Logs ● Calculating Error Budgets with Metrics @phredmoyer
  • 21. Calculating Error Budgets with Logs @phredmoyer Latency
  • 22. Calculating Error Budgets with Logs - Latency @phredmoyer Error Budget = 100%-SLO = (1-0.999)*1M = 1k Error Budget = 1k requests/day > 300ms EventLog "%h %l %u %O "%{User-Agent}i" %D" %D - Request duration in milliseconds For each request: If duration > SLI (300ms), error_budget++
  • 23. Calculating Error Budgets with Logs - Errors @phredmoyer Errors
  • 24. Calculating Error Budgets with Logs - Errors @phredmoyer Error Budget = 1k requests/day > 300ms [Wed Oct 11 14:32:52 2000] [error] [client 127.0.0.1] client denied by server: /export/home/live/ap/htdocs/test For each error log entry, error_budget++ If req duration > SLI (300ms), error_budget++ Alert if error_budget/total_reqs > 80% * 1-SLO
  • 25. Calculating Error Budgets with Logs @phredmoyer Cumulative sum functionality required ● Splunk ● ELK ● Mtail ○ https://github.com/google/mtail ● Honeycomb.io ● Circonus Logwatch ○ https://github.com/circonus- labs/circonus-logwatch
  • 26. Talk Agenda ● SLOs and Error Budgets ● Calculating Error Budgets with Logs ● Calculating Error Budgets with Metrics @phredmoyer
  • 27. Calculating Error Budgets with Metrics @phredmoyer Errors
  • 28. Calculating Error Budgets with Metrics @phredmoyer Use a counter metric (uint32/uint64) Error Budget = 1k requests/day > 300ms For each app error, error_budget++ If req duration > SLI (300ms), error_budget++ Alert if error_budget/total_reqs > 80% * 1-SLO
  • 29. Calculating Error Budgets with Metrics (and Logs) @phredmoyer Problems: ● SLI fixed threshold ● Inability to introspect historical data ● Difficult to compare different SLI behavior
  • 30. Calculating Error Budgets with Metrics - Histograms @phredmoyer Use a histogram Image source http://www.brendangregg.com/FrequencyTrails/modes.html
  • 31. Calculating Error Budgets with Metrics - Histograms @phredmoyer Linear, Cumulative, Log-Linear, Approximate… High dynamic range, log-linear recommended http://hdrhistogram.org/ https://github.com/circonus/-labs/circonusllhist
  • 32. Calculating Error Budgets with Metrics - Histograms @phredmoyer Error Budget = 1k requests/day > Xms For each histogram bin >= X: error_budget += bin_sample_count Alert if error_budget/total_reqs > 80% * 1-SLO
  • 33. Calculating Error Budgets with Metrics - Histograms @phredmoyer Choose bin boundary for SLI (preferred) or interpolate within boundaries
  • 34. Calculating Error Budgets with Metrics - Histograms @phredmoyer Error Budget ~ 1k requests/day > 1,800µs
  • 35. Calculating Error Budgets with Metrics - Histograms @phredmoyer Error Budget ~ 1k requests/day > 2,400µs
  • 36. Calculating Error Budgets with Metrics - Histograms @phredmoyer Benefits: ● SLI variable threshold ● Ability to analyze historical data ● Examine error budgets for different SLIs
  • 37. Talk Agenda ● SLOs and Error Budgets ● Calculating Error Budgets with Logs ● Calculating Error Budgets with Metrics @phredmoyer
  • 40. Appendix - SLOs, How Do They Work? @phredmoyer ● Chapter 4 ○ Service Level Objectives ● 99% Get RPC calls < 100ms ● https://landing.google.com/sre/sre-book/toc/index.html
  • 41. @phredmoyer ● Ch 2: Implementing SLOs ● Ch 3: SLO Eng case studies ● Ch 5: Alerting on SLOs ● https://landing.google.com/sre/workbook/toc Appendix - SLOs, How Do They Work?
  • 42. @phredmoyer ● Chapter 21 ○ The Art and Science of The Service Level Objective Appendix - SLOs, How Do They Work?