Using Error Budgets to Prioritize Work

Using Error Budgets to
Prioritize Work
Nathen Harvey / @nathenharvey
He/Him

Incentives are not aligned
Developers
Agility
Operators
Stability

The Business

The Business IT

Just who is the business?
The Business IT

The Business
Business-y things
IT

The Business
Business-y things
IT
Engineer-y things

Nathen Harvey
Developer Advocate, Google
@nathenharvey

What is "reliable"?
How do you know if the
application is working?

Browse
Catalog
Select
Item
Purchase

Browse
Catalog
Select
Item
Purchase
Browser
Browser

Server
Server
/catalog
Browse
Catalog
Select
Item
Purchase
Browser
Browser

/catalog
List of Items
Server
Server
Browse
Catalog
Select
Item
Purchase
Browser
Browser

Browser Server
Browse
Catalog
Select
Item
Purchase
Browser Server
/catalog
List of Items
/api/addToCart

Browser Server
Browse
Catalog
Select
Item
Purchase
Browser Server
/catalog
List of Items
/api/addToCart
Status

Browser Server
Browse
Catalog
Select
Item
Purchase
Browser Server
/catalog
List of Items
/api/addToCart
Status
/checkout

Browser Server
Browse
Catalog
Select
Item
Purchase
Browser Server
/catalog
List of Items
/api/addToCart
Status
/checkout
Payment Page

Browser Server
Browse
Catalog
Select
Item
Purchase
Browser Server
/catalog
List of Items
/api/addToCart
Status
/checkout
Payment Page
/api/pay

/charge
Browser Server
Browse
Catalog
Select
Item
Purchase
Browser Server
/catalog
List of Items
/api/addToCart
Status
/checkout
Payment Page
/api/pay
Payment
Processor
Payment
Processor

status
Browser Server
Browse
Catalog
Select
Item
Purchase
Browser Server
/catalog
List of Items
/api/addToCart
Status
/checkout
Payment Page
/api/pay
Payment
Processor
Payment
Processor
/charge

Browser Server
Browse
Catalog
Select
Item
Purchase
Browser Server
/catalog
List of Items
/api/addToCart
Status
/checkout
Payment Page
/api/pay
Status Page
Payment
Processor
Payment
Processor
/charge
status

Ops Review
Which pages…
● Were unactionable?
● Show a pattern?
● Can be eliminated with engineering work?
Reduce your maintenance costs,
get your return in future team bandwidth.

Browser Server
Browse
Catalog
Select
Item
Purchase
Browser Server
/catalog
List of Items
/api/addToCart
Status
/checkout
Payment Page
/api/pay
Status Page
status
Server
Payment
Processor
Payment
Processor
/charge
status

/charge
status
Browser Server
Browse
Catalog
Select
Item
Purchase
Browser Server
/catalog
List of Items
/api/addToCart
Status
/checkout
Payment Page
/api/pay
Status Page
Server
Payment
Processor
Payment
Processor

IT - heroic effort
Resolving the Latency Issue

IT - heroic effort
Product Management - kept pushing for features
Resolving the Latency Issue

IT - showing signs of burnout
Net Result

IT - showing signs of burnout
Product Management - frustrated with pace of
development
Net Result

Site Reliability Engineering
(SRE)
To protect, provide for, and progress
software and systems with an
ever-watchful eye on their availability,
latency, performance, and capacity.

● Often defined on the “Golden Signals” of a service:
traffic, error rate, and latency
● May also use custom metrics or
“saturation”
● Istio gathers these automatically for each of your
services
Service level indicators
A carefully defined quantitative measure of some aspect of
the level of service that is provided.
SLIs | Metrics that you use to
define the SLO targets

status
Browser Server
Browse
Catalog
Select
Item
Purchase
Browser Server
/catalog
List of Items
/api/addToCart
Status
/checkout
Payment Page
/api/pay
Status Page
Payment
Processor
Payment
Processor
/charge

/charge
status
Browser Server
Browse
Catalog
Select
Item
Purchase
Browser Server
/catalog
List of Items
/api/addToCart
Status
/checkout
Payment Page
/api/pay
Status Page
Payment
Processor
Payment
Processor

Browser Server
Payment
Processor
Browse
Catalog
Select
Item
Purchase
Browser Server
Payment
Processor
/catalog
List of Items
/api/addToCart
Status
/checkout
Payment Page
/api/pay
Status Page
/charge
status

The payment status page should load quickly
Latency

● How do we deﬁne quickly?
● When does the timer start / stop?
Latency

The proportion of HTTP POST requests for /api/pay
that send their entire response within X ms
measured at the load balancer.
● How do we deﬁne quickly?
● When does the timer start / stop?
Latency

● The Golden Signals make great
SLIs to set SLOs on
● Examples:
○ Return 99.9% of calls within 200ms
○ At least 99.5% of calls
return non-errors
● You may alert when out of compliance
Service level objective
A target value or range of values for a service level that is
measured by an SLI.
SLOs | Metrics that you use to
deﬁne the SLO targets

😡😋
SLOs should capture the performance and availability levels that,
if barely met, would keep the typical customer of a service happy
“meets target SLO” ⇒ “happy customers”
“sad customers” ⇒ “misses target SLO”

SLOs
Service SLI Type Objective
Purchase Latency 99% < 500ms in last 28 days

SLOs
Service SLI Type Objective

Error Budget
A principled way to deﬁne the
desired reliability of a service
An acceptable level of
unreliability. This is a budget that
can be allocated
An agreement that helps
prioritize engineering work

What are the consequences when we
exhaust or overspend
our error budget?

Consequences
/ freeze feature releases

Consequences may include
/ freeze feature releases
/ prioritize post mortem items
/ automate deployment pipelines *
/ improve monitoring and observability
/ require SRE consultation
/ return the pager

What should we spend
our error budget on?

Error budgets can accommodate
/ releasing new features
/ expected system changes
/ inevitable failure in hardware, networks, etc.
/ planned downtime
/ risky experiments

Error Budget
An agreement that helps
prioritize engineering work

Both of these are now
available in HTML
format for free!
landing.google.com/sre/books

Using Error Budgets to Prioritize Work

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Using Error Budgets to Prioritize Work

Similar to Using Error Budgets to Prioritize Work (20)

More from Nathen Harvey

More from Nathen Harvey (14)

Recently uploaded

Recently uploaded (20)

Using Error Budgets to Prioritize Work