Site Reliability Engineering (SRE) is a set of principles, practices, and organizational constructs that seek to balance the reliability of a service with the need to continually deliver new features. An error budget is the primary construct used to help balance these seemingly competing goals.
This is an introduction to error budgets and their components: service level indicators (SLIs) and service level objectives (SLOs). We will discuss the art of creating and implementing SLOs.
Attendees will be able to:
• Describe the key concepts, namely, Error Budget, Service Level Indicator (SLIs), and Service Level Objectives (SLOs)
• Recommend actions to take when the error budget is over consumed
• Recommend actions to take when excess error budget remains
In the spirit of DevOps, Error Budgets and SLOs work best when they are agreed to in collaboration with many different constituents across the business. As such, this presentation is appropriate for:
• Product Owners and Product Managers
• Business decision makers
• Developers
• Operators
• And anyone else interested in building and operating services that deliver business and customer value.
30. Ops Review
Which pages…
● Were unactionable?
● Show a pattern?
● Can be eliminated with engineering work?
Reduce your maintenance costs,
get your return in future team bandwidth.
36. IT - showing signs of burnout
Product Management - frustrated with pace of
development
Net Result
37. Site Reliability Engineering
(SRE)
To protect, provide for, and progress
software and systems with an
ever-watchful eye on their availability,
latency, performance, and capacity.
38.
39. ● Often defined on the “Golden Signals” of a service:
traffic, error rate, and latency
● May also use custom metrics or
“saturation”
● Istio gathers these automatically for each of your
services
Service level indicators
A carefully defined quantitative measure of some aspect of
the level of service that is provided.
SLIs | Metrics that you use to
define the SLO targets
45. ● How do we define quickly?
● When does the timer start / stop?
The payment status page should load quickly
Latency
46. The proportion of HTTP POST requests for /api/pay
that send their entire response within X ms
measured at the load balancer.
● How do we define quickly?
● When does the timer start / stop?
The payment status page should load quickly
Latency
47. ● The Golden Signals make great
SLIs to set SLOs on
● Examples:
○ Return 99.9% of calls within 200ms
○ At least 99.5% of calls
return non-errors
● You may alert when out of compliance
Service level objective
A target value or range of values for a service level that is
measured by an SLI.
SLOs | Metrics that you use to
define the SLO targets
48. 😡😋
SLOs should capture the performance and availability levels that,
if barely met, would keep the typical customer of a service happy
“meets target SLO” ⇒ “happy customers”
“sad customers” ⇒ “misses target SLO”
52. Error Budget
A principled way to define the
desired reliability of a service
An acceptable level of
unreliability. This is a budget that
can be allocated
An agreement that helps
prioritize engineering work
53. What are the consequences when we
exhaust or overspend
our error budget?
57. Error budgets can accommodate
/ releasing new features
/ expected system changes
/ inevitable failure in hardware, networks, etc.
/ planned downtime
/ risky experiments