According to Google, SRE is what you get when you treat operations as if it’s a software problem. In this video, I briefly explain key SRE processes. Video: https://youtu.be/BdFmRJAnB6A
4. Monitoring
• Without monitoring, you have no way to tell whether the
service is even working; absent a thoughtfully designed
monitoring infrastructure, you’re flying blind
• Maybe everyone who tries to use the website gets an
error, maybe not—but you want to be aware of problems
before your users notice them
4
https://blog.zabbix.com/zabbix-4-2-out-now/6791/ https://www.youtube.com/watch?v=BPu_0hqHgqA https://www.zabbix.com/
6. Postmortem and Root-Cause Analysis
• The primary goals of writing a postmortem are to ensure
• that the incident is documented,
• that all contributing root cause(s) are well understood,
and,
• that effective preventive actions are put in place to
reduce the likelihood and/or impact of recurrence
• Blameless
6
https://landing.google.com/sre/sre-book/chapters/postmortem-culture/
7. Testing
• Testing is the mechanism you use to demonstrate specific
areas of equivalence when changes occur
• Each test that passes both before and after a change
reduces the uncertainty for which the analysis needs to
allow
• Thorough testing helps us predict the future reliability of a
given site with enough detail to be practically useful
7
https://landing.google.com/sre/sre-book/chapters/testing-reliability/
8. Capacity Planning
• Intent-Based Capacity Planning
• Intent is the rationale for how a service owner wants to
run their service
• Moving from concrete resource demands to motivating
reasons in order to arrive at the true capacity planning
intent often requires several layers of abstraction
• Example
• "I want 50 cores in clusters X, Y, and Z for service Foo."
• "I want to run service Foo at 5 nines of reliability."
8
https://landing.google.com/sre/sre-book/chapters/software-engineering-in-sre/
9. Development
• Distributed Reliability
• Data processing pipelines
• one-shot MapReduce jobs running periodically
• systems that operate in near real-time
• Data Integrity
• What you read is what you write
9
10. Product
• Finally, having made our way up the reliability pyramid,
we find ourselves at the point of having a workable
product
10