SRE Demystified - 14 - SRE Practices overview

SRE Demystified
SRE Practices
ganesh@ganeshniyer.com
ganesh.vigneswara@gmail.com,
http://ganeshniyer.com
Dr Ganesh Neelakanta Iyer

SRE
•
2https://image.slidesharecdn.com/devopssreatgooglescale-190121123035/95/devops-sre-at-google-scale-30-638.jpg?cb=1548074257

Practices
3
https://landing.google.com/sre/sre-book/chapters/part3/

Monitoring
• Without monitoring, you have no way to tell whether the
service is even working; absent a thoughtfully designed
monitoring infrastructure, you’re flying blind
• Maybe everyone who tries to use the website gets an
error, maybe not—but you want to be aware of problems
before your users notice them
4
https://blog.zabbix.com/zabbix-4-2-out-now/6791/ https://www.youtube.com/watch?v=BPu_0hqHgqA https://www.zabbix.com/

Incident Response
• Being on-call
• Effective troubleshooting
• Emergency response
• Managing incidents
5

Postmortem and Root-Cause Analysis
• The primary goals of writing a postmortem are to ensure
• that the incident is documented,
• that all contributing root cause(s) are well understood,
and,
• that effective preventive actions are put in place to
reduce the likelihood and/or impact of recurrence
• Blameless
6
https://landing.google.com/sre/sre-book/chapters/postmortem-culture/

Testing
• Testing is the mechanism you use to demonstrate specific
areas of equivalence when changes occur
• Each test that passes both before and after a change
reduces the uncertainty for which the analysis needs to
allow
• Thorough testing helps us predict the future reliability of a
given site with enough detail to be practically useful
7
https://landing.google.com/sre/sre-book/chapters/testing-reliability/

Capacity Planning
• Intent-Based Capacity Planning
• Intent is the rationale for how a service owner wants to
run their service
• Moving from concrete resource demands to motivating
reasons in order to arrive at the true capacity planning
intent often requires several layers of abstraction
• Example
• "I want 50 cores in clusters X, Y, and Z for service Foo."
• "I want to run service Foo at 5 nines of reliability."
8
https://landing.google.com/sre/sre-book/chapters/software-engineering-in-sre/

Development
• Distributed Reliability
• Data processing pipelines
• one-shot MapReduce jobs running periodically
• systems that operate in near real-time
• Data Integrity
• What you read is what you write
9

Product
• Finally, having made our way up the reliability pyramid,
we find ourselves at the point of having a workable
product
10

Dr Ganesh Neelakanta Iyer
ganesh@ganeshniyer.com
ganesh.vigneswara@gmail.com

SRE Demystified - 14 - SRE Practices overview

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à SRE Demystified - 14 - SRE Practices overview

Similaire à SRE Demystified - 14 - SRE Practices overview (20)

Plus de Dr Ganesh Iyer

Plus de Dr Ganesh Iyer (20)

Dernier

Dernier (20)

SRE Demystified - 14 - SRE Practices overview