Lessons Learned from a Parallel Universe
David N. Blank-Edelman, Technical Evangelist, Apcera
Just within the last ten or so years, we have seen at least two separate communities evolve at the crossroads of development and operations. The first—DevOps—grew up very much in public, the second matured sequestered within the halls of “special” companies like Google and Facebook and is only now starting to gain visibility and traction in the wider world. The DevOps and Site Reliability Engineering (SRE) communities barely speak, yet both have common ancestors and much to offer each other. Let’s look at what they have in common, how they differ, and what are the key things we can learn from both.
DevOps Enterprise Summit San Francisco 2016
16. What Makes SRE, SRE (dramatic recreation)
• hire only coders
• have an SLA for your service
• measure and report performance against SLA
• Use Error Budgets and gate launches on them
• Common staffing pool for SRE and DEV
• Excess Ops work overflows to DEV team
• Cap SRE operational load at 50%
• Share 5% of ops work with DEV team
• Oncall teams at least 8 people, or 6x2
• Maximum of 2 events per oncall shift
• Post mortem for every event
• Post mortems are blameless and focus on process and technology, not people
17. What Makes SRE, SRE (dramatic recreation)
• hire only coders
• have an SLA for your service
• measure and report performance against SLA
• Use Error Budgets and gate launches on them
• Common staffing pool for SRE and DEV
• Excess Ops work overflows to DEV team
• Cap SRE operational load at 50%
• Share 5% of ops work with DEV team
• Oncall teams at least 8 people, or 6x2
• Maximum of 2 events per oncall shift
• Post mortem for every event
• Post mortems are blameless and focus on process and technology, not people
19. What Makes SRE, SRE (dramatic recreation)
• hire only coders
• have an SLA for your service
• measure and report performance against SLA
• Use Error Budgets and gate launches on them
• Common staffing pool for SRE and DEV
• Excess Ops work overflows to DEV team
• Cap SRE operational load at 50%
• Share 5% of ops work with DEV team
• Oncall teams at least 8 people, or 6x2
• Maximum of 2 events per oncall shift
• Post mortem for every event
• Post mortems are blameless and focus on process and technology, not people
20. What Makes SRE, SRE (dramatic recreation)
• hire only coders
• have an SLA for your service
• measure and report performance against SLA
• Use Error Budgets and gate launches on them
• Common staffing pool for SRE and DEV
• Excess Ops work overflows to DEV team
• Cap SRE operational load at 50%
• Share 5% of ops work with DEV team
• Oncall teams at least 8 people, or 6x2
• Maximum of 2 events per oncall shift
• Post mortem for every event
• Post mortems are blameless and focus on process and technology, not people
22. What Makes SRE, SRE (dramatic recreation)
• hire only coders
• have an SLA for your service
• measure and report performance against SLA
• Use Error Budgets and gate launches on them
• Common staffing pool for SRE and DEV
• Excess Ops work overflows to DEV team
• Cap SRE operational load at 50%
• Share 5% of ops work with DEV team
• Oncall teams at least 8 people, or 6x2
• Maximum of 2 events per oncall shift
• Post mortem for every event
• Post mortems are blameless and focus on process and technology, not people
23. What Makes SRE, SRE (dramatic recreation)
• hire only coders
• have an SLA for your service
• measure and report performance against SLA
• Use Error Budgets and gate launches on them
• Common staffing pool for SRE and DEV
• Excess Ops work overflows to DEV team
• Cap SRE operational load at 50%
• Share 5% of ops work with DEV team
• Oncall teams at least 8 people, or 6x2
• Maximum of 2 events per oncall shift
• Post mortem for every event
• Post mortems blameless and focus on process and technology, not people
31. Facebook has Production Engineers
“Production Engineers at Facebook are hybrid
software/systems engineers who ensure that
Facebook's services run smoothly and have the
capacity for future growth. They are embedded in
every one of Facebook's product and infrastructure
teams, and are core participants in every significant
engineering effort underway in the company.”
32. Facebook has Production Engineers
“Production Engineers at Facebook are hybrid
software/systems engineers who ensure that
Facebook's services run smoothly and have the
capacity for future growth. They are embedded in
every one of Facebook's product and infrastructure
teams, and are core participants in every significant
engineering effort underway in the company.”
33. Facebook has Production Engineers
“Production Engineers at Facebook are hybrid
software/systems engineers who ensure that
Facebook's services run smoothly and have the
capacity for future growth. They are embedded
in every one of Facebook's product and infrastructure
teams, and are core participants in every significant
engineering effort underway in the company.”
35. Facebook has Production Engineers
•~1-10 ratio
•18, 24, 36 months with a service
(not a S.W.A.T. team or ops monkeys)
•Also do Bootcamp
•Lead reports to Facebook head of engineering
(same person responsible for Ops and Eng)
37. Production Engineering Maturity Model
•maturity of SW/PE team’s work
•maturity of level of service
•relationship status of PE/SWE teams
•stages:
• bootstrap/build out
• scale/initial deployment
• awesomize