4. Hi there! I’m Pedro!
• Engineering Director @
• Impact-driven person
• Passionate about People, Technology, and Products
• Agile, Lean and DevOps aficionado
• 10+ years of experience running engineering teams
5. On-call :: Definition
(of a person) able to be contacted in order to provide a
professional service if necessary, but not formally on duty.
‘The team is on call 24 hours-a-day, and is trained in resuscitation techniques and how to use live-saving defibrillators.’
‘If you work in a global organization, you might be on call 24 hours a day for troubleshooting or consulting.’
‘You have to get up in the middle of the night if you're on call.’
21. Tool Age
• Tons of alarms
• False positives (Broken windows theory
https://en.wikipedia.org/wiki/Broken_windows_theory)
• MTTA not tracked
• MTTR “over 9000”
• All systems were on-call (Because none was… so all of them were)
22. Tool Age
• No compensation (voluntarily and pro-bono)
29. Bronze Age
• We evaluated 3 scenarios: “Primary / Secondary”, “Just primary” and
“Primary / Secondary (SRE)”
• SRE team covering own rota (infra one) –> We rebranded the Ops
team to SRE team
• Development teams with rotas (dedicated to their systems)
• One engineer per rota (no secondaries)
• Engineers on-call (eat your own dog food: you develop it… you
maintain it in PROD!)
31. Bronze Age
• Tools: One hotspot per rota (no smartphones so that we don’t make
people carry two devices) + VictorOps App
32. Bronze Age
• One week rotas (four rotas in total)
• The rotas start / end every Tuesday (i.e. End-of-Sprint day) aligning
the rotas calendar with the sprints calendar
33. Bronze Age
• Only critical systems covered by the program (defined by Engineering
and agreed with stakeholders (e.g. Product, Customer Services,
Support))
35. Bronze Age
• Incident commander defined - The Incident Commander (IC) holds
the high-level state about the incident. They structure the incident
response task force, assigning responsibilities according to need and
priority
36. Bronze Age
• Weekly fire drills (or like Google calls it "Wheel of misfortune")
38. Bronze Age
• Alarms fine tuned
• Defined time to Ack under 5 minutes
• Redefined thresholds
• Distinguished Alarms from Notifications: The alarm requires immediate
action. The notification can wait for the next day or so
• Cleaned up alarms from non Production environments
39.
40. Bronze Age
• Volunteer based and not compulsory based (Yeah… we ran into
“trouble” and I went on-call because of that: eat your own dog food…
lead by example… I took 4 consecutive weeks on-call)
41. Bronze Age
• Engineers participating in multiple rotas
• Avoiding engineers doing rotas back to back
45. Bronze Age
• Little time to work on the resiliency of systems (hard to prioritize and
hard to complete action points from PMs during sprints)
46. Bronze Age
• On-call procedure
• Updating the company’s status page
• Keeping the organization/stakeholders informed with the incident status
every 5 minutes
48. Bronze Age
• Performance reviews completely disassociated from the on-call
program (no one gets a worst review because of not participating in
the program)
49. Bronze Age
• Although we have offices in different time zones we didn’t use a
“follow the sun” strategy (lack of engineers in the US)
50. Bronze Age
• P0s are all-hands on deck and we are “entitled” to call all engineers
that can help
• Panic button on Slack with Zappier integration
86. Final thoughts
• Don’t make rushed decisions because you are getting too many alerts
(e.g. turning off alarms)
87. Final thoughts
• Take advantage of the business hours (when you have the entire
engineering team at the office) to tackle issues that might come up
during out-of-business hours (when you “only” have the on-call
engineers available)
88. Final thoughts
• Being on-call doesn’t mean that you need to save the world. We don’t
need “Rambos”… so play it safe, stick to the playbooks and don’t
make risky decisions under stress
89. Final thoughts
• Don’t hesitate to jump into a (video) call to coordinate the incident
resolution (usually Slack is not enough) – sync vs async comms
90. Final thoughts
• Don’t forget to keep the stakeholders in the loop (we are in the heat
zone… but they are suffering from the sideline… and they need to
know what is happening)
91. Final thoughts
• Action items on (Blameless) post mortems should be tracked and
assured that they are executed
92. Final thoughts
• Don’t fall into the wishful thinking game: if you believe/suspect that
an alarm is triggered by something harmless that you “can’t control”
(e.g. network glitch)… be ready to prove that… otherwise don’t stop
investigating the root cause
93. Final thoughts
• Always write PMs (for PEs and PIs) and bare in mind that you should
have public versions of the PM (sooner or later your customers will
ask for them)
Shifts (I like to call them rotas)
People (I like to call them heroes)
Systems (I like to call them “the critical ones”)
Shifts (I like to call them rotas)
People (I like to call them heroes)
Systems (I like to call them “the critical ones”)
Shifts (I like to call them rotas)
People (I like to call them heroes)
Systems (I like to call them “the critical ones”)
Shifts (I like to call them rotas)
People (I like to call them heroes)
Systems (I like to call them “the critical ones”)
Because you care about your customers
Because you care about your production systems
Because you care about your engineers
Because you care about your company
Because you care about your job