According to Google, SRE is what you get when you treat operations as if it’s a software problem. In this video, I briefly explain what is and isn't toil, how to identify, measure and eliminate them.
Youtube channel here: https://youtu.be/EgpCw15fIK8
3. Toil
• Toil is the kind of
work tied to running
a production service
that tends to be
manual, repetitive,
automatable,
tactical, devoid of
enduring value, and
that scales linearly
as a service grows
3https://landing.google.com/sre/workbook/chapters/eliminating-toil/
4. What is NOT toil?
• Toil is not just "work I don’t like to do.”
• It’s also not simply equivalent to administrative chores or
grungy work
• There are also administrative chores that have to get done,
but should not be categorized as toil: this is overhead
• It includes tasks like team meetings, setting goals and HR
paperwork
• Cleaning up the entire alerting configuration for your
service and removing clutter may be grungy, but it’s not toil
4https://landing.google.com/sre/workbook/chapters/eliminating-toil/
5. Toil Defined
5
Manual Repetitive Automatable Tactical
No enduring Value O(n) with service growth
Manually running a
script (time spend
running the script)
Handling pager
alerts
Toil is work you do
over and over
If a machine could
accomplish the task just
as well as a human
If your service remains in the
same state after you have
finished a task, the task was
probably toil.
If the work involved in a task scales up
linearly with service size, traffic volume, or
user count, that task is probably toil.
https://landing.google.com/sre/workbook/chapters/eliminating-toil/
6. Examples
• Handling quota requests
• Applying database schema changes
• Reviewing non-critical monitoring
alerts
• Copying and pasting commands
from a playbook
6
https://cloud.google.com/blog/products/management-tools/identifying-and-tracking-toil-using-sre-principles
https://www.rundeck.com/blog/sre-anti-pattern-known-workaround-bug-closed
7. Measuring the impact of the work
• What type of work was it (quota changes, push release to
production, ACL update, etc.)?
• What was the degree of difficulty: Easy (<1 hour);
Medium (hours); Hard (days) (based on human hands-on
time, not elapsed time)?
• Who did the work?
7
https://cloud.google.com/blog/products/management-tools/identifying-and-tracking-toil-using-sre-principles
8. Identifying toil: Survey the team
• Averaging over the past four weeks, approximately what fraction of your time did you spend on toil?
• Scale 0-100%
• How happy are you with the quantity of time you spend on toil?
• Not happy / OK / No problem at all
• What are your top three sources of toil?
• On-call Response / Interrupts / Pushes / Capacity / Other / etc.
• Do you have a long-term engineering project in your quarterly objectives?
• Yes / No
• If so, averaging over the past four weeks, approximately what fraction of your time did you spend on
your engineering project? (estimate)
• Scale 0-100%
• In your team, is there toil you can automate away but you don’t do so, because that very toil takes
time away from long-term engineering work? If so, please describe below.
• Open response
8
https://cloud.google.com/blog/products/management-tools/identifying-and-tracking-toil-using-sre-principles
9. Measuring Toil
• Regularly, compute an
estimate of how much time is
being spent on various types
of work
• Look for patterns or trends in
your tickets, surveys, and on-
call incident response, and
prioritize based on the
aggregate human time spent
9
https://www.rundeck.com/blog/sre-anti-pattern-known-workaround-bug-closed
https://cloud.google.com/blog/products/management-tools/identifying-and-tracking-toil-using-sre-principles
10. Eliminating Toil
• Treat your automation like any other production system
• If you have an SLO practice, use some of your error
budget to automate away toil
• Complete postmortems when your automation fails, and
fix it as you would any user-facing system
• You want your automation available to you in any
situation, including production incidents, to free humans
to do the work they’re good at
10
https://cloud.google.com/blog/products/management-tools/identifying-and-tracking-toil-using-sre-principles