Damon Edwards, co-founder of Rundeck, presentation at NewOps Days in Raleigh, NC on December 4, 2018.
See a Demo of Rundeck Enterprise :
https://www.rundeck.com/see-demo
--or--
Download Rundeck Open Source here:
https://rundeck.com/open-source
Connect:
Stack Overflow community: https://stackoverflow.com/questions/tagged/rundeck
Github: https://github.com/rundeck/rundeck/issues
Twitter: https://twitter.com/Rundeck
Facebook: https://www.facebook.com/RundeckInc/
LinkedIn: www.linkedin.com › company › rundeck-inc
9. SysAdmins
Overloaded. Constant
firefighting.
Waiting in ticket queues
for everything.
Things break. Break
again. And again.
Everyone is busy, but it
doesn’t get any better.
ansformation has largely
nored Ops. Any ideas?
Have you heard of SRE?
Google does it.
Everything takes too
long, cost too much,
and break too often!
Executive
View
10. SysAdmins
Overloaded. Constant
firefighting.
Waiting in ticket queues
for everything.
Things break. Break
again. And again.
Everyone is busy, but it
doesn’t get any better.
ansformation has largely
nored Ops. Any ideas?
Have you heard of SRE?
Google does it.
Everything takes too
long, cost too much,
and break too often!
Executive
View
SRE (new name)
Overloaded. Constant
firefighting.
Waiting in ticket queues
for everything.
Things break. Break
again. And again.
Everyone is busy, but it
doesn’t get any better.
Our transformation has largely
ignored Ops. Any ideas?
Have you h
Google
Everything takes too
long, cost too much,
and break too often!
Executive
View
11. Changing job titles or adding individual skills
doesn’t make systems administrators SREs.
12. SRE is a rethinking of how Operations work gets
done.
Changing job titles or adding individual skills
doesn’t make systems administrators SREs.
16. SLO and Error Budgets: Tools for Shared Responsibility
0
100
Service Level Objective
Error Budget*
Service Level Indicator
(*Use this to improve the service)
17. SLO and Error Budgets: Tools for Shared Responsibility
0
100
Service Level Objective
Error Budget*
Service Level Indicator
(*Use this to improve the service)
18. SLO and Error Budgets: Tools for Shared Responsibility
0
100
Service Level Objective
Error Budget*
Service Level Indicator
(*Use this to improve the service)
DEV
BIZ
Ops
19. Principles of SRE are what set SRE apart
1. SRE needs Service Level Objectives, with consequences
20. Principles of SRE are what set SRE apart
1. SRE needs Service Level Objectives, with consequences
2. SREs have time to make tomorrow better than today
22. Toil: Name For a Problem We’ve All Felt
“Toil is the kind of work tied to running a production
service that tends to be manual, repetitive,
automatable, tactical, devoid of enduring value, and
that scales linearly as a service grows.”
-Vivek Rau
Google
23. Toil vs. Engineering Work
Toil Engineering Work
Lacks Enduring Value Builds Enduring Value
Rote, Repetitive Creative, Iterative
Tactical Strategic
Increases With Scale Enables Scaling
Can Be Automated Requires Human Creativity
24. Excessive Toil Prevents Fixing the System
Toil Engineering Work
E.W.Toil
Reduce toil
Improve the business ǡ
No capacity to reduce toil
No capacity to improve business
Toil at manageable percentage of capacity
Toil at unmanageable percentage of capacity (“Engineering Bankruptcy”)
25. Excessive Toil Prevents Fixing the System
Toil Engineering Work
E.W.Toil
Reduce toil
Improve the business ǡ
No capacity to reduce toil
No capacity to improve business
Toil at manageable percentage of capacity
Toil at unmanageable percentage of capacity (“Engineering Bankruptcy”)
26. Excessive Toil Prevents Fixing the System
Toil Engineering Work
E.W.Toil
Reduce toil
Improve the business ǡ
No capacity to reduce toil
No capacity to improve business
Toil at manageable percentage of capacity
Toil at unmanageable percentage of capacity (“Engineering Bankruptcy”)
Downward spiral is inevitable!
27. Toil is a naturally occurring force
General Evolution of Automation
1. No automation
2. Externally maintained system-specific automation
3. Externally maintained generic automation
4. Internally maintained system-specific automation
5. Systems that don’t need any automation
Niall Murphy
Microsoft Azure
28. Toil is a naturally occurring force
General Evolution of Automation
1. No automation
2. Externally maintained system-specific automation
3. Externally maintained generic automation
4. Internally maintained system-specific automation
5. Systems that don’t need any automation
Niall Murphy
Microsoft Azure
Launch
(ToDos & Unknowns)
Mature
29. Toil is a naturally occurring force
General Evolution of Automation
1. No automation
2. Externally maintained system-specific automation
3. Externally maintained generic automation
4. Internally maintained system-specific automation
5. Systems that don’t need any automation
Niall Murphy
Microsoft Azure
Toil
Toil
Toil
Toil
Launch
(ToDos & Unknowns)
Mature
30. Principles of SRE are what set SRE apart
1. SRE needs Service Level Objectives, with consequences
2. SREs have time to make tomorrow better than today
31. Principles of SRE are what set SRE apart
1. SRE needs Service Level Objectives, with consequences
2. SREs have time to make tomorrow better than today
3. SRE teams have the ability to regulate their workload
32. Principles of SRE are what set SRE apart
Stephen Thorne
At DevOps Enterprise Summit
London 2018
“Principles of SRE”
https://youtu.be/c-w_GYvi0eA
1. SRE needs Service Level Objectives, with consequences
2. SREs have time to make tomorrow better than today
3. SRE teams have the ability to regulate their workload
34. Where to start (in the enterprise)
1. SRE needs Service Level Objectives, with consequences
2. SREs have time to make tomorrow better than today
3. SRE teams have the ability to regulate their workload
35. Where to start (in the enterprise)
1. SRE needs Service Level Objectives, with consequences
2. SREs have time to make tomorrow better than today
3. SRE teams have the ability to regulate their workload
Company-wide culture change (hard!)
36. Where to start (in the enterprise)
1. SRE needs Service Level Objectives, with consequences
2. SREs have time to make tomorrow better than today
3. SRE teams have the ability to regulate their workload
Company-wide culture change (hard!)
Company-wide culture change (hard!)
37. Where to start (in the enterprise)
1. SRE needs Service Level Objectives, with consequences
2. SREs have time to make tomorrow better than today
3. SRE teams have the ability to regulate their workload
Company-wide culture change (hard!)
Company-wide culture change (hard!)
Reduce toil.
Everybody wins!
38. Where to start (in the enterprise)
1. SRE needs Service Level Objectives, with consequences
2. SREs have time to make tomorrow better than today
3. SRE teams have the ability to regulate their workload
Company-wide culture change (hard!)
Company-wide culture change (hard!)
Reduce toil.
Everybody wins!
40. Why focus on reducing toil?
1. Lots of value independent of “SRE”
41. 2. Your people are you most expensive assets
… stay out of their way!
Why focus on reducing toil?
1. Lots of value independent of “SRE”
42. Your people are expensive, stay out of their way!
Ticket
Queue ✅Ticket
Queue
Ticket
Queue
Ticket
Queue
Backlog
Ticket
Queue
Ticket
Queue ✅
Backlog
Not this:
This:
Delivering planned work:
43. Your people are expensive, stay out of their way!
Ticket
Queue ✅Ticket
Queue
Ticket
Queue
Ticket
Queue
Backlog
Ticket
Queue
Ticket
Queue ✅
Backlog
Not this:
This:
Observe
Orient
Decide
Action
SRE
OODA
Loop
Responding to incidents:Delivering planned work:
44. Your people are expensive, stay out of their way!
Ticket
Queue ✅Ticket
Queue
Ticket
Queue
Ticket
Queue
Backlog
Ticket
Queue
Ticket
Queue ✅
Backlog
Not this:
This:
Invest in the right
instrumentation
Observe
Orient
Decide
Action
SRE
OODA
Loop
Responding to incidents:Delivering planned work:
45. Your people are expensive, stay out of their way!
Ticket
Queue ✅Ticket
Queue
Ticket
Queue
Ticket
Queue
Backlog
Ticket
Queue
Ticket
Queue ✅
Backlog
Not this:
This:
Invest in the right
instrumentation
Invest in
collaboration,
checklists,
investigatory tools
Observe
Orient
Decide
Action
SRE
OODA
Loop
Responding to incidents:Delivering planned work:
46. Your people are expensive, stay out of their way!
Ticket
Queue ✅Ticket
Queue
Ticket
Queue
Ticket
Queue
Backlog
Ticket
Queue
Ticket
Queue ✅
Backlog
Not this:
This:
Invest in the right
instrumentation
Invest in
collaboration,
checklists,
investigatory tools
Empower them to
make decisions!
Observe
Orient
Decide
Action
SRE
OODA
Loop
Responding to incidents:Delivering planned work:
47. Your people are expensive, stay out of their way!
Ticket
Queue ✅Ticket
Queue
Ticket
Queue
Ticket
Queue
Backlog
Ticket
Queue
Ticket
Queue ✅
Backlog
Not this:
This:
Invest in the right
instrumentation
Invest in
collaboration,
checklists,
investigatory tools
Empower them to
make decisions!
Empower them to
take action!
Observe
Orient
Decide
Action
SRE
OODA
Loop
Responding to incidents:Delivering planned work:
48. Ticket queues = interruptions, waiting, and toil
Silo A Silo BFunction A Function B
49. Ticket queues = interruptions, waiting, and toil
Silo A Silo B
Ticket
Queue
Function A Function B
50. ??
Silo A Silo B
Ticket
Queue
Function A Function B
Ticket queues = interruptions, waiting, and toil
51. ??
Silo A Silo B
Ticket
Queue
Function A Function B
Snowflakes
Technically acceptable, but brittle and unreproducible
Ticket queues = interruptions, waiting, and toil
53. Super easy to get started
1. Track toil levels for each team
54. Super easy to get started
1. Track toil levels for each team
2. Set toil limit for each team
55. Super easy to get started
1. Track toil levels for each team
2. Set toil limit for each team
3. Fund efforts to reduce toil (with emphasis on teams already over limit)
57. “Do it, do it again, then do it again.”
Done.I need you
to do X
Your
other
work
I need you
to do X
I need you
to do X
Ticket
Do X
Later…
Do X
Do X
Done.
Done.
Your
other
work
Self-Service
Self-Service
Self-Service
Your
other
work x2
Your
other
work x3
Later…Later…
Later…
Your
other
work
Your
other
work
After
Before
Wait Interrupt
Ticket
Wait Interrupt
Ticket
Wait Interrupt
58. “Do it, do it again, then do it again.”
Done.I need you
to do X
Your
other
work
I need you
to do X
I need you
to do X
Ticket
Do X
Later…
Do X
Do X
Done.
Done.
Your
other
work
Self-Service
Self-Service
Self-Service
Your
other
work x2
Your
other
work x3
Later…Later…
Later…
Your
other
work
Your
other
work
After
Before
Wait Interrupt
Ticket
Wait Interrupt
Ticket
Wait Interrupt
59. “I could fix it, but I can’t get to it.”
Environment
I could fix it if I
could get to it
Before
Wait
Interrupt
After
I’ve got this!
Environment
Self-
Service
60. “I could fix it, but I can’t get to it.”
Environment
I could fix it if I
could get to it
Before
Wait
Interrupt
After
I’ve got this!
Environment
Self-
Service
61. “I’m an expert, I don’t read the wiki.”
docs
Service has changed. Use this flag or
bad things will happen!
Pause monitoring first or
we all get woken up!
“restart -doit -now”
I’ve done this before.
I’ve got this…
Environment
docs
Later…
Before
Service has changed. Use this flag or
bad things will happen!
Pause monitoring first or
we all get woken up!
“restart”
Environment
Later…
Update
Restart Job
✅
I’ve done this before.
I’ve got this.
Self-Service
Self-Service
After
62. “I’m an expert, I don’t read the wiki.”
docs
Service has changed. Use this flag or
bad things will happen!
Pause monitoring first or
we all get woken up!
“restart -doit -now”
I’ve done this before.
I’ve got this…
Environment
docs
Later…
Before
Service has changed. Use this flag or
bad things will happen!
Pause monitoring first or
we all get woken up!
“restart”
Environment
Later…
Update
Restart Job
✅
I’ve done this before.
I’ve got this.
Self-Service
Self-Service
After
63. “I’m an expert, I don’t read the wiki.”
docs
Service has changed. Use this flag or
bad things will happen!
Pause monitoring first or
we all get woken up!
“restart -doit -now”
I’ve done this before.
I’ve got this…
Environment
docs
Later…
Before
Service has changed. Use this flag or
bad things will happen!
Pause monitoring first or
we all get woken up!
“restart”
Environment
Later…
Update
Restart Job
✅
I’ve done this before.
I’ve got this.
Self-Service
Self-Service
After
66. Strategic: Improve incident response times
https://youtu.be/USYrDaPEFtM
Jody Mulkey at DOES ‘15 SF
Services Monitoring Scripts/Tools Services Monitoring Scripts/ToolsServices Monitoring Scripts/Tools
DEV STAGE PROD
Dev & QA NOC/Ops Dev
Promote
approved
jobs
Self-Service Self-Service
Empower
67. Strategic: Improve incident response times
https://youtu.be/USYrDaPEFtM
Jody Mulkey at DOES ‘15 SF
Services Monitoring Scripts/Tools Services Monitoring Scripts/ToolsServices Monitoring Scripts/Tools
DEV STAGE PROD
Dev & QA NOC/Ops Dev
Promote
approved
jobs
Self-Service Self-Service
Empower
68. Strategic: Improve incident response times
https://youtu.be/USYrDaPEFtM
Jody Mulkey at DOES ‘15 SF
Services Monitoring Scripts/Tools Services Monitoring Scripts/ToolsServices Monitoring Scripts/Tools
DEV STAGE PROD
Dev & QA NOC/Ops Dev
Promote
approved
jobs
Self-Service Self-Service
Empower
• Reduced MTTR by 92%
• Reduced escalations by 50%
• Reduced overall support costs by 55%
69. Strategic: Reduce compliance burden & improve consistency
Shaun Norris at DOES ‘18 Las Vegas
https://youtu.be/d5IMvK0YHTg
70. Strategic: Reduce compliance burden & improve consistency
Shaun Norris at DOES ‘18 Las Vegas
https://youtu.be/d5IMvK0YHTg
Optimized for compliance
• 86,000+ employees
• 60+ countries
• Highly regulated
71. Strategic: Reduce compliance burden & improve consistency
Shaun Norris at DOES ‘18 Las Vegas
https://youtu.be/d5IMvK0YHTg
Optimized for compliance
• 86,000+ employees
• 60+ countries
• Highly regulated
LOB #1
LOB #2 LOB #3
LOB …n
Services Scripts/Tools
Data Center
Services Scripts/Tools
Data Center
Services Scripts/Tools
Data Center Services Scripts/Tools
Cloud
Services Scripts/Tools
Cloud
Services Scripts/Tools
Cloud
Services Scripts/Tools
Cloud
Self-Service
ComplianceConsistency
72. Strategic: Reduce compliance burden & improve consistency
Shaun Norris at DOES ‘18 Las Vegas
https://youtu.be/d5IMvK0YHTg
Optimized for compliance
• 86,000+ employees
• 60+ countries
• Highly regulated
LOB #1
LOB #2 LOB #3
LOB …n
Services Scripts/Tools
Data Center
Services Scripts/Tools
Data Center
Services Scripts/Tools
Data Center Services Scripts/Tools
Cloud
Services Scripts/Tools
Cloud
Services Scripts/Tools
Cloud
Services Scripts/Tools
Cloud
Self-Service
ComplianceConsistency
12 months:
• Freed up 28 person years of time
• 13,000+ ops tasks in privileged environments
that didn’t require a review
• ~200 less customer impacting events