As presented by Damon Edwards, co-founder of Rundeck, at SREcon in Dusseldorf, Germany on 30 Aug 2018.
Video available here:
https://www.usenix.org/conference/srecon18europe/presentation/edwards
See a Demo of Rundeck Enterprise :
https://www.rundeck.com/see-demo
--or--
Download Rundeck Open Source here:
https://rundeck.com/open-source
Connect:
Stack Overflow community: https://stackoverflow.com/questions/tagged/rundeck
Github: https://github.com/rundeck/rundeck/issues
Twitter: https://twitter.com/Rundeck
Facebook: https://www.facebook.com/RundeckInc/
LinkedIn: www.linkedin.com › company › rundeck-inc
4. OpsBusiness
Idea
Shorter Time-to-Market
Fast Feedback
from Users
Dev Ops
Running
Services
Improved Quality
Digital and DevOps
Availability Auditing
Security Compliance
"Go faster!"
“Open up!”
“Lock it down!”
“Great for Dev, but what about Ops?”
5. Our transformation has largely
ignored Ops. Any ideas?
Have you heard of SRE?
Google does it.
11. SysAdmins
Overloaded. Constant
firefighting.
Waiting in ticket queues
for everything.
Things break. Break
again. And again.
Everyone is busy, but it
doesn’t get any better.
ansformation has largely
nored Ops. Any ideas?
Have you heard of SRE?
Google does it.
Everything takes too
long, cost too much,
and break too often!
Executive
View
12. SysAdmins
Overloaded. Constant
firefighting.
Waiting in ticket queues
for everything.
Things break. Break
again. And again.
Everyone is busy, but it
doesn’t get any better.
ansformation has largely
nored Ops. Any ideas?
Have you heard of SRE?
Google does it.
Everything takes too
long, cost too much,
and break too often!
Executive
View
(False) SRE
Overloaded. Constant
firefighting.
Waiting in ticket queues
for everything.
Things break. Break
again. And again.
Everyone is busy, but it
doesn’t get any better.
Our transformation has largely
ignored Ops. Any ideas?
Have you h
Google
Everything takes too
long, cost too much,
and break too often!
Executive
View
13. Changing job titles or adding individual skills
doesn’t make systems administrators SREs.
15. Principles of SRE are what set SRE apart
1. SRE needs Service Level Objectives, with consequences
16. Principles of SRE are what set SRE apart
1. SRE needs Service Level Objectives, with consequences
17. SLO and Error Budgets: Tools for Shared Responsibility
0
100
Service Level Objective
Error Budget*
Service Level Indicator
(*Use this to improve the service)
18. SLO and Error Budgets: Tools for Shared Responsibility
0
100
Service Level Objective
Error Budget*
Service Level Indicator
(*Use this to improve the service)
DEV
BIZ
Ops
19. Principles of SRE are what set SRE apart
1. SRE needs Service Level Objectives, with consequences
20. Principles of SRE are what set SRE apart
1. SRE needs Service Level Objectives, with consequences
2. SREs have time to make tomorrow better than today
21. Principles of SRE are what set SRE apart
1. SRE needs Service Level Objectives, with consequences
2. SREs have time to make tomorrow better than today
3. SRE teams have the ability to regulate their workload
22. Principles of SRE are what set SRE apart
Stephen Thorne
At DevOps Enterprise Summit
London 2018
“Principles of SRE”
https://youtu.be/c-w_GYvi0eA
1. SRE needs Service Level Objectives, with consequences
2. SREs have time to make tomorrow better than today
3. SRE teams have the ability to regulate their workload
27. Backlog Information
I need X
PrioritiesTools
Silos
Backlog
I do X
Requests
for X
Silo A
Information
Priorities
Silo B
Tools
28. Silos cause disconnects and mismatches
Backlog Information
I need X
PrioritiesTools
Backlog
I do X
Requests
for X
Silo A
Information
Priorities
Silo B
Tools
Context
Context
Process
Process
Tooling
Tooling
Capacity
Capacity
31. Function A
Function B
Function C
Silos create labor pools of functional specialists
Requests fulfilled by semi-
manual or manual effort
Primary management focus is
on protecting team capacity
32. Silos Undermine SRE Principles
1. Org has Service Level Objectives, with consequences?
2. SREs have time to make tomorrow better than today?
3. SRE teams have the ability to regulate their workload?
33. Silos Undermine SRE Principles
1. Org has Service Level Objectives, with consequences?
2. SREs have time to make tomorrow better than today?
3. SRE teams have the ability to regulate their workload?
Disjointed silos make meaningful SLOs and shared
responsibility almost impossible
X
34. Silos Undermine SRE Principles
1. Org has Service Level Objectives, with consequences?
2. SREs have time to make tomorrow better than today?
3. SRE teams have the ability to regulate their workload?
Disjointed silos make meaningful SLOs and shared
responsibility almost impossible
X
Siloed labor pools, disconnected processes and tools, and slow
feedback loops tend to consume all available capacity
X
35. Silos Undermine SRE Principles
1. Org has Service Level Objectives, with consequences?
2. SREs have time to make tomorrow better than today?
3. SRE teams have the ability to regulate their workload?
Disjointed silos make meaningful SLOs and shared
responsibility almost impossible
X
Siloed labor pools, disconnected processes and tools, and slow
feedback loops tend to consume all available capacity
X
Struggling to keep up with demand and unable to protect capacityX
37. How do we cover for our cross-silo disconnects and mismatches?
Silo A Silo B
38. How do we cover for our cross-silo disconnects and mismatches?
Silo A Silo B
Ticket
Queue
39. ??
Silo A Silo B
We all know how well that works
Ticket
Queue
40. Request queues are an expensive way to manage work
Ticket
Queue
Queues Create…
Longer Cycle Time
Increased Risk
More Variability
More Overhead
Lower Quality
Less Motivation
Adapted from Donald G. Reinertsen, The Principles of Product Development Flow: Second Generation Lean Product Development
45. Tickets queues become “snowflake makers”
??
Silo A Silo B
Ticket
Queue
Snowflakes
(each unique, technically acceptable but unreproducible and brittle)
46. Ticket Queues Undermine SRE Principles
1. Org has Service Level Objectives, with consequences?
2. SREs have time to make tomorrow better than today?
3. SRE teams have the ability to regulate their workload?
47. Ticket Queues Undermine SRE Principles
1. Org has Service Level Objectives, with consequences?
2. SREs have time to make tomorrow better than today?
3. SRE teams have the ability to regulate their workload?
Tickets reinforce siloed behaviors and obfuscate the value
stream
X
48. Ticket Queues Undermine SRE Principles
1. Org has Service Level Objectives, with consequences?
2. SREs have time to make tomorrow better than today?
3. SRE teams have the ability to regulate their workload?
Tickets reinforce siloed behaviors and obfuscate the value
stream
X
Longer cycle time, more variability, more overhead, lower quality, and
more snowflakes consume available capacity
X
49. Ticket Queues Undermine SRE Principles
1. Org has Service Level Objectives, with consequences?
2. SREs have time to make tomorrow better than today?
3. SRE teams have the ability to regulate their workload?
Tickets reinforce siloed behaviors and obfuscate the value
stream
X
Longer cycle time, more variability, more overhead, lower quality, and
more snowflakes consume available capacity
X
Queues obfuscate the pressure being put on request fulfillersX
52. Toil is the enemy of SRE
“Toil is the kind of work tied to running a production
service that tends to be manual, repetitive,
automatable, tactical, devoid of enduring value, and
that scales linearly as a service grows.”
-Vivek Rau
Google
53. Toil vs. Engineering Work
Toil Engineering Work
Lacks Enduring Value Builds Enduring Value
Rote, Repetitive Creative, Iterative
Tactical Strategic
Increases With Scale Enables Scaling
Can Be Automated Requires Human Creativity
54. Excessive toil prevents fixing the system
Toil Engineering Work
E.W.Toil
Reduce toil
Improve the business ǡ
No capacity to reduce toil
No capacity to improve business
Toil at manageable percentage of capacity
Toil at unmanageable percentage of capacity (“Engineering Bankruptcy”)
55. Excessive toil prevents fixing the system
Toil Engineering Work
E.W.Toil
Reduce toil
Improve the business ǡ
No capacity to reduce toil
No capacity to improve business
Toil at manageable percentage of capacity
Toil at unmanageable percentage of capacity (“Engineering Bankruptcy”)
56. Excessive Toil Undermines SRE Principles
1. Org has Service Level Objectives, with consequences?
2. SREs have time to make tomorrow better than today?
3. SRE teams have the ability to regulate their workload?
57. Excessive Toil Undermines SRE Principles
1. Org has Service Level Objectives, with consequences?
2. SREs have time to make tomorrow better than today?
3. SRE teams have the ability to regulate their workload?
Buried in toil keeps team from contributing engineering work
to uphold their end of the shared responsibility deal
X
58. Excessive Toil Undermines SRE Principles
1. Org has Service Level Objectives, with consequences?
2. SREs have time to make tomorrow better than today?
3. SRE teams have the ability to regulate their workload?
Buried in toil keeps team from contributing engineering work
to uphold their end of the shared responsibility deal
X
Buried in toil… no capacity for engineering work to reduce toil.X
59. Excessive Toil Undermines SRE Principles
1. Org has Service Level Objectives, with consequences?
2. SREs have time to make tomorrow better than today?
3. SRE teams have the ability to regulate their workload?
Buried in toil keeps team from contributing engineering work
to uphold their end of the shared responsibility deal
X
Buried in toil… no capacity for engineering work to reduce toil.X
Buried in toil… no capacity for engineering work to reduce toil.X
63. All work is contextual
rm -rf $PATHNAME
John
Allspaw
64. All work is contextual
rm -rf $PATHNAME Is this dangerous?
John
Allspaw
65. All work is contextual
rm -rf $PATHNAME
John
Allspaw
66. All work is contextual
rm -rf $PATHNAME
John
Allspaw
67. All work is contextual
rm -rf $PATHNAME
Is this dangerous?
John
Allspaw
68. All work is contextual
rm -rf $PATHNAME
John
Allspaw
69. All work is contextual
rm -rf $PATHNAME
Answer is always
“it depends”
John
Allspaw
70. escalate
1° 2° 3° 4°
escalate escalateor
Context
Where are decisions made? Who can take action?
71. Low trust + approvals = illusion of control
Ticket
System
72. Low trust + approvals = illusion of control
Ticket
System
Add up the total number of approval requests and
73. Low trust + approvals = illusion of control
Ticket
System
Add up the total number of approval requests and
…subtract the info radiators (“I need to be in the loop”)
74. Low trust + approvals = illusion of control
Ticket
System
Add up the total number of approval requests and
…subtract the info radiators (“I need to be in the loop”)
…subtract the CYAs (“Prove you followed the process”)
75. Low trust + approvals = illusion of control
Ticket
System
Add up the total number of approval requests and
…subtract the info radiators (“I need to be in the loop”)
…subtract the CYAs (“Prove you followed the process”)
…subtract the too removed to judge (“mostly guessing”)
76. Low trust + approvals = illusion of control
Ticket
System
Add up the total number of approval requests and
…subtract the info radiators (“I need to be in the loop”)
…subtract the CYAs (“Prove you followed the process”)
…subtract the too removed to judge (“mostly guessing”)
77. Low trust + approvals = illusion of control
Ticket
System
Add up the total number of approval requests and
…subtract the info radiators (“I need to be in the loop”)
…subtract the CYAs (“Prove you followed the process”)
…subtract the too removed to judge (“mostly guessing”)
78. Low trust + approvals = illusion of control
Ticket
System
Add up the total number of approval requests and
…subtract the info radiators (“I need to be in the loop”)
…subtract the CYAs (“Prove you followed the process”)
…subtract the too removed to judge (“mostly guessing”)
How many are you left with?
79. Low trust + approvals = illusion of control
Ticket
System
Add up the total number of approval requests and
…subtract the info radiators (“I need to be in the loop”)
…subtract the CYAs (“Prove you followed the process”)
…subtract the too removed to judge (“mostly guessing”)
How many are you left with?
How many were the right call?
80. Low trust + approvals = illusion of control
Ticket
System
Add up the total number of approval requests and
…subtract the info radiators (“I need to be in the loop”)
…subtract the CYAs (“Prove you followed the process”)
…subtract the too removed to judge (“mostly guessing”)
How many are you left with?
How many were the right call?
How many got rejected?
81. Low trust + approvals = illusion of control
Ticket
System
Add up the total number of approval requests and
…subtract the info radiators (“I need to be in the loop”)
…subtract the CYAs (“Prove you followed the process”)
…subtract the too removed to judge (“mostly guessing”)
How many are you left with?
How many were the right call?
How many got rejected?
82. Low Trust Undermines SRE Principles
1. Org has Service Level Objectives, with consequences?
2. SREs have time to make tomorrow better than today?
3. SRE teams have the ability to regulate their workload?
83. Low Trust Undermines SRE Principles
1. Org has Service Level Objectives, with consequences?
2. SREs have time to make tomorrow better than today?
3. SRE teams have the ability to regulate their workload?
Cultures of low trust have a really difficult time with shared
responsibility
X
84. Low Trust Undermines SRE Principles
1. Org has Service Level Objectives, with consequences?
2. SREs have time to make tomorrow better than today?
3. SRE teams have the ability to regulate their workload?
Cultures of low trust have a really difficult time with shared
responsibility
X
People closest to problems know what to fix but tasking, priorities,
and decisions are largely out of their control
X
85. Low Trust Undermines SRE Principles
1. Org has Service Level Objectives, with consequences?
2. SREs have time to make tomorrow better than today?
3. SRE teams have the ability to regulate their workload?
Cultures of low trust have a really difficult time with shared
responsibility
X
People closest to problems know what to fix but tasking, priorities,
and decisions are largely out of their control
X
People aren’t trusted to plan or design their own workX
88. Lean on Lean to find what to fix
PD
TS
W
EP M
M
M
TS
?
PD
TS
W
EP M
M
M
TS
?
Countermeasure
Countermeasure
Countermeasure
Countermeasure
Countermeasure
Countermeasure
Countermeasure
Countermeasure
Lorem ipsum dolor
In aliquet rhoncus urna. Proin
eget diam volutpat.
Lorem ipsum dolor
In aliquet rhoncus urna. Proin
eget diam volutpat.
Lorem ipsum dolor
In aliquet rhoncus urna. Proin
eget diam volutpat.
Map the end-to-end flow of information and artifacts (using a recent delivery or event)
Identify what slows lead times, undermines quality, and impacts flow
1
2
3 Identify countermeasures and create improvement storyboards (justification/plan)
89. Lean on Lean to find what to fix
PD
TS
W
EP M
M
M
TS
?
PD
TS
W
EP M
M
M
TS
?
Countermeasure
Countermeasure
Countermeasure
Countermeasure
Countermeasure
Countermeasure
Countermeasure
Countermeasure
Lorem ipsum dolor
In aliquet rhoncus urna. Proin
eget diam volutpat.
Lorem ipsum dolor
In aliquet rhoncus urna. Proin
eget diam volutpat.
Lorem ipsum dolor
In aliquet rhoncus urna. Proin
eget diam volutpat.
Map the end-to-end flow of information and artifacts (using a recent delivery or event)
Identify what slows lead times, undermines quality, and impacts flow
1
2
3 Identify countermeasures and create improvement storyboards (justification/plan)
All processes should be studied with
an improvement disciple
90. Lean on Lean to find what to fix
PD
TS
W
EP M
M
M
TS
?
PD
TS
W
EP M
M
M
TS
?
Countermeasure
Countermeasure
Countermeasure
Countermeasure
Countermeasure
Countermeasure
Countermeasure
Countermeasure
Lorem ipsum dolor
In aliquet rhoncus urna. Proin
eget diam volutpat.
Lorem ipsum dolor
In aliquet rhoncus urna. Proin
eget diam volutpat.
Lorem ipsum dolor
In aliquet rhoncus urna. Proin
eget diam volutpat.
Map the end-to-end flow of information and artifacts (using a recent delivery or event)
Identify what slows lead times, undermines quality, and impacts flow
1
2
3 Identify countermeasures and create improvement storyboards (justification/plan)
All processes should be studied with
an improvement disciple
Incidents are just as much a
“process” as delivery
91. Lean on Lean to find what to fix
PD
TS
W
EP M
M
M
TS
?
PD
TS
W
EP M
M
M
TS
?
Countermeasure
Countermeasure
Countermeasure
Countermeasure
Countermeasure
Countermeasure
Countermeasure
Countermeasure
Lorem ipsum dolor
In aliquet rhoncus urna. Proin
eget diam volutpat.
Lorem ipsum dolor
In aliquet rhoncus urna. Proin
eget diam volutpat.
Lorem ipsum dolor
In aliquet rhoncus urna. Proin
eget diam volutpat.
Map the end-to-end flow of information and artifacts (using a recent delivery or event)
Identify what slows lead times, undermines quality, and impacts flow
1
2
3 Identify countermeasures and create improvement storyboards (justification/plan)
All processes should be studied with
an improvement disciple
Incidents are just as much a
“process” as delivery
Look to Lean for proven
improvement techniques (value
stream mapping, waste analysis,
improvement kata)
92. Lean on Lean to find what to fix
PD
TS
W
EP M
M
M
TS
?
PD
TS
W
EP M
M
M
TS
?
Countermeasure
Countermeasure
Countermeasure
Countermeasure
Countermeasure
Countermeasure
Countermeasure
Countermeasure
Lorem ipsum dolor
In aliquet rhoncus urna. Proin
eget diam volutpat.
Lorem ipsum dolor
In aliquet rhoncus urna. Proin
eget diam volutpat.
Lorem ipsum dolor
In aliquet rhoncus urna. Proin
eget diam volutpat.
Map the end-to-end flow of information and artifacts (using a recent delivery or event)
Identify what slows lead times, undermines quality, and impacts flow
1
2
3 Identify countermeasures and create improvement storyboards (justification/plan)
All processes should be studied with
an improvement disciple
Incidents are just as much a
“process” as delivery
Look to Lean for proven
improvement techniques (value
stream mapping, waste analysis,
improvement kata)
Make it a part of your organization’s
discipline
93. Get rid of as many silos as possible
Old Silo A Old Silo B Old Silo C Old Silo D
94. Old Silo A Old Silo B Old Silo C Old Silo D
Cross-Functional Team 1
Cross-Functional Team 2
Cross-Functional Team n
Get rid of as many silos as possible
95. Old Silo A Old Silo B Old Silo C Old Silo D
Cross-Functional Team 1
Cross-Functional Team 2
Cross-Functional Team n
Get rid of as many silos as possible
Key 1: get rid of as many
handoffs as possible
96. Old Silo A Old Silo B Old Silo C Old Silo D
Cross-Functional Team 1
Cross-Functional Team 2
Cross-Functional Team n
Get rid of as many silos as possible
Key 2: “Horizontal”
shared responsibility, not
everyone do everything!
Key 1: get rid of as many
handoffs as possible
97. Shared responsibility matters more than org model
Cross-Functional Team 1
Cross-Functional Team 2
Cross-Functional Team n
Development Team 1
Development Team 2
Development Team n
SRE
Team
Clear handoff requirements
Error budget consequences
“Netflix"
Model
“Google”
Model
98. Shared responsibility matters more than org model
Cross-Functional Team 1
Cross-Functional Team 2
Cross-Functional Team n
Development Team 1
Development Team 2
Development Team n
SRE
Team
Clear handoff requirements
Error budget consequences
“Netflix"
Model
“Google”
Model
99. Shared responsibility matters more than org model
Cross-Functional Team 1
Cross-Functional Team 2
Cross-Functional Team n
Development Team 1
Development Team 2
Development Team n
SRE
Team
Clear handoff requirements
Error budget consequences
“Netflix"
Model
“Google”
Model
Same
high-quality,
high-velocity
results!
101. Why focus on getting rid of handoffs?
1. Your people are your most valuable assets
102. Why focus on getting rid of handoffs?
1. Your people are your most valuable assets
2. The SRE skillset is expensive
103. Why focus on getting rid of handoffs?
1. Your people are your most valuable assets
2. The SRE skillset is expensive
3. Stay out of their way!
104. SREs are expensive, stay out of their way!
Ticket
Queue ✅Ticket
Queue
Ticket
Queue
Ticket
Queue
Backlog
Ticket
Queue
Ticket
Queue ✅
Backlog
Not this:
This:
105. SREs are expensive, stay out of their way!
Ticket
Queue ✅Ticket
Queue
Ticket
Queue
Ticket
Queue
Backlog
Ticket
Queue
Ticket
Queue ✅
Backlog
Not this:
This:
Observe
Orient
Decide
Action
SRE
OODA
Loop
Reduce friction:
106. SREs are expensive, stay out of their way!
Ticket
Queue ✅Ticket
Queue
Ticket
Queue
Ticket
Queue
Backlog
Ticket
Queue
Ticket
Queue ✅
Backlog
Not this:
This:
Invest in the right
instrumentation
Observe
Orient
Decide
Action
SRE
OODA
Loop
Reduce friction:
107. SREs are expensive, stay out of their way!
Ticket
Queue ✅Ticket
Queue
Ticket
Queue
Ticket
Queue
Backlog
Ticket
Queue
Ticket
Queue ✅
Backlog
Not this:
This:
Invest in the right
instrumentation
Invest in
collaboration,
checklists,
investigatory tools
Observe
Orient
Decide
Action
SRE
OODA
Loop
Reduce friction:
108. SREs are expensive, stay out of their way!
Ticket
Queue ✅Ticket
Queue
Ticket
Queue
Ticket
Queue
Backlog
Ticket
Queue
Ticket
Queue ✅
Backlog
Not this:
This:
Invest in the right
instrumentation
Invest in
collaboration,
checklists,
investigatory tools
Empower them to
make decisions!
Observe
Orient
Decide
Action
SRE
OODA
Loop
Reduce friction:
109. SREs are expensive, stay out of their way!
Ticket
Queue ✅Ticket
Queue
Ticket
Queue
Ticket
Queue
Backlog
Ticket
Queue
Ticket
Queue ✅
Backlog
Not this:
This:
Invest in the right
instrumentation
Invest in
collaboration,
checklists,
investigatory tools
Empower them to
make decisions!
Empower them to
take action!
Observe
Orient
Decide
Action
SRE
OODA
Loop
Reduce friction:
110. What about the handoffs you can’t get rid of?
Old Silo A Old Silo B Old Silo C Old Silo D
Cross-Functional Team 1
Cross-Functional Team 2
Cross-Functional Team n
Specialist
Capabilities
Specialist
Capabilities
Specialist
Capabilities
111. What about the handoffs you can’t get rid of?
Old Silo A Old Silo B Old Silo C Old Silo D
Cross-Functional Team 1
Cross-Functional Team 2
Cross-Functional Team n
Specialist
Capabilities
Specialist
Capabilities
Specialist
Capabilities
Ticket
Queue
Ticket
Queue
Ticket
Queue
112. What about the handoffs you can’t get rid of?
Old Silo A Old Silo B Old Silo C Old Silo D
Cross-Functional Team 1
Cross-Functional Team 2
Cross-Functional Team n
Specialist
Capabilities
Specialist
Capabilities
Specialist
Capabilities
Ticket
Queue
Ticket
Queue
Ticket
Queue
Ticket
Queue
Ticket
Queue Ticket
Queue
113. Operations as a Service: Turn handoffs into self-service
Operations as a Service
On
Demand
On
Demand
On
Demand
On
Demand
Ops
(embedded)Cross-Functional Product Team 1
Cross-Functional Product Team n Ops
(embedded)
Ops
(builds & operates)
Cross-Functional Product Team 2 Ops
(embedded)
Ops Capability
SRE, Dev, or
Specialist
Ops Capability
SRE, Dev, or
Specialist
Ops Capability
SRE, Dev, or
Specialist
114. Development Team 1
Development Team 2
Development Team n
Ops/SRE
Team
Operations as a Service
On
Demand
On
Demand
On
Demand
On
Demand
Ops
(builds & operates)
Ops Capability
SRE, Dev, or
Specialist
Ops Capability
SRE, Dev, or
Specialist
Ops Capability
SRE, Dev, or
Specialist
Operations as a Service: Works with any org model
115. Operations as a Service: Popular Uses for SRE
Environment
"I could fix it, if I could get to it”
116. Operations as a Service: Popular Uses for SRE
Environment
"I could fix it, if I could get to it”
Environment
O
a
a
S
117. Operations as a Service: Popular Uses for SRE
“Avoiding the dogpile”
I think its a problem with
dbcluster07-store2.uswest.acme
dbcluster07-
store2.uswest.
acme
“$ top”
“$ top”
“$ top”
“$ top”
“$ top”
“$ top”“$ top”
118. Operations as a Service: Popular Uses for SRE
“Avoiding the dogpile”
I think its a problem with
dbcluster07-store2.uswest.acme
dbcluster07-
store2.uswest.
acme
“$ top”
“$ top”
“$ top”
“$ top”
“$ top”
“$ top”“$ top”
I think its a problem with
dbcluster07-store2.uswest.acme
dbcluster07-
store2.uswest.
acme
“$ top”
“Healthcheck
store2 - all”
OaaS
119. “I don’t read wikis. I’m an expert.”
docs
Service has changed. This flag is now
required or bad things will happen!
Pause monitoring first or we
all get woken up!
“restart -doit -now”
I’ve done this before. I’ve got this.
Environment
docs
Later…
Operations as a Service: Popular Uses for SRE
120. “I don’t read wikis. I’m an expert.”
docs
Service has changed. This flag is now
required or bad things will happen!
Pause monitoring first or we
all get woken up!
“restart -doit -now”
I’ve done this before. I’ve got this.
Environment
docs
Later…
OaaS
Service has changed. This flag is now
required or bad things will happen!
Pause monitoring first or we
all get woken up!
“restart”
I’ve done this before. I’ve got this.
Environment
Later…
Update
Restart Job
✅
OaaS
Operations as a Service: Popular Uses for SRE
121. Operations as a Service: Popular Uses for SRE
“Uneven and hidden skills”
I don’t know
how to do X.
I know how
to do X.
I know how
to do Y.
I don’t know
how to do Y.
122. Operations as a Service: Popular Uses for SRE
“Uneven and hidden skills”
I don’t know
how to do X.
I know how
to do X.
I know how
to do Y.
I don’t know
how to do Y.
OaaS
“Do X”
“Define Y
Procedure”
“Define X
Procedure”
“Do Y”
“Do X+Y”
123. “Let me do that for you again… and again”
Done.
I need you to
do X
Later…
Ticket
Other
work
Done.
I need you to
do X
Later…
Ticket
Other
work
Sigh..Done.
I need you to
do X
Ticket
Other
work
Operations as a Service: Popular Uses for SRE
124. “Let me do that for you again… and again”
Done.
I need you to
do X
Later…
Ticket
Other
work
Done.
I need you to
do X
Later…
Ticket
Other
work
Sigh..Done.
I need you to
do X
Ticket
Other
work
OaaS
Do X
Later…
Other
work 1
Later…
Other
work 2
Other
work 3
Do X
Do X
OaaS
OaaS
Operations as a Service: Popular Uses for SRE
126. Use tickets only for what they are good for
1.Documenting true problems/issues/exceptions
Ticket
System
127. Use tickets only for what they are good for
1.Documenting true problems/issues/exceptions
2.Routing for necessary approvals
Ticket
System
128. Use tickets only for what they are good for
1.Documenting true problems/issues/exceptions
2.Routing for necessary approvals
Not as a general purpose work management system!
Ticket
System
129. But won’t Security or Compliance stop you?
Operations as a Service
On
Demand
On
Demand
On
Demand
On
Demand
Ops
(embedded)Cross-Functional Product Team 1
Cross-Functional Product Team n Ops
(embedded)
Ops
(builds & operates)
Cross-Functional Product Team 2 Ops
(embedded)
Ops Capability
SRE, Dev, or
Specialist
Ops Capability
SRE, Dev, or
Specialist
Ops Capability
SRE, Dev, or
Specialist
Build-in
Security
Here
Build-in
Compliance
Here
131. But what about ITIL®
?
• Ask ITIL people and they say SRE is ITIL compatible
132. But what about ITIL®
?
• Ask ITIL people and they say SRE is ITIL compatible
• Ask people who have seen ITIL implemented and they say “how?”
133. But what about ITIL®
?
• Ask ITIL people and they say SRE is ITIL compatible
• Ask people who have seen ITIL implemented and they say “how?”
• Agile+DevOps+SRE have self-regulation and shared responsibility
features that seem to undermine ITIL command and control nature
134. But what about ITIL®
?
• Ask ITIL people and they say SRE is ITIL compatible
• Ask people who have seen ITIL implemented and they say “how?”
• Agile+DevOps+SRE have self-regulation and shared responsibility
features that seem to undermine ITIL command and control nature
• ITIL “Standard Change” is often focus of discussion, but it still
implies approval model
135. But what about ITIL®
?
• Ask ITIL people and they say SRE is ITIL compatible
• Ask people who have seen ITIL implemented and they say “how?”
• Agile+DevOps+SRE have self-regulation and shared responsibility
features that seem to undermine ITIL command and control nature
• ITIL “Standard Change” is often focus of discussion, but it still
implies approval model
• Straight talk: are we doing contortions to defend a sunk cost?
136. “Shift Left” the ability to take action
escalate
1° 2° 3° 4°
escalate escalateor
137. “Shift Left” the ability to take action
Push the ability to take action this direction
escalate
1° 2° 3° 4°
escalate escalateor
138. “Shift Left” the ability to take action
Push the ability to take action this direction
escalate
1° 2° 3° 4°
escalate escalateor
OaaS Enablement and tooling
141. Reduce Toil
1. Track toil levels for each team
2. Set toil limits for each team
142. Reduce Toil
1. Track toil levels for each team
2. Set toil limits for each team
3. Fund efforts to reduce toil (with emphasis on teams over toil limits)