Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Practical Operability
Techniques for Teams
Matthew Skelton
Conflux
confluxdigital.net / @ConfluxHQ
Continuous Lifecycle Lo...
Practical operability
Why do we need a focus on
operability?
5 practical operability
techniques to try
2
use modern logging, Run Book
dialogue sheets, endpoint
healthchecks, correlation IDs,
and user personas as
team collaborat...
About me
Matthew Skelton, Conflux
@matthewpskelton
matthewskelton.net
Leeds, UK
4
assemblyconf.com
Team Guide to
Software Operability
Matthew Skelton & Rob Thatcher
skeltonthatcher.com/publications
Download a free sample ...
You?
Software Developer
Tester / QA
DevOps Engineer
Team Leader
Head of Department
6
Practical
Operability Techniques
for Teams
7
Operability
making software work well
in Production
8
Operability
9
“But can’t we just give
those things to a DevOp?”
10
“But can’t we just give
those things to the
DevOp?”
11
Operability is
a shared concern
#BizDevTestSecOps
12
Operability is
a shared concern
#BizDevTestSecOps
13
14
15
A self-managed Kubernetes
cluster near you
Operability is
a shared concern
#BizDevTestSecOps
16
Practical operability techniques
1. Modern logging
2. Run Book dialogue sheets
3. Endpoint healthchecks
4. Correlation IDs...
Lack of observability for
distributed systems
18
19
Logging with Event IDs
Modern logging w/ Event IDs
Distinct application states
No “logorrhoea” (!)
Distributed tracing via logs
Build a shared un...
search by event
Event ID
{Delivered,
InTransit,
Arrived}
21
transaction
trace
Correlation ID
612999958…
22
How many distinct event
types (state transitions) in
your application?
23
24
represent distinct states
25
enum
Human-readable sets:
unique values, sparse,
immutable
C#, Java, Python, node
(Ruby, PHP, …)
26
public enum EventID
{
// Badly-initialised logging data
NotSet = 0,
// An unrecognised event has occurred
UnexpectedError ...
BasketItemAdded = 60001
BasketItemRemoved = 60002
28
log using Event IDs
with a modern
‘structured logging’ library
29
example:
https://github.com/EqualExperts/opslogger
Sean Reilly
@seanjreilly 30
Example: video processing
On-demand processing of TV and
mobile streaming adverts
Ad-agency → TV broadcaster
High throughp...
Storage I/O
Worker Job
Queue
Upload
32
33
34
Example: video processing
Discover processing bottlenecks
Trigger alerts
Report on KPIs
Target areas for improvement
35
Modern logging w/ Event IDs
reduce time to detect problems
increase team engagement
increase configurability
enhance colla...
Modern Logging:
Collaborate on Event IDs
and Correlation traces for
better system awareness
37
Operational aspects not
addressed, or addressed
too late in the cycle
38
Run Book dialogue sheets
Run Book dialogue sheets
Checklists for typical
operational considerations
Team-friendly exploration
40
runbooktemplate.infoRun Book dialogue sheets
41
System characteristics
Hours of operation
During what hours does the service or system actually need to operate? Can porti...
http://runbooktemplate.info/
43
runbooktemplate.infoRun Book dialogue sheets
44
Run Book dialogue sheets
Early discovery of
operational requirements
Input to team backlog
“Shift-left” testing
Avoid oper...
Run Book dialogue sheets:
Collaborate on operational
requirements for better
system awareness
46
“Why has my deployment
failed again?”
“Why is Pre-Prod always
so flaky?”
47
Endpoint healthchecks
Endpoint healthchecks
Simple HTTP check
Common way to assess any
service/app/component
Key operational requirement
49
endpoint healthchecks
Every runnable app/service/daemon
exposes /status/health
An HTTP GET to the endpoint returns:
200 – ...
endpoint healthchecks
Each component is responsible for
determining its own health and
viability – this is very contextual...
endpoint healthchecks
Use JSON as a response type –
parsable by both
machines and humans!
52
53
endpoint healthchecks
For databases and other non-HTTP
components, run a lightweight HTTP
service in front of the componen...
Helper service
55
https://github.com/Lugribossk/simple-dashboard
56
57
Question:
What does this look like for
Serverless?
¯_(ツ)_/¯
Endpoint healthchecks
Rapid dashboarding and
diagnosis
Reduce confusion around
environment state
“Fail fast” aka “learn so...
Endpoint healthchecks:
Collaborate on component
health status for better
system awareness
59
“Which containers
[servers] handled the
request?”
60
Correlation IDs
Correlation IDs
Unique-ish identifiers
Trace calls across machine &
container boundaries
Re-assemble HTTP
request/response...
‘Unique-ish’ identifier for each request
Passed through downstream layers
63
Unique-ish ID
64
Synchronous HTTP:
X-HEADER e.g. X-trace-id
X-trace-id: 348e1cf8
If header is present, pass it on
(Yes, RFC6648, but this i...
Asynchonous (queues, etc.):
Message Attributes, name:value pair
e.g. "trace-id":"348e1cf8"
AWS SQS: SendMessage() / Receiv...
Example: OpenTracing / PCF
3 tracing elements:
TraceID, SpanID, ParentSpan
"X-B3-TraceId" "X-B3-SpanId"
"X-B3-ParentSpan"
...
Example: OpenTracing / PCF
Always log the TraceID as-is
Log calling SpanID as ParentSpan
Log new SpanID
68
Trace
Span
ParentSpan
69
Correlation IDs
Detect bottlenecks and
unexpected interactions
Increase transparency
Learn about the system
70
Correlation IDs:
Collaborate on distributed
tracing for better system
awareness
71
Software is difficult to
operate: poor UX for Ops.
72
Lightweight user personas
Lightweight User Personas
Simple characterisation of
user needs for Dev/Test/Ops
Based on full UX user
personas but less d...
Lightweight user personas:
Ops Engineer
Test Engineer
Build & Deployment Engineer
Service Owner
75
Lightweight user personas:
Consider the User Experience (UX) of
engineers and team members using
and working with the soft...
http://www.keepitusable.com/blog/?tag=alan-cooper
77
Motivations
Goals
Frustrations
Lightweight user personas:
What data does the User Persona need
visible on a dashboard in order to
make decisions rapidly ...
https://www.geckoboard.com/blog/visualisation-upgrades-progressing-towards-a-more-useful-and-beautiful-dashboard/ 79
Lightweight User Personas
Empathise better with people
from other roles
Capture missing operational
requirements
80
Lightweight User Personas:
Collaborate on user needs
for better
system awareness
81
Summary
82
Operability
making software work well
in Production
83
84
Lack of observability
Operational aspects not known
“Why has deployment failed?”
What handled the request?
Poor UX for Ops...
Logging with Event IDs
use enum-based Event IDs to
explore runtime behaviour and
fault conditions
86
Run Book dialogue sheets
explore and establish
operational requirements as a
team, around a physical table,
together
87
Endpoint healthchecks
HTTP 200 / 500 responses to
/status/health call with JSON
details – good for tools and
humans
88
Correlation IDs
trace execution using
correlation IDs:
synchronous (HTTP X-trace-id)
async (SQS MessageAttribute)
89
Lightweight user personas
explore the UX and needs of
different roles for rapid
decisions via dashboards
90
use modern logging, Run Book
dialogue sheets, endpoint
healthchecks, correlation IDs,
and user personas as
team collaborat...
Team Guide to
Software Operability
Matthew Skelton & Rob Thatcher
skeltonthatcher.com/publications
Download a free sample ...
Resources
•Team Guide to Software Operability by Matthew
Skelton and Rob Thatcher (Skelton Thatcher
Publications, 2016) ht...
thank you
@ConfluxHQ
confluxdigital.net
Prochain SlideShare
Chargement dans…5
×

Practical operability techniques for teams - Matthew Skelton - Conflux - Continuous Lifecycle London 2018

779 vues

Publié le

In this talk, we explore five practical, tried-and-tested, real world techniques for improving operability with many kinds of software systems, including cloud, Serverless, on-premise, and IoT.

Logging as a live diagnostics vector with sparse Event IDs
Operational checklists and ‘Run Book dialogue sheets’ as a discovery mechanism for teams
Endpoint healthchecks as a way to assess runtime dependencies and complexity
Correlation IDs beyond simple HTTP calls
Lightweight ‘User Personas’ as drivers for operational dashboards
Based on our work in many industry sectors, we will share our experience of helping teams to improve the operability of their software systems through

Required audience experience

Some experience of building web-scale systems or industrial IoT/embedded systems would be helpful.

Objective of the talk

We will share our experience of helping teams to improve the operability of their software systems. Attendees will learn some practical operability approaches and how teams can expand their understanding and awareness of operability through these simple, team-friendly techniques.

From a talk given at Continuous Lifecycle London 2018: https://continuouslifecycle.london/sessions/practical-team-focused-operability-techniques-for-distributed-systems/

Publié dans : Logiciels
  • Soyez le premier à commenter

Practical operability techniques for teams - Matthew Skelton - Conflux - Continuous Lifecycle London 2018

  1. 1. Practical Operability Techniques for Teams Matthew Skelton Conflux confluxdigital.net / @ConfluxHQ Continuous Lifecycle London, 17 May 2018
  2. 2. Practical operability Why do we need a focus on operability? 5 practical operability techniques to try 2
  3. 3. use modern logging, Run Book dialogue sheets, endpoint healthchecks, correlation IDs, and user personas as team collaboration techniques 3
  4. 4. About me Matthew Skelton, Conflux @matthewpskelton matthewskelton.net Leeds, UK 4 assemblyconf.com
  5. 5. Team Guide to Software Operability Matthew Skelton & Rob Thatcher skeltonthatcher.com/publications Download a free sample chapter 5
  6. 6. You? Software Developer Tester / QA DevOps Engineer Team Leader Head of Department 6
  7. 7. Practical Operability Techniques for Teams 7
  8. 8. Operability making software work well in Production 8
  9. 9. Operability 9
  10. 10. “But can’t we just give those things to a DevOp?” 10
  11. 11. “But can’t we just give those things to the DevOp?” 11
  12. 12. Operability is a shared concern #BizDevTestSecOps 12
  13. 13. Operability is a shared concern #BizDevTestSecOps 13
  14. 14. 14
  15. 15. 15 A self-managed Kubernetes cluster near you
  16. 16. Operability is a shared concern #BizDevTestSecOps 16
  17. 17. Practical operability techniques 1. Modern logging 2. Run Book dialogue sheets 3. Endpoint healthchecks 4. Correlation IDs 5. Lightweight User Personas 17
  18. 18. Lack of observability for distributed systems 18
  19. 19. 19 Logging with Event IDs
  20. 20. Modern logging w/ Event IDs Distinct application states No “logorrhoea” (!) Distributed tracing via logs Build a shared understanding 20
  21. 21. search by event Event ID {Delivered, InTransit, Arrived} 21
  22. 22. transaction trace Correlation ID 612999958… 22
  23. 23. How many distinct event types (state transitions) in your application? 23
  24. 24. 24
  25. 25. represent distinct states 25
  26. 26. enum Human-readable sets: unique values, sparse, immutable C#, Java, Python, node (Ruby, PHP, …) 26
  27. 27. public enum EventID { // Badly-initialised logging data NotSet = 0, // An unrecognised event has occurred UnexpectedError = 10000, ApplicationStarted = 20000, ApplicationShutdownNoticeReceived = 20001, MessageQueued = 40000, MessagePeeked = 40001, BasketItemAdded = 60001, BasketItemRemoved = 60002, CreditCardDetailsSubmitted = 70001, // ... } 27
  28. 28. BasketItemAdded = 60001 BasketItemRemoved = 60002 28
  29. 29. log using Event IDs with a modern ‘structured logging’ library 29
  30. 30. example: https://github.com/EqualExperts/opslogger Sean Reilly @seanjreilly 30
  31. 31. Example: video processing On-demand processing of TV and mobile streaming adverts Ad-agency → TV broadcaster High throughput Glitch-free video & audio 31
  32. 32. Storage I/O Worker Job Queue Upload 32
  33. 33. 33
  34. 34. 34
  35. 35. Example: video processing Discover processing bottlenecks Trigger alerts Report on KPIs Target areas for improvement 35
  36. 36. Modern logging w/ Event IDs reduce time to detect problems increase team engagement increase configurability enhance collaboration 36
  37. 37. Modern Logging: Collaborate on Event IDs and Correlation traces for better system awareness 37
  38. 38. Operational aspects not addressed, or addressed too late in the cycle 38
  39. 39. Run Book dialogue sheets
  40. 40. Run Book dialogue sheets Checklists for typical operational considerations Team-friendly exploration 40
  41. 41. runbooktemplate.infoRun Book dialogue sheets 41
  42. 42. System characteristics Hours of operation During what hours does the service or system actually need to operate? Can portions or features of the system be unavailable at times if needed? Hours of operation - core features (e.g. 03:00-01:00 GMT+0) Hours of operation - secondary features (e.g. 07:00-23:00 GMT+0) Data and processing flows How and where does data flow through the system? What controls or triggers data flows? (e.g. mobile requests / scheduled batch jobs / inbound IoT sensor data ) … 42
  43. 43. http://runbooktemplate.info/ 43
  44. 44. runbooktemplate.infoRun Book dialogue sheets 44
  45. 45. Run Book dialogue sheets Early discovery of operational requirements Input to team backlog “Shift-left” testing Avoid operational problems 45
  46. 46. Run Book dialogue sheets: Collaborate on operational requirements for better system awareness 46
  47. 47. “Why has my deployment failed again?” “Why is Pre-Prod always so flaky?” 47
  48. 48. Endpoint healthchecks
  49. 49. Endpoint healthchecks Simple HTTP check Common way to assess any service/app/component Key operational requirement 49
  50. 50. endpoint healthchecks Every runnable app/service/daemon exposes /status/health An HTTP GET to the endpoint returns: 200 – "I am healthy" 500 – "I am sick" 50
  51. 51. endpoint healthchecks Each component is responsible for determining its own health and viability – this is very contextual 51
  52. 52. endpoint healthchecks Use JSON as a response type – parsable by both machines and humans! 52
  53. 53. 53
  54. 54. endpoint healthchecks For databases and other non-HTTP components, run a lightweight HTTP service in front of the component 200 / 500 responses 54
  55. 55. Helper service 55
  56. 56. https://github.com/Lugribossk/simple-dashboard 56
  57. 57. 57 Question: What does this look like for Serverless? ¯_(ツ)_/¯
  58. 58. Endpoint healthchecks Rapid dashboarding and diagnosis Reduce confusion around environment state “Fail fast” aka “learn sooner” 58
  59. 59. Endpoint healthchecks: Collaborate on component health status for better system awareness 59
  60. 60. “Which containers [servers] handled the request?” 60
  61. 61. Correlation IDs
  62. 62. Correlation IDs Unique-ish identifiers Trace calls across machine & container boundaries Re-assemble HTTP request/response later 62
  63. 63. ‘Unique-ish’ identifier for each request Passed through downstream layers 63
  64. 64. Unique-ish ID 64
  65. 65. Synchronous HTTP: X-HEADER e.g. X-trace-id X-trace-id: 348e1cf8 If header is present, pass it on (Yes, RFC6648, but this is internal only) 65
  66. 66. Asynchonous (queues, etc.): Message Attributes, name:value pair e.g. "trace-id":"348e1cf8" AWS SQS: SendMessage() / ReceiveMessage() Log the Correlation ID if present 66
  67. 67. Example: OpenTracing / PCF 3 tracing elements: TraceID, SpanID, ParentSpan "X-B3-TraceId" "X-B3-SpanId" "X-B3-ParentSpan" 67
  68. 68. Example: OpenTracing / PCF Always log the TraceID as-is Log calling SpanID as ParentSpan Log new SpanID 68
  69. 69. Trace Span ParentSpan 69
  70. 70. Correlation IDs Detect bottlenecks and unexpected interactions Increase transparency Learn about the system 70
  71. 71. Correlation IDs: Collaborate on distributed tracing for better system awareness 71
  72. 72. Software is difficult to operate: poor UX for Ops. 72
  73. 73. Lightweight user personas
  74. 74. Lightweight User Personas Simple characterisation of user needs for Dev/Test/Ops Based on full UX user personas but less detailed 74
  75. 75. Lightweight user personas: Ops Engineer Test Engineer Build & Deployment Engineer Service Owner 75
  76. 76. Lightweight user personas: Consider the User Experience (UX) of engineers and team members using and working with the software 76
  77. 77. http://www.keepitusable.com/blog/?tag=alan-cooper 77 Motivations Goals Frustrations
  78. 78. Lightweight user personas: What data does the User Persona need visible on a dashboard in order to make decisions rapidly & safely? 78
  79. 79. https://www.geckoboard.com/blog/visualisation-upgrades-progressing-towards-a-more-useful-and-beautiful-dashboard/ 79
  80. 80. Lightweight User Personas Empathise better with people from other roles Capture missing operational requirements 80
  81. 81. Lightweight User Personas: Collaborate on user needs for better system awareness 81
  82. 82. Summary 82
  83. 83. Operability making software work well in Production 83
  84. 84. 84
  85. 85. Lack of observability Operational aspects not known “Why has deployment failed?” What handled the request? Poor UX for Ops 85
  86. 86. Logging with Event IDs use enum-based Event IDs to explore runtime behaviour and fault conditions 86
  87. 87. Run Book dialogue sheets explore and establish operational requirements as a team, around a physical table, together 87
  88. 88. Endpoint healthchecks HTTP 200 / 500 responses to /status/health call with JSON details – good for tools and humans 88
  89. 89. Correlation IDs trace execution using correlation IDs: synchronous (HTTP X-trace-id) async (SQS MessageAttribute) 89
  90. 90. Lightweight user personas explore the UX and needs of different roles for rapid decisions via dashboards 90
  91. 91. use modern logging, Run Book dialogue sheets, endpoint healthchecks, correlation IDs, and user personas as team collaboration techniques 91
  92. 92. Team Guide to Software Operability Matthew Skelton & Rob Thatcher skeltonthatcher.com/publications Download a free sample chapter 92
  93. 93. Resources •Team Guide to Software Operability by Matthew Skelton and Rob Thatcher (Skelton Thatcher Publications, 2016) http://operabilitybook.com/ •Run Book template & Run Book dialogue sheets http://runbooktemplate.info/ 93
  94. 94. thank you @ConfluxHQ confluxdigital.net

×