Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Nunux Keeper Nunux Reader
THE FOLLOWING CONTENT HAS BEEN APPROVED FOR
SITE RELIABILITY ENGINEERING
THIS TALK HAS BEEN INSPIRED BY GOOGLE WHITE PAPER...
Operators
I want Stability!
Developers
I want Features!
DevOps in a nutshell
• Reduce organization silos
• Accept failure as normal
• Implement gradual change
• Leverage tooling ...
If you please,
Draw me a Software
The requested NFR / operability was not found on
this definition. That’s all we know.
def·i·ni·tion
def·​i·​ni·​tion | ˌde-fə-ˈni-shən 
A non-functional requirement (NFR) is a
requirement that specifies crite...
Availability
SLA Compliance
Fault tolerant mechanisms
Performance
Response time
Profiling
Scalability
Handle current ...
Define Measure Sale
Availability of service
Availability Chart
SLO Unavailability
Per year
Unavailability
Per quarter
Unavailability
Per month
(30 days)
95% 18d 9d 3d...
To be monitored, you have to
make yourself observable.
-me
Observability
Structured Logging Metrics Traces
$ cat /var/log/server.log | grep “error”
Structured Logging
• Normalized
• Key/Value for context
• Structured format (JSON may be an option)
THEN
Levels
DEBUG
What ever you need to trace your business logic.
NOT IN PRODUCTION!
INFO
Everything related to the business ...
The context
=
Behavior understanding
&
Efficient troubleshooting
About Error Level
• Create your own exceptions with the cause
and… the context!
• Suggest action when possible.
• Stack tr...
Performances
• Beware of performance impact!
• I/O can be saturated and by consequent YOUR
application
• Asynchronous logg...
Structured Logging -
Log
production
Collector DB
Query and
Visualization
Goals
• Centralized system
• Exploration tool
• Alerting
• Troubleshooting
• And Dashboarding (the sexy parts)
Goals?
• And metrics… you should not(*)
(*) Well… not in that way
4 Golden Signals
LATENCY
Measures the
duration of a
unit of work
TRAFFIC
Measures how
much work
gets done per
unit of time...
Measuring everything!
System Business events
- CPU
- Memory
- I/O (disk/net)
- Runtime (Docker, JVM, Node, Go, …)
- Servic...
Metric Types*
• Counter
Simple counter incremented over the
time.
Ex: number of payments
• Gauge
Simple counter moving up ...
With meta data*
• Tags
• Fields
• Geographical coordinates
Ex: Datacenter ID, Server name, Service name, Tenant, Payment m...
VSPush
Pushes the metrics as they arrive (or in batch way).
Easy to implement
Data loss if the collector is unavailable
Ex...
Metrics -
Metrics
production
Collector TSDB
Query and
Visualization
Alerting
Traces
Or the ability to follow a transaction internally AND
in a distributed architectures
• Ideal for spotting:
– Bottle...
https://opentracing.io/
Traces -
Trace
production
Transport DB Visualization
Healthcheck
Or the ability to say that your service is available
for your end-users
Basically:
smoke tests in production
Whitebox vs Blackbox
• Gather health status of:
– Internals: services, DB connection,
FS, Memory (Heap), thread pools, …
–...
Timeouts everywhere!
• Timeout propagation
• Circuit breaker
REST API
 Global Timeout 
Service
 Service Timeout 
DB
Ex...
Healthcheck -
eclipse/microprofile-health
docker/go-healthcheck
Health status
External
probes TSDB
Error Metric ++
• Be notified… before the client
• Trace error with:
– Context
– Code binding
– User feedback
• Manage the...
Error Tracking -
Error
emission DB Managment
This talk is TLS*
* Too Long, Slept
Availability
SLA Compliance
Fault tolerant mechanisms
Performance
Response time
Profiling
Scalability
Handle current ...
SLO reminder
• 100% availability is not a realist target!
• Fault tolerance has a cost in terms of money,
time and agility...
Observability reminder
And…
• Logs are for production
• Track and hunt ALL errors
• Put proper health checks
• Set ALL your timeouts
• Automation...
Atos, the Atos logo, Atos Codex, Atos Consulting, Atos Worldgrid, Bull, Canopy, equensWorldline, Unify,
Worldline and Zero...
I pushed in production :). Have a nice weekend
I pushed in production :). Have a nice weekend
I pushed in production :). Have a nice weekend
I pushed in production :). Have a nice weekend
I pushed in production :). Have a nice weekend
I pushed in production :). Have a nice weekend
I pushed in production :). Have a nice weekend
I pushed in production :). Have a nice weekend
I pushed in production :). Have a nice weekend
I pushed in production :). Have a nice weekend
I pushed in production :). Have a nice weekend
I pushed in production :). Have a nice weekend
I pushed in production :). Have a nice weekend
I pushed in production :). Have a nice weekend
I pushed in production :). Have a nice weekend
I pushed in production :). Have a nice weekend
I pushed in production :). Have a nice weekend
I pushed in production :). Have a nice weekend
I pushed in production :). Have a nice weekend
I pushed in production :). Have a nice weekend
I pushed in production :). Have a nice weekend
I pushed in production :). Have a nice weekend
I pushed in production :). Have a nice weekend
I pushed in production :). Have a nice weekend
I pushed in production :). Have a nice weekend
I pushed in production :). Have a nice weekend
Vous avez terminé ce document.
Télécharger et lire hors ligne.
Prochain SlideShare
What to Upload to SlideShare
Suivant
Prochain SlideShare
What to Upload to SlideShare
Suivant
Télécharger pour lire hors ligne et voir en mode plein écran

Partager

I pushed in production :). Have a nice weekend

Télécharger pour lire hors ligne

This talk is about SRE, availability and observability.

I pushed in production :). Have a nice weekend

  1. 1. Nunux Keeper Nunux Reader
  2. 2. THE FOLLOWING CONTENT HAS BEEN APPROVED FOR SITE RELIABILITY ENGINEERING THIS TALK HAS BEEN INSPIRED BY GOOGLE WHITE PAPER R UNDER 18 REQUIRES SENIOR DEV GUARDIAN RESTRICTED CRUD AND JSON CONTENT, DEVOPS, GO LANGUAGE, SUPERVISED BEHAVIOR, BUZWORDS, OBSEVABILITY, AND TOOLING INVOLVING SYSADMIN landing.google.com/sre
  3. 3. Operators I want Stability! Developers I want Features!
  4. 4. DevOps in a nutshell • Reduce organization silos • Accept failure as normal • Implement gradual change • Leverage tooling and automation • Measure everything
  5. 5. If you please, Draw me a Software
  6. 6. The requested NFR / operability was not found on this definition. That’s all we know.
  7. 7. def·i·ni·tion def·​i·​ni·​tion | ˌde-fə-ˈni-shən A non-functional requirement (NFR) is a requirement that specifies criteria that can be used to judge the operation of a system, rather than specific behaviors.
  8. 8. Availability SLA Compliance Fault tolerant mechanisms Performance Response time Profiling Scalability Handle current and future loads Optimum use of resources Auditing Events/Entity Tracking Notifications Configurability Personalization per tenant Configuration management Security Role/Privilege based access Data privacy and encryption Extendability Event driven architecture Code modularity Monitoring Usage and technical metrics Application health check
  9. 9. Define Measure Sale Availability of service
  10. 10. Availability Chart SLO Unavailability Per year Unavailability Per quarter Unavailability Per month (30 days) 95% 18d 9d 3d 99% 3d 21h 7h 99.9% 8h 2h 43m 99.95% 4h 1h 21m 99.99% 52m 12m 4m 99.999% 5m 1m 25s
  11. 11. To be monitored, you have to make yourself observable. -me
  12. 12. Observability Structured Logging Metrics Traces
  13. 13. $ cat /var/log/server.log | grep “error”
  14. 14. Structured Logging • Normalized • Key/Value for context • Structured format (JSON may be an option) THEN
  15. 15. Levels DEBUG What ever you need to trace your business logic. NOT IN PRODUCTION! INFO Everything related to the business logic that have a true value to monitor. Ex: business transactions, business errors, … WARN All unexpected behavior or error that can be overcome.  An action may be required Ex: configuration weakness, service unavailable but fallback is ok, inconsistent data, … ERROR All unexpected errors with ALL the context we are able to fetch!  An action is required! Ex: service unavailable, data loss, … FATAL Unexpected error that mean the system can’t continue to work.  SIGKILL Ex: database connection lost, disk full, …
  16. 16. The context = Behavior understanding & Efficient troubleshooting
  17. 17. About Error Level • Create your own exceptions with the cause and… the context! • Suggest action when possible. • Stack trace should be only tolerated for unexpected errors.
  18. 18. Performances • Beware of performance impact! • I/O can be saturated and by consequent YOUR application • Asynchronous logger may be an option… in some cases  Be concise!
  19. 19. Structured Logging - Log production Collector DB Query and Visualization
  20. 20. Goals • Centralized system • Exploration tool • Alerting • Troubleshooting • And Dashboarding (the sexy parts)
  21. 21. Goals? • And metrics… you should not(*) (*) Well… not in that way
  22. 22. 4 Golden Signals LATENCY Measures the duration of a unit of work TRAFFIC Measures how much work gets done per unit of time ERRORS Measures all occurrences of errors SATURATION Measures resource consumption
  23. 23. Measuring everything! System Business events - CPU - Memory - I/O (disk/net) - Runtime (Docker, JVM, Node, Go, …) - Services (MySQL, Postgre, Kafka, …) - Business transactions - Business errors - Object lifecycle - External calls (BDD, API, …)
  24. 24. Metric Types* • Counter Simple counter incremented over the time. Ex: number of payments • Gauge Simple counter moving up or down over the time. Ex: CPU usage • Histogram Samples observations Ex: Request duration, response size • Summary Samples observations with total count and a sum of all observed values, Ex: Request duration, response size (*) Depending the used protocol (statsd, InfluxDB Line Protocol, Prometheus…)
  25. 25. With meta data* • Tags • Fields • Geographical coordinates Ex: Datacenter ID, Server name, Service name, Tenant, Payment mean, … (*) Depending the used protocol (statsd, InfluxDB Line Protocol, Prometheus…)
  26. 26. VSPush Pushes the metrics as they arrive (or in batch way). Easy to implement Data loss if the collector is unavailable Ex: InfluxDB, StatsD Pull Gathers the metrics in memory and provides an end point for an external collector. Platform independency Don’t work well for event-driven time series Ex: Borg, Prometheus
  27. 27. Metrics - Metrics production Collector TSDB Query and Visualization Alerting
  28. 28. Traces Or the ability to follow a transaction internally AND in a distributed architectures • Ideal for spotting: – Bottleneck – Complex hierarchy call – Non optimized code – Cross system transactions
  29. 29. https://opentracing.io/
  30. 30. Traces - Trace production Transport DB Visualization
  31. 31. Healthcheck Or the ability to say that your service is available for your end-users Basically: smoke tests in production
  32. 32. Whitebox vs Blackbox • Gather health status of: – Internals: services, DB connection, FS, Memory (Heap), thread pools, … – And dependencies: Externals API, … Gently reminder: Your SLO depends of others SLO • And send them to a metrics system. • Expose a sum-up of your health status – Using standard endpoint (/healthz) – With a smoke test scenario – Tips: Caching those values • And use a external probe.
  33. 33. Timeouts everywhere! • Timeout propagation • Circuit breaker REST API  Global Timeout  Service  Service Timeout  DB External Service  DB Timeout   Ext Service Timeout 
  34. 34. Healthcheck - eclipse/microprofile-health docker/go-healthcheck Health status External probes TSDB
  35. 35. Error Metric ++ • Be notified… before the client • Trace error with: – Context – Code binding – User feedback • Manage the error lifecycle: – Identification – Correction – Validation – Postmortem
  36. 36. Error Tracking - Error emission DB Managment
  37. 37. This talk is TLS* * Too Long, Slept
  38. 38. Availability SLA Compliance Fault tolerant mechanisms Performance Response time Profiling Scalability Handle current and future loads Optimum use of resources Auditing Events/Entity Tracking Notifications Configurability Personalization per tenant Configuration management Security Role/Privilege based access Data privacy and encryption Extendability Event driven architecture Code modularity Monitoring Usage and technical metrics Application health check
  39. 39. SLO reminder • 100% availability is not a realist target! • Fault tolerance has a cost in terms of money, time and agility • We have to accept some degree of risk in order to deliver features • This risk should fit end user satisfaction
  40. 40. Observability reminder
  41. 41. And… • Logs are for production • Track and hunt ALL errors • Put proper health checks • Set ALL your timeouts • Automation and tooling is the key of the velocity • Think User Experience!
  42. 42. Atos, the Atos logo, Atos Codex, Atos Consulting, Atos Worldgrid, Bull, Canopy, equensWorldline, Unify, Worldline and Zero Email are registered trademarks of the Atos group. March 2017. © 2017 Atos. Confidential information owned by Atos, to be used by the recipient only. This document, or any part of it, may not be reproduced, copied, circulated and/or distributed nor quoted without prior written approval from Atos. Thanks For more information please contact: @ncarlier github.com/ncarlier
  • FGRibreau

    Mar. 5, 2019
  • Magellan2K

    Feb. 7, 2019

This talk is about SRE, availability and observability.

Vues

Nombre de vues

636

Sur Slideshare

0

À partir des intégrations

0

Nombre d'intégrations

24

Actions

Téléchargements

5

Partages

0

Commentaires

0

Mentions J'aime

2

×