Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

I pushed in production :). Have a nice weekend

349 vues

Publié le

This talk is about SRE, availability and observability.

Publié dans : Technologie
  • Soyez le premier à commenter

I pushed in production :). Have a nice weekend

  1. 1. Nunux Keeper Nunux Reader
  3. 3. Operators I want Stability! Developers I want Features!
  4. 4. DevOps in a nutshell • Reduce organization silos • Accept failure as normal • Implement gradual change • Leverage tooling and automation • Measure everything
  5. 5. If you please, Draw me a Software
  6. 6. The requested NFR / operability was not found on this definition. That’s all we know.
  7. 7. def·i·ni·tion def·​i·​ni·​tion | ˌde-fə-ˈni-shən A non-functional requirement (NFR) is a requirement that specifies criteria that can be used to judge the operation of a system, rather than specific behaviors.
  8. 8. Availability SLA Compliance Fault tolerant mechanisms Performance Response time Profiling Scalability Handle current and future loads Optimum use of resources Auditing Events/Entity Tracking Notifications Configurability Personalization per tenant Configuration management Security Role/Privilege based access Data privacy and encryption Extendability Event driven architecture Code modularity Monitoring Usage and technical metrics Application health check
  9. 9. Define Measure Sale Availability of service
  10. 10. Availability Chart SLO Unavailability Per year Unavailability Per quarter Unavailability Per month (30 days) 95% 18d 9d 3d 99% 3d 21h 7h 99.9% 8h 2h 43m 99.95% 4h 1h 21m 99.99% 52m 12m 4m 99.999% 5m 1m 25s
  11. 11. To be monitored, you have to make yourself observable. -me
  12. 12. Observability Structured Logging Metrics Traces
  13. 13. $ cat /var/log/server.log | grep “error”
  14. 14. Structured Logging • Normalized • Key/Value for context • Structured format (JSON may be an option) THEN
  15. 15. Levels DEBUG What ever you need to trace your business logic. NOT IN PRODUCTION! INFO Everything related to the business logic that have a true value to monitor. Ex: business transactions, business errors, … WARN All unexpected behavior or error that can be overcome.  An action may be required Ex: configuration weakness, service unavailable but fallback is ok, inconsistent data, … ERROR All unexpected errors with ALL the context we are able to fetch!  An action is required! Ex: service unavailable, data loss, … FATAL Unexpected error that mean the system can’t continue to work.  SIGKILL Ex: database connection lost, disk full, …
  16. 16. The context = Behavior understanding & Efficient troubleshooting
  17. 17. About Error Level • Create your own exceptions with the cause and… the context! • Suggest action when possible. • Stack trace should be only tolerated for unexpected errors.
  18. 18. Performances • Beware of performance impact! • I/O can be saturated and by consequent YOUR application • Asynchronous logger may be an option… in some cases  Be concise!
  19. 19. Structured Logging - Log production Collector DB Query and Visualization
  20. 20. Goals • Centralized system • Exploration tool • Alerting • Troubleshooting • And Dashboarding (the sexy parts)
  21. 21. Goals? • And metrics… you should not(*) (*) Well… not in that way
  22. 22. 4 Golden Signals LATENCY Measures the duration of a unit of work TRAFFIC Measures how much work gets done per unit of time ERRORS Measures all occurrences of errors SATURATION Measures resource consumption
  23. 23. Measuring everything! System Business events - CPU - Memory - I/O (disk/net) - Runtime (Docker, JVM, Node, Go, …) - Services (MySQL, Postgre, Kafka, …) - Business transactions - Business errors - Object lifecycle - External calls (BDD, API, …)
  24. 24. Metric Types* • Counter Simple counter incremented over the time. Ex: number of payments • Gauge Simple counter moving up or down over the time. Ex: CPU usage • Histogram Samples observations Ex: Request duration, response size • Summary Samples observations with total count and a sum of all observed values, Ex: Request duration, response size (*) Depending the used protocol (statsd, InfluxDB Line Protocol, Prometheus…)
  25. 25. With meta data* • Tags • Fields • Geographical coordinates Ex: Datacenter ID, Server name, Service name, Tenant, Payment mean, … (*) Depending the used protocol (statsd, InfluxDB Line Protocol, Prometheus…)
  26. 26. VSPush Pushes the metrics as they arrive (or in batch way). Easy to implement Data loss if the collector is unavailable Ex: InfluxDB, StatsD Pull Gathers the metrics in memory and provides an end point for an external collector. Platform independency Don’t work well for event-driven time series Ex: Borg, Prometheus
  27. 27. Metrics - Metrics production Collector TSDB Query and Visualization Alerting
  28. 28. Traces Or the ability to follow a transaction internally AND in a distributed architectures • Ideal for spotting: – Bottleneck – Complex hierarchy call – Non optimized code – Cross system transactions
  29. 29. https://opentracing.io/
  30. 30. Traces - Trace production Transport DB Visualization
  31. 31. Healthcheck Or the ability to say that your service is available for your end-users Basically: smoke tests in production
  32. 32. Whitebox vs Blackbox • Gather health status of: – Internals: services, DB connection, FS, Memory (Heap), thread pools, … – And dependencies: Externals API, … Gently reminder: Your SLO depends of others SLO • And send them to a metrics system. • Expose a sum-up of your health status – Using standard endpoint (/healthz) – With a smoke test scenario – Tips: Caching those values • And use a external probe.
  33. 33. Timeouts everywhere! • Timeout propagation • Circuit breaker REST API  Global Timeout  Service  Service Timeout  DB External Service  DB Timeout   Ext Service Timeout 
  34. 34. Healthcheck - eclipse/microprofile-health docker/go-healthcheck Health status External probes TSDB
  35. 35. Error Metric ++ • Be notified… before the client • Trace error with: – Context – Code binding – User feedback • Manage the error lifecycle: – Identification – Correction – Validation – Postmortem
  36. 36. Error Tracking - Error emission DB Managment
  37. 37. This talk is TLS* * Too Long, Slept
  38. 38. Availability SLA Compliance Fault tolerant mechanisms Performance Response time Profiling Scalability Handle current and future loads Optimum use of resources Auditing Events/Entity Tracking Notifications Configurability Personalization per tenant Configuration management Security Role/Privilege based access Data privacy and encryption Extendability Event driven architecture Code modularity Monitoring Usage and technical metrics Application health check
  39. 39. SLO reminder • 100% availability is not a realist target! • Fault tolerance has a cost in terms of money, time and agility • We have to accept some degree of risk in order to deliver features • This risk should fit end user satisfaction
  40. 40. Observability reminder
  41. 41. And… • Logs are for production • Track and hunt ALL errors • Put proper health checks • Set ALL your timeouts • Automation and tooling is the key of the velocity • Think User Experience!
  42. 42. Atos, the Atos logo, Atos Codex, Atos Consulting, Atos Worldgrid, Bull, Canopy, equensWorldline, Unify, Worldline and Zero Email are registered trademarks of the Atos group. March 2017. © 2017 Atos. Confidential information owned by Atos, to be used by the recipient only. This document, or any part of it, may not be reproduced, copied, circulated and/or distributed nor quoted without prior written approval from Atos. Thanks For more information please contact: @ncarlier github.com/ncarlier