Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Observability with Spring-based distributed systems

442 vues

Publié le

Observability with Spring-based distributed systems

Publié dans : Technologie
  • Soyez le premier à commenter

  • Soyez le premier à aimer ceci

Observability with Spring-based distributed systems

  1. 1. Tommy Ludwig Rakuten, Inc. Travel Product Development Department Foundation Office Spring Fest 2018 2018-10-31
  2. 2. 2 • Observability: what / why • 3 pillars of observability: Logging, Metrics, Tracing • Putting it all together
  3. 3. 3
  4. 4. 4 Observability is achieved through a set of tools and practices that aims to turn data points and context into insights. • Beyond traditional monitoring • Constant partial degradation/failure • Expect the unexpected • Answer unknown questions about your system
  5. 5. 5 You want to provide a great experience for users of your system. • Observability builds confidence in production • Ownership. Give yourself the tools to be a good owner. • MTTR is key – failures will are happening • early detection + fast recovery + increased understanding * MTTR = mean time to recovery
  6. 6. 6 • Finish your work faster/easier • Find and fix problems sooner (before release, before QA) • Improve your service by better understanding its behavior
  7. 7. 7
  8. 8. 8 • Spring Boot Actuator is awesome. • You get so much out-of-the-box. • But... is it enough? Like most things, it depends. • Inherently information is instance-scoped
  9. 9. Spring Boot Admin makes it easy to access and use each instance’s Actuator endpoints. https://github.com/codecentric/spring-boot-admin
  10. 10. 10
  11. 11. 11 DB DB DB User User 👤 👤
  12. 12. 12 • Any request spans multiple processes • Need to stitch together local info and slice/drill-down • Increased points of failure • Scaling and ephemeral instances* * Not strictly properties of a distributed system
  13. 13. 13
  14. 14. 14 … • 3 sides to observability • Non-functional requirements (generic/specific) • Overlap exists, but use all 3 for best insight Source: Peter Bourgon, access date: 2018-05-18 http://peter.bourgon.org/blog/2017/02/21/metrics-tracing-and-logging.html
  15. 15. 15 When it comes to logging, metrics, and tracing: • Common needs just work out-of-the-box. • Custom needs can be met with a little extra effort. See also: 80-20 rule
  16. 16. 16
  17. 17. 17 • Arbitrary messages you want to find later • Formatted to give context: logging levels, timestamp • Message examples • Exceptions/stack traces • Additional context • Access logs • Request/response bodies
  18. 18. 18 VM App1 Logs I want to check the logs… ~~~~~~~~~~~~~ ~~~~~~~~~~~~ ~~~~~~~~~~~~~~ Get logs Search logs 🤔 App2 App1 App2 ~~~~~~~~~~~~~ ~~~~~~~~~~~~ ~~~~~~~~~~~~~~ ~~~~~~~~~~~~~ ~~~~~~~~~~~~ ~~~~~~~~~~~~~~ ~~~~~~~~~~~~~ ~~~~~~~~~~~~ ~~~~~~~~~~~~~~ 💥 Legend:
  19. 19. 19 • Does not scale; Too much work and knowledge required • Multithreaded, concurrent requests intermingle logs • Low usability – searching is limited/difficult
  20. 20. 20 Central log store service stream logs Query request Collection of matching logs query logs VM App1 LogsApp2Legend:
  21. 21. 21 Spring Cloud Sleuth • adds trace ID for request correlation • Query all collected logs by any field or full-text search • time window, service, log level, trace ID, message Centralized, request-correlated, formatted logs indexed and searchable across your system
  22. 22. 22 Spring Boot • Configurable via Spring Environment (see also Spring Cloud Config) • log format – make a common format across applications • log levels (logging.level.*) • Configurable via Actuator (at runtime) • log levels
  23. 23. 23 Spring Cloud Config – shared config properties • Common log pattern Travel Auto-configuration • Correlation ID added to MDC ELK • Elasticsearch – log storage/querying/indexing • Logstash – log forwarding/parsing • Kibana – search / UI for querying Elasticsearch
  24. 24. 24
  25. 25. 25 Characteristics: • Aggregate time-series data; bounded size • Can slice based on multiple dimensions/tags/labels* Purpose: • Visualize / identify trends and deviation • Alerting based on metric queries * See also https://www.datadoghq.com/blog/the-power-of-tagged-metrics/
  26. 26. 26 Example metric Type Example tags response time timer uri, status, method number of classes loaded gauge response body size histogram uri, status, method number of garbage collections counter cause, action
  27. 27. 27 HTTP server requests 👥 my-application 👤 HTTP GET metricscontroller metrics over JMX
  28. 28. 28 HTTP server requests 👥 my-application 👤 controller my-application controller LB
  29. 29. 29 my-application controller my-application controller Metrics backend 😌 publish metrics Alerts ☠ Visualization
  30. 30. 30 • Spring Boot 2 uses Micrometer as its native metrics library • Micrometer supports many metrics backends • e.g. Atlas, Datadog, Influx, Prometheus, SignalFX, Wavefront • Instrumentation of common components auto-configured • JVM/system, HTTP server/client requests, Spring Integration, DataSource… • Custom metrics also easy to add
  31. 31. 31 • Configure via properties • management.metrics.* • Disable certain metrics • Enable percentiles/SLAs/percentile histograms • Common tags • e.g. application name, instance, stack, region, zone
  32. 32. 32 Travel Service Starter (included in service-parent) • Includes micrometer-registry-prometheus dependency Travel Auto-configuration • Common metric tag for application name (spring.application.name) Travel Metrics Platform • Micrometer library for metrics instrumentation/reporting • Prometheus for metrics collection/storage/querying • Grafana for dashboards/graphing sourced by Prometheus
  33. 33. 33 • Visualize metrics, compare over time • Have a question you’re trying to answer • Do NOT just stare at dashboards
  34. 34. 34 • 4 Golden signals • Latency • Errors • Rate • Saturation
  35. 35. 35 • Don’t double alert! • Symptoms, not causes
  36. 36. 36
  37. 37. 37 • Investigate a slow request • Understand dependency/call relationship between services • Where did the error occur in the request?
  38. 38. 38 • local tracing: Actuator /httptrace endpoint • Latency data + request metadata { "traces" : [ { "timestamp" : "2018-05-09T13:28:32.867Z", "principal" : { "name" : "alice” }, "session" : { "id" : "728aebfe-8222-4dd2-856c-256104b20bfe” }, "request" : { "method" : "GET", "uri" : "https://api.example.com", "headers" : { "Accept" : [ "application/json" ] } }, "response" : { "status" : 200, "headers" : { "Content-Type" : [ "application/json" ] } }, "timeTaken" : 3 } ] } Source: Spring Boot Actuator Web API Documentation; access date: 2018-05-18 https://docs.spring.io/spring-boot/docs/2.0.2.RELEASE/actuator-api/html/#http-trace
  39. 39. 39 Distributed tracing: tracing across process boundaries • Propagate context/hierarchy; join together after • Request-scoped latency analysis across services • Metrics lack request context • Logging has local context but limited distributed info
  40. 40. 40 Tracing instrumented system 👤 service1 service2 service3 service4 ① ① start span / sampling decision ② propagate trace context ③ continue trace ④ report spans ② ③ ④ = tracer / instrumentation Tracing backenduser
  41. 41. 41
  42. 42. 42 [2010] Google Dapper [2012] Twitter Zipkin [2015] OpenZipkin [2017] Zipkin Meetup #1 [2018] Apache Incubator Today https://zipkin.io/ WIKI: https://cwiki.apache.org/confluence/display/ZIPKIN/
  43. 43. 43 Source: Spring Cloud Sleuth reference documentation; access date: 2018-05-18 http://cloud.spring.io/spring-cloud-static/spring-cloud-sleuth/2.0.0.RC1/single/spring-cloud-sleuth.html#_distributed_tracing_with_zipkin Zipkin UI workshop happening this week! https://cwiki.apache.org/confluence/display/ZIPKIN/2018-10-29+Zipkin+UI+at+LINE+Tokyo
  44. 44. 44 Zipkin server transport collector UI storage datastore API 👩 💻 • HTTP • Kafka • RabbitMQ • In-memory * • MySQL * • Elasticsearch • Cassandra Reference: https://zipkin.io/pages/architecture.html Tracing instrumented system 👤 s1 s2 s3 s4
  45. 45. 45 Tracing backend: Zipkin Server getting started Spring Cloud Sleuth: spring-cloud-starter-zipkin dependency • auto-configures tracing instrumentation (Zipkin’s Brave) • reports recorded spans to Zipkin async/batched
  46. 46. 46 Travel Service Starter (included in service-parent) • Includes spring-cloud-zipkin-starter dependency (Spring Cloud Sleuth) Travel Auto-configuration • Tag root span with correlation ID Travel Cloud Config • Zipkin server address • Sampling %, skip patterns
  47. 47. 47
  48. 48. 48 Together you have correlated logging, metrics, and tracing across the whole system. Jump between each using common identifiers. Adapted from: Adrian Cole, “Observability 3 ways: logging metrics and tracing”; access date: 2018-05-18 https://speakerdeck.com/adriancole/observability-3-ways-logging-metrics-and-tracing
  49. 49. 49  spring.application.name = Zipkin service name Configure as Micrometer common tag http_server_requests_seconds_count{exception="None",method="GET",status="200",uri="/hello",} 4.0 http_server_requests_seconds_sum{exception="None",method="GET",status="200",uri="/hello",} 0.02570928 http_server_requests_seconds_max{exception="None",method="GET",status="200",uri="/hello",} 0.0 Micrometer tags Zipkin tags
  50. 50. 50  Link to e.g. Kibana search by traceId can also do Logs  Trace https://github.com/openzipkin/zipkin/tree/master/zipkin-ui#how-do-i-find-logs-associated-with-a-particular-trace
  51. 51. 51 • Confirm request flow – does it match the expected design/architecture? • Check service dependencies in Zipkin • Check request flow in Zipkin; jump to logs if necessary • Filter by service name, span name, tags • Adjust log levels via Actuator if necessary
  52. 52. 52 • Automated tests generate a correlation ID per test case execution. • Use correlation ID to find the related traces in Zipkin. cID0001 cID0001 trace1 trace2
  53. 53. 53 • Manual tests (in non-production environments) from the browser can use Zipkin Browser Extension to get the traceId for a browser request • Where in the request flow did the error occur or why was it slow? • Check request flow in Zipkin; jump to logs (if necessary) • Adjust log levels via Actuator (if necessary)
  54. 54. 54 検知 調査 復旧 調整 アラート ・ 問い合わせ 1. Starts with an alert/report 2. Check metrics 3. Check tracing data (if needed) 4. Check logs (if needed) 5. Triage issue 6. Make adjustment to prevent recurrence 🔁
  55. 55. 55
  56. 56. 56 • System-wide observability is crucial in distributed architectures • Tools exist and Spring makes them easy to integrate • Most common cases are covered out-of-the-box or configurable. Custom instrumentation is possible as needed. • Use the right tool for the job; synergize across tools
  57. 57. 58 • “Distributed Systems Observability” e-book by Cindy Sridharan: http://distributed-systems-observability-ebook.humio.com/ • Articles by Cindy Sridharan (@copyconstruct): https://medium.com/@copyconstruct • Talks by Charity Majors (@mipsytipsy): https://speakerdeck.com/charity • “Observability+” articles by JBD (@rakyll): https://medium.com/observability

×