Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

OSDC 2018 - Distributed monitoring

263 vues

Publié le

Modern software development is increasingly taking a “microservice” approach that has resulted in an explosion of complexity at the network level. We have more applications running distributed across different datacenters. Distributed tracing, events, and metrics are essential for observing and understanding modern microservice architectures.
This talk is a deep dive on how to monitor your distributed system. You will get tools, methodologies, and experiences that will help you to realize what your applications expose and how to get value out from all these information.
Gianluca Arbezzano, SRE at InfluxData will share how to monitor a distributed system, how to switch from a more traditional monitoring approach to observability. Stay focused on the server’s role and not on the hostname because it’s not really important anymore, our servers or containers are fast moving part and it’s easy to detach it from the right in case of trouble than call the server by name as a cute puppet. How to design a SLO for your core services and now to iterate on them. Instrument your services with tracing using tools like Zipkin or Jaeger to measure latency between in your network.

Publié dans : Ingénierie
  • Soyez le premier à commenter

OSDC 2018 - Distributed monitoring

  1. 1. @gianarb - gianluca@influxdb.com Distributed monitoring How to understand the chaos
  2. 2. @gianarb - gianluca@influxdb.com Who Am I? ● Software Engineer passionate about almost everything atm I work with Golang ● Open Source developer, Docker Captain and CNCF Ambassador ● Site Reliability Engineer at InfluxData ● Speaker, blogger (gianarb.it) and so on... ● I love to travel, I grow my vegetables and I cook time to time
  3. 3. @gianarb - gianluca@influxdb.com
  4. 4. @gianarb - gianluca@influxdb.com
  5. 5. @gianarb - gianluca@influxdb.com
  6. 6. @gianarb - gianluca@influxdb.com
  7. 7. @gianarb - gianluca@influxdb.com github.com/influxdata
  8. 8. @gianarb - gianluca@influxdb.com What are you trying to say? Microservices is not “The Distributed System”
  9. 9. @gianarb - gianluca@influxdb.com What are you trying to say? A queue system can be distributed too...
  10. 10. @gianarb - gianluca@influxdb.com What are you trying to say? A multi threads application is a distributed system
  11. 11. @gianarb - gianluca@influxdb.com What are you trying to say? More “it is distributed” across servers, worlds, cloud providers and more complex it is...
  12. 12. @gianarb - gianluca@influxdb.com What are you trying to say? Containers, Docker, Kubernetes, Cloud Computing accelerated the application distribution...
  13. 13. @gianarb - gianluca@influxdb.com What are you trying to say? Did you migrated to a distributed mess without a real needs? Blame yourself...
  14. 14. @gianarb - gianluca@influxdb.com What are you trying to say? A request rises and falls across multiple applications before back to the user. This is complexity.
  15. 15. @gianarb - gianluca@influxdb.com Consequences The logs are not easy to follow when they comes from distributed applications, it is not a single stream.
  16. 16. @gianarb - gianluca@influxdb.com Consequences Events and metrics need to be correlated.
  17. 17. @gianarb - gianluca@influxdb.com Distributed Tracing Tracing is a way to correlate logs using a set of IDs
  18. 18. @gianarb - gianluca@influxdb.com
  19. 19. @gianarb - gianluca@influxdb.com Criticalities We write applications in many different languages
  20. 20. @gianarb - gianluca@influxdb.com Criticalities Across different teams
  21. 21. @gianarb - gianluca@influxdb.com Criticalities At the end, to build a trace we need to agree on the same protocol no matters the language or the team.
  22. 22. @gianarb - gianluca@influxdb.com Distributed Tracing Opentracing is a standard sponsored by the Cloud Native Computing Foundation (CNCF) developed to agree on common rules. It provides libraries across languages and you can use many tracers open source and as a service.
  23. 23. © 2017 InfluxData. All rights reserved.24 OpenTracing API application logic µ-service frameworks Lambda functions RPC & control-flow frameworks existing instrumentation tracing infrastructure main() I N S T A N A J a e g e r microservice process
  24. 24. © 2017 InfluxData. All rights reserved.25 >2 year old! Tracer implementations: Zipkin, Jaeger, LightStep, SkyWalking, AWS X-Ray.... All sorts of companies use OpenTracing:
  25. 25. © 2017 InfluxData. All rights reserved.26 Rapidly growing OSS and vendor adoption JDBIJava Webservlet
  26. 26. @gianarb - gianluca@influxdb.com High Cardinality A trace contains a lot of information and they are indexed via request id (called trace_id). They are expensive to store.
  27. 27. @gianarb - gianluca@influxdb.com Distributed Tracing Luckily traces doesn’t have a long lifecycle. Usually, you use them to debug a problem happened almost in real time or in short time window.
  28. 28. @gianarb - gianluca@influxdb.com Distributed Tracing We set a week as retention policy, after 7 days we downsample and remove the original trace. We also downsample them based on how many requests we are receiving for a specific API call.
  29. 29. @gianarb - gianluca@influxdb.com import opentracing "github.com/opentracing/opentracing-go" import zipkin "github.com/openzipkin/zipkin-go-opentracing" collector, err := zipkin.NewHTTPCollector(tracingConf.ZipkinEndpoint) recorder := zipkin.NewRecorder(collector, false, fmt.Sprintf("", tracingConf.Port), "restapi") tracer, err = zipkin.NewTracer( recorder, zipkin.ClientServerSameSpan(false), zipkin.TraceID128Bit(false), ) opentracing.SetGlobalTracer(tracer)
  30. 30. @gianarb - gianluca@influxdb.com import opentracing "github.com/opentracing/opentracing-go" tracer := opentracing.GlobalTracer() sp := tr.StartSpan(“api.create_user”) defer sp.Finish()
  31. 31. @gianarb - gianluca@influxdb.com opentracing.io
  32. 32. @gianarb - gianluca@influxdb.com Distributed Tracing
  33. 33. @gianarb - gianluca@influxdb.com
  34. 34. @gianarb - gianluca@influxdb.com Distributed Tracing - Collect traces via Telegraf
  35. 35. @gianarb - gianluca@influxdb.com
  36. 36. @gianarb - gianluca@influxdb.com
  37. 37. @gianarb - gianluca@influxdb.com No UI at the moment. :( SELECT * FROM zipkin WHERE time < now() - 1h AND trace_id = ‘a4hs45hs46jd56j4s’
  38. 38. @gianarb - gianluca@influxdb.com The process of understanding 1. Instrument 2. Observe 3. Aggregate and sample 4. Take action (via alerts or whatever)
  39. 39. @gianarb - gianluca@influxdb.com How people and teams play this game? They should deploy their application.
  40. 40. @gianarb - gianluca@influxdb.com How people and teams play this game? Be on-call and they should take care about production behavior for their applications
  41. 41. @gianarb - gianluca@influxdb.com How people and teams play this game? They can write a “presentation” of their service (a doc): critical metrics, capacity planning (cpu, ram, disk intensive app), service location in the system (close to other apps, ssd)
  42. 42. @gianarb - gianluca@influxdb.com How people and teams play this game? Keep everyone in the loop and responsible for the real traffic. There is not fun writing code without running it in production!
  43. 43. @gianarb - gianluca@influxdb.com How people and teams play this game? Every tools we develop exposes APIs, developers can use them. Eg. Create runtime alerts with Kapacitor.
  44. 44. @gianarb - gianluca@influxdb.com Servers/Containers/VMs are not pets 1. They don’t have name because they come and go based on loads and needs. 2. You can’t watch cute servers’s picture on Instagram. Yet.. 3. A server has labels.
  45. 45. @gianarb - gianluca@influxdb.com Servers/Containers/VMs are not pets Write tools that helps you to replace servers and processes or use available projects like AWS Autoscaling group, Kubernetes and so on. DevOps is an attitude is not a role that you hire. They are developer passionate about server automation and related stuff.
  46. 46. @gianarb - gianluca@influxdb.com
  47. 47. @gianarb - gianluca@influxdb.com
  48. 48. @gianarb - gianluca@influxdb.com We care about state and event 1. Use created 2. Invoice generation 3. Email sent 4. Purchase 5. Whatever...
  49. 49. @gianarb - gianluca@influxdb.com We care about data! But data is all another topic!
  50. 50. @gianarb - gianluca@influxdb.com Back to tracing - the cost of a retry
  51. 51. @gianarb - gianluca@influxdb.com Distributed Tracing - the cost of a retry
  52. 52. @gianarb - gianluca@influxdb.com Wrap up! ● Monitor distributed system is hard and you need to correlate all the things ● Opentracing and distributed tracing ● Keep people in the loop and give ownership of production ● DevOps is an attitude ● Servers/Containers/Processes are not pet ● Application state and events ● Listen to your system and have fun
  53. 53. @gianarb - gianluca@influxdb.com Collect data is just the beginning Aggregation, alerting, downsampling are other important steps to answer a question