Modern software development is increasingly taking a “microservice” approach that has resulted in an explosion of complexity at the network level. We have more applications running distributed across different datacenters. Distributed tracing, events, and metrics are essential for observing and understanding modern microservice architectures.
This talk is a deep dive on how to monitor your distributed system. You will get tools, methodologies, and experiences that will help you to realize what your applications expose and how to get value out from all these information.
Gianluca Arbezzano, SRE at InfluxData will share how to monitor a distributed system, how to switch from a more traditional monitoring approach to observability. Stay focused on the server’s role and not on the hostname because it’s not really important anymore, our servers or containers are fast moving part and it’s easy to detach it from the right in case of trouble than call the server by name as a cute puppet. How to design a SLO for your core services and now to iterate on them. Instrument your services with tracing using tools like Zipkin or Jaeger to measure latency between in your network.
2. @gianarb - gianluca@influxdb.com
Who Am I?
● Software Engineer passionate about almost everything atm I
work with Golang
● Open Source developer, Docker Captain and CNCF
Ambassador
● Site Reliability Engineer at InfluxData
● Speaker, blogger (gianarb.it) and so on...
● I love to travel, I grow my vegetables and I cook time to time
11. @gianarb - gianluca@influxdb.com
What are you trying to say?
More “it is distributed” across servers,
worlds, cloud providers and more
complex it is...
12. @gianarb - gianluca@influxdb.com
What are you trying to say?
Containers, Docker, Kubernetes, Cloud
Computing accelerated the application
distribution...
14. @gianarb - gianluca@influxdb.com
What are you trying to say?
A request rises and falls across multiple
applications before back to the user.
This is complexity.
23. @gianarb - gianluca@influxdb.com
Distributed Tracing
Opentracing is a standard sponsored by the Cloud Native
Computing Foundation (CNCF) developed to agree on
common rules. It provides libraries across languages and you
can use many tracers open source and as a service.
27. @gianarb - gianluca@influxdb.com
High Cardinality
A trace contains a lot of information and
they are indexed via request id (called
trace_id). They are expensive to store.
28. @gianarb - gianluca@influxdb.com
Distributed Tracing
Luckily traces doesn’t have a long lifecycle. Usually, you use
them to debug a problem happened almost in real time or in
short time window.
29. @gianarb - gianluca@influxdb.com
Distributed Tracing
We set a week as retention policy, after 7 days we
downsample and remove the original trace.
We also downsample them based on how many requests we
are receiving for a specific API call.
39. @gianarb - gianluca@influxdb.com
The process of understanding
1. Instrument
2. Observe
3. Aggregate and sample
4. Take action (via alerts or whatever)
41. @gianarb - gianluca@influxdb.com
How people and teams play this game?
Be on-call and they should take care
about production behavior for their
applications
42. @gianarb - gianluca@influxdb.com
How people and teams play this game?
They can write a “presentation” of their
service (a doc): critical metrics, capacity
planning (cpu, ram, disk intensive app), service
location in the system (close to other apps, ssd)
43. @gianarb - gianluca@influxdb.com
How people and teams play this game?
Keep everyone in the loop and
responsible for the real traffic.
There is not fun writing code without
running it in production!
44. @gianarb - gianluca@influxdb.com
How people and teams play this game?
Every tools we develop exposes APIs,
developers can use them.
Eg. Create runtime alerts with Kapacitor.
45. @gianarb - gianluca@influxdb.com
Servers/Containers/VMs are not pets
1. They don’t have name because they come and go based
on loads and needs.
2. You can’t watch cute servers’s picture on Instagram. Yet..
3. A server has labels.
46. @gianarb - gianluca@influxdb.com
Servers/Containers/VMs are not pets
Write tools that helps you to replace servers and processes
or use available projects like AWS Autoscaling group,
Kubernetes and so on.
DevOps is an attitude is not a role that you hire. They are
developer passionate about server automation and related
stuff.
53. @gianarb - gianluca@influxdb.com
Wrap up!
● Monitor distributed system is hard and you need to
correlate all the things
● Opentracing and distributed tracing
● Keep people in the loop and give ownership of production
● DevOps is an attitude
● Servers/Containers/Processes are not pet
● Application state and events
● Listen to your system and have fun