OSDC 2018 - Distributed monitoring

@gianarb - gianluca@influxdb.com
Distributed monitoring
How to understand the chaos

Who Am I?
● Software Engineer passionate about almost everything atm I
work with Golang
● Open Source developer, Docker Captain and CNCF
Ambassador
● Site Reliability Engineer at InfluxData
● Speaker, blogger (gianarb.it) and so on...
● I love to travel, I grow my vegetables and I cook time to time

github.com/influxdata

What are you trying to say?
Microservices is not “The Distributed
System”

A queue system can be distributed too...

A multi threads application is a
distributed system

More “it is distributed” across servers,
worlds, cloud providers and more
complex it is...

Containers, Docker, Kubernetes, Cloud
Computing accelerated the application
distribution...

Did you migrated to a distributed mess
without a real needs?
Blame yourself...

A request rises and falls across multiple
applications before back to the user.
This is complexity.

Consequences
The logs are not easy to follow when
they comes from distributed applications,
it is not a single stream.

Consequences
Events and metrics
need to be correlated.

Distributed Tracing
Tracing is a way to correlate
logs using a set of IDs

Criticalities
We write applications in many different
languages

Criticalities
Across different teams

Criticalities
At the end, to build a trace we need to
agree on the same protocol no matters
the language or the team.

Distributed Tracing
Opentracing is a standard sponsored by the Cloud Native
Computing Foundation (CNCF) developed to agree on
common rules. It provides libraries across languages and you
can use many tracers open source and as a service.

© 2017 InfluxData. All rights reserved.24
OpenTracing
API
application logic
µ-service frameworks
Lambda functions
RPC & control-flow frameworks
existing instrumentation
tracing infrastructure
main()
I N S T A N A
J a e g e r
microservice process

>2 year old!
Tracer implementations: Zipkin, Jaeger, LightStep, SkyWalking, AWS X-Ray....
All sorts of companies use OpenTracing:

Rapidly growing OSS and vendor adoption
JDBIJava Webservlet

High Cardinality
A trace contains a lot of information and
they are indexed via request id (called
trace_id). They are expensive to store.

Distributed Tracing
Luckily traces doesn’t have a long lifecycle. Usually, you use
them to debug a problem happened almost in real time or in
short time window.

Distributed Tracing
We set a week as retention policy, after 7 days we
downsample and remove the original trace.
We also downsample them based on how many requests we
are receiving for a specific API call.

import opentracing "github.com/opentracing/opentracing-go"
import zipkin "github.com/openzipkin/zipkin-go-opentracing"
collector, err := zipkin.NewHTTPCollector(tracingConf.ZipkinEndpoint)
recorder := zipkin.NewRecorder(collector, false,
fmt.Sprintf("0.0.0.0:%d", tracingConf.Port), "restapi")
tracer, err = zipkin.NewTracer(
recorder,
zipkin.ClientServerSameSpan(false),
zipkin.TraceID128Bit(false),
)
opentracing.SetGlobalTracer(tracer)

import opentracing "github.com/opentracing/opentracing-go"
tracer := opentracing.GlobalTracer()
sp := tr.StartSpan(“api.create_user”)
defer sp.Finish()

opentracing.io

Distributed Tracing

Distributed Tracing - Collect traces via Telegraf

No UI at the moment. :(
SELECT * FROM zipkin WHERE time < now() - 1h
AND trace_id = ‘a4hs45hs46jd56j4s’

The process of understanding
1. Instrument
2. Observe
3. Aggregate and sample
4. Take action (via alerts or whatever)

How people and teams play this game?
They should deploy their application.

Be on-call and they should take care
about production behavior for their
applications

They can write a “presentation” of their
service (a doc): critical metrics, capacity
planning (cpu, ram, disk intensive app), service
location in the system (close to other apps, ssd)

Keep everyone in the loop and
responsible for the real traffic.
There is not fun writing code without
running it in production!

Every tools we develop exposes APIs,
developers can use them.
Eg. Create runtime alerts with Kapacitor.

Servers/Containers/VMs are not pets
1. They don’t have name because they come and go based
on loads and needs.
2. You can’t watch cute servers’s picture on Instagram. Yet..
3. A server has labels.

Servers/Containers/VMs are not pets
Write tools that helps you to replace servers and processes
or use available projects like AWS Autoscaling group,
Kubernetes and so on.
DevOps is an attitude is not a role that you hire. They are
developer passionate about server automation and related
stuff.

We care about state and event
1. Use created
2. Invoice generation
3. Email sent
4. Purchase
5. Whatever...

We care about data!
But data is all another
topic!

Back to tracing - the cost of a retry

Distributed Tracing - the cost of a retry

Wrap up!
● Monitor distributed system is hard and you need to
correlate all the things
● Opentracing and distributed tracing
● Keep people in the loop and give ownership of production
● DevOps is an attitude
● Servers/Containers/Processes are not pet
● Application state and events
● Listen to your system and have fun

Collect data is just the
beginning
Aggregation, alerting, downsampling are other
important steps to answer a question

OSDC 2018 - Distributed monitoring

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (19)

Similaire à OSDC 2018 - Distributed monitoring

Similaire à OSDC 2018 - Distributed monitoring (20)

Plus de Gianluca Arbezzano

Plus de Gianluca Arbezzano (18)

Dernier

Dernier (20)

OSDC 2018 - Distributed monitoring