Systems Monitoring with Prometheus (Devops Ireland April 2015)

Systems Monitoring with Prometheus
Devops Ireland, April 2015
Brian Brazil
Senior Software Engineer
Boxever

What is monitoring?
Host-based checks?
• Typically shell scripts with success/fail
• Failure causes alerts
• More blackbox than whitebox
• About machines, not services

Brian’s Pet Peeve #1
Thinking in terms of machines rather than services.
It’s the future, it’s not the “Webserver machine” it’s one
machine what happens to run a Webserver as part of the
Webserver service.

What is monitoring?
Highly granular information about a subsystem?
• Tends to be focused on one subsystem, such as
incoming http requests
• No visibility into the rest of the system

What is monitoring?
High frequency high granularity profiling?
• Great for debugging once you’ve narrowed down the
problem
• Not so useful for general monitoring

What is monitoring?
Tailing logs?
• Easy to miss something
• Tend to get very noisy
• We have computers, why are humans doing repetitive
tasks?

Step Back
Why do we want monitoring?

Why: Alerting
We want to know when things go wrong
We want to know when things aren’t quite right
We want to know in advance of problems

Why: Debugging
When something is up, you need to debug.
You want to go from high-level problem, and drill down to
what’s causing it. Need to be able to reason about things.
Sometimes want to go from code back to metrics.

Instrumentation that you need to read the code to
understand.
Make the names such that a random person not intimately
familiar with the system would have a good chance at
guessing what it means. Specify your units.

Why: Trending
How the various bits of a system are being used.
For example, how many static requests per dynamic
request? How many sessions active at once? How many hit
a certain corner case?
For some stats, also want to know how they change over
time for capacity planning and design discussions.

A different approach
What if we instrumented everything?
• RPCs
• Interfaces between subsystems
• Business logic
• Every time you’d log something
What if we monitored systems and subsystems to know
how everything is generally doing?

That’s a lot of metrics
That could be tens of thousands of codepoints across an
entire system.
You’d need some way to make it easy to instrument all
code, not just the externally facing parts of applications.
You’d need something able to handle a million time series.

Presenting Prometheus
An open-source service monitoring system and time series
database.
Started in 2012, primarily developed in Soundcloud with
committers also in Boxever and Docker.
Publicly announced January 2015, many contributions and
users since then.

The Server
• Can handle over a million time series in one instance
• No dependencies such as HBase or Cassandra
• Stores data on local disk
• Written in Go
• Easy to run

Data Model
Tired of aggregating and alerting off metrics like http.
responses.500.myserver.mydc.production?
Time series have structured key-value pairs, e.g.
http_responses_total{
response_code=”500”,instance=”myserver”,
dc=”mydc”,env=”production”}

Munging structured data in a way that loses the structure
Is it so much to ask for some escaping, or at least sanitizing
any separators in the data?

Query Language
Aggregation based on the key-value labels
Arbitrarily complex math
And all of this can be used in pre-computed rules and alerts

Query Language: Example
Column families with the 10 highest read rates per second
topk(10,
sum by(job, keyspace, columnfamily) (
rate(cassandra_columnfamily_readlatency[5m])
)
)

Client Libraries
How you instrument your code
• Official: Python, Java, Go, Ruby
• Unofficial: Bash, NodeJS, .Net
Text-based format, easy to write new client libraries

Client Libraries: Example
@Summary(“request_latency_seconds”,
“Request latency”).time()
def process_request():
pass

Client Libraries: In and Out
Client libraries don’t tie you to Prometheus instrumentation
Custom collectors allow pulling data from other
instrumentation systems into Prometheus client library
Similarly, can pull data out of client library and expose as
you wish

Client Libraries: What to Instrument

Client Libraries: What to Instrument
Everything

Things to have
• Client and server qps/errors/latency
• Every log message should be a metric
• Every failure should be a metric
• Threadpool/queue size, in progress, latency
• Business logic inputs and outputs
• Data sizes in/out
• Process cpu/ram/language internals (e.g. GC)
• Blackbox and end-to-end monitoring

Batch/Offline Processing Metrics
• Last time it succeeded
• Records processed/throughput
• Duration of major batch stages
• Heartbeats for end-to-end testing
Use the PushGateway for ephemeral jobs

Wrapping instrumentation libraries to make them “simpler”
Tend to confuse abstractions, encourage bad practices and
make it difficult to write correct and useable instrumentation
e.g. Prometheus values are doubles, if you only allow ints
then end user has to do math to convert back to seconds

Speaking of Correct Instrumentation
It’s better to have math done in the server, not the client
Many instrumentation systems are exponentially decaying
Do you really want to do calculus during an outage?
Prometheus has monotonic counters
Races and missed scrapers don’t lose data

Integrations
More powerful data model needs integrations and
instrumentation written to take advantage of that
Machine (Node Exporter), HAProxy, CloudWatch, Statsd,
Collectd, JMX, Mesos, Consul, MySQL
Direct instrumentation: cadvisor, etcd

Dashboards
• Promdash: Ruby on Rails web app
• Console templates: More power for those who like
checking things in
• Expression browser: Ad-hoc queries
• JSON interface: Roll your own

Dashboards
What to put on dashboards?

Dashboards
Goal: Make it easy to logically drill down a problem
Most services are a graph.
Make it easy to go from a console about one service, see
which of it’s backends is the problem, and repeat.

Dashboard anti-patterns:
● Graph of a hundred plots
● Page of a thousand graphs
● Consoles that their creator can barely understand
● “Put it on a console somewhere”

Dashboard Guidelines
• Don’t put every possible metric on the dashboard
• Focus on the top few metrics, based on the most likely
failure modes and things you’ll want to know
• No more than 5 graphs per console, 5 plots per graph
• Have units, y-labels, legends and descriptions
• Split out or trim if dashboards are getting too complex
• It’s difficult for a dashboard to serve two masters
• Dashboards are not for alerting

Alerting
Alertmanager aggregates alerts from Prometheus servers
Supports notifications to Pagerduty, Email, Pushover
Best practices:
• Alert on symptoms not causes
• Have a way to deal with non-critical alerts

What to Alert On
Online Serving: Overall latency, errors
Offline Processing: Propagation/Processing Delay
Batch Jobs: When it last Suceeded

The Live Demo
please work please work please work please work please work please work please work please work please work please work please work
please work please work please work please work please work please work please work please work please work please work

The Future
Many features on roadmap:
• Service discovery
• Federation
• Long term storage
• HA Alertmanager
• More exporters, client libraries and integrations

Final Word
Systems Monitoring is your first port of call in an
emergency, keep it working without needing lots of effort.
Prometheus is awesome - can be lead to non-critical data
taking lots of management, crowding out critical metrics.
At some point, have to move non-critical things to generic
data processing system.

Finaler Word
How do you know your monitoring system is good?
• When you have superb monitoring for everything?
• When it causes a HDD to fail?
• When it finds two bugs in Go’s DNS library?

Finaler Word
How do you know your monitoring system is good?
• When you have superb monitoring for everything?
• When it causes a HDD to fail?
• When it finds two bugs in Go’s DNS library?
No, when the unittests find a bug in your filesystem!

Try it out!
http://www.boxever.com/tag/monitoring has step-by-step instructions on
monitoring:
• Machines
• Cassandra
• HAProxy
• Python Batch host-based jobs
• Java applications
Problems?
We’re on #prometheus on Freenode

More Information
http://prometheus.io
http://www.boxever.com/tag/monitoring
Python Ireland, April 8th
SREcon15 Europe, May 14-15th

Systems Monitoring with Prometheus (Devops Ireland April 2015)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Systems Monitoring with Prometheus (Devops Ireland April 2015)

Similar to Systems Monitoring with Prometheus (Devops Ireland April 2015) (20)

More from Brian Brazil

More from Brian Brazil (20)

Recently uploaded

Recently uploaded (20)

Systems Monitoring with Prometheus (Devops Ireland April 2015)