3. What is monitoring?
• Host-based checks
• High frequency information about a few key metrics
• High frequency high granularity profiling
• Tailing logs
5. Why: Alerting
We want to know when things go wrong
We want to know when things aren’t quite right
We want to know in advance of problems
6. Why: Debugging
When something is up, you need to debug.
You want to go from high-level problem, and drill down to
what’s causing it. Need to be able to reason about things.
Sometimes want to go from code back to metrics.
7. Why: Trending
How the various bits of a system are being used.
For example, how many static requests per dynamic
request? How many sessions active at once? How many hit
a certain corner case?
For some stats, also want to know how they change over
time for capacity planning and design discussions.
8. A different approach
What if we instrumented everything?
• RPCs
• Interfaces between subsystems
• Business logic
• Every time you’d log something
9. A different approach
What if we monitored systems and subsystems to know
how everything is generally doing?
What if each developer didn’t have to add instrumentation
themselves - what if every library came with it built-in?
Could focus on developing, and still get good metrics!
10. A different approach
Some things to monitor:
● Client and server qps/errors/latency
● Every log message should be a metric
● Every failure should be a metric
● Threadpool/queue size, in progress, latency
● Business logic inputs and outputs
● Data sizes in/out
● Process cpu/ram/language internals (e.g. GC)
● Blackbox and end-to-end monitoring/heartbeats
● Batch job: last success time, duration, records processed
11. That’s a lot of metrics
That could be tens of thousands of codepoints across an
entire system.
You’d need some way to make it easy to instrument all
code, not just the externally facing parts of applications.
You’d need something able to handle a million time series.
12. Presenting Prometheus
An open-source service monitoring system and time series
database.
Started in 2012, primarily developed in Soundcloud with
committers also in Boxever and Docker.
Publicly announced January 2015, many contributions and
users since then.
14. Presenting Prometheus
• Client libraries that make instrumentation easy
• Support for many languages: Python, Java, Go, Ruby…
• Standalone server
• Can handle over a million time series in one instance
• No network dependencies
• Written in Go, easy to run
• Integrations
• Machine, HAProxy, CloudWatch, Statsd, Collectd, JMX, Mesos,
Consul, MySQL, cadvisor, etcd, django, elasticsearch...
15. Presenting Prometheus
• Dashboards
• Promdash: Ruby on Rails web app
• Console templates: More power for those who like checking things in
• Expression browser: Ad-hoc queries
• JSON interface: Roll your own
• Alerts
• Supports Pagerduty, Email, Pushover
17. Let’s Talk Python
First version of client hacked together in October 2014 in
an hour, mostly spent playing with meta-programming.
First official version 0.0.1 released February 2015.
Version 0.0.8 released April 2015.
19. The Basics
Two fundamental data types.
Counter: It only goes up (and resets), counts something
Gauge: It goes up and down, snapshot of state
20. Flow with your code
Instrumentation should be an integral part of your code,
similar to logging.
Don’t segregate out instrumentation to a separate class, file
or module - have it everywhere.
Instrumentation that makes this easy helps.
21. Counting exceptions in a method
from instrumentation import *
EX = 0
metrics.register(Counter.create(
‘method_ex’, lambda: EX))
def my_method():
try:
pass # Your code here
except:
global EX
EX += 1
raise
22. Counting exceptions: Prometheus
from prometheus_client import Counter
EX = Counter(
‘mymethod_exceptions_total’, 'Exceptions in mymethod’)
@EX.count_exceptions()
def my_method():
pass
23. Brian’s Pet Peeve #1
Wrapping instrumentation libraries to make them “simpler”
Tend to confuse abstractions, encourage bad practices and
make it difficult to write correct and useable instrumentation
e.g. Prometheus values are doubles, if you only allow ints
then end user has to do math to convert back to seconds
24. Speaking of Correct Instrumentation
It’s better to have math done in the server, not the client
Many instrumentation systems are exponentially decaying
Do you really want to do calculus during an outage?
Prometheus has monotonic counters
Races and missed scrapers don’t lose data
25. Counting exceptions: Context Manager
from prometheus_client import Counter
EX = Counter(
‘method_exceptions’, 'Exceptions in my method’)
def my_method():
with EX.count_exceptions():
pass
26. Decorator and Context Manager
In Python 3 have contextlib.ContextDecorator.
contextdecorator on PyPi for Python 2 - but couldn’t get it to
work.
Ended up hand coding it, an object that supports
__enter__, __exit__ and __call__.
27. Counter Basics
requests = Counter(
‘requests_total’,
‘Total number of requests’)
requests.inc()
requests.inc(42)
28. Brian’s Pet Peeve #2
Instrumentation that you need to read the code to
understand
e.g. “Total number of requests” - what type of request?
Make the names such that a random person not intimately
familiar with the system would have a good chance at
guessing what it means. Specify your units.
29. Gauge Basics
INPROGRESS = Gauge(
‘http_requests_inprogress’,
‘Total number of HTTP requests ongoing’)
def my_method:
INPROGRESS.inc()
try:
pass # Your code here
finally:
INPROGRESS.dec()
30. Gauge Basics: Convenience all the way
INPROGRESS = Gauge(
‘inprogress_requests’,
‘Total number of requests ongoing’)
@INPROGRESS.track_inprogress()
def my_method:
pass # Your code here
31. More Gauges
Many other ways to use a Gauge:
MYGAUGE.set(42)
MYGAUGE.set_to_current_time()
MYGAUGE.set_function(lambda: len(some_dict))
32. What about time?
Useful to measure how long things take.
Two options in Prometheus: Summary and Histogram.
Summary is cheap and simple.
Histogram can be expensive and is more granular.
33. Time a method
LATENCY = Summary(‘request_latency_seconds’,
‘Request latency in seconds’)
@LATENCY.time()
def process_request():
pass
Histogram is the same. There’s also a context manager.
34. How to get the data out: Summary
Summary is two counters, one for the number of requests
and the other for the amount of time spent.
Calculating rate(), aggregate and divide to get latency.
Not limited to time, can track e.g. bytes sent or objects
processed using observe() method.
35. How to get the data out: Histogram
Histogram is counter per bucket (plus Summary counters).
Get rate()s of buckets, aggregate and
histogram_quantile() will estimate the quantile.
Timeseries per bucket can add up fast.
37. Python 3 support
Simple stuff:
try:
from BaseHTTPServer import BaseHTTPRequestHandler
except ImportError:
from http.server import BaseHTTPRequestHandler
iter vs. iteritems
% vs. format
38. Python 3 support: Unicode
from __future__ import unicode_literals
Use b‘’ for raw byte literals
unicode_literals breaks __all__ on Python 2.x,
munge with encode(‘ascii`)
unicode = str for Python 3
39. Data Model
Tired of aggregating and alerting off metrics like http.
responses.500.myserver.mydc.production?
Time series have structured key-value pairs, e.g.
http_responses_total{
response_code=”500”,instance=”myserver”,
dc=”mydc”,env=”production”}
40. Brian’s Pet Peeve #3
Munging structured data in a way that loses the structure
Is it so much to ask for some escaping, or at least sanitizing
any separators in the data?
41. Labels
For any metric:
LATENCY = Summary(‘request_bytes_sent’,
‘Request bytes sent’, labels=[‘method’])
LATENCY.labels(“GET”).observe(42)
Don’t go overboard!
42. Getting The Data Out
from prometheus_client import start_http_server
start_http_server(8000)
Easy to produce output for e.g. Django.
Can also use write_to_textfile() with Node Exporter
Textfile Collector for machine-level cronjobs!
43. Query Language
Aggregation based on the key-value labels
Arbitrarily complex math
And all of this can be used in pre-computed rules and alerts
44. Query Language: Example
Column families with the 10 highest read rates per second
topk(10,
sum by(job, keyspace, columnfamily) (
rate(cassandra_columnfamily_readlatency[5m])
)
)
45. The Live Demo
please work please work please work please work please work please work please work please work please work please work please work
please work please work please work please work please work please work please work please work please work please work please work
please work please work please work please work please work please work please work please work please work please work please work
please work please work please work please work please work please work please work please work please work please work please work
please work please work please work please work please work please work please work please work please work please work please work
please work please work please work please work please work please work please work please work please work please work please work
please work please work please work please work please work please work please work please work please work please work please work
please work please work please work please work please work please work please work please work please work please work please work
please work please work please work please work please work please work please work please work please work please work please work
please work please work please work please work please work please work please work please work please work please work please work
please work please work please work please work please work please work please work please work please work please work please work
please work please work please work please work please work please work please work please work please work please work please work
please work please work please work please work please work please work please work please work please work please work please work
please work please work please work please work please work please work please work please work please work please work please work
please work please work please work please work please work please work please work please work please work please work please work
please work please work please work please work please work please work please work please work please work please work please work
please work please work please work please work please work please work please work please work please work please work please work
please work please work please work please work please work please work please work please work please work please work please work
please work please work please work please work please work please work please work please work please work please work please work
please work please work please work please work please work please work please work please work please work please work please work
please work please work please work please work please work please work please work please work please work please work
46. Client Libraries: In and Out
Client libraries don’t tie you to Prometheus instrumentation
Custom collectors allow pulling data from other
instrumentation systems into Prometheus client library
Similarly, can pull data out of client library and expose as
you wish