The hitchhiker’s guide to Prometheus

The hitchhiker’s guide to
Remco Overdijk
1
"A Metric, The Hitchhiker's Guide to Prometheus says, is
about the most massively useful thing someone doing
Monitoring can have. It has great practical value. You can
wave your Metric in emergencies as a distress signal, and
produce pretty Graphs at the same time."

1. The Landscape
What are we running and why?
2. Core Concepts
How does Prometheus work?
3. Demo Time!
It’s a Tools in Action talk after all, right?
4. Tips & Tricks
Getting the most out of your Prometheus Experience
5. Questions?
I’m probably going to answer “42” to most of them..
So many things to tell, so little time..
2
The Hitchhiker’s Guide to Prometheus

• Started out in TES, doing Metrics, Monitoring & Logging.
(Graphite, Statsd, Grafana, Nagios, Logstash, ElasticSearch, Kibana, etc. )
• Currently in DPI, doing CI/CD and bringing Gitlab/Spinnaker to the Cloud.
That requires a lot of monitoring…
• Member of the Cloud9 MML Circle, doing Prometheus
• Core Contributor to the R2D2 module that manages Prometheus and Monitoring/Alerting resources
within Cloud9
• Worked on implementing Prometheus and Grafana, while also using these stacks for monitoring
production systems.
• NightOwl for SRT Platform; I know how pagers work.
Who are you, and why are you telling us this?
3
Introduction

The Landscape
What are we running?

Data Center VS Cloud
VM’s and Servers VS containers in Kubernetes
5
Monitoring Prometheus
Metrics Prometheus (+
InfluxDB/Thanos)
Alerting AlertManager, Iris,
OnCall, Grafana
Visualization Grafana
Logging StackDriver,
ElasticSearch + Kibana
Monitoring Nagios + Thruk +
Lookingglass
Metrics Graphite + Statsd
Alerting SMS modems in
physical servers
Visualization Grafana
Logging ElasticSearch + Kibana

•Applications in Kubernetes are much more dynamic than we’re used to.
• No Static IP addresses.
• No Static amount servers (Well, pods actually..)
• Kubernetes can reschedule / relocate pods at will.
• Prometheus uses Service Discovery to find targets
•Both Nagios and Graphite have scaling issues and are too rigid.
• Prometheus is Pull instead of Push based and doesn’t require execution for every single check
• Combines Metrics & Monitoring into a single stack, but focuses on Monitoring.
•Being based on BorgMon, it works out of the box with a lot of Kubernetes /
Cloud native components and the services supporting them.
•StackDriver is not a full fledged alternative due to features, retention and cost.
Why didn’t you come up with something else?
6
So, why Prometheus?

•Out of the box, Prometheus also doesn’t scale endlessly without compromises
(But Thanos will)
•Scalability is solved through retention, manual sharding and vertical scaling,
which all have clear drawbacks.
•HA is solved through duplication (Polling twice from independent instances
with individual TSDB’s).
•Prometheus development is very focused, which shows in certain aspects.
Well.. No.
7
Is this the answer to everything then?

All the pods & services
8
Infrastructure Overview
Kubernetes {DEV, STG, PRO} Clusters
Datacenters
Prometheus
Prometheus
AlertManager
AlertManager
AlertManager
Grafana
PushGateway
IRIS
OnCall
SMS / Call
Provider
HipChat
Operator
Remote
Storage
Adapter
InfluxDB
YOUR App!
Kubernetes
Exporters

Core Concepts
How does it work and what makes it tick?

- Counters
- A counter is a cumulative metric that represents a single monotonically increasing counter whose value can only
increase or be reset to zero on restart. (1, 2, 5, 9, 0, 2, 7)
- Gauges
- A gauge is a metric that represents a single numerical value that can arbitrarily go up and down.
(1, 4, 2, 5, 8)
- Histograms
- A histogram samples observations (usually things like request durations or response sizes) and counts them in
configurable buckets. It also provides a sum of all observed values.
- Summaries
- Similar to a histogram, a summary samples observations (usually things like request durations and response
sizes). While it also provides a total count of observations and a sum of all observed values, it calculates
configurable quantiles over a sliding time window.
- Quantiles are convenient when (for example) expressing median (2-quantile) and 95th percentiles.
Supported Types
10
Making Metrics

- Instead of creating separate checks for every metric that should be monitored for your
application, you expose a single (or multiple..) HTTP Endpoint containing all metrics.
- It’s your responsibility to make this endpoint Available, Fast and Reliable.
- Multiple Frameworks and Libraries can help you provisioning and maintaining such an
endpoint.
- Axle Comes with built-in support for MicroMeter, which does everything for you.
- Backspin support is coming soon™.
- Example: http://localhost:30000/metrics
The concept of Scraping HTTP Metric Endpoints
11
Exposing Metrics: Push VS Pull
# HELP prometheus_tsdb_head_min_time Minimum time bound of the head block.
# TYPE prometheus_tsdb_head_min_time gauge
prometheus_tsdb_head_min_time 1.5282792e+12
# HELP prometheus_tsdb_head_samples_appended_total Total number of appended samples.
# TYPE prometheus_tsdb_head_samples_appended_total counter
prometheus_tsdb_head_samples_appended_total 2.9485092e+07
# HELP prometheus_tsdb_head_series Total number of series in the head block.
# TYPE prometheus_tsdb_head_series gauge
prometheus_tsdb_head_series 19956
# HELP prometheus_tsdb_head_series_created_total Total number of series created in the head
# TYPE prometheus_tsdb_head_series_created_total gauge
prometheus_tsdb_head_series_created_total 56888

- An actual Query Language that looks a lot more like SQL than Graphite.
- You’ll need to learn a new language, but it’s only a single language for creating Graphs and Alerts; for
monitoring and long term metrics.
- Allows for a lot of flexibility, but can be a bit harder to grasp when starting out.
- Supports functions, operators, regex, arithmetic and expressions.
- Four expression types are supported:
- Instant Vectors (like http_requests_total{environment=~"staging|testing|development", method!="GET"})
- Instant vector selectors allow the selection of a set of time series and a single sample value for each at a given timestamp
(instant): in the simplest form, only a metric name is specified. This results in an instant vector containing elements for all time
series that have this metric name.
- Range Vectors (like http_requests_total{job="prometheus"}[5m] )
- Range vector literals work like instant vector literals, except that they select a range of samples back from the current instant.
Syntactically, a range duration is appended in square brackets ([]) at the end of a vector selector to specify how far back in time
values should be fetched for each resulting range vector element.
- Scalars
- Strings
PromQL
12
Querying Metrics

- Custom Resource Type provided by Prometheus-operator
- Abstraction of Prometheus “job” and Service Discovery
- Allows for easy ingestion of new endpoints through their k8s service
- Example:
ServiceMonitors
13
Getting your endpoint monitored
Prometheus
Prometheus OperatorYOUR App! K8s Service ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
spec:
endpoints:
- bearerTokenFile:
/var/run/secrets/kubernetes.io/serviceaccount/token
interval: 30s
port: https
scheme: https
tlsConfig:
insecureSkipVerify: true
jobLabel: k8s-app
selector:
matchLabels:
k8s-app: node-exporter
apiVersion: v1
kind: Service
metadata:
labels:
k8s-app: node-exporter
name: node-exporter
spec:
ports:
- name: https
port: 9100
protocol: TCP
targetPort: https
selector:
app: node-exporter
type: ClusterIP

- The same tool you were probably already using.
- The central interface for cloud insights
- Contains a specialized query editor for Prometheus data sources.
- Prometheus currently doesn’t store metrics older than one month for performance reasons.
- Multiple solutions for long term metrics exist, but it’s a work in progress.
Dashboarding with Grafana
14
Creating Insights
Prometheus
Prometheus Grafana
HipChat
Remote
Storage
Adapter
InfluxDB

Trouble in Paradise
Creating Alerts, choosing your weapon
15
WARNINGS – Notifications During workhours
- No direct intervention is required
- Usually picked up by members of the team
developing / maintaining a system.
- Alert delivery is NOT guaranteed.
Use Grafana with HipChat or Email alerts
CRITICALS – 24x7 Text Messages with Escalation
- Actionable events that require immediate attention
by an Engineer on Duty, who does not necessarily
have intimate knowledge of your system.
- Response is required to silence/end the alert.
- Provisioned through RuleList (R2D2 / Operator)
Use AlertManager / Iris / Oncall

Yes, It’s PromQL as well!
16
Alert Basics
%YAML 1.1
---
kind: PrometheusAlertRule
Data:
test.rules: |
Groups:
- name: Load
interval: 30s
Rules:
- alert: HighLoad
expr: rate(web_http_responses_total[1m]) > 1
for: 1m
Labels:
Severity: attention
Annotations:
description: The rate of HTTP requests is too high.

- Alerts should be actionable: Somebody has to do something, now.
- They should be simple: Someone without intimate knowledge of the system should ideally be
able to solve the alert.
- They should be urgent and require human intervention: No point in waking someone up if they
shouldn’t have to do something, or when tomorrow afternoon would be soon enough.
- Provide accurate descriptions and a playbook where possible.
- Basic system monitoring should be based on SLI/SLO’s rather than infra metrics.
- Prefer AM/Iris/OnCall if you’re serious about your alert.
Creating the perfect alert
17
Alert Perfection
Prometheus
AlertManager
AlertManager
AlertManager
Grafana
IRIS OnCall
SMS / Call
Provider
HipChat

• A long list of exporters is available at https://prometheus.io/docs/instrumenting/exporters/
• A number of these come preconfigured with our Kubernetes clusters and provide additional metrics
When artisanal endpoints don’t cut the cake
18
Exporters - Additional sources of metrics
Databases
Aerospike exporter
ClickHouse exporter
Consul exporter (official)
CouchDB exporter
ElasticSearch exporter
Memcached exporter (official)
MongoDB exporter
MSSQL server exporter
MySQL server exporter (official)
OpenTSDB Exporter
Oracle DB Exporter
PgBouncer exporter
PostgreSQL exporter
ProxySQL exporter
RavenDB exporter
Redis exporter
RethinkDB exporter
SQL exporter
Tarantool metric library
Hardware related
apcupsd exporter
Collins exporter
IoT Edison exporter
IPMI exporter
knxd exporter
Node/system metrics exporter (official)
Ubiquiti UniFi exporter
Messaging systems
Beanstalkd exporter
Gearman exporter
Kafka exporter
NATS exporter
NSQ exporter
Mirth Connect exporter
MQTT blackbox exporter
RabbitMQ exporter
RabbitMQ Management Plugin exporter
Storage
Ceph exporter
Ceph RADOSGW exporter
Gluster exporter
Hadoop HDFS FSImage exporter
Lustre exporter
ScaleIO exporter
HTTP
Apache exporter
HAProxy exporter (official)
Nginx metric library
Nginx VTS exporter
Passenger exporter
Tinyproxy exporter
Varnish exporter
WebDriver exporter
APIs
AWS ECS exporter
AWS Health exporter
AWS SQS exporter
Cloudflare exporter
DigitalOcean exporter
Docker Cloud exporter
Docker Hub exporter
GitHub exporter
InstaClustr exporter
Mozilla Observatory exporter
OpenWeatherMap exporter
Pagespeed exporter
Rancher exporter
Speedtest exporter
Logging
Fluentd exporter
Google's mtail log data extractor
Grok exporter
Other monitoring systems
Akamai Cloudmonitor exporter
AWS CloudWatch exporter (official)
Cloud Foundry Firehose exporter
Collectd exporter (official)
Google Stackdriver exporter
Graphite exporter (official)
Heka dashboard exporter
Heka exporter
InfluxDB exporter (official)
JavaMelody exporter
JMX exporter (official)
Munin exporter
Nagios / Naemon exporter
New Relic exporter
NRPE exporter
Osquery exporter
Pingdom exporter
scollector exporter
Sensu exporter
SNMP exporter (official)
StatsD exporter (official)
Miscellaneous
Bamboo exporter
BIG-IP exporter
BIND exporter
Bitbucket exporter
Blackbox exporter (official)
BOSH exporter
cAdvisor
Confluence exporter
Dovecot exporter
eBPF exporter
Jenkins exporter
JIRA exporter
Kannel exporter
Kemp LoadBalancer exporter
Meteor JS web framework exporter
Minecraft exporter module
PHP-FPM exporter
PowerDNS exporter
Process exporter
rTorrent exporter
SABnzbd exporter
Script exporter
Shield exporter
SMTP/Maildir MDA blackbox prober
SoftEther exporter
Transmission exporter
Unbound exporter
Xen exporter

• StackDriver Exporter- Get your GCP Project’s native metrics into Prometheus.
• Blackbox Exporter – Monitor Golden Signals on any system, without knowledge about the inner working
• Nginx exporter – used in Ingresses
• SNMP Exporter – Bring your own MIB’s.
• Statsd Exporter – Push your statsd metrics to a sidecar container
• Node Exporter – Provides system metrics for VM and Physical systems (like kubernetes nodes)
• cAdvisor – Get generic container metrics
• Etcd
• Kubernetes
• Minio (Gitlab Runner Caching)
The most commonly used
19
Exporters - Highlights
Prometheus
Prometheus OperatorExporter K8s Service ServiceMonitor

• For situations where you are unable to serve a HTTP metrics page for a reliable period of time.
• Ideal for short running tasks like Kubernetes CronJobs, Hadoop Jobs, Scripts, etc.
• Allows you to Push (through a HTTP call) Metrics to buffering service, which in turn exposes them to
Prometheus.
• Metrics will live forever on the Gateway, so be careful of what you push and how you name them.
• Avoid this route if possible, since it scales very badly and is NOT redundant. Bring your own endpoint if
and when possible.
• PRO-Tip: If you have an ephemeral job, also push the timestamp of last successful job completion.
The Push Gateway
20
Metrics for ephemeral jobs
Prometheus
PrometheusYOUR App! Push Gateway
echo ”ultimate_answer 42.0" | curl --data-binary @- http://gateway:9091/metrics/job/magrathea/instance/zaphod-001/group/vogon/opex/DPI
ultimate_answer{group=”vogon",instance=”zaphod-001",job=”magrathea",opex=”DPI"} 42.0

• Kubernetes Running on Docker for macOS.
• Out of the box Prometheus on Kubernetes from https://github.com/coreos/prometheus-
operator/tree/master/contrib/kube-prometheus
• Services are running without an Ingress, so we’re accessing them directly, using NodePorts.
• We’re going to add our own Full Featured Axle Service by creating a Deployment and a Service to match
it, adding a ServiceMonitor, watching Service Discovery do it’s thing, graphing one of the metrics and
creating an alert for it.
• Prometheus: http://localhost:30000/graph
• AlertManager: http://localhost:31000/#/alerts
• Grafana: http://localhost:32000/d/9dP_FHImz/pods
Getting started in 5 minutes
22
Today’s Quick Demo

Tips & Tricks
Getting the most out of your Prometheus Experience

• Metrics in Prometheus are multi dimensional; They consist of names and labels.
• Names are generic identifiers to tell WHAT you are measuring, in what format.
• Metric Names SHOULD have a single (base!) unit, added as a suffix describing that unit. (bytes, seconds,
meters)
• Labels describe characteristics, and are usually used to identify WHERE those metrics are coming from,
and can be multi faceted.
• Prometheus saves a separate Time Series for each name/labels combination, so you have to ensure
label cardinality does not get too high, or you will kill Prometheus in the end. (Bad examples: usernames,
internet IP addresses, hashes).
• Read https://prometheus.io/docs/practices/naming/ before you start making your own!
Keep things running smoothly by not making a mess.
24
Metric Naming
api_http_requests_total { type="create|update|delete”, method=“GET|POST|DELETE” }
api_request_duration_seconds { stage="extract|transform|load” }
api_errors_total { endpoint=“listProducts|updatePricing”, code=“500|404|418 I'm a teapot” }

•An SLI is a service level indicator—a carefully defined quantitative measure of some aspect of
the level of service that is provided.
•An SLO is a service level objective: a target value or range of values for a service level that is
measured by an SLI. A natural structure for SLOs is thus
[SLI ≤ target], or [lower bound ≤ SLI ≤ upper bound].
•Symptoms vs Causes: Monitor things that users will notice when using your system.
•Latency - The time it takes to service a request.
•Traffic. - A measure of how much demand is being placed on your system, measured in a
high-level system-specific metric. For a web service, this measurement is usually HTTP
requests per second.
•Errors - The rate of requests that fail (like HTTP 500’s)
•Saturation- "How "full" your service is. A measure of your system fraction, emphasizing the
resources that are most constrained.
What should you be monitoring?
25
The Golden Signals

•BlackBox Exporter for period requests and their Metrics (Success, Latency, Errors)
•Nginx Ingress Metrics for a man-in-the-middle view of your application (Flow, Latency, Errors)
•Your own application’s Metrics for insights, details and under-the-hood view.
Combining Metric Sources for an unbiassed view
26
Bringing it all together
Your App
Blackbox
Exporter
Ingress
Poll Metrics
Ingress Metrics
App Metrics
- job_name: 'blackbox’
metrics_path: /probe
params:
module: [http_2xx] # Look for a HTTP 200 response.
static_configs:
- targets:
- http://myapp.behindingress.io # Target to probe with http
Prometheus scrape

•Introducing the GenericServiceMonitor and DCServiceMonitor
•These types allow you to define endpoints outside of Kubernetes, and allow
you to monitor on-premise services.
•DCServiceMonitor works based on bol_applications and as such is bol.com
specific:
•GenericServiceMonitor works on static endpoints
My stuff runs in the DC and I want to keep it there.
27
So what about non-Cloud resources?
kind: Prometheus/DCServiceMonitor
name: tst-sdd-app
spec:
port: 8080
path: /internal/metrics
kind: Prometheus/GenericServiceMonitor
name: dev-atscale-app
Spec:
hosts: - ip: 1.2.3.4
hostname: some.host.name
port: 8080
path: /internal/metrics
opex: srt-bificsps

•Always initialize your metrics at zero when possible, or you won’t know the significance of the
first value.
•How do you know if your application is OK when the metrics stopped working? The up metric
might also disappear when Service Discovery no longer detects your service. Always use
absent() to check for existence of up!
•(i)rate()/increase() then sum(), not sum() then (i)rate()/increase(), since those
are the only safe functions to deal with resets.
•The rate function takes a time series over a time range, and based on the first and last data
points within that range (http://localhost:32000/d/h3RZO2Iik/rate-vs-irate?orgId=1 )
•By contrast irate is an instant rate. It only looks at the last two points within the
range passed to it and calculates a per-second rate.
•To complement the saturation signal; Prometheus has predict_linear() for Gauges.
•All the metrics? http://localhost:30000/federate?match[]={__name__%3D~%22[a-z].*%22}
Things you’ll encounter once you start making queries
28
Other tips

Questions?
Don’t bother to ask me the Ultimate Question of Life, the
Universe and Everything, because you already know the answer.
(and yes, I know where my towel is.)

Remco Overdijk
roverdijk@bol.com
So Long!
And thanks for all the fish.

The hitchhiker’s guide to Prometheus

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à The hitchhiker’s guide to Prometheus

Similaire à The hitchhiker’s guide to Prometheus (20)

Plus de Bol.com Techlab

Plus de Bol.com Techlab (20)

Dernier

Dernier (20)

The hitchhiker’s guide to Prometheus