SlideShare une entreprise Scribd logo
1  sur  74
Télécharger pour lire hors ligne
PromCon Munich
Julien Pivotto
@roidelapluie
Improved alerting with
Prometheus and
Alertmanager
November 8th, 2019
• This talk contains PromQL.
• This talk contains YAML.
• What you will see was built over time.
Important notes!
@roidelapluie
Context
@roidelapluie
Message Broker in the Belgian healthcare sector
• High visibility
• Sync & Async
• Legacy & New
• Lots of partners
• Multiple customers
Message Broker
@roidelapluie
Technical
Business
Monitoring
@roidelapluie Font Awesome CC-BY-4.0
Alerts are not only for incidents.
Some alerts carry business information about ongoing
events (good or bad).
Some alerts go outside of our org.
Some alerts are not for humans.
Alerting
@roidelapluie
Channels
@roidelapluie Font Awesome CC-BY-4.0
Repeat every 15m, 1h, 4h, 24h, 2d
24x7, 10x5, 12x6, 10x7, never
Legal holidays
Time frames
@roidelapluie
Updated annotations & value
Updated graphs
15m/1h repeat interval?
@roidelapluie
• Alertmanager owns the noti cations
• Webhook receivers have no logic
• Take decisions at time of alert writing
Constraints
@roidelapluie
• Avoid Alertmanager recon gurations
• Safe and easy way to write alerts
• Only send relevant alerts
• Alert on staging environments
Challenges
@roidelapluie
PromQL
@roidelapluie
- alert: a target is down
expr: up == 0
for: 5m
Gauges
@roidelapluie
Gauges
@roidelapluie
Instead of:
- alert: a target is down
expr: up == 0
for: 5m
Do:
- alert: a target is down
expr: avg_over_time(up[5m]) < .9
for: 5m
Gauges
@roidelapluie
Alert me if temperature is above 27°C
Hysteresis
@roidelapluie
- alert: temperature is above threshold
expr: temperature_celcius > 27
for: 5m
labels:
priority: high
Hysteresis
@roidelapluie
Hysteresis
@roidelapluie
Hysteresis is the dependence of the state of a system
on its history.
Hysteresis
@roidelapluie Wikipedia CC-BY-SA-3.0
- alert: temperature is above threshold
expr: |
avg_over_time(temperature_celcius[5m])
> 27
for: 5m
labels:
priority: high
alternative: max_over_time
5m might be too short
if > 5m: when is it resolved?
Hysteresis
@roidelapluie
Alert me
• if temperature is above 27°C
• only stop when it gets below 25°C
Hysteresis
@roidelapluie
(avg_over_time(temperature_celcius[5m]) > 27)
or (temperature_celcius > 25 and
count without (alertstate, alertname, priority)
ALERTS{
alertstate="firing",
alertname="temperature is above threshold"
})
Hysteresis
@roidelapluie
temperature_celcius > 27
but...
Computed threshold
@roidelapluie
- record: temperature_threshold_celcius
expr: |
27+0*temperature_celcius{
location=~".*ambiant"
}
or 25+0*temperature_celcius
Bonus: temperature_threshold_celcius can be
used in grafana!
Computed threshold
@roidelapluie
- alert: temperature is above threshold
expr: |
temperature_celcius >
temperature_threshold_celcius
Note: put threshold & alert
in the same alert group
Computed threshold
@roidelapluie
- alert: no more sms
expr: sms_available < 39000
Absence
@roidelapluie
Absence
@roidelapluie
No metric = No alert!
Metric is back = New alert!
Absence
@roidelapluie
- record: sms_available_last
expr: |
sms_available or
sms_available_last
- alert: no more sms
record: sms_available_last < 39000
- alert: no more sms data
record: absent(sms_available)
for: 1h
Absence
@roidelapluie
Con guration
@roidelapluie
recipients:
name/channel
jpivotto/mail
opsteam/ticket
appteam/message
customer/sms
dc1/jenkins
Recipients
@roidelapluie
Alertmanager receivers
- name: "opsteam/mail"
email_configs:
- to: 'ops@inuits.eu'
send_resolved: yes
html: "{{ template "inuits.html.tmpl" . }}"
text: "{{ template "inuits.txt.tmpl" . }}"
headers:
Subject: "{{ template "title.tmpl" . }}"
Hint: Subject can be a template.
Receivers
@roidelapluie
Alertmanager receivers
- name: "opsteam/mail/noresolved"
email_configs:
- to: 'ops@inuits.eu'
send_resolved: no
html: "{{ template "inuits.html.tmpl" . }}"
text: "{{ template "inuits.txt.tmpl" . }}"
headers:
Subject: "{{ template "title.tmpl" . }}"
Same, but with send_resolved: no
Receivers
@roidelapluie
Alertmanager receivers
- name: "lotsOfPeople/mail"
email_configs:
- to: 'a@inuits.eu,b@inuits.eu,c@inuits.eu'
headers:
To: a@inuits.eu
CC: b@inuits.eu
Reply-To: support@inuits.eu
c@inuits.eu is now BCC.
Email: CC, BCC
@roidelapluie
Prometheus alert
- alert: Not enough traffic
expr: ...
for: 5m
labels:
recipients: customer1/sms,opsteam/ticket
annotations:
summary: ...
resolved_summary: ...
Who gets the alert?
@roidelapluie
Alertmanager routing
- receiver: "customer1/sms"
match_re:
recipient: "(.*,)?customer1/sms(,.*)?"
continue: true
routes: [...]
- receiver: "opsteam/ticket"
match_re:
recipient: "(.*,)?opsteam/ticket(,.*)?"
continue: true
routes: [...]
Who gets the alert?
@roidelapluie
Prometheus alert
- alert: Not enough traffic
expr: ...
for: 5m
labels:
recipients: customer1/sms,opsteam/ticket
send_resolved: "no"
Resolved
@roidelapluie
Alertmanager routing
- receiver: "customer1/sms"
match_re:
recipient: "(.*,)?customer1/sms(,.*)?"
continue: true
routes:
- receiver: customer1/sms/noresolved
match:
send_resolved: "no"
Resolved
@roidelapluie
Prometheus alert
- alert: Not enough traffic
expr: ...
for: 5m
labels:
recipients: customer1/sms,opsteam/ticket
repeat_interval: 1h
Repeat interval
@roidelapluie
Alertmanager routing
- receiver: "customer1/sms"
match_re:
recipient: "(.*,)?customer1/sms(,.*)?"
continue: true
routes:
- receiver: customer1/sms
repeat_interval: 1h
match:
repeat_interval: 1h
Repeat interval
@roidelapluie
Some channels have speci c group_interval: 0s.
Some channels always send_resolved: no.
Some recipients have aliases (ticket+chat).
Extra con gurations
@roidelapluie
Extract of amtool con g routes show
─ {recipient=~"^(?:(.*,)?jpivotto/mail(,.*)?)$"} continue: true receiver: jpivotto/mail
   ├── {repeat_interval="15m",send_resolved="no"} receiver: jpivotto/mail/noresolved
   ├── {repeat_interval="15m"} receiver: jpivotto/mail
   ├── {repeat_interval="30m",send_resolved="no"} receiver: jpivotto/mail/noresolved
   ├── {repeat_interval="30m"} receiver: jpivotto/mail
   ├── {repeat_interval="1h",send_resolved="no"} receiver: jpivotto/mail/noresolved
   ├── {repeat_interval="1h"} receiver: jpivotto/mail
   ├── {repeat_interval="2h",send_resolved="no"} receiver: jpivotto/mail/noresolved
   ├── {repeat_interval="2h"} receiver: jpivotto/mail
   ├── {repeat_interval="4h",send_resolved="no"} receiver: jpivotto/mail/noresolved
   ├── {repeat_interval="4h"} receiver: jpivotto/mail
   ├── {repeat_interval="6h",send_resolved="no"} receiver: jpivotto/mail/noresolved
   ├── {repeat_interval="6h"} receiver: jpivotto/mail
   ├── {repeat_interval="12h",send_resolved="no"} receiver: jpivotto/mail/noresolved
   ├── {repeat_interval="12h"} receiver: jpivotto/mail
   ├── {repeat_interval="24h",send_resolved="no"} receiver:jpivotto/mail/noresolved
   ├── {repeat_interval="24h"} receiver: jpivotto/mail
   ├── {send_resolved="no"} receiver: jpivotto/mail/noresolved
   └── {repeat_interval=""} receiver: jpivotto/mail
Routes tree
@roidelapluie
Con g Management!
Our input
receivers:
customer:
email:
to: [customer@example.com]
cc: [service-management@inuits.eu]
bcc: [ops@inuits.eu]
sms: [+1234567890, +2345678901]
chat:
room: "#customer"
How do we achieve it?
@roidelapluie
• Script that is deployed with AM
• Knows all the recipients
• Will validate alerts yaml
• promtool
• mandatory labels
• validate receivers label
• validate repeat_interval label
Not possible to write alerts that go nowhere by
accident.
Con guration management
@roidelapluie
Time frame
@roidelapluie
Prometheus alert
- alert: a target is down
expr: up == 0
for: 5m
labels:
recipients: customer1/sms,opsteam/ticket
time_window: 13x5
Time frame
@roidelapluie
- record: daily_saving_time_belgium
expr: |
(vector(0) and (month() < 3 or month() > 10))
or
(vector(1) and (month() > 3 and month() < 10))
or
(
(
(month() %2 and (day_of_month() - day_of_week()
> (30 + +month() % 2 - 7)) and day_of_week() > 0)
or
-1*month()%2+1 and (day_of_month() -
day_of_week() <= (30 + month() % 2 - 7))
)
)
or
(vector(1) and ((month()==10 and hour() < 1) or (month()==3 and hour() > 0
or
vector(0)
- record: belgium_localtime
expr: |
time() + 3600 + 3600 * daily_saving_time_belgium
Timezone
@roidelapluie
hour(belgium_localtime)
hour() and other time-functions can take a
timestamp as argument.
Belgian hour
@roidelapluie
- record: public_holiday
expr: |
vector(1) and
day_of_month(belgium_localtime) == 25
and month(belgium_localtime) == 12
labels:
name: Xmas
Holidays
@roidelapluie
groups:
- name: Easter Meeus/Jones/Butcher Algorithm
interval: 60s
rules:
- record: easter_y
expr: year(belgium_localtime)
- record: easter_a
expr: easter_y % 19
- record: easter_b
expr: floor(easter_y / 100)
- record: easter_c
expr: easter_y % 100
- record: easter_d
expr: floor(easter_b / 4)
- record: easter_e
expr: easter_b % 4
- record: easter_f
expr: floor((easter_b +8 ) / 25)
- record: easter_g
expr: floor((easter_b - easter_f + 1 ) / 3)
- record: easter_h
expr: (19*easter_a + easter_b - easter_d - easter_g + 15 ) % 30
- record: easter_i
expr: floor(easter_c/4)
- record: easter_k
expr: easter_c%4
- record: easter_l
expr: (32 + 2*easter_e + 2*easter_i - easter_h - easter_k) % 7
- record: easter_m
expr: floor((easter_a + 11*easter_h + 22*easter_l) / 451)
- record: easter_month
expr: floor((easter_h + easter_l - 7*easter_m + 114) / 31)
- record: easter_day
expr: ((easter_h + easter_l - 7*easter_m + 114) %31) + 1
Easter
@roidelapluie Wikipedia CC-BY-SA-3.0
- record: public_holiday
expr: |
vector(1) and
day_of_month(belgium_localtime-86400) == easter_day
and month(belgium_localtime-86400) == easter_month
labels:
name: Easter Monday
- record: public_holiday
expr: |
vector(1) and
day_of_month(belgium_localtime-40*86400) == easter_day
and month(belgium_localtime-40*86400) == easter_month
labels:
name: Feast of the Ascension
Easter
@roidelapluie
- record: business_day
expr: |
vector(1) and day_of_week(belgium_localtime) > 0
and day_of_week(belgium_localtime) < 6
unless count(public_holiday)
- record: belgium_hour
expr: |
hour(belgium_localtime)
- record: business_hour
expr: |
vector(1) and belgium_hour >= 8 < 18
and business_day
Business hour
@roidelapluie
- record: extended_business_hour
expr: |
(vector(1) and belgium_hour >= 7 < 20
and business_day)
- record: extended_business_hour_sat
expr: |
extended_business_hour
or (vector(1) and belgium_hour >= 7 < 14
and day_of_week(belgium_localtime) == 6
unless count(public_holiday))
Extended business hours
@roidelapluie
(sum(rate(http_requests_total{code=~"5.."}[5m])) by (vhost)
> 10 and on () business_hour)
or
(sum(rate(http_requests_total{code=~"5.."}[5m])) by (vhost) > 1
and sum(rate(http_requests_total{code=~"2.."}[5m])) by (vhost) < 1)
and on () business_hour
Thresholds depending on time
@roidelapluie
- record: daylight
expr: |
vector(1) and belgium_hour >= 8 < 18
- record: extended_daylight
expr: |
vector(1) and belgium_hour >= 7 < 20
Day and night
@roidelapluie
- alert: Time Window - Night
expr: absent(daylight)
labels:
recipient: none
- alert: Time Window - OBH
expr: absent(business_hour)
labels:
recipient: none
- alert: Time Window - Extended Night
expr: absent(extended_daylight)
labels:
recipient: none
Alerts
@roidelapluie
- alert: Time Window - Extended OBH with Saturday
expr: absent(extended_business_hour_sat)
labels:
recipient: none
- alert: Time Window - Extended OBH
expr: absent(extended_business_hour)
labels:
recipient: none
Alerts
@roidelapluie
At this point, we will have "meaningless" alerts at night
and during business holidays.
Alerts
@roidelapluie
Inhibition is a concept of suppressing noti cations for
certain alerts if certain other alerts are already ring.
Inhibition
@roidelapluie Alertmanager documentation CC-BY-4.0
Alertmanager inhibition
inhibit_rules:
- source_match:
alertname: "Time Window - Night"
target_match:
time_window: 10x7
- source_match:
alertname: "Time Window - OBH"
target_match:
time_window: 10x5
Inhibition
@roidelapluie
Alertmanager inhibition
- source_match:
alertname: "Time Window - Extended Night"
target_match:
time_window: 13x7
- source_match:
alertname: "Time Window - Extended OBH"
target_match:
time_window: 13x5
- source_match:
alertname: "Time Window - Extended OBH with Saturday"
target_match:
time_window: 13x6
Inhibition
@roidelapluie
Alerts Relabeling
@roidelapluie
Prometheus alert
- alert: a target is down
expr: up == 0
for: 5m
labels:
recipients_prod: customer1/sms,opsteam/ticket
time_window_prod: 24x7
recipients: opsteam/chat
time_window: 8x5
Per env recipients
@roidelapluie
Prometheus con g
alerting:
alert_relabel_configs:
- source_labels: [time_window_prod,env]
regex: "(.+);prod"
target_label: time_window
replacement: '$1'
- source_labels: [time_window_dev,env]
regex: "(.+);dev"
target_label: time_window
replacement: '$1'
Repeat for other env, other labels.
Per env recipients
@roidelapluie
Prometheus con g
alerting:
alert_relabel_configs:
- source_labels: [time_window]
regex: "never"
action: drop
Be careful about the order (time_window can be
mutated by relabeling).
Drop alert
@roidelapluie
Conclusion
@roidelapluie
• Alert-Writing experience is great with this
• Prometheus and PromQL can do a lot
• Con g management lls the "gaps"
• With some e ort, we have everything we wanted
Conclusion
@roidelapluie
- name: normal rate
interval: 120s
rules:
- record: request_rate_history
expr: |
sum(rate(http_requests_total[5m])) by (env)
labels:
when: 0w
Bonus
@roidelapluie
- record: request_rate_history
expr: |
(
sum(rate(http_requests_total[5m] offset 168h)) by (env)
and on () (
daily_saving_time_belgium offset 1w
== daily_saving_time_belgium)
) or
(
sum(rate(http_requests_total[5m] offset 167h)) by (env)
and on () (
daily_saving_time_belgium offset 1w
< daily_saving_time_belgium)
) or
(
sum(rate(http_requests_total[5m] offset 169h)) by (env)
and on () (
daily_saving_time_belgium offset 1w
> daily_saving_time_belgium)
)
labels:
when: 1w
Bonus
@roidelapluie
- record: request_rate_normal
expr: |
max(bottomk(1,
topk(4, request_rate_history) by(env)
) by(env)) by(env)
Bonus
@roidelapluie
Bonus
@roidelapluie
Bonus
@roidelapluie
Really hope we will get rid of DST in 2021!
Bonus
@roidelapluie
Julien Pivotto
@roidelapluie
roidelapluie@inuits.eu
Essensteenweg 31
2930 Brasschaat
Belgium
Contact:
info@inuits.eu
+32-3-8082105

Contenu connexe

Tendances

Explore your prometheus data in grafana - Promcon 2018
Explore your prometheus data in grafana - Promcon 2018Explore your prometheus data in grafana - Promcon 2018
Explore your prometheus data in grafana - Promcon 2018Grafana Labs
 
Grafana Loki: like Prometheus, but for Logs
Grafana Loki: like Prometheus, but for LogsGrafana Loki: like Prometheus, but for Logs
Grafana Loki: like Prometheus, but for LogsMarco Pracucci
 
Monitoring with prometheus
Monitoring with prometheusMonitoring with prometheus
Monitoring with prometheusKasper Nissen
 
Grafana introduction
Grafana introductionGrafana introduction
Grafana introductionRico Chen
 
Introduction to Prometheus
Introduction to PrometheusIntroduction to Prometheus
Introduction to PrometheusJulien Pivotto
 
Grafana optimization for Prometheus
Grafana optimization for PrometheusGrafana optimization for Prometheus
Grafana optimization for PrometheusMitsuhiro Tanda
 
Comprehensive Terraform Training
Comprehensive Terraform TrainingComprehensive Terraform Training
Comprehensive Terraform TrainingYevgeniy Brikman
 
Prometheus Overview
Prometheus OverviewPrometheus Overview
Prometheus OverviewBrian Brazil
 
Prometheus - basics
Prometheus - basicsPrometheus - basics
Prometheus - basicsJuraj Hantak
 
PromQL Deep Dive - The Prometheus Query Language
PromQL Deep Dive - The Prometheus Query Language PromQL Deep Dive - The Prometheus Query Language
PromQL Deep Dive - The Prometheus Query Language Weaveworks
 
Monitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusMonitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusGrafana Labs
 
OpenTelemetry For Developers
OpenTelemetry For DevelopersOpenTelemetry For Developers
OpenTelemetry For DevelopersKevin Brockhoff
 
Exploring the power of OpenTelemetry on Kubernetes
Exploring the power of OpenTelemetry on KubernetesExploring the power of OpenTelemetry on Kubernetes
Exploring the power of OpenTelemetry on KubernetesRed Hat Developers
 

Tendances (20)

Prometheus and Grafana
Prometheus and GrafanaPrometheus and Grafana
Prometheus and Grafana
 
Explore your prometheus data in grafana - Promcon 2018
Explore your prometheus data in grafana - Promcon 2018Explore your prometheus data in grafana - Promcon 2018
Explore your prometheus data in grafana - Promcon 2018
 
Prometheus monitoring
Prometheus monitoringPrometheus monitoring
Prometheus monitoring
 
Grafana Loki: like Prometheus, but for Logs
Grafana Loki: like Prometheus, but for LogsGrafana Loki: like Prometheus, but for Logs
Grafana Loki: like Prometheus, but for Logs
 
Monitoring with prometheus
Monitoring with prometheusMonitoring with prometheus
Monitoring with prometheus
 
Grafana introduction
Grafana introductionGrafana introduction
Grafana introduction
 
Introduction to Prometheus
Introduction to PrometheusIntroduction to Prometheus
Introduction to Prometheus
 
Grafana
GrafanaGrafana
Grafana
 
Grafana optimization for Prometheus
Grafana optimization for PrometheusGrafana optimization for Prometheus
Grafana optimization for Prometheus
 
Prometheus + Grafana = Awesome Monitoring
Prometheus + Grafana = Awesome MonitoringPrometheus + Grafana = Awesome Monitoring
Prometheus + Grafana = Awesome Monitoring
 
Monitoring With Prometheus
Monitoring With PrometheusMonitoring With Prometheus
Monitoring With Prometheus
 
Comprehensive Terraform Training
Comprehensive Terraform TrainingComprehensive Terraform Training
Comprehensive Terraform Training
 
Final terraform
Final terraformFinal terraform
Final terraform
 
Prometheus Overview
Prometheus OverviewPrometheus Overview
Prometheus Overview
 
Prometheus - basics
Prometheus - basicsPrometheus - basics
Prometheus - basics
 
Cloud Monitoring tool Grafana
Cloud Monitoring  tool Grafana Cloud Monitoring  tool Grafana
Cloud Monitoring tool Grafana
 
PromQL Deep Dive - The Prometheus Query Language
PromQL Deep Dive - The Prometheus Query Language PromQL Deep Dive - The Prometheus Query Language
PromQL Deep Dive - The Prometheus Query Language
 
Monitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusMonitoring Kubernetes with Prometheus
Monitoring Kubernetes with Prometheus
 
OpenTelemetry For Developers
OpenTelemetry For DevelopersOpenTelemetry For Developers
OpenTelemetry For Developers
 
Exploring the power of OpenTelemetry on Kubernetes
Exploring the power of OpenTelemetry on KubernetesExploring the power of OpenTelemetry on Kubernetes
Exploring the power of OpenTelemetry on Kubernetes
 

Similaire à Improved alerting with Prometheus and Alertmanager

#include customer.h#include heap.h#include iostream.docx
#include customer.h#include heap.h#include iostream.docx#include customer.h#include heap.h#include iostream.docx
#include customer.h#include heap.h#include iostream.docxAASTHA76
 
AI&ML Conference 2019 - Deep Learning from zero to hero
AI&ML Conference 2019 - Deep Learning from zero to heroAI&ML Conference 2019 - Deep Learning from zero to hero
AI&ML Conference 2019 - Deep Learning from zero to heroGianluca Carucci
 
Accelerating OpenERP accounting: Precalculated period sums
Accelerating OpenERP accounting: Precalculated period sumsAccelerating OpenERP accounting: Precalculated period sums
Accelerating OpenERP accounting: Precalculated period sumsBorja López
 
Proposed pricing model for cloud computing
Proposed pricing model for cloud computingProposed pricing model for cloud computing
Proposed pricing model for cloud computingAdeel Javaid
 
Ch02 primitive-data-definite-loops
Ch02 primitive-data-definite-loopsCh02 primitive-data-definite-loops
Ch02 primitive-data-definite-loopsJames Brotsos
 
ch02-primitive-data-definite-loops.ppt
ch02-primitive-data-definite-loops.pptch02-primitive-data-definite-loops.ppt
ch02-primitive-data-definite-loops.pptMahyuddin8
 
ch02-primitive-data-definite-loops.ppt
ch02-primitive-data-definite-loops.pptch02-primitive-data-definite-loops.ppt
ch02-primitive-data-definite-loops.pptghoitsun
 
Objects? No thanks!
Objects? No thanks!Objects? No thanks!
Objects? No thanks!corehard_by
 
Performance Analysis with TAU on Summit Supercomputer, part II
Performance Analysis with TAU on Summit Supercomputer, part IIPerformance Analysis with TAU on Summit Supercomputer, part II
Performance Analysis with TAU on Summit Supercomputer, part IIGeorge Markomanolis
 
Observability in a Dynamically Scheduled World
Observability in a Dynamically Scheduled WorldObservability in a Dynamically Scheduled World
Observability in a Dynamically Scheduled WorldSneha Inguva
 
C++ process new
C++ process newC++ process new
C++ process new敬倫 林
 
Cyrille Martraire: Monoids, Monoids Everywhere! at I T.A.K.E. Unconference 2015
Cyrille Martraire: Monoids, Monoids Everywhere! at I T.A.K.E. Unconference 2015Cyrille Martraire: Monoids, Monoids Everywhere! at I T.A.K.E. Unconference 2015
Cyrille Martraire: Monoids, Monoids Everywhere! at I T.A.K.E. Unconference 2015Mozaic Works
 
Control Statement.ppt
Control Statement.pptControl Statement.ppt
Control Statement.pptsanjay
 
The hitchhiker’s guide to Prometheus
The hitchhiker’s guide to PrometheusThe hitchhiker’s guide to Prometheus
The hitchhiker’s guide to PrometheusBol.com Techlab
 

Similaire à Improved alerting with Prometheus and Alertmanager (20)

#include customer.h#include heap.h#include iostream.docx
#include customer.h#include heap.h#include iostream.docx#include customer.h#include heap.h#include iostream.docx
#include customer.h#include heap.h#include iostream.docx
 
AI&ML Conference 2019 - Deep Learning from zero to hero
AI&ML Conference 2019 - Deep Learning from zero to heroAI&ML Conference 2019 - Deep Learning from zero to hero
AI&ML Conference 2019 - Deep Learning from zero to hero
 
Accelerating OpenERP accounting: Precalculated period sums
Accelerating OpenERP accounting: Precalculated period sumsAccelerating OpenERP accounting: Precalculated period sums
Accelerating OpenERP accounting: Precalculated period sums
 
Proposed pricing model for cloud computing
Proposed pricing model for cloud computingProposed pricing model for cloud computing
Proposed pricing model for cloud computing
 
Code Tuning
Code TuningCode Tuning
Code Tuning
 
C# Loops
C# LoopsC# Loops
C# Loops
 
Performance
PerformancePerformance
Performance
 
14 queuing
14 queuing14 queuing
14 queuing
 
Ch02 primitive-data-definite-loops
Ch02 primitive-data-definite-loopsCh02 primitive-data-definite-loops
Ch02 primitive-data-definite-loops
 
ScalaMeter 2012
ScalaMeter 2012ScalaMeter 2012
ScalaMeter 2012
 
ch02-primitive-data-definite-loops.ppt
ch02-primitive-data-definite-loops.pptch02-primitive-data-definite-loops.ppt
ch02-primitive-data-definite-loops.ppt
 
ch02-primitive-data-definite-loops.ppt
ch02-primitive-data-definite-loops.pptch02-primitive-data-definite-loops.ppt
ch02-primitive-data-definite-loops.ppt
 
Objects? No thanks!
Objects? No thanks!Objects? No thanks!
Objects? No thanks!
 
Performance Analysis with TAU on Summit Supercomputer, part II
Performance Analysis with TAU on Summit Supercomputer, part IIPerformance Analysis with TAU on Summit Supercomputer, part II
Performance Analysis with TAU on Summit Supercomputer, part II
 
Observability in a Dynamically Scheduled World
Observability in a Dynamically Scheduled WorldObservability in a Dynamically Scheduled World
Observability in a Dynamically Scheduled World
 
Ch3
Ch3Ch3
Ch3
 
C++ process new
C++ process newC++ process new
C++ process new
 
Cyrille Martraire: Monoids, Monoids Everywhere! at I T.A.K.E. Unconference 2015
Cyrille Martraire: Monoids, Monoids Everywhere! at I T.A.K.E. Unconference 2015Cyrille Martraire: Monoids, Monoids Everywhere! at I T.A.K.E. Unconference 2015
Cyrille Martraire: Monoids, Monoids Everywhere! at I T.A.K.E. Unconference 2015
 
Control Statement.ppt
Control Statement.pptControl Statement.ppt
Control Statement.ppt
 
The hitchhiker’s guide to Prometheus
The hitchhiker’s guide to PrometheusThe hitchhiker’s guide to Prometheus
The hitchhiker’s guide to Prometheus
 

Plus de Julien Pivotto

What's New in Prometheus and Its Ecosystem
What's New in Prometheus and Its EcosystemWhat's New in Prometheus and Its Ecosystem
What's New in Prometheus and Its EcosystemJulien Pivotto
 
Prometheus: What is is, what is new, what is coming
Prometheus: What is is, what is new, what is comingPrometheus: What is is, what is new, what is coming
Prometheus: What is is, what is new, what is comingJulien Pivotto
 
What's new in Prometheus?
What's new in Prometheus?What's new in Prometheus?
What's new in Prometheus?Julien Pivotto
 
Introduction to Grafana Loki
Introduction to Grafana LokiIntroduction to Grafana Loki
Introduction to Grafana LokiJulien Pivotto
 
Why you should revisit mgmt
Why you should revisit mgmtWhy you should revisit mgmt
Why you should revisit mgmtJulien Pivotto
 
Observing the HashiCorp Ecosystem From Prometheus
Observing the HashiCorp Ecosystem From PrometheusObserving the HashiCorp Ecosystem From Prometheus
Observing the HashiCorp Ecosystem From PrometheusJulien Pivotto
 
Monitoring in a fast-changing world with Prometheus
Monitoring in a fast-changing world with PrometheusMonitoring in a fast-changing world with Prometheus
Monitoring in a fast-changing world with PrometheusJulien Pivotto
 
5 tips for Prometheus Service Discovery
5 tips for Prometheus Service Discovery5 tips for Prometheus Service Discovery
5 tips for Prometheus Service DiscoveryJulien Pivotto
 
Prometheus and TLS - an Introduction
Prometheus and TLS - an IntroductionPrometheus and TLS - an Introduction
Prometheus and TLS - an IntroductionJulien Pivotto
 
Powerful graphs in Grafana
Powerful graphs in GrafanaPowerful graphs in Grafana
Powerful graphs in GrafanaJulien Pivotto
 
HAProxy as Egress Controller
HAProxy as Egress ControllerHAProxy as Egress Controller
HAProxy as Egress ControllerJulien Pivotto
 
SIngle Sign On with Keycloak
SIngle Sign On with KeycloakSIngle Sign On with Keycloak
SIngle Sign On with KeycloakJulien Pivotto
 
Monitoring as an entry point for collaboration
Monitoring as an entry point for collaborationMonitoring as an entry point for collaboration
Monitoring as an entry point for collaborationJulien Pivotto
 
Incident Resolution as Code
Incident Resolution as CodeIncident Resolution as Code
Incident Resolution as CodeJulien Pivotto
 
Monitor your CentOS stack with Prometheus
Monitor your CentOS stack with PrometheusMonitor your CentOS stack with Prometheus
Monitor your CentOS stack with PrometheusJulien Pivotto
 
Monitor your CentOS stack with Prometheus
Monitor your CentOS stack with PrometheusMonitor your CentOS stack with Prometheus
Monitor your CentOS stack with PrometheusJulien Pivotto
 
An introduction to Ansible
An introduction to AnsibleAn introduction to Ansible
An introduction to AnsibleJulien Pivotto
 

Plus de Julien Pivotto (20)

The O11y Toolkit
The O11y ToolkitThe O11y Toolkit
The O11y Toolkit
 
What's New in Prometheus and Its Ecosystem
What's New in Prometheus and Its EcosystemWhat's New in Prometheus and Its Ecosystem
What's New in Prometheus and Its Ecosystem
 
Prometheus: What is is, what is new, what is coming
Prometheus: What is is, what is new, what is comingPrometheus: What is is, what is new, what is coming
Prometheus: What is is, what is new, what is coming
 
What's new in Prometheus?
What's new in Prometheus?What's new in Prometheus?
What's new in Prometheus?
 
Introduction to Grafana Loki
Introduction to Grafana LokiIntroduction to Grafana Loki
Introduction to Grafana Loki
 
Why you should revisit mgmt
Why you should revisit mgmtWhy you should revisit mgmt
Why you should revisit mgmt
 
Observing the HashiCorp Ecosystem From Prometheus
Observing the HashiCorp Ecosystem From PrometheusObserving the HashiCorp Ecosystem From Prometheus
Observing the HashiCorp Ecosystem From Prometheus
 
Monitoring in a fast-changing world with Prometheus
Monitoring in a fast-changing world with PrometheusMonitoring in a fast-changing world with Prometheus
Monitoring in a fast-changing world with Prometheus
 
5 tips for Prometheus Service Discovery
5 tips for Prometheus Service Discovery5 tips for Prometheus Service Discovery
5 tips for Prometheus Service Discovery
 
Prometheus and TLS - an Introduction
Prometheus and TLS - an IntroductionPrometheus and TLS - an Introduction
Prometheus and TLS - an Introduction
 
Powerful graphs in Grafana
Powerful graphs in GrafanaPowerful graphs in Grafana
Powerful graphs in Grafana
 
YAML Magic
YAML MagicYAML Magic
YAML Magic
 
HAProxy as Egress Controller
HAProxy as Egress ControllerHAProxy as Egress Controller
HAProxy as Egress Controller
 
SIngle Sign On with Keycloak
SIngle Sign On with KeycloakSIngle Sign On with Keycloak
SIngle Sign On with Keycloak
 
Monitoring as an entry point for collaboration
Monitoring as an entry point for collaborationMonitoring as an entry point for collaboration
Monitoring as an entry point for collaboration
 
Incident Resolution as Code
Incident Resolution as CodeIncident Resolution as Code
Incident Resolution as Code
 
Monitor your CentOS stack with Prometheus
Monitor your CentOS stack with PrometheusMonitor your CentOS stack with Prometheus
Monitor your CentOS stack with Prometheus
 
Monitor your CentOS stack with Prometheus
Monitor your CentOS stack with PrometheusMonitor your CentOS stack with Prometheus
Monitor your CentOS stack with Prometheus
 
An introduction to Ansible
An introduction to AnsibleAn introduction to Ansible
An introduction to Ansible
 
Jsonnet
JsonnetJsonnet
Jsonnet
 

Dernier

Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 

Dernier (20)

Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 

Improved alerting with Prometheus and Alertmanager

  • 1. PromCon Munich Julien Pivotto @roidelapluie Improved alerting with Prometheus and Alertmanager November 8th, 2019
  • 2. • This talk contains PromQL. • This talk contains YAML. • What you will see was built over time. Important notes! @roidelapluie
  • 4. Message Broker in the Belgian healthcare sector • High visibility • Sync & Async • Legacy & New • Lots of partners • Multiple customers Message Broker @roidelapluie
  • 6. Alerts are not only for incidents. Some alerts carry business information about ongoing events (good or bad). Some alerts go outside of our org. Some alerts are not for humans. Alerting @roidelapluie
  • 8. Repeat every 15m, 1h, 4h, 24h, 2d 24x7, 10x5, 12x6, 10x7, never Legal holidays Time frames @roidelapluie
  • 9. Updated annotations & value Updated graphs 15m/1h repeat interval? @roidelapluie
  • 10. • Alertmanager owns the noti cations • Webhook receivers have no logic • Take decisions at time of alert writing Constraints @roidelapluie
  • 11. • Avoid Alertmanager recon gurations • Safe and easy way to write alerts • Only send relevant alerts • Alert on staging environments Challenges @roidelapluie
  • 13. - alert: a target is down expr: up == 0 for: 5m Gauges @roidelapluie
  • 15. Instead of: - alert: a target is down expr: up == 0 for: 5m Do: - alert: a target is down expr: avg_over_time(up[5m]) < .9 for: 5m Gauges @roidelapluie
  • 16. Alert me if temperature is above 27°C Hysteresis @roidelapluie
  • 17. - alert: temperature is above threshold expr: temperature_celcius > 27 for: 5m labels: priority: high Hysteresis @roidelapluie
  • 19. Hysteresis is the dependence of the state of a system on its history. Hysteresis @roidelapluie Wikipedia CC-BY-SA-3.0
  • 20. - alert: temperature is above threshold expr: | avg_over_time(temperature_celcius[5m]) > 27 for: 5m labels: priority: high alternative: max_over_time 5m might be too short if > 5m: when is it resolved? Hysteresis @roidelapluie
  • 21. Alert me • if temperature is above 27°C • only stop when it gets below 25°C Hysteresis @roidelapluie
  • 22. (avg_over_time(temperature_celcius[5m]) > 27) or (temperature_celcius > 25 and count without (alertstate, alertname, priority) ALERTS{ alertstate="firing", alertname="temperature is above threshold" }) Hysteresis @roidelapluie
  • 23. temperature_celcius > 27 but... Computed threshold @roidelapluie
  • 24. - record: temperature_threshold_celcius expr: | 27+0*temperature_celcius{ location=~".*ambiant" } or 25+0*temperature_celcius Bonus: temperature_threshold_celcius can be used in grafana! Computed threshold @roidelapluie
  • 25. - alert: temperature is above threshold expr: | temperature_celcius > temperature_threshold_celcius Note: put threshold & alert in the same alert group Computed threshold @roidelapluie
  • 26. - alert: no more sms expr: sms_available < 39000 Absence @roidelapluie
  • 28. No metric = No alert! Metric is back = New alert! Absence @roidelapluie
  • 29. - record: sms_available_last expr: | sms_available or sms_available_last - alert: no more sms record: sms_available_last < 39000 - alert: no more sms data record: absent(sms_available) for: 1h Absence @roidelapluie
  • 32. Alertmanager receivers - name: "opsteam/mail" email_configs: - to: 'ops@inuits.eu' send_resolved: yes html: "{{ template "inuits.html.tmpl" . }}" text: "{{ template "inuits.txt.tmpl" . }}" headers: Subject: "{{ template "title.tmpl" . }}" Hint: Subject can be a template. Receivers @roidelapluie
  • 33. Alertmanager receivers - name: "opsteam/mail/noresolved" email_configs: - to: 'ops@inuits.eu' send_resolved: no html: "{{ template "inuits.html.tmpl" . }}" text: "{{ template "inuits.txt.tmpl" . }}" headers: Subject: "{{ template "title.tmpl" . }}" Same, but with send_resolved: no Receivers @roidelapluie
  • 34. Alertmanager receivers - name: "lotsOfPeople/mail" email_configs: - to: 'a@inuits.eu,b@inuits.eu,c@inuits.eu' headers: To: a@inuits.eu CC: b@inuits.eu Reply-To: support@inuits.eu c@inuits.eu is now BCC. Email: CC, BCC @roidelapluie
  • 35. Prometheus alert - alert: Not enough traffic expr: ... for: 5m labels: recipients: customer1/sms,opsteam/ticket annotations: summary: ... resolved_summary: ... Who gets the alert? @roidelapluie
  • 36. Alertmanager routing - receiver: "customer1/sms" match_re: recipient: "(.*,)?customer1/sms(,.*)?" continue: true routes: [...] - receiver: "opsteam/ticket" match_re: recipient: "(.*,)?opsteam/ticket(,.*)?" continue: true routes: [...] Who gets the alert? @roidelapluie
  • 37. Prometheus alert - alert: Not enough traffic expr: ... for: 5m labels: recipients: customer1/sms,opsteam/ticket send_resolved: "no" Resolved @roidelapluie
  • 38. Alertmanager routing - receiver: "customer1/sms" match_re: recipient: "(.*,)?customer1/sms(,.*)?" continue: true routes: - receiver: customer1/sms/noresolved match: send_resolved: "no" Resolved @roidelapluie
  • 39. Prometheus alert - alert: Not enough traffic expr: ... for: 5m labels: recipients: customer1/sms,opsteam/ticket repeat_interval: 1h Repeat interval @roidelapluie
  • 40. Alertmanager routing - receiver: "customer1/sms" match_re: recipient: "(.*,)?customer1/sms(,.*)?" continue: true routes: - receiver: customer1/sms repeat_interval: 1h match: repeat_interval: 1h Repeat interval @roidelapluie
  • 41. Some channels have speci c group_interval: 0s. Some channels always send_resolved: no. Some recipients have aliases (ticket+chat). Extra con gurations @roidelapluie
  • 42. Extract of amtool con g routes show ─ {recipient=~"^(?:(.*,)?jpivotto/mail(,.*)?)$"} continue: true receiver: jpivotto/mail    ├── {repeat_interval="15m",send_resolved="no"} receiver: jpivotto/mail/noresolved    ├── {repeat_interval="15m"} receiver: jpivotto/mail    ├── {repeat_interval="30m",send_resolved="no"} receiver: jpivotto/mail/noresolved    ├── {repeat_interval="30m"} receiver: jpivotto/mail    ├── {repeat_interval="1h",send_resolved="no"} receiver: jpivotto/mail/noresolved    ├── {repeat_interval="1h"} receiver: jpivotto/mail    ├── {repeat_interval="2h",send_resolved="no"} receiver: jpivotto/mail/noresolved    ├── {repeat_interval="2h"} receiver: jpivotto/mail    ├── {repeat_interval="4h",send_resolved="no"} receiver: jpivotto/mail/noresolved    ├── {repeat_interval="4h"} receiver: jpivotto/mail    ├── {repeat_interval="6h",send_resolved="no"} receiver: jpivotto/mail/noresolved    ├── {repeat_interval="6h"} receiver: jpivotto/mail    ├── {repeat_interval="12h",send_resolved="no"} receiver: jpivotto/mail/noresolved    ├── {repeat_interval="12h"} receiver: jpivotto/mail    ├── {repeat_interval="24h",send_resolved="no"} receiver:jpivotto/mail/noresolved    ├── {repeat_interval="24h"} receiver: jpivotto/mail    ├── {send_resolved="no"} receiver: jpivotto/mail/noresolved    └── {repeat_interval=""} receiver: jpivotto/mail Routes tree @roidelapluie
  • 43. Con g Management! Our input receivers: customer: email: to: [customer@example.com] cc: [service-management@inuits.eu] bcc: [ops@inuits.eu] sms: [+1234567890, +2345678901] chat: room: "#customer" How do we achieve it? @roidelapluie
  • 44. • Script that is deployed with AM • Knows all the recipients • Will validate alerts yaml • promtool • mandatory labels • validate receivers label • validate repeat_interval label Not possible to write alerts that go nowhere by accident. Con guration management @roidelapluie
  • 46. Prometheus alert - alert: a target is down expr: up == 0 for: 5m labels: recipients: customer1/sms,opsteam/ticket time_window: 13x5 Time frame @roidelapluie
  • 47. - record: daily_saving_time_belgium expr: | (vector(0) and (month() < 3 or month() > 10)) or (vector(1) and (month() > 3 and month() < 10)) or ( ( (month() %2 and (day_of_month() - day_of_week() > (30 + +month() % 2 - 7)) and day_of_week() > 0) or -1*month()%2+1 and (day_of_month() - day_of_week() <= (30 + month() % 2 - 7)) ) ) or (vector(1) and ((month()==10 and hour() < 1) or (month()==3 and hour() > 0 or vector(0) - record: belgium_localtime expr: | time() + 3600 + 3600 * daily_saving_time_belgium Timezone @roidelapluie
  • 48. hour(belgium_localtime) hour() and other time-functions can take a timestamp as argument. Belgian hour @roidelapluie
  • 49. - record: public_holiday expr: | vector(1) and day_of_month(belgium_localtime) == 25 and month(belgium_localtime) == 12 labels: name: Xmas Holidays @roidelapluie
  • 50. groups: - name: Easter Meeus/Jones/Butcher Algorithm interval: 60s rules: - record: easter_y expr: year(belgium_localtime) - record: easter_a expr: easter_y % 19 - record: easter_b expr: floor(easter_y / 100) - record: easter_c expr: easter_y % 100 - record: easter_d expr: floor(easter_b / 4) - record: easter_e expr: easter_b % 4 - record: easter_f expr: floor((easter_b +8 ) / 25) - record: easter_g expr: floor((easter_b - easter_f + 1 ) / 3) - record: easter_h expr: (19*easter_a + easter_b - easter_d - easter_g + 15 ) % 30 - record: easter_i expr: floor(easter_c/4) - record: easter_k expr: easter_c%4 - record: easter_l expr: (32 + 2*easter_e + 2*easter_i - easter_h - easter_k) % 7 - record: easter_m expr: floor((easter_a + 11*easter_h + 22*easter_l) / 451) - record: easter_month expr: floor((easter_h + easter_l - 7*easter_m + 114) / 31) - record: easter_day expr: ((easter_h + easter_l - 7*easter_m + 114) %31) + 1 Easter @roidelapluie Wikipedia CC-BY-SA-3.0
  • 51. - record: public_holiday expr: | vector(1) and day_of_month(belgium_localtime-86400) == easter_day and month(belgium_localtime-86400) == easter_month labels: name: Easter Monday - record: public_holiday expr: | vector(1) and day_of_month(belgium_localtime-40*86400) == easter_day and month(belgium_localtime-40*86400) == easter_month labels: name: Feast of the Ascension Easter @roidelapluie
  • 52. - record: business_day expr: | vector(1) and day_of_week(belgium_localtime) > 0 and day_of_week(belgium_localtime) < 6 unless count(public_holiday) - record: belgium_hour expr: | hour(belgium_localtime) - record: business_hour expr: | vector(1) and belgium_hour >= 8 < 18 and business_day Business hour @roidelapluie
  • 53. - record: extended_business_hour expr: | (vector(1) and belgium_hour >= 7 < 20 and business_day) - record: extended_business_hour_sat expr: | extended_business_hour or (vector(1) and belgium_hour >= 7 < 14 and day_of_week(belgium_localtime) == 6 unless count(public_holiday)) Extended business hours @roidelapluie
  • 54. (sum(rate(http_requests_total{code=~"5.."}[5m])) by (vhost) > 10 and on () business_hour) or (sum(rate(http_requests_total{code=~"5.."}[5m])) by (vhost) > 1 and sum(rate(http_requests_total{code=~"2.."}[5m])) by (vhost) < 1) and on () business_hour Thresholds depending on time @roidelapluie
  • 55. - record: daylight expr: | vector(1) and belgium_hour >= 8 < 18 - record: extended_daylight expr: | vector(1) and belgium_hour >= 7 < 20 Day and night @roidelapluie
  • 56. - alert: Time Window - Night expr: absent(daylight) labels: recipient: none - alert: Time Window - OBH expr: absent(business_hour) labels: recipient: none - alert: Time Window - Extended Night expr: absent(extended_daylight) labels: recipient: none Alerts @roidelapluie
  • 57. - alert: Time Window - Extended OBH with Saturday expr: absent(extended_business_hour_sat) labels: recipient: none - alert: Time Window - Extended OBH expr: absent(extended_business_hour) labels: recipient: none Alerts @roidelapluie
  • 58. At this point, we will have "meaningless" alerts at night and during business holidays. Alerts @roidelapluie
  • 59. Inhibition is a concept of suppressing noti cations for certain alerts if certain other alerts are already ring. Inhibition @roidelapluie Alertmanager documentation CC-BY-4.0
  • 60. Alertmanager inhibition inhibit_rules: - source_match: alertname: "Time Window - Night" target_match: time_window: 10x7 - source_match: alertname: "Time Window - OBH" target_match: time_window: 10x5 Inhibition @roidelapluie
  • 61. Alertmanager inhibition - source_match: alertname: "Time Window - Extended Night" target_match: time_window: 13x7 - source_match: alertname: "Time Window - Extended OBH" target_match: time_window: 13x5 - source_match: alertname: "Time Window - Extended OBH with Saturday" target_match: time_window: 13x6 Inhibition @roidelapluie
  • 63. Prometheus alert - alert: a target is down expr: up == 0 for: 5m labels: recipients_prod: customer1/sms,opsteam/ticket time_window_prod: 24x7 recipients: opsteam/chat time_window: 8x5 Per env recipients @roidelapluie
  • 64. Prometheus con g alerting: alert_relabel_configs: - source_labels: [time_window_prod,env] regex: "(.+);prod" target_label: time_window replacement: '$1' - source_labels: [time_window_dev,env] regex: "(.+);dev" target_label: time_window replacement: '$1' Repeat for other env, other labels. Per env recipients @roidelapluie
  • 65. Prometheus con g alerting: alert_relabel_configs: - source_labels: [time_window] regex: "never" action: drop Be careful about the order (time_window can be mutated by relabeling). Drop alert @roidelapluie
  • 67. • Alert-Writing experience is great with this • Prometheus and PromQL can do a lot • Con g management lls the "gaps" • With some e ort, we have everything we wanted Conclusion @roidelapluie
  • 68. - name: normal rate interval: 120s rules: - record: request_rate_history expr: | sum(rate(http_requests_total[5m])) by (env) labels: when: 0w Bonus @roidelapluie
  • 69. - record: request_rate_history expr: | ( sum(rate(http_requests_total[5m] offset 168h)) by (env) and on () ( daily_saving_time_belgium offset 1w == daily_saving_time_belgium) ) or ( sum(rate(http_requests_total[5m] offset 167h)) by (env) and on () ( daily_saving_time_belgium offset 1w < daily_saving_time_belgium) ) or ( sum(rate(http_requests_total[5m] offset 169h)) by (env) and on () ( daily_saving_time_belgium offset 1w > daily_saving_time_belgium) ) labels: when: 1w Bonus @roidelapluie
  • 70. - record: request_rate_normal expr: | max(bottomk(1, topk(4, request_rate_history) by(env) ) by(env)) by(env) Bonus @roidelapluie
  • 73. Really hope we will get rid of DST in 2021! Bonus @roidelapluie
  • 74. Julien Pivotto @roidelapluie roidelapluie@inuits.eu Essensteenweg 31 2930 Brasschaat Belgium Contact: info@inuits.eu +32-3-8082105