5. Monitoring
Collecting, processing, aggregating, and displaying real-time quantitative data
about a system, such as query counts and types, error counts and types,
processing times, and server lifetimes.
● White-box monitoring
● Black-box monitoring
● Dashboard
● Alert
● Root cause
● Push
● Node and machine
6. Why Monitor?
● Analyzing long-term trends
● Comparing over time or experiment groups
● Alerting
● Building dashboards
● Conducting ad hoc retrospective analysis (i.e., debugging)
7. Please stop using nagios (Andy Sykes)
So we can die peacefully…..
Who use it?
Why did you choose it?
8. Please stop using nagios (Andy Sykes)
So we can die peacefully…..
Who use it?
Why did you choose it?
Advantages:
● Incredible simple plugins model.
● Simple to use
● Many people know it.
● On the top in google and everybody
use it :)
9. Please stop using nagios (Andy Sykes)
So we can die peacefully…..
Who use it?
Why did you choose it?
Advantages:
● Incredible simple plugins model.
● Simple to use
● Many people know it.
● On the top in google and everybody
use it :)
Disadvantages:
● Doesn’t scale - cannot be clustering -
Thruk hack
● Millions lines of configuration -
check_mk hack
● Horrible interface
● Only for static infrastructure
● Stupid format of clients - hacks
● Perfdata…
● Doesn’t have API - livestatus hack
● Always need to hack….
13. When your monitoring suck...
- Improve the quality of alerts
- Improve monitoring tools, or even change them
Wait a minute…. Before you start to solve them...
14. UNDERSTAND PROBLEMS AND MEASURE THEM!!!
“To measure is to know”
“If you can not measure it,
you can not improve it”
Lord William Thomson
(aka Baron Kelvin)
15.
16.
17. Over-monitoring and alarm fatigue: For whom do the
bells toll? Hospitals in USA
- Ignoring Alarms notification
- “Yeah that is no important”
- 72–99% false alerts
- Young parents vs nurses in hospital
- Monitoring means more money
- More is not better
- Patient could died
- Telemetry as a means of preventing, detecting, and improving
Source:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4926996/
19. What monitoring should be?
Actionable
Compatible
Essential - only alerts which are needed
Fully Automated
Proactive - should predict failures
Easy for operators
20. State monitoring what it should be like?
State or blackbox monitoring now has the most of sense in VMs and bare-metals
What should be monitored with those kind of tools?
● Health endpoints
● Service states (like systemctl status *)
What could be monitored?
● Specific endpoints (using for example satellite node) with http/tcp checks
21. Icinga2 - Nagios fork but rewritten in many places, has scaling scenarios
(multimaster, with 3 levels of nodes - masters, satellites(ie. Supervisor per DC),
clients(check executors)), plugins - like InfluxDB metric exporter, livestatus etc.
What we can get from Icinga2?
● High Available and distributed setup
● Nice and good documented REST API
● (dynamic inventory)
● Decrease amount of time needed for implement features
22. Metrics
Metric tools could be used in two ways:
1. Failure prediction
2. Graphing the data for humans - for humans it means SIMPLE
First case is quite simple - rules for detecting anomalies like more traffic than
usual and alert if it can make an impact on other clients
Second case is also simple - just graphs for debugging and better understanding
what’s happening with applications
Not every metric should have alert (and notifications)!
23. Prometheus
Circa 120 ready to use dashboards in Grafana repository(ie. MySQL board by
Percona)
Many useful features in one tool - Prometheus has a rich query language, Alert
manager, support for PagerDuty etc.
Plenty of exporters (collectors) for standard tools: MySQL, HAProxy, NGINX,
Pagespeed, BIND, Jenkins, scollector
Third party project support for Prometheus: GitLab, Kubernetes, etcd, telegraf,
jmx-exporter, collectd)
24. Logs
Servers, application, network and security devices generate log files.
Errors, problems, and more information is constantly logged and saved for
analysis.
Once an event is detected, the monitoring system will send alert, either to a
person or to another software/hardware system.
27. Monitoring strategy
Icinga 2 for state monitoring on bare metal, VMs and VMs in cloud.
Prometheus for metrics and data from Kubernetes (or other container) clusters.
ELK stack for logs
37. Good practices for alerts
● Notify before accident
● Actionable alarms
● Value of measure things
● Documentation - not only one-liners in on call wiki
● Reduce number of tools
● Terraform
55. Team
To make sense currently and in the future changing the monitoring infrastructure
should be supported by development and "reacting" teams.
Reacting team:
● 24/7 people for looking on boards and reacting on issues work shifts
● Incident manager taking decisions and investigating tuning of monitoring
● People with “programming” skills responsible for deploy proposals of IM
(writing new checks, adding some pieces of code)