Talk by: Magnus Lübeck
This talk will discuss Icinga as the “one stop shop” for finding the “single truth of systems state”. KMG Group use a “four field” model when designing systems, where Icinga have an important place in a section called “technical monitoring/technical performance monitoring”. We touch two methodologies (MOPS – Metrics, Operational tools, Processes, and Metrics), and Ted Dziuba’s actionable response to monitoring events.
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
Efficient IT operations using monitoring systems and standardized tools - Icinga Camp Zurich 2019
1. KMG Group GmbH, http://www.kmggroup.ch
Magnus Lübeck, Zürich, 2019-11-12
http://kmg.group
Icinga Day Zürich 2019
2. 2
Sysadmin since the 90’s
Unix/Oracle at Volvo
Pre sales, Sun Microsystems reseller
Oracle DBA at CERN
IT Operations manager at Accarda
IT Operations manager at Kanton LU
Owner of KMG Group GmbH
Built infrastructure and operations at
Swisscom
peaq
Serafe
This is me
5. 5
Quick overview
People, tools and processes
The four fielder
Telemetry and health
Desire lines
OSS and Free software in modern operations environments
Tool landscape
Icinga’s part in the mechano
Outline
8. 8
Telemetry is part of good systems design
Measurement points should be a mandatory point of EVERY system
This has been know since many years, across many industries
Metrics - /status, /health
9. 9
The use of waveforms to diagnose broken things is far from new.
The triangular form is particularly useful.
Can be used in many ways
Very useful for repetitive patterns.
Metrics - /status, /health
12. 12
A fool with a tool is still a fool.
Get smart people
Use tools
Integrate the tools with your
environment.
Tools can cost money
But does not have to
Operational tools
15. 15
Naming conventions
No servers named after
porn stars
Baseline installations
Mini OS install
Automation/ Infrastructure as code
Ansible, chef, puppet
Coding guidelines
Standards
16. 16
Inception in so many levels
Deals with ”less than 24/7” SLAs
You can service check your SLA
Shameless plug – SLA check
19. 19
A customer of mine had
8’500 Open Critical Alerts
15’300 Warnings
Typical “cry wolf” scenario
3 possible/allowed Actions
Solve the problem
Change the threshold (change the metric, template, standard)
Remove the alert
Monitoring theory:
Bad design reduces the value of your monitoring
20. 20
Move the responsibility of delivering telemetry to the application
designers and the application owners
Help them learn how to write service checks
A service delivery is not complete unless telemetry and monitoring
packages are delivered
Application service check responsibility
devOps or stoneAgeOps?
21. 21
Question from an auditor (ISO-27001 audit)
How do you ensure that all applications work after a patch run
My answer:
We don’t
The big audit monster
26. 26
Manually edit config – use it when you learn Icinga
Good ways to do it
Automate icinga centric configuration repository - director
Icinga API – write the integration yourself
Automation per Ansible
Metamonitoring
By using your inventory, you know what you are monitoring
And, what you are not monitoring
Icinga client and service registration
27. 27
The layer cake is your monitoring standard grouped
by common denominators.
Group service checks in layers (i.e L0 – L5)
L0 – OS Level - (Linux admin)
CPU, disk usage, ssh, ping, fs usage {/, /var, /home}
L1 – Server type – shared OS resources (Linux Admin)
iops on db fs, fs usage on /app/ora
…
L5 – Application checks – (Application Managers)
Application specific checks
The Layered Cake
28. 28
The human brain is excellent at identifying harmonies and regularities.
Ingredient number 2: Sawtooth waveform
29. 29
The human brain is excellent at identifying harmonies and regularities.
Ingredient number 2: Sawtooth waveform