How could one create very sophisticated, open - source based monitoring solution that is very scalable and easy to deploy?
I gave this talk during on of the biggest Linux conferences in Poland: 11 Linux Session which took place in Wrocław on 5/6-04-2013
Why Teams call analytics are critical to your entire business
Monitoring with Nagios and Ganglia
1. Maciej Lasyk, Ganglia & Nagios
Maciej Lasyk
11. Sesja Linuksowa
Wrocław, 2014-04-06
1/25
Ganglia & Nagios
2. Ganglia.. what?
Ganglia – cluster / group of neurons found outside
the central nervous system
Maciej Lasyk, Ganglia & Nagios 2/25
3. Just a little about monitoring
- the need for monitoring
Maciej Lasyk, Ganglia & Nagios 3/25
4. Just a little about monitoring
- the need for monitoring
- measuring availability
Maciej Lasyk, Ganglia & Nagios 3/25
5. Just a little about monitoring
- the need for monitoring
- measuring availability
- measuring performance
Maciej Lasyk, Ganglia & Nagios 3/25
6. Just a little about monitoring
- the need for monitoring
- measuring availability
- measuring performance
- gathering additional metrics
Maciej Lasyk, Ganglia & Nagios 3/25
7. Monitoring is critical for HA
How to measure availability?
Maciej Lasyk, Ganglia & Nagios 4/25
8. Monitoring is critical for HA
How to measure availability?
A = Uptime / (Uptime + Downtime)
Maciej Lasyk, Ganglia & Nagios 4/25
9. Monitoring is critical for HA
How to measure availability?
A = Uptime / (Uptime + Downtime)
MTTD (Mean Time to Diagnose)
The average time it takes to diagnose the problem
Maciej Lasyk, Ganglia & Nagios 4/25
10. Monitoring is critical for HA
How to measure availability?
A = Uptime / (Uptime + Downtime)
MTTD (Mean Time to Diagnose)
The average time it takes to diagnose the problem
MTTR (Mean Time to Repair)
The average time it takes to fix a problem
Maciej Lasyk, Ganglia & Nagios 4/25
11. Monitoring is critical for HA
How to measure availability?
A = Uptime / (Uptime + Downtime)
MTTD (Mean Time to Diagnose)
The average time it takes to diagnose the problem
MTTR (Mean Time to Repair)
The average time it takes to fix a problem
MTTF (Mean Time to Failure)
The average time there is correct behavior
Maciej Lasyk, Ganglia & Nagios 4/25
12. Monitoring is critical for HA
How to measure availability?
A = Uptime / (Uptime + Downtime)
MTTD (Mean Time to Diagnose)
The average time it takes to diagnose the problem
MTTR (Mean Time to Repair)
The average time it takes to fix a problem
MTTF (Mean Time to Failure)
The average time there is correct behavior
MTBF (Mean Time Between Failures)
The average time between different failures of the service
Maciej Lasyk, Ganglia & Nagios 4/25
14. Monitoring is critical for HA
Maciej Lasyk, Ganglia & Nagios
A = MTTF / MTBF = MTTF / (MTTF + MTTD + MTTR)
4/25
15. What should we monitor?
Maciej Lasyk, Ganglia & Nagios
- hardware housing
- devices
- storage
- network
- hosts
- software (very deep hole)
5/25
16. What should we monitor?
Maciej Lasyk, Ganglia & Nagios
- hardware housing
- devices
- storage
- network
- hosts
- software (very deep hole)
Think dependencies!
5/25
17. When outage hits us – don't panic!
Maciej Lasyk, Ganglia & Nagios
- Notifications
6/25
18. When outage hits us – don't panic!
Maciej Lasyk, Ganglia & Nagios
- Notifications
- Escalations
L1 <-> L2 <-> L3 <-> L4 lol ;)
desktop support / devs / ops / networking /
/ storage / middleware / dc / security
6/25
19. When outage hits us – don't panic!
Maciej Lasyk, Ganglia & Nagios
- Notifications
- Escalations
L1 <-> L2 <-> L3 <-> L4 lol ;)
desktop support / devs / ops / networking /
/ storage / middleware / dc / security
- Clock is ticking – it should be simple
6/25
20. When outage hits us – don't panic!
Maciej Lasyk, Ganglia & Nagios
- Notifications
- Escalations
L1 <-> L2 <-> L3 <-> L4 lol ;)
desktop support / devs / ops / networking /
/ storage / middleware / dc / security
- Clock is ticking – it should be simple
- What if cell is offline or someone is out?
6/25
50. Maciej Lasyk, Ganglia & Nagios
Nagios recap
Notifications
- periods
- groups
- which states to be notified about?
10/25
51. Maciej Lasyk, Ganglia & Nagios
Nagios recap
Notifications
- periods
- groups
- which states to be notified about?
- escalations / rotations
10/25
52. Maciej Lasyk, Ganglia & Nagios
Nagios recap
Notifications
- periods
- groups
- which states to be notified about?
- escalations / rotations
- custom notifications method
10/25
53. Maciej Lasyk, Ganglia & Nagios
Nagios recap
Monitoring remotes
- NRPE daemons
- checks via SSH
10/25
61. Maciej Lasyk, Ganglia & Nagios
Ganglia – what is it?
Problems of big scale:
20k hosts with zylion metrics probed every 10 seconds
It is fully redundant (until you spoil it)
It is very scalable
Regexp searches and creating of views – adhoc :)
12/25
77. Maciej Lasyk, Ganglia & Nagios
Ganglia – web (events)
Events have API json based
Think – integration with whatever app :)
17/25
78. Maciej Lasyk, Ganglia & Nagios
Ganglia – web (dashboards)
- Create view -> apply as dashboard
- Create dashboard from XML
- Generate graphs and add to views
17/25
82. Maciej Lasyk, Ganglia & Nagios
Ganglia – metrics
- base / extended metrics
- own modules
18/25
83. Maciej Lasyk, Ganglia & Nagios
Ganglia – metrics
- base / extended metrics
- own modules
- c / c++
18/25
84. Maciej Lasyk, Ganglia & Nagios
Ganglia – metrics
- base / extended metrics
- own modules
- c / c++
- mod_python
18/25
85. Maciej Lasyk, Ganglia & Nagios
Ganglia – metrics
- base / extended metrics
- own modules
- c / c++
- mod_python
- spoofing
18/25
86. Maciej Lasyk, Ganglia & Nagios
Ganglia – metrics
- base / extended metrics
- own modules
- c / c++
- mod_python
- spoofing
- gmetric
- gmetric4j / java
18/25
87. Maciej Lasyk, Ganglia & Nagios
Ganglia – metrics
- base / extended metrics
- own modules
- c / c++
- mod_python
- spoofing
- gmetric
- gmetric4j / java
- Which to choose? gmetric / python / c/c++?
18/25
88. Maciej Lasyk, Ganglia & Nagios
Ganglia and logfiles?
ganglia-logtailer
- https://bitbucket.org/maplebed/ganglia-logtailer
- parser logfiles (realtime)
- pushes data to ganglia (via gmetric)
- yup – based on specific log formats
- yet still – open source so poke around ;)
19/25
90. Nagios + Ganglia: ganglia-web/nagios
Maciej Lasyk, Ganglia & Nagios
https://github.com/ganglia/ganglia-web
Sending Nagios Data to Ganglia
service_perfdata_command
Or replace Nagios checks with Ganglia!
- Check heartbeat.
- Check a single metric on a specific host.
- Check multiple metrics on a specific host.
- Check multiple metrics across a regex-defined
range of hosts
21/25
91. Maciej Lasyk, Ganglia & Nagios
Nagios + Ganglia: ganglia-web/nagios
Nagios pulls info from Ganglia via HTTP
21/25
92. Maciej Lasyk, Ganglia & Nagios
Nagios + Ganglia: ganglia-nagios-bridge
- https://github.com/ganglia/ganglia-nagios-bridge
- Python script run in e.g. in crontab
- pulls data from Ganglia XML via sockets
- parses XML
- send data to Nagios
- Nagios commits only passive checks
22/25
93. Maciej Lasyk, Ganglia & Nagios
Nagios + Ganglia: check_ganglia_metric
- https://pypi.python.org/pypi/check_ganglia_metric/
- basically Nagios plugin
- pulls data from Ganglia XML via sockets
- check_ganglia_metric.py
--gmetad_host=gmetad-server.example.com
--metric_host=host.example.com --metric_name=cpu_idle
23/25
94. Maciej Lasyk, Ganglia & Nagios
Nagios + Ganglia
Which one integration should I use?
24/25
95. Maciej Lasyk, Ganglia & Nagios
Nagios + Ganglia
Which one integration should I use?
Seriously – try yourself and test
24/25
96. Maciej Lasyk, Ganglia & Nagios
Freenode #ganglia
https://lists.sourceforge.net/lists/listinfo/ganglia-general
24.5/25
97. sources?
Maciej Lasyk, Ganglia & Nagios 25/25
- “Monitoring with Ganglia” book
- also nagios.org
- and “Web Operations” book
- plus some experience ;)
98. Maciej Lasyk
11. Sesja Linuksowa
2014-04-06, Wrocław
http://maciek.lasyk.info/sysop
maciek@lasyk.info
@docent-net
Ganglia & Nagios
Thank you :)
Maciej Lasyk, Ganglia & Nagios 25/25