SlideShare une entreprise Scribd logo
1  sur  98
Maciej Lasyk, Ganglia & Nagios
Maciej Lasyk
11. Sesja Linuksowa
Wrocław, 2014-04-06
1/25
Ganglia & Nagios
Ganglia.. what?
Ganglia – cluster / group of neurons found outside
the central nervous system
Maciej Lasyk, Ganglia & Nagios 2/25
Just a little about monitoring
- the need for monitoring
Maciej Lasyk, Ganglia & Nagios 3/25
Just a little about monitoring
- the need for monitoring
- measuring availability
Maciej Lasyk, Ganglia & Nagios 3/25
Just a little about monitoring
- the need for monitoring
- measuring availability
- measuring performance
Maciej Lasyk, Ganglia & Nagios 3/25
Just a little about monitoring
- the need for monitoring
- measuring availability
- measuring performance
- gathering additional metrics
Maciej Lasyk, Ganglia & Nagios 3/25
Monitoring is critical for HA
How to measure availability?
Maciej Lasyk, Ganglia & Nagios 4/25
Monitoring is critical for HA
How to measure availability?
A = Uptime / (Uptime + Downtime)
Maciej Lasyk, Ganglia & Nagios 4/25
Monitoring is critical for HA
How to measure availability?
A = Uptime / (Uptime + Downtime)
MTTD (Mean Time to Diagnose)
The average time it takes to diagnose the problem
Maciej Lasyk, Ganglia & Nagios 4/25
Monitoring is critical for HA
How to measure availability?
A = Uptime / (Uptime + Downtime)
MTTD (Mean Time to Diagnose)
The average time it takes to diagnose the problem
MTTR (Mean Time to Repair)
The average time it takes to fix a problem
Maciej Lasyk, Ganglia & Nagios 4/25
Monitoring is critical for HA
How to measure availability?
A = Uptime / (Uptime + Downtime)
MTTD (Mean Time to Diagnose)
The average time it takes to diagnose the problem
MTTR (Mean Time to Repair)
The average time it takes to fix a problem
MTTF (Mean Time to Failure)
The average time there is correct behavior
Maciej Lasyk, Ganglia & Nagios 4/25
Monitoring is critical for HA
How to measure availability?
A = Uptime / (Uptime + Downtime)
MTTD (Mean Time to Diagnose)
The average time it takes to diagnose the problem
MTTR (Mean Time to Repair)
The average time it takes to fix a problem
MTTF (Mean Time to Failure)
The average time there is correct behavior
MTBF (Mean Time Between Failures)
The average time between different failures of the service
Maciej Lasyk, Ganglia & Nagios 4/25
Monitoring is critical for HA
Maciej Lasyk, Ganglia & Nagios 4/25
Monitoring is critical for HA
Maciej Lasyk, Ganglia & Nagios
A = MTTF / MTBF = MTTF / (MTTF + MTTD + MTTR)
4/25
What should we monitor?
Maciej Lasyk, Ganglia & Nagios
- hardware housing
- devices
- storage
- network
- hosts
- software (very deep hole)
5/25
What should we monitor?
Maciej Lasyk, Ganglia & Nagios
- hardware housing
- devices
- storage
- network
- hosts
- software (very deep hole)
Think dependencies!
5/25
When outage hits us – don't panic!
Maciej Lasyk, Ganglia & Nagios
- Notifications
6/25
When outage hits us – don't panic!
Maciej Lasyk, Ganglia & Nagios
- Notifications
- Escalations
L1 <-> L2 <-> L3 <-> L4 lol ;)
desktop support / devs / ops / networking /
/ storage / middleware / dc / security
6/25
When outage hits us – don't panic!
Maciej Lasyk, Ganglia & Nagios
- Notifications
- Escalations
L1 <-> L2 <-> L3 <-> L4 lol ;)
desktop support / devs / ops / networking /
/ storage / middleware / dc / security
- Clock is ticking – it should be simple
6/25
When outage hits us – don't panic!
Maciej Lasyk, Ganglia & Nagios
- Notifications
- Escalations
L1 <-> L2 <-> L3 <-> L4 lol ;)
desktop support / devs / ops / networking /
/ storage / middleware / dc / security
- Clock is ticking – it should be simple
- What if cell is offline or someone is out?
6/25
Monitoring: notifications issues
Maciej Lasyk, Ganglia & Nagios
- false positives
7/25
Maciej Lasyk, Ganglia & Nagios
- false positives
- major events
Monitoring: notifications issues
7/25
Maciej Lasyk, Ganglia & Nagios
- false positives
- major events
- failover notifications?
Monitoring: notifications issues
7/25
Maciej Lasyk, Ganglia & Nagios
- false positives
- major events
- failover notifications?
- tolerance & critical thresholds
Monitoring: notifications issues
7/25
Monitoring: reporting
Maciej Lasyk, Ganglia & Nagios
- baseline
8/25
Maciej Lasyk, Ganglia & Nagios
- baseline
- correlation between incidents and
change management
Monitoring: reporting
8/25
Maciej Lasyk, Ganglia & Nagios
- baseline
- correlation between incidents and
change management
- trending info
Monitoring: reporting
8/25
Maciej Lasyk, Ganglia & Nagios
- baseline
- correlation between incidents and
change management
- trending info
- reporting
Monitoring: reporting
8/25
Monitoring: good practices
Maciej Lasyk, Ganglia & Nagios
- don't NIH!
9/25
Maciej Lasyk, Ganglia & Nagios
- don't NIH!
- DVCS
Monitoring: good practices
9/25
Maciej Lasyk, Ganglia & Nagios
- don't NIH!
- DVCS
- testing envs
Monitoring: good practices
9/25
Maciej Lasyk, Ganglia & Nagios
- don't NIH!
- DVCS
- testing envs
- think usability!
Monitoring: good practices
9/25
Maciej Lasyk, Ganglia & Nagios
- don't NIH!
- DVCS
- testing envs
- think usability!
- passive checks
Monitoring: good practices
9/25
Maciej Lasyk, Ganglia & Nagios
- don't NIH!
- DVCS
- testing envs
- think usability!
- passive checks
- automate – don't hardcode
Monitoring: good practices
9/25
Maciej Lasyk, Ganglia & Nagios
- don't NIH!
- DVCS
- testing envs
- think usability!
- passive checks
- automate – don't hardcode
- security
Monitoring: good practices
9/25
Maciej Lasyk, Ganglia & Nagios
Last but not least...
“Quis custodiet ipsos custodes?”
(Who will guard the guards?)
Monitoring: good practices
9/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Host / Services / Contacts
- hosts, hostgroups
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Host / Services / Contacts
- hosts, hostgroups
- services, service groups
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Host / Services / Contacts
- hosts, hostgroups
- services, service groups
- templates
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Host / Services / Contacts
- hosts, hostgroups
- services, service groups
- templates
- time periods
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Host / Services / Contacts
- hosts, hostgroups
- services, service groups
- templates
- time periods
- host and services dependencies
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Host / Services / Contacts
- hosts, hostgroups
- services, service groups
- templates
- time periods
- host and services dependencies
- regular expressions
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Checks and states
- frequencies & thresholds
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Checks and states
- frequencies & thresholds
- scheduling downtimes
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Checks and states
- frequencies & thresholds
- scheduling downtimes
- outages and flapping
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Notifications
- periods
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Notifications
- periods
- groups
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Notifications
- periods
- groups
- which states to be notified about?
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Notifications
- periods
- groups
- which states to be notified about?
- escalations / rotations
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Notifications
- periods
- groups
- which states to be notified about?
- escalations / rotations
- custom notifications method
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Monitoring remotes
- NRPE daemons
- checks via SSH
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Web interface – tactical overview
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Web interface – availability reports
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Web interface – trends
10/25
Maciej Lasyk, Ganglia & Nagios
Nagios recap
Web interface – network maps
10/25
Maciej Lasyk, Ganglia & Nagios
Networking recap
Unicast
11/25
Maciej Lasyk, Ganglia & Nagios
Networking recap
Multicast
11/25
Maciej Lasyk, Ganglia & Nagios
Networking recap
Broadcast
11/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – what is it?
Problems of big scale:
20k hosts with zylion metrics probed every 10 seconds
It is fully redundant (until you spoil it)
It is very scalable
Regexp searches and creating of views – adhoc :)
12/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – architecture
13/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – architecture
13/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – topologies
Default multicast topology
14/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – topologies
Deaf / mute multicast topology
14/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – topologies
Unicast topology
14/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – topologies
Gmetad topology
14/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – topologies
Gmetad HA topology (active - active)
14/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – topologies
Gmetad hierarchical topology
14/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – RRDcached
15/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – sFlow
16/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – web (grid view)
17/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – web (cluster view)
17/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – web (physical view)
17/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – web (host view)
17/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – web (compare hosts)
17/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – web (events)
Events have API json based
Think – integration with whatever app :)
17/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – web (dashboards)
- Create view -> apply as dashboard
- Create dashboard from XML
- Generate graphs and add to views
17/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – web (graphs)
17/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – metrics
- base / extended metrics
- own modules
- c / c++
- mod_python
- spoofing
- gmetric
- gmetric4j / java
- Which to choose? gmetric / python / c/c++?
18/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – metrics
- base / extended metrics
18/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – metrics
- base / extended metrics
- own modules
18/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – metrics
- base / extended metrics
- own modules
- c / c++
18/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – metrics
- base / extended metrics
- own modules
- c / c++
- mod_python
18/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – metrics
- base / extended metrics
- own modules
- c / c++
- mod_python
- spoofing
18/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – metrics
- base / extended metrics
- own modules
- c / c++
- mod_python
- spoofing
- gmetric
- gmetric4j / java
18/25
Maciej Lasyk, Ganglia & Nagios
Ganglia – metrics
- base / extended metrics
- own modules
- c / c++
- mod_python
- spoofing
- gmetric
- gmetric4j / java
- Which to choose? gmetric / python / c/c++?
18/25
Maciej Lasyk, Ganglia & Nagios
Ganglia and logfiles?
ganglia-logtailer
- https://bitbucket.org/maplebed/ganglia-logtailer
- parser logfiles (realtime)
- pushes data to ganglia (via gmetric)
- yup – based on specific log formats
- yet still – open source so poke around ;)
19/25
So... Nagios + Ganglia!
Maciej Lasyk, Ganglia & Nagios
3 ways of integration:
- ganglia-web/nagios (PHP & bash based)
https://github.com/ganglia/ganglia-web
- ganglia-nagios-bridge (Python & cron based)
https://github.com/ganglia/ganglia-nagios-bridge
- check-ganglia-metric (Python)
https://github.com/ganglia/ganglia_contrib
20/25
Nagios + Ganglia: ganglia-web/nagios
Maciej Lasyk, Ganglia & Nagios
https://github.com/ganglia/ganglia-web
Sending Nagios Data to Ganglia
service_perfdata_command
Or replace Nagios checks with Ganglia!
- Check heartbeat.
- Check a single metric on a specific host.
- Check multiple metrics on a specific host.
- Check multiple metrics across a regex-defined
range of hosts
21/25
Maciej Lasyk, Ganglia & Nagios
Nagios + Ganglia: ganglia-web/nagios
Nagios pulls info from Ganglia via HTTP
21/25
Maciej Lasyk, Ganglia & Nagios
Nagios + Ganglia: ganglia-nagios-bridge
- https://github.com/ganglia/ganglia-nagios-bridge
- Python script run in e.g. in crontab
- pulls data from Ganglia XML via sockets
- parses XML
- send data to Nagios
- Nagios commits only passive checks
22/25
Maciej Lasyk, Ganglia & Nagios
Nagios + Ganglia: check_ganglia_metric
- https://pypi.python.org/pypi/check_ganglia_metric/
- basically Nagios plugin
- pulls data from Ganglia XML via sockets
- check_ganglia_metric.py 
--gmetad_host=gmetad-server.example.com 
--metric_host=host.example.com --metric_name=cpu_idle
23/25
Maciej Lasyk, Ganglia & Nagios
Nagios + Ganglia
Which one integration should I use?
24/25
Maciej Lasyk, Ganglia & Nagios
Nagios + Ganglia
Which one integration should I use?
Seriously – try yourself and test
24/25
Maciej Lasyk, Ganglia & Nagios
Freenode #ganglia
https://lists.sourceforge.net/lists/listinfo/ganglia-general
24.5/25
sources?
Maciej Lasyk, Ganglia & Nagios 25/25
- “Monitoring with Ganglia” book
- also nagios.org
- and “Web Operations” book
- plus some experience ;)
Maciej Lasyk
11. Sesja Linuksowa
2014-04-06, Wrocław
http://maciek.lasyk.info/sysop
maciek@lasyk.info
@docent-net
Ganglia & Nagios
Thank you :)
Maciej Lasyk, Ganglia & Nagios 25/25

Contenu connexe

En vedette

En vedette (19)

Using Nagios with Chef
Using Nagios with ChefUsing Nagios with Chef
Using Nagios with Chef
 
Nagios core vs. nagios xi presentation power point.pptx [diperbaiki]
Nagios core vs. nagios xi presentation power point.pptx [diperbaiki]Nagios core vs. nagios xi presentation power point.pptx [diperbaiki]
Nagios core vs. nagios xi presentation power point.pptx [diperbaiki]
 
Marcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise Edition
Marcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise EditionMarcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise Edition
Marcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise Edition
 
Monitoring with Ganglia
Monitoring with GangliaMonitoring with Ganglia
Monitoring with Ganglia
 
Nagios Conference 2013 - Eric Stanley and Andy Brist - API and Nagios
Nagios Conference 2013 - Eric Stanley and Andy Brist - API and NagiosNagios Conference 2013 - Eric Stanley and Andy Brist - API and Nagios
Nagios Conference 2013 - Eric Stanley and Andy Brist - API and Nagios
 
Time to say goodbye to your Nagios based setup
Time to say goodbye to your Nagios based setupTime to say goodbye to your Nagios based setup
Time to say goodbye to your Nagios based setup
 
Nagios Conference 2012 - Mike Weber - Failover
Nagios Conference 2012 - Mike Weber - FailoverNagios Conference 2012 - Mike Weber - Failover
Nagios Conference 2012 - Mike Weber - Failover
 
Jenkins
JenkinsJenkins
Jenkins
 
Nagios, Getting Started.
Nagios, Getting Started.Nagios, Getting Started.
Nagios, Getting Started.
 
Nagios Conference 2014 - Konstantin Benz - Monitoring Openstack The Relations...
Nagios Conference 2014 - Konstantin Benz - Monitoring Openstack The Relations...Nagios Conference 2014 - Konstantin Benz - Monitoring Openstack The Relations...
Nagios Conference 2014 - Konstantin Benz - Monitoring Openstack The Relations...
 
OTechs Network Monitoring (Nagios) Training Course
OTechs Network Monitoring (Nagios) Training CourseOTechs Network Monitoring (Nagios) Training Course
OTechs Network Monitoring (Nagios) Training Course
 
Nagios Conference 2011 - David Thomas - Know Its Broke Before Your Customers Do
Nagios Conference 2011 - David Thomas - Know Its Broke Before Your Customers DoNagios Conference 2011 - David Thomas - Know Its Broke Before Your Customers Do
Nagios Conference 2011 - David Thomas - Know Its Broke Before Your Customers Do
 
Nagios Consulting Implementation and Maintenance
Nagios Consulting Implementation and MaintenanceNagios Consulting Implementation and Maintenance
Nagios Consulting Implementation and Maintenance
 
Nagios Conference 2013 - Andy Brist - Data Visualizations and Nagios XI
Nagios Conference 2013 - Andy Brist - Data Visualizations and Nagios XINagios Conference 2013 - Andy Brist - Data Visualizations and Nagios XI
Nagios Conference 2013 - Andy Brist - Data Visualizations and Nagios XI
 
Metrics with Ganglia
Metrics with GangliaMetrics with Ganglia
Metrics with Ganglia
 
Nagios Conference 2012 - Mike Weber - NRPE
Nagios Conference 2012 - Mike Weber - NRPENagios Conference 2012 - Mike Weber - NRPE
Nagios Conference 2012 - Mike Weber - NRPE
 
NagiosXI - Astiostech NagiosXI Event with NTT MSC Cyberjaya
NagiosXI - Astiostech NagiosXI Event with NTT MSC CyberjayaNagiosXI - Astiostech NagiosXI Event with NTT MSC Cyberjaya
NagiosXI - Astiostech NagiosXI Event with NTT MSC Cyberjaya
 
Nagios Conference 2011 - Mike Guthrie - Distributed Monitoring With Nagios
Nagios Conference 2011 - Mike Guthrie - Distributed Monitoring With NagiosNagios Conference 2011 - Mike Guthrie - Distributed Monitoring With Nagios
Nagios Conference 2011 - Mike Guthrie - Distributed Monitoring With Nagios
 
Janice Singh - Writing Custom Nagios Plugins
Janice Singh - Writing Custom Nagios PluginsJanice Singh - Writing Custom Nagios Plugins
Janice Singh - Writing Custom Nagios Plugins
 

Plus de Maciej Lasyk

Plus de Maciej Lasyk (20)

Rundeck & Ansible
Rundeck & AnsibleRundeck & Ansible
Rundeck & Ansible
 
Docker 1.11
Docker 1.11Docker 1.11
Docker 1.11
 
Programowanie AWSa z CLI, boto, Ansiblem i libcloudem
Programowanie AWSa z CLI, boto, Ansiblem i libcloudemProgramowanie AWSa z CLI, boto, Ansiblem i libcloudem
Programowanie AWSa z CLI, boto, Ansiblem i libcloudem
 
Co powinieneś wiedzieć na temat devops?f
Co powinieneś wiedzieć na temat devops?f Co powinieneś wiedzieć na temat devops?f
Co powinieneś wiedzieć na temat devops?f
 
"Containers do not contain"
"Containers do not contain""Containers do not contain"
"Containers do not contain"
 
Git Submodules
Git SubmodulesGit Submodules
Git Submodules
 
Linux containers & Devops
Linux containers & DevopsLinux containers & Devops
Linux containers & Devops
 
Under the Dome (of failure driven pipeline)
Under the Dome (of failure driven pipeline)Under the Dome (of failure driven pipeline)
Under the Dome (of failure driven pipeline)
 
Continuous Security in DevOps
Continuous Security in DevOpsContinuous Security in DevOps
Continuous Security in DevOps
 
About cultural change w/Devops
About cultural change w/DevopsAbout cultural change w/Devops
About cultural change w/Devops
 
Orchestrating docker containers at scale (#DockerKRK edition)
Orchestrating docker containers at scale (#DockerKRK edition)Orchestrating docker containers at scale (#DockerKRK edition)
Orchestrating docker containers at scale (#DockerKRK edition)
 
Orchestrating docker containers at scale (PJUG edition)
Orchestrating docker containers at scale (PJUG edition)Orchestrating docker containers at scale (PJUG edition)
Orchestrating docker containers at scale (PJUG edition)
 
Orchestrating Docker containers at scale
Orchestrating Docker containers at scaleOrchestrating Docker containers at scale
Orchestrating Docker containers at scale
 
Ghost in the shell
Ghost in the shellGhost in the shell
Ghost in the shell
 
Scaling and securing node.js apps
Scaling and securing node.js appsScaling and securing node.js apps
Scaling and securing node.js apps
 
Node.js security
Node.js securityNode.js security
Node.js security
 
High Availability (HA) Explained - second edition
High Availability (HA) Explained - second editionHigh Availability (HA) Explained - second edition
High Availability (HA) Explained - second edition
 
Stop disabling SELinux!
Stop disabling SELinux!Stop disabling SELinux!
Stop disabling SELinux!
 
RHEL/Fedora + Docker (and SELinux)
RHEL/Fedora + Docker (and SELinux)RHEL/Fedora + Docker (and SELinux)
RHEL/Fedora + Docker (and SELinux)
 
High Availability (HA) Explained
High Availability (HA) ExplainedHigh Availability (HA) Explained
High Availability (HA) Explained
 

Dernier

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Dernier (20)

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 

Monitoring with Nagios and Ganglia

  • 1. Maciej Lasyk, Ganglia & Nagios Maciej Lasyk 11. Sesja Linuksowa Wrocław, 2014-04-06 1/25 Ganglia & Nagios
  • 2. Ganglia.. what? Ganglia – cluster / group of neurons found outside the central nervous system Maciej Lasyk, Ganglia & Nagios 2/25
  • 3. Just a little about monitoring - the need for monitoring Maciej Lasyk, Ganglia & Nagios 3/25
  • 4. Just a little about monitoring - the need for monitoring - measuring availability Maciej Lasyk, Ganglia & Nagios 3/25
  • 5. Just a little about monitoring - the need for monitoring - measuring availability - measuring performance Maciej Lasyk, Ganglia & Nagios 3/25
  • 6. Just a little about monitoring - the need for monitoring - measuring availability - measuring performance - gathering additional metrics Maciej Lasyk, Ganglia & Nagios 3/25
  • 7. Monitoring is critical for HA How to measure availability? Maciej Lasyk, Ganglia & Nagios 4/25
  • 8. Monitoring is critical for HA How to measure availability? A = Uptime / (Uptime + Downtime) Maciej Lasyk, Ganglia & Nagios 4/25
  • 9. Monitoring is critical for HA How to measure availability? A = Uptime / (Uptime + Downtime) MTTD (Mean Time to Diagnose) The average time it takes to diagnose the problem Maciej Lasyk, Ganglia & Nagios 4/25
  • 10. Monitoring is critical for HA How to measure availability? A = Uptime / (Uptime + Downtime) MTTD (Mean Time to Diagnose) The average time it takes to diagnose the problem MTTR (Mean Time to Repair) The average time it takes to fix a problem Maciej Lasyk, Ganglia & Nagios 4/25
  • 11. Monitoring is critical for HA How to measure availability? A = Uptime / (Uptime + Downtime) MTTD (Mean Time to Diagnose) The average time it takes to diagnose the problem MTTR (Mean Time to Repair) The average time it takes to fix a problem MTTF (Mean Time to Failure) The average time there is correct behavior Maciej Lasyk, Ganglia & Nagios 4/25
  • 12. Monitoring is critical for HA How to measure availability? A = Uptime / (Uptime + Downtime) MTTD (Mean Time to Diagnose) The average time it takes to diagnose the problem MTTR (Mean Time to Repair) The average time it takes to fix a problem MTTF (Mean Time to Failure) The average time there is correct behavior MTBF (Mean Time Between Failures) The average time between different failures of the service Maciej Lasyk, Ganglia & Nagios 4/25
  • 13. Monitoring is critical for HA Maciej Lasyk, Ganglia & Nagios 4/25
  • 14. Monitoring is critical for HA Maciej Lasyk, Ganglia & Nagios A = MTTF / MTBF = MTTF / (MTTF + MTTD + MTTR) 4/25
  • 15. What should we monitor? Maciej Lasyk, Ganglia & Nagios - hardware housing - devices - storage - network - hosts - software (very deep hole) 5/25
  • 16. What should we monitor? Maciej Lasyk, Ganglia & Nagios - hardware housing - devices - storage - network - hosts - software (very deep hole) Think dependencies! 5/25
  • 17. When outage hits us – don't panic! Maciej Lasyk, Ganglia & Nagios - Notifications 6/25
  • 18. When outage hits us – don't panic! Maciej Lasyk, Ganglia & Nagios - Notifications - Escalations L1 <-> L2 <-> L3 <-> L4 lol ;) desktop support / devs / ops / networking / / storage / middleware / dc / security 6/25
  • 19. When outage hits us – don't panic! Maciej Lasyk, Ganglia & Nagios - Notifications - Escalations L1 <-> L2 <-> L3 <-> L4 lol ;) desktop support / devs / ops / networking / / storage / middleware / dc / security - Clock is ticking – it should be simple 6/25
  • 20. When outage hits us – don't panic! Maciej Lasyk, Ganglia & Nagios - Notifications - Escalations L1 <-> L2 <-> L3 <-> L4 lol ;) desktop support / devs / ops / networking / / storage / middleware / dc / security - Clock is ticking – it should be simple - What if cell is offline or someone is out? 6/25
  • 21. Monitoring: notifications issues Maciej Lasyk, Ganglia & Nagios - false positives 7/25
  • 22. Maciej Lasyk, Ganglia & Nagios - false positives - major events Monitoring: notifications issues 7/25
  • 23. Maciej Lasyk, Ganglia & Nagios - false positives - major events - failover notifications? Monitoring: notifications issues 7/25
  • 24. Maciej Lasyk, Ganglia & Nagios - false positives - major events - failover notifications? - tolerance & critical thresholds Monitoring: notifications issues 7/25
  • 25. Monitoring: reporting Maciej Lasyk, Ganglia & Nagios - baseline 8/25
  • 26. Maciej Lasyk, Ganglia & Nagios - baseline - correlation between incidents and change management Monitoring: reporting 8/25
  • 27. Maciej Lasyk, Ganglia & Nagios - baseline - correlation between incidents and change management - trending info Monitoring: reporting 8/25
  • 28. Maciej Lasyk, Ganglia & Nagios - baseline - correlation between incidents and change management - trending info - reporting Monitoring: reporting 8/25
  • 29. Monitoring: good practices Maciej Lasyk, Ganglia & Nagios - don't NIH! 9/25
  • 30. Maciej Lasyk, Ganglia & Nagios - don't NIH! - DVCS Monitoring: good practices 9/25
  • 31. Maciej Lasyk, Ganglia & Nagios - don't NIH! - DVCS - testing envs Monitoring: good practices 9/25
  • 32. Maciej Lasyk, Ganglia & Nagios - don't NIH! - DVCS - testing envs - think usability! Monitoring: good practices 9/25
  • 33. Maciej Lasyk, Ganglia & Nagios - don't NIH! - DVCS - testing envs - think usability! - passive checks Monitoring: good practices 9/25
  • 34. Maciej Lasyk, Ganglia & Nagios - don't NIH! - DVCS - testing envs - think usability! - passive checks - automate – don't hardcode Monitoring: good practices 9/25
  • 35. Maciej Lasyk, Ganglia & Nagios - don't NIH! - DVCS - testing envs - think usability! - passive checks - automate – don't hardcode - security Monitoring: good practices 9/25
  • 36. Maciej Lasyk, Ganglia & Nagios Last but not least... “Quis custodiet ipsos custodes?” (Who will guard the guards?) Monitoring: good practices 9/25
  • 37. Maciej Lasyk, Ganglia & Nagios Nagios recap Host / Services / Contacts - hosts, hostgroups 10/25
  • 38. Maciej Lasyk, Ganglia & Nagios Nagios recap Host / Services / Contacts - hosts, hostgroups - services, service groups 10/25
  • 39. Maciej Lasyk, Ganglia & Nagios Nagios recap Host / Services / Contacts - hosts, hostgroups - services, service groups - templates 10/25
  • 40. Maciej Lasyk, Ganglia & Nagios Nagios recap Host / Services / Contacts - hosts, hostgroups - services, service groups - templates - time periods 10/25
  • 41. Maciej Lasyk, Ganglia & Nagios Nagios recap Host / Services / Contacts - hosts, hostgroups - services, service groups - templates - time periods - host and services dependencies 10/25
  • 42. Maciej Lasyk, Ganglia & Nagios Nagios recap Host / Services / Contacts - hosts, hostgroups - services, service groups - templates - time periods - host and services dependencies - regular expressions 10/25
  • 43. Maciej Lasyk, Ganglia & Nagios Nagios recap 10/25
  • 44. Maciej Lasyk, Ganglia & Nagios Nagios recap 10/25
  • 45. Maciej Lasyk, Ganglia & Nagios Nagios recap Checks and states - frequencies & thresholds 10/25
  • 46. Maciej Lasyk, Ganglia & Nagios Nagios recap Checks and states - frequencies & thresholds - scheduling downtimes 10/25
  • 47. Maciej Lasyk, Ganglia & Nagios Nagios recap Checks and states - frequencies & thresholds - scheduling downtimes - outages and flapping 10/25
  • 48. Maciej Lasyk, Ganglia & Nagios Nagios recap Notifications - periods 10/25
  • 49. Maciej Lasyk, Ganglia & Nagios Nagios recap Notifications - periods - groups 10/25
  • 50. Maciej Lasyk, Ganglia & Nagios Nagios recap Notifications - periods - groups - which states to be notified about? 10/25
  • 51. Maciej Lasyk, Ganglia & Nagios Nagios recap Notifications - periods - groups - which states to be notified about? - escalations / rotations 10/25
  • 52. Maciej Lasyk, Ganglia & Nagios Nagios recap Notifications - periods - groups - which states to be notified about? - escalations / rotations - custom notifications method 10/25
  • 53. Maciej Lasyk, Ganglia & Nagios Nagios recap Monitoring remotes - NRPE daemons - checks via SSH 10/25
  • 54. Maciej Lasyk, Ganglia & Nagios Nagios recap Web interface – tactical overview 10/25
  • 55. Maciej Lasyk, Ganglia & Nagios Nagios recap Web interface – availability reports 10/25
  • 56. Maciej Lasyk, Ganglia & Nagios Nagios recap Web interface – trends 10/25
  • 57. Maciej Lasyk, Ganglia & Nagios Nagios recap Web interface – network maps 10/25
  • 58. Maciej Lasyk, Ganglia & Nagios Networking recap Unicast 11/25
  • 59. Maciej Lasyk, Ganglia & Nagios Networking recap Multicast 11/25
  • 60. Maciej Lasyk, Ganglia & Nagios Networking recap Broadcast 11/25
  • 61. Maciej Lasyk, Ganglia & Nagios Ganglia – what is it? Problems of big scale: 20k hosts with zylion metrics probed every 10 seconds It is fully redundant (until you spoil it) It is very scalable Regexp searches and creating of views – adhoc :) 12/25
  • 62. Maciej Lasyk, Ganglia & Nagios Ganglia – architecture 13/25
  • 63. Maciej Lasyk, Ganglia & Nagios Ganglia – architecture 13/25
  • 64. Maciej Lasyk, Ganglia & Nagios Ganglia – topologies Default multicast topology 14/25
  • 65. Maciej Lasyk, Ganglia & Nagios Ganglia – topologies Deaf / mute multicast topology 14/25
  • 66. Maciej Lasyk, Ganglia & Nagios Ganglia – topologies Unicast topology 14/25
  • 67. Maciej Lasyk, Ganglia & Nagios Ganglia – topologies Gmetad topology 14/25
  • 68. Maciej Lasyk, Ganglia & Nagios Ganglia – topologies Gmetad HA topology (active - active) 14/25
  • 69. Maciej Lasyk, Ganglia & Nagios Ganglia – topologies Gmetad hierarchical topology 14/25
  • 70. Maciej Lasyk, Ganglia & Nagios Ganglia – RRDcached 15/25
  • 71. Maciej Lasyk, Ganglia & Nagios Ganglia – sFlow 16/25
  • 72. Maciej Lasyk, Ganglia & Nagios Ganglia – web (grid view) 17/25
  • 73. Maciej Lasyk, Ganglia & Nagios Ganglia – web (cluster view) 17/25
  • 74. Maciej Lasyk, Ganglia & Nagios Ganglia – web (physical view) 17/25
  • 75. Maciej Lasyk, Ganglia & Nagios Ganglia – web (host view) 17/25
  • 76. Maciej Lasyk, Ganglia & Nagios Ganglia – web (compare hosts) 17/25
  • 77. Maciej Lasyk, Ganglia & Nagios Ganglia – web (events) Events have API json based Think – integration with whatever app :) 17/25
  • 78. Maciej Lasyk, Ganglia & Nagios Ganglia – web (dashboards) - Create view -> apply as dashboard - Create dashboard from XML - Generate graphs and add to views 17/25
  • 79. Maciej Lasyk, Ganglia & Nagios Ganglia – web (graphs) 17/25
  • 80. Maciej Lasyk, Ganglia & Nagios Ganglia – metrics - base / extended metrics - own modules - c / c++ - mod_python - spoofing - gmetric - gmetric4j / java - Which to choose? gmetric / python / c/c++? 18/25
  • 81. Maciej Lasyk, Ganglia & Nagios Ganglia – metrics - base / extended metrics 18/25
  • 82. Maciej Lasyk, Ganglia & Nagios Ganglia – metrics - base / extended metrics - own modules 18/25
  • 83. Maciej Lasyk, Ganglia & Nagios Ganglia – metrics - base / extended metrics - own modules - c / c++ 18/25
  • 84. Maciej Lasyk, Ganglia & Nagios Ganglia – metrics - base / extended metrics - own modules - c / c++ - mod_python 18/25
  • 85. Maciej Lasyk, Ganglia & Nagios Ganglia – metrics - base / extended metrics - own modules - c / c++ - mod_python - spoofing 18/25
  • 86. Maciej Lasyk, Ganglia & Nagios Ganglia – metrics - base / extended metrics - own modules - c / c++ - mod_python - spoofing - gmetric - gmetric4j / java 18/25
  • 87. Maciej Lasyk, Ganglia & Nagios Ganglia – metrics - base / extended metrics - own modules - c / c++ - mod_python - spoofing - gmetric - gmetric4j / java - Which to choose? gmetric / python / c/c++? 18/25
  • 88. Maciej Lasyk, Ganglia & Nagios Ganglia and logfiles? ganglia-logtailer - https://bitbucket.org/maplebed/ganglia-logtailer - parser logfiles (realtime) - pushes data to ganglia (via gmetric) - yup – based on specific log formats - yet still – open source so poke around ;) 19/25
  • 89. So... Nagios + Ganglia! Maciej Lasyk, Ganglia & Nagios 3 ways of integration: - ganglia-web/nagios (PHP & bash based) https://github.com/ganglia/ganglia-web - ganglia-nagios-bridge (Python & cron based) https://github.com/ganglia/ganglia-nagios-bridge - check-ganglia-metric (Python) https://github.com/ganglia/ganglia_contrib 20/25
  • 90. Nagios + Ganglia: ganglia-web/nagios Maciej Lasyk, Ganglia & Nagios https://github.com/ganglia/ganglia-web Sending Nagios Data to Ganglia service_perfdata_command Or replace Nagios checks with Ganglia! - Check heartbeat. - Check a single metric on a specific host. - Check multiple metrics on a specific host. - Check multiple metrics across a regex-defined range of hosts 21/25
  • 91. Maciej Lasyk, Ganglia & Nagios Nagios + Ganglia: ganglia-web/nagios Nagios pulls info from Ganglia via HTTP 21/25
  • 92. Maciej Lasyk, Ganglia & Nagios Nagios + Ganglia: ganglia-nagios-bridge - https://github.com/ganglia/ganglia-nagios-bridge - Python script run in e.g. in crontab - pulls data from Ganglia XML via sockets - parses XML - send data to Nagios - Nagios commits only passive checks 22/25
  • 93. Maciej Lasyk, Ganglia & Nagios Nagios + Ganglia: check_ganglia_metric - https://pypi.python.org/pypi/check_ganglia_metric/ - basically Nagios plugin - pulls data from Ganglia XML via sockets - check_ganglia_metric.py --gmetad_host=gmetad-server.example.com --metric_host=host.example.com --metric_name=cpu_idle 23/25
  • 94. Maciej Lasyk, Ganglia & Nagios Nagios + Ganglia Which one integration should I use? 24/25
  • 95. Maciej Lasyk, Ganglia & Nagios Nagios + Ganglia Which one integration should I use? Seriously – try yourself and test 24/25
  • 96. Maciej Lasyk, Ganglia & Nagios Freenode #ganglia https://lists.sourceforge.net/lists/listinfo/ganglia-general 24.5/25
  • 97. sources? Maciej Lasyk, Ganglia & Nagios 25/25 - “Monitoring with Ganglia” book - also nagios.org - and “Web Operations” book - plus some experience ;)
  • 98. Maciej Lasyk 11. Sesja Linuksowa 2014-04-06, Wrocław http://maciek.lasyk.info/sysop maciek@lasyk.info @docent-net Ganglia & Nagios Thank you :) Maciej Lasyk, Ganglia & Nagios 25/25