Nagios XI Best Practices

Nagios XI
Best Practices
By Troy Lea
tlea@nagios.com

About Me
•Tech Support Contractor for Nagios
Enterprises
•Based in Australia
•Typically cover UTC+10 from 9am to 5pm
•Nagios & XI Dev (Box293)
•Nagios MVP3

What's Covered In This Talk
•Getting the most from Nagios XI
•Time saving information
•Configuration practices
•Object definitions
•Backend setup
•Performance enhancements

Nagios XI License Entitlements
•XI license entitles 3 instances:
•Production
•Test & Dev (T&D)
•Disaster Recovery (DR)
•License activation is tied to IP Address
of each XI host

Whats Monitoring Nagios XI?
•How would you know your XI server died?
•“Nagios XI Server” Monitoring Wizard
•DR instance monitors production instance
•Production instance is UP & HEALTHY
•Production Instance monitors DR instance
•DR instance is UP & HEALTHY

localhost services
•Do you know how your XI server is
performing?
•Basic local services are included in XI base
•You should ideally be monitoring:
•Service Status (check_init_service)
•crond, httpd, mysql, ndo2db, npcd, ntpd,
postgresql, snmptrapd, snmptt

localhost services
•File Counts (check_file_count)
•NPCD Perfdata spool directory
•xidpe spool directory
•Check results folder
•snmptt spool folder
•nagios user account has not expired
•(check_pass_expire.pl)

localhost services
•root mailbox size
•(box293_check_mbox)
•MySQL / MariaDB
•Database tables crashed?
•(box293_check_mysql_table_status)
•Date/Time correct?
•(box293_check_mysql_date)

localhost services
•Overall Load (check_load)
•Memory Free – Physical (check_memory)
•Swap Usage (check_swap)
•Disk Free (check_disk)

Date and Timezone!
•Configure Timezone
•Admin > Manage System Config
•Sync with trusted time source
•VM? Don’t sync with hypervisor!
•Can be the source of confusing
problems

CPU
•CPU Cores vs Speed!
•Not everything is multi-threaded
•3.4 GHz vs 2.2 GHz
•Number of cores is still important
•Refer to XI hardware requirements

Memory
•Enough memory to cope in a major outage
•Event handlers consume memory quickly (+GB
in a matter of minutes in a major outage)
•Have at least 50% more memory than needed
•Refer to XI hardware requirements

RAM Disk
•Lots of little files created/deleted/updated
•Using a RAM Disk:
•Reduces disk I/O & load
•Speeds up processing of performance data
•Speeds up processing of spooled check results
•Speeds up nagios restarts
•Refer to official procedure

Solid State Disk (SSD)
•Greatly improves overall performance
•Compliments RAM Disk
•Helps read/writes with:
•Logs
•Database
•Performance Graphs
•Reports

SSD vs RAID ?
•SSD beats* a spinning disk RAID set
•*Depends on how much money you have
•Still need to RAID1 SSD for redundancy!
•SSD may not give you the required capacity
•3.8TB SAS SSD now available
!!!

rrdcached
•Enabling rrdcached accumulates the
spooled performance data, after x amount of
time it is processed into backend RRD files
•Reduces Disk I/O
•Can be a delay in data appearing in graphs

Offloaded MySQL / MariaDB
•Data constantly written to databases
•Historical and Configuration
•Offload to separate server to reduce load
•Don't forget to monitor offloaded server!!!
•Disk/CPU/Memory/Tables/Service
•Refer to earlier slides

Mod-Gearman
•Used for offloading plugins to workers
•Plugins need to be installed on all workers
•Be aware of plugins that use /tmp files!
•XI 2014 onwards uses Core 4
•Core 4 has it's own workers (only local
workers)
•nagios.cfg “check_workers” option

Disaster Recovey
•Failover and High Availability Solutions for
Nagios XI
•Andy Brist - NWC2014 – Failover & HA
•What is really important in disaster?
•Plan and test

Backups!!!
•Admin > System Backups
•Schedule backups of XI
•Location can be local, FTP, SSH
•Remote location recommended
•Manual Backups
•Local Backup Archives via Admin menu
•/usr/local/nagiosxi/scripts/backup_xi.sh

Restoring Backups
•Official Backup and Restore procedure
•Brings system back online with ease
•Great for migrating from old XI to new XI
•Also good for:
•DR
•Test & Dev

Intervals - Host vs Services
•Host down HARD = service notifications
suppressed
•What happens when host and services
use the same check intervals?
•Unnecessary Notifications get sent :(
•Make host go down HARD quicker than
it’s services!

Service Dependencies
•When a master service goes down:
•Prevents notifications from being sent
•Prevents service checks from execution
•Make master service go down HARD
quicker than dependent services!
•Otherwise dependencies are pointless
•Master service e.g. - Ping or NRPE Version

Disable Service Checks ?
•host_down_disable_service_checks
•Nagios Core 4.1.x feature (XI 5)
•System wide setting
•Reduces load on XI host
•Think of it as automatic service dependencies
on their own hosts
•Service dependencies ignored if host is down

Check Intervals - Be Realistic
•Does it need to be checked every 5
minutes?
•Disk Free Space – every 60 minutes perhaps?
•Too long = no performance data
•Different intervals to spread the load
•3, 5, 7 minute intervals
•58, 60, 62 minute intervals

Notification & Check Intervals
•Nagios determines if it is allowed to send a
notification every service HARD state
•e.g. 15 minute check and 60 minute notification
•Internal scheduling may cause 14min 55sec
to pass, 4 x 14:55 = 59min 40sec … it’s <
60min!
•Notification not sent until 75min!
•Scheduling is geared +/- to reduce load!

Use Hostgroups!
•Assign ONE service to a hostgroup of
common servers
•Windows Servers
•Linux Servers
•Consistent monitoring, standards enforced!
•Directive changes - all hosts get updated
•Reduces management overhead

Use Contact Groups!
•Use contact groups in all definitions
•Makes it easy when staff join/leave
•Just add/remove the contact from groups
•Reduces administrative overhead
•Enforces your company policy
•Similar principle to host groups

Configuration Wizards
•Pros
•Great for getting up and running quickly
•No need to learn how a plugin works
•Cons
•Creates individual services
•More work later when enforcing “standards”

Templates
•Common settings applied to objects
•Helps enforce standards
•Reduces administrative overhead
•Layer multiple templates
•Can be additive or ignore inheritance
•XI Config Wizard objects use templates
•Example of common icmp check

User Macros – resources.cfg
•$USERx$ macros are good for common
items like a username or password
•Allows passwords with a ! exclamation
mark
•Values not visible in object definitions
•$USER1$
•/usr/local/nagios/libexec

Custom Object Variables
•Allows you to create your own variables
•Can be defined in host or service objects
•E.G. hosts have their own check_nt
password
•Define _CHECK_NT_PASSWORD in host object
•In command definitions reference it as:
•$_HOSTCHECK_NT_PASSWORD$
•VERY POWERFULL!

MTRG Clean Configs
•Your MRTG configs may be collecting
more than what you think
•/etc/mrtg/conf.d/*.cfg files
•Created by Network Switch / Router Wizard
•Comment out unused ports
•About 37 lines per port
•Comment out unused non-interfaces
(VLANs)

Plugins – Compiled vs Scripts
•Compiled runs quicker
•Official nagios-plugins are compiled
•“Custom modifications” require re-compiling
•Scripts run slower, consume more
resources
•Perl plugins known to consume +CPU +RAM
•“nice” can reduce impact of plugins
•Check Profiler component by box293

Backend API - Read Only User
•API provides you with URLs for use in third
party products without needing user/pass
•Requires a user account to be created
•Account should be READ ONLY

Performance Data Tool
•Component developed by box293
•Allows you to manipulate RRD files
•Great for merging RRD data
•Can also delete old RRD files for old services
•View raw data in tables
•Find it in the Nagios Exchange

Thank you!
What Is Your Best Practice?
Any Questions?

end
done
fi esac
)
}
;
od
until
.

Nagios XI Best Practices

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Nagios XI Best Practices

Similaire à Nagios XI Best Practices (20)

Plus de Nagios

Plus de Nagios (20)

Dernier

Dernier (20)

Nagios XI Best Practices

Notes de l'éditeur