Secure-24 uses Zenoss as its primary monitoring tool to monitor over 9,000 devices and 1.7 million data points. It monitors key components of the Zenoss infrastructure like Zenoss daemons, RabbitMQ queues, and the event processing system to ensure proper functioning. Checks include looking for process/heartbeat issues, queue lengths in RabbitMQ, and testing event opening, processing, and closing to verify the full event flow. Remote monitoring is emphasized in case the primary Zenoss system goes down.
In this breakout session we will be discussing monitoring some important aspects of Zenoss
Who am I?
- Maintaining the health of the monitoring infrastructure
- Developing ZenPacks that:
- Add functionality to Zenoss
- Building out new monitoring for new devices and applications
- And ZenPacks that extend the API to work with our other internal systems
What I hope you take from this is: How to monitor your Zenoss instance, the importance of monitoring your Zenoss instance. So that you are the first to know about issues. Or at least getting you thinking about monitoring your instance.
Before we get into that, I would like to give you some background on Secure-24, our environment and what we use Zenoss for
www.secure-24.com
Secure-24 has 15 years of experience delivering managed IT operations, application hosting and cloud services.
We manage:
- Oracle E-Business Suite
- PeopleSoft
- JD Edwards
- Hyperion
- SAP
- other critical applications
Devices we use in Zenoss
- Cisco UCS and HP Proliant
- Networking
- Cisco, Juniper, F5, Riverbed and Radware Networking devices
- EMC and NetApp
- Windows and Linux
- SAP, Hyperion, Oracle, MySql, PeopleSoft, Progress DB
- Microsoft Applications:
- Exchange, Sql, SharePoint, Lync
- Along with Citrix and VMWare View
We used to have a variety of monitoring tools.
Different teams would have different applications
Solarwinds for networking
Nagios for Linux and Windows servers
Tidal/OEM for applications
We have moved away from these other monitoring platforms
Focusing on using Zenoss as our primary tool
Taking anything we liked from our older tools and added that functionality into Zenoss
Zenoss daemon monitoring is setup out of the box in two different forms:
Heartbeats
Process monitoring
Heartbeats are sent out from the daemon to zenhub and then passed on to zeneventserver.
If heartbeats stop coming in, then a /Status/Heartbeat event will be created
Monitoring daemons with processes works like all other process monitoring.
You can monitoring CPU, memory, count and up/down status.
I find this very useful when deploying new ZenPacks. For example, when I deploy a new ZenPack that has a new zenpython datasource I like to watch the memory usage of the daemon over time to make sure it does not have a memory leak or other performance issues.
You will generally get a process down event before a heartbeat event.
MySql and RabbitMQ are both critical applications. Monitoring them for at least up/down is a must.
If either of these go down, you will not get any events.
Think about setting up an external monitoring server.
Here is where we start talking about setting up an external monitoring instance.This can be another Zenoss instance, another monitoring product or a simple server running scheduled scripts.
We migrated from Nagios to Zenoss, and since we already had Nagios servers up and running and integrated in with Service-Now, we just used that to perform our external monitoring
We started with some basic Http checks, these checks were designed to perform some simple monitoring of the two things we cared about most at the time.
Web interface (Users being able to login and use Zenoss)
Events (Users being able to view events in the Event Console)
The first check is a simple http check to the Dashboard page that verifies a string on the page. This allows us to monitoring zenwebserver (nginx/zope) and LDAP authentication
The second was a http check directly to zeneventserver using the API to get a list of events. This allowed us to monitor zeneventserver and make sure it was accepting connections.
This was a good start, but not ideal.
Very important to monitor Rabbit queues. If something happens to RabbitMQ, event processing will not work.
We wanted to start monitoring for the symptoms to our issues we were having. And one of the common symptoms was Rabbit queues backing up.
Zenoss has 3 major queues it uses to process event:
Events come in from the collectors and are placed into the rawevents queue by zenhub
Zeneventd then processes the messages in rawevents, applies any transforms and then places the event into zenevents queue
Zeneventserver processes the messages in zenevents and runs them through any triggers. If an event matches a trigger the event will be placed into the signal queue
Zenactiond then processes any messages in signal using the proper Notification method
So, you can have several different kinds of issues, but one of the symptoms for each are backed up Rabbit queues.
With each of these issues you will see a backup of messages in RabbitMQ:
Deadlocks in zends
zenactiond daemon is down or overwhelmed
New (poor performing) transform added to environment
We use a script to connect to RabbitMQ and pull the current message count in these 3 queues, and if any counts are higher than set thresholds it will alert.
And we do it remotely.
But this wasn’t enough, we wanted to know as soon as zeneventserver was having an issue.. Which lead us to creating some synthetic checks…
We took that a setup farther and created a new check that would more closely follow the event flow process.
When I was here last year at GalaxZ15, I really liked the technical discussions, take aways from those discussions and talks outside of the break outs.
So, when Zenoss asked me to speak this year, I wanted to make a point of giving something to the community that they could use and take back with them.
The script that I mentioned today that performs the Rabbit queue monitoring and synthetic event checks can be found on the community wiki and in github.
My hope is that people will find this information and script useful in some way. If not using it out right to monitor their Zenoss instance, then at least able to give them ideas on how to monitor their instance in a proactive way.
I am open to any feedback you have about this session and the script. Feel free to post comments on the wiki or bugs/feature requests on github. And I believe you can leave feedback on this session in the GalaxZ app. I would love to hear your thoughts on both topics.
When I was here last year at GalaxZ15, I really liked the technical discussions, take aways from those discussions and talks outside of the break outs.
So, when Zenoss asked me to speak this year, I wanted to make a point of giving something to the community that they could use and take back with them.
The script that I mentioned today that performs the Rabbit queue monitoring and synthetic event checks can be found on the community wiki and in github.
My hope is that people will find this information and script useful in some way. If not using it out right to monitor their Zenoss instance, then at least able to give them ideas on how to monitor their instance in a proactive way.
I am open to any feedback you have about this session and the script. Feel free to post comments on the wiki or bugs/feature requests on github. And I believe you can leave feedback on this session in the GalaxZ app. I would love to hear your thoughts on both topics.