For the last two years, David Schroeder, Software Engineer at Viasat, Inc. has supported a single Sensu cluster shared by multiple teams, each with their own requirements, thresholds, and contacts. How does it all work, how can these different uses coexist?
This talk from Sensu Summit 2018 describes how Ansible is used to configure and deploy Sensu for multiple teams, how much autonomy is granted each one, and where the bottlenecks are.
3. › Environment segregation
– Access limits
– Contacts
› Different deployment strategies
› Different thresholds
– Both keepalive and other checks
› Different checks, different platforms (even Windows)
› API calls
– Creating silence
– Gather check results
"Can Sensu do #{this_thing}?"
3
Team Requirements
4. › Sensu Enterprise RBAC!
› Contact routing!
› Check parameter tokenization!
› API tokens!
› Custom configuration anywhere and everywhere!
"Sensu can do #{this_thing}!"
4
Team Requirements
10. › sensu/
– group_vars/
▪ infrastructure_pdx_dev/
– main.yml
– vault.yml
Per Environment
10
Ansible Structure
---
### Environment Definitions ###########################################
host_subscriptions:
- "basic"
- "framework"
- "framework_pdx_dev"
host_environment: "framework_pdx_dev"
host_contact: "framework"
# Keepalive thresholds: number of seconds before warning or alerting
keepalive_warn: 150
keepalive_crit: 210
# Set re-notification time (in seconds) for keepalive alarms. Default is 300.
keepalive_refresh: 3600
11. › sensu/
– group_vars/
▪ infrastructure_pdx_dev/
– main.yml
– vault.yml
Per Environment
11
Ansible Structure
# To add a subscription based on server role as included in the hostname,
# include the subscription name as the key, and hostname pattern as the
# value. Be sure to escape out backslashes.
role_patterns:
framework_zeromq: "-mq00d"
framework_utility: "^utly"
# Enable Sensu client socket commands
enable_client_socket: true
# Custom client-side configuration
custom_client_configs:
checks:
check_ram:
warning: 101
critical: 100
12. › sensu/
– group_vars/
▪ infrastructure_pdx_dev/
– main.yml
– vault.yml
Per Environment
12
Ansible Structure
### Communicating with Sensu ##########################################
# Hostname or IP address of the graphite API server for graph rendering
graphite_server: "172.16.20.100"
rabbitmq_params:
port: 5671
user: "sensu"
pass: "{{ vault_rabbitmq['password'] }}"
host1: "172.16.20.101"
host1_cert: "{{ vault_rabbitmq['host1_cert'] }}"
host1_key: "{{ vault_rabbitmq['host1_key'] }}"
host2: "172.16.20.102"
host2_cert: "{{ vault_rabbitmq['host2_cert'] }}"
host2_key: "{{ vault_rabbitmq['host2_key'] }}"
host3: "172.16.20.103"
host3_cert: "{{ vault_rabbitmq['host3_cert'] }}"
host3_key: "{{ vault_rabbitmq['host3_key'] }}"
20. Ongoing Challenges
API calls
Limited availability in RBAC01
Dashboard
Missing hosts in Events list02
Cleanup
Old checks, forgotten hosts03
Bottlenecks
04
20
21. Ongoing Challenges
API calls
Limited availability in LDAP RBAC01
21
› Works through RBAC, but without subscription limitations:
– /clients
– /clients/:client/history (deprecated)
– /events (returns all events)
– /silenced (POST ignores 'begin' field)
› Does not work at all through RBAC layer"
– /results
– /events/:client/
– /silenced/subscriptions/:subscription
– /silenced/checks/:check
– ?filter
› Good news: support in Sensu 2.0!
22. Ongoing Challenges
Dashboard
Missing hosts in Events list02
22
› If a host matches a subscription in RBAC, but the alerting
check does not, it is not visible on the Events page
Hi, David Schroeder, Viasat. Last year, I talked about migrating my team from a Nagios-based monitoring solution over to Sensu. And in that talk I touched on an unexpected side-effect of the migration...
Popularity. Other teams saw what we were doing, found it satisfactory, and wanted in on the action. My Ansible playbook for Sensu was only geared toward the one team, so I needed to expand it, make sure it worked for everybody, and there were many lessons learned along the way.
The different teams would of course need to have their own environments, reasonably separated from the other teams, with their own logins and contact profiles.
They may have different ways of deploying playbooks on their servers. That's OK, gotta work with that.
They may want keepalive alarms to re-notify every hour, or every 10 minutes. They may want the standard memory check on their systems to never warn and go critical at 99%. Others may want different thresholds.
They're of course going to have their own unique checks, and they may have different platforms to support, multiple Linux distros, even Windows. Yeah, Windows.
And, finally, they're gonna want to be able to access server states and create silence automatically, using API calls, so we need to be able to support that, too.
Fortunately, the answer to all these requests is "yes." Yes, Sensu Enterprise has built-in RBAC to limit access to certain environments per team.
Yes, we can use contact routing to support different alert recipients per team.
Yes, we can tokenize parameters to checks, and we can have them differ per environment, if that's what they need.
We can use API tokens expose some functionality to other teams.
And Sensu's flexible approach to configuration parameters means I can inject configs anywhere, and pick them up where needed. This is a huge boon to customization, and one of the things that makes this platform so versatile.
But I'm not here to talk about these features and capabilities, I covered a lot of that in my talk last year. I want to dig into actual implementation in Ansible.
Here's how it works.
There are seven separate roles. The ones that get the most action are sensu-client and sensu-server. sensu-client is what another team would run to add new clients to a Sensu monitoring cluster, or make changes to existing clients.
Sometimes, that's all they need. There are a bunch of default checks and thresholds that cover all the usual monitoring basics. But other times, they might need to propose changes to this role, perhaps a task to lay down a configuration file used by a check, or to install a new check script. I'll talk about that process later.
sensu-server's job is to properly configure the server cluster. If there's a new check to be defined, here's where it's done. The Sensu servers also handle standalone checks like alerting on aggregates, or performing ping checks. Mostly, though, it's pub/sub.
If there are contact changes, new team, different e-mail address, new chatroom, this role handles those, too. Most of the handlers we use are custom, so they're stored and applied here, too. Some of the other roles:
sensu-enterprise is all about installing and configuring Sensu Enterprise. Outside of building a new cluster, this role is used for applying RBAC-related changes to the dashboard config and adding new API tokens.
rabbitmq-server and redis-server are only used when building a new cluster... The RabbitMQ role has the extra step of downloading the auto-generated SSL certs, which are vaulted and given to the clients.
sensu-standalone is a way to build a single server running the community edition of Sensu, it's actually the community sensu-ansible role included as a submodule, and the standalone playbook is helpful for development and testing, since the sensu-client and sensu-server roles can work over the top of it... though, I have to say, sensu-ansible has matured significantly since I last tested that out, two years ago, and I'm not sure if it still works here.
sensu-winclient is a separate beast. It's barely used, and has a sort of backwards way of deploying Windows clients.
sensu-client has actually been broken-out as a Galaxy-style role, included among the rest a sort of submodule. Some teams only use Galaxy-style roles for deployment, so they needed this available separately.
We follow Ansible's best practices in terms of directory structure, with subdirectories for group_vars/ and roles/. When a team wants to come on board, or wants to add a group of servers, which we call "environments", each logical division gets its own subdirectory in group_vars.
This example up here on the screen shows a fictional framework team with a Portland datacenter, and separate environments for dev, staging, and production. If it makes sense to have different check parameters or thresholds or contact profiles, they get their own environment, and therefore their own subdirectory under group_vars. There are a lot of these. Dozens.
Typically, if a team adds an environment, or another team comes on board, they just copy an existing directory to use as a template.
The Sensu clusters themselves get their own directories inside of group_vars/. These behave very much like the other environments, but of course have some extra stuff to configure Sensu itself.
I wanted to give you some examples of how each environment may be constructed. Inside each subdirectory under group_vars, there are two files, main and vault. main has all the parameters, vault stores the sensitive ones, like passwords and RabbitMQ certificates.
Looking at main.yml here, first, if there are any subscriptions that should be applied to all servers in this environment, those are added here. The environment name is specified, this is used for several things, like building aggregates or subdividing the metric scheme for graphite graphs.
The contact name is important, basically the name of the team which is responsible for these servers.
Separate keepalive thresholds can be set, warning and critical, and it's nice to vary the re-notify time as well. This example here is a dev environment, you don't need these servers to re-notify quite as often as, say, a production environment.
Digging deeper into main.yml, I've found that assigning different subscriptions based off of the hostname has been helpful, using pattern matching to add clients in this environment to one subscription or another. Like if your hostname has 'db' in the name, you may want to add a subscription to certain database checks.
If they need socket commands enabled, by default this is not open, hardly anybody I support uses them, but that option is here.
Also in this file, you can set custom check parameters. So here in this example, warning is effectively disabled, and it'll only go critical when the memory percent is maxed out. There are defaults for the memory check, and this overrides them.
The last part of the config in main.yml defines how it talks to Sensu. There are multiple clusters available, so telling it to use the right Graphite server and RabbitMQ servers, along with the vaulted certificates, is essential.
The Sensu cluster environments have a main.yml file which includes all the stuff that the other client environments have, RabbitMQ certs and hostname-pattern-matched subscriptions and the like, but we also have the RBAC confuration here, so I wanted to show you what that looks like.
This example shows a typical team declaration, the fictional "framework" team, giving them full UI access to all of their servers, which as we saw above, include the "framework" subscription. One subscription name per team is nice to have, so you don't have to list out, you know, dev and staging and production as separate subscriptions here.
Here is an API token example, providing a token to perform certain Sensu API calls, GETs and POSTs. It's nice to be able to lock this stuff down.
There's a file which contains the contact routing information, different parameters for different teams, with the names matching the contact name you saw in the client config.
I wanted to show an example of a check definition, too. This one is the standard memory check, which leverages the optional parameters which can be specified per environment.
This 'basic' subscription, everybody gets that, so it's nice to be able to tweak the threshold depending on a what certain team or environment needs.
I also make sure each check includes a runbook link, and if possible, a graph. Not shown here, because I couldn't fit it in this mess, is the metric definition that populates this graph. But being able to embed graphs into the check results page in the UI is a great little feature.
Had enough of that? Let's get into the workflow, the general workflow, how these different teams go from wanting a change in Sensu, to getting one.
Usually, it starts with a pull request. All the roles are hosted in git, so the different teams write their checks or make their changes in a branch and then submit a pull request. I take a look at it, a quick code review process, and once merged, it's back to team.
If they have a new check, a new script, they re-run the sensu-client playbook on their end, at least one of the tasks in that playbook, to install the change and satisfy dependencies. Sometimes they've already done this before submitting the pull request. I work with great people who are definitely on top of things.
They give me the go-ahead, I run the sensu-server playbook, and make the new check or whatever live. It's the final step, make sure the dependencies on the clients are good first. Put the horse before the cart.
Are there problems with this whole multi-team Sensu thing? You bet! I'd be lying if I said there weren't. But let's call them challenges.
The first pertains to API calls. Teams would like access to data that is simply not available through the RBAC layer.
The second one involves host visibility in one part of the web UI, the Events page.
The third one is more of an internal thing, many teams aren't interested in cleaning up their old hosts and checks, but that needs to be solved.
And finally, there are bottlenecks in the whole workflow that I'd like to point out.
Digging into the first one, all the lovely and versatile API calls that are available in Sensu, if you want them to go through an RBAC layer, LDAP, specifically, most don't really work. Either they return everything, all clients for all teams, regardless of any RBAC restrictions in place. Either that, or they don't work at all, the functionality's not exposed. Fortunately, RBAC is a first-class feature in Sensu 2.0, so I'm looking forward to working with that.
Second, if you're locked down to certain subscriptions using RBAC, you're also limited in what you can see on the Events page. A host with a matching subscription won't show up unless the check also matches that subscription.
A global check like the load average, shown here, runs everywhere, it's part of the 'basic' subscription I showed you earlier. But you can't see them in the Events page. I get why that is, it's a strict subscription match on the Events page, but it's simply a usability issue, it keeps the teams from seeing all the alerts on their hosts in one place.
Let's say you've got a check that no longer applies to a host. The result will stick around in the UI. Or let's say you've decom'ed a host, shut it down, silenced it in Sensu. I am pretty anal-retentive about cleaning up that sort of thing, keeping the UI concise and orderly, but I can't expect everyone to be the same way. Before long, you look at the UI and see it cluttered with obsolete check results and silenced hosts with keepalive alarms. Fortunately, this one has a solution.
I've got an auto-cleanup task, runs like a check on the Sensu API servers once a day, and cleans up everything that's more than two weeks stale: checks that haven't been run in two weeks, and hosts which have been in keepalive failure for two weeks, which you might be able to make out in this example.
Everything except for old silenced entries, when something has long since recovered but is still silenced. I'm afraid to clean that up automatically, though, since it kind of presents an unpredictable user experience.
This cleanup check is something I would like to open up to the community, though, I think others might find it helpful.
And finally, the biggest bottleneck is me. I have to approve, merge, and deploy any changes that touch the server side. Usually, this isn't a problem, the turn-around time is maybe an hour, depending on what else is going on. And technically, there are a number of people who can do this part, there are people with the right access as well as clearly documentated procedures.
But if you're not in it day to day, you're not going to be comfortable with the process. If I take a vacation, they'll wait until I'm back. Making this all CICD-capable, that'd be nice, at some point, but in the meantime, I'm the bottleneck, the human gate check. I thought that was worth pointing out.
That's all I've got. Hopefully, whether or not you use Ansible, this gave you some ideas on how to work with multiple teams in Sensu.
Thank you for giving me the time, and thank you, Sensu team, for a great product and a great event.
Questions?