SlideShare a Scribd company logo
1 of 17
Download to read offline
Sensu at Brightpearl
Turning a hatred of Nagios into a
love of Sensu
www.brightpearl.com
Who the hell am I?
Dave Tibbs
@LowlySysadm1n
l
Systems Administrator at Brightpearl Inc
l
Started at Brightpearl UK in October 2010
l
Back then, only about 20 people in the company
– I was the only Systems Administrator/General
IT Dogsbody
l
~7 years experience as Sysadmin working with
various flavours of Linux
Monitoring – who needs it anyway?
l
Basically everyone – if you're running production
software that people depend on, you need to know
what's going on with your servers
l
You can't rely on screaming users to let you know
when things go wrong
l
Certain metrics can be a very good indicator of
failures before they happen – think disk space,
memory consumption, failed backups, web
requests/sec, etc
Monitoring in place when I started
Right, better get some monitoring.
Nagios, then?
l
Reputation of being the default, safe choice
l
Claim to be “Industry Standard” on their website
l
Historically people were put off by extortionate
costs of enterprise software (e.g. HP Openview) –
now cloud-based software still requires a
subscription.
l
Hey, Nagios is free.
l
Neckbeards rejoice – it's open source.
In the beginning, it was joyous.
l
MONITOR ALL TEH THINGZ
l
(Relatively) low server count means it was still
manageable. Easy to tune alerts to specific
servers.
l
All the plugins you can imagine means we could
monitor RDS instances, internal office servers,
UPS, etc etc
l
Email alerts for warnings keep us abreast of
things that might happen
l
Pagerduty integration for critical alerts
l
Configuration assisted with Chef.
But then...
l As the number of servers increases, so does the
configuration required
l ...and so do the spurious alerts, where the
thresholds aren't so simple to set. Hosting cost
restraints means sometimes running close to the
wire on some servers but not others.
l Because of this, NAGIOSAGEDDON in your
email inbox. Soon enough, everyone's ignoring
them, especially the warnings. And especially if
stuff is still working
A quick note on Nagios checks.
l Monitoring host sends check command over NRPE and waits for a response
l Queue of checks are processed one by one – if networking to certain hosts is
slow, it's slower to process the list.
l If the list of checks doesn't get processed before the next check is due.....
So Nagios sucks then?
l Well, Nagios gets some things right -
The plugin model is simple (4 exit codes!) and
reasonably well-designed
●
It's pretty reliable
●
SSL Support = secure
l If you're running a small office/datacentre with
servers and requirements that rarely or never
change it works – but still with a lot of painful
setting up
l But as soon as you deviate from this, it all goes
wrong.
Yes, bascially Nagios sucks.
l A lot has changed in the IT world in 15 years –
Nagios hasn't.
l It's completely unscalable. There is no such
thing as a Nagios cluster. More checks = more
server load on master
l The configuration format is horrible –
chef/puppet only slightly dulls the pain
l It has a horrendous interface – even if you pay
for Nagios XI, which isn't cheap
l It assumes a static infrastructure, which in the
days of Cloud is almost never.
l Configuration has to be duplicated in two places
So what to do?
l Reached the limit of Nagios pain – determined to
shake the Stockholm Syndrome we all appear to
have
l Alerts are pretty much ignored by all, once flood
gets large enough they WILL end up filtered.
Nagios has gone stopped for days without
anybody noticing.
l A monitoring system that people ignore is utterly
pointless.
l Started to investigate other alternatives.
Alternatives to Nagios
l NagiosXI - $$$ and apparently not much better.
l Zabbix – Not as much support as Nagios, lots of
people seem to think it's worse. Configuration
possibly even more complex
l ZenOSS – Confusing config, issues with false
positives and massive numbers of alerts
l Then I found Sensu.
What is this Sensu then?
l Much, much better model (queue-subscriber)
l Purpose-built for this, best tool for the job. Think
Graphite for graphing, pagerduty for alerting.
l Supports existing Nagios plugins
l Integrates with graphite, pagerduty
l Easy to scale – automatically handles clustering.
l Great REST API – you can do most things with it
No really, what is is it?
l Often described as a “monitoring router”
l Results of “check” scripts are passed onto one
or more handlers, depending on certain
conditions
l Written in Ruby (yay!)
l Configuration is all in JSON
l Four main components:
●
Server
●
Client
●
API
●
Dashboard
Compared to Nagios, this is good
l Hosting our infrastructure in the cloud, we need
to have our monitoring solution be
●
able to cope with changing
instances/infrastructure
●
aware of new servers without us having to
remember to tell it
●
Able to cope with possbibly rapid expansion
l Sensu fulfills these objectives reasonably well.
So is Sensu perfect?
●
No, nothing is.
●
The dashboard is immature – basically still a
bit rubbish
●
Current release is only version 0.12 – so the
whole software itself is fairly immature.
●
Fairly complicated install process, with
dependencies on quite a bit more than Nagios.
It's been Chef'd (and Chef'd well) but seems
easy for these dependencies to break with
version inconsistencies.
But it's still immeasurably better.
●
It'll scale well when our infrastructure expands
●
Has performed great in a test environment
●
Looking forward to rolling it out to production!

More Related Content

What's hot

Alexander Naydenko - Nagios to Zabbix Migration | ZabConf2016
Alexander Naydenko - Nagios to Zabbix Migration | ZabConf2016Alexander Naydenko - Nagios to Zabbix Migration | ZabConf2016
Alexander Naydenko - Nagios to Zabbix Migration | ZabConf2016Zabbix
 
Monitoring Oracle Database Instances with Zabbix
Monitoring Oracle Database Instances with ZabbixMonitoring Oracle Database Instances with Zabbix
Monitoring Oracle Database Instances with ZabbixGerger
 
MySQL Monitoring Shoot Out
MySQL Monitoring Shoot OutMySQL Monitoring Shoot Out
MySQL Monitoring Shoot OutKris Buytaert
 
Zabbix 3.0 and beyond - FISL 2015
Zabbix 3.0 and beyond - FISL 2015Zabbix 3.0 and beyond - FISL 2015
Zabbix 3.0 and beyond - FISL 2015Zabbix
 
Vladimir Ulogov - Large Scale Simulation | ZabConf2016 Lightning Talk
Vladimir Ulogov - Large Scale Simulation | ZabConf2016 Lightning TalkVladimir Ulogov - Large Scale Simulation | ZabConf2016 Lightning Talk
Vladimir Ulogov - Large Scale Simulation | ZabConf2016 Lightning TalkZabbix
 
How Yelp uses Mesos to Power its SOA Infrastructure
How Yelp uses Mesos to Power its SOA InfrastructureHow Yelp uses Mesos to Power its SOA Infrastructure
How Yelp uses Mesos to Power its SOA InfrastructureEvanKrall
 
Dimitri Bellini and Pietro Antonacci - Manage Zabbix Proxies in Remote Networ...
Dimitri Bellini and Pietro Antonacci - Manage Zabbix Proxies in Remote Networ...Dimitri Bellini and Pietro Antonacci - Manage Zabbix Proxies in Remote Networ...
Dimitri Bellini and Pietro Antonacci - Manage Zabbix Proxies in Remote Networ...Zabbix
 
An Introduction to Rearview - Time Series Based Monitoring
An Introduction to Rearview - Time Series Based MonitoringAn Introduction to Rearview - Time Series Based Monitoring
An Introduction to Rearview - Time Series Based MonitoringVictorOps
 
OWASP 2013 APPSEC USA ZAP Hackathon
OWASP 2013 APPSEC USA ZAP HackathonOWASP 2013 APPSEC USA ZAP Hackathon
OWASP 2013 APPSEC USA ZAP HackathonSimon Bennetts
 
AllDayDevOps ZAP automation in CI
AllDayDevOps ZAP automation in CIAllDayDevOps ZAP automation in CI
AllDayDevOps ZAP automation in CISimon Bennetts
 
OWASP 2014 AppSec EU ZAP Advanced Features
OWASP 2014 AppSec EU ZAP Advanced FeaturesOWASP 2014 AppSec EU ZAP Advanced Features
OWASP 2014 AppSec EU ZAP Advanced FeaturesSimon Bennetts
 
20140708 - Jeremy Edberg: How Netflix Delivers Software
20140708 - Jeremy Edberg: How Netflix Delivers Software20140708 - Jeremy Edberg: How Netflix Delivers Software
20140708 - Jeremy Edberg: How Netflix Delivers SoftwareDevOps Chicago
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...Demi Ben-Ari
 
Dave Williams - Nagios Log Server - Practical Experience
Dave Williams - Nagios Log Server - Practical ExperienceDave Williams - Nagios Log Server - Practical Experience
Dave Williams - Nagios Log Server - Practical ExperienceNagios
 
Logmanagement with Icinga2 and ELK
Logmanagement with Icinga2 and ELKLogmanagement with Icinga2 and ELK
Logmanagement with Icinga2 and ELKIcinga
 
Alexei Vladishev - Opening Speech
Alexei Vladishev - Opening SpeechAlexei Vladishev - Opening Speech
Alexei Vladishev - Opening SpeechZabbix
 

What's hot (20)

Alexander Naydenko - Nagios to Zabbix Migration | ZabConf2016
Alexander Naydenko - Nagios to Zabbix Migration | ZabConf2016Alexander Naydenko - Nagios to Zabbix Migration | ZabConf2016
Alexander Naydenko - Nagios to Zabbix Migration | ZabConf2016
 
Migrating big data
Migrating big dataMigrating big data
Migrating big data
 
Monitoring Oracle Database Instances with Zabbix
Monitoring Oracle Database Instances with ZabbixMonitoring Oracle Database Instances with Zabbix
Monitoring Oracle Database Instances with Zabbix
 
MySQL Monitoring Shoot Out
MySQL Monitoring Shoot OutMySQL Monitoring Shoot Out
MySQL Monitoring Shoot Out
 
Zabbix 3.0 and beyond - FISL 2015
Zabbix 3.0 and beyond - FISL 2015Zabbix 3.0 and beyond - FISL 2015
Zabbix 3.0 and beyond - FISL 2015
 
Vladimir Ulogov - Large Scale Simulation | ZabConf2016 Lightning Talk
Vladimir Ulogov - Large Scale Simulation | ZabConf2016 Lightning TalkVladimir Ulogov - Large Scale Simulation | ZabConf2016 Lightning Talk
Vladimir Ulogov - Large Scale Simulation | ZabConf2016 Lightning Talk
 
How Yelp uses Mesos to Power its SOA Infrastructure
How Yelp uses Mesos to Power its SOA InfrastructureHow Yelp uses Mesos to Power its SOA Infrastructure
How Yelp uses Mesos to Power its SOA Infrastructure
 
Dimitri Bellini and Pietro Antonacci - Manage Zabbix Proxies in Remote Networ...
Dimitri Bellini and Pietro Antonacci - Manage Zabbix Proxies in Remote Networ...Dimitri Bellini and Pietro Antonacci - Manage Zabbix Proxies in Remote Networ...
Dimitri Bellini and Pietro Antonacci - Manage Zabbix Proxies in Remote Networ...
 
Sensu Monitoring
Sensu MonitoringSensu Monitoring
Sensu Monitoring
 
An Introduction to Rearview - Time Series Based Monitoring
An Introduction to Rearview - Time Series Based MonitoringAn Introduction to Rearview - Time Series Based Monitoring
An Introduction to Rearview - Time Series Based Monitoring
 
OWASP 2013 APPSEC USA ZAP Hackathon
OWASP 2013 APPSEC USA ZAP HackathonOWASP 2013 APPSEC USA ZAP Hackathon
OWASP 2013 APPSEC USA ZAP Hackathon
 
AllDayDevOps ZAP automation in CI
AllDayDevOps ZAP automation in CIAllDayDevOps ZAP automation in CI
AllDayDevOps ZAP automation in CI
 
OWASP 2014 AppSec EU ZAP Advanced Features
OWASP 2014 AppSec EU ZAP Advanced FeaturesOWASP 2014 AppSec EU ZAP Advanced Features
OWASP 2014 AppSec EU ZAP Advanced Features
 
20140708 - Jeremy Edberg: How Netflix Delivers Software
20140708 - Jeremy Edberg: How Netflix Delivers Software20140708 - Jeremy Edberg: How Netflix Delivers Software
20140708 - Jeremy Edberg: How Netflix Delivers Software
 
sensu
sensusensu
sensu
 
Sensu
SensuSensu
Sensu
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
 
Dave Williams - Nagios Log Server - Practical Experience
Dave Williams - Nagios Log Server - Practical ExperienceDave Williams - Nagios Log Server - Practical Experience
Dave Williams - Nagios Log Server - Practical Experience
 
Logmanagement with Icinga2 and ELK
Logmanagement with Icinga2 and ELKLogmanagement with Icinga2 and ELK
Logmanagement with Icinga2 and ELK
 
Alexei Vladishev - Opening Speech
Alexei Vladishev - Opening SpeechAlexei Vladishev - Opening Speech
Alexei Vladishev - Opening Speech
 

Viewers also liked

Open Source Monitoring Tools
Open Source Monitoring ToolsOpen Source Monitoring Tools
Open Source Monitoring Toolsm_richardson
 
Monitoring using Sensu
Monitoring using SensuMonitoring using Sensu
Monitoring using Sensuripienaar
 
Security in the face of adversity
Security in the face of adversitySecurity in the face of adversity
Security in the face of adversityDavid Tibbs
 
Nagios Conference 2012 - Nathan Vonnahme - Monitoring the User Experience
Nagios Conference 2012 - Nathan Vonnahme - Monitoring the User ExperienceNagios Conference 2012 - Nathan Vonnahme - Monitoring the User Experience
Nagios Conference 2012 - Nathan Vonnahme - Monitoring the User ExperienceNagios
 
Comparative Analysis of IT Monitoring Tools
Comparative Analysis of IT Monitoring ToolsComparative Analysis of IT Monitoring Tools
Comparative Analysis of IT Monitoring Toolsapprize360
 
Writing Nagios Plugins in Python
Writing Nagios Plugins in PythonWriting Nagios Plugins in Python
Writing Nagios Plugins in Pythonguesta6e653
 
Présentation Séminaire Supervision 2009
Présentation Séminaire Supervision 2009Présentation Séminaire Supervision 2009
Présentation Séminaire Supervision 2009LINAGORA
 
Monitoring solutions comparison
Monitoring solutions comparisonMonitoring solutions comparison
Monitoring solutions comparisonWouter Hermans
 
Monitoring as code
Monitoring as codeMonitoring as code
Monitoring as codeIcinga
 
Rapport de stage nagios
Rapport de stage nagiosRapport de stage nagios
Rapport de stage nagioshindif
 
Rapport nagios miniprojet
Rapport nagios miniprojetRapport nagios miniprojet
Rapport nagios miniprojetAyoub Rouzi
 
RMLL_2011_icinga_un_fork_de_nagios_core.odp
RMLL_2011_icinga_un_fork_de_nagios_core.odpRMLL_2011_icinga_un_fork_de_nagios_core.odp
RMLL_2011_icinga_un_fork_de_nagios_core.odpCharles JUDITH
 
[SINS] Présentation de Nagios
[SINS] Présentation de Nagios[SINS] Présentation de Nagios
[SINS] Présentation de Nagiosjeyg
 
Cours - Supervision SysRes et Présentation de Nagios
Cours - Supervision SysRes et Présentation de NagiosCours - Supervision SysRes et Présentation de Nagios
Cours - Supervision SysRes et Présentation de NagiosErwan 'Labynocle' Ben Souiden
 

Viewers also liked (15)

Open Source Monitoring Tools
Open Source Monitoring ToolsOpen Source Monitoring Tools
Open Source Monitoring Tools
 
Monitoring using Sensu
Monitoring using SensuMonitoring using Sensu
Monitoring using Sensu
 
Security in the face of adversity
Security in the face of adversitySecurity in the face of adversity
Security in the face of adversity
 
Nagios Conference 2012 - Nathan Vonnahme - Monitoring the User Experience
Nagios Conference 2012 - Nathan Vonnahme - Monitoring the User ExperienceNagios Conference 2012 - Nathan Vonnahme - Monitoring the User Experience
Nagios Conference 2012 - Nathan Vonnahme - Monitoring the User Experience
 
Comparative Analysis of IT Monitoring Tools
Comparative Analysis of IT Monitoring ToolsComparative Analysis of IT Monitoring Tools
Comparative Analysis of IT Monitoring Tools
 
Writing Nagios Plugins in Python
Writing Nagios Plugins in PythonWriting Nagios Plugins in Python
Writing Nagios Plugins in Python
 
Nagios
NagiosNagios
Nagios
 
Présentation Séminaire Supervision 2009
Présentation Séminaire Supervision 2009Présentation Séminaire Supervision 2009
Présentation Séminaire Supervision 2009
 
Monitoring solutions comparison
Monitoring solutions comparisonMonitoring solutions comparison
Monitoring solutions comparison
 
Monitoring as code
Monitoring as codeMonitoring as code
Monitoring as code
 
Rapport de stage nagios
Rapport de stage nagiosRapport de stage nagios
Rapport de stage nagios
 
Rapport nagios miniprojet
Rapport nagios miniprojetRapport nagios miniprojet
Rapport nagios miniprojet
 
RMLL_2011_icinga_un_fork_de_nagios_core.odp
RMLL_2011_icinga_un_fork_de_nagios_core.odpRMLL_2011_icinga_un_fork_de_nagios_core.odp
RMLL_2011_icinga_un_fork_de_nagios_core.odp
 
[SINS] Présentation de Nagios
[SINS] Présentation de Nagios[SINS] Présentation de Nagios
[SINS] Présentation de Nagios
 
Cours - Supervision SysRes et Présentation de Nagios
Cours - Supervision SysRes et Présentation de NagiosCours - Supervision SysRes et Présentation de Nagios
Cours - Supervision SysRes et Présentation de Nagios
 

Similar to Sensu at brightpearl

Nagios Conference 2014 - Frank Pantaleo - Nagios Monitoring of Netezza Databases
Nagios Conference 2014 - Frank Pantaleo - Nagios Monitoring of Netezza DatabasesNagios Conference 2014 - Frank Pantaleo - Nagios Monitoring of Netezza Databases
Nagios Conference 2014 - Frank Pantaleo - Nagios Monitoring of Netezza DatabasesNagios
 
Project: Intrusion Detection
Project: Intrusion DetectionProject: Intrusion Detection
Project: Intrusion DetectionJay Schulman
 
Watching Somebody Else's Computer: Cloud Native Observability
Watching Somebody Else's Computer: Cloud Native ObservabilityWatching Somebody Else's Computer: Cloud Native Observability
Watching Somebody Else's Computer: Cloud Native ObservabilityRonald McCollam
 
Not my problem - Delegating responsibility to infrastructure
Not my problem - Delegating responsibility to infrastructureNot my problem - Delegating responsibility to infrastructure
Not my problem - Delegating responsibility to infrastructureYshay Yaacobi
 
Continuous (Production) Integration: Ruby on Rails Application Monitoring wit...
Continuous (Production) Integration: Ruby on Rails Application Monitoring wit...Continuous (Production) Integration: Ruby on Rails Application Monitoring wit...
Continuous (Production) Integration: Ruby on Rails Application Monitoring wit...jnewland
 
Information Security: Advanced SIEM Techniques
Information Security: Advanced SIEM TechniquesInformation Security: Advanced SIEM Techniques
Information Security: Advanced SIEM TechniquesReliaQuest
 
Troubleshooting: A High-Value Asset For The Service-Provider Discipline
Troubleshooting: A High-Value Asset For The Service-Provider DisciplineTroubleshooting: A High-Value Asset For The Service-Provider Discipline
Troubleshooting: A High-Value Asset For The Service-Provider DisciplineSagi Brody
 
Sensepost assessment automation
Sensepost assessment automationSensepost assessment automation
Sensepost assessment automationSensePost
 
Path dependent-development (PyCon India)
Path dependent-development (PyCon India)Path dependent-development (PyCon India)
Path dependent-development (PyCon India)ncoghlan_dev
 
Teach Your Sites to Call for Help: Automated Problem Reporting for Online Ser...
Teach Your Sites to Call for Help: Automated Problem Reporting for Online Ser...Teach Your Sites to Call for Help: Automated Problem Reporting for Online Ser...
Teach Your Sites to Call for Help: Automated Problem Reporting for Online Ser...Caktus Group
 
Python - The Good, The Bad and The ugly
Python - The Good, The Bad and The ugly Python - The Good, The Bad and The ugly
Python - The Good, The Bad and The ugly Eran Shlomo
 
Path Dependent Development (PyCon AU)
Path Dependent Development (PyCon AU)Path Dependent Development (PyCon AU)
Path Dependent Development (PyCon AU)ncoghlan_dev
 
OSMC 2012 | Shinken by Jean Gabès
OSMC 2012 | Shinken by Jean GabèsOSMC 2012 | Shinken by Jean Gabès
OSMC 2012 | Shinken by Jean GabèsNETWAYS
 
Codemash 2.0.1.4: Tech Trends and Pwning Your Pwn Career
Codemash 2.0.1.4: Tech Trends and Pwning Your Pwn CareerCodemash 2.0.1.4: Tech Trends and Pwning Your Pwn Career
Codemash 2.0.1.4: Tech Trends and Pwning Your Pwn CareerKevin Davis
 
AnsibleFest 2019 - Greenfielding Network and Systems Automation in a Large an...
AnsibleFest 2019 - Greenfielding Network and Systems Automation in a Large an...AnsibleFest 2019 - Greenfielding Network and Systems Automation in a Large an...
AnsibleFest 2019 - Greenfielding Network and Systems Automation in a Large an...Logan Best
 
SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools
SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power ToolsSaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools
SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power ToolsSaltStack
 
Reactive Microservice Architecture with Groovy and Grails
Reactive Microservice Architecture with Groovy and GrailsReactive Microservice Architecture with Groovy and Grails
Reactive Microservice Architecture with Groovy and GrailsSteve Pember
 
RedisConf17 - Observability and the Glorious Future
RedisConf17 - Observability and the Glorious FutureRedisConf17 - Observability and the Glorious Future
RedisConf17 - Observability and the Glorious FutureRedis Labs
 
2019 StartIT - Boosting your performance with Blackfire
2019 StartIT - Boosting your performance with Blackfire2019 StartIT - Boosting your performance with Blackfire
2019 StartIT - Boosting your performance with BlackfireMarko Mitranić
 

Similar to Sensu at brightpearl (20)

Nagios Conference 2014 - Frank Pantaleo - Nagios Monitoring of Netezza Databases
Nagios Conference 2014 - Frank Pantaleo - Nagios Monitoring of Netezza DatabasesNagios Conference 2014 - Frank Pantaleo - Nagios Monitoring of Netezza Databases
Nagios Conference 2014 - Frank Pantaleo - Nagios Monitoring of Netezza Databases
 
Project: Intrusion Detection
Project: Intrusion DetectionProject: Intrusion Detection
Project: Intrusion Detection
 
Watching Somebody Else's Computer: Cloud Native Observability
Watching Somebody Else's Computer: Cloud Native ObservabilityWatching Somebody Else's Computer: Cloud Native Observability
Watching Somebody Else's Computer: Cloud Native Observability
 
Not my problem - Delegating responsibility to infrastructure
Not my problem - Delegating responsibility to infrastructureNot my problem - Delegating responsibility to infrastructure
Not my problem - Delegating responsibility to infrastructure
 
Continuous (Production) Integration: Ruby on Rails Application Monitoring wit...
Continuous (Production) Integration: Ruby on Rails Application Monitoring wit...Continuous (Production) Integration: Ruby on Rails Application Monitoring wit...
Continuous (Production) Integration: Ruby on Rails Application Monitoring wit...
 
Information Security: Advanced SIEM Techniques
Information Security: Advanced SIEM TechniquesInformation Security: Advanced SIEM Techniques
Information Security: Advanced SIEM Techniques
 
Troubleshooting: A High-Value Asset For The Service-Provider Discipline
Troubleshooting: A High-Value Asset For The Service-Provider DisciplineTroubleshooting: A High-Value Asset For The Service-Provider Discipline
Troubleshooting: A High-Value Asset For The Service-Provider Discipline
 
Sensepost assessment automation
Sensepost assessment automationSensepost assessment automation
Sensepost assessment automation
 
Path dependent-development (PyCon India)
Path dependent-development (PyCon India)Path dependent-development (PyCon India)
Path dependent-development (PyCon India)
 
Teach Your Sites to Call for Help: Automated Problem Reporting for Online Ser...
Teach Your Sites to Call for Help: Automated Problem Reporting for Online Ser...Teach Your Sites to Call for Help: Automated Problem Reporting for Online Ser...
Teach Your Sites to Call for Help: Automated Problem Reporting for Online Ser...
 
Python - The Good, The Bad and The ugly
Python - The Good, The Bad and The ugly Python - The Good, The Bad and The ugly
Python - The Good, The Bad and The ugly
 
Path Dependent Development (PyCon AU)
Path Dependent Development (PyCon AU)Path Dependent Development (PyCon AU)
Path Dependent Development (PyCon AU)
 
OSMC 2012 | Shinken by Jean Gabès
OSMC 2012 | Shinken by Jean GabèsOSMC 2012 | Shinken by Jean Gabès
OSMC 2012 | Shinken by Jean Gabès
 
Codemash 2.0.1.4: Tech Trends and Pwning Your Pwn Career
Codemash 2.0.1.4: Tech Trends and Pwning Your Pwn CareerCodemash 2.0.1.4: Tech Trends and Pwning Your Pwn Career
Codemash 2.0.1.4: Tech Trends and Pwning Your Pwn Career
 
AnsibleFest 2019 - Greenfielding Network and Systems Automation in a Large an...
AnsibleFest 2019 - Greenfielding Network and Systems Automation in a Large an...AnsibleFest 2019 - Greenfielding Network and Systems Automation in a Large an...
AnsibleFest 2019 - Greenfielding Network and Systems Automation in a Large an...
 
SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools
SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power ToolsSaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools
SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools
 
Reactive Microservice Architecture with Groovy and Grails
Reactive Microservice Architecture with Groovy and GrailsReactive Microservice Architecture with Groovy and Grails
Reactive Microservice Architecture with Groovy and Grails
 
RedisConf17 - Observability and the Glorious Future
RedisConf17 - Observability and the Glorious FutureRedisConf17 - Observability and the Glorious Future
RedisConf17 - Observability and the Glorious Future
 
2019 StartIT - Boosting your performance with Blackfire
2019 StartIT - Boosting your performance with Blackfire2019 StartIT - Boosting your performance with Blackfire
2019 StartIT - Boosting your performance with Blackfire
 
Devops down-under
Devops down-underDevops down-under
Devops down-under
 

Recently uploaded

Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdfPros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdfkalichargn70th171
 
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdfAndrey Devyatkin
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?Alexandre Beguel
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogueitservices996
 
Mastering Project Planning with Microsoft Project 2016.pptx
Mastering Project Planning with Microsoft Project 2016.pptxMastering Project Planning with Microsoft Project 2016.pptx
Mastering Project Planning with Microsoft Project 2016.pptxAS Design & AST.
 
Zer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfZer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfmaor17
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldRoberto Pérez Alcolea
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...OnePlan Solutions
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shardsChristopher Curtin
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfRTS corp
 
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jGraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jNeo4j
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecturerahul_net
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonApplitools
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptxVinzoCenzo
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITmanoharjgpsolutions
 
Advantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptxAdvantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptxRTS corp
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...OnePlan Solutions
 
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdfSteve Caron
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsJean Silva
 
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop SlidesIntroduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slidesvaideheekore1
 

Recently uploaded (20)

Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdfPros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
 
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogue
 
Mastering Project Planning with Microsoft Project 2016.pptx
Mastering Project Planning with Microsoft Project 2016.pptxMastering Project Planning with Microsoft Project 2016.pptx
Mastering Project Planning with Microsoft Project 2016.pptx
 
Zer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfZer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdf
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository world
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
 
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jGraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecture
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptx
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh IT
 
Advantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptxAdvantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptx
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
 
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero results
 
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop SlidesIntroduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slides
 

Sensu at brightpearl

  • 1. Sensu at Brightpearl Turning a hatred of Nagios into a love of Sensu www.brightpearl.com
  • 2. Who the hell am I? Dave Tibbs @LowlySysadm1n l Systems Administrator at Brightpearl Inc l Started at Brightpearl UK in October 2010 l Back then, only about 20 people in the company – I was the only Systems Administrator/General IT Dogsbody l ~7 years experience as Sysadmin working with various flavours of Linux
  • 3. Monitoring – who needs it anyway? l Basically everyone – if you're running production software that people depend on, you need to know what's going on with your servers l You can't rely on screaming users to let you know when things go wrong l Certain metrics can be a very good indicator of failures before they happen – think disk space, memory consumption, failed backups, web requests/sec, etc
  • 4. Monitoring in place when I started
  • 5. Right, better get some monitoring. Nagios, then? l Reputation of being the default, safe choice l Claim to be “Industry Standard” on their website l Historically people were put off by extortionate costs of enterprise software (e.g. HP Openview) – now cloud-based software still requires a subscription. l Hey, Nagios is free. l Neckbeards rejoice – it's open source.
  • 6. In the beginning, it was joyous. l MONITOR ALL TEH THINGZ l (Relatively) low server count means it was still manageable. Easy to tune alerts to specific servers. l All the plugins you can imagine means we could monitor RDS instances, internal office servers, UPS, etc etc l Email alerts for warnings keep us abreast of things that might happen l Pagerduty integration for critical alerts l Configuration assisted with Chef.
  • 7. But then... l As the number of servers increases, so does the configuration required l ...and so do the spurious alerts, where the thresholds aren't so simple to set. Hosting cost restraints means sometimes running close to the wire on some servers but not others. l Because of this, NAGIOSAGEDDON in your email inbox. Soon enough, everyone's ignoring them, especially the warnings. And especially if stuff is still working
  • 8. A quick note on Nagios checks. l Monitoring host sends check command over NRPE and waits for a response l Queue of checks are processed one by one – if networking to certain hosts is slow, it's slower to process the list. l If the list of checks doesn't get processed before the next check is due.....
  • 9. So Nagios sucks then? l Well, Nagios gets some things right - The plugin model is simple (4 exit codes!) and reasonably well-designed ● It's pretty reliable ● SSL Support = secure l If you're running a small office/datacentre with servers and requirements that rarely or never change it works – but still with a lot of painful setting up l But as soon as you deviate from this, it all goes wrong.
  • 10. Yes, bascially Nagios sucks. l A lot has changed in the IT world in 15 years – Nagios hasn't. l It's completely unscalable. There is no such thing as a Nagios cluster. More checks = more server load on master l The configuration format is horrible – chef/puppet only slightly dulls the pain l It has a horrendous interface – even if you pay for Nagios XI, which isn't cheap l It assumes a static infrastructure, which in the days of Cloud is almost never. l Configuration has to be duplicated in two places
  • 11. So what to do? l Reached the limit of Nagios pain – determined to shake the Stockholm Syndrome we all appear to have l Alerts are pretty much ignored by all, once flood gets large enough they WILL end up filtered. Nagios has gone stopped for days without anybody noticing. l A monitoring system that people ignore is utterly pointless. l Started to investigate other alternatives.
  • 12. Alternatives to Nagios l NagiosXI - $$$ and apparently not much better. l Zabbix – Not as much support as Nagios, lots of people seem to think it's worse. Configuration possibly even more complex l ZenOSS – Confusing config, issues with false positives and massive numbers of alerts l Then I found Sensu.
  • 13. What is this Sensu then? l Much, much better model (queue-subscriber) l Purpose-built for this, best tool for the job. Think Graphite for graphing, pagerduty for alerting. l Supports existing Nagios plugins l Integrates with graphite, pagerduty l Easy to scale – automatically handles clustering. l Great REST API – you can do most things with it
  • 14. No really, what is is it? l Often described as a “monitoring router” l Results of “check” scripts are passed onto one or more handlers, depending on certain conditions l Written in Ruby (yay!) l Configuration is all in JSON l Four main components: ● Server ● Client ● API ● Dashboard
  • 15. Compared to Nagios, this is good l Hosting our infrastructure in the cloud, we need to have our monitoring solution be ● able to cope with changing instances/infrastructure ● aware of new servers without us having to remember to tell it ● Able to cope with possbibly rapid expansion l Sensu fulfills these objectives reasonably well.
  • 16. So is Sensu perfect? ● No, nothing is. ● The dashboard is immature – basically still a bit rubbish ● Current release is only version 0.12 – so the whole software itself is fairly immature. ● Fairly complicated install process, with dependencies on quite a bit more than Nagios. It's been Chef'd (and Chef'd well) but seems easy for these dependencies to break with version inconsistencies.
  • 17. But it's still immeasurably better. ● It'll scale well when our infrastructure expands ● Has performed great in a test environment ● Looking forward to rolling it out to production!

Editor's Notes

  1. EXPLAIN WHY NAGIOS CHECKS ARE BAD – NRPE check fired to each server, the more checks, the more they queue up. Check can fire off on server before previous one has completed – never get a result back.Chef kind of helps with configuration, but not by a lot. As there are more servers, there are more exceptions not covered so easily by configuration management. What follows NAGIOSAGEDDON? Mail queue overload and eventual crash. Alerts stop all together, which nobody notices, because they're ignoring them.
  2. If the list of checks doesn't get processed before the next check is due..... we may never get results back for the later checks in the list.Or, consider that the server is able to process the checks required within the time “window” (e.g. 1 minute for checks that are made every minute) – what if the number of checks is doubled? Tripled?
  3. Reliability – when was the last time you saw the nagios daemon crash? It's usually things external to Nagios that are the problem, Painful setting up – there are bolt-ons like Groundworks to improve setting up but they're not that much better than arsing about with configuration files Deviation = non-static hostnames in the cloud. Generally in a datacentre most is static.
  4. A lot has changed in 15 years – biggest of which is is a) everyone's running more servers and more servicesb) Most people relying on the cloud = many many non-static IP addresses. Nagios is 15 years old, give or take – released in 1999 and the design hasn't changed much in years. It's not fair to expect them to predict the changes back then, but neither has the software moved with the times. Configuration duplication – the server has to be aware of what checks it wants clients to make, the client has to be aware of what checks it's going to be expected to be run. Absolutely crazy setup.
  5. Stockholm syndrome not just in our company or even with me – everyone seems to have it. Reference everyone defending Nagios when it's basically shit.
  6. “Sensu” from the Japanese word for “fan” - relates to the “fanout exchange”, one of the exchange types used by RabbitMQ.
  7. Server – orchestrates check executions, processes the results, and handles events from results to handlers. You can run more than one server and tasks are distributed amongst them Client – Recieves check execution requests, executes the checks, and publishes the results. API – Provides a REST-like interface to Sensu data, such as registered clients and current events. You can run more than one. Dashboard – UI for Sensu. Not great.
  8. Server – orchestrates check executions, processes the results, and handles events from results to handlers. You can run more than one server and tasks are distributed amongst them Client – Recieves check execution requests, executes the checks, and publishes the results. API – Provides a REST-like interface to Sensu data, such as registered clients and current events. You can run more than one. Dashboard – UI for Sensu. Not great.
  9. Server – orchestrates check executions, processes the results, and handles events from results to handlers. You can run more than one server and tasks are distributed amongst them Client – Recieves check execution requests, executes the checks, and publishes the results. API – Provides a REST-like interface to Sensu data, such as registered clients and current events. You can run more than one. Dashboard – UI for Sensu. Not great.
  10. Server – orchestrates check executions, processes the results, and handles events from results to handlers. You can run more than one server and tasks are distributed amongst them Client – Recieves check execution requests, executes the checks, and publishes the results. API – Provides a REST-like interface to Sensu data, such as registered clients and current events. You can run more than one. Dashboard – UI for Sensu. Not great.