SlideShare une entreprise Scribd logo
1  sur  18
Scaling Nagios 4
Daniel Wittenberg
daniel.wittenberg@ipsoft.com
About MeAbout Me
● Unix/Linux admin since mid 90's
● Nagios/Netsaint user since early 2000's
● Owned/operated consulting business for almost 10 years that
provided distributed monitoring using Nagios
● Previously employed by Fortune 50 Insurance company
● Currently Monitoring Platform Manager at IPsoft Inc.
About IPsoftAbout IPsoft
● Provider of Remote Infrastructure Management and automation
services
● ITIL and 6 Sigma compliance management framework
● Automation that resolves 56% of all incidents, and 90% L1
● Monitoring, Automation, Event Correlation, Management....
● Offices around the world in ten countries
● http://www.ipsoft.com
Last year...Last year...
My ConfigurationMy Configuration
● ~700 Nagios Servers
● ~130,000 Monitored Devices
● ~3,000,000 Service Checks
● Mix of customized Nagios 3.2.3 and 4.0.0
● Scientific Linux 6.2/6.4
● Managed by Puppet 3.x
● 2/3 on VMware ESX rest are bare metal
● Adding new Nagios servers almost daily
What's different with Nagios 4What's different with Nagios 4
SPEED!
● Current testing shows on average 500% faster over 3.2.3
What's different with Nagios 4What's different with Nagios 4
Some things that would impact performance/stability
http://nagios.sourceforge.net/docs/nagioscore/4/en/whatsnew.html
● Embedded Perl – Gone
● external_command_buffer_slots - Gone
● -x option to not verify circular paths no longer needed in rc scripts
● Configuration Verification algorithm changes, massive startup speed increase
● Event Queue algorithm changes, helps with CPU utilization * Andreas 2012 Pres.
● Disk I/O reduced to virtually 0
● NEW query handler interface, better communication with core
● NEW core workers – reduces I/O, memory, CPU
● Completely re-written spec file for better installs, debug modes
Perf Testing Lab SetupPerf Testing Lab Setup
● Servers are all ESX 5 based VM's on the same cluster
● Variable CPU cores, 4GB memory
● Metrics used to consider a test failure:
● CPU Block Queue > 3
● CPU I/O Wait > 3
● CPU Idle < 10%
● Service Check Latency > 1s
● Host Check Latency > 1s
● 30 minute run time, > 3% failure rate failed the test
● Fully automated increasing work load, consistent results
● Add 1 host + 1 service check, try to get “best case” numbers w/o check lat.
Test Lab ArchitectureTest Lab Architecture
Test ResultsTest Results
CPU Cores Service Checks
Version 3.2.3
Service Checks
Version 4.0.0rc1
Difference
1 1700 10500 617%
2 3300 20800 630%
4 6500 35300 543%
8 11700 45100 385%
Other software usedOther software used
● Customized livestatus based on Andreas updates for Nagios 4
● https://github.com/ageric/livestatus
● Developing custom “single pane” interface to replace CGI/Check_mk Multisite
● Developing full REST API to talk to QH, livestatus and config files
● nagios-qh.rb Query Handler interface to gather loadctl metrics
● https://www.dropbox.com/s/h6zn0ecycqb1xrc/nagios-qh.rb
● Custom load control daemon that talks to QH
● Custom Event Broker to send perf data directly to ActiveMQ for post-
processing
● Custom agent, like NRPE on steroids without limitations like buffer size
Other performance tweaksOther performance tweaks
● Sysctl Changes
● net.ipv4.tcp_fin_timeout
● net.ipv4.tcp_keepalive_profiles
● net.ipv4.tcp_tw_recycle
● net.ipv4.tcp_tw.reuse
● No longer need RAMDISK, but still in the default sysconfig/RC script for now
● Keep logging levels as low as possible
● Disable CGI's whenever possible
● Disable Environment Macros
● Don't use resource macros when you don't need to, they are not cached
Other performance tweaksOther performance tweaks
● /etc/security/limits.d/nagios.conf
● ipmon soft nofile 131072
● ipmon hard nofile 131072
● ipmon soft nproc 131072
● ipmon hard nproc 131072
● Nearly disable OOM killer for the nagios process, saves it until last
● echo '-16' > /proc/<nagios pid>/oom_adj
● Re-nice puppet to run at 10 so less impacting (true for any extra services)
● /etc/sysconfig/puppet – NICELEVEL=10
● This should apply to any other running services that might take resources
Common Perf ToolsCommon Perf Tools
● vmstat / top – cpu/memory
● iostat / iotop – disk usage
● iptraf - network
● sar – cpu/memory/disk
● strace – immediate debugging, also debugging QA
● esxtop – VM stats
● tuned – can dynamically tune system
● perf record -p <pid> / perf list / perf top -u nagios
How to keep it running goodHow to keep it running good
● Monitor everything...you can never have too much info!
● CPU load and CPU stats (idle/wait/user/system)
● Disk space, inodes free
● All application/system logs (apache, syslog, nagios.log, etc.)
● Hardware status
● Swap / Physical Memory Usage
● Puppet state (state.yaml)
● Apache Stats (if have GUI/API)
● Network performance and stats (errors, throughput, etc.)
● NTP time and drift (more important on VM's)
Our Platform Architecture (simplified)Our Platform Architecture (simplified)
Known IssuesKnown Issues (and complaints)(and complaints)
● Number of workers on smaller (1-2 core) systems easily overloaded
● No remote workers (yet)
● Still have to restart to add new hosts/services
● No REST API natively
● Livestatus (or similar) not native
Questions ?Questions ?
● Daniel.Wittenberg@ipsoft.com
● dwittenberg2008@gmail.com
● @dwittenberg2008
● www.linkedin.com/in/dwittenberg
● nagios and nagios-devel IRC
● Nagios Users and Devel mailing lists
● Always looking to hire new people so contact me!

Contenu connexe

Plus de Nagios

Mike Weber - Nagios and Group Deployment of Service Checks
Mike Weber - Nagios and Group Deployment of Service ChecksMike Weber - Nagios and Group Deployment of Service Checks
Mike Weber - Nagios and Group Deployment of Service ChecksNagios
 
Mike Guthrie - Revamping Your 10 Year Old Nagios Installation
Mike Guthrie - Revamping Your 10 Year Old Nagios InstallationMike Guthrie - Revamping Your 10 Year Old Nagios Installation
Mike Guthrie - Revamping Your 10 Year Old Nagios InstallationNagios
 
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...Nagios
 
Matt Bruzek - Monitoring Your Public Cloud With Nagios
Matt Bruzek - Monitoring Your Public Cloud With NagiosMatt Bruzek - Monitoring Your Public Cloud With Nagios
Matt Bruzek - Monitoring Your Public Cloud With NagiosNagios
 
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.Nagios
 
Eric Loyd - Fractal Nagios
Eric Loyd - Fractal NagiosEric Loyd - Fractal Nagios
Eric Loyd - Fractal NagiosNagios
 
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...Nagios
 
Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...
Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...
Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...Nagios
 
Nagios World Conference 2015 - Scott Wilkerson Opening
Nagios World Conference 2015 - Scott Wilkerson OpeningNagios World Conference 2015 - Scott Wilkerson Opening
Nagios World Conference 2015 - Scott Wilkerson OpeningNagios
 
Nrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios Core
Nrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios CoreNrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios Core
Nrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios CoreNagios
 
Nagios Log Server - Features
Nagios Log Server - FeaturesNagios Log Server - Features
Nagios Log Server - FeaturesNagios
 
Nagios Network Analyzer - Features
Nagios Network Analyzer - FeaturesNagios Network Analyzer - Features
Nagios Network Analyzer - FeaturesNagios
 
Nagios Conference 2014 - Dorance Martinez Cortes - Customizing Nagios
Nagios Conference 2014 - Dorance Martinez Cortes - Customizing NagiosNagios Conference 2014 - Dorance Martinez Cortes - Customizing Nagios
Nagios Conference 2014 - Dorance Martinez Cortes - Customizing NagiosNagios
 
Nagios Conference 2014 - Mike Weber - Nagios Rapid Deployment Options
Nagios Conference 2014 - Mike Weber - Nagios Rapid Deployment OptionsNagios Conference 2014 - Mike Weber - Nagios Rapid Deployment Options
Nagios Conference 2014 - Mike Weber - Nagios Rapid Deployment OptionsNagios
 
Nagios Conference 2014 - Eric Mislivec - Getting Started With Nagios Core
Nagios Conference 2014 - Eric Mislivec - Getting Started With Nagios CoreNagios Conference 2014 - Eric Mislivec - Getting Started With Nagios Core
Nagios Conference 2014 - Eric Mislivec - Getting Started With Nagios CoreNagios
 
Nagios Conference 2014 - Trevor McDonald - Monitoring The Physical World With...
Nagios Conference 2014 - Trevor McDonald - Monitoring The Physical World With...Nagios Conference 2014 - Trevor McDonald - Monitoring The Physical World With...
Nagios Conference 2014 - Trevor McDonald - Monitoring The Physical World With...Nagios
 
Nagios Conference 2014 - Andy Brist - Nagios XI Failover and HA Solutions
Nagios Conference 2014 - Andy Brist - Nagios XI Failover and HA SolutionsNagios Conference 2014 - Andy Brist - Nagios XI Failover and HA Solutions
Nagios Conference 2014 - Andy Brist - Nagios XI Failover and HA SolutionsNagios
 
Nagios Conference 2014 - Shamas Demoret - An Overview of Nagios Solutions
Nagios Conference 2014 - Shamas Demoret - An Overview of Nagios SolutionsNagios Conference 2014 - Shamas Demoret - An Overview of Nagios Solutions
Nagios Conference 2014 - Shamas Demoret - An Overview of Nagios SolutionsNagios
 
Nagios Conference 2014 - Shamas Demoret - Getting Started With Nagios XI
Nagios Conference 2014 - Shamas Demoret - Getting Started With Nagios XINagios Conference 2014 - Shamas Demoret - Getting Started With Nagios XI
Nagios Conference 2014 - Shamas Demoret - Getting Started With Nagios XINagios
 
Nagios Conference 2014 - Abbas Haider Ali - Proactive Alerting and Intelligen...
Nagios Conference 2014 - Abbas Haider Ali - Proactive Alerting and Intelligen...Nagios Conference 2014 - Abbas Haider Ali - Proactive Alerting and Intelligen...
Nagios Conference 2014 - Abbas Haider Ali - Proactive Alerting and Intelligen...Nagios
 

Plus de Nagios (20)

Mike Weber - Nagios and Group Deployment of Service Checks
Mike Weber - Nagios and Group Deployment of Service ChecksMike Weber - Nagios and Group Deployment of Service Checks
Mike Weber - Nagios and Group Deployment of Service Checks
 
Mike Guthrie - Revamping Your 10 Year Old Nagios Installation
Mike Guthrie - Revamping Your 10 Year Old Nagios InstallationMike Guthrie - Revamping Your 10 Year Old Nagios Installation
Mike Guthrie - Revamping Your 10 Year Old Nagios Installation
 
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...
 
Matt Bruzek - Monitoring Your Public Cloud With Nagios
Matt Bruzek - Monitoring Your Public Cloud With NagiosMatt Bruzek - Monitoring Your Public Cloud With Nagios
Matt Bruzek - Monitoring Your Public Cloud With Nagios
 
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
 
Eric Loyd - Fractal Nagios
Eric Loyd - Fractal NagiosEric Loyd - Fractal Nagios
Eric Loyd - Fractal Nagios
 
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
 
Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...
Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...
Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...
 
Nagios World Conference 2015 - Scott Wilkerson Opening
Nagios World Conference 2015 - Scott Wilkerson OpeningNagios World Conference 2015 - Scott Wilkerson Opening
Nagios World Conference 2015 - Scott Wilkerson Opening
 
Nrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios Core
Nrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios CoreNrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios Core
Nrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios Core
 
Nagios Log Server - Features
Nagios Log Server - FeaturesNagios Log Server - Features
Nagios Log Server - Features
 
Nagios Network Analyzer - Features
Nagios Network Analyzer - FeaturesNagios Network Analyzer - Features
Nagios Network Analyzer - Features
 
Nagios Conference 2014 - Dorance Martinez Cortes - Customizing Nagios
Nagios Conference 2014 - Dorance Martinez Cortes - Customizing NagiosNagios Conference 2014 - Dorance Martinez Cortes - Customizing Nagios
Nagios Conference 2014 - Dorance Martinez Cortes - Customizing Nagios
 
Nagios Conference 2014 - Mike Weber - Nagios Rapid Deployment Options
Nagios Conference 2014 - Mike Weber - Nagios Rapid Deployment OptionsNagios Conference 2014 - Mike Weber - Nagios Rapid Deployment Options
Nagios Conference 2014 - Mike Weber - Nagios Rapid Deployment Options
 
Nagios Conference 2014 - Eric Mislivec - Getting Started With Nagios Core
Nagios Conference 2014 - Eric Mislivec - Getting Started With Nagios CoreNagios Conference 2014 - Eric Mislivec - Getting Started With Nagios Core
Nagios Conference 2014 - Eric Mislivec - Getting Started With Nagios Core
 
Nagios Conference 2014 - Trevor McDonald - Monitoring The Physical World With...
Nagios Conference 2014 - Trevor McDonald - Monitoring The Physical World With...Nagios Conference 2014 - Trevor McDonald - Monitoring The Physical World With...
Nagios Conference 2014 - Trevor McDonald - Monitoring The Physical World With...
 
Nagios Conference 2014 - Andy Brist - Nagios XI Failover and HA Solutions
Nagios Conference 2014 - Andy Brist - Nagios XI Failover and HA SolutionsNagios Conference 2014 - Andy Brist - Nagios XI Failover and HA Solutions
Nagios Conference 2014 - Andy Brist - Nagios XI Failover and HA Solutions
 
Nagios Conference 2014 - Shamas Demoret - An Overview of Nagios Solutions
Nagios Conference 2014 - Shamas Demoret - An Overview of Nagios SolutionsNagios Conference 2014 - Shamas Demoret - An Overview of Nagios Solutions
Nagios Conference 2014 - Shamas Demoret - An Overview of Nagios Solutions
 
Nagios Conference 2014 - Shamas Demoret - Getting Started With Nagios XI
Nagios Conference 2014 - Shamas Demoret - Getting Started With Nagios XINagios Conference 2014 - Shamas Demoret - Getting Started With Nagios XI
Nagios Conference 2014 - Shamas Demoret - Getting Started With Nagios XI
 
Nagios Conference 2014 - Abbas Haider Ali - Proactive Alerting and Intelligen...
Nagios Conference 2014 - Abbas Haider Ali - Proactive Alerting and Intelligen...Nagios Conference 2014 - Abbas Haider Ali - Proactive Alerting and Intelligen...
Nagios Conference 2014 - Abbas Haider Ali - Proactive Alerting and Intelligen...
 

Dernier

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 

Dernier (20)

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 

Nagios Conference 2013 - Daniel Wittenberg - Scaling Nagios Core 4

  • 1. Scaling Nagios 4 Daniel Wittenberg daniel.wittenberg@ipsoft.com
  • 2. About MeAbout Me ● Unix/Linux admin since mid 90's ● Nagios/Netsaint user since early 2000's ● Owned/operated consulting business for almost 10 years that provided distributed monitoring using Nagios ● Previously employed by Fortune 50 Insurance company ● Currently Monitoring Platform Manager at IPsoft Inc.
  • 3. About IPsoftAbout IPsoft ● Provider of Remote Infrastructure Management and automation services ● ITIL and 6 Sigma compliance management framework ● Automation that resolves 56% of all incidents, and 90% L1 ● Monitoring, Automation, Event Correlation, Management.... ● Offices around the world in ten countries ● http://www.ipsoft.com
  • 5. My ConfigurationMy Configuration ● ~700 Nagios Servers ● ~130,000 Monitored Devices ● ~3,000,000 Service Checks ● Mix of customized Nagios 3.2.3 and 4.0.0 ● Scientific Linux 6.2/6.4 ● Managed by Puppet 3.x ● 2/3 on VMware ESX rest are bare metal ● Adding new Nagios servers almost daily
  • 6. What's different with Nagios 4What's different with Nagios 4 SPEED! ● Current testing shows on average 500% faster over 3.2.3
  • 7. What's different with Nagios 4What's different with Nagios 4 Some things that would impact performance/stability http://nagios.sourceforge.net/docs/nagioscore/4/en/whatsnew.html ● Embedded Perl – Gone ● external_command_buffer_slots - Gone ● -x option to not verify circular paths no longer needed in rc scripts ● Configuration Verification algorithm changes, massive startup speed increase ● Event Queue algorithm changes, helps with CPU utilization * Andreas 2012 Pres. ● Disk I/O reduced to virtually 0 ● NEW query handler interface, better communication with core ● NEW core workers – reduces I/O, memory, CPU ● Completely re-written spec file for better installs, debug modes
  • 8. Perf Testing Lab SetupPerf Testing Lab Setup ● Servers are all ESX 5 based VM's on the same cluster ● Variable CPU cores, 4GB memory ● Metrics used to consider a test failure: ● CPU Block Queue > 3 ● CPU I/O Wait > 3 ● CPU Idle < 10% ● Service Check Latency > 1s ● Host Check Latency > 1s ● 30 minute run time, > 3% failure rate failed the test ● Fully automated increasing work load, consistent results ● Add 1 host + 1 service check, try to get “best case” numbers w/o check lat.
  • 9. Test Lab ArchitectureTest Lab Architecture
  • 10. Test ResultsTest Results CPU Cores Service Checks Version 3.2.3 Service Checks Version 4.0.0rc1 Difference 1 1700 10500 617% 2 3300 20800 630% 4 6500 35300 543% 8 11700 45100 385%
  • 11. Other software usedOther software used ● Customized livestatus based on Andreas updates for Nagios 4 ● https://github.com/ageric/livestatus ● Developing custom “single pane” interface to replace CGI/Check_mk Multisite ● Developing full REST API to talk to QH, livestatus and config files ● nagios-qh.rb Query Handler interface to gather loadctl metrics ● https://www.dropbox.com/s/h6zn0ecycqb1xrc/nagios-qh.rb ● Custom load control daemon that talks to QH ● Custom Event Broker to send perf data directly to ActiveMQ for post- processing ● Custom agent, like NRPE on steroids without limitations like buffer size
  • 12. Other performance tweaksOther performance tweaks ● Sysctl Changes ● net.ipv4.tcp_fin_timeout ● net.ipv4.tcp_keepalive_profiles ● net.ipv4.tcp_tw_recycle ● net.ipv4.tcp_tw.reuse ● No longer need RAMDISK, but still in the default sysconfig/RC script for now ● Keep logging levels as low as possible ● Disable CGI's whenever possible ● Disable Environment Macros ● Don't use resource macros when you don't need to, they are not cached
  • 13. Other performance tweaksOther performance tweaks ● /etc/security/limits.d/nagios.conf ● ipmon soft nofile 131072 ● ipmon hard nofile 131072 ● ipmon soft nproc 131072 ● ipmon hard nproc 131072 ● Nearly disable OOM killer for the nagios process, saves it until last ● echo '-16' > /proc/<nagios pid>/oom_adj ● Re-nice puppet to run at 10 so less impacting (true for any extra services) ● /etc/sysconfig/puppet – NICELEVEL=10 ● This should apply to any other running services that might take resources
  • 14. Common Perf ToolsCommon Perf Tools ● vmstat / top – cpu/memory ● iostat / iotop – disk usage ● iptraf - network ● sar – cpu/memory/disk ● strace – immediate debugging, also debugging QA ● esxtop – VM stats ● tuned – can dynamically tune system ● perf record -p <pid> / perf list / perf top -u nagios
  • 15. How to keep it running goodHow to keep it running good ● Monitor everything...you can never have too much info! ● CPU load and CPU stats (idle/wait/user/system) ● Disk space, inodes free ● All application/system logs (apache, syslog, nagios.log, etc.) ● Hardware status ● Swap / Physical Memory Usage ● Puppet state (state.yaml) ● Apache Stats (if have GUI/API) ● Network performance and stats (errors, throughput, etc.) ● NTP time and drift (more important on VM's)
  • 16. Our Platform Architecture (simplified)Our Platform Architecture (simplified)
  • 17. Known IssuesKnown Issues (and complaints)(and complaints) ● Number of workers on smaller (1-2 core) systems easily overloaded ● No remote workers (yet) ● Still have to restart to add new hosts/services ● No REST API natively ● Livestatus (or similar) not native
  • 18. Questions ?Questions ? ● Daniel.Wittenberg@ipsoft.com ● dwittenberg2008@gmail.com ● @dwittenberg2008 ● www.linkedin.com/in/dwittenberg ● nagios and nagios-devel IRC ● Nagios Users and Devel mailing lists ● Always looking to hire new people so contact me!