SlideShare a Scribd company logo
1 of 18
Download to read offline
NWC 2011
     Monitoring a Cloud Infrastructure in a Multi-Region Topology

                                                      Nicolas Brousse
                                                    nicolas@tubemogul.com

                                                      September 29th 2011




2011 TubeMogul Incorporated All rights reserved.
                                                                            1
Introduction - About the speaker
   • My name is Nicolas Brousse
   • I previously worked for many industry leading company in France
      – From Web Hosting to Online Video services
           (Lycos, MultiMania, Kewego, MediaPlazza...)
     – Heavy traffic environment and large user databases
   • I work as a Lead Operations Engineer at TubeMogul.com since 2008
   • I help TubeMogul to scale its infrastructure
     – From 20 servers to +500 servers
     – Using 4 Amazon EC2 Regions + 1 Colo
     – Monitoring with Nagios over 6,000 actives services and 1,000 passives services
     – Collecting over 80,000 metrics with Ganglia
     – Managing over 300 TB of data in Hadoop HDFS
     – Billions HTTP queries a day
   • Occasionally contribute to OpenSource projects
     – Ganglia (PHP and PERL module)
     – PHP Judy
2011 TubeMogul Incorporated All rights reserved.
                                                                                        2
Introduction - About TubeMogul

   • Created in November 2006 by John Hughes and Brett Wilson
   • Formerly a video distribution and analytics platform
   • Acquire Illuminex - a flash analytics firm - in October 2008
   • New platform call PlayTime™ :
      – TubeMogul is a Video Marketing Company
      – Built for Branding
      – Integrate real-time media buying, ad serving, targeting, optimization and brand
           measurement



   TubeMogul simplifies the delivery of video ads and maximizes the impact of
                     every dollar spent by brand marketers


                                  http://www.tubemogul.com/company/about_us


2011 TubeMogul Incorporated All rights reserved.
                                                                                          3
Our Environment
   • +10 servers hosted at LiquidWeb
   • Few VPS on Linode
   • +500 instances on Amazon EC2
       – Over 50 different servers configurations
   • Our technology stack :
       – JAVA, PHP
       – Hadoop : HDFS, MapReduce, HBase, Hive
       – Membase
       – Memcache
       – MySQL
       – And more...
   • Monitoring with Nagios
       – Using NSCA when possible
   • Graphing and Trending using Ganglia with Python plugins
       – Some legacy servers using Munin
   • Configuration Management using Puppet
2011 TubeMogul Incorporated All rights reserved.
                                                               4
Amazon Clound Environment




2011 TubeMogul Incorporated All rights reserved.
                                                   5
Amazon Clound Environment
   • We like it because....
      – We can quickly start new servers/clusters
      – We can quickly start new servers/clusters in many regions
         • US East (Virginia)
             • US West (North California)
             • Europe (Dublin)
             • Asia Pacific (Tokyo & Singapore)
      – We can use different type of instances (RAM, CPU, Disks, etc.)
      – It’s easy to automate with EC2 API
      – It’s easy to plug to a configuration management tool

   • But...
      – It can be hard to troubleshoot some failures or network problems
      – Occasionally being notified of hardware failures after the facts
      – No Multicast (Though, possible with Amazon VPC)
      – Bandwidth cost between regions can get expensive


2011 TubeMogul Incorporated All rights reserved.
                                                                           6
What’s the plan ?
   • Our monitoring must be able to scale
   • We need a better Graphing/Trending solution
   • Our monitoring configuration must be automated
      – How to monitor a cluster of servers with variables number of servers every hours ?
      – How to change configuration in multiple regions without missing something ?
   • A failure in one region shouldn’t impact other regions
   • We want to be wake-up only when it really matter
   • We have limited resources
      – Can’t spend big bucks for monitoring
      – Small operation team




2011 TubeMogul Incorporated All rights reserved.
                                                                                             7
Graphing, Trending...

       Munin                                                 Ganglia
                      munin-update                                    Gmetad
                      munin-graph

Pull                                                  Pull


                                                              Gmond    Gmond   Gmetad


                       munin-nodes                                                      Pull
                                                      Push
                   sequential polling




   2011 TubeMogul Incorporated All rights reserved.
                                                                                           8
Graphing, Trending...
   • Why we switched from Munin to Ganglia ?

      – Pretty much : Pull vs Push
         • Munin server fetch data from Munin Clients (munin-nodes)
                    – Can quickly overload the Munin server in disk I/O and CPU
                    – Data collected in sequential order impacted by previous run time and server load
             • Ganglia Client send data to representative clusters nodes. Data get federated
               periodically by a Gmetad process.
                    – Lighter on the aggregation side
                    – Clients push data at defined interval
                    – Can use threshold to send data only when it make sense
                       » using time_threshold and value_threshold in the metric

      – Ganglia is designed for Clusters and Grids
         • You can use multiple layer of gmond/gmetad process
             • You don’t need to manually add servers to your configuration



2011 TubeMogul Incorporated All rights reserved.
                                                                                                         9
Monitoring with Nagios




2011 TubeMogul Incorporated All rights reserved.
                                                   10
Automating Nagios configuration
• Puppet will configure our monitoring instance in each Region
   – We use Nagios regex : use_regexp_matching=1
   – But we don’t use true regex : use_true_regexp_matching=0
   – We use NSCA with Upstart




   – We don’t use the perfdata
   – We includes our configurations from 3 directories
    - objects => templates, contacts, commands, event_handlers
    - servers => contain a configuration file for each server
    - clusters => contain a configuration file for each cluster


2011 TubeMogul Incorporated All rights reserved.
                                                                  11
Automating Nagios configuration
Process of event when starting a new host and add it to our monitoring:

1. We start a new instance using Cerveza and Cloud-init

2. Puppet configure Gmond on the instance

3. Our monitoring server running Gmetad get data from the new instance

4. A Nagios check run every minute and look for new hosts in Ganglia

5. If a new host is found, the check script rebuild the Nagios config and
 reload Nagios

6. If the config is corrupt, the check script will send a critical alert



2011 TubeMogul Incorporated All rights reserved.
                                                                            12
Automating Nagios configuration
• Each server configuration is
  generated from a template
• Our nagios plugin
  “check_tm_clusters”, goes
  over the RRD files generated
  by Ganglia
• If a new host is found, it
  simply copy the template to
  the servers config directory
  and replace the variables as
  reported by Ganglia and
  looking at DNS entries




2011 TubeMogul Incorporated All rights reserved.
                                                   13
Reducing noise and false positive
• We disable most notification and only care of a cluster status




• Most of our checks are based on Ganglia RRD files




2011 TubeMogul Incorporated All rights reserved.
                                                                   14
Reducing noise and false positive
• It become really easy to monitor any metrics returned by Ganglia




2011 TubeMogul Incorporated All rights reserved.
                                                                     15
Reducing noise and false positive
• We can check cluster status by hosts/services but also per returned
   messages !




2011 TubeMogul Incorporated All rights reserved.
                                                                        16
Reducing noise and false positive
• We extensively use our “check_cluster” plugin
• We limit as much as possible email notification
• We use a custom variable _PAGING to identify pageable services
• Paging ONLY on Critical alerts for services/hosts with _PAGING=yes
• Use different contacts and time periods to send alerts to the right person
• We use Nagios Checker for FireFox and Chrome




2011 TubeMogul Incorporated All rights reserved.
                                                                           17
Thank You...


                                                   TubeMogul is Hiring !

    http://www.tubemogul.com/company/careers
               jobs@tubemogul.com


                                                   Follow us on Twitter
                                       @tubemogul                         @orieg

2011 TubeMogul Incorporated All rights reserved.
                                                                                   18

More Related Content

More from Nagios

Dave Williams - Nagios Log Server - Practical Experience
Dave Williams - Nagios Log Server - Practical ExperienceDave Williams - Nagios Log Server - Practical Experience
Dave Williams - Nagios Log Server - Practical ExperienceNagios
 
Mike Weber - Nagios and Group Deployment of Service Checks
Mike Weber - Nagios and Group Deployment of Service ChecksMike Weber - Nagios and Group Deployment of Service Checks
Mike Weber - Nagios and Group Deployment of Service ChecksNagios
 
Mike Guthrie - Revamping Your 10 Year Old Nagios Installation
Mike Guthrie - Revamping Your 10 Year Old Nagios InstallationMike Guthrie - Revamping Your 10 Year Old Nagios Installation
Mike Guthrie - Revamping Your 10 Year Old Nagios InstallationNagios
 
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...Nagios
 
Matt Bruzek - Monitoring Your Public Cloud With Nagios
Matt Bruzek - Monitoring Your Public Cloud With NagiosMatt Bruzek - Monitoring Your Public Cloud With Nagios
Matt Bruzek - Monitoring Your Public Cloud With NagiosNagios
 
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.Nagios
 
Eric Loyd - Fractal Nagios
Eric Loyd - Fractal NagiosEric Loyd - Fractal Nagios
Eric Loyd - Fractal NagiosNagios
 
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...Nagios
 
Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...
Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...
Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...Nagios
 
Nagios World Conference 2015 - Scott Wilkerson Opening
Nagios World Conference 2015 - Scott Wilkerson OpeningNagios World Conference 2015 - Scott Wilkerson Opening
Nagios World Conference 2015 - Scott Wilkerson OpeningNagios
 
Nrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios Core
Nrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios CoreNrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios Core
Nrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios CoreNagios
 
Nagios Log Server - Features
Nagios Log Server - FeaturesNagios Log Server - Features
Nagios Log Server - FeaturesNagios
 
Nagios Network Analyzer - Features
Nagios Network Analyzer - FeaturesNagios Network Analyzer - Features
Nagios Network Analyzer - FeaturesNagios
 
Nagios Conference 2014 - Dorance Martinez Cortes - Customizing Nagios
Nagios Conference 2014 - Dorance Martinez Cortes - Customizing NagiosNagios Conference 2014 - Dorance Martinez Cortes - Customizing Nagios
Nagios Conference 2014 - Dorance Martinez Cortes - Customizing NagiosNagios
 
Nagios Conference 2014 - Mike Weber - Nagios Rapid Deployment Options
Nagios Conference 2014 - Mike Weber - Nagios Rapid Deployment OptionsNagios Conference 2014 - Mike Weber - Nagios Rapid Deployment Options
Nagios Conference 2014 - Mike Weber - Nagios Rapid Deployment OptionsNagios
 
Nagios Conference 2014 - Eric Mislivec - Getting Started With Nagios Core
Nagios Conference 2014 - Eric Mislivec - Getting Started With Nagios CoreNagios Conference 2014 - Eric Mislivec - Getting Started With Nagios Core
Nagios Conference 2014 - Eric Mislivec - Getting Started With Nagios CoreNagios
 
Nagios Conference 2014 - Trevor McDonald - Monitoring The Physical World With...
Nagios Conference 2014 - Trevor McDonald - Monitoring The Physical World With...Nagios Conference 2014 - Trevor McDonald - Monitoring The Physical World With...
Nagios Conference 2014 - Trevor McDonald - Monitoring The Physical World With...Nagios
 
Nagios Conference 2014 - Andy Brist - Nagios XI Failover and HA Solutions
Nagios Conference 2014 - Andy Brist - Nagios XI Failover and HA SolutionsNagios Conference 2014 - Andy Brist - Nagios XI Failover and HA Solutions
Nagios Conference 2014 - Andy Brist - Nagios XI Failover and HA SolutionsNagios
 
Nagios Conference 2014 - Shamas Demoret - An Overview of Nagios Solutions
Nagios Conference 2014 - Shamas Demoret - An Overview of Nagios SolutionsNagios Conference 2014 - Shamas Demoret - An Overview of Nagios Solutions
Nagios Conference 2014 - Shamas Demoret - An Overview of Nagios SolutionsNagios
 
Nagios Conference 2014 - Shamas Demoret - Getting Started With Nagios XI
Nagios Conference 2014 - Shamas Demoret - Getting Started With Nagios XINagios Conference 2014 - Shamas Demoret - Getting Started With Nagios XI
Nagios Conference 2014 - Shamas Demoret - Getting Started With Nagios XINagios
 

More from Nagios (20)

Dave Williams - Nagios Log Server - Practical Experience
Dave Williams - Nagios Log Server - Practical ExperienceDave Williams - Nagios Log Server - Practical Experience
Dave Williams - Nagios Log Server - Practical Experience
 
Mike Weber - Nagios and Group Deployment of Service Checks
Mike Weber - Nagios and Group Deployment of Service ChecksMike Weber - Nagios and Group Deployment of Service Checks
Mike Weber - Nagios and Group Deployment of Service Checks
 
Mike Guthrie - Revamping Your 10 Year Old Nagios Installation
Mike Guthrie - Revamping Your 10 Year Old Nagios InstallationMike Guthrie - Revamping Your 10 Year Old Nagios Installation
Mike Guthrie - Revamping Your 10 Year Old Nagios Installation
 
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...
 
Matt Bruzek - Monitoring Your Public Cloud With Nagios
Matt Bruzek - Monitoring Your Public Cloud With NagiosMatt Bruzek - Monitoring Your Public Cloud With Nagios
Matt Bruzek - Monitoring Your Public Cloud With Nagios
 
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
 
Eric Loyd - Fractal Nagios
Eric Loyd - Fractal NagiosEric Loyd - Fractal Nagios
Eric Loyd - Fractal Nagios
 
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
 
Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...
Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...
Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...
 
Nagios World Conference 2015 - Scott Wilkerson Opening
Nagios World Conference 2015 - Scott Wilkerson OpeningNagios World Conference 2015 - Scott Wilkerson Opening
Nagios World Conference 2015 - Scott Wilkerson Opening
 
Nrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios Core
Nrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios CoreNrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios Core
Nrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios Core
 
Nagios Log Server - Features
Nagios Log Server - FeaturesNagios Log Server - Features
Nagios Log Server - Features
 
Nagios Network Analyzer - Features
Nagios Network Analyzer - FeaturesNagios Network Analyzer - Features
Nagios Network Analyzer - Features
 
Nagios Conference 2014 - Dorance Martinez Cortes - Customizing Nagios
Nagios Conference 2014 - Dorance Martinez Cortes - Customizing NagiosNagios Conference 2014 - Dorance Martinez Cortes - Customizing Nagios
Nagios Conference 2014 - Dorance Martinez Cortes - Customizing Nagios
 
Nagios Conference 2014 - Mike Weber - Nagios Rapid Deployment Options
Nagios Conference 2014 - Mike Weber - Nagios Rapid Deployment OptionsNagios Conference 2014 - Mike Weber - Nagios Rapid Deployment Options
Nagios Conference 2014 - Mike Weber - Nagios Rapid Deployment Options
 
Nagios Conference 2014 - Eric Mislivec - Getting Started With Nagios Core
Nagios Conference 2014 - Eric Mislivec - Getting Started With Nagios CoreNagios Conference 2014 - Eric Mislivec - Getting Started With Nagios Core
Nagios Conference 2014 - Eric Mislivec - Getting Started With Nagios Core
 
Nagios Conference 2014 - Trevor McDonald - Monitoring The Physical World With...
Nagios Conference 2014 - Trevor McDonald - Monitoring The Physical World With...Nagios Conference 2014 - Trevor McDonald - Monitoring The Physical World With...
Nagios Conference 2014 - Trevor McDonald - Monitoring The Physical World With...
 
Nagios Conference 2014 - Andy Brist - Nagios XI Failover and HA Solutions
Nagios Conference 2014 - Andy Brist - Nagios XI Failover and HA SolutionsNagios Conference 2014 - Andy Brist - Nagios XI Failover and HA Solutions
Nagios Conference 2014 - Andy Brist - Nagios XI Failover and HA Solutions
 
Nagios Conference 2014 - Shamas Demoret - An Overview of Nagios Solutions
Nagios Conference 2014 - Shamas Demoret - An Overview of Nagios SolutionsNagios Conference 2014 - Shamas Demoret - An Overview of Nagios Solutions
Nagios Conference 2014 - Shamas Demoret - An Overview of Nagios Solutions
 
Nagios Conference 2014 - Shamas Demoret - Getting Started With Nagios XI
Nagios Conference 2014 - Shamas Demoret - Getting Started With Nagios XINagios Conference 2014 - Shamas Demoret - Getting Started With Nagios XI
Nagios Conference 2014 - Shamas Demoret - Getting Started With Nagios XI
 

Recently uploaded

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 

Recently uploaded (20)

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 

Nagios Conference 2011 - Nicolas Brousse - Monitoring A Cloud Infrastructure In A Multi-Region Topology

  • 1. NWC 2011 Monitoring a Cloud Infrastructure in a Multi-Region Topology Nicolas Brousse nicolas@tubemogul.com September 29th 2011 2011 TubeMogul Incorporated All rights reserved. 1
  • 2. Introduction - About the speaker • My name is Nicolas Brousse • I previously worked for many industry leading company in France – From Web Hosting to Online Video services (Lycos, MultiMania, Kewego, MediaPlazza...) – Heavy traffic environment and large user databases • I work as a Lead Operations Engineer at TubeMogul.com since 2008 • I help TubeMogul to scale its infrastructure – From 20 servers to +500 servers – Using 4 Amazon EC2 Regions + 1 Colo – Monitoring with Nagios over 6,000 actives services and 1,000 passives services – Collecting over 80,000 metrics with Ganglia – Managing over 300 TB of data in Hadoop HDFS – Billions HTTP queries a day • Occasionally contribute to OpenSource projects – Ganglia (PHP and PERL module) – PHP Judy 2011 TubeMogul Incorporated All rights reserved. 2
  • 3. Introduction - About TubeMogul • Created in November 2006 by John Hughes and Brett Wilson • Formerly a video distribution and analytics platform • Acquire Illuminex - a flash analytics firm - in October 2008 • New platform call PlayTime™ : – TubeMogul is a Video Marketing Company – Built for Branding – Integrate real-time media buying, ad serving, targeting, optimization and brand measurement TubeMogul simplifies the delivery of video ads and maximizes the impact of every dollar spent by brand marketers http://www.tubemogul.com/company/about_us 2011 TubeMogul Incorporated All rights reserved. 3
  • 4. Our Environment • +10 servers hosted at LiquidWeb • Few VPS on Linode • +500 instances on Amazon EC2 – Over 50 different servers configurations • Our technology stack : – JAVA, PHP – Hadoop : HDFS, MapReduce, HBase, Hive – Membase – Memcache – MySQL – And more... • Monitoring with Nagios – Using NSCA when possible • Graphing and Trending using Ganglia with Python plugins – Some legacy servers using Munin • Configuration Management using Puppet 2011 TubeMogul Incorporated All rights reserved. 4
  • 5. Amazon Clound Environment 2011 TubeMogul Incorporated All rights reserved. 5
  • 6. Amazon Clound Environment • We like it because.... – We can quickly start new servers/clusters – We can quickly start new servers/clusters in many regions • US East (Virginia) • US West (North California) • Europe (Dublin) • Asia Pacific (Tokyo & Singapore) – We can use different type of instances (RAM, CPU, Disks, etc.) – It’s easy to automate with EC2 API – It’s easy to plug to a configuration management tool • But... – It can be hard to troubleshoot some failures or network problems – Occasionally being notified of hardware failures after the facts – No Multicast (Though, possible with Amazon VPC) – Bandwidth cost between regions can get expensive 2011 TubeMogul Incorporated All rights reserved. 6
  • 7. What’s the plan ? • Our monitoring must be able to scale • We need a better Graphing/Trending solution • Our monitoring configuration must be automated – How to monitor a cluster of servers with variables number of servers every hours ? – How to change configuration in multiple regions without missing something ? • A failure in one region shouldn’t impact other regions • We want to be wake-up only when it really matter • We have limited resources – Can’t spend big bucks for monitoring – Small operation team 2011 TubeMogul Incorporated All rights reserved. 7
  • 8. Graphing, Trending... Munin Ganglia munin-update Gmetad munin-graph Pull Pull Gmond Gmond Gmetad munin-nodes Pull Push sequential polling 2011 TubeMogul Incorporated All rights reserved. 8
  • 9. Graphing, Trending... • Why we switched from Munin to Ganglia ? – Pretty much : Pull vs Push • Munin server fetch data from Munin Clients (munin-nodes) – Can quickly overload the Munin server in disk I/O and CPU – Data collected in sequential order impacted by previous run time and server load • Ganglia Client send data to representative clusters nodes. Data get federated periodically by a Gmetad process. – Lighter on the aggregation side – Clients push data at defined interval – Can use threshold to send data only when it make sense » using time_threshold and value_threshold in the metric – Ganglia is designed for Clusters and Grids • You can use multiple layer of gmond/gmetad process • You don’t need to manually add servers to your configuration 2011 TubeMogul Incorporated All rights reserved. 9
  • 10. Monitoring with Nagios 2011 TubeMogul Incorporated All rights reserved. 10
  • 11. Automating Nagios configuration • Puppet will configure our monitoring instance in each Region – We use Nagios regex : use_regexp_matching=1 – But we don’t use true regex : use_true_regexp_matching=0 – We use NSCA with Upstart – We don’t use the perfdata – We includes our configurations from 3 directories - objects => templates, contacts, commands, event_handlers - servers => contain a configuration file for each server - clusters => contain a configuration file for each cluster 2011 TubeMogul Incorporated All rights reserved. 11
  • 12. Automating Nagios configuration Process of event when starting a new host and add it to our monitoring: 1. We start a new instance using Cerveza and Cloud-init 2. Puppet configure Gmond on the instance 3. Our monitoring server running Gmetad get data from the new instance 4. A Nagios check run every minute and look for new hosts in Ganglia 5. If a new host is found, the check script rebuild the Nagios config and reload Nagios 6. If the config is corrupt, the check script will send a critical alert 2011 TubeMogul Incorporated All rights reserved. 12
  • 13. Automating Nagios configuration • Each server configuration is generated from a template • Our nagios plugin “check_tm_clusters”, goes over the RRD files generated by Ganglia • If a new host is found, it simply copy the template to the servers config directory and replace the variables as reported by Ganglia and looking at DNS entries 2011 TubeMogul Incorporated All rights reserved. 13
  • 14. Reducing noise and false positive • We disable most notification and only care of a cluster status • Most of our checks are based on Ganglia RRD files 2011 TubeMogul Incorporated All rights reserved. 14
  • 15. Reducing noise and false positive • It become really easy to monitor any metrics returned by Ganglia 2011 TubeMogul Incorporated All rights reserved. 15
  • 16. Reducing noise and false positive • We can check cluster status by hosts/services but also per returned messages ! 2011 TubeMogul Incorporated All rights reserved. 16
  • 17. Reducing noise and false positive • We extensively use our “check_cluster” plugin • We limit as much as possible email notification • We use a custom variable _PAGING to identify pageable services • Paging ONLY on Critical alerts for services/hosts with _PAGING=yes • Use different contacts and time periods to send alerts to the right person • We use Nagios Checker for FireFox and Chrome 2011 TubeMogul Incorporated All rights reserved. 17
  • 18. Thank You... TubeMogul is Hiring ! http://www.tubemogul.com/company/careers jobs@tubemogul.com Follow us on Twitter @tubemogul @orieg 2011 TubeMogul Incorporated All rights reserved. 18