SlideShare une entreprise Scribd logo
1  sur  16
How Zynga handles monitoring at
scale in its hybrid zCloud

Nov 12th, 2013
Matt West : mwest@zynga.com
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/zynga-zcloud

InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Presented at QCon San Francisco
www.qconsf.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
With Great Scale, Comes Great Challenges…
• The Sky is Falling!
– Immense alert volumes
– Irregular checks with inconsistent results
– Environment coverage questionability
• Touching the Oven!
– Investigate a set of standards to provide the customers
– Leverage an API for host information
– Use asynchronous queues for job execution at scale
• Into the Trenches!
– Implement findings with customers
• Profit!

2
Nagios, Gearman, and Mod-Gearman
• Nagios is susceptible to
processing delays for various
reasons.
• Tuning configuration
parameters can help.
• Gearman / Mod-Gearman
gives us access to designated
pool of workers.
– Can Grow and Shrink with
demand from Nagios.
– Written in C and Open
Source.
– Practical and Easy to
deploy.

Non Mod-Gearman Enabled Nagios

Mod-Gearman Enabled Nagios

3
Nagios, Gearman, and Mod-Gearman Continued…
• Nagios Daemon Loads a NEB
(Nagios Event Broker) module.
• NEB Module imports or skips
inserting the job to the
requested Gearman queue.

• Workers running on various
servers execute commands on
behalf of the Nagios instance.
• Workers then inserts the results of executing the command into the results
Gearman queue for processing.
• Result worker(s) consume the Gearman queue information and pass the
information back into the Nagios process for any further processing and
handling.

4
Initial Results
• Host Check Latency Times in
seconds
– Standard: min/max/avg
– 107.62;111.23;109.656

– Mod-Gearman: min/max/avg
– 0;6.59;2.753
– Savings: min/max/avg
– 100.00%;94.08%;97.49%

• Host Check Execution Times in
seconds
– Standard: min/max/avg
– 3.02;4.03;4.014
– Mod-Gearman: min/max/avg
– 0.5;0.53;0.508
– Savings: min/max/avg
– 83.44%;86.85%;87.34%
5
Initial Results Continued…
• Service Check Latency Times in
seconds
– Standard: min/max/avg
– 47.21;118.17;110.78

– Mod-Gearman: min/max/avg
– 0;8.05;0.405
– Savings: min/max/avg
– 100.00%;93.19%;99.63%

• Service Check Execution Times in
seconds
– Standard: min/max/avg
– 0.01;4.21;0.18
– Mod-Gearman: min/max/avg
– 0;3.24;0.121
– Savings: min/max/avg
– 100.00%;23.04%;32.78%
6
Initial Results Continued…
• Host and Service Exec Times stay fairly stable.
• Host and Service Latency Times are immediately reduced.
• Achieved even while adding more hosts and services to this
cluster.
– Standard:
– 10452 Services
– 1294 Hosts

– Mod-Gearman:
– 17996 Services
– 1374 Hosts
– Difference:
– 7544 Services (+72.18%)
– 80 Hosts (+6.18%)

7
Our Monitoring Scaling Pipe Dream
• Saigon (Centralized Nagios Configuration Management)
– What if some of those initial problems, weren’t problems
because everyone came to us for Nagios solutions.
• Distributable Result Information to various customers
– What if all the customers could register an API callback for
information about Nagios alerts they don’t control.
• Increased usage ability of external host information from
external APIs.
– What if we didn’t have to wait around for host information to
come back from off site APIs due to latency issues.

8
Saigon and Beanstalkd
• Saigon UI Explanation
– Beanstalkd integration
– RPM Builder
– Configuration Viewing
– Configuration Tester
– Configuration Version
Diffing
• Saigon API Explanation
– Caching Layer
– RESTful Syntax
– Scripted Consumers

9
Distributed Results with Beanstalkd
• Nagios
– Script sends data to an API to be placed into Beanstalkd.

• Beanstalkd
– Reduces work done by Nagios server to bare minimum.
– Possible customers for Nagios results.
– Stats and Analytics
– Lifetime Server State Change Logs
– External Break/Fix Systems

10
Rightscale Cache and Beanstalkd
• External calls to Rightscale API could be untimely.
– Fairly reliable at returning data small sets of data.
– Certain large requests had a 10% chance of failing.
• Implemented Host Hot Cache
– Leveraged Beanstalkd to manage sub jobs of global jobs.
– Beanstalkd is used to keep global re-occurring jobs running.
– Hot Cache is completely refreshed once every 4 hours.
• Fronted by RESTful API
– Allows for single, multi, global host invalidation or revalidation.
– Created jobs for surfacing known problems between
Cloudstack, Rightscale and our Physical Hosts.

11
Software Used
• Nagios : http://nagios.org
o v3.2.3 and v3.5.0
• Gearman : http://gearman.org
o v0.25
• Mod-Gearman : https://labs.consol.de/lang/en/nagios/mod-gearman/
o v1.4.2
• Beanstalkd : http://kr.github.io/beanstalkd/
o v1.4.6
• Check-MK : http://mathias-kettner.com/check_mk.html
o v1.2.2p1

12
Thank you…

We can now begin the,
Interrogation Gauntlet… ;)

13
Watch the video with slide synchronization on
InfoQ.com!
http://www.infoq.com/presentations/zyngazcloud

Contenu connexe

Plus de C4Media

Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideC4Media
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDC4Media
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine LearningC4Media
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at SpeedC4Media
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsC4Media
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsC4Media
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerC4Media
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleC4Media
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeC4Media
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereC4Media
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing ForC4Media
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data EngineeringC4Media
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreC4Media
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsC4Media
 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechC4Media
 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/awaitC4Media
 
Opportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaOpportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaC4Media
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayC4Media
 
Are We Really Cloud-Native?
Are We Really Cloud-Native?Are We Really Cloud-Native?
Are We Really Cloud-Native?C4Media
 
CockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseCockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseC4Media
 

Plus de C4Media (20)

Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate Guide
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery Teams
 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in Adtech
 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/await
 
Opportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaOpportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven Utopia
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
 
Are We Really Cloud-Native?
Are We Really Cloud-Native?Are We Really Cloud-Native?
Are We Really Cloud-Native?
 
CockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseCockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL Database
 

Dernier

Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 

Dernier (20)

Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 

How Zynga Handles Monitoring at Scale in Its Hybrid zCloud

  • 1. How Zynga handles monitoring at scale in its hybrid zCloud Nov 12th, 2013 Matt West : mwest@zynga.com
  • 2. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /zynga-zcloud InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month
  • 3. Presented at QCon San Francisco www.qconsf.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  • 4. With Great Scale, Comes Great Challenges… • The Sky is Falling! – Immense alert volumes – Irregular checks with inconsistent results – Environment coverage questionability • Touching the Oven! – Investigate a set of standards to provide the customers – Leverage an API for host information – Use asynchronous queues for job execution at scale • Into the Trenches! – Implement findings with customers • Profit! 2
  • 5. Nagios, Gearman, and Mod-Gearman • Nagios is susceptible to processing delays for various reasons. • Tuning configuration parameters can help. • Gearman / Mod-Gearman gives us access to designated pool of workers. – Can Grow and Shrink with demand from Nagios. – Written in C and Open Source. – Practical and Easy to deploy. Non Mod-Gearman Enabled Nagios Mod-Gearman Enabled Nagios 3
  • 6. Nagios, Gearman, and Mod-Gearman Continued… • Nagios Daemon Loads a NEB (Nagios Event Broker) module. • NEB Module imports or skips inserting the job to the requested Gearman queue. • Workers running on various servers execute commands on behalf of the Nagios instance. • Workers then inserts the results of executing the command into the results Gearman queue for processing. • Result worker(s) consume the Gearman queue information and pass the information back into the Nagios process for any further processing and handling. 4
  • 7. Initial Results • Host Check Latency Times in seconds – Standard: min/max/avg – 107.62;111.23;109.656 – Mod-Gearman: min/max/avg – 0;6.59;2.753 – Savings: min/max/avg – 100.00%;94.08%;97.49% • Host Check Execution Times in seconds – Standard: min/max/avg – 3.02;4.03;4.014 – Mod-Gearman: min/max/avg – 0.5;0.53;0.508 – Savings: min/max/avg – 83.44%;86.85%;87.34% 5
  • 8. Initial Results Continued… • Service Check Latency Times in seconds – Standard: min/max/avg – 47.21;118.17;110.78 – Mod-Gearman: min/max/avg – 0;8.05;0.405 – Savings: min/max/avg – 100.00%;93.19%;99.63% • Service Check Execution Times in seconds – Standard: min/max/avg – 0.01;4.21;0.18 – Mod-Gearman: min/max/avg – 0;3.24;0.121 – Savings: min/max/avg – 100.00%;23.04%;32.78% 6
  • 9. Initial Results Continued… • Host and Service Exec Times stay fairly stable. • Host and Service Latency Times are immediately reduced. • Achieved even while adding more hosts and services to this cluster. – Standard: – 10452 Services – 1294 Hosts – Mod-Gearman: – 17996 Services – 1374 Hosts – Difference: – 7544 Services (+72.18%) – 80 Hosts (+6.18%) 7
  • 10. Our Monitoring Scaling Pipe Dream • Saigon (Centralized Nagios Configuration Management) – What if some of those initial problems, weren’t problems because everyone came to us for Nagios solutions. • Distributable Result Information to various customers – What if all the customers could register an API callback for information about Nagios alerts they don’t control. • Increased usage ability of external host information from external APIs. – What if we didn’t have to wait around for host information to come back from off site APIs due to latency issues. 8
  • 11. Saigon and Beanstalkd • Saigon UI Explanation – Beanstalkd integration – RPM Builder – Configuration Viewing – Configuration Tester – Configuration Version Diffing • Saigon API Explanation – Caching Layer – RESTful Syntax – Scripted Consumers 9
  • 12. Distributed Results with Beanstalkd • Nagios – Script sends data to an API to be placed into Beanstalkd. • Beanstalkd – Reduces work done by Nagios server to bare minimum. – Possible customers for Nagios results. – Stats and Analytics – Lifetime Server State Change Logs – External Break/Fix Systems 10
  • 13. Rightscale Cache and Beanstalkd • External calls to Rightscale API could be untimely. – Fairly reliable at returning data small sets of data. – Certain large requests had a 10% chance of failing. • Implemented Host Hot Cache – Leveraged Beanstalkd to manage sub jobs of global jobs. – Beanstalkd is used to keep global re-occurring jobs running. – Hot Cache is completely refreshed once every 4 hours. • Fronted by RESTful API – Allows for single, multi, global host invalidation or revalidation. – Created jobs for surfacing known problems between Cloudstack, Rightscale and our Physical Hosts. 11
  • 14. Software Used • Nagios : http://nagios.org o v3.2.3 and v3.5.0 • Gearman : http://gearman.org o v0.25 • Mod-Gearman : https://labs.consol.de/lang/en/nagios/mod-gearman/ o v1.4.2 • Beanstalkd : http://kr.github.io/beanstalkd/ o v1.4.6 • Check-MK : http://mathias-kettner.com/check_mk.html o v1.2.2p1 12
  • 15. Thank you… We can now begin the, Interrogation Gauntlet… ;) 13
  • 16. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/zyngazcloud