SlideShare a Scribd company logo
1 of 40
Optimizing Your Cloud Applications in RightScale October 13, 2011 Watch the video of this webinar
Your Panel Today Presenting ,[object Object]
Raphael Simon, Sr. Systems Architect, RightScaleQ&A  Jordan Evans, Account Manager, RightScale Please use the “Questions” window to ask questions any time!
Agenda Introduction 3-tier application architecture Vertical & horizontal scaling RightScale monitoring  and cluster graphs New Relic RPM Support for optimizing DB performance Load testing Please use the “Questions” window to ask questions any time!
Multi-tenancy Shared resource pooling Geo-distribution and ubiquitous network access Service oriented Dynamic resource provisioning Self-organizing Utility based pricing Cloud computing characteristics
No upfront investment Lowering operating costs Highly scalable Easy access Reduces business risk and maintenance costs Enables process automation Cloud computing advantages
3-tier application architecture Load balancers An array of application servers Master-slave
Optimizing Your Cloud Applications in RightScale Vertical & Horizontal Scaling
Instance size (vertical scaling) Instance autoscaling (horizontal scaling) Server arrays RightScale support for performance optimization ServerTemplates are configured to capture performance data CollectdRightScripts Hardware & OS monitoring data Specialized plugins – MySQL, HAProxy, Apache, NgInx, IIS, etc Monitoring graphs: individual, cluster, stacked, heat maps Alerts & escalations New Relic RPM Cloud performance optimization
Compute units vs memory Scaling up – spectrum of instance sizes
Server arrays provide horizontal scaling
The array scales up or down based on performance votes Tags allow scaling on an arbitrary decision set Decision threshold controls reaction time Sleep time allows new resources to have an impact Scaling can be time dependent Detailed setup instructions: http://bit.ly/c1oLr2 Fast response to changes in load conditions using alerts  Allocation of servers to availability zones based on weights Deployment-based so configuration is consistent  Arrays can be pre-scaled to support anticipated demand  Server arrays provide horizontal scaling
Optimizing Your Cloud Applications in RightScale Monitoring & Cluster Graphs with RightScale
Server monitoring graphs
Cluster monitoring Individual graphs Good for a dozen servers Displays all standard graphs with full detail Stacked graphs Displays the contribution of many servers to a total Great to see the sum and variability of activity in a cluster Difficult to make out individual servers Examples: requests/sec, cpu busy cycles, I/O bytes/sec Heat maps Displays a bar for each server Great to see uneven distribution across servers Great to quickly spot performance problems across many servers Difficult to read absolute values or see the total cluster activity
Cluster monitoring architecture Architecture Monitoring front-end serverspull data from storage servers Up to 100 servers on one graph(to be increased) monitoring storage servers monitoring front-end servers your servers
Cluster monitoring Current cluster monitoring: one graph per server
Stacked graphs Each color band shows contribution of one server Servers are stacked on top of one another
Heat maps Each horizontal strip shows one server The color shows how “hot” the server is running
Heat map with 100 servers
Stacked graph of the same 100 servers
Optimizing Your Cloud Applications in RightScale Application Performance Analytics with New Relic
Real-Time App Performance Analytics Supports Ruby, PHP, Java & .Net SQL & NoSQL performance Web transaction tracing Performance notifications Availability monitoring Scalability analysis New Relic RPM
New Relic RPM Direct access from RightScale dashboard
New Relic RPM Historical statistics over a period of time
New Relic RPM Distribution of the most time consuming requests
New Relic RPM Statistics about response times from different countries
New Relic RPM Detailed response times by browser
An expensive query The N+1 query problem New Relic RPM – 2 Examples
Optimizing Your Cloud Applications in RightScale Optimizing Database Performance
Optimizing DB performance RightScale MySQLServerTemplates Configuration files tailored to instance size innodb_buffer_pool_size key_buffer_size thread_size sort_buffer_size The never ending task of identifying current bottlenecks Disk seeks Performance of disk operations Scale up when working set cannot fit in memory – avoid active swapping Constant monitoring of performance graphs, logs and query Schema considerations
Schema considerations Lookups need to be indexed Sorting requires an index Joins need to be done on indices Become slower as tables grow Compounded indices should be used consistently Do not abuse indices Each index requires a disk write Compact tables if they become fragmented Deleted rows do not remove the corresponding index entries
Monitoring DB performance Standard collectd statistics User vs wait time (disk operations) Performance of disk operations Scale up when working set cannot fit in memory MySQLcollectdplugin Monitor INSERT, SELECT, UPDATE operations The breakdown of read operations can indicate missing indices Monitoring /var/log/mysqlslow.log file Identify slow queries Use MySQL EXPLAIN command to identify query plan
MySQLCollectdPlugin Uses MySQL SHOW STATUS command to collect statistics A large set of counters that are divided into 10 categories Connections IO Requests Select Rates Read Rates Key Rates Commands Rates Query Cache Tables Memory Misc.
MySQLCollectdPlugin Uses MySQL SHOW STATUS command to collect statistics
Mysqlslow.log & explain command
MySQL performance depends on locality Wait time should be minimum when working set fits in memory Performance degrades once wait time is significant wait time insignificant user time dominates
MySQL reads graphs Read-random-next represents a table scan Read-next represents an index scan
Optimizing Your Cloud Applications in RightScale Load Testing
Load testing using httperf RightScale provides ServerTemplates in the marketplace https://my.rightscale.com/library/server_templates/Httperf-Load-Tester/24714 Tutorial on httperf setup and configuration http://support.rightscale.com/03-Tutorials/02-AWS/E2E_Examples/E2E_Gaming_Deployment/Adding_Httperf_Load_Tester

More Related Content

More from RightScale

More from RightScale (20)

10 Must-Have Automated Cloud Policies for IT Governance
10 Must-Have Automated Cloud Policies for IT Governance10 Must-Have Automated Cloud Policies for IT Governance
10 Must-Have Automated Cloud Policies for IT Governance
 
Kubernetes and Terraform in the Cloud: How RightScale Does DevOps
Kubernetes and Terraform in the Cloud: How RightScale Does DevOpsKubernetes and Terraform in the Cloud: How RightScale Does DevOps
Kubernetes and Terraform in the Cloud: How RightScale Does DevOps
 
Optimize Software, SaaS, and Cloud with Flexera and RightScale
Optimize Software, SaaS, and Cloud with Flexera and RightScaleOptimize Software, SaaS, and Cloud with Flexera and RightScale
Optimize Software, SaaS, and Cloud with Flexera and RightScale
 
Prepare Your Enterprise Cloud Strategy for 2019: 7 Things to Think About Now
Prepare Your Enterprise Cloud Strategy for 2019: 7 Things to Think About NowPrepare Your Enterprise Cloud Strategy for 2019: 7 Things to Think About Now
Prepare Your Enterprise Cloud Strategy for 2019: 7 Things to Think About Now
 
How to Set Up a Cloud Cost Optimization Process for your Enterprise
How to Set Up a Cloud Cost Optimization Process for your EnterpriseHow to Set Up a Cloud Cost Optimization Process for your Enterprise
How to Set Up a Cloud Cost Optimization Process for your Enterprise
 
Multi-Cloud Management with RightScale CMP (Demo)
Multi-Cloud Management with RightScale CMP (Demo)Multi-Cloud Management with RightScale CMP (Demo)
Multi-Cloud Management with RightScale CMP (Demo)
 
Comparing Cloud VM Types and Prices: AWS vs Azure vs Google vs IBM
Comparing Cloud VM Types and Prices: AWS vs Azure vs Google vs IBMComparing Cloud VM Types and Prices: AWS vs Azure vs Google vs IBM
Comparing Cloud VM Types and Prices: AWS vs Azure vs Google vs IBM
 
How to Allocate and Report Cloud Costs with RightScale Optima
How to Allocate and Report Cloud Costs with RightScale OptimaHow to Allocate and Report Cloud Costs with RightScale Optima
How to Allocate and Report Cloud Costs with RightScale Optima
 
Should You Move Between AWS, Azure, or Google Clouds? Considerations, Pros an...
Should You Move Between AWS, Azure, or Google Clouds? Considerations, Pros an...Should You Move Between AWS, Azure, or Google Clouds? Considerations, Pros an...
Should You Move Between AWS, Azure, or Google Clouds? Considerations, Pros an...
 
Using RightScale CMP with Cloud Provider Tools
Using RightScale CMP with Cloud Provider ToolsUsing RightScale CMP with Cloud Provider Tools
Using RightScale CMP with Cloud Provider Tools
 
Best Practices for Multi-Cloud Security and Compliance
Best Practices for Multi-Cloud Security and ComplianceBest Practices for Multi-Cloud Security and Compliance
Best Practices for Multi-Cloud Security and Compliance
 
Automating Multi-Cloud Policies for AWS, Azure, Google, and More
Automating Multi-Cloud Policies for AWS, Azure, Google, and MoreAutomating Multi-Cloud Policies for AWS, Azure, Google, and More
Automating Multi-Cloud Policies for AWS, Azure, Google, and More
 
The 5 Stages of Cloud Management for Enterprises
The 5 Stages of Cloud Management for EnterprisesThe 5 Stages of Cloud Management for Enterprises
The 5 Stages of Cloud Management for Enterprises
 
9 Ways to Reduce Cloud Storage Costs
9 Ways to Reduce Cloud Storage Costs9 Ways to Reduce Cloud Storage Costs
9 Ways to Reduce Cloud Storage Costs
 
Serverless Comparison: AWS vs Azure vs Google vs IBM
Serverless Comparison: AWS vs Azure vs Google vs IBMServerless Comparison: AWS vs Azure vs Google vs IBM
Serverless Comparison: AWS vs Azure vs Google vs IBM
 
Best Practices for Cloud Managed Services Providers: The Path to CMP Success
Best Practices for Cloud Managed Services Providers: The Path to CMP SuccessBest Practices for Cloud Managed Services Providers: The Path to CMP Success
Best Practices for Cloud Managed Services Providers: The Path to CMP Success
 
Cloud Storage Comparison: AWS vs Azure vs Google vs IBM
Cloud Storage Comparison: AWS vs Azure vs Google vs IBMCloud Storage Comparison: AWS vs Azure vs Google vs IBM
Cloud Storage Comparison: AWS vs Azure vs Google vs IBM
 
2018 Cloud Trends: RightScale State of the Cloud Report
2018 Cloud Trends: RightScale State of the Cloud Report2018 Cloud Trends: RightScale State of the Cloud Report
2018 Cloud Trends: RightScale State of the Cloud Report
 
Got a Multi-Cloud Strategy? How RightScale CMP Helps
Got a Multi-Cloud Strategy? How RightScale CMP HelpsGot a Multi-Cloud Strategy? How RightScale CMP Helps
Got a Multi-Cloud Strategy? How RightScale CMP Helps
 
How to Manage Cloud Costs with RightScale Optima
How to Manage Cloud Costs with RightScale OptimaHow to Manage Cloud Costs with RightScale Optima
How to Manage Cloud Costs with RightScale Optima
 

Recently uploaded

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 

Optimizing Your Cloud Applications in RightScale

  • 1. Optimizing Your Cloud Applications in RightScale October 13, 2011 Watch the video of this webinar
  • 2.
  • 3. Raphael Simon, Sr. Systems Architect, RightScaleQ&A Jordan Evans, Account Manager, RightScale Please use the “Questions” window to ask questions any time!
  • 4. Agenda Introduction 3-tier application architecture Vertical & horizontal scaling RightScale monitoring and cluster graphs New Relic RPM Support for optimizing DB performance Load testing Please use the “Questions” window to ask questions any time!
  • 5. Multi-tenancy Shared resource pooling Geo-distribution and ubiquitous network access Service oriented Dynamic resource provisioning Self-organizing Utility based pricing Cloud computing characteristics
  • 6. No upfront investment Lowering operating costs Highly scalable Easy access Reduces business risk and maintenance costs Enables process automation Cloud computing advantages
  • 7. 3-tier application architecture Load balancers An array of application servers Master-slave
  • 8. Optimizing Your Cloud Applications in RightScale Vertical & Horizontal Scaling
  • 9. Instance size (vertical scaling) Instance autoscaling (horizontal scaling) Server arrays RightScale support for performance optimization ServerTemplates are configured to capture performance data CollectdRightScripts Hardware & OS monitoring data Specialized plugins – MySQL, HAProxy, Apache, NgInx, IIS, etc Monitoring graphs: individual, cluster, stacked, heat maps Alerts & escalations New Relic RPM Cloud performance optimization
  • 10. Compute units vs memory Scaling up – spectrum of instance sizes
  • 11. Server arrays provide horizontal scaling
  • 12. The array scales up or down based on performance votes Tags allow scaling on an arbitrary decision set Decision threshold controls reaction time Sleep time allows new resources to have an impact Scaling can be time dependent Detailed setup instructions: http://bit.ly/c1oLr2 Fast response to changes in load conditions using alerts Allocation of servers to availability zones based on weights Deployment-based so configuration is consistent Arrays can be pre-scaled to support anticipated demand Server arrays provide horizontal scaling
  • 13. Optimizing Your Cloud Applications in RightScale Monitoring & Cluster Graphs with RightScale
  • 15. Cluster monitoring Individual graphs Good for a dozen servers Displays all standard graphs with full detail Stacked graphs Displays the contribution of many servers to a total Great to see the sum and variability of activity in a cluster Difficult to make out individual servers Examples: requests/sec, cpu busy cycles, I/O bytes/sec Heat maps Displays a bar for each server Great to see uneven distribution across servers Great to quickly spot performance problems across many servers Difficult to read absolute values or see the total cluster activity
  • 16. Cluster monitoring architecture Architecture Monitoring front-end serverspull data from storage servers Up to 100 servers on one graph(to be increased) monitoring storage servers monitoring front-end servers your servers
  • 17. Cluster monitoring Current cluster monitoring: one graph per server
  • 18. Stacked graphs Each color band shows contribution of one server Servers are stacked on top of one another
  • 19. Heat maps Each horizontal strip shows one server The color shows how “hot” the server is running
  • 20. Heat map with 100 servers
  • 21. Stacked graph of the same 100 servers
  • 22. Optimizing Your Cloud Applications in RightScale Application Performance Analytics with New Relic
  • 23. Real-Time App Performance Analytics Supports Ruby, PHP, Java & .Net SQL & NoSQL performance Web transaction tracing Performance notifications Availability monitoring Scalability analysis New Relic RPM
  • 24. New Relic RPM Direct access from RightScale dashboard
  • 25. New Relic RPM Historical statistics over a period of time
  • 26. New Relic RPM Distribution of the most time consuming requests
  • 27. New Relic RPM Statistics about response times from different countries
  • 28. New Relic RPM Detailed response times by browser
  • 29. An expensive query The N+1 query problem New Relic RPM – 2 Examples
  • 30. Optimizing Your Cloud Applications in RightScale Optimizing Database Performance
  • 31. Optimizing DB performance RightScale MySQLServerTemplates Configuration files tailored to instance size innodb_buffer_pool_size key_buffer_size thread_size sort_buffer_size The never ending task of identifying current bottlenecks Disk seeks Performance of disk operations Scale up when working set cannot fit in memory – avoid active swapping Constant monitoring of performance graphs, logs and query Schema considerations
  • 32. Schema considerations Lookups need to be indexed Sorting requires an index Joins need to be done on indices Become slower as tables grow Compounded indices should be used consistently Do not abuse indices Each index requires a disk write Compact tables if they become fragmented Deleted rows do not remove the corresponding index entries
  • 33. Monitoring DB performance Standard collectd statistics User vs wait time (disk operations) Performance of disk operations Scale up when working set cannot fit in memory MySQLcollectdplugin Monitor INSERT, SELECT, UPDATE operations The breakdown of read operations can indicate missing indices Monitoring /var/log/mysqlslow.log file Identify slow queries Use MySQL EXPLAIN command to identify query plan
  • 34. MySQLCollectdPlugin Uses MySQL SHOW STATUS command to collect statistics A large set of counters that are divided into 10 categories Connections IO Requests Select Rates Read Rates Key Rates Commands Rates Query Cache Tables Memory Misc.
  • 35. MySQLCollectdPlugin Uses MySQL SHOW STATUS command to collect statistics
  • 37. MySQL performance depends on locality Wait time should be minimum when working set fits in memory Performance degrades once wait time is significant wait time insignificant user time dominates
  • 38. MySQL reads graphs Read-random-next represents a table scan Read-next represents an index scan
  • 39. Optimizing Your Cloud Applications in RightScale Load Testing
  • 40. Load testing using httperf RightScale provides ServerTemplates in the marketplace https://my.rightscale.com/library/server_templates/Httperf-Load-Tester/24714 Tutorial on httperf setup and configuration http://support.rightscale.com/03-Tutorials/02-AWS/E2E_Examples/E2E_Gaming_Deployment/Adding_Httperf_Load_Tester
  • 41.
  • 43. Ask questions at the Genius Bar

Editor's Notes

  1. The cluster monitoring is very powerful in that it provides different types of views into the operation of large clusters of servers
  2. The cluster monitoring is very powerful in that it provides different types of views into the operation of large clusters of servers
  3. The architecture behind the cluster monitoring is rather extensiveCustomer (i.e. your) servers send monitoring data every 20 seconds to our serversThe data points are cached in-memory on those servers and flushed to disk periodicallyCluster monitoring graphs are produced on separate front-end servers, which pull the data from over 100 monitoring storage serversThe graphs are produced using rrdtool and auto-refresh
  4. Walk through ofhow it works: in any deployment, go to the monitoring tab select servers select metric to plot familiar controls to switch time period and graph size displays one graph per server, here core1.rightscale.com through core8.rightscale.com in this example the graphs show cpu utilization for the past week, where blue is busy time and green is idle
  5. Individual graphs only work for so many servers, they also don’t show what is happening as an aggregateStacked graphs stack the contribution of each server on top of one anotherWalk through what the graph shows
  6. Stacked graphs are great to see the aggregate, but it is often difficult to see abnormal server behaviorHeat maps show many servers on one graph by plotting one horizontal bar per serverThe time axis is the same for all servers and it is shown at the bottom of the graphThe color of the bar shows the value of the metric for the serverWalk through the graphIt’s easy to see that there are 6 servers sharing the load, and two servers that are different
  7. At scale this is how all this looks and comes togetherThis example is real, it shows an incident we had with our monitoring cluster a few months agoThis heat map shows 100 servers out of one of our monitoring clusters (we want to be vague here…)When there are more than 100 servers, the heat map shows a sampling of 100Describe the sampling: most recently launched, longest running, some of each server template, rest randomStory:This heat map plots I/O wait for our monitoring servers on a day where we suddenly received a number of alerts for a few serversThe heap map shows these servers clearly as red bands starting between 7am and 8amSo we could clearly see that something was going on with a small number of servers and that it started more or less at the same time on all themTo see what happened in aggregate, we can switch graph type…
  8. This shows the same incident as on the previous slide, but with a timescale of a weekIt shows the number of servers handled by each monitoring server, i.e. each color bar shows one serverIt is easy to see that some customer launched a large number of servers right at the time the overload beganFurther investigation showed that due to a bug these servers were allocated unevenly across the cluster causing the overload’