SlideShare une entreprise Scribd logo
1  sur  13
Télécharger pour lire hors ligne
CloudStack Scalability

     By Alex Huang
Current Status
• 10k resources managed per management server
  node
• Scales out horizontally (must disable stats
  collector)
• Real production deployment of tens of thousands
  of resources
• Internal testing with software simulators up to
  30k physical resources with 30k VMs managed by
  4 management server nodes
• We believe we can at least double that scale per
  management server node
Balancing Incoming Requests
• Each management server has two worker thread pools for incoming
  requests: effectively two servers in one.
   – Executor threads provided by tomcat
   – Job threads waiting on job queue
• All incoming requests that requires mostly DB operations are short
  in duration and are executed by executor threads because incoming
  requests are already load balanced by the load balancer
• All incoming requests needing resources, which often have long
  running durations, are checked against ACL by the executor threads
  and then queued and picked up by job threads.
• # of job threads are scaled to the # of DB connections available to
  the management server
• Requests may take a long time depending on the constraint of the
  resources but they don’t fail.
The Much Harder Problem
• CloudStack performs a number of tasks on behalf of
  the users and those tasks increases with the number of
  virtual and physical resources available
   –   VM Sync
   –   SG Sync
   –   Hardware capacity monitoring
   –   Virtual resource usage statistics collection
   –   More to come
• When done in number of hundreds, no big deal.
• As numbers increase, this problem magnifies.
• How to scale this horizontally across management
  servers?
Comparison of two Approaches
• Stats Collector – collects capacity statistics
   – Fires every five minutes to collect stats about host CPU and
     memory capacity
   – Smart server and dumb client model: Resource only
     collects info and management server processes
   – Runs the same way on every management server
• VM Sync
   – Fires every minute
   – Peer to peer model: Resource does a full sync on
     connection and delta syncs thereafter. Management
     server trusts on resource for correct information.
   – Only runs against resources connected to the management
     server node
Numbers
•   Assume 10k hosts and 500k VMs (50 VMs per host)
•   Stats Collector
     – Fires off 10k requests every 5 minutes or 33 requests a second.
     – Bad but not too bad: Occupies 33 threads every second.
     – But just wait:
          •   2 management servers: 66 requests
          •   3 management servers: 99 requests
     – It gets worse as # of management servers increase because it did not auto-balance across
       management servers
     – Oh but it gets worse still: Because the 10k hosts is now spread across 3 management servers.
       While it’s 99 requests generated, the number of threads involved is three-fold because
       requests need to be routed to the right management server.
     – It keeps the management server at 20% busy even at no load from incoming requests
•   VM Sync
     – Fires off 1 request at resource connection to sync about 50 VMs
     – Then, push from resource as resource knows what it has pushed before and only pushes
       changes that are out-of-band.
     – So essentially no threads occupied for a much larger data set.
What’s the Down Side?
• Resources must reconcile between VM states
  caused by management server commands and
  VM states it collects from the physical
  hardware so it requires more CPU
• Resources must use more memory to keep
  track of what amounts to a journal of changes
  since the last sync point.
• But data centers are full of these two
  resources.
Resource Load Balancing
• As management server is added into the cluster, resources are rebalanced
  seamlessly.
    –   MS2 signals to MS1 to hand over a resource
    –   MS1 wait for the commands on the resources to finish
    –   MS1 holds further commands in a queue
    –   MS1 signals to MS2 to take over
    –   MS2 connects
    –   MS2 signals to MS1 to complete transfer
    –   MS1 discards its resource and flows the commands being held to MS2
• Listeners are provided to business logic to listen on connection status and
  adjusts work based on who’s connected.
• By only working on resources that are connected to the management
  server the process is on, work is auto-balanced between management
  servers.
• Also reduces the message routing between the management servers.
Designing for Scalability
• Take advantage of the most abundant resources in a data center
  (CPU, RAM)
• Auto-scale to the least abundant resource (DB)
• Do not hold DB connections/Transactions across resource calls.
   – Use lock table implementation (Merovingian2 or
     GenericDao.acquireLockInTable() call) over database row locks in this
     situation.
   – Database row locks are still fine quick short lock outs.
• Balance the resource intensive tasks as # of management server
  nodes increases and decreases
   – Use job queues to balance long running processes across management
     servers
   – Make use of resource rebalancing in CloudStack to auto-balance your
     world load.
Reliability

By Alex Huang
The Five W’s of Unreliability
• What is unreliable? Everything
• Who is unreliable? Developers & administrators
• When does unreliability happen? 3:04 a.m. no
  matter which time zone… Any time.
• Where does unreliability happen? In carefully
  planned, everything has been considered data
  centers.
• How does unreliability happen? Rather
  nonchalantly
Dealing with Unreliability
•   Don’t assume!
•   Don’t bang your head against the wall!
•   Know when you don’t know any better.
•   Ask for help!
Designs against Unreliability
• Management Servers keeps an heartbeat with the DB. One
  ping a minute.
• Management Servers self-fences if it cannot write the
  heartbeat
• Other management servers wait to make sure the down
  management server is no longer writing to the heartbeat
  and then signal interested software to recover
• Check points at every call to a resource and code to deal
  with recovering from those check points
• Database records are not actually deleted to help with
  manual recovery when needed
• Write code that is idempotent
• Respect modularity when writing your code

Contenu connexe

Tendances

Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...confluent
 
Server load balancer ppt
Server load balancer pptServer load balancer ppt
Server load balancer pptShilpi Tandon
 
Architecting for Failure in a Containerized World
Architecting for Failure in a Containerized WorldArchitecting for Failure in a Containerized World
Architecting for Failure in a Containerized WorldTom Faulhaber
 
Load Balancing from the Cloud - Layer 7 Aware Solution
Load Balancing from the Cloud - Layer 7 Aware SolutionLoad Balancing from the Cloud - Layer 7 Aware Solution
Load Balancing from the Cloud - Layer 7 Aware SolutionImperva Incapsula
 
Building your own Distributed System The easy way - Cassandra Summit EU 2014
Building your own Distributed System The easy way - Cassandra Summit EU 2014Building your own Distributed System The easy way - Cassandra Summit EU 2014
Building your own Distributed System The easy way - Cassandra Summit EU 2014Kévin LOVATO
 
The Foundations of Multi-DC Kafka (Jakub Korab, Solutions Architect, Confluen...
The Foundations of Multi-DC Kafka (Jakub Korab, Solutions Architect, Confluen...The Foundations of Multi-DC Kafka (Jakub Korab, Solutions Architect, Confluen...
The Foundations of Multi-DC Kafka (Jakub Korab, Solutions Architect, Confluen...confluent
 
Russell spring one2gx_messaging_india
Russell spring one2gx_messaging_indiaRussell spring one2gx_messaging_india
Russell spring one2gx_messaging_indiaGaryPRussell
 
Webinar Slides: Tungsten Connector / Proxy – The Secret Sauce Behind Zero-Dow...
Webinar Slides: Tungsten Connector / Proxy – The Secret Sauce Behind Zero-Dow...Webinar Slides: Tungsten Connector / Proxy – The Secret Sauce Behind Zero-Dow...
Webinar Slides: Tungsten Connector / Proxy – The Secret Sauce Behind Zero-Dow...Continuent
 
Load Balancing
Load BalancingLoad Balancing
Load Balancingnashniv
 
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #2: Galera Cluster
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #2: Galera ClusterWebinar Slides: MySQL HA/DR/Geo-Scale - High Noon #2: Galera Cluster
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #2: Galera ClusterContinuent
 
Database , 13 Replication
Database , 13 ReplicationDatabase , 13 Replication
Database , 13 ReplicationAli Usman
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...Yahoo Developer Network
 
Select Stars: A DBA's Guide to Azure Cosmos DB (Chicago Suburban SQL Server U...
Select Stars: A DBA's Guide to Azure Cosmos DB (Chicago Suburban SQL Server U...Select Stars: A DBA's Guide to Azure Cosmos DB (Chicago Suburban SQL Server U...
Select Stars: A DBA's Guide to Azure Cosmos DB (Chicago Suburban SQL Server U...Bob Pusateri
 
Client Centric Consistency Model
Client Centric Consistency ModelClient Centric Consistency Model
Client Centric Consistency ModelRajat Kumar
 
Achieving Zero Downtime for SQL
Achieving Zero Downtime for SQLAchieving Zero Downtime for SQL
Achieving Zero Downtime for SQLScaleArc
 
Decoupling Decisions with Apache Kafka
Decoupling Decisions with Apache KafkaDecoupling Decisions with Apache Kafka
Decoupling Decisions with Apache KafkaGrant Henke
 
clustering and load balancing
clustering and load balancingclustering and load balancing
clustering and load balancingPrabhat gangwar
 

Tendances (20)

Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
 
Server load balancer ppt
Server load balancer pptServer load balancer ppt
Server load balancer ppt
 
Architecting for Failure in a Containerized World
Architecting for Failure in a Containerized WorldArchitecting for Failure in a Containerized World
Architecting for Failure in a Containerized World
 
Load balancing
Load balancingLoad balancing
Load balancing
 
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
 
Load Balancing Server
Load Balancing ServerLoad Balancing Server
Load Balancing Server
 
Load Balancing from the Cloud - Layer 7 Aware Solution
Load Balancing from the Cloud - Layer 7 Aware SolutionLoad Balancing from the Cloud - Layer 7 Aware Solution
Load Balancing from the Cloud - Layer 7 Aware Solution
 
Building your own Distributed System The easy way - Cassandra Summit EU 2014
Building your own Distributed System The easy way - Cassandra Summit EU 2014Building your own Distributed System The easy way - Cassandra Summit EU 2014
Building your own Distributed System The easy way - Cassandra Summit EU 2014
 
The Foundations of Multi-DC Kafka (Jakub Korab, Solutions Architect, Confluen...
The Foundations of Multi-DC Kafka (Jakub Korab, Solutions Architect, Confluen...The Foundations of Multi-DC Kafka (Jakub Korab, Solutions Architect, Confluen...
The Foundations of Multi-DC Kafka (Jakub Korab, Solutions Architect, Confluen...
 
Russell spring one2gx_messaging_india
Russell spring one2gx_messaging_indiaRussell spring one2gx_messaging_india
Russell spring one2gx_messaging_india
 
Webinar Slides: Tungsten Connector / Proxy – The Secret Sauce Behind Zero-Dow...
Webinar Slides: Tungsten Connector / Proxy – The Secret Sauce Behind Zero-Dow...Webinar Slides: Tungsten Connector / Proxy – The Secret Sauce Behind Zero-Dow...
Webinar Slides: Tungsten Connector / Proxy – The Secret Sauce Behind Zero-Dow...
 
Load Balancing
Load BalancingLoad Balancing
Load Balancing
 
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #2: Galera Cluster
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #2: Galera ClusterWebinar Slides: MySQL HA/DR/Geo-Scale - High Noon #2: Galera Cluster
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #2: Galera Cluster
 
Database , 13 Replication
Database , 13 ReplicationDatabase , 13 Replication
Database , 13 Replication
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
 
Select Stars: A DBA's Guide to Azure Cosmos DB (Chicago Suburban SQL Server U...
Select Stars: A DBA's Guide to Azure Cosmos DB (Chicago Suburban SQL Server U...Select Stars: A DBA's Guide to Azure Cosmos DB (Chicago Suburban SQL Server U...
Select Stars: A DBA's Guide to Azure Cosmos DB (Chicago Suburban SQL Server U...
 
Client Centric Consistency Model
Client Centric Consistency ModelClient Centric Consistency Model
Client Centric Consistency Model
 
Achieving Zero Downtime for SQL
Achieving Zero Downtime for SQLAchieving Zero Downtime for SQL
Achieving Zero Downtime for SQL
 
Decoupling Decisions with Apache Kafka
Decoupling Decisions with Apache KafkaDecoupling Decisions with Apache Kafka
Decoupling Decisions with Apache Kafka
 
clustering and load balancing
clustering and load balancingclustering and load balancing
clustering and load balancing
 

Similaire à CloudStack Scalability

Architecting for the cloud elasticity security
Architecting for the cloud elasticity securityArchitecting for the cloud elasticity security
Architecting for the cloud elasticity securityLen Bass
 
vCloud Automation Center 6.0 -My Notes on Architecture
vCloud Automation Center 6.0 -My Notes on ArchitecturevCloud Automation Center 6.0 -My Notes on Architecture
vCloud Automation Center 6.0 -My Notes on Architecturetechstarts
 
Cosmos DB at VLDB 2019
Cosmos DB at VLDB 2019Cosmos DB at VLDB 2019
Cosmos DB at VLDB 2019Dharma Shukla
 
Transforming Legacy Applications Into Dynamically Scalable Web Services
Transforming Legacy Applications Into Dynamically Scalable Web ServicesTransforming Legacy Applications Into Dynamically Scalable Web Services
Transforming Legacy Applications Into Dynamically Scalable Web ServicesAdam Takvam
 
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...ScyllaDB
 
Cloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark AnalyticsCloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark Analyticsamesar0
 
The impact of cloud NSBCon NY by Yves Goeleven
The impact of cloud NSBCon NY by Yves GoelevenThe impact of cloud NSBCon NY by Yves Goeleven
The impact of cloud NSBCon NY by Yves GoelevenParticular Software
 
Unit 5 Advanced Computer Architecture
Unit 5 Advanced Computer ArchitectureUnit 5 Advanced Computer Architecture
Unit 5 Advanced Computer ArchitectureBalaji Vignesh
 
adap-stability-202310.pptx
adap-stability-202310.pptxadap-stability-202310.pptx
adap-stability-202310.pptxMichael Ming Lei
 
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)Spark Summit
 
Membase Intro from Membase Meetup San Francisco
Membase Intro from Membase Meetup San FranciscoMembase Intro from Membase Meetup San Francisco
Membase Intro from Membase Meetup San FranciscoMembase
 
Probabilistic consolidation of virtual machines in self organizing cloud data...
Probabilistic consolidation of virtual machines in self organizing cloud data...Probabilistic consolidation of virtual machines in self organizing cloud data...
Probabilistic consolidation of virtual machines in self organizing cloud data...WMLab,NCU
 
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...Amazon Web Services
 
(ATS4-PLAT08) Server Pool Management
(ATS4-PLAT08) Server Pool Management(ATS4-PLAT08) Server Pool Management
(ATS4-PLAT08) Server Pool ManagementBIOVIA
 
Case Study - How Rackspace Query Terabytes Of Data
Case Study - How Rackspace Query Terabytes Of DataCase Study - How Rackspace Query Terabytes Of Data
Case Study - How Rackspace Query Terabytes Of DataSchubert Zhang
 
3 the cloud
3 the cloud 3 the cloud
3 the cloud Len Bass
 
FlurryDB: A Dynamically Scalable Relational Database with Virtual Machine Clo...
FlurryDB: A Dynamically Scalable Relational Database with Virtual Machine Clo...FlurryDB: A Dynamically Scalable Relational Database with Virtual Machine Clo...
FlurryDB: A Dynamically Scalable Relational Database with Virtual Machine Clo...Michael Mior
 
05. performance-concepts
05. performance-concepts05. performance-concepts
05. performance-conceptsMuhammad Ahad
 

Similaire à CloudStack Scalability (20)

Architecting for the cloud elasticity security
Architecting for the cloud elasticity securityArchitecting for the cloud elasticity security
Architecting for the cloud elasticity security
 
vCloud Automation Center 6.0 -My Notes on Architecture
vCloud Automation Center 6.0 -My Notes on ArchitecturevCloud Automation Center 6.0 -My Notes on Architecture
vCloud Automation Center 6.0 -My Notes on Architecture
 
Database Management System - 2a
Database Management System - 2aDatabase Management System - 2a
Database Management System - 2a
 
Cosmos DB at VLDB 2019
Cosmos DB at VLDB 2019Cosmos DB at VLDB 2019
Cosmos DB at VLDB 2019
 
Transforming Legacy Applications Into Dynamically Scalable Web Services
Transforming Legacy Applications Into Dynamically Scalable Web ServicesTransforming Legacy Applications Into Dynamically Scalable Web Services
Transforming Legacy Applications Into Dynamically Scalable Web Services
 
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
 
Cloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark AnalyticsCloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark Analytics
 
The impact of cloud NSBCon NY by Yves Goeleven
The impact of cloud NSBCon NY by Yves GoelevenThe impact of cloud NSBCon NY by Yves Goeleven
The impact of cloud NSBCon NY by Yves Goeleven
 
Unit 5 Advanced Computer Architecture
Unit 5 Advanced Computer ArchitectureUnit 5 Advanced Computer Architecture
Unit 5 Advanced Computer Architecture
 
adap-stability-202310.pptx
adap-stability-202310.pptxadap-stability-202310.pptx
adap-stability-202310.pptx
 
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
 
Membase Intro from Membase Meetup San Francisco
Membase Intro from Membase Meetup San FranciscoMembase Intro from Membase Meetup San Francisco
Membase Intro from Membase Meetup San Francisco
 
Probabilistic consolidation of virtual machines in self organizing cloud data...
Probabilistic consolidation of virtual machines in self organizing cloud data...Probabilistic consolidation of virtual machines in self organizing cloud data...
Probabilistic consolidation of virtual machines in self organizing cloud data...
 
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
 
Release it! - Takeaways
Release it! - TakeawaysRelease it! - Takeaways
Release it! - Takeaways
 
(ATS4-PLAT08) Server Pool Management
(ATS4-PLAT08) Server Pool Management(ATS4-PLAT08) Server Pool Management
(ATS4-PLAT08) Server Pool Management
 
Case Study - How Rackspace Query Terabytes Of Data
Case Study - How Rackspace Query Terabytes Of DataCase Study - How Rackspace Query Terabytes Of Data
Case Study - How Rackspace Query Terabytes Of Data
 
3 the cloud
3 the cloud 3 the cloud
3 the cloud
 
FlurryDB: A Dynamically Scalable Relational Database with Virtual Machine Clo...
FlurryDB: A Dynamically Scalable Relational Database with Virtual Machine Clo...FlurryDB: A Dynamically Scalable Relational Database with Virtual Machine Clo...
FlurryDB: A Dynamically Scalable Relational Database with Virtual Machine Clo...
 
05. performance-concepts
05. performance-concepts05. performance-concepts
05. performance-concepts
 

Plus de CloudStack - Open Source Cloud Computing Project

Plus de CloudStack - Open Source Cloud Computing Project (20)

Apache CloudStack from API to UI
Apache CloudStack from API to UIApache CloudStack from API to UI
Apache CloudStack from API to UI
 
CloudStack Hyderabad Meetup: How the Apache community works
CloudStack Hyderabad Meetup: How the Apache community worksCloudStack Hyderabad Meetup: How the Apache community works
CloudStack Hyderabad Meetup: How the Apache community works
 
CloudStack Hyderabad Meetup: Migrating applications to IaaS clouds
CloudStack Hyderabad Meetup: Migrating applications to IaaS cloudsCloudStack Hyderabad Meetup: Migrating applications to IaaS clouds
CloudStack Hyderabad Meetup: Migrating applications to IaaS clouds
 
CloudStack Hyderabad Meetup: Using CloudStack to build IaaS clouds
CloudStack Hyderabad Meetup: Using CloudStack to build IaaS cloudsCloudStack Hyderabad Meetup: Using CloudStack to build IaaS clouds
CloudStack Hyderabad Meetup: Using CloudStack to build IaaS clouds
 
CloudStack technical overview
CloudStack technical overviewCloudStack technical overview
CloudStack technical overview
 
Introduction to CloudStack: How to Deploy and Manage Infrastructure-as-a-Serv...
Introduction to CloudStack: How to Deploy and Manage Infrastructure-as-a-Serv...Introduction to CloudStack: How to Deploy and Manage Infrastructure-as-a-Serv...
Introduction to CloudStack: How to Deploy and Manage Infrastructure-as-a-Serv...
 
vBACD July 2012 - Apache Hadoop, Now and Beyond
vBACD July 2012 - Apache Hadoop, Now and BeyondvBACD July 2012 - Apache Hadoop, Now and Beyond
vBACD July 2012 - Apache Hadoop, Now and Beyond
 
vBACD July 2012 - Scaling Storage with Ceph
vBACD July 2012 - Scaling Storage with CephvBACD July 2012 - Scaling Storage with Ceph
vBACD July 2012 - Scaling Storage with Ceph
 
vBACD July 2012 - Deploying Private PaaS with ActiveState Stackato
vBACD July 2012 - Deploying Private PaaS with ActiveState StackatovBACD July 2012 - Deploying Private PaaS with ActiveState Stackato
vBACD July 2012 - Deploying Private PaaS with ActiveState Stackato
 
vBACD July 2012 - Xen Cloud Platform
vBACD July 2012 - Xen Cloud PlatformvBACD July 2012 - Xen Cloud Platform
vBACD July 2012 - Xen Cloud Platform
 
vBACD- July 2012 - Crash Course in Open Source Cloud Computing
vBACD- July 2012 - Crash Course in Open Source Cloud ComputingvBACD- July 2012 - Crash Course in Open Source Cloud Computing
vBACD- July 2012 - Crash Course in Open Source Cloud Computing
 
Virtualization in the cloud
Virtualization in the cloudVirtualization in the cloud
Virtualization in the cloud
 
Build a Cloud Day San Francisco - Ubuntu Cloud
Build a Cloud Day San Francisco - Ubuntu CloudBuild a Cloud Day San Francisco - Ubuntu Cloud
Build a Cloud Day San Francisco - Ubuntu Cloud
 
Cloudstack UI Customization
Cloudstack UI CustomizationCloudstack UI Customization
Cloudstack UI Customization
 
CloudStack Networking
CloudStack NetworkingCloudStack Networking
CloudStack Networking
 
CloudStack Architecture
CloudStack ArchitectureCloudStack Architecture
CloudStack Architecture
 
Management server internals
Management server internalsManagement server internals
Management server internals
 
Introduction to CloudStack
Introduction to CloudStack Introduction to CloudStack
Introduction to CloudStack
 
vBACD - Introduction to Puppet, Configuration Management and IT Automation So...
vBACD - Introduction to Puppet, Configuration Management and IT Automation So...vBACD - Introduction to Puppet, Configuration Management and IT Automation So...
vBACD - Introduction to Puppet, Configuration Management and IT Automation So...
 
vBACD - Distributed Petabyte-Scale Cloud Storage with GlusterFS - 2/28
vBACD - Distributed Petabyte-Scale Cloud Storage with GlusterFS - 2/28vBACD - Distributed Petabyte-Scale Cloud Storage with GlusterFS - 2/28
vBACD - Distributed Petabyte-Scale Cloud Storage with GlusterFS - 2/28
 

Dernier

Novo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4jNovo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4jNeo4j
 
Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.IPLOOK Networks
 
Patch notes explaining DISARM Version 1.4 update
Patch notes explaining DISARM Version 1.4 updatePatch notes explaining DISARM Version 1.4 update
Patch notes explaining DISARM Version 1.4 updateadam112203
 
How to release an Open Source Dataweave Library
How to release an Open Source Dataweave LibraryHow to release an Open Source Dataweave Library
How to release an Open Source Dataweave Libraryshyamraj55
 
How to become a GDSC Lead GDSC MI AOE.pptx
How to become a GDSC Lead GDSC MI AOE.pptxHow to become a GDSC Lead GDSC MI AOE.pptx
How to become a GDSC Lead GDSC MI AOE.pptxKaustubhBhavsar6
 
2024.03.12 Cost drivers of cultivated meat production.pdf
2024.03.12 Cost drivers of cultivated meat production.pdf2024.03.12 Cost drivers of cultivated meat production.pdf
2024.03.12 Cost drivers of cultivated meat production.pdfThe Good Food Institute
 
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedIn
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedInOutage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedIn
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedInThousandEyes
 
Stobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
Stobox 4: Revolutionizing Investment in Real-World Assets Through TokenizationStobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
Stobox 4: Revolutionizing Investment in Real-World Assets Through TokenizationStobox
 
.NET 8 ChatBot with Azure OpenAI Services.pptx
.NET 8 ChatBot with Azure OpenAI Services.pptx.NET 8 ChatBot with Azure OpenAI Services.pptx
.NET 8 ChatBot with Azure OpenAI Services.pptxHansamali Gamage
 
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptx
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptxEmil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptx
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptxNeo4j
 
UiPath Studio Web workshop series - Day 1
UiPath Studio Web workshop series  - Day 1UiPath Studio Web workshop series  - Day 1
UiPath Studio Web workshop series - Day 1DianaGray10
 
AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024Brian Pichman
 
UiPath Studio Web workshop Series - Day 3
UiPath Studio Web workshop Series - Day 3UiPath Studio Web workshop Series - Day 3
UiPath Studio Web workshop Series - Day 3DianaGray10
 
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - Tech
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - TechWebinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - Tech
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - TechProduct School
 
The Importance of Indoor Air Quality (English)
The Importance of Indoor Air Quality (English)The Importance of Indoor Air Quality (English)
The Importance of Indoor Air Quality (English)IES VE
 
Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...DianaGray10
 
TrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie WorldTrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie WorldTrustArc
 
EMEA What is ThousandEyes? Webinar
EMEA What is ThousandEyes? WebinarEMEA What is ThousandEyes? Webinar
EMEA What is ThousandEyes? WebinarThousandEyes
 
Top 10 Squarespace Development Companies
Top 10 Squarespace Development CompaniesTop 10 Squarespace Development Companies
Top 10 Squarespace Development CompaniesTopCSSGallery
 
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptxGraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptxNeo4j
 

Dernier (20)

Novo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4jNovo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4j
 
Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.
 
Patch notes explaining DISARM Version 1.4 update
Patch notes explaining DISARM Version 1.4 updatePatch notes explaining DISARM Version 1.4 update
Patch notes explaining DISARM Version 1.4 update
 
How to release an Open Source Dataweave Library
How to release an Open Source Dataweave LibraryHow to release an Open Source Dataweave Library
How to release an Open Source Dataweave Library
 
How to become a GDSC Lead GDSC MI AOE.pptx
How to become a GDSC Lead GDSC MI AOE.pptxHow to become a GDSC Lead GDSC MI AOE.pptx
How to become a GDSC Lead GDSC MI AOE.pptx
 
2024.03.12 Cost drivers of cultivated meat production.pdf
2024.03.12 Cost drivers of cultivated meat production.pdf2024.03.12 Cost drivers of cultivated meat production.pdf
2024.03.12 Cost drivers of cultivated meat production.pdf
 
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedIn
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedInOutage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedIn
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedIn
 
Stobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
Stobox 4: Revolutionizing Investment in Real-World Assets Through TokenizationStobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
Stobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
 
.NET 8 ChatBot with Azure OpenAI Services.pptx
.NET 8 ChatBot with Azure OpenAI Services.pptx.NET 8 ChatBot with Azure OpenAI Services.pptx
.NET 8 ChatBot with Azure OpenAI Services.pptx
 
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptx
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptxEmil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptx
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptx
 
UiPath Studio Web workshop series - Day 1
UiPath Studio Web workshop series  - Day 1UiPath Studio Web workshop series  - Day 1
UiPath Studio Web workshop series - Day 1
 
AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024
 
UiPath Studio Web workshop Series - Day 3
UiPath Studio Web workshop Series - Day 3UiPath Studio Web workshop Series - Day 3
UiPath Studio Web workshop Series - Day 3
 
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - Tech
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - TechWebinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - Tech
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - Tech
 
The Importance of Indoor Air Quality (English)
The Importance of Indoor Air Quality (English)The Importance of Indoor Air Quality (English)
The Importance of Indoor Air Quality (English)
 
Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...
 
TrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie WorldTrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie World
 
EMEA What is ThousandEyes? Webinar
EMEA What is ThousandEyes? WebinarEMEA What is ThousandEyes? Webinar
EMEA What is ThousandEyes? Webinar
 
Top 10 Squarespace Development Companies
Top 10 Squarespace Development CompaniesTop 10 Squarespace Development Companies
Top 10 Squarespace Development Companies
 
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptxGraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
 

CloudStack Scalability

  • 1. CloudStack Scalability By Alex Huang
  • 2. Current Status • 10k resources managed per management server node • Scales out horizontally (must disable stats collector) • Real production deployment of tens of thousands of resources • Internal testing with software simulators up to 30k physical resources with 30k VMs managed by 4 management server nodes • We believe we can at least double that scale per management server node
  • 3. Balancing Incoming Requests • Each management server has two worker thread pools for incoming requests: effectively two servers in one. – Executor threads provided by tomcat – Job threads waiting on job queue • All incoming requests that requires mostly DB operations are short in duration and are executed by executor threads because incoming requests are already load balanced by the load balancer • All incoming requests needing resources, which often have long running durations, are checked against ACL by the executor threads and then queued and picked up by job threads. • # of job threads are scaled to the # of DB connections available to the management server • Requests may take a long time depending on the constraint of the resources but they don’t fail.
  • 4. The Much Harder Problem • CloudStack performs a number of tasks on behalf of the users and those tasks increases with the number of virtual and physical resources available – VM Sync – SG Sync – Hardware capacity monitoring – Virtual resource usage statistics collection – More to come • When done in number of hundreds, no big deal. • As numbers increase, this problem magnifies. • How to scale this horizontally across management servers?
  • 5. Comparison of two Approaches • Stats Collector – collects capacity statistics – Fires every five minutes to collect stats about host CPU and memory capacity – Smart server and dumb client model: Resource only collects info and management server processes – Runs the same way on every management server • VM Sync – Fires every minute – Peer to peer model: Resource does a full sync on connection and delta syncs thereafter. Management server trusts on resource for correct information. – Only runs against resources connected to the management server node
  • 6. Numbers • Assume 10k hosts and 500k VMs (50 VMs per host) • Stats Collector – Fires off 10k requests every 5 minutes or 33 requests a second. – Bad but not too bad: Occupies 33 threads every second. – But just wait: • 2 management servers: 66 requests • 3 management servers: 99 requests – It gets worse as # of management servers increase because it did not auto-balance across management servers – Oh but it gets worse still: Because the 10k hosts is now spread across 3 management servers. While it’s 99 requests generated, the number of threads involved is three-fold because requests need to be routed to the right management server. – It keeps the management server at 20% busy even at no load from incoming requests • VM Sync – Fires off 1 request at resource connection to sync about 50 VMs – Then, push from resource as resource knows what it has pushed before and only pushes changes that are out-of-band. – So essentially no threads occupied for a much larger data set.
  • 7. What’s the Down Side? • Resources must reconcile between VM states caused by management server commands and VM states it collects from the physical hardware so it requires more CPU • Resources must use more memory to keep track of what amounts to a journal of changes since the last sync point. • But data centers are full of these two resources.
  • 8. Resource Load Balancing • As management server is added into the cluster, resources are rebalanced seamlessly. – MS2 signals to MS1 to hand over a resource – MS1 wait for the commands on the resources to finish – MS1 holds further commands in a queue – MS1 signals to MS2 to take over – MS2 connects – MS2 signals to MS1 to complete transfer – MS1 discards its resource and flows the commands being held to MS2 • Listeners are provided to business logic to listen on connection status and adjusts work based on who’s connected. • By only working on resources that are connected to the management server the process is on, work is auto-balanced between management servers. • Also reduces the message routing between the management servers.
  • 9. Designing for Scalability • Take advantage of the most abundant resources in a data center (CPU, RAM) • Auto-scale to the least abundant resource (DB) • Do not hold DB connections/Transactions across resource calls. – Use lock table implementation (Merovingian2 or GenericDao.acquireLockInTable() call) over database row locks in this situation. – Database row locks are still fine quick short lock outs. • Balance the resource intensive tasks as # of management server nodes increases and decreases – Use job queues to balance long running processes across management servers – Make use of resource rebalancing in CloudStack to auto-balance your world load.
  • 11. The Five W’s of Unreliability • What is unreliable? Everything • Who is unreliable? Developers & administrators • When does unreliability happen? 3:04 a.m. no matter which time zone… Any time. • Where does unreliability happen? In carefully planned, everything has been considered data centers. • How does unreliability happen? Rather nonchalantly
  • 12. Dealing with Unreliability • Don’t assume! • Don’t bang your head against the wall! • Know when you don’t know any better. • Ask for help!
  • 13. Designs against Unreliability • Management Servers keeps an heartbeat with the DB. One ping a minute. • Management Servers self-fences if it cannot write the heartbeat • Other management servers wait to make sure the down management server is no longer writing to the heartbeat and then signal interested software to recover • Check points at every call to a resource and code to deal with recovering from those check points • Database records are not actually deleted to help with manual recovery when needed • Write code that is idempotent • Respect modularity when writing your code