International Symposium on Architecting Critical Systems (ISARCS) 2013 talk slides. June 19th, 2013.
Full paper at http://www.nicta.com.au/pub?doc=6431
WordPress Websites for Engineers: Elevate Your Brand
Availability Analysis for Deployment of In-Cloud Applications
1. Availability Analysis for Deployment
of In-Cloud Applications
Xiwei Xu, Qinghua Lu, Liming Zhu, Jim (Zhanwen) Li
Sherif Sakr, Hiroshi Wada, Ingo Weber
Software Systems Research Group, NICTA
ISARCS13, Vancouver
Slides at: http://www.slideshare.net/LimingZhu/
2. NICTA Copyright 2010 From imagination to impact 2
Motivation
• Uncertainties in Cloud are challenging for architecting
critical applications and understanding availability
– Shared resources, weak SLA guarantees and limited visibility
– Rare but high consequence events
– Sporadic activities: upgrade, backup, recovery…
– Subjective uncertainties: impact of configuration choices
• We want to explicitly model the above uncertainties in
application availability analysis of cloud deployment.
– from a cloud consumer perspective
– focusing on mechanisms most relevant to critical
applications: auto-scaling, over-
provisioning, backup, recovery and maintenance.
3. NICTA Copyright 2010 From imagination to impact 3
Contributions
• SRN(Stochastic Reward Net)-based availability models
• which allow you to specify:
– Deployment architecture (application placements in VM)
– Node/Aggregation level SLAs from infrastructure providers
– Auto-scaling policies and recovery strategies
– Rare events: availability zone or region down
• which give you application availability levels of different options
under different scenarios
• Model evaluation by analysing existing industry best
practices in cloud application deployment
– Quantifying the rule-of-thumb best practices
– Comparing different (best) practices
4. NICTA Copyright 2010 From imagination to impact 4
Deployment Architecture Assumption
– Stateless VMs: auto-scaling groups
– Stateful VMs: hot standbys
– Backup at separate region for recovery
5. NICTA Copyright 2010 From imagination to impact 5
Availability Analysis Overview
• SRN-based Models
• Architecture model and recovery model in this paper
• One SRN architecture model per availability zone
6. NICTA Copyright 2010 From imagination to impact 6
Availability Analysis Overview
• Deployment decisions and patterns
– stateless/stateful application placement within VMs
– auto-scaling policies
– multi-zone configurations
7. NICTA Copyright 2010 From imagination to impact 7
Availability Analysis Overview
• SLA from the cloud providers
• Node level (Rackspace) or zone level (Amazon)
8. NICTA Copyright 2010 From imagination to impact 8
Availability Analysis Overview
• Recovery strategy
• Auto-regeneration of stateless VMs and different
recovery mechanisms for stateful VMs
• Different Recovery-Time/Point-Objective (RTO/RPO)
9. NICTA Copyright 2010 From imagination to impact 9
Availability Analysis Overview
• Application-specific data
– Stateless VM start-up time…
– Stateful VM replication…
10. NICTA Copyright 2010 From imagination to impact 10
Stochastic Reward Net
• Stochastic Reward Net (SRN)
– Stochastic Petri Net variant
– Firing delays
– Reward function
• Constructs
• Places: VM states
(Full, Running, Stoped, Failed )
• Token: VMs
• Transition
• Guard function
• Transition rate: 1) frequency of
events, 2) delay before the
transition fires
• Reward Function:
if((#Running1>0) 1 else 0
11. NICTA Copyright 2010 From imagination to impact 11
SRN-based Availability Models
12. NICTA Copyright 2010 From imagination to impact 12
Availability Models: Auto-scaling
14. NICTA Copyright 2010 From imagination to impact 14
Availability Models: Stateful VM
15. NICTA Copyright 2010 From imagination to impact 15
Availability Models—Disaster Recovery
• Availability zone life cycle
– Interact with the big
architecture model
• Stateless VM recovery
– Backup/AMI
• Stateful VM recovery
– Backup
– Replica
– Hot standby
16. NICTA Copyright 2010 From imagination to impact 16
Case 1: Multi-zone Deployment
• Parameters
– Amazon EC2 SLA of 99.95% availability
– Zone fail rate: 0.00011, MTTR: 4.38 hours per year
– Application specific measurement of transitions
0.01% = 52.56 mins downtime per year
0.4% diff = 35 hours
0.76% diff = 66 hours
17. NICTA Copyright 2010 From imagination to impact 17
Case 2: Recovery across Availability Zone
• Industry rule of thumb: ―Target auto-scale 30-60% until you have
50% headroom for load spikes. Lose an AZ leads to 90% utilisation.‖
• Impact on overall availability?
• 30-60% vs. traditional 70-90%?
• over-provisioning vs. auto-scaling?
0.29% diff = 25 hours
18. NICTA Copyright 2010 From imagination to impact 18
Case 3: Disaster Recovery across Regions
• Trade-off between RPO and RTO
• RPO: Recovery Point Objective
• RTO: Recovery Time Objective
Yuruware — http://www.yuruware.com/
0.2% diff = 17 hours
19. NICTA Copyright 2010 From imagination to impact
Conclusion and Future Work
• SRN-based availability models
– Application-level availability
– Highly configurable for different deployment architectures
– Model different uncertainties and scenarios for critical systems
– Quantify and compare choices and enable what-if analysis
– Evaluated using industry best practices
• Future work
– Better evaluation!
– Integrated models on impact of upgrade, live migration, backup and
subjective uncertainties (in IEEE Cloud 13)
Q. Lu, X. Xu, L. Zhu, L. Bass, et al., "Incorporating Uncertainty into in-Cloud Application
Deployment Decisions for Availability," in IEEE Cloud 2013
Liming.Zhu@nicta.com.au
Slides available at http://www.slideshare.net/LimingZhu/
19
Editor's Notes
In this paper, we only show the architecture model and the recovery model due to space limitations.