SlideShare a Scribd company logo
1 of 19
Availability Analysis for Deployment
of In-Cloud Applications
Xiwei Xu, Qinghua Lu, Liming Zhu, Jim (Zhanwen) Li
Sherif Sakr, Hiroshi Wada, Ingo Weber
Software Systems Research Group, NICTA
ISARCS13, Vancouver
Slides at: http://www.slideshare.net/LimingZhu/
NICTA Copyright 2010 From imagination to impact 2
Motivation
• Uncertainties in Cloud are challenging for architecting
critical applications and understanding availability
– Shared resources, weak SLA guarantees and limited visibility
– Rare but high consequence events
– Sporadic activities: upgrade, backup, recovery…
– Subjective uncertainties: impact of configuration choices
• We want to explicitly model the above uncertainties in
application availability analysis of cloud deployment.
– from a cloud consumer perspective
– focusing on mechanisms most relevant to critical
applications: auto-scaling, over-
provisioning, backup, recovery and maintenance.
NICTA Copyright 2010 From imagination to impact 3
Contributions
• SRN(Stochastic Reward Net)-based availability models
• which allow you to specify:
– Deployment architecture (application placements in VM)
– Node/Aggregation level SLAs from infrastructure providers
– Auto-scaling policies and recovery strategies
– Rare events: availability zone or region down
• which give you application availability levels of different options
under different scenarios
• Model evaluation by analysing existing industry best
practices in cloud application deployment
– Quantifying the rule-of-thumb best practices
– Comparing different (best) practices
NICTA Copyright 2010 From imagination to impact 4
Deployment Architecture Assumption
– Stateless VMs: auto-scaling groups
– Stateful VMs: hot standbys
– Backup at separate region for recovery
NICTA Copyright 2010 From imagination to impact 5
Availability Analysis Overview
• SRN-based Models
• Architecture model and recovery model in this paper
• One SRN architecture model per availability zone
NICTA Copyright 2010 From imagination to impact 6
Availability Analysis Overview
• Deployment decisions and patterns
– stateless/stateful application placement within VMs
– auto-scaling policies
– multi-zone configurations
NICTA Copyright 2010 From imagination to impact 7
Availability Analysis Overview
• SLA from the cloud providers
• Node level (Rackspace) or zone level (Amazon)
NICTA Copyright 2010 From imagination to impact 8
Availability Analysis Overview
• Recovery strategy
• Auto-regeneration of stateless VMs and different
recovery mechanisms for stateful VMs
• Different Recovery-Time/Point-Objective (RTO/RPO)
NICTA Copyright 2010 From imagination to impact 9
Availability Analysis Overview
• Application-specific data
– Stateless VM start-up time…
– Stateful VM replication…
NICTA Copyright 2010 From imagination to impact 10
Stochastic Reward Net
• Stochastic Reward Net (SRN)
– Stochastic Petri Net variant
– Firing delays
– Reward function
• Constructs
• Places: VM states
(Full, Running, Stoped, Failed )
• Token: VMs
• Transition
• Guard function
• Transition rate: 1) frequency of
events, 2) delay before the
transition fires
• Reward Function:
if((#Running1>0) 1 else 0
NICTA Copyright 2010 From imagination to impact 11
SRN-based Availability Models
NICTA Copyright 2010 From imagination to impact 12
Availability Models: Auto-scaling
NICTA Copyright 2010 From imagination to impact 13
Availability Models: Auto-scaling
gScaleSelf1:
if(#Running1<=#Running2 && #Stopped1>0) 1 else 0
gScaleOther1:
if(#Running1>#Running2 && #Stopped2>0) 1 else 0
NICTA Copyright 2010 From imagination to impact 14
Availability Models: Stateful VM
NICTA Copyright 2010 From imagination to impact 15
Availability Models—Disaster Recovery
• Availability zone life cycle
– Interact with the big
architecture model
• Stateless VM recovery
– Backup/AMI
• Stateful VM recovery
– Backup
– Replica
– Hot standby
NICTA Copyright 2010 From imagination to impact 16
Case 1: Multi-zone Deployment
• Parameters
– Amazon EC2 SLA of 99.95% availability
– Zone fail rate: 0.00011, MTTR: 4.38 hours per year
– Application specific measurement of transitions
0.01% = 52.56 mins downtime per year
0.4% diff = 35 hours
0.76% diff = 66 hours
NICTA Copyright 2010 From imagination to impact 17
Case 2: Recovery across Availability Zone
• Industry rule of thumb: ―Target auto-scale 30-60% until you have
50% headroom for load spikes. Lose an AZ leads to 90% utilisation.‖
• Impact on overall availability?
• 30-60% vs. traditional 70-90%?
• over-provisioning vs. auto-scaling?
0.29% diff = 25 hours
NICTA Copyright 2010 From imagination to impact 18
Case 3: Disaster Recovery across Regions
• Trade-off between RPO and RTO
• RPO: Recovery Point Objective
• RTO: Recovery Time Objective
Yuruware — http://www.yuruware.com/
0.2% diff = 17 hours
NICTA Copyright 2010 From imagination to impact
Conclusion and Future Work
• SRN-based availability models
– Application-level availability
– Highly configurable for different deployment architectures
– Model different uncertainties and scenarios for critical systems
– Quantify and compare choices and enable what-if analysis
– Evaluated using industry best practices
• Future work
– Better evaluation!
– Integrated models on impact of upgrade, live migration, backup and
subjective uncertainties (in IEEE Cloud 13)
Q. Lu, X. Xu, L. Zhu, L. Bass, et al., "Incorporating Uncertainty into in-Cloud Application
Deployment Decisions for Availability," in IEEE Cloud 2013
Liming.Zhu@nicta.com.au
Slides available at http://www.slideshare.net/LimingZhu/
19

More Related Content

More from Liming Zhu

Trends & Innovation in Cyber and Digitaltech
Trends & Innovationin Cyber and DigitaltechTrends & Innovationin Cyber and Digitaltech
Trends & Innovation in Cyber and DigitaltechLiming Zhu
 
Responsible/Trustworthy AI in the Era of Foundation Models
Responsible/Trustworthy AI in the Era of Foundation Models Responsible/Trustworthy AI in the Era of Foundation Models
Responsible/Trustworthy AI in the Era of Foundation Models Liming Zhu
 
ICSE23 Keynote: Software Engineering as the Linchpin of Responsible AI
ICSE23 Keynote: Software Engineering as the Linchpin of Responsible AIICSE23 Keynote: Software Engineering as the Linchpin of Responsible AI
ICSE23 Keynote: Software Engineering as the Linchpin of Responsible AILiming Zhu
 
International Cooperation for Research on Privacy and Data Protection - Austr...
International Cooperation for Research on Privacy and Data Protection - Austr...International Cooperation for Research on Privacy and Data Protection - Austr...
International Cooperation for Research on Privacy and Data Protection - Austr...Liming Zhu
 
RegTech for IR - Opportunities and Lessons
RegTech for IR - Opportunities and LessonsRegTech for IR - Opportunities and Lessons
RegTech for IR - Opportunities and LessonsLiming Zhu
 
Emerging Technologies in Data Sharing and Analytics at Data61
Emerging Technologies in Data Sharing and Analytics at Data61Emerging Technologies in Data Sharing and Analytics at Data61
Emerging Technologies in Data Sharing and Analytics at Data61Liming Zhu
 
Responsible AI The Australian Approach
Responsible AIThe Australian ApproachResponsible AIThe Australian Approach
Responsible AI The Australian ApproachLiming Zhu
 
Distributed Trust Architecture: The New Reality of ML-based Systems
Distributed Trust Architecture: The New Reality of ML-based SystemsDistributed Trust Architecture: The New Reality of ML-based Systems
Distributed Trust Architecture: The New Reality of ML-based SystemsLiming Zhu
 
Distributed Trust Architecture: The New Foundation of Everything
Distributed Trust Architecture: The New Foundation of EverythingDistributed Trust Architecture: The New Foundation of Everything
Distributed Trust Architecture: The New Foundation of EverythingLiming Zhu
 
Cyber technologies for SME growth – Barriers and Solutions
Cyber technologies for SME growth – Barriers and SolutionsCyber technologies for SME growth – Barriers and Solutions
Cyber technologies for SME growth – Barriers and SolutionsLiming Zhu
 
Emerging Technologies in Synthetic Representation and Digital Twin
Emerging Technologies in Synthetic Representation and Digital TwinEmerging Technologies in Synthetic Representation and Digital Twin
Emerging Technologies in Synthetic Representation and Digital TwinLiming Zhu
 
Responsible AI & Cybersecurity: A tale of two technology risks
Responsible AI & Cybersecurity: A tale of two technology risksResponsible AI & Cybersecurity: A tale of two technology risks
Responsible AI & Cybersecurity: A tale of two technology risksLiming Zhu
 
POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud ...
POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud ...POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud ...
POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud ...Liming Zhu
 
Challenges in Practicing High Frequency Releases in Cloud Environments
Challenges in Practicing High Frequency Releases in Cloud Environments Challenges in Practicing High Frequency Releases in Cloud Environments
Challenges in Practicing High Frequency Releases in Cloud Environments Liming Zhu
 
Dependable Operation - Performance Management and Capacity Planning Under Con...
Dependable Operation - Performance Management and Capacity Planning Under Con...Dependable Operation - Performance Management and Capacity Planning Under Con...
Dependable Operation - Performance Management and Capacity Planning Under Con...Liming Zhu
 
Dependable Operations
Dependable OperationsDependable Operations
Dependable OperationsLiming Zhu
 
Modelling and Analysing Operation Processes for Dependability
Modelling and Analysing Operation Processes for Dependability Modelling and Analysing Operation Processes for Dependability
Modelling and Analysing Operation Processes for Dependability Liming Zhu
 
Cloud API Issues: an Empirical Study and Impact
Cloud API Issues: an Empirical Study and ImpactCloud API Issues: an Empirical Study and Impact
Cloud API Issues: an Empirical Study and ImpactLiming Zhu
 

More from Liming Zhu (18)

Trends & Innovation in Cyber and Digitaltech
Trends & Innovationin Cyber and DigitaltechTrends & Innovationin Cyber and Digitaltech
Trends & Innovation in Cyber and Digitaltech
 
Responsible/Trustworthy AI in the Era of Foundation Models
Responsible/Trustworthy AI in the Era of Foundation Models Responsible/Trustworthy AI in the Era of Foundation Models
Responsible/Trustworthy AI in the Era of Foundation Models
 
ICSE23 Keynote: Software Engineering as the Linchpin of Responsible AI
ICSE23 Keynote: Software Engineering as the Linchpin of Responsible AIICSE23 Keynote: Software Engineering as the Linchpin of Responsible AI
ICSE23 Keynote: Software Engineering as the Linchpin of Responsible AI
 
International Cooperation for Research on Privacy and Data Protection - Austr...
International Cooperation for Research on Privacy and Data Protection - Austr...International Cooperation for Research on Privacy and Data Protection - Austr...
International Cooperation for Research on Privacy and Data Protection - Austr...
 
RegTech for IR - Opportunities and Lessons
RegTech for IR - Opportunities and LessonsRegTech for IR - Opportunities and Lessons
RegTech for IR - Opportunities and Lessons
 
Emerging Technologies in Data Sharing and Analytics at Data61
Emerging Technologies in Data Sharing and Analytics at Data61Emerging Technologies in Data Sharing and Analytics at Data61
Emerging Technologies in Data Sharing and Analytics at Data61
 
Responsible AI The Australian Approach
Responsible AIThe Australian ApproachResponsible AIThe Australian Approach
Responsible AI The Australian Approach
 
Distributed Trust Architecture: The New Reality of ML-based Systems
Distributed Trust Architecture: The New Reality of ML-based SystemsDistributed Trust Architecture: The New Reality of ML-based Systems
Distributed Trust Architecture: The New Reality of ML-based Systems
 
Distributed Trust Architecture: The New Foundation of Everything
Distributed Trust Architecture: The New Foundation of EverythingDistributed Trust Architecture: The New Foundation of Everything
Distributed Trust Architecture: The New Foundation of Everything
 
Cyber technologies for SME growth – Barriers and Solutions
Cyber technologies for SME growth – Barriers and SolutionsCyber technologies for SME growth – Barriers and Solutions
Cyber technologies for SME growth – Barriers and Solutions
 
Emerging Technologies in Synthetic Representation and Digital Twin
Emerging Technologies in Synthetic Representation and Digital TwinEmerging Technologies in Synthetic Representation and Digital Twin
Emerging Technologies in Synthetic Representation and Digital Twin
 
Responsible AI & Cybersecurity: A tale of two technology risks
Responsible AI & Cybersecurity: A tale of two technology risksResponsible AI & Cybersecurity: A tale of two technology risks
Responsible AI & Cybersecurity: A tale of two technology risks
 
POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud ...
POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud ...POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud ...
POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud ...
 
Challenges in Practicing High Frequency Releases in Cloud Environments
Challenges in Practicing High Frequency Releases in Cloud Environments Challenges in Practicing High Frequency Releases in Cloud Environments
Challenges in Practicing High Frequency Releases in Cloud Environments
 
Dependable Operation - Performance Management and Capacity Planning Under Con...
Dependable Operation - Performance Management and Capacity Planning Under Con...Dependable Operation - Performance Management and Capacity Planning Under Con...
Dependable Operation - Performance Management and Capacity Planning Under Con...
 
Dependable Operations
Dependable OperationsDependable Operations
Dependable Operations
 
Modelling and Analysing Operation Processes for Dependability
Modelling and Analysing Operation Processes for Dependability Modelling and Analysing Operation Processes for Dependability
Modelling and Analysing Operation Processes for Dependability
 
Cloud API Issues: an Empirical Study and Impact
Cloud API Issues: an Empirical Study and ImpactCloud API Issues: an Empirical Study and Impact
Cloud API Issues: an Empirical Study and Impact
 

Recently uploaded

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 

Recently uploaded (20)

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 

Availability Analysis for Deployment of In-Cloud Applications

  • 1. Availability Analysis for Deployment of In-Cloud Applications Xiwei Xu, Qinghua Lu, Liming Zhu, Jim (Zhanwen) Li Sherif Sakr, Hiroshi Wada, Ingo Weber Software Systems Research Group, NICTA ISARCS13, Vancouver Slides at: http://www.slideshare.net/LimingZhu/
  • 2. NICTA Copyright 2010 From imagination to impact 2 Motivation • Uncertainties in Cloud are challenging for architecting critical applications and understanding availability – Shared resources, weak SLA guarantees and limited visibility – Rare but high consequence events – Sporadic activities: upgrade, backup, recovery… – Subjective uncertainties: impact of configuration choices • We want to explicitly model the above uncertainties in application availability analysis of cloud deployment. – from a cloud consumer perspective – focusing on mechanisms most relevant to critical applications: auto-scaling, over- provisioning, backup, recovery and maintenance.
  • 3. NICTA Copyright 2010 From imagination to impact 3 Contributions • SRN(Stochastic Reward Net)-based availability models • which allow you to specify: – Deployment architecture (application placements in VM) – Node/Aggregation level SLAs from infrastructure providers – Auto-scaling policies and recovery strategies – Rare events: availability zone or region down • which give you application availability levels of different options under different scenarios • Model evaluation by analysing existing industry best practices in cloud application deployment – Quantifying the rule-of-thumb best practices – Comparing different (best) practices
  • 4. NICTA Copyright 2010 From imagination to impact 4 Deployment Architecture Assumption – Stateless VMs: auto-scaling groups – Stateful VMs: hot standbys – Backup at separate region for recovery
  • 5. NICTA Copyright 2010 From imagination to impact 5 Availability Analysis Overview • SRN-based Models • Architecture model and recovery model in this paper • One SRN architecture model per availability zone
  • 6. NICTA Copyright 2010 From imagination to impact 6 Availability Analysis Overview • Deployment decisions and patterns – stateless/stateful application placement within VMs – auto-scaling policies – multi-zone configurations
  • 7. NICTA Copyright 2010 From imagination to impact 7 Availability Analysis Overview • SLA from the cloud providers • Node level (Rackspace) or zone level (Amazon)
  • 8. NICTA Copyright 2010 From imagination to impact 8 Availability Analysis Overview • Recovery strategy • Auto-regeneration of stateless VMs and different recovery mechanisms for stateful VMs • Different Recovery-Time/Point-Objective (RTO/RPO)
  • 9. NICTA Copyright 2010 From imagination to impact 9 Availability Analysis Overview • Application-specific data – Stateless VM start-up time… – Stateful VM replication…
  • 10. NICTA Copyright 2010 From imagination to impact 10 Stochastic Reward Net • Stochastic Reward Net (SRN) – Stochastic Petri Net variant – Firing delays – Reward function • Constructs • Places: VM states (Full, Running, Stoped, Failed ) • Token: VMs • Transition • Guard function • Transition rate: 1) frequency of events, 2) delay before the transition fires • Reward Function: if((#Running1>0) 1 else 0
  • 11. NICTA Copyright 2010 From imagination to impact 11 SRN-based Availability Models
  • 12. NICTA Copyright 2010 From imagination to impact 12 Availability Models: Auto-scaling
  • 13. NICTA Copyright 2010 From imagination to impact 13 Availability Models: Auto-scaling gScaleSelf1: if(#Running1<=#Running2 && #Stopped1>0) 1 else 0 gScaleOther1: if(#Running1>#Running2 && #Stopped2>0) 1 else 0
  • 14. NICTA Copyright 2010 From imagination to impact 14 Availability Models: Stateful VM
  • 15. NICTA Copyright 2010 From imagination to impact 15 Availability Models—Disaster Recovery • Availability zone life cycle – Interact with the big architecture model • Stateless VM recovery – Backup/AMI • Stateful VM recovery – Backup – Replica – Hot standby
  • 16. NICTA Copyright 2010 From imagination to impact 16 Case 1: Multi-zone Deployment • Parameters – Amazon EC2 SLA of 99.95% availability – Zone fail rate: 0.00011, MTTR: 4.38 hours per year – Application specific measurement of transitions 0.01% = 52.56 mins downtime per year 0.4% diff = 35 hours 0.76% diff = 66 hours
  • 17. NICTA Copyright 2010 From imagination to impact 17 Case 2: Recovery across Availability Zone • Industry rule of thumb: ―Target auto-scale 30-60% until you have 50% headroom for load spikes. Lose an AZ leads to 90% utilisation.‖ • Impact on overall availability? • 30-60% vs. traditional 70-90%? • over-provisioning vs. auto-scaling? 0.29% diff = 25 hours
  • 18. NICTA Copyright 2010 From imagination to impact 18 Case 3: Disaster Recovery across Regions • Trade-off between RPO and RTO • RPO: Recovery Point Objective • RTO: Recovery Time Objective Yuruware — http://www.yuruware.com/ 0.2% diff = 17 hours
  • 19. NICTA Copyright 2010 From imagination to impact Conclusion and Future Work • SRN-based availability models – Application-level availability – Highly configurable for different deployment architectures – Model different uncertainties and scenarios for critical systems – Quantify and compare choices and enable what-if analysis – Evaluated using industry best practices • Future work – Better evaluation! – Integrated models on impact of upgrade, live migration, backup and subjective uncertainties (in IEEE Cloud 13) Q. Lu, X. Xu, L. Zhu, L. Bass, et al., "Incorporating Uncertainty into in-Cloud Application Deployment Decisions for Availability," in IEEE Cloud 2013 Liming.Zhu@nicta.com.au Slides available at http://www.slideshare.net/LimingZhu/ 19

Editor's Notes

  1. In this paper, we only show the architecture model and the recovery model due to space limitations.