SlideShare une entreprise Scribd logo
1  sur  27
Designing Apps for Resiliency
Masashi Narumoto
Principal Lead PM
AzureCAT patterns & practices
Agenda
• What is ’resiliency’?
• Why it’s so important?
• Process to improve resiliency
• Resiliency checklist
What is ‘Resiliency’?
• Resiliency is the ability to recover from failures and continue to
function. It's not about avoiding failures, but responding to failures in
a way that avoids downtime or data loss.
• High availability is the ability of the application to keep running in a
healthy state, without significant downtime.
• Disaster recovery is the ability to recover from rare but major incidents:
Non-transient, wide-scale failures, such as service disruption that affects an
entire region.
Why it’s so important?
• More transient faults in the cloud
• Dependent service may go down
• SLA < 100% means something could go wrong at some point
• More focus on MTTR rather than MTBF
Process to improve resiliency
Plan Design Implement Test Deploy Monitor Respond
Define
requirements
Identify
failures
Implement
recovery
strategies
Inject failures
Simulate FO
Deploy apps in a
reliable manner
Monitor
failures
Take actions
to fix issues
Defining resiliency requirements
Major incident occurs Service recoveredData backupData backupData backup
Recovery Time Objective
(RTO)
Recovery Point Objective
(RPO)
RPO: The maximum time period in which data might be lost
RTO: Duration of time in which the service must be restored after an incident
Business recovered
Maximum Tolerable Outage (MTO)
SLA (Service Level Agreement)
Composite SLA
Composite SLA = ? Composite SLA = ?
Cache
Fallback action:
Return data from local cache
99.94% 99.95%99.95%
99.95% x 99.99% = 99.94%
1.0 − (0.0001 × 0.001) = 99.99999%
Composite SLA for two regions = (1 − (1 − N)(1 − N)) x Traffic manager SLA
1 – (1 – 0.9995) x ( 1 – 0.9995)= 0.99999975(1 – (1 – 0.9995) x ( 1 – 0.9995)) x 0.9999 = 0.999899
Designing for resiliency
Reading data from SQL Server fails
A web server goes down
A NVA goes down
1. Identify possible failures
2. Rate risk of each failure
(impact x likelihood)
3. Design resiliency strategy
- Detection
- Recovery
- Diagnostics
Failure mode analysis
https://azure.microsoft.com/en-us/documentation/articles/guidance-resiliency-failure-mode-analysis/
Rack awareness
Web tier
Availability set
Middle tier
Availability set
Data tier
Availability set
Fault domain 1
Replica #1
Replica #1
Replica #2
Fault domain 2 Fault domain 3
Shard #2Shard #1
Load balance multiple instances
Application gateway for
- L7 routing
- SSL termination
Failover / Failback
Traffic manager
Priority routing method
Web
Application
Data
Web
Application
Data
Automatedfailover
Manualfailback
Primary region
Secondary region (regional pair)
WebWebWeb
Data
ApplicationApplication
Data
Data replication Azure storage
Geo replica (RA-GRS)
LocationMode = PrimaryThenSecondaryLocationMode = SecondaryOnly
Periodically check
If it’s back online
Retry transient failures
See ‘Azure retry guidance’ for more details
< E2E latency requirement
Circuit Breaker
Remote service
Your application
User
Hold resources while retrying operation
Lead to cascading failures
Failed
Circuit Breaker
https://github.com/App-vNext/Polly
Bulkhead
Service A Service B Service C
Thread pool Thread pool Thread pool
Workload 1 Workload 2
Thread pool Thread poolThread pool
Workload 1 Workload 2
Memory
CPU
Disk
Thread pool
Connection pool
Network connection
Other design patterns for resiliency
• Compensating transaction
• Scheduler-agent-supervisor
• Throttling
• Load leveling
• Leader election
See ‘Cloud design patterns’
Principles of chaos engineering
• Build hypothesis around steady state behavior
• Vary real-world events
• Run experiments in production
• Automate experiments to run consistently
http://principlesofchaos.org/
Control Group
Experimental Group
HW/SW failures
Spike in traffic
Verify difference
In terms of steady state
Feed production traffic
Testing for resiliency
• Fault injection testing
• Shut down VM instances
• Crash processes
• Expire certificates
• Change access keys
• Shut down the DNS service on domain controllers
• Limit available system resources, such as RAM or number of threads
• Unmount disks
• Redeploy a VM
• Load testing
• Use production data as much you can
• VSTS, JMeter
• Soak testing
• Longer period under normal production load
Blue/Green and Canary release
Web App DB
Web App DB
Blue/Green Deployment
Web App DB
Web App DB
Canary release
90%
10%
Current version
New version
Current version
New version
LoadBalancer
ReverseProxy
Deployment slots at App Service
Dark launching
New feature
Toggle enable/disable
User Interface
Production environment
Resiliency checklist
• https://azure.microsoft.com/en-us/documentation/articles/guidance-
resiliency-checklist/
Other resources
http://docs.microsoft.com/Azure
Resiliency / High Availability / Disaster Recovery
Throttling
Circuit breaker
Zero downtime deployment
Eventual consistency
Data restore
Retry
Graceful degradation
Geo-replica
Multi-region deployment

Contenu connexe

Tendances

Redis and Kafka - Simplifying Advanced Design Patterns within Microservices A...
Redis and Kafka - Simplifying Advanced Design Patterns within Microservices A...Redis and Kafka - Simplifying Advanced Design Patterns within Microservices A...
Redis and Kafka - Simplifying Advanced Design Patterns within Microservices A...HostedbyConfluent
 
Nine Neins - where Java EE will never take you
Nine Neins - where Java EE will never take youNine Neins - where Java EE will never take you
Nine Neins - where Java EE will never take youMarkus Eisele
 
The Future of Services: Building Asynchronous, Resilient and Elastic Systems
The Future of Services: Building Asynchronous, Resilient and Elastic SystemsThe Future of Services: Building Asynchronous, Resilient and Elastic Systems
The Future of Services: Building Asynchronous, Resilient and Elastic SystemsLightbend
 
Achieving scale and performance using cloud native environment
Achieving scale and performance using cloud native environmentAchieving scale and performance using cloud native environment
Achieving scale and performance using cloud native environmentRakuten Group, Inc.
 
The 6 Rules for Modernizing Your Legacy Java Monolith with Microservices
The 6 Rules for Modernizing Your Legacy Java Monolith with MicroservicesThe 6 Rules for Modernizing Your Legacy Java Monolith with Microservices
The 6 Rules for Modernizing Your Legacy Java Monolith with MicroservicesLightbend
 
Project Sherpa: How RightScale Went All in on Docker
Project Sherpa: How RightScale Went All in on DockerProject Sherpa: How RightScale Went All in on Docker
Project Sherpa: How RightScale Went All in on DockerRightScale
 
What is reactive
What is reactiveWhat is reactive
What is reactiveLightbend
 
Evolution of unix environments and the road to faster deployments
Evolution of unix environments and the road to faster deploymentsEvolution of unix environments and the road to faster deployments
Evolution of unix environments and the road to faster deploymentsRakuten Group, Inc.
 
Digital Transformation with Kubernetes, Containers, and Microservices
Digital Transformation with Kubernetes, Containers, and MicroservicesDigital Transformation with Kubernetes, Containers, and Microservices
Digital Transformation with Kubernetes, Containers, and MicroservicesLightbend
 
Kafka Summit SF 2017 - Running Kafka for Maximum Pain
Kafka Summit SF 2017 - Running Kafka for Maximum PainKafka Summit SF 2017 - Running Kafka for Maximum Pain
Kafka Summit SF 2017 - Running Kafka for Maximum Painconfluent
 
Cloudstate - Towards Stateful Serverless
Cloudstate - Towards Stateful ServerlessCloudstate - Towards Stateful Serverless
Cloudstate - Towards Stateful ServerlessLightbend
 
Caching for Microservices Architectures: Session I
Caching for Microservices Architectures: Session ICaching for Microservices Architectures: Session I
Caching for Microservices Architectures: Session IVMware Tanzu
 
How to Migrate to Cloud with Complete Confidence and Trust
How to Migrate to Cloud with Complete Confidence and TrustHow to Migrate to Cloud with Complete Confidence and Trust
How to Migrate to Cloud with Complete Confidence and TrustApcera
 
Introduction to architectural patterns
Introduction to architectural patternsIntroduction to architectural patterns
Introduction to architectural patternsGeorgy Podsvetov
 
Containerization: The DevOps Revolution
Containerization: The DevOps Revolution Containerization: The DevOps Revolution
Containerization: The DevOps Revolution SoftServe
 
Availability of Kafka - Beyond the Brokers | Andrew Borley and Emma Humber, IBM
Availability of Kafka - Beyond the Brokers | Andrew Borley and Emma Humber, IBMAvailability of Kafka - Beyond the Brokers | Andrew Borley and Emma Humber, IBM
Availability of Kafka - Beyond the Brokers | Andrew Borley and Emma Humber, IBMHostedbyConfluent
 
Webinar: Eventual Consistency != Hopeful Consistency
Webinar: Eventual Consistency != Hopeful ConsistencyWebinar: Eventual Consistency != Hopeful Consistency
Webinar: Eventual Consistency != Hopeful ConsistencyDataStax
 
Reactive Revealed Part 2: Scalability, Elasticity and Location Transparency i...
Reactive Revealed Part 2: Scalability, Elasticity and Location Transparency i...Reactive Revealed Part 2: Scalability, Elasticity and Location Transparency i...
Reactive Revealed Part 2: Scalability, Elasticity and Location Transparency i...Legacy Typesafe (now Lightbend)
 

Tendances (20)

Redis and Kafka - Simplifying Advanced Design Patterns within Microservices A...
Redis and Kafka - Simplifying Advanced Design Patterns within Microservices A...Redis and Kafka - Simplifying Advanced Design Patterns within Microservices A...
Redis and Kafka - Simplifying Advanced Design Patterns within Microservices A...
 
Nine Neins - where Java EE will never take you
Nine Neins - where Java EE will never take youNine Neins - where Java EE will never take you
Nine Neins - where Java EE will never take you
 
The Future of Services: Building Asynchronous, Resilient and Elastic Systems
The Future of Services: Building Asynchronous, Resilient and Elastic SystemsThe Future of Services: Building Asynchronous, Resilient and Elastic Systems
The Future of Services: Building Asynchronous, Resilient and Elastic Systems
 
Achieving scale and performance using cloud native environment
Achieving scale and performance using cloud native environmentAchieving scale and performance using cloud native environment
Achieving scale and performance using cloud native environment
 
The 6 Rules for Modernizing Your Legacy Java Monolith with Microservices
The 6 Rules for Modernizing Your Legacy Java Monolith with MicroservicesThe 6 Rules for Modernizing Your Legacy Java Monolith with Microservices
The 6 Rules for Modernizing Your Legacy Java Monolith with Microservices
 
Going Reactive in Java with Typesafe Reactive Platform
Going Reactive in Java with Typesafe Reactive PlatformGoing Reactive in Java with Typesafe Reactive Platform
Going Reactive in Java with Typesafe Reactive Platform
 
Project Sherpa: How RightScale Went All in on Docker
Project Sherpa: How RightScale Went All in on DockerProject Sherpa: How RightScale Went All in on Docker
Project Sherpa: How RightScale Went All in on Docker
 
What is reactive
What is reactiveWhat is reactive
What is reactive
 
Evolution of unix environments and the road to faster deployments
Evolution of unix environments and the road to faster deploymentsEvolution of unix environments and the road to faster deployments
Evolution of unix environments and the road to faster deployments
 
Digital Transformation with Kubernetes, Containers, and Microservices
Digital Transformation with Kubernetes, Containers, and MicroservicesDigital Transformation with Kubernetes, Containers, and Microservices
Digital Transformation with Kubernetes, Containers, and Microservices
 
Kafka Summit SF 2017 - Running Kafka for Maximum Pain
Kafka Summit SF 2017 - Running Kafka for Maximum PainKafka Summit SF 2017 - Running Kafka for Maximum Pain
Kafka Summit SF 2017 - Running Kafka for Maximum Pain
 
Cloudstate - Towards Stateful Serverless
Cloudstate - Towards Stateful ServerlessCloudstate - Towards Stateful Serverless
Cloudstate - Towards Stateful Serverless
 
Caching for Microservices Architectures: Session I
Caching for Microservices Architectures: Session ICaching for Microservices Architectures: Session I
Caching for Microservices Architectures: Session I
 
How to Migrate to Cloud with Complete Confidence and Trust
How to Migrate to Cloud with Complete Confidence and TrustHow to Migrate to Cloud with Complete Confidence and Trust
How to Migrate to Cloud with Complete Confidence and Trust
 
Introduction to architectural patterns
Introduction to architectural patternsIntroduction to architectural patterns
Introduction to architectural patterns
 
Containerization: The DevOps Revolution
Containerization: The DevOps Revolution Containerization: The DevOps Revolution
Containerization: The DevOps Revolution
 
Availability of Kafka - Beyond the Brokers | Andrew Borley and Emma Humber, IBM
Availability of Kafka - Beyond the Brokers | Andrew Borley and Emma Humber, IBMAvailability of Kafka - Beyond the Brokers | Andrew Borley and Emma Humber, IBM
Availability of Kafka - Beyond the Brokers | Andrew Borley and Emma Humber, IBM
 
3 migration
3 migration3 migration
3 migration
 
Webinar: Eventual Consistency != Hopeful Consistency
Webinar: Eventual Consistency != Hopeful ConsistencyWebinar: Eventual Consistency != Hopeful Consistency
Webinar: Eventual Consistency != Hopeful Consistency
 
Reactive Revealed Part 2: Scalability, Elasticity and Location Transparency i...
Reactive Revealed Part 2: Scalability, Elasticity and Location Transparency i...Reactive Revealed Part 2: Scalability, Elasticity and Location Transparency i...
Reactive Revealed Part 2: Scalability, Elasticity and Location Transparency i...
 

En vedette

OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"
OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"
OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"Randy Bias
 
Resiliency jenna-2013
Resiliency jenna-2013Resiliency jenna-2013
Resiliency jenna-2013Jenna Martin
 
Manueverable architecture
Manueverable architectureManueverable architecture
Manueverable architectureMichael Nygard
 
Tempo, Maneuverability, and Initiative
Tempo, Maneuverability, and InitiativeTempo, Maneuverability, and Initiative
Tempo, Maneuverability, and InitiativeMichael Nygard
 
AppSphere 15 - Preparing for System Failure: How Pearson used AppDynamics to ...
AppSphere 15 - Preparing for System Failure: How Pearson used AppDynamics to ...AppSphere 15 - Preparing for System Failure: How Pearson used AppDynamics to ...
AppSphere 15 - Preparing for System Failure: How Pearson used AppDynamics to ...AppDynamics
 
FORUM PA 2015 - Microservices with IBM Bluemix
FORUM PA 2015 - Microservices with IBM BluemixFORUM PA 2015 - Microservices with IBM Bluemix
FORUM PA 2015 - Microservices with IBM Bluemixgjuljo
 
Architecture without an end state
Architecture without an end stateArchitecture without an end state
Architecture without an end stateMichael Nygard
 
Resiliency through failure @ QConNY 2013
Resiliency through failure @ QConNY 2013Resiliency through failure @ QConNY 2013
Resiliency through failure @ QConNY 2013Ariel Tseitlin
 
[ML15]Class Cat佐々木さん「いち早く人工知能テクノロジーを取り入れた製品・サービスを市場に展開するには?」
[ML15]Class Cat佐々木さん「いち早く人工知能テクノロジーを取り入れた製品・サービスを市場に展開するには?」[ML15]Class Cat佐々木さん「いち早く人工知能テクノロジーを取り入れた製品・サービスを市場に展開するには?」
[ML15]Class Cat佐々木さん「いち早く人工知能テクノロジーを取り入れた製品・サービスを市場に展開するには?」AINOW
 
Resilient Architecture
Resilient ArchitectureResilient Architecture
Resilient ArchitectureMatt Stine
 

En vedette (17)

OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"
OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"
OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"
 
L02 What is Software Architecture?
L02 What is Software Architecture?L02 What is Software Architecture?
L02 What is Software Architecture?
 
Resiliency jenna-2013
Resiliency jenna-2013Resiliency jenna-2013
Resiliency jenna-2013
 
Manueverable architecture
Manueverable architectureManueverable architecture
Manueverable architecture
 
The Big Red Button
The Big Red ButtonThe Big Red Button
The Big Red Button
 
Where to put_my_data
Where to put_my_dataWhere to put_my_data
Where to put_my_data
 
Tempo, Maneuverability, and Initiative
Tempo, Maneuverability, and InitiativeTempo, Maneuverability, and Initiative
Tempo, Maneuverability, and Initiative
 
AppSphere 15 - Preparing for System Failure: How Pearson used AppDynamics to ...
AppSphere 15 - Preparing for System Failure: How Pearson used AppDynamics to ...AppSphere 15 - Preparing for System Failure: How Pearson used AppDynamics to ...
AppSphere 15 - Preparing for System Failure: How Pearson used AppDynamics to ...
 
Resilience engineering
Resilience engineeringResilience engineering
Resilience engineering
 
FORUM PA 2015 - Microservices with IBM Bluemix
FORUM PA 2015 - Microservices with IBM BluemixFORUM PA 2015 - Microservices with IBM Bluemix
FORUM PA 2015 - Microservices with IBM Bluemix
 
Fault tolerance made easy
Fault tolerance made easyFault tolerance made easy
Fault tolerance made easy
 
Azure Reference Architectures
Azure Reference ArchitecturesAzure Reference Architectures
Azure Reference Architectures
 
Architecture without an end state
Architecture without an end stateArchitecture without an end state
Architecture without an end state
 
Patterns of resilience
Patterns of resiliencePatterns of resilience
Patterns of resilience
 
Resiliency through failure @ QConNY 2013
Resiliency through failure @ QConNY 2013Resiliency through failure @ QConNY 2013
Resiliency through failure @ QConNY 2013
 
[ML15]Class Cat佐々木さん「いち早く人工知能テクノロジーを取り入れた製品・サービスを市場に展開するには?」
[ML15]Class Cat佐々木さん「いち早く人工知能テクノロジーを取り入れた製品・サービスを市場に展開するには?」[ML15]Class Cat佐々木さん「いち早く人工知能テクノロジーを取り入れた製品・サービスを市場に展開するには?」
[ML15]Class Cat佐々木さん「いち早く人工知能テクノロジーを取り入れた製品・サービスを市場に展開するには?」
 
Resilient Architecture
Resilient ArchitectureResilient Architecture
Resilient Architecture
 

Similaire à Designing apps for resiliency

Adding Value in the Cloud with Performance Test
Adding Value in the Cloud with Performance TestAdding Value in the Cloud with Performance Test
Adding Value in the Cloud with Performance TestRodolfo Kohn
 
Tokyo azure meetup #12 service fabric internals
Tokyo azure meetup #12   service fabric internalsTokyo azure meetup #12   service fabric internals
Tokyo azure meetup #12 service fabric internalsTokyo Azure Meetup
 
Testing Applications—For the Cloud and in the Cloud
Testing Applications—For the Cloud and in the CloudTesting Applications—For the Cloud and in the Cloud
Testing Applications—For the Cloud and in the CloudTechWell
 
Dependable Systems - Structure-Based Dependabiilty Modeling (6/16)
Dependable Systems - Structure-Based Dependabiilty Modeling (6/16)Dependable Systems - Structure-Based Dependabiilty Modeling (6/16)
Dependable Systems - Structure-Based Dependabiilty Modeling (6/16)Peter Tröger
 
Towards a Unified View of Cloud Elasticity
Towards a Unified View of Cloud ElasticityTowards a Unified View of Cloud Elasticity
Towards a Unified View of Cloud ElasticitySrikumar Venugopal
 
Azure Application Architecture Guide ~Design principles for Azure application...
Azure Application Architecture Guide ~Design principles for Azure application...Azure Application Architecture Guide ~Design principles for Azure application...
Azure Application Architecture Guide ~Design principles for Azure application...Naoki (Neo) SATO
 
24 Hours of PASS, Summit Preview Session: Virtual SQL Server CPUs
24 Hours of PASS, Summit Preview Session: Virtual SQL Server CPUs24 Hours of PASS, Summit Preview Session: Virtual SQL Server CPUs
24 Hours of PASS, Summit Preview Session: Virtual SQL Server CPUsDavid Klee
 
SFScon 22 - Andrea Janes - Scalability assessment applied to microservice arc...
SFScon 22 - Andrea Janes - Scalability assessment applied to microservice arc...SFScon 22 - Andrea Janes - Scalability assessment applied to microservice arc...
SFScon 22 - Andrea Janes - Scalability assessment applied to microservice arc...South Tyrol Free Software Conference
 
Availability Considerations for SQL Server
Availability Considerations for SQL ServerAvailability Considerations for SQL Server
Availability Considerations for SQL ServerBob Roudebush
 
Maximizing Business Continuity Success
Maximizing Business Continuity SuccessMaximizing Business Continuity Success
Maximizing Business Continuity SuccessSymantec
 
A Year of “Testing” the Cloud for Development and Test
A Year of “Testing” the Cloud for Development and TestA Year of “Testing” the Cloud for Development and Test
A Year of “Testing” the Cloud for Development and TestTechWell
 
Azure Application Architecture Guide
Azure Application Architecture GuideAzure Application Architecture Guide
Azure Application Architecture GuideMasashi Narumoto
 
Cloud-enabled Performance Testing vis-à-vis On-premise- Impetus White Paper
Cloud-enabled Performance Testing vis-à-vis On-premise- Impetus White PaperCloud-enabled Performance Testing vis-à-vis On-premise- Impetus White Paper
Cloud-enabled Performance Testing vis-à-vis On-premise- Impetus White PaperImpetus Technologies
 
T3 Consortium's Performance Center of Excellence
T3 Consortium's Performance Center of ExcellenceT3 Consortium's Performance Center of Excellence
T3 Consortium's Performance Center of Excellenceveehikle
 
Resilience reloaded - more resilience patterns
Resilience reloaded - more resilience patternsResilience reloaded - more resilience patterns
Resilience reloaded - more resilience patternsUwe Friedrichsen
 
Database and Public Endpoints redundancy on Azure
Database and Public Endpoints redundancy on AzureDatabase and Public Endpoints redundancy on Azure
Database and Public Endpoints redundancy on AzureRadu Vunvulea
 
MGT3342BUS - Architecting Data Protection with Rubrik - VMworld 2017
MGT3342BUS - Architecting Data Protection with Rubrik - VMworld 2017MGT3342BUS - Architecting Data Protection with Rubrik - VMworld 2017
MGT3342BUS - Architecting Data Protection with Rubrik - VMworld 2017Andrew Miller
 

Similaire à Designing apps for resiliency (20)

Adding Value in the Cloud with Performance Test
Adding Value in the Cloud with Performance TestAdding Value in the Cloud with Performance Test
Adding Value in the Cloud with Performance Test
 
Tokyo azure meetup #12 service fabric internals
Tokyo azure meetup #12   service fabric internalsTokyo azure meetup #12   service fabric internals
Tokyo azure meetup #12 service fabric internals
 
Testing Applications—For the Cloud and in the Cloud
Testing Applications—For the Cloud and in the CloudTesting Applications—For the Cloud and in the Cloud
Testing Applications—For the Cloud and in the Cloud
 
Dependable Systems - Structure-Based Dependabiilty Modeling (6/16)
Dependable Systems - Structure-Based Dependabiilty Modeling (6/16)Dependable Systems - Structure-Based Dependabiilty Modeling (6/16)
Dependable Systems - Structure-Based Dependabiilty Modeling (6/16)
 
Towards a Unified View of Cloud Elasticity
Towards a Unified View of Cloud ElasticityTowards a Unified View of Cloud Elasticity
Towards a Unified View of Cloud Elasticity
 
Azure Application Architecture Guide ~Design principles for Azure application...
Azure Application Architecture Guide ~Design principles for Azure application...Azure Application Architecture Guide ~Design principles for Azure application...
Azure Application Architecture Guide ~Design principles for Azure application...
 
24 Hours of PASS, Summit Preview Session: Virtual SQL Server CPUs
24 Hours of PASS, Summit Preview Session: Virtual SQL Server CPUs24 Hours of PASS, Summit Preview Session: Virtual SQL Server CPUs
24 Hours of PASS, Summit Preview Session: Virtual SQL Server CPUs
 
SFScon 22 - Andrea Janes - Scalability assessment applied to microservice arc...
SFScon 22 - Andrea Janes - Scalability assessment applied to microservice arc...SFScon 22 - Andrea Janes - Scalability assessment applied to microservice arc...
SFScon 22 - Andrea Janes - Scalability assessment applied to microservice arc...
 
Availability Considerations for SQL Server
Availability Considerations for SQL ServerAvailability Considerations for SQL Server
Availability Considerations for SQL Server
 
Maximizing Business Continuity Success
Maximizing Business Continuity SuccessMaximizing Business Continuity Success
Maximizing Business Continuity Success
 
A Year of “Testing” the Cloud for Development and Test
A Year of “Testing” the Cloud for Development and TestA Year of “Testing” the Cloud for Development and Test
A Year of “Testing” the Cloud for Development and Test
 
Azure Application Architecture Guide
Azure Application Architecture GuideAzure Application Architecture Guide
Azure Application Architecture Guide
 
Cloud design principles
Cloud design principlesCloud design principles
Cloud design principles
 
Cloud-enabled Performance Testing vis-à-vis On-premise- Impetus White Paper
Cloud-enabled Performance Testing vis-à-vis On-premise- Impetus White PaperCloud-enabled Performance Testing vis-à-vis On-premise- Impetus White Paper
Cloud-enabled Performance Testing vis-à-vis On-premise- Impetus White Paper
 
Performance testing
Performance testingPerformance testing
Performance testing
 
PEnDAR webinar 2 with notes
PEnDAR webinar 2 with notesPEnDAR webinar 2 with notes
PEnDAR webinar 2 with notes
 
T3 Consortium's Performance Center of Excellence
T3 Consortium's Performance Center of ExcellenceT3 Consortium's Performance Center of Excellence
T3 Consortium's Performance Center of Excellence
 
Resilience reloaded - more resilience patterns
Resilience reloaded - more resilience patternsResilience reloaded - more resilience patterns
Resilience reloaded - more resilience patterns
 
Database and Public Endpoints redundancy on Azure
Database and Public Endpoints redundancy on AzureDatabase and Public Endpoints redundancy on Azure
Database and Public Endpoints redundancy on Azure
 
MGT3342BUS - Architecting Data Protection with Rubrik - VMworld 2017
MGT3342BUS - Architecting Data Protection with Rubrik - VMworld 2017MGT3342BUS - Architecting Data Protection with Rubrik - VMworld 2017
MGT3342BUS - Architecting Data Protection with Rubrik - VMworld 2017
 

Dernier

(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...ranjana rawat
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝soniya singh
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...RajaP95
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 

Dernier (20)

(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
 

Designing apps for resiliency

  • 1. Designing Apps for Resiliency Masashi Narumoto Principal Lead PM AzureCAT patterns & practices
  • 2. Agenda • What is ’resiliency’? • Why it’s so important? • Process to improve resiliency • Resiliency checklist
  • 3. What is ‘Resiliency’? • Resiliency is the ability to recover from failures and continue to function. It's not about avoiding failures, but responding to failures in a way that avoids downtime or data loss. • High availability is the ability of the application to keep running in a healthy state, without significant downtime. • Disaster recovery is the ability to recover from rare but major incidents: Non-transient, wide-scale failures, such as service disruption that affects an entire region.
  • 4. Why it’s so important? • More transient faults in the cloud • Dependent service may go down • SLA < 100% means something could go wrong at some point • More focus on MTTR rather than MTBF
  • 5. Process to improve resiliency Plan Design Implement Test Deploy Monitor Respond Define requirements Identify failures Implement recovery strategies Inject failures Simulate FO Deploy apps in a reliable manner Monitor failures Take actions to fix issues
  • 6. Defining resiliency requirements Major incident occurs Service recoveredData backupData backupData backup Recovery Time Objective (RTO) Recovery Point Objective (RPO) RPO: The maximum time period in which data might be lost RTO: Duration of time in which the service must be restored after an incident Business recovered Maximum Tolerable Outage (MTO)
  • 7. SLA (Service Level Agreement)
  • 8. Composite SLA Composite SLA = ? Composite SLA = ? Cache Fallback action: Return data from local cache 99.94% 99.95%99.95% 99.95% x 99.99% = 99.94% 1.0 − (0.0001 × 0.001) = 99.99999% Composite SLA for two regions = (1 − (1 − N)(1 − N)) x Traffic manager SLA 1 – (1 – 0.9995) x ( 1 – 0.9995)= 0.99999975(1 – (1 – 0.9995) x ( 1 – 0.9995)) x 0.9999 = 0.999899
  • 9. Designing for resiliency Reading data from SQL Server fails A web server goes down A NVA goes down 1. Identify possible failures 2. Rate risk of each failure (impact x likelihood) 3. Design resiliency strategy - Detection - Recovery - Diagnostics
  • 11. Rack awareness Web tier Availability set Middle tier Availability set Data tier Availability set Fault domain 1 Replica #1 Replica #1 Replica #2 Fault domain 2 Fault domain 3 Shard #2Shard #1
  • 12. Load balance multiple instances Application gateway for - L7 routing - SSL termination
  • 13. Failover / Failback Traffic manager Priority routing method Web Application Data Web Application Data Automatedfailover Manualfailback Primary region Secondary region (regional pair) WebWebWeb Data ApplicationApplication Data
  • 14. Data replication Azure storage Geo replica (RA-GRS) LocationMode = PrimaryThenSecondaryLocationMode = SecondaryOnly Periodically check If it’s back online
  • 15. Retry transient failures See ‘Azure retry guidance’ for more details < E2E latency requirement
  • 16. Circuit Breaker Remote service Your application User Hold resources while retrying operation Lead to cascading failures Failed
  • 18. Bulkhead Service A Service B Service C Thread pool Thread pool Thread pool Workload 1 Workload 2 Thread pool Thread poolThread pool Workload 1 Workload 2 Memory CPU Disk Thread pool Connection pool Network connection
  • 19. Other design patterns for resiliency • Compensating transaction • Scheduler-agent-supervisor • Throttling • Load leveling • Leader election See ‘Cloud design patterns’
  • 20. Principles of chaos engineering • Build hypothesis around steady state behavior • Vary real-world events • Run experiments in production • Automate experiments to run consistently http://principlesofchaos.org/ Control Group Experimental Group HW/SW failures Spike in traffic Verify difference In terms of steady state Feed production traffic
  • 21. Testing for resiliency • Fault injection testing • Shut down VM instances • Crash processes • Expire certificates • Change access keys • Shut down the DNS service on domain controllers • Limit available system resources, such as RAM or number of threads • Unmount disks • Redeploy a VM • Load testing • Use production data as much you can • VSTS, JMeter • Soak testing • Longer period under normal production load
  • 22. Blue/Green and Canary release Web App DB Web App DB Blue/Green Deployment Web App DB Web App DB Canary release 90% 10% Current version New version Current version New version LoadBalancer ReverseProxy
  • 23. Deployment slots at App Service
  • 24. Dark launching New feature Toggle enable/disable User Interface Production environment
  • 27. Resiliency / High Availability / Disaster Recovery Throttling Circuit breaker Zero downtime deployment Eventual consistency Data restore Retry Graceful degradation Geo-replica Multi-region deployment

Notes de l'éditeur

  1. Everybody is talking about it but the its definition is not clear. I’ll clarify what it means Why everybody is taking about it? There’s a number of reasons Main part of this topic is how to make your app resilient. I’ll show you some of the example of checklist
  2. DR? Data backup? These are all true statements but none of them clearly define what resiliency means. In order to be HA, it doesn’t need to go down and come back online. If you’re app is running w/ 100% uptime w/o any failures, it’s HA but you never know if it’s resilient. Once something bad happens, then it may take days to come back online which is not really resilient at all. DR needs to be a catastrophic failure such as something that could take down entire DC. For example..
  3. Why it’s so important? Why everybody is talking about resiliency? Transient faults because of commodity HW, networking, multi-tenant shared model Remote services could go down at any time 99.99% means 4 mins downtime a month. Do you want to sit down and wait for 4 minutes or do something else? I’d rather do something because you never know it’s going to be 4 minutes or 4 hours. Based on the assumption that anything goes wrong at some point, focus has been shifting from MTBF to MTTR
  4. We’re getting into more interesting part. We discussed what resiliency means, why it’s so important. Now we’re getting into ‘how’ part. This is the process to improve resiliency in your system in 7 steps from plan to respond. Let’s talk about each step. Clearly define your requirements, otherwise you don’t know what you’re aiming for Identify all possible failures you may see and Implement recovery strategies to bounce back from these failures To make sure these strategies work, you need to test them by injecting failures Deployment needs to be resilient too. Because deploying new version is the most common cause of failures Monitoring is key to QoS. Monitor errors, latency, throughputs etc. in percentile. You need to take actions quickly to mitigate the downtime
  5. There’re two common requirements when it comes to resiliency. RPO: defines the interval of data backup RTO: defines the requirements for hot/warm/cold stand-by MTO: how long a particular business process can be down
  6. If you look at well-experienced customers, they define availability requirements per each use case. Decompose your workload and define availability requirements (uptime, latency etc.) per each Higher SLA comes with cost because of redundant services/components. Measuring downtime will become an issue when you target 5’nine’s
  7. The fact that App Service offers 99.95% doesn’t mean that the entire system has 99.95%. Other important fact is that SLA doesn’t guarantee that it always up 99.95% of the time. You’ll get money back when it violates SLA. It’s not just a number game. This is where resiliency comes into play. SLA is not guaranteed. If we don’t meet SLA, you get money back. Definition of SLA varies depending on the service.
  8. In order to design your app to be resilient, you need to identify all possible failures first. Then implement resilient strategies against them,
  9. To help you identify all possible failures, we published list of most common failures on Azure. It has a few items per each service. 30 to 40 items in total. Let’s take a look. In the case of DocumentDB. When you fail to read data from it, the client SDK retries the operation for you. The only transient fault it retries against is throttling (429). If you constantly get 420, consider increasing its scale (RU) DocumentDB now supports geo-replica. If primary region fails, it will switch traffic to other regions in the list you configure For diagnostics, you need to log all errors at client side.
  10. You can think of rack as power module. If it goes down, anything belong to it go down all together. So it’s better to distribute VMs across different racks for redundancy sake. This is where availability set comes into play. Each machine in the same AS belongs to different rack. VMSS automatically put VMs in 5 FD, 5 UD but it doesn’t support data disk yet.
  11. Avoiding SPOF is critical for resiliency. Many customers still don’t know these basics. They deploy critical workload on a single machine. For that, you nee to have redundant components. One goes down but still others are running. In this case, put VMs in the same tier into the same availability set with LB. LB would distribute requests to VMs in backend address pool Health probe can be either Http or Tcp depending on the workload. By default it pings root path ‘/’. You may want to expose health endpoint to monitor all critical component.
  12. There’s a risk of data loss in FO, take a snapshot and ensure the data integrity.
  13. If it’s less frequent transient faults, set the property to PrimaryThenSecondary. It’ll switch to secondary region for you If it’s more frequent or non-transient faults, set the property to SecondaryOnly otherwise it keeps hitting and getting errors from primary. You need to monitor the primary region, when it comes back then set the property back to PrimaryOnly or PTS One thing to notice is that Azure storage wouldn’t failover to secondary until reginal wide disaster happens which I don’t think we have had yet. This strategy is applicable for read not write.
  14. Let’s take a look at a few resiliency strategies to recover from failures you identified above. Exponential back-off for non-interactive transaction Quick liner retry for interactive transaction Anti-patterns: Cascading retry (5x5 = 25) More than one immediate retry Many attempts with regular interval (Randomize interval)
  15. People often say don’t waste your time, let’s circuit break and fail fast That is only a part of the problem. Real issues is the cascading failures. Also by keep retrying failed operations, the remote service can’t recover from the failed state
  16. Type of resources to isolate are not limited to but they are most common ones.
  17. Given the chaotic nature of the cloud and distributed system, always something happens somewhere. it makes sense to follow chaos engineering principles. Define the steady state as the measurable output of a system, rather than internal attributes of the system Introduce real-world chaotic events such as HW-failure, SW-failure, spike in traffic etc. Best way to validate the system at production scale is to run experiment in production. Netflix at least once a month, inject faults in one of their regions to see if their system can keep up and running. Since it’s such a time consuming tasks, you should automate the experiments and run them continuously Chaos engineering is not testing, it’s validation of the system. https://www.youtube.com/watch?v=Q4nniyAarbs
  18. Tools = Chaos monkey/kong, ToxiProxy https://en.wikipedia.org/wiki/Soak_testing
  19. Deploy current and new version into two identical environments (blue, green) Do smoke test on new version then switch traffic to it. Canary release is to incrementally switches from current to new using LB. Use Akamai or equivalent to do Canary. The unique name for this environment comes from a tactic used by coal miners: they’d bring canaries with them into the coal mines to monitor the levels of carbon monoxide in the air; if the canary died, they knew that the level of toxic gas in the air was high, and they’d leave the mines. In either case you should be able to rollback if the new version doesn’t work Graceful shutdown and Switching DB/Storage are the challenge. Github route request to blue and green, compares the result from blue and green. Make sure they are identical. Dark launch: Deploy new features without enabling it to users. Make sure it won’t cause any issues in production, then enable it.
  20. This is how it works in App Service. You can have up to 15 deployment slots
  21. Deploy a new feature to prod env without enabling it to users. Make sure it works with in the prod infrustracture, no memory leaks, no nothing. Then enable it to users on UI. If something bad happens, then disable it in UI. Facebook does this.
  22. All other proven practices are in this doc. You can use this list when you have ADR with your customers. Give us feedback.