(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
Designing apps for resiliency
1. Designing Apps for Resiliency
Masashi Narumoto
Principal Lead PM
AzureCAT patterns & practices
2. Agenda
• What is ’resiliency’?
• Why it’s so important?
• Process to improve resiliency
• Resiliency checklist
3. What is ‘Resiliency’?
• Resiliency is the ability to recover from failures and continue to
function. It's not about avoiding failures, but responding to failures in
a way that avoids downtime or data loss.
• High availability is the ability of the application to keep running in a
healthy state, without significant downtime.
• Disaster recovery is the ability to recover from rare but major incidents:
Non-transient, wide-scale failures, such as service disruption that affects an
entire region.
4. Why it’s so important?
• More transient faults in the cloud
• Dependent service may go down
• SLA < 100% means something could go wrong at some point
• More focus on MTTR rather than MTBF
5. Process to improve resiliency
Plan Design Implement Test Deploy Monitor Respond
Define
requirements
Identify
failures
Implement
recovery
strategies
Inject failures
Simulate FO
Deploy apps in a
reliable manner
Monitor
failures
Take actions
to fix issues
6. Defining resiliency requirements
Major incident occurs Service recoveredData backupData backupData backup
Recovery Time Objective
(RTO)
Recovery Point Objective
(RPO)
RPO: The maximum time period in which data might be lost
RTO: Duration of time in which the service must be restored after an incident
Business recovered
Maximum Tolerable Outage (MTO)
8. Composite SLA
Composite SLA = ? Composite SLA = ?
Cache
Fallback action:
Return data from local cache
99.94% 99.95%99.95%
99.95% x 99.99% = 99.94%
1.0 − (0.0001 × 0.001) = 99.99999%
Composite SLA for two regions = (1 − (1 − N)(1 − N)) x Traffic manager SLA
1 – (1 – 0.9995) x ( 1 – 0.9995)= 0.99999975(1 – (1 – 0.9995) x ( 1 – 0.9995)) x 0.9999 = 0.999899
9. Designing for resiliency
Reading data from SQL Server fails
A web server goes down
A NVA goes down
1. Identify possible failures
2. Rate risk of each failure
(impact x likelihood)
3. Design resiliency strategy
- Detection
- Recovery
- Diagnostics
13. Failover / Failback
Traffic manager
Priority routing method
Web
Application
Data
Web
Application
Data
Automatedfailover
Manualfailback
Primary region
Secondary region (regional pair)
WebWebWeb
Data
ApplicationApplication
Data
14. Data replication Azure storage
Geo replica (RA-GRS)
LocationMode = PrimaryThenSecondaryLocationMode = SecondaryOnly
Periodically check
If it’s back online
18. Bulkhead
Service A Service B Service C
Thread pool Thread pool Thread pool
Workload 1 Workload 2
Thread pool Thread poolThread pool
Workload 1 Workload 2
Memory
CPU
Disk
Thread pool
Connection pool
Network connection
19. Other design patterns for resiliency
• Compensating transaction
• Scheduler-agent-supervisor
• Throttling
• Load leveling
• Leader election
See ‘Cloud design patterns’
20. Principles of chaos engineering
• Build hypothesis around steady state behavior
• Vary real-world events
• Run experiments in production
• Automate experiments to run consistently
http://principlesofchaos.org/
Control Group
Experimental Group
HW/SW failures
Spike in traffic
Verify difference
In terms of steady state
Feed production traffic
21. Testing for resiliency
• Fault injection testing
• Shut down VM instances
• Crash processes
• Expire certificates
• Change access keys
• Shut down the DNS service on domain controllers
• Limit available system resources, such as RAM or number of threads
• Unmount disks
• Redeploy a VM
• Load testing
• Use production data as much you can
• VSTS, JMeter
• Soak testing
• Longer period under normal production load
22. Blue/Green and Canary release
Web App DB
Web App DB
Blue/Green Deployment
Web App DB
Web App DB
Canary release
90%
10%
Current version
New version
Current version
New version
LoadBalancer
ReverseProxy
27. Resiliency / High Availability / Disaster Recovery
Throttling
Circuit breaker
Zero downtime deployment
Eventual consistency
Data restore
Retry
Graceful degradation
Geo-replica
Multi-region deployment
Notes de l'éditeur
Everybody is talking about it but the its definition is not clear. I’ll clarify what it means
Why everybody is taking about it? There’s a number of reasons
Main part of this topic is how to make your app resilient.
I’ll show you some of the example of checklist
DR? Data backup? These are all true statements but none of them clearly define what resiliency means.
In order to be HA, it doesn’t need to go down and come back online. If you’re app is running w/ 100% uptime w/o any failures, it’s HA but you never know if it’s resilient. Once something bad happens, then it may take days to come back online which is not really resilient at all.
DR needs to be a catastrophic failure such as something that could take down entire DC.
For example..
Why it’s so important? Why everybody is talking about resiliency?
Transient faults because of commodity HW, networking, multi-tenant shared model
Remote services could go down at any time
99.99% means 4 mins downtime a month. Do you want to sit down and wait for 4 minutes or do something else?
I’d rather do something because you never know it’s going to be 4 minutes or 4 hours.
Based on the assumption that anything goes wrong at some point, focus has been shifting from MTBF to MTTR
We’re getting into more interesting part.
We discussed what resiliency means, why it’s so important.
Now we’re getting into ‘how’ part.
This is the process to improve resiliency in your system in 7 steps from plan to respond.
Let’s talk about each step.
Clearly define your requirements, otherwise you don’t know what you’re aiming for
Identify all possible failures you may see and
Implement recovery strategies to bounce back from these failures
To make sure these strategies work, you need to test them by injecting failures
Deployment needs to be resilient too. Because deploying new version is the most common cause of failures
Monitoring is key to QoS. Monitor errors, latency, throughputs etc. in percentile.
You need to take actions quickly to mitigate the downtime
There’re two common requirements when it comes to resiliency.
RPO: defines the interval of data backup
RTO: defines the requirements for hot/warm/cold stand-by
MTO: how long a particular business process can be down
If you look at well-experienced customers, they define availability requirements per each use case.
Decompose your workload and define availability requirements (uptime, latency etc.) per each
Higher SLA comes with cost because of redundant services/components.
Measuring downtime will become an issue when you target 5’nine’s
The fact that App Service offers 99.95% doesn’t mean that the entire system has 99.95%.
Other important fact is that SLA doesn’t guarantee that it always up 99.95% of the time. You’ll get money back when it violates SLA.
It’s not just a number game. This is where resiliency comes into play.
SLA is not guaranteed. If we don’t meet SLA, you get money back.
Definition of SLA varies depending on the service.
In order to design your app to be resilient, you need to identify all possible failures first.
Then implement resilient strategies against them,
To help you identify all possible failures, we published list of most common failures on Azure.
It has a few items per each service. 30 to 40 items in total. Let’s take a look.
In the case of DocumentDB. When you fail to read data from it, the client SDK retries the operation for you.
The only transient fault it retries against is throttling (429).
If you constantly get 420, consider increasing its scale (RU)
DocumentDB now supports geo-replica. If primary region fails, it will switch traffic to other regions in the list you configure
For diagnostics, you need to log all errors at client side.
You can think of rack as power module.
If it goes down, anything belong to it go down all together.
So it’s better to distribute VMs across different racks for redundancy sake.
This is where availability set comes into play.
Each machine in the same AS belongs to different rack.
VMSS automatically put VMs in 5 FD, 5 UD but it doesn’t support data disk yet.
Avoiding SPOF is critical for resiliency.
Many customers still don’t know these basics. They deploy critical workload on a single machine.
For that, you nee to have redundant components. One goes down but still others are running.
In this case, put VMs in the same tier into the same availability set with LB.
LB would distribute requests to VMs in backend address pool
Health probe can be either Http or Tcp depending on the workload.
By default it pings root path ‘/’. You may want to expose health endpoint to monitor all critical component.
There’s a risk of data loss in FO, take a snapshot and ensure the data integrity.
If it’s less frequent transient faults, set the property to PrimaryThenSecondary. It’ll switch to secondary region for you
If it’s more frequent or non-transient faults, set the property to SecondaryOnly otherwise it keeps hitting and getting errors from primary.
You need to monitor the primary region, when it comes back then set the property back to PrimaryOnly or PTS
One thing to notice is that Azure storage wouldn’t failover to secondary until reginal wide disaster happens which I don’t think we have had yet.
This strategy is applicable for read not write.
Let’s take a look at a few resiliency strategies to recover from failures you identified above.
Exponential back-off for non-interactive transaction
Quick liner retry for interactive transaction
Anti-patterns:
Cascading retry (5x5 = 25)
More than one immediate retry
Many attempts with regular interval (Randomize interval)
People often say don’t waste your time, let’s circuit break and fail fast
That is only a part of the problem. Real issues is the cascading failures.
Also by keep retrying failed operations, the remote service can’t recover from the failed state
Type of resources to isolate are not limited to but they are most common ones.
Given the chaotic nature of the cloud and distributed system, always something happens somewhere.
it makes sense to follow chaos engineering principles.
Define the steady state as the measurable output of a system, rather than internal attributes of the system
Introduce real-world chaotic events such as HW-failure, SW-failure, spike in traffic etc.
Best way to validate the system at production scale is to run experiment in production.
Netflix at least once a month, inject faults in one of their regions to see if their system can keep up and running.
Since it’s such a time consuming tasks, you should automate the experiments and run them continuously
Chaos engineering is not testing, it’s validation of the system.
https://www.youtube.com/watch?v=Q4nniyAarbs
Deploy current and new version into two identical environments (blue, green)
Do smoke test on new version then switch traffic to it.
Canary release is to incrementally switches from current to new using LB.
Use Akamai or equivalent to do Canary.
The unique name for this environment comes from a tactic used by coal miners: they’d bring canaries with them into the coal mines to monitor the levels of carbon monoxide in the air; if the canary died, they knew that the level of toxic gas in the air was high, and they’d leave the mines.
In either case you should be able to rollback if the new version doesn’t work
Graceful shutdown and Switching DB/Storage are the challenge.
Github route request to blue and green, compares the result from blue and green. Make sure they are identical.
Dark launch: Deploy new features without enabling it to users. Make sure it won’t cause any issues in production, then enable it.
This is how it works in App Service. You can have up to 15 deployment slots
Deploy a new feature to prod env without enabling it to users.
Make sure it works with in the prod infrustracture, no memory leaks, no nothing.
Then enable it to users on UI. If something bad happens, then disable it in UI.
Facebook does this.
All other proven practices are in this doc.
You can use this list when you have ADR with your customers. Give us feedback.