SlideShare une entreprise Scribd logo
1  sur  108
Télécharger pour lire hors ligne
@theburningmonk
@sarutule
1
"I COME BACK TO YOU NOW AT THE TURN OF THE TIDE"
1
@theburningmonk
@sarutule
WHAT IS
RESILIENCE?
2
@theburningmonk
@sarutule
Failures in distributed systems
3
@theburningmonk
@sarutule
Failures on load: exhaustion of resources
4
@theburningmonk
@sarutule
Failures on load: exhaustion of resources
5
You Shall Not Fail!
in the face of turbulent conditions
TM
what is
RESILIENCE
chaos
ENGINEERING
multi-region
STRATEGIES
retries &
TIMEOUTS
lambda
SCALING
decoupled
INVOCATION
PRODUCERS
Yan Cui, @theburningmonk
Sara Gerion, @sarutule
SPEAKERS
Yan Cui, @theburningmonk
Sara Gerion, @sarutule
SPEAKING AT
AWS Community Summit Online
SPECIAL THANKS
Phil Horn
Joe Park
@theburningmonk
@sarutule
8
Yan Cui
http://theburningmonk.com
@theburningmonk
AWS user for 10 years
@theburningmonk
@sarutule
9
Yan Cui
http://theburningmonk.com
@theburningmonk
Developer Advocate @
@theburningmonk
@sarutule
10
@theburningmonk
@sarutule
11
Yan Cui
http://theburningmonk.com
@theburningmonk
Independent Consultant
advisetraining delivery
SARA GERION
Italian living in Amsterdam, The Netherlands
Passionate about cloud, scalability, resilience
Twitter: @Sarutule
Backend engineer at DAZN
@dazneng
Director of Tech at SheSharp
@SheSharpNL
@theburningmonk
@sarutule
Lambda execution environment
13
@theburningmonk
@sarutule
Serverless - multiple AZ’s out of the box
14
Total resources created:
1 API Gateway
1 Lambda
@theburningmonk
@sarutule
Load balancing
15
@theburningmonk
@sarutule
Data replication
16
@theburningmonk
@sarutule
REST API - Lambda autoscaling
17
Concurrency limits:

3000 – US West (Oregon), US East (N.
Virginia), Europe (Ireland), 1000 – Asia Pacific
(Tokyo), Europe (Frankfurt), 500 – Other
Regions
Later bursts: 500 new containers / each minute

@theburningmonk
@sarutule
REST API - Lambda autoscaling
18
X number of execution environments 

pre-initialized (ready to respond to invocations)
Note: standard burst concurrency limits when
over the provisioned capacity


Concurrency limits:

3000 – US West (Oregon), US East (N.
Virginia), Europe (Ireland), 1000 – Asia Pacific
(Tokyo), Europe (Frankfurt), 500 – Other
Regions
Later bursts: 500 new containers / each minute

@theburningmonk
@sarutule
REST API - Lambda autoscaling
19
Adjustable provisioned capacity based on
CloudWatch metrics
X number of execution environments 

pre-initialized (ready to respond to invocations)
Note: standard burst concurrency limits when
over the provisioned capacity


Concurrency limits:

3000 – US West (Oregon), US East (N.
Virginia), Europe (Ireland), 1000 – Asia Pacific
(Tokyo), Europe (Frankfurt), 500 – Other
Regions
Later bursts: 500 new containers / each minute

@theburningmonk
@sarutule
REST API - Lambda limitations & throttling
20
@theburningmonk
@sarutule
HOW TO SOLVE IT?
21
@theburningmonk
@sarutule
HOW TO SOLVE IT?
IT DEPENDS
22
@theburningmonk
@sarutule
The importance of retry policies
23
@theburningmonk
@sarutule
Scenario: client only needs an acknowledgement
24
@theburningmonk
@sarutule
If fast acknowledgement not possible…
@theburningmonk
@sarutule
Scenario: predictable spikes
26
@theburningmonk
@sarutule
Scenario: predictable spikes
27
Holidays, weekends,

celebrations

(Black Friday)
Planned launch of

resources

(new series available)
Sport events
@theburningmonk
@sarutule
Scenario: unpredictable spikes
28
Traffic generated by user
actions



Jennifer Aniston’s first post
@theburningmonk
@sarutule
Possible mitigations for REST API’s
29
Use 1 Lambda

for each

endpoint
29
@theburningmonk
@sarutule
One Lambda function for each endpoint
3030
@theburningmonk
@sarutule
Possible mitigations for REST API’s
31
Use 1 Lambda

for each

endpoint
Raise limits with

an AWS support ticket
31
@theburningmonk
@sarutule
Possible mitigations for REST API’s
32
Use 1 Lambda

for each

endpoint
Optimise 

performance
Raise limits with

an AWS support ticket
32
@theburningmonk
@sarutule
Possible mitigations for REST API’s
33
Use 1 Lambda

for each

endpoint
Optimise 

performance
Offload computing

operations to an 

async flow (SQS, SNS, …)
Raise limits with

an AWS support ticket
33
@theburningmonk
@sarutule
Offload computing operations to queues
34
@theburningmonk
@sarutule
Offload computing operations to queues
35
@theburningmonk
@sarutule
Possible mitigations for REST API’s
36
Use 1 Lambda

for each

endpoint
Optimise 

performance
Offload computing

operations to an 

async flow (SQS, SNS, …)
Use provisioned capacity

(plus autoscaling)
Raise limits with

an AWS support ticket
36
@theburningmonk
@sarutule
Reminder: beware of long timeouts
37
API Gateway

Integration timeout 

Default: 29s
Lambda

Timeout
Max: 15 minutes
SQS

Visibility timeout

Default: 30s
Min: 0s
Max: 12 hours
@theburningmonk
@sarutule
Single-region architectures
38
@theburningmonk
@sarutule
Multi-region: active-passive
39
@theburningmonk
@sarutule
Multi-region: active-active
40
@theburningmonk
@sarutule
Active-active & data replication
41
@theburningmonk
@sarutule
Multi-region architecture - benefits & tradeoffs
42
Protection against

regional failures
@theburningmonk
@sarutule
Multi-region architecture - benefits & tradeoffs
43
Protection against

regional failures
Higher complexity
@theburningmonk
@sarutule
Multi-region architecture - benefits & tradeoffs
44
Protection against

regional failures
Higher complexity Very hard to test
@theburningmonk
@sarutule
CHAOS ENGINEERING
45
@theburningmonk
@sarutule
46
MUST KILL SERVERS!
RAWR!!
RAWR!!
@theburningmonk
@sarutule
47
“the discipline of experimenting on a system in order to build confidence in the
system’s capability to withstand turbulent conditions in production”
principlesofchaos.org
@theburningmonk
@sarutule
48
“You don't choose the moment, the moment chooses you!
You only choose how prepared you are when it does.”
Fire Chief Mike Burtch
@theburningmonk
@sarutule
49
identify weaknesses before they manifest in system-wide, aberrant behaviors
GOAL
@theburningmonk
@sarutule
50
learn about the system’s behavior by observing it during a controlled experiments
HOW
@theburningmonk
@sarutule
51
learn about the system’s behavior by observing it during a controlled experiments
HOW
game days
failure injection
@theburningmonk
@sarutule
52
MUST KILL SERVERS!
RAWR!!
RAWR!!
ahhhhhhh!!!!
HELP!!!
OMG!!!
F***!!!
@theburningmonk
@sarutule
53
phew!
@theburningmonk
@sarutule
54
STEP 1.
define steady state
i.e. “what does normal look like”
@theburningmonk
@sarutule
55
STEP 2.
hypothesis that steady state continues in control and experimental group
e.g. “the system stays up if a server dies”
@theburningmonk
@sarutule
56
STEP 3.
inject realistic failures
e.g. “slow response from 3rd-party service”
@theburningmonk
@sarutule
57
STEP 4.
try to disprove hypothesis
i.e. “look for difference between control and experimental group”
@theburningmonk
@sarutule
DON’T START
EXPERIMENTS
IN PRODUCTION
58
@theburningmonk
@sarutule
59
identify weaknesses before they manifest in system-wide, aberrant behaviors
GOAL
@theburningmonk
@sarutule
60
“Corporation X lost millions due to a
chaos experiment went wrong and
destroyed key infrastructure,
resulting in hours of downtime and
unrecoverable data loss.”
@theburningmonk
@sarutule
61
Chaos Engineering doesn't cause problems. It reveals them.
Nora Jones
@theburningmonk
@sarutule
62
CONTAINMENT
@theburningmonk
@sarutule
63
CONTAINMENT
run experiments during office hours
@theburningmonk
@sarutule
64
CONTAINMENT
run experiments during office hours
let others know what you’re doing, no surprises
@theburningmonk
@sarutule
65
CONTAINMENT
run experiments during office hours
let others know what you’re doing, no surprises
avoid important dates
@theburningmonk
@sarutule
66
CONTAINMENT
run experiments during office hours
let others know what you’re doing, no surprises
avoid important dates
make the smallest change possible
@theburningmonk
@sarutule
67
CONTAINMENT
run experiments during office hours
let others know what you’re doing, no surprises
avoid important dates
make the smallest change possible
have a rollback plan before you start
@theburningmonk
@sarutule
DON’T START
EXPERIMENTS
IN PRODUCTION
68
@theburningmonk
@sarutule
69
by Russ Miles @russmiles
source https://medium.com/russmiles/chaos-engineering-for-the-business-17b723f26361
@theburningmonk
@sarutule
70
chaos monkey kills an
EC2 instance
latency monkey induces
artificial delay in APIs
chaos gorilla kills an AWS
Availability Zone
chaos kong kills an entire
AWS region
@theburningmonk
@sarutule
71
@theburningmonk
@sarutule
72
there are no servers to kill!
SERVERLESS
@theburningmonk
@sarutule
73
by Russ Miles @russmiles
source https://medium.com/russmiles/chaos-engineering-for-the-business-17b723f26361
@theburningmonk
@sarutule
74
by Russ Miles @russmiles
source https://medium.com/russmiles/chaos-engineering-for-the-business-17b723f26361
@theburningmonk
@sarutule
75
improperly tuned timeouts
@theburningmonk
@sarutule
76
missing error handling
@theburningmonk
@sarutule
77
missing fallbacks
@theburningmonk
@sarutule
78
@theburningmonk
@sarutule
79
“what if DynamoDB has an elevated error rate?”
@theburningmonk
@sarutule
80
hypothesis: the AWS SDK retries would handle it
@theburningmonk
@sarutule
81
runs experiment…
@theburningmonk
@sarutule
82
TIL: the js DynamoDB client defaults to 10 retries
with base delay of 50ms
@theburningmonk
@sarutule
83
TIL: the js DynamoDB client defaults to 10 retries
with base delay of 50ms
delay = Math.random() * (Math.pow(2, retryCount) * base)
this is Marc Brooker’s
fav formula!
@theburningmonk
@sarutule
84
@theburningmonk
@sarutule
85
result: function times out after 6s
(hypothesis is disproved)
@theburningmonk
@sarutule
86
action: set max retry count + fallback
@theburningmonk
@sarutule
87
outcome: a more resilient system
@theburningmonk
@sarutule
88
“what if service X has elevated latency?”
@theburningmonk
@sarutule
89
hypothesis: our try-catch would handle it
@theburningmonk
@sarutule
90
runs experiment…
@theburningmonk
@sarutule
91
result: function times out after 6s
(hypothesis is disproved)
@theburningmonk
@sarutule
92
TIL: most HTTP client libraries have default timeout of 60s.
API Gateway has an integration timeout of 29s.
Most Lambda functions default to timeout of 3-6s.
@theburningmonk
@sarutule
93
@theburningmonk
@sarutule
94
@theburningmonk
@sarutule
95
https://bit.ly/2Wvfort
@theburningmonk
@sarutule
96
@theburningmonk
@sarutule
97
@theburningmonk
@sarutule
98
outcome: a more resilient system
@theburningmonk
@sarutule
99
recap
@theburningmonk
@sarutule
Failures in distributed systems
100
@theburningmonk
@sarutule
Serverless - multiple AZ’s out of the box
101
Total resources created:
1 API Gateway
1 Lambda
@theburningmonk
@sarutule
Beware of timeouts
102
API Gateway

Integration timeout 

Default: 29s
Lambda

Timeout
Max: 15 minutes
SQS

Visibility timeout

Default: 30s
Min: 0s
Max: 12 hours
@theburningmonk
@sarutule
Offload computing operations to queues
103
@theburningmonk
@sarutule
Active-active
104
@theburningmonk
@sarutule
105
“You don't choose the moment, the moment chooses you!
You only choose how prepared you are when it does.”
Fire Chief Mike Burtch
@theburningmonk
@sarutule
106
by Russ Miles @russmiles
source https://medium.com/russmiles/chaos-engineering-for-the-business-17b723f26361
@theburningmonk
@sarutule
107
by Russ Miles @russmiles
source https://medium.com/russmiles/chaos-engineering-for-the-business-17b723f26361
@theburningmonk
@sarutule
108

Contenu connexe

Plus de Yan Cui

How serverless changes the cost paradigm
How serverless changes the cost paradigmHow serverless changes the cost paradigm
How serverless changes the cost paradigm
Yan Cui
 

Plus de Yan Cui (20)

How to ship customer value faster with step functions
How to ship customer value faster with step functionsHow to ship customer value faster with step functions
How to ship customer value faster with step functions
 
How serverless changes the cost paradigm
How serverless changes the cost paradigmHow serverless changes the cost paradigm
How serverless changes the cost paradigm
 
Why your next serverless project should use AWS AppSync
Why your next serverless project should use AWS AppSyncWhy your next serverless project should use AWS AppSync
Why your next serverless project should use AWS AppSync
 
Build social network in 4 weeks
Build social network in 4 weeksBuild social network in 4 weeks
Build social network in 4 weeks
 
Patterns and practices for building resilient serverless applications
Patterns and practices for building resilient serverless applicationsPatterns and practices for building resilient serverless applications
Patterns and practices for building resilient serverless applications
 
How to bring chaos engineering to serverless
How to bring chaos engineering to serverlessHow to bring chaos engineering to serverless
How to bring chaos engineering to serverless
 
Migrating existing monolith to serverless in 8 steps
Migrating existing monolith to serverless in 8 stepsMigrating existing monolith to serverless in 8 steps
Migrating existing monolith to serverless in 8 steps
 
Building a social network in under 4 weeks with Serverless and GraphQL
Building a social network in under 4 weeks with Serverless and GraphQLBuilding a social network in under 4 weeks with Serverless and GraphQL
Building a social network in under 4 weeks with Serverless and GraphQL
 
FinDev as a business advantage in the post covid19 economy
FinDev as a business advantage in the post covid19 economyFinDev as a business advantage in the post covid19 economy
FinDev as a business advantage in the post covid19 economy
 
How to improve lambda cold starts
How to improve lambda cold startsHow to improve lambda cold starts
How to improve lambda cold starts
 
What can you do with lambda in 2020
What can you do with lambda in 2020What can you do with lambda in 2020
What can you do with lambda in 2020
 
A chaos experiment a day, keeping the outage away
A chaos experiment a day, keeping the outage awayA chaos experiment a day, keeping the outage away
A chaos experiment a day, keeping the outage away
 
How to debug slow lambda response times
How to debug slow lambda response timesHow to debug slow lambda response times
How to debug slow lambda response times
 
What can you do with lambda in 2020
What can you do with lambda in 2020What can you do with lambda in 2020
What can you do with lambda in 2020
 
How to ship customer value faster with step functions
How to ship customer value faster with step functionsHow to ship customer value faster with step functions
How to ship customer value faster with step functions
 
Debugging Lambda timeouts
Debugging Lambda timeoutsDebugging Lambda timeouts
Debugging Lambda timeouts
 
Serverless a superpower for frontend developers
Serverless a superpower for frontend developersServerless a superpower for frontend developers
Serverless a superpower for frontend developers
 
Debugging AWS Lambda Performance Issues
Debugging AWS Lambda Performance  IssuesDebugging AWS Lambda Performance  Issues
Debugging AWS Lambda Performance Issues
 
Patterns and Practices for Building Resilient Serverless Applications
Patterns and Practices for Building Resilient Serverless ApplicationsPatterns and Practices for Building Resilient Serverless Applications
Patterns and Practices for Building Resilient Serverless Applications
 
Serverless Security: Defence Against the Dark Arts
Serverless Security: Defence Against the Dark ArtsServerless Security: Defence Against the Dark Arts
Serverless Security: Defence Against the Dark Arts
 

Dernier

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Dernier (20)

ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 

You shall not Fail! (in the face of turbulent conditions)