Patterns and practices for building resilient serverless applications.pdf

Patterns and Practices
for building resilient
serverless applications
presented by Yan Cui

@theburningmonk

@theburningmonk theburningmonk.com

“the capacity to recover quickly from dif
fi
culties; toughness.”
resilience
/rɪˈzɪlɪəns/
noun

“the capacity to recover quickly from dif
fi
culties; toughness.”
resilience
/rɪˈzɪlɪəns/
noun
it’s not about
preventing failures!

everything fails, all the time

we need to build applications that can withstand failures

don’t run your application on one server…

entire data centers can
go down…

run your application in multiple AZs and regions

Failures on load: exhaustion of resources

latency
reqs/s
Failures on load: exhaustion of resources
CPU saturation

Failures in distributed systems
Service A Service B Service C
user

user
HTTP 502

user
You suck!

microservices death stars circa 2015

Yan Cui
http://theburningmonk.com
@theburningmonk
AWS user for 10 years

Yan Cui
@theburningmonk
http://bit.ly/yubl-serverless

Yan Cui
@theburningmonk
Developer Advocate @

Yan Cui
@theburningmonk
Independent Consultant
advise
training delivery

Lambda execution environment

Serverless - multiple AZ’s out of the box
Total resources created:
1 API Gateway
1 Lambda

Serverless - multiple AZ’s out of the box
Total resources created:
1 API Gateway
1 Lambda
don’t pay for idle
redundant resources!

Load balancing

Data replication in different AZ’s
DynamoDB
Global Tables

There are throttling everywhere!

Beware of timeout mismatch
API Gateway 
Integration timeout  
Default: 29s
Lambda 
Timeout
Max: 15 minutes

Lambda 
Timeout
Max: 15 minutes
SQS 
Visibility timeout 
Default: 30s
Min: 0s
Max: 12 hours

Lambda 
Timeout
Max: 15 minutes
SQS 
Visibility timeout 
Default: 30s
Min: 0s
Max: 12 hours
set VisibilityTimeout to
6x Lambda timeout

Of
fl
oad computing operations to queues

Of
fl
better absorb
downstream problems

Of
fl
need way to replay
DLQ events

https://www.npmjs.com/package/lumigo-cli

Of
fl
great for
fi
re-and-forget tasks

“what if the client is waiting for a response?”

“Decoupled Invocation”

task id created at result
xxx xxx <null>
xxx xxx <null>
… … …
task results
not ready…

xxx xxx <null>
xxx xxx <null>
… … …
task results
not ready…
202

xxx xxx <null>
xxx xxx <null>
… … …
task results
reporting for duty!

xxx xxx <null>
xxx xxx <null>
… … …
task results
working hard…
not ready…

xxx xxx <null>
xxx xxx <null>
… … …
task results
202
working hard…

xxx xxx <null>
xxx xxx { … }
… … …
task results
done!

xxx xxx <null>
xxx xxx { … }
… … …
task results
200
{ … }

wait…

a distributed
transaction!

a distributed
transaction!
needs rollback

how do you implement distributed transactions?

The Saga pattern
A pattern for managing failures where each action
has a compensating action for rollback

The Saga pattern
https://www.youtube.com/watch?v=xDuwrtwYHu8

The Saga pattern
Begin transaction

Start book hotel request

End book hotel request

Start book
fl
ight request

End book
fl
ight request

Start book car rental request

End book car rental request

End transaction

The Saga pattern
model both actions and
compensating actions as
Lambda functions

The Saga pattern
use Step Functions as the
coordinator for the saga

The Saga pattern
Input

The Saga pattern

no distributed
transactions

do the work here

retry-until-success

24 hours data retention

24 hours data retention
need alerting to ensure
issue are addressed quickly

Mind the poison message

retry-until-success
needs to deal with
poinson messages

6, 3, 1, 1, 1, 1, …

6, 3, 1, 1, 1, 1, …
only count the “same” batch

have to fetch
from the stream

have to fetch
from the stream
do it before they expire
from the stream!

Mind the partial failures
Lambda
SQS

Lambda
SQS Poller

Lambda
SQS Poller
Delete

Lambda
SQS Poller
Error

Lambda
SQS Poller
Error
DLQ

Lambda
SQS Poller
Error
DLQ
batch fails as a unit

ReportBatchItemFailures

Mind the retry storm
Service A

Mind the retry storm
Service A
retry
retry
retry
retry

retry storm

circuit breaker pattern
After X consecutive timeouts, trip the circuit

When circuit is open, fail fast


but, allow 1 request through every Y mins


but, allow 1 request through every Y mins

If request succeeds, close the circuit

where do I keep the state of the circuit?

in-memory
PROS
simplicity
no dependency on external service
CONS
takes longer & more requests to stop all traf
fi
c
new containers would generate more traf
fi
c

external service
PROS
minimizes no. of total requests to trip circuit
new containers respect collective decision
CONS
complexity
dependency on an external service

which approach should I use?
It depends. Maybe start with the simplest solution
fi
rst?

multi-region, active-active

us-east-1
API Gateway Lambda DynamoDB
Route53

eu-west-1
us-east-1
us-west-1

eu-west-1
us-east-1
us-west-1
Global
Table

eu-central-1
us-east-1
us-east-1
SQS Lambda DynamoDB Lambda API Gateway
SNS
SNS

us-east-1
eu-central-1
us-east-1
SNS
SNS

us-east-1
eu-central-1
us-east-1
SNS
SNS
D
dedupe

us-east-1
us-east-1
SNS
eu-central-1
SNS
eu-central-1
Global Table

Multi-region architecture - bene
fi
ts & tradeoffs
Protection against 
regional failures
Higher complexity Very hard to test

MUST KILL SERVERS!

RAWR!!
RAWR!!

“the discipline of experimenting on a system in order to build con
fi
dence in the
system’s capability to withstand turbulent conditions in production”
principlesofchaos.org

“You don't choose the moment, the moment chooses you!

You only choose how prepared you are when it does.”
Fire Chief Mike Burtch

by Russ Miles @russmiles

source https://medium.com/russmiles/chaos-engineering-for-the-business-17b723f26361

chaos monkey kills an
EC2 instance
latency monkey induces
arti
fi
cial delay in APIs
chaos gorilla kills an AWS
Availability Zone
chaos kong kills an entire
AWS region

there are no servers to kill!
SERVERLESS

Serverless gives you a lot of built-in resilience, but it’s not infallible

improperly tuned timeouts

missing error handling

missing fallbacks

“what if DynamoDB has an elevated error rate?”

“what if service X has elevated latency?”

identify weaknesses before they manifest in system-wide, aberrant behaviors
GOAL

https://theburningmonk.com/hire-me
Advise
Training Delivery
“Fundamentally, Yan has improved our team by increasing our
ability to derive value from AWS and Lambda in particular.”
Nick Blair
Tech Lead

productionreadyserverless.com
August 25-26th

@theburningmonk
theburningmonk.com
github.com/theburningmonk

Patterns and practices for building resilient serverless applications.pdf

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Patterns and practices for building resilient serverless applications.pdf

Similaire à Patterns and practices for building resilient serverless applications.pdf (20)

Plus de Yan Cui

Plus de Yan Cui (16)

Dernier

Dernier (20)

Patterns and practices for building resilient serverless applications.pdf