SlideShare une entreprise Scribd logo
1  sur  49
Télécharger pour lire hors ligne
How to Fail with Serverless
Jeremy Daly
CTO, AlertMe.news
@jeremy_daly
Jeremy Daly
• CTO at AlertMe.news
• Consult with companies building in the cloud
• 20+ year veteran of technology startups
• Started working with AWS in 2009 and started using
Lambda in 2015
• Blogger (jeremydaly.com), OSS contributor, speaker
• Publish the Off-by-none serverless newsletter
• Host of the Serverless Chats podcast
@jeremy_daly
Agenda
• Distributed systems and serverless
• Writing code FOR the cloud
• Failure modes in the cloud
• Serverless patterns to deal with failure
• Decoupling our services
@jeremy_daly
Distributed Systems are…
Systems whose components are located on
different networked computers, which communicate
and coordinate their actions by passing messages to
one another. ~Wikipedia
@jeremy_daly
They’re also really hard!
@jeremy_daly
EVERYTHING FAILS
ALL THE TIME
WernerVogels
CTO, Amazon.com
8 Fallacies of Distributed Computing
@jeremy_daly
The network
is reliable
Latency
is zero
Bandwidth
is infinite
The network is
secure
Topology
doesn’t change
There is one
administrator
Transport cost
is zero
The network is
homogeneous
🚫 🚫 🚫 🚫
🚫 🚫 🚫 🚫
Serverless applications are…
Distributed systems on steroids! 💪 💪 💪
@jeremy_daly
• Smaller, more distributed compute units
• Stateless, requiring network access to state
• Uncoordinated, requires buses, queues, pub/sub, state machines
• Heavily reliant on other networked cloud services
What does it mean to be Serverless?
• No server management
• Flexible scaling
• Pay for value / never pay for idle
• Automated high availability
• LOTS of configuration and knowledge of cloud services
• Highly event-driven
@jeremy_daly
Lots of services to communicate with!
@jeremy_daly
ElastiCache
RDS
EMR Amazon ES
Redshift
Fargate
Anything “on EC2”Lambda Cognito Kinesis
S3 DynamoDB SQS
SNS API Gateway
AppSync IoT Comprehend
“Serverless” Managed Non-Serverless
DocumentDB
(MongoDB)
Managed Streaming
for Kada
EventBridge
Reliability or High Availability is…
@jeremy_daly
A characteristic of a system, which aims to ensure an
agreed level of operational performance,
usually uptime, for a higher than normal period.
~Wikipedia
Resiliency is…
@jeremy_daly
The ability of a software solution to absorb the impact
of a problem in one or more parts of a system, while
continuing to provide an acceptable service level to
the business. ~ IBM
IT’S NOT ABOUT PREVENTING FAILURE
IT’S UNDERSTANDING HOWTO GRACEFULLY DEAL WITH IT
Writing code FOR the Cloud
Using Lambda for our business logic
• Ephemeral compute service
• Runs your code in response to events
• Automatically manages the runtime,
compute, and scaling
• Single concurrency model
• No sticky-sessions or guaranteed
lifespan
@jeremy_daly
Traditional Error Handling & Retries
try {
// Do something important
} catch (err) {
// Do some error handling
// Do some logging
// Maybe retry the operation
}
@jeremy_daly
What happens to the
original event?
What happens if the function
container crashes?
What happens if there is a
network issue?
👨✈ Losing events is very bad!
What happens if the function
never runs?
Function Error Types
• Unhandled Exception ‼
• FunctionTimeouts ⏳
• Out-of-Memory Errors 🧠
• Throttling Errors 🚦
@jeremy_daly
The cloud is better than you…
…at error handling
…at retrying failures
…at understanding network failures
…at mapping the network topology
…at handling failover and redundancy
@jeremy_daly
So why not let the
cloud do those
things for you? 🤷
Fail up the stack
• Don’t swallow errors with try/catch – fail the function
• Return errors directly to the invoking service
• Configure built-in retry mechanisms to reprocess events
• Utilize dead letter queues to capture failed events
@jeremy_daly
(sometimes 😉)
Types of Lambda Functions
• The Lambdalith
• The Fat Lambda
• The Single-Purpose Function
@jeremy_daly
😬
🤔
👍
The Mighty Lambdalith
• The entire application is in one
Lambda function
• Often times these are “lift and
shift” Express.js or Flask apps
• Events are synchronous via API
Gateway or ALB
• Partial failures are handled “in the
code”
@jeremy_daly
The Fat Lambda
• Several related methods are collocated in a single Lambda function
• Generally used to optimize the speed of synchronous operations
• Partial failures are still handled “in the code”
• Under the right circumstances, this can be useful
@jeremy_daly
The Single-Purpose Function
• Tightly scoped function that handles a single discrete piece of
business logic
• Can be invoked synchronously or asynchronously
• Failures are generally “total failures” and are passed back to the
invoking service
• Can be reused as part of other “workflows”, can scale (or throttle)
independently, and can utilize the Principle of Least Privilege
@jeremy_daly
Failure Modes in the Cloud
WARNING: Firehose of overly-technical content ahead 👩🚒
A quick word about retries…
• Retries are a vital part of distributed systems
• Most cloud services guarantee “at least once” delivery
• It is possible for the same event to be received more than once
• Retried operations should be idempotent
@jeremy_daly
Idempotent means that…
An operation can be repeated multiple times and
always provide the same result, with no side effects
to other objects in the system. ~ Computer Hope
@jeremy_daly
Idempotent operations:
• Update a database record
• Authenticate a user
• Check if a record exists and create if not
There are lots of
strategies to ensure
idempotency!
What are Dead Letter Queues (DLQs)?
• Capture messages/events that fail to process or are skipped
• Allows for alarming, inspection, and potential replay
• Can be added to SQS queues, SNS subscriptions, Lambda functions
@jeremy_daly
Lambda Invocation Types
• Synchronous – request/response model
• Asynchronous – set it and forget it
• Stream-based – push
• Poller-based – pull
@jeremy_daly
Synchronous Lambda Retry Behavior
• Functions are invoked directly using request/response method
• Failures are returned to the invoker
• Retries are delegated to the invoking application
• Some AWS services automatically retry (e.g. Alexa & Cognito)
• Other services do not retry (e.g. API Gateway, ALB, Step Functions)
• API Gateway and ALB can return errors to the client for retry
@jeremy_daly
Asynchronous Lambda Retry Behavior
• The Lambda Service accepts requests and adds them to a queue
• The invoker receives a 202 status code and disconnects
• The Lambda Service will attempt to reprocess failed events up to
2 times, configured using the MaximumRetryAttempts setting
• If the Lambda function is throttled, the event will be retried for up to
6 hours, configured using MaximumEventAgeInSeconds
• Failed and expired events can be sent to a Dead Letter Queue (DLQ)
or an on-failure destination
@jeremy_daly
Stream-based Lambda Retry Behavior
• Records are pushed synchronously to Lambda from Kinesis or
DynamoDB streams in batches (10k and 1k limits per batch)
• MaximumRetryAttempts: number of retry attempts for batches
before they can be skipped (up to 10,000)
• MaximumRecordAgeInSeconds: store records up to 7 days
• BisectBatchOnFunctionError: recursively split failed batches
(poison pill)
• Skipped records are sent to an On-failure Destination (SQS or SNS)
@jeremy_daly
Poller-based Lambda Retry Behavior
• The Lambda Poller pulls records synchronously from SQS in
batches (up to 10)
• Errors fail the entire batch
• MaxReceiveCount: number of times messages can be returned to
the queue before being sent to the DLQ (up to 1,000)
• Polling frequency is tied to function concurrency
• VisibilityTimeout should be set to at least 6 times the timeout
configured on your consuming function
@jeremy_daly
Lambda Destinations
• Only for asynchronous invocations
• Routing based on SUCCESS and/or FAILURE
• OnFailure should be favored over a standard DLQ
• Destinations can be an SQS queue, SNS topic,
Lambda function, or EventBridge event bus
@jeremy_daly
Lambda Destinations (continued)
Destination-specific JSON format
• SQS/SNS: JSON object is passed as the Message
• Lambda: JSON is passed as the payload to the function
• EventBridge: JSON is passed as the Detail in the PutEvents call
• Source is ”lambda”
• DetailType is “Lambda Function Invocation Result – Success/Failure”
• Resource fields contain the function and destination ARNs
@jeremy_daly
SQS Redrive Policies
• Only supports another SQS queue as the DLQ
• Messages are sent to the DLQ if the Maximum Receives value is
exceeded
@jeremy_daly
SNS Redrive Policies
• Dead Letter Queues are attached to Subscriptions, not Topics
• Only supports SQS queues as the DLQ
• Client-side errors (e.g. Lambda doesn’t exist) do no retry
• Messages to SQS or Lambda are retried 100,015 times over 23 days
• Messages to SMTP, SMS, and Mobile retry 50 times over 6 hours
• HTTP endpoints support customer-defined retry policies
(number of retries, delays, and backoff strategy)
@jeremy_daly
EventBridge Retry Behavior
• Will attempt to deliver events for up to 24 hours with backoff
• Lambda functions are invoked asynchronously
• Failed events are lost (this is very unlikely)
• Once events are accepted by the target service, failure modes of
those services are used
@jeremy_daly
Step Functions
• State Machines: Orchestration workflows
• Lambdas are invoked synchronously
• Retriers and Catchers allow for complex
error handling patterns
• Use “error names” with ErrorEquals for
condition error handling (States.*)
• Control retry policies with IntervalSeconds,
MaxAttempts, BackoffRate
@jeremy_daly
Complex Error Handling Pattern
Credit:Yan Cui
AWS SDK Retries
• Automatic retries and exponential backoff
@jeremy_daly
AWS SDK
Maximum retry
count
Connection
timeout
Socket timeout
Python (Boto 3) depends on service 60 seconds 60 seconds
Node.js depends on service N/A 120 seconds
Java 3 10 seconds 50 seconds
.NET 4 100 seconds 300 seconds
Go 3 N/A N/A
Error Handling Patterns
miss
Caching strategy
Client API Gateway RDSLambda
Elasticache
Key Points:
• Create new RDS connections ONLY on misses
• Make sureTTLs are set appropriately
• Include the ability to invalidate cache
@jeremy_daly
YOU STILL NEEDTO
SIZEYOUR DATABASE
CLUSTERS APPROPRIATELY
RDS
Buffer events for throttling and durability
Client API Gateway
SQS
Queue
SQS
(DLQ)
Lambda Lambda
(throttled)
ack
“Asynchronous”
Request
Synchronous
Request
@jeremy_daly
Key Points:
• SQS adds durability
• Throttled Lambdas reduce downstream pressure
• Failed events are stored for further inspection/replay
Limit the
concurrency to match
RDS throughput
x
Utilize Service
Integrations
DynamoDB
Stripe API
The Circuit Breaker
Client API Gateway Lambda
Key Points:
• Cache your cache with warm functions
• Use a reasonable failure count
• Understand idempotency!
Status
Check CLOSED
OPEN
Increment Failure Count
HALF OPEN
“Everything fails all the time.”
~WernerVogels
@jeremy_daly
Elasticache
or
Event Processing Conditional Retry Pattern
@jeremy_daly
EventBridge Processing
Lambda
EventBridge
onSuccess
onFailure
SQS
(DLQ)
Replay SQS
Replay Lambda
(Throttled)
Rule to route
permanent
failuresRule to route
“retriable” failures
Asynchronous
(use delivery delay)
Decoupling Our Services
Multicasting with SNS
Key Points:
• SNS has a “well-defined API”
• Decouples downstream processes
• Allows multiple subscribers with message filters
Client
SNS
“Asynchronous”
Request
ack
Event Service
@jeremy_daly
HTTP
SMS
Lambda
SQS
Email
SQS (DLQ)
@jeremy_daly
Multicasting with EventBridge
Key Points:
• Allows multiple subscribers with RULES, PATTERNS and FILTERS
• Forward events to other accounts
• 24 hours of automated retries
Asynchronous
“PutEvents” Request
ack
w/ event id
Amazon
EventBridge
Lambda
SQS
Client
Step Function
Event Bus
+16 others
Key Points:
• Filter events to selectively trigger services
• Manage throttling/quotas/failures per service
• Use Lambda Destinations with asynchronous events
Stripe API
@jeremy_daly
Distribute & Throttle
ack
SQS
Queue Lambda
(concurrency 25)
Client API
Gateway
Lambda
Order Service
"total": [{ "numeric": [ ”>", 0 ]}]
RDS
SQS
Queue Lambda
(concurrency 10)
SMS Alerting Service
Twilio API
SQS
Queue Lambda
(concurrency 5)
Billing Service
"detail-type": [ "ORDER COMPLETE" ]
EventBridge
Key Takeaways
• Be prepared for failure – everything fails all the time!
• Utilize the built in retry mechanisms of the cloud
• Understand failure modes to protect against data loss
• Buffer and throttle events to distributed systems
• Embrace asynchronous processes to decouple components
@jeremy_daly
Things I’m working on…
Blog: JeremyDaly.com
Podcast: ServerlessChats.com
Newsletter: Offbynone.io
DDBToolbox: DynamoDBToolbox.com
Lambda API: LambdaAPI.com
GitHub: github.com/jeremydaly
Twitter: @jeremy_daly
@jeremy_daly
Thank You!
Jeremy Daly
jeremy@jeremydaly.com
@jeremy_daly

Contenu connexe

Tendances

Sloppy Little Serverless Stories
Sloppy Little Serverless StoriesSloppy Little Serverless Stories
Sloppy Little Serverless StoriesSheenBrisals
 
Stephen Liedig: Building Serverless Backends with AWS Lambda and API Gateway
Stephen Liedig: Building Serverless Backends with AWS Lambda and API GatewayStephen Liedig: Building Serverless Backends with AWS Lambda and API Gateway
Stephen Liedig: Building Serverless Backends with AWS Lambda and API GatewaySteve Androulakis
 
Getting Started with Serverless Architectures
Getting Started with Serverless ArchitecturesGetting Started with Serverless Architectures
Getting Started with Serverless ArchitecturesAmazon Web Services
 
How LEGO.com Accelerates With Serverless
How LEGO.com Accelerates With ServerlessHow LEGO.com Accelerates With Serverless
How LEGO.com Accelerates With ServerlessSheenBrisals
 
Choosing the right messaging service for your serverless app [with lumigo]
Choosing the right messaging service for your serverless app [with lumigo]Choosing the right messaging service for your serverless app [with lumigo]
Choosing the right messaging service for your serverless app [with lumigo]Dhaval Nagar
 
To Serverless And Beyond!
To Serverless And Beyond!To Serverless And Beyond!
To Serverless And Beyond!SheenBrisals
 
ENT310 Microservices? Dynamic Infrastructure? - Adventures in Keeping Your Ap...
ENT310 Microservices? Dynamic Infrastructure? - Adventures in Keeping Your Ap...ENT310 Microservices? Dynamic Infrastructure? - Adventures in Keeping Your Ap...
ENT310 Microservices? Dynamic Infrastructure? - Adventures in Keeping Your Ap...Amazon Web Services
 
Enterprise Serverless Adoption. An Experience Report
Enterprise Serverless Adoption. An Experience ReportEnterprise Serverless Adoption. An Experience Report
Enterprise Serverless Adoption. An Experience ReportSheenBrisals
 
Getting Started with Amazon EventBridge
Getting Started with Amazon EventBridgeGetting Started with Amazon EventBridge
Getting Started with Amazon EventBridgeSrushith Repakula
 
Using AWS Lambda to Build Control Systems for Your AWS Infrastructure
Using AWS Lambda to Build Control Systems for Your AWS InfrastructureUsing AWS Lambda to Build Control Systems for Your AWS Infrastructure
Using AWS Lambda to Build Control Systems for Your AWS InfrastructureAmazon Web Services
 
WKS404 7 Things You Must Know to Build Better Alexa Skills
WKS404 7 Things You Must Know to Build Better Alexa SkillsWKS404 7 Things You Must Know to Build Better Alexa Skills
WKS404 7 Things You Must Know to Build Better Alexa SkillsAmazon Web Services
 
Thinking Asynchronously Full Vesion - Utah UG
Thinking Asynchronously Full Vesion - Utah UGThinking Asynchronously Full Vesion - Utah UG
Thinking Asynchronously Full Vesion - Utah UGEric Johnson
 
Serverless Web Apps using API Gateway, Lambda and DynamoDB
Serverless Web Apps using API Gateway, Lambda and DynamoDBServerless Web Apps using API Gateway, Lambda and DynamoDB
Serverless Web Apps using API Gateway, Lambda and DynamoDBAmazon Web Services
 
Building on AWS: Optimizing & Delivering
Building on AWS: Optimizing & DeliveringBuilding on AWS: Optimizing & Delivering
Building on AWS: Optimizing & DeliveringAmazon Web Services
 
Serverless Architectural Patterns
Serverless Architectural PatternsServerless Architectural Patterns
Serverless Architectural PatternsAmazon Web Services
 
Serverless Architecture Patterns
Serverless Architecture PatternsServerless Architecture Patterns
Serverless Architecture PatternsAmazon Web Services
 
Stop calling everything serverless!
Stop calling everything serverless!Stop calling everything serverless!
Stop calling everything serverless!Jeremy Daly
 
Serverless Computing: build and run applications without thinking about servers
Serverless Computing: build and run applications without thinking about serversServerless Computing: build and run applications without thinking about servers
Serverless Computing: build and run applications without thinking about serversAmazon Web Services
 
CQRS Evolved - CQRS + Akka.NET
CQRS Evolved - CQRS + Akka.NETCQRS Evolved - CQRS + Akka.NET
CQRS Evolved - CQRS + Akka.NETDavid Hoerster
 

Tendances (20)

Sloppy Little Serverless Stories
Sloppy Little Serverless StoriesSloppy Little Serverless Stories
Sloppy Little Serverless Stories
 
Stephen Liedig: Building Serverless Backends with AWS Lambda and API Gateway
Stephen Liedig: Building Serverless Backends with AWS Lambda and API GatewayStephen Liedig: Building Serverless Backends with AWS Lambda and API Gateway
Stephen Liedig: Building Serverless Backends with AWS Lambda and API Gateway
 
Getting Started with Serverless Architectures
Getting Started with Serverless ArchitecturesGetting Started with Serverless Architectures
Getting Started with Serverless Architectures
 
How LEGO.com Accelerates With Serverless
How LEGO.com Accelerates With ServerlessHow LEGO.com Accelerates With Serverless
How LEGO.com Accelerates With Serverless
 
Choosing the right messaging service for your serverless app [with lumigo]
Choosing the right messaging service for your serverless app [with lumigo]Choosing the right messaging service for your serverless app [with lumigo]
Choosing the right messaging service for your serverless app [with lumigo]
 
To Serverless And Beyond!
To Serverless And Beyond!To Serverless And Beyond!
To Serverless And Beyond!
 
ENT310 Microservices? Dynamic Infrastructure? - Adventures in Keeping Your Ap...
ENT310 Microservices? Dynamic Infrastructure? - Adventures in Keeping Your Ap...ENT310 Microservices? Dynamic Infrastructure? - Adventures in Keeping Your Ap...
ENT310 Microservices? Dynamic Infrastructure? - Adventures in Keeping Your Ap...
 
Enterprise Serverless Adoption. An Experience Report
Enterprise Serverless Adoption. An Experience ReportEnterprise Serverless Adoption. An Experience Report
Enterprise Serverless Adoption. An Experience Report
 
Getting Started with Amazon EventBridge
Getting Started with Amazon EventBridgeGetting Started with Amazon EventBridge
Getting Started with Amazon EventBridge
 
Using AWS Lambda to Build Control Systems for Your AWS Infrastructure
Using AWS Lambda to Build Control Systems for Your AWS InfrastructureUsing AWS Lambda to Build Control Systems for Your AWS Infrastructure
Using AWS Lambda to Build Control Systems for Your AWS Infrastructure
 
WKS404 7 Things You Must Know to Build Better Alexa Skills
WKS404 7 Things You Must Know to Build Better Alexa SkillsWKS404 7 Things You Must Know to Build Better Alexa Skills
WKS404 7 Things You Must Know to Build Better Alexa Skills
 
Thinking Asynchronously Full Vesion - Utah UG
Thinking Asynchronously Full Vesion - Utah UGThinking Asynchronously Full Vesion - Utah UG
Thinking Asynchronously Full Vesion - Utah UG
 
Serverless Web Apps using API Gateway, Lambda and DynamoDB
Serverless Web Apps using API Gateway, Lambda and DynamoDBServerless Web Apps using API Gateway, Lambda and DynamoDB
Serverless Web Apps using API Gateway, Lambda and DynamoDB
 
Building on AWS: Optimizing & Delivering
Building on AWS: Optimizing & DeliveringBuilding on AWS: Optimizing & Delivering
Building on AWS: Optimizing & Delivering
 
Serverless Architectural Patterns
Serverless Architectural PatternsServerless Architectural Patterns
Serverless Architectural Patterns
 
Serverless
ServerlessServerless
Serverless
 
Serverless Architecture Patterns
Serverless Architecture PatternsServerless Architecture Patterns
Serverless Architecture Patterns
 
Stop calling everything serverless!
Stop calling everything serverless!Stop calling everything serverless!
Stop calling everything serverless!
 
Serverless Computing: build and run applications without thinking about servers
Serverless Computing: build and run applications without thinking about serversServerless Computing: build and run applications without thinking about servers
Serverless Computing: build and run applications without thinking about servers
 
CQRS Evolved - CQRS + Akka.NET
CQRS Evolved - CQRS + Akka.NETCQRS Evolved - CQRS + Akka.NET
CQRS Evolved - CQRS + Akka.NET
 

Similaire à How to fail with serverless

AWS Lambda from the Trenches
AWS Lambda from the TrenchesAWS Lambda from the Trenches
AWS Lambda from the TrenchesYan Cui
 
Operational challenges behind Serverless architectures
Operational challenges behind Serverless architecturesOperational challenges behind Serverless architectures
Operational challenges behind Serverless architecturesLaurent Bernaille
 
Deterministic behaviour and performance in trading systems
Deterministic behaviour and performance in trading systemsDeterministic behaviour and performance in trading systems
Deterministic behaviour and performance in trading systemsPeter Lawrey
 
Azure Cloud Patterns
Azure Cloud PatternsAzure Cloud Patterns
Azure Cloud PatternsTamir Dresher
 
Patterns of Distributed Application Design
Patterns of Distributed Application DesignPatterns of Distributed Application Design
Patterns of Distributed Application DesignGlobalLogic Ukraine
 
Serverless Computing Model
Serverless Computing ModelServerless Computing Model
Serverless Computing ModelMohamed Samir
 
Site reliability in the Serverless age - Serverless Boston 2019
Site reliability in the Serverless age  - Serverless Boston 2019Site reliability in the Serverless age  - Serverless Boston 2019
Site reliability in the Serverless age - Serverless Boston 2019Erik Peterson
 
Cloud Computing with .Net
Cloud Computing with .NetCloud Computing with .Net
Cloud Computing with .NetWesley Faler
 
Operationnal challenges behind Serverless architectures by Laurent Bernaille
Operationnal challenges behind Serverless architectures by Laurent BernailleOperationnal challenges behind Serverless architectures by Laurent Bernaille
Operationnal challenges behind Serverless architectures by Laurent BernailleThe Incredible Automation Day
 
AWS Serverless patterns & best-practices in AWS
AWS Serverless  patterns & best-practices in AWSAWS Serverless  patterns & best-practices in AWS
AWS Serverless patterns & best-practices in AWSDima Pasko
 
Exploring Twitter's Finagle technology stack for microservices
Exploring Twitter's Finagle technology stack for microservicesExploring Twitter's Finagle technology stack for microservices
Exploring Twitter's Finagle technology stack for microservices💡 Tomasz Kogut
 
Planning to Fail #phpuk13
Planning to Fail #phpuk13Planning to Fail #phpuk13
Planning to Fail #phpuk13Dave Gardner
 
CQRS + Event Sourcing
CQRS + Event SourcingCQRS + Event Sourcing
CQRS + Event SourcingMike Bild
 
Low latency for high throughput
Low latency for high throughputLow latency for high throughput
Low latency for high throughputPeter Lawrey
 
Reliability of the Cloud: How AWS Achieves High Availability (ARC317-R1) - AW...
Reliability of the Cloud: How AWS Achieves High Availability (ARC317-R1) - AW...Reliability of the Cloud: How AWS Achieves High Availability (ARC317-R1) - AW...
Reliability of the Cloud: How AWS Achieves High Availability (ARC317-R1) - AW...Amazon Web Services
 
7 Common Questions About a Cloud Management Platform
7 Common Questions About a Cloud Management Platform7 Common Questions About a Cloud Management Platform
7 Common Questions About a Cloud Management PlatformRightScale
 
Lotuscript for large systems
Lotuscript for large systemsLotuscript for large systems
Lotuscript for large systemsBill Buchan
 
Eventual Consistency @WalmartLabs with Kafka, Avro, SolrCloud and Hadoop
Eventual Consistency @WalmartLabs with Kafka, Avro, SolrCloud and HadoopEventual Consistency @WalmartLabs with Kafka, Avro, SolrCloud and Hadoop
Eventual Consistency @WalmartLabs with Kafka, Avro, SolrCloud and HadoopAyon Sinha
 
Fixing twitter
Fixing twitterFixing twitter
Fixing twitterRoger Xia
 

Similaire à How to fail with serverless (20)

AWS Lambda from the Trenches
AWS Lambda from the TrenchesAWS Lambda from the Trenches
AWS Lambda from the Trenches
 
Operational challenges behind Serverless architectures
Operational challenges behind Serverless architecturesOperational challenges behind Serverless architectures
Operational challenges behind Serverless architectures
 
Deterministic behaviour and performance in trading systems
Deterministic behaviour and performance in trading systemsDeterministic behaviour and performance in trading systems
Deterministic behaviour and performance in trading systems
 
Azure Cloud Patterns
Azure Cloud PatternsAzure Cloud Patterns
Azure Cloud Patterns
 
Patterns of Distributed Application Design
Patterns of Distributed Application DesignPatterns of Distributed Application Design
Patterns of Distributed Application Design
 
Serverless Computing Model
Serverless Computing ModelServerless Computing Model
Serverless Computing Model
 
Site reliability in the Serverless age - Serverless Boston 2019
Site reliability in the Serverless age  - Serverless Boston 2019Site reliability in the Serverless age  - Serverless Boston 2019
Site reliability in the Serverless age - Serverless Boston 2019
 
Cloud Computing with .Net
Cloud Computing with .NetCloud Computing with .Net
Cloud Computing with .Net
 
Grails Services
Grails ServicesGrails Services
Grails Services
 
Operationnal challenges behind Serverless architectures by Laurent Bernaille
Operationnal challenges behind Serverless architectures by Laurent BernailleOperationnal challenges behind Serverless architectures by Laurent Bernaille
Operationnal challenges behind Serverless architectures by Laurent Bernaille
 
AWS Serverless patterns & best-practices in AWS
AWS Serverless  patterns & best-practices in AWSAWS Serverless  patterns & best-practices in AWS
AWS Serverless patterns & best-practices in AWS
 
Exploring Twitter's Finagle technology stack for microservices
Exploring Twitter's Finagle technology stack for microservicesExploring Twitter's Finagle technology stack for microservices
Exploring Twitter's Finagle technology stack for microservices
 
Planning to Fail #phpuk13
Planning to Fail #phpuk13Planning to Fail #phpuk13
Planning to Fail #phpuk13
 
CQRS + Event Sourcing
CQRS + Event SourcingCQRS + Event Sourcing
CQRS + Event Sourcing
 
Low latency for high throughput
Low latency for high throughputLow latency for high throughput
Low latency for high throughput
 
Reliability of the Cloud: How AWS Achieves High Availability (ARC317-R1) - AW...
Reliability of the Cloud: How AWS Achieves High Availability (ARC317-R1) - AW...Reliability of the Cloud: How AWS Achieves High Availability (ARC317-R1) - AW...
Reliability of the Cloud: How AWS Achieves High Availability (ARC317-R1) - AW...
 
7 Common Questions About a Cloud Management Platform
7 Common Questions About a Cloud Management Platform7 Common Questions About a Cloud Management Platform
7 Common Questions About a Cloud Management Platform
 
Lotuscript for large systems
Lotuscript for large systemsLotuscript for large systems
Lotuscript for large systems
 
Eventual Consistency @WalmartLabs with Kafka, Avro, SolrCloud and Hadoop
Eventual Consistency @WalmartLabs with Kafka, Avro, SolrCloud and HadoopEventual Consistency @WalmartLabs with Kafka, Avro, SolrCloud and Hadoop
Eventual Consistency @WalmartLabs with Kafka, Avro, SolrCloud and Hadoop
 
Fixing twitter
Fixing twitterFixing twitter
Fixing twitter
 

Dernier

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 

Dernier (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 

How to fail with serverless

  • 1. How to Fail with Serverless Jeremy Daly CTO, AlertMe.news @jeremy_daly
  • 2. Jeremy Daly • CTO at AlertMe.news • Consult with companies building in the cloud • 20+ year veteran of technology startups • Started working with AWS in 2009 and started using Lambda in 2015 • Blogger (jeremydaly.com), OSS contributor, speaker • Publish the Off-by-none serverless newsletter • Host of the Serverless Chats podcast @jeremy_daly
  • 3. Agenda • Distributed systems and serverless • Writing code FOR the cloud • Failure modes in the cloud • Serverless patterns to deal with failure • Decoupling our services @jeremy_daly
  • 4. Distributed Systems are… Systems whose components are located on different networked computers, which communicate and coordinate their actions by passing messages to one another. ~Wikipedia @jeremy_daly They’re also really hard!
  • 5. @jeremy_daly EVERYTHING FAILS ALL THE TIME WernerVogels CTO, Amazon.com
  • 6. 8 Fallacies of Distributed Computing @jeremy_daly The network is reliable Latency is zero Bandwidth is infinite The network is secure Topology doesn’t change There is one administrator Transport cost is zero The network is homogeneous 🚫 🚫 🚫 🚫 🚫 🚫 🚫 🚫
  • 7. Serverless applications are… Distributed systems on steroids! 💪 💪 💪 @jeremy_daly • Smaller, more distributed compute units • Stateless, requiring network access to state • Uncoordinated, requires buses, queues, pub/sub, state machines • Heavily reliant on other networked cloud services
  • 8. What does it mean to be Serverless? • No server management • Flexible scaling • Pay for value / never pay for idle • Automated high availability • LOTS of configuration and knowledge of cloud services • Highly event-driven @jeremy_daly
  • 9. Lots of services to communicate with! @jeremy_daly ElastiCache RDS EMR Amazon ES Redshift Fargate Anything “on EC2”Lambda Cognito Kinesis S3 DynamoDB SQS SNS API Gateway AppSync IoT Comprehend “Serverless” Managed Non-Serverless DocumentDB (MongoDB) Managed Streaming for Kada EventBridge
  • 10. Reliability or High Availability is… @jeremy_daly A characteristic of a system, which aims to ensure an agreed level of operational performance, usually uptime, for a higher than normal period. ~Wikipedia
  • 11. Resiliency is… @jeremy_daly The ability of a software solution to absorb the impact of a problem in one or more parts of a system, while continuing to provide an acceptable service level to the business. ~ IBM IT’S NOT ABOUT PREVENTING FAILURE IT’S UNDERSTANDING HOWTO GRACEFULLY DEAL WITH IT
  • 12. Writing code FOR the Cloud
  • 13. Using Lambda for our business logic • Ephemeral compute service • Runs your code in response to events • Automatically manages the runtime, compute, and scaling • Single concurrency model • No sticky-sessions or guaranteed lifespan @jeremy_daly
  • 14. Traditional Error Handling & Retries try { // Do something important } catch (err) { // Do some error handling // Do some logging // Maybe retry the operation } @jeremy_daly What happens to the original event? What happens if the function container crashes? What happens if there is a network issue? 👨✈ Losing events is very bad! What happens if the function never runs?
  • 15. Function Error Types • Unhandled Exception ‼ • FunctionTimeouts ⏳ • Out-of-Memory Errors 🧠 • Throttling Errors 🚦 @jeremy_daly
  • 16. The cloud is better than you… …at error handling …at retrying failures …at understanding network failures …at mapping the network topology …at handling failover and redundancy @jeremy_daly So why not let the cloud do those things for you? 🤷
  • 17. Fail up the stack • Don’t swallow errors with try/catch – fail the function • Return errors directly to the invoking service • Configure built-in retry mechanisms to reprocess events • Utilize dead letter queues to capture failed events @jeremy_daly (sometimes 😉)
  • 18. Types of Lambda Functions • The Lambdalith • The Fat Lambda • The Single-Purpose Function @jeremy_daly 😬 🤔 👍
  • 19. The Mighty Lambdalith • The entire application is in one Lambda function • Often times these are “lift and shift” Express.js or Flask apps • Events are synchronous via API Gateway or ALB • Partial failures are handled “in the code” @jeremy_daly
  • 20. The Fat Lambda • Several related methods are collocated in a single Lambda function • Generally used to optimize the speed of synchronous operations • Partial failures are still handled “in the code” • Under the right circumstances, this can be useful @jeremy_daly
  • 21. The Single-Purpose Function • Tightly scoped function that handles a single discrete piece of business logic • Can be invoked synchronously or asynchronously • Failures are generally “total failures” and are passed back to the invoking service • Can be reused as part of other “workflows”, can scale (or throttle) independently, and can utilize the Principle of Least Privilege @jeremy_daly
  • 22. Failure Modes in the Cloud WARNING: Firehose of overly-technical content ahead 👩🚒
  • 23. A quick word about retries… • Retries are a vital part of distributed systems • Most cloud services guarantee “at least once” delivery • It is possible for the same event to be received more than once • Retried operations should be idempotent @jeremy_daly
  • 24. Idempotent means that… An operation can be repeated multiple times and always provide the same result, with no side effects to other objects in the system. ~ Computer Hope @jeremy_daly Idempotent operations: • Update a database record • Authenticate a user • Check if a record exists and create if not There are lots of strategies to ensure idempotency!
  • 25. What are Dead Letter Queues (DLQs)? • Capture messages/events that fail to process or are skipped • Allows for alarming, inspection, and potential replay • Can be added to SQS queues, SNS subscriptions, Lambda functions @jeremy_daly
  • 26. Lambda Invocation Types • Synchronous – request/response model • Asynchronous – set it and forget it • Stream-based – push • Poller-based – pull @jeremy_daly
  • 27. Synchronous Lambda Retry Behavior • Functions are invoked directly using request/response method • Failures are returned to the invoker • Retries are delegated to the invoking application • Some AWS services automatically retry (e.g. Alexa & Cognito) • Other services do not retry (e.g. API Gateway, ALB, Step Functions) • API Gateway and ALB can return errors to the client for retry @jeremy_daly
  • 28. Asynchronous Lambda Retry Behavior • The Lambda Service accepts requests and adds them to a queue • The invoker receives a 202 status code and disconnects • The Lambda Service will attempt to reprocess failed events up to 2 times, configured using the MaximumRetryAttempts setting • If the Lambda function is throttled, the event will be retried for up to 6 hours, configured using MaximumEventAgeInSeconds • Failed and expired events can be sent to a Dead Letter Queue (DLQ) or an on-failure destination @jeremy_daly
  • 29. Stream-based Lambda Retry Behavior • Records are pushed synchronously to Lambda from Kinesis or DynamoDB streams in batches (10k and 1k limits per batch) • MaximumRetryAttempts: number of retry attempts for batches before they can be skipped (up to 10,000) • MaximumRecordAgeInSeconds: store records up to 7 days • BisectBatchOnFunctionError: recursively split failed batches (poison pill) • Skipped records are sent to an On-failure Destination (SQS or SNS) @jeremy_daly
  • 30. Poller-based Lambda Retry Behavior • The Lambda Poller pulls records synchronously from SQS in batches (up to 10) • Errors fail the entire batch • MaxReceiveCount: number of times messages can be returned to the queue before being sent to the DLQ (up to 1,000) • Polling frequency is tied to function concurrency • VisibilityTimeout should be set to at least 6 times the timeout configured on your consuming function @jeremy_daly
  • 31. Lambda Destinations • Only for asynchronous invocations • Routing based on SUCCESS and/or FAILURE • OnFailure should be favored over a standard DLQ • Destinations can be an SQS queue, SNS topic, Lambda function, or EventBridge event bus @jeremy_daly
  • 32. Lambda Destinations (continued) Destination-specific JSON format • SQS/SNS: JSON object is passed as the Message • Lambda: JSON is passed as the payload to the function • EventBridge: JSON is passed as the Detail in the PutEvents call • Source is ”lambda” • DetailType is “Lambda Function Invocation Result – Success/Failure” • Resource fields contain the function and destination ARNs @jeremy_daly
  • 33. SQS Redrive Policies • Only supports another SQS queue as the DLQ • Messages are sent to the DLQ if the Maximum Receives value is exceeded @jeremy_daly
  • 34. SNS Redrive Policies • Dead Letter Queues are attached to Subscriptions, not Topics • Only supports SQS queues as the DLQ • Client-side errors (e.g. Lambda doesn’t exist) do no retry • Messages to SQS or Lambda are retried 100,015 times over 23 days • Messages to SMTP, SMS, and Mobile retry 50 times over 6 hours • HTTP endpoints support customer-defined retry policies (number of retries, delays, and backoff strategy) @jeremy_daly
  • 35. EventBridge Retry Behavior • Will attempt to deliver events for up to 24 hours with backoff • Lambda functions are invoked asynchronously • Failed events are lost (this is very unlikely) • Once events are accepted by the target service, failure modes of those services are used @jeremy_daly
  • 36. Step Functions • State Machines: Orchestration workflows • Lambdas are invoked synchronously • Retriers and Catchers allow for complex error handling patterns • Use “error names” with ErrorEquals for condition error handling (States.*) • Control retry policies with IntervalSeconds, MaxAttempts, BackoffRate @jeremy_daly Complex Error Handling Pattern Credit:Yan Cui
  • 37. AWS SDK Retries • Automatic retries and exponential backoff @jeremy_daly AWS SDK Maximum retry count Connection timeout Socket timeout Python (Boto 3) depends on service 60 seconds 60 seconds Node.js depends on service N/A 120 seconds Java 3 10 seconds 50 seconds .NET 4 100 seconds 300 seconds Go 3 N/A N/A
  • 39. miss Caching strategy Client API Gateway RDSLambda Elasticache Key Points: • Create new RDS connections ONLY on misses • Make sureTTLs are set appropriately • Include the ability to invalidate cache @jeremy_daly YOU STILL NEEDTO SIZEYOUR DATABASE CLUSTERS APPROPRIATELY
  • 40. RDS Buffer events for throttling and durability Client API Gateway SQS Queue SQS (DLQ) Lambda Lambda (throttled) ack “Asynchronous” Request Synchronous Request @jeremy_daly Key Points: • SQS adds durability • Throttled Lambdas reduce downstream pressure • Failed events are stored for further inspection/replay Limit the concurrency to match RDS throughput x Utilize Service Integrations
  • 41. DynamoDB Stripe API The Circuit Breaker Client API Gateway Lambda Key Points: • Cache your cache with warm functions • Use a reasonable failure count • Understand idempotency! Status Check CLOSED OPEN Increment Failure Count HALF OPEN “Everything fails all the time.” ~WernerVogels @jeremy_daly Elasticache or
  • 42. Event Processing Conditional Retry Pattern @jeremy_daly EventBridge Processing Lambda EventBridge onSuccess onFailure SQS (DLQ) Replay SQS Replay Lambda (Throttled) Rule to route permanent failuresRule to route “retriable” failures Asynchronous (use delivery delay)
  • 44. Multicasting with SNS Key Points: • SNS has a “well-defined API” • Decouples downstream processes • Allows multiple subscribers with message filters Client SNS “Asynchronous” Request ack Event Service @jeremy_daly HTTP SMS Lambda SQS Email SQS (DLQ)
  • 45. @jeremy_daly Multicasting with EventBridge Key Points: • Allows multiple subscribers with RULES, PATTERNS and FILTERS • Forward events to other accounts • 24 hours of automated retries Asynchronous “PutEvents” Request ack w/ event id Amazon EventBridge Lambda SQS Client Step Function Event Bus +16 others
  • 46. Key Points: • Filter events to selectively trigger services • Manage throttling/quotas/failures per service • Use Lambda Destinations with asynchronous events Stripe API @jeremy_daly Distribute & Throttle ack SQS Queue Lambda (concurrency 25) Client API Gateway Lambda Order Service "total": [{ "numeric": [ ”>", 0 ]}] RDS SQS Queue Lambda (concurrency 10) SMS Alerting Service Twilio API SQS Queue Lambda (concurrency 5) Billing Service "detail-type": [ "ORDER COMPLETE" ] EventBridge
  • 47. Key Takeaways • Be prepared for failure – everything fails all the time! • Utilize the built in retry mechanisms of the cloud • Understand failure modes to protect against data loss • Buffer and throttle events to distributed systems • Embrace asynchronous processes to decouple components @jeremy_daly
  • 48. Things I’m working on… Blog: JeremyDaly.com Podcast: ServerlessChats.com Newsletter: Offbynone.io DDBToolbox: DynamoDBToolbox.com Lambda API: LambdaAPI.com GitHub: github.com/jeremydaly Twitter: @jeremy_daly @jeremy_daly