Serverless applications increasingly involve distributed systems where errors and bottlenecks can have significant downstream impact. This can be compounded by the ephemeral nature of FaaS offerings in which errors can be difficult to diagnose retroactively. In this session we'll discuss instrumentation and "self-healing" architectural patterns that will improve resiliency of your application and drive improved observability and performance.
2. AWS | LAMBDA FEATURES PAGE
AWS Lambda invokes your code only when
needed and automatically scales to support the
rate of incoming requests without requiring you
to configure anything. There is no limit to the
number of requests your code can handle.
The Promise:
SELF-HEALING SERVERLESS APPLICATIONS | PG2
3. AWS | LAMBDA FEATURES PAGE
The Reality:
AWS Lambda invokes your code only when
needed and automatically scales to support the
rate of incoming requests without requiring you
to configure anything. There is no limit to the
number of requests your code can handle.
s
architecture
sometimes
certain
s
es
every
can
but
areproperly
^
(suggested edits)
SELF-HEALING SERVERLESS APPLICATIONS | PG3
4. What to expect
when you’re not expecting.
SELF-HEALING SERVERLESS APPLICATIONS | PG4
5. FAILURE TYPES DESCRIPTION
Common Serverless Failures
FOR LAMBDA-BASED ARCHITECTURES
DEFAULT BEHAVIOR
SELF-HEALING SERVERLESS APPLICATIONS | PG5
• Runtime Error:
• Uncaught Exception
• Timeout
• Bad State
• Scaling:
• Concurrency Limits
• Spawn Limits
• Bottlenecking
6. FAILURE TYPES DESCRIPTION
Common Serverless Failures
FOR LAMBDA-BASED ARCHITECTURES
DEFAULT BEHAVIOR
Synchronous invocations:
• Function fails
• Returns error to caller
• Logs timestamp, error message,
& stack trace to CloudWatch
Asynchronous invocations:
• Retries up to three times (or
more if reading from a stream)
• Caller is unaware of error
• Logs timestamp, error message,
& stack trace to CloudWatch
• Runtime Error:
• Uncaught Exception
• Timeout
• Bad State
• Scaling:
• Concurrency Limits
• Spawn Limits
• Bottlenecking
An event triggers your
Lambda to run, but raises
an unhandled exception
in your code.
SELF-HEALING SERVERLESS APPLICATIONS | PG6
7. FAILURE TYPES DESCRIPTION
Common Serverless Failures
FOR LAMBDA-BASED ARCHITECTURES
DEFAULT BEHAVIOR
Synchronous invocations:
• Lambda returns error to caller
(if client hasn’t timed out)
• Logs timestamp and error
message to CloudWatch
Asynchronous invocations:
• Retries up to three times (more
if reading from stream)
• Caller is unaware of error
• Logs timestamp & error
message to CloudWatch
• Runtime Error:
• Uncaught Exception
• Timeout
• Bad State
• Scaling:
• Concurrency Limits
• Spawn Limits
• Bottlenecking
An event triggers your
Lambda to run, but
execution does not
complete within the
configured maximum
execution time.
(Lambda’s default
configuration is a
3-second timeout.)
SELF-HEALING SERVERLESS APPLICATIONS | PG7
8. FAILURE TYPES DESCRIPTION
Common Serverless Failures
FOR LAMBDA-BASED ARCHITECTURES
DEFAULT BEHAVIOR
• Runtime Error:
• Uncaught Exception
• Timeout
• Bad State
• Scaling:
• Concurrency Limits
• Spawn Limits
• Bottlenecking
When noisy:
• Behaves as Uncaught
Exception
• Visible in CloudWatch, but may
be difficult to diagnose without
event visibility
When silent:
• Unexpected application
behavior
• Can be lost permanently
• Can tank performance and
dramatically spike costs
An event triggers your
Lambda to run, but the
message is malformed or
state is improperly
provided causing
unexpected behavior.
SELF-HEALING SERVERLESS APPLICATIONS | PG8
9. FAILURE TYPES DESCRIPTION
Common Serverless Failures
FOR LAMBDA-BASED ARCHITECTURES
DEFAULT BEHAVIOR
• Runtime Error:
• Uncaught Exception
• Timeout
• Bad State
• Scaling:
• Concurrency Limits
• Spawn Limits
• Bottlenecking
Unbuffered invocations:
• Fails to invoke
• No retry
• Visible in CloudWatch metrics,
but not in logs
Buffered invocations:
• Initially fails to invoke
• Will eventually continue
reading from stream as volume
drops
Your application becomes
throttled as more Lambda
instances are required
than are allowed to be
concurrently running by
AWS for your account.
Your compute can’t scale
high enough.
SELF-HEALING SERVERLESS APPLICATIONS | PG9
10. FAILURE TYPES DESCRIPTION
Common Serverless Failures
FOR LAMBDA-BASED ARCHITECTURES
DEFAULT BEHAVIOR
• Runtime Error:
• Uncaught Exception
• Timeout
• Bad State
• Scaling:
• Concurrency Limits
• Spawn Limits
• Bottlenecking
Unbuffered invocations:
• Fails to invoke
• No retry
• Visible in CloudWatch metrics,
nothing in logs
(but really non-obvious)
Buffered invocations:
• Initially fails to invoke
• Will eventually continue
reading from stream as volume
drops
Your application becomes
throttled as more new
Lambda instances are
required than are allowed
to spawn by AWS for your
account.
Your compute can’t scale
fast enough.
SELF-HEALING SERVERLESS APPLICATIONS | PG10
11. FAILURE TYPES DESCRIPTION
Common Serverless Failures
FOR LAMBDA-BASED ARCHITECTURES
DEFAULT BEHAVIOR
• Runtime Error:
• Uncaught Exception
• Timeout
• Bad State
• Scaling:
• Concurrency Limits
• Spawn Limits
• Bottlenecking
Upstream bottlenecks:
• Fails to invoke
• No retry
• Visible in CloudWatch, as long
as you know where to look
Downstream bottlenecks:
• Can throw error, timeout,
and/or distribute failures to
other functions.
• Can cause cascading failures
• Can tank performance and
dramatically spike costs
Your application is
throttled due to
throughput pressure
upstream or downstream
of your Lambda.
Your architecture can’t
scale enough.
SELF-HEALING SERVERLESS APPLICATIONS | PG11
13. Self-Healing Design Principles
LEADING PRACTICES FOR RESILIENT SYSTEMS
STANDARDIZE FAIL GRACEFULLY
• Reroute and unblock
• Automate known
solutions
• Notify a human
SELF-HEALING SERVERLESS APPLICATIONS | PG13
Learn to fail.
• Introduce universal
instrumentation
• Collect event-centric
diagnostics
• Give everyone visibility
PLAN FOR FAILURE
• Identify service limits
• Use self-throttling
• Consider alternative
resource types
15. Scenario: Uncaught Exceptions
WHEN THINGS BREAK AND YOU DON’T KNOW WHY
PROBLEM
Lambda periodically fails.
Error messages and stack
traces are visible in
CloudWatch logs. Failing
events are lost, making
reproduction difficult.
KEY PRINCIPLES
• Introduce universal
instrumentation
• Collect event-centric
diagnostics
• Give everyone visibility
SOLUTION
• Use function wrapper or
decorator pattern
• Capture and log events
which fail
SELF-HEALING SERVERLESS APPLICATIONS | PG15
Decrease time to resolution by capturing event data.
17. WHEN YOUR LAMBDAS AREN’T GETTING INVOKED
PROBLEM
API Gateway hits
throughput limits and fails
to invoke Lambda on
every request.
KEY PRINCIPLES
• Identify service limits
• Use self-throttling
• Notify a human
SOLUTION
• Implement retries with
exponential backoff
logic for 429 responses
• Raise alarm on:
4XXError
Scenario: Upstream bottleneck
SELF-HEALING SERVERLESS APPLICATIONS | PG17
Don’t overlook client-side solutions to backend failures.
19. WHEN EXECUTION TAKES TOO LONG
PROBLEM
Lambda is periodically
timing out.
KEY PRINCIPLES
• Introduce universal
instrumentation
• Use self-throttling
• Consider alternative
resource types
SOLUTION
• Use function wrapper or
decorator pattern
• Evaluate Fargate or
alternative long-running
resources
Scenario: Timeouts
SELF-HEALING SERVERLESS APPLICATIONS | PG19
Enforce your own limits.
21. WHEN FAILURES ARE BLOCKING THE REST OF THE STREAM
PROBLEM
Lambda exceptions and/or
timeouts are blocking
processing of a Kinesis
shard.
KEY PRINCIPLES
• Reroute and unblock
• Automate known
solutions
• Consider alternative
resource types
SOLUTION
• Introduce state machine-
type logic
• Move bad messages to
alternate stream
• Potentially architect with
Fargate or SNS
Scenario: Stream processing gets “stuck”
SELF-HEALING SERVERLESS APPLICATIONS | PG21
Small failures are preferable to large ones.
22. PROBLEM
Your Lambdas have scaled
up but are depleting your
RDS database connection
pools.
KEY PRINCIPLES
• Identify service limits
• Automate known
solutions
• Give everyone visibility
SOLUTION
• Always close database
connections
• Scale your database
• Map your dependencies
Scenario: Downstream bottleneck
WHEN LAMBDA IS OUT-SCALING YOUR DATABASE
SELF-HEALING SERVERLESS APPLICATIONS | PG22
Scale dependencies, too.