This document discusses designing applications for high availability on AWS. It provides best practices for designing systems to be fault tolerant and self-healing. The key principles discussed are: 1) design for failure by avoiding single points of failure, 2) use multiple availability zones for redundancy, 3) implement auto-scaling for flexibility and fault tolerance, 4) incorporate self-healing techniques like health checks and auto-scaling policies, and 5) loosely couple components. The document explores how various AWS services like EC2, EBS, RDS, ELB, auto-scaling, S3 and Route 53 can be leveraged together to build highly available, fault tolerant systems on AWS infrastructure.
2. Designing for Availability
ME: Joel Williams– Solutions Architect at Amazon Web Services
YOU: here to learn more about designing your applications for high
availability on AWS
TODAY: about best practices and things to think about when building a
highly available application on AWS
3. 33
What is High Availability?
Availability: Percentage of time an application operates during its work cycle
Loss of availability is known as an outage or downtime
• App is offline, unreachable, or partially available
• App is slow to use
• Planned and unplanned
Goal
• No downtime
• Always available
4. 44
Availability is related to
Scalability
• Ability of an application to accommodate growth without changing design
• If app cannot scale, availability may be impacted
• Scalability doesn’t guarantee availability
Fault Tolerance
• Built-in redundancy so apps can continue functioning when components fail
• Fault tolerance is crucial to HA
AWS democratizes High Availability
• Multiple servers, isolated redundant data centers, regions across the globe, Fault
Tolerant services, etc.
12. Vertical Scaling
From $0.02/hr
Elastic Compute Cloud (EC2)
Basic unit of compute capacity
Range of CPU, memory & local disk options
42 Instance types available from 16 different families
Feature Details
Flexible Run windows or Linux distributions
Scalable Wide range of instance types from micro to
cluster compute
Machine Images Configurations can be saved as machine
images (AMIs) from which new instances can
be created
Full control Full root or administrator rights
Secure Full firewall control via Security Groups
Monitoring Publishes metrics to Cloud Watch
Inexpensive On-demand, Reserved and Spot instance types
VM Import/Export Import and export VM images to transfer
configurations in and out of EC2
Compute
Compute Storage
AWS Global Infrastructure
Database
App Services
Deployment & Administration
Networking
** MANY NEW INSTANCE TYPES
Amazon EC2 instances
26. Elastic Block Store
High performance block storage device
1GB to 1TB in size
Mount as drives to instances
Feature Details
High performance
file system
Mount EBS as drives and format as required
Flexible size Volumes from 1GB to 1TB in size
Secure Private to your instances
Performance Use provisioned IOPS to get desired level of IO
performance
Available Replicated within an Availability Zone
Backups Volumes can be snapshotted for point in time
restore
Monitoring Detailed metrics captured via Cloud Watch
Storage
Compute Storage
AWS Global Infrastructure
Database
App Services
Deployment & Administration
Networking
EBS
snapshot
EC2
27. Web Server EC2
RDS DB
instance
Internet gateway
Elastic IP
Route
53
user DNS
Resolution
www.example.com
EBS
28. Web Server EC2
RDS DB
instance
Internet gateway
Elastic IP
Route
53
user DNS
Resolution
www.example.com
EBS
29. Web Server EC2
RDS DB
instance
Internet gateway
Route
53
user DNS
Resolution
www.example.com
EBS
EC2
Elastic IP
30. Web Server EC2
RDS DB
instance
Internet gateway
Route
53
user DNS
Resolution
www.example.com
EBS
EC2
Elastic IP
32. Elastic Load Balancing
Create highly scalable applications
Distribute load across EC2 instances in multiple
availability zones
Feature Details
Auto-scaling Automatically scales to handle request volume
Available Load balance across instances in multiple
availability zones
Health checks Automatically checks health of instances and
takes them in or out of service
Session stickiness Route requests to the same instance
Secure sockets layer Supports SSL offload from web and application
servers with flexible cipher support
Monitoring Publishes metrics to Cloud Watch
Compute Storage
AWS Global Infrastructure
Database
App Services
Deployment & Administration
Networking
** NEW CONNECTION DRAINING
AND NEW ACCESS LOGS
Compute
Elastic Load
Balancing
EC2 EC2
Auto Scaling Group
33. Web Server EC2
RDS DB
instance
Internet gateway
Elastic IP
Route
53
user DNS
Resolution
www.example.com
42. Relational Database Service
Database-as-a-Service
No need to install or manage database instances
Scalable and fault tolerant configurations
Feature Details
Platform support Create MySQL, SQL Server, Postgres and
Oracle RDBMS
Preconfigured Get started instantly with sensible default
settings
Automated patching Keep your database platform up to date
automatically
Backups Automatic backups and point in time recovery
and full DB backups
Provisioned IOPS Specify IO throughput depending on
requirements
Failover Automated failover to slave hosts in event of a
failure
Replication Easily create read-replicas of your data and
seamlessly replicate data across availability
zones
Database
Compute Storage
AWS Global Infrastructure
Database
App Services
Deployment & Administration
Networking
RDS DB
instance
RDS DB
instance standby
(Multi-AZ)
RDS DB
instance read
replica
54. Auto Scaling
Automatic re-sizing of compute clusters based upon demand
Feature Details
Control Define minimum and maximum instance pool
sizes and when scaling and cool down occurs
Integrated to
CloudWatch
Use metrics gathered by CloudWatch to drive
scaling
Instance types Run auto scaling for on-demand instances and
spot. Compatible with VPC
as-create-auto-scaling-group MyGroup
--launch-configuration MyConfig
--availability-zones eu-west-1a
--min-size 4
--max-size 200
Compute – Auto Scaling
Compute Storage
AWS Global Infrastructure
Database
App Services
Deployment & Administration
Networking
** NEW CONSOLE
Auto Scaling Group
EC2 EC2
56. Web
Servers
RDS DB
instance
Internet gateway
Route
53
user DNS
Resolution
www.example.com
Elastic Load
Balancing
Availability Zone A Availability Zone B
RDS DB
SlaveSynchronous Replication
Auto Scaling Group
Auto
Scaling
EC2 EC2 EC2 EC2
57. Web
Servers
RDS DB
instance
Internet gateway
Route
53
user DNS
Resolution
www.example.com
Elastic Load
Balancing
Availability Zone A Availability Zone B
RDS DB
SlaveSynchronous Replication
Auto Scaling Group
Auto
Scaling
EC2 EC2 EC2 EC2
58. Web
Servers
RDS DB
instance
Internet gateway
Route
53
user DNS
Resolution
www.example.com
Elastic Load
Balancing
Availability Zone A Availability Zone B
RDS DB
SlaveSynchronous Replication
Auto Scaling Group
Auto
Scaling
AMI
Auto Scaling Policy fires
EC2 EC2 EC2 EC2
59. Web
Servers
RDS DB
instance
Internet gateway
Route
53
user DNS
Resolution
www.example.com
Elastic Load
Balancing
Availability Zone A Availability Zone B
RDS DB
SlaveSynchronous Replication
Auto Scaling Group
Auto
Scaling
launching launching
EC2 EC2 EC2 EC2EC2 EC2
60. Web
Servers
RDS DB
instance
Internet gateway
Route
53
user DNS
Resolution
www.example.com
Elastic Load
Balancing
Availability Zone A Availability Zone B
RDS DB
SlaveSynchronous Replication
Auto Scaling Group
Auto
Scaling
EC2 EC2 EC2 EC2EC2 EC2
61. RDS DB
instance
Internet gateway
Route
53
user DNS
Resolution
www.example.com
Elastic Load
Balancing
Availability Zone A Availability Zone B
RDS DB
SlaveSynchronous Replication
Auto Scaling Group
Auto
Scaling
Web
Servers
EC2 EC2 EC2 EC2EC2 EC2
62. Web
Servers
RDS DB
instance
Internet gateway
Route
53
user DNS
Resolution
www.example.com
Elastic Load
Balancing
Availability Zone A Availability Zone B
RDS DB
SlaveSynchronous Replication
Auto Scaling Group
Auto
Scaling
terminating terminating
EC2 EC2 EC2 EC2EC2 EC2
63. Web
Servers
RDS DB
instance
Internet gateway
Route
53
user DNS
Resolution
www.example.com
Elastic Load
Balancing
Availability Zone A Availability Zone B
RDS DB
SlaveSynchronous Replication
Auto Scaling Group
Auto
Scaling
EC2 EC2 EC2 EC2
65. RDS - Push-Button Scaling
scale up or down to the
desired instance class
scale up to an 8-core
server with 244 GB of RAM
with the cr1.8xlarge
66. Use Cases
Reporting and ETL
Discrete read/write transactions (browsers vs buyers)
Scale-out with one or
more read servers master-slave
architecture
scaling
READS
67. • Optimize master for OLTP and read slaves for table
scans
• Resize slaves as needed to boost reporting performance
• Use short-term slaves to save cost during monthly
reporting
• Promote to standalone server.
• NEW - Cross Region Read Replicas with MySQL
scaling
READS Tech tips
68. Scaling for Writes on the Data Tier
At large scale, you may start to run into issues with your database
around contention on writes to the master.
How can you solve it?
Federation ( splitting into multiple DBs based on function)
Sharding ( splitting one data set up across multiple hosts)
Moving some functionality to other types of DBs ( NoSQL )
69. Database Federation
Split up Databases by function/purpose
Harder to do cross function queries
Essentially delaying the need for
something like sharding / NoSQL until
much further down the line
Won’t help with single huge
functions/tables
ForumsDB
UsersDB
ProductsDB
70. Sharded Horizontal Scaling
More complex at the application layer
ORM support can help
No practical limit on scalability
Operation complexity/sophistication
Shard by function or key space
RDBMS or NoSQL
User ShardID
002345 A
002346 B
002347 C
002348 B
002349 A
A
B
C
73. Web
Servers
RDS DB
instance
Internet gateway
Route
53
user DNS
Resolution
www.example.com
Elastic Load
Balancing
Availability Zone A Availability Zone B
RDS DB
SlaveSynchronous Replication
Auto Scaling Group
Auto
Scaling
EC2 EC2 EC2 EC2
74. Web
Servers
RDS DB
instance
Internet gateway
Route
53
user DNS
Resolution
www.example.com
Elastic Load
Balancing
Availability Zone A Availability Zone B
RDS DB
SlaveSynchronous Replication
Auto Scaling Group
Auto
Scaling
EC2 EC2 EC2 EC2
75. Web
Servers
RDS DB
instance
Internet gateway
Route
53
user DNS
Resolution
www.example.com
Elastic Load
Balancing
Availability Zone A Availability Zone B
RDS DB
SlaveSynchronous Replication
Auto Scaling Group
Auto
Scaling
EC2 EC2 EC2 EC2
launching
EC2
76. Web
Servers
RDS DB
instance
Internet gateway
Route
53
user DNS
Resolution
www.example.com
Elastic Load
Balancing
Availability Zone A Availability Zone B
RDS DB
SlaveSynchronous Replication
Auto Scaling Group
Auto
Scaling
EC2 EC2 EC2EC2
81. Web
Servers
RDS DB
instance
Internet gateway
Route
53
user DNS
Resolution
www.example.com
Elastic Load
Balancing
Availability Zone A Availability Zone B
RDS DB
SlaveSynchronous Replication
Auto Scaling Group
Auto
Scaling
EC2 EC2 EC2 EC2
S3 Static Website – www.example.com
82. Web
Servers
RDS DB
instance
Internet gateway
Route
53
user DNS
Resolution
www.example.com
Elastic Load
Balancing
Availability Zone A Availability Zone B
RDS DB
SlaveSynchronous Replication
Auto Scaling Group
Auto
Scaling
EC2 EC2 EC2 EC2
S3 Static Website – www.example.com
85. Services Oriented Architecture - SOA
Move services into their own
tiers/modules. Treat each of these
as 100% whole-y separate pieces
of your infrastructure and scale
them independently.
Amazon.com and AWS do this
extensively! It offers flexibility and
greater understanding of each
component.
86. Loose coupling sets you free!
The looser they're coupled, the bigger they scale
• Independent components
• Design everything as a black box
• Decouple interactions
• Favor services with built in redundancy and scalability than building your
own
88. Amazon SQS
Reliable, highly scalable, queue service
for storing messages as they travel
between instances
Feature Details
Reliable Messages stored redundantly across
multiple availability zones
Simple Simple APIs to send and receive messages
Scalable Unlimited number of messages
Secure Authentication of queues to ensure
controlled access
Application Services
Compute Storage
AWS Global Infrastructure
Database
App Services
Deployment & Administration
Networking
SQS
messages
get
message
instance
put
message
instance
Amazon SNS topic
publish
notification
queue is subscribed
to topic
91. S3 Bucket
Route
53
user
www.example.com
Webservers / CMS
SQS
Workers
Photo CMS with SQS
1) User / browser posts photo
to S3 and is redirected to
form on webservers
2) User completes form for
photo and submits
3) Message is sent to SQS
4) Worker long polling SQS
grabs message and
creates different size photo
assets
5) Thumbs are uploaded to
S3 bucket
6) Worker updates database
with photo assets
1
2
3
4
5
6
94. S3 Bucket
Route
53
user
www.example.com
Webservers / CMS
SQS
Workers
1
2
3
4
5
6
Photo CMS with SQS
message
1) User / browser posts photo
to S3 and is redirected to
form on webservers
2) User completes form for
photo and submits
3) Message is sent to SQS
4) Worker long polling SQS
grabs message and
creates different size photo
assets
5) Thumbs are uploaded to
S3 bucket
6) Worker updates database
with photo assets
95. S3 Bucket
Route
53
user
www.example.com
Webservers / CMS
SQS
Workers
1
2
3
5
6
Photo CMS with SQS
Message reappears
in queue
4
1) User / browser posts photo
to S3 and is redirected to
form on webservers
2) User completes form for
photo and submits
3) Message is sent to SQS
4) Worker long polling SQS
grabs message and
creates different size photo
assets
5) Thumbs are uploaded to
S3 bucket
6) Worker updates database
with photo assets
96. S3 Bucket
Route
53
user
www.example.com
Webservers / CMS
SQS
Workers
1
2
3
5
6
Photo CMS with SQS
4message
1) User / browser posts photo
to S3 and is redirected to
form on webservers
2) User completes form for
photo and submits
3) Message is sent to SQS
4) Worker long polling SQS
grabs message and
creates different size photo
assets
5) Thumbs are uploaded to
S3 bucket
6) Worker updates database
with photo assets
98. S3 Bucket
Route
53
user
www.example.com
Webservers / CMS
SQS
Workers
Photo CMS – Scaling with SQS
1
2
3
4
5
6
backlog of
messages
Auto Scaling Group
Auto Scaling Group1) User / browser posts photo
to S3 and is redirected to
form on webservers
2) User completes form for
photo and submits
3) Message is sent to SQS
4) Worker long polling SQS
grabs message and
creates different size photo
assets
5) Thumbs are uploaded to
S3 bucket
6) Worker updates database
with photo assets
99. Lambda
Event driven compute
Connective tissue for AWS services
Feature Details
Stateless Request driven code called Lambda functions
triggered by events
Easy Fixed OS and language - JavaScript
Management AWS owns and manages the infrastructure
Scaling Implicit scaling; just make requests
Compute Storage
AWS Global Infrastructure
Database
App Services
Deployment & Administration
Networking
Compute
S3 Bucket
Lambda
Push: Event
notification
DynamoDB
Pull: DynamoDB
Stream
Kinesis
Pull:
Kinesis Stream
100. S3 Bucket
Route
53
user
www.example.com
Webservers / CMS
Photo CMS with Lambda
1) User / browser posts photo
to S3 and is redirected to
form on webservers
2) The redirected user
completes form for photo
and submits
3) At the same time as the
redirect, S3 event
notifications fire off and are
received by Lambda
4) Lambda creates different
size photo assets and
uploads them to S3
5) Lambda updates database
with photo assets
1
2
43
5
Lambda
101. 1. DESIGN FOR FAILURE
2. MULTIPLE AVAILABILITY ZONES
3. SCALING
4. SELF-HEALING
5. LOOSE COUPLING
102. 1. DESIGN FOR FAILURE
2. MULTIPLE AVAILABILITY ZONES
3. SCALING
4. SELF-HEALING
5. LOOSE COUPLING
103. 1. DESIGN FOR FAILURE
2. MULTIPLE AVAILABILITY ZONES
3. SCALING
4. SELF-HEALING
5. LOOSE COUPLING
104. 1. DESIGN FOR FAILURE
2. MULTIPLE AVAILABILITY ZONES
3. SCALING
4. SELF-HEALING
5. LOOSE COUPLING
105. 1. DESIGN FOR FAILURE
2. MULTIPLE AVAILABILITY ZONES
3. SCALING
4. SELF-HEALING
5. LOOSE COUPLING
106. 1. DESIGN FOR FAILURE
2. MULTIPLE AVAILABILITY ZONES
3. SCALING
4. SELF-HEALING
5. LOOSE COUPLING