Slides from CON314 "Automating Zero-Downtime Production Cluster Upgrades for Amazon ECS" - a chalk talk presented at AWS re:Invent 2017
Containers make it easy to deploy new code into production to update the functionality of a service, but what happens when you need to update the EC2 compute instances that your containers are running on? In this talk, we’ll deep dive into how to upgrade the EC2 infrastructure underlying a live production Amazon ECS cluster without affecting service availability. Matt Callanan, Engineering Manager at Expedia will walk through Expedia’s “PRISM” project that safely relocates hundreds of tasks onto new Amazon EC2 instances with zero-downtime to applications.
2. Matt Callanan
Engineering Manager/Tech Lead
“Cloud Acceleration Team”
Expedia
Brisbane, Australia
• mcallanan@expedia.com
• linkedin.com/in/matthewcallanan
• @mcallana
3. Cluster Management
• How to upgrade the EC2 infrastructure underlying a live
production Amazon ECS cluster without affecting service
availability.
• How to safely relocate hundreds of tasks onto new
Amazon EC2 instances with zero-downtime to
applications.
4. Region 1
Region 2 Region 3 Region 4 Region
5
Expedia ECS Cluster Statistics
2,600 ECS Services (1,100 Applications)
13,000 Containers
860 EC2 Instances (13 ECS Clusters)
5. Region 1
Region 2 Region 3 Region 4 Region
5
Expedia ECS Cluster Topology
Production Cluster Test Environment Cluster
480 Services
230 Instances
8. Immutable Servers
Amazon-provided Base AMI
Standard Chef cookbook
Custom setup baked into AMI
ecs-optimized AMI
Expedia standard image
Docker Config
Daemon containers
Golden AMI
docker ecs-agent
9. Immutable Servers
ecs-optimized AMI
Expedia standard image
Docker Config
Daemon containers
Golden AMI
docker ecs-agent
ecs-optimized AMI
Expedia standard image
Docker Config
Daemon containers
Cluster Instance
Custom bootstrap:
• ECS Cluster Config
• Start ECS Agent, Docker
• Cron: Restart ECS agent
• Cron: Custom Metrics
docker ecs-agent
12. Default Cloud Formation Auto-Scaling Rolling Update
Old Instance
New Instance
Terminating Instance
Active Task
Relocated Task
Stopped Task
13. Default Cloud Formation Auto-Scaling Rolling Update
A
A
Old Instance
New Instance
Terminating Instance
Active Task
Relocated Task
Stopped Task
14. Default Cloud Formation Auto-Scaling Rolling Update
A
A
Old Instance
New Instance
Terminating Instance
Active Task
Relocated Task
Stopped Task
Problem:
Tasks stop before they are relocated.
Even with tasks on distinct instances, outage can
happen if new instances are not pulling images
fast enough
15. Default Cloud Formation Auto-Scaling Rolling Update
A
Old Instance
New Instance
Terminating Instance
Active Task
Relocated Task
Stopped Task
16. Default Cloud Formation Auto-Scaling Rolling Update
A
A
Problem:
Tasks start on instances
about to be terminated
Old Instance
New Instance
Terminating Instance
Active Task
Relocated Task
Stopped Task
17. Default Cloud Formation Auto-Scaling Rolling Update
A
A
Old Instance
New Instance
Terminating Instance
Active Task
Relocated Task
Stopped Task
18. Default Cloud Formation Auto-Scaling Rolling Update
A
A
Old Instance
New Instance
Terminating Instance
Active Task
Relocated Task
Stopped Task
19. Default Cloud Formation Auto-Scaling Rolling Update
A
A
Old Instance
New Instance
Terminating Instance
Active Task
Relocated Task
Stopped Task
20. Default Cloud Formation Auto-Scaling Rolling Update
A
A
Old Instance
New Instance
Terminating Instance
Active Task
Relocated Task
Stopped Task
21. Default Cloud Formation Auto-Scaling Rolling Update
A
A
Old Instance
New Instance
Terminating Instance
Active Task
Relocated Task
Stopped Task
22. Default Cloud Formation Auto-Scaling Rolling Update
A
A
Old Instance
New Instance
Terminating Instance
Active Task
Relocated Task
Stopped Task
23. Default Cloud Formation Auto-Scaling Rolling Update
A
Old Instance
New Instance
Terminating Instance
Active Task
Relocated Task
Stopped Task
24. Default Cloud Formation Auto-Scaling Rolling Update
A
Old Instance
New Instance
Terminating Instance
Active Task
Relocated Task
Stopped Task
25. Default Cloud Formation Auto-Scaling Rolling Update
A
Old Instance
New Instance
Terminating Instance
Active Task
Relocated Task
Stopped Task
26. Default Cloud Formation Auto-Scaling Rolling Update
A
Old Instance
New Instance
Terminating Instance
Active Task
Relocated Task
Stopped Task
27. Default Cloud Formation Auto-Scaling Rolling Update
Old Instance
New Instance
Terminating Instance
Active Task
Relocated Task
Stopped Task
Problem:
Service “A” experiences downtime
28. Default Cloud Formation Auto-Scaling Rolling Update
A
Old Instance
New Instance
Terminating Instance
Active Task
Relocated Task
Stopped Task
29. Default Cloud Formation Auto-Scaling Rolling Update
A
Old Instance
New Instance
Terminating Instance
Active Task
Relocated Task
Stopped Task
30. Default Cloud Formation Auto-Scaling Rolling Update
A
A
Old Instance
New Instance
Terminating Instance
Active Task
Relocated Task
Stopped Task
31. Default Cloud Formation Auto-Scaling Rolling Update
A
A
Old Instance
New Instance
Terminating Instance
Active Task
Relocated Task
Stopped Task
32. Default Cloud Formation Auto-Scaling Rolling Update
A
A
Old Instance
New Instance
Terminating Instance
Active Task
Relocated Task
Stopped Task
Problem:
Tasks bunch up on first
few new instances
33. First Approach: Problem Summary
• CloudFormation AutoScaling
Asynchrony
• Rolling Update not granular enough
• Outside our control
• Rollbacks with no ability for manual
intervention
• Tasks stop before they are relocated
• Service outage can happen if new
instances are not pulling images fast
enough
• Tasks are not evenly spread across new
instances
CloudFormation
Auto Scaling
34. Second Approach
• Custom Rolling Update automation
• Programmatic Update Script using
Ruby SDK
• But still took 10 (nail-biting) hours to
update a cluster with 100 instances
• Any issues during the process required
manual intervention Auto Scaling
36. “PRISM” Goals
• Zero-downtime for applications as their workloads get relocated onto new instances
Safety
• Complete as fast as possible
Speed
• Quickly retreat back to known-good state if anything goes wrong
Rollbackable
• Resumeable if anything goes wrong
Idempotent
• Drain in batches to prevent burden on Docker registry and network
• Avoid having tasks relocated to instances about to be drained
Avoid “thundering herd” scenario
43. Old Instance
New Instance
Draining Instance
A
Active Task
Relocated Task
“DRAINING” Task
A
Blue Auto-Scaling Group
Blue CFN Stack
Cluster Update – Phase 1: Expand
44. AA
Blue Auto-Scaling Group
Blue CFN Stack
Disable Auto-Scaling Processes
Cluster Update – Phase 1: Expand
Old Instance
New Instance
Draining Instance
Active Task
Relocated Task
“DRAINING” Task
45. AA
Blue Auto-Scaling Group
Blue CFN Stack
Green Auto-Scaling Group
Green CFN Stack
Disable Auto-Scaling Processes Disable Auto-Scaling Processes
Cluster Update – Phase 1: Expand
Old Instance
New Instance
Draining Instance
Active Task
Relocated Task
“DRAINING” Task
46. AA
Blue Auto-Scaling Group
Blue CFN Stack
Green Auto-Scaling Group
Green CFN Stack
state = ‘pre-drain’
Disable Auto-Scaling Processes Disable Auto-Scaling Processes
Cluster Update – Phase 1: Expand
Old Instance
New Instance
Draining Instance
Active Task
Relocated Task
“DRAINING” Task
47. AA
Blue Auto-Scaling Group
Blue CFN Stack
Green Auto-Scaling Group
Green CFN Stack
Disable Auto-Scaling Processes
Cluster Update – Phase 2: Relocate Tasks (batches of 3 instances)
Old Instance
New Instance
Draining Instance
Active Task
Relocated Task
“DRAINING” Task
Disable Auto-Scaling Processes
state = ‘pre-drain’
48. Blue Auto-Scaling Group
Blue CFN Stack
Green Auto-Scaling Group
Green CFN Stack
Disable Auto-Scaling Processes
Cluster Update – Phase 2: Relocate Tasks (batches of 3 instances)
Old Instance
New Instance
Draining Instance
Active Task
Relocated Task
“DRAINING” Task
AA
state = ‘pre-drain’
Disable Auto-Scaling Processes
49. Blue Auto-Scaling Group
Blue CFN Stack
Green Auto-Scaling Group
Green CFN Stack
Disable Auto-Scaling Processes
Cluster Update – Phase 2: Relocate Tasks (batches of 3 instances)
Old Instance
New Instance
Draining Instance
Active Task
Relocated Task
“DRAINING” Task
A
AAA
state = ‘pre-drain’
Disable Auto-Scaling Processes
50. Blue Auto-Scaling Group
Blue CFN Stack
A
A
Green Auto-Scaling Group
Green CFN Stack
Disable Auto-Scaling Processes
Cluster Update – Phase 2: Relocate Tasks (batches of 3 instances)
Old Instance
New Instance
Draining Instance
Active Task
Relocated Task
“DRAINING” Task
AA
state = ‘pre-drain’
Disable Auto-Scaling Processes
51. Blue Auto-Scaling Group
Blue CFN Stack
A
A
Green Auto-Scaling Group
Green CFN Stack
Disable Auto-Scaling Processes
Cluster Update – Phase 2: Relocate Tasks (batches of 3 instances)
Old Instance
New Instance
Draining Instance
Active Task
Relocated Task
“DRAINING” Task
state = ‘pre-drain’
Disable Auto-Scaling Processes
52. Blue Auto-Scaling Group
Blue CFN Stack
A
Green Auto-Scaling Group
Green CFN Stack
Disable Auto-Scaling Processes
Cluster Update – Phase 2: Relocate Tasks (batches of 3 instances)
Old Instance
New Instance
Draining Instance
Active Task
Relocated Task
“DRAINING” Task
A
state = ‘pre-drain’
Disable Auto-Scaling Processes
53. Blue Auto-Scaling Group
Blue CFN Stack
A
Green Auto-Scaling Group
Green CFN Stack
Disable Auto-Scaling Processes
Cluster Update – Phase 2: Relocate Tasks (batches of 3 instances)
Old Instance
New Instance
Draining Instance
Active Task
Relocated Task
“DRAINING” Task
A
state = ‘pre-drain’
Disable Auto-Scaling Processes
54. Blue Auto-Scaling Group
Blue CFN Stack
A
A
Green Auto-Scaling Group
Green CFN Stack
Disable Auto-Scaling Processes
Cluster Update – Phase 2: Relocate Tasks (batches of 3 instances)
Old Instance
New Instance
Draining Instance
Active Task
Relocated Task
“DRAINING” Task
state = ‘pre-drain’
Disable Auto-Scaling Processes
55. Blue Auto-Scaling Group
Blue CFN Stack
A
A
Green Auto-Scaling Group
Green CFN Stack
Disable Auto-Scaling Processes
Cluster Update – Phase 2: Relocate Tasks (batches of 3 instances)
Old Instance
New Instance
Draining Instance
Active Task
Relocated Task
“DRAINING” Task
state = ‘pre-drain’
Disable Auto-Scaling Processes
56. A
A
Green Auto-Scaling Group
Green CFN Stack
Disable Auto-Scaling Processes
Cluster Update – Phase 3: Clean Up
Old Instance
New Instance
Draining Instance
Active Task
Relocated Task
“DRAINING” Task
57. A
A
Green Auto-Scaling Group
Green CFN Stack
Resume Auto-Scaling Processes
Cluster Update – Phase 3: Clean Up
Old Instance
New Instance
Draining Instance
Active Task
Relocated Task
“DRAINING” Task
59. Pre-flight Checks
• Before Cluster Create/Update
• Check # instances available of target instance type
• Check IP addresses available in target subnets
• Check EBS volume space available of target volume type
60. Drain on Scale-down with Lifecycle Lambda
• Lambda triggered by AutoScaling
EC2_INSTANCE_TERMINATE SNS events
• Updates instance state to “DRAINING”
• ASG has 30-minute heartbeat to keep instance in
Terminating:Wait state for 30mins
• Allows ECS to safely relocate any tasks that are part of a
service to another instance
62. Avoid Relocating to Old Instance
• Problem:
• Tasks can get rescheduled to another old instance in the ASG that is about to be replaced - so
tasks can get bumped from instance to instance until all instances are replaced
• Solution:
• Deploy services with placement constraint
• This means that a service won’t be placed on an instance that has an attribute named “state” with
value “pre-drain”
• At cluster replacement time, we will stand up all new clusters, place old clusters into the “pre-
drain” state, and terminate the old instances in batches.
• Relocated tasks will only be placed on new instances, avoiding the default “thundering herd”
scenario
placement_constraints = [ {
type: 'memberOf',
expression: 'attribute:state !exists or attribute:state != pre-drain'
}]
64. Task Definition Placement Constraint
• “state” is a custom ECS Instance Attribute
• By default, the “state” attribute doesn’t exist on instances
• Set only to “pre-drain” during cluster update
• Prevents ECS scheduling tasks on instances that are about to be drained
• Removed from instance only in case of rollback of cluster update
65. Instance Launch Considerations
• Worked with AWS Auto-Scaling team to enable more
appropriate Auto-Scaling “Launch Rate”
• Start the ECS agent with exponential backoff
• Throttle on container instance registration rate = 1 per
second/60 max per minute
66. Gotchas
• Found a number of apps don't relocate easily
• Slows down prod upgrades and old stack decommissions
67. Related
AWS re:invent 2017: Going Big with Containers: Customer
Case Studies of Large-Scale (ENT209)
• https://www.youtube.com/watch?v=L3l_ZiYRrks
• Covers full scope of Expedia’s ECS deployment
automation platform