SlideShare a Scribd company logo
1 of 67
Automating Zero-Downtime Production
Cluster Upgrades for Amazon ECS
Matt Callanan
Engineering Manager/Tech Lead
“Cloud Acceleration Team”
Expedia
Brisbane, Australia
• mcallanan@expedia.com
• linkedin.com/in/matthewcallanan
• @mcallana
Cluster Management
• How to upgrade the EC2 infrastructure underlying a live
production Amazon ECS cluster without affecting service
availability.
• How to safely relocate hundreds of tasks onto new
Amazon EC2 instances with zero-downtime to
applications.
Region 1
Region 2 Region 3 Region 4 Region
5
Expedia ECS Cluster Statistics
2,600 ECS Services (1,100 Applications)
13,000 Containers
860 EC2 Instances (13 ECS Clusters)
Region 1
Region 2 Region 3 Region 4 Region
5
Expedia ECS Cluster Topology
Production Cluster Test Environment Cluster
480 Services
230 Instances
Production Cluster Visualization
230 Instances 480 Services 3,200 Containers
c3vis Open Source: https://github.com/ExpediaDotCom/c3vis
ECS Cluster Creation
Cloud
Formation
Stack
EC2 Instances
Auto Scaling Group
Amazon
ECS Cluster
Immutable Servers
Amazon-provided Base AMI
Standard Chef cookbook
Custom setup baked into AMI
ecs-optimized AMI
Expedia standard image
Docker Config
Daemon containers
Golden AMI
docker ecs-agent
Immutable Servers
ecs-optimized AMI
Expedia standard image
Docker Config
Daemon containers
Golden AMI
docker ecs-agent
ecs-optimized AMI
Expedia standard image
Docker Config
Daemon containers
Cluster Instance
Custom bootstrap:
• ECS Cluster Config
• Start ECS Agent, Docker
• Cron: Restart ECS agent
• Cron: Custom Metrics
docker ecs-agent
Zero-Downtime Cluster Upgrades
First Approach
• CloudFormation Auto-Scaling Rolling Update
Default Cloud Formation Auto-Scaling Rolling Update
Old Instance
New Instance
Terminating Instance
Active Task
Relocated Task
Stopped Task
Default Cloud Formation Auto-Scaling Rolling Update
A
A
Old Instance
New Instance
Terminating Instance
Active Task
Relocated Task
Stopped Task
Default Cloud Formation Auto-Scaling Rolling Update
A
A
Old Instance
New Instance
Terminating Instance
Active Task
Relocated Task
Stopped Task
Problem:
Tasks stop before they are relocated.
Even with tasks on distinct instances, outage can
happen if new instances are not pulling images
fast enough
Default Cloud Formation Auto-Scaling Rolling Update
A
Old Instance
New Instance
Terminating Instance
Active Task
Relocated Task
Stopped Task
Default Cloud Formation Auto-Scaling Rolling Update
A
A
Problem:
Tasks start on instances
about to be terminated
Old Instance
New Instance
Terminating Instance
Active Task
Relocated Task
Stopped Task
Default Cloud Formation Auto-Scaling Rolling Update
A
A
Old Instance
New Instance
Terminating Instance
Active Task
Relocated Task
Stopped Task
Default Cloud Formation Auto-Scaling Rolling Update
A
A
Old Instance
New Instance
Terminating Instance
Active Task
Relocated Task
Stopped Task
Default Cloud Formation Auto-Scaling Rolling Update
A
A
Old Instance
New Instance
Terminating Instance
Active Task
Relocated Task
Stopped Task
Default Cloud Formation Auto-Scaling Rolling Update
A
A
Old Instance
New Instance
Terminating Instance
Active Task
Relocated Task
Stopped Task
Default Cloud Formation Auto-Scaling Rolling Update
A
A
Old Instance
New Instance
Terminating Instance
Active Task
Relocated Task
Stopped Task
Default Cloud Formation Auto-Scaling Rolling Update
A
A
Old Instance
New Instance
Terminating Instance
Active Task
Relocated Task
Stopped Task
Default Cloud Formation Auto-Scaling Rolling Update
A
Old Instance
New Instance
Terminating Instance
Active Task
Relocated Task
Stopped Task
Default Cloud Formation Auto-Scaling Rolling Update
A
Old Instance
New Instance
Terminating Instance
Active Task
Relocated Task
Stopped Task
Default Cloud Formation Auto-Scaling Rolling Update
A
Old Instance
New Instance
Terminating Instance
Active Task
Relocated Task
Stopped Task
Default Cloud Formation Auto-Scaling Rolling Update
A
Old Instance
New Instance
Terminating Instance
Active Task
Relocated Task
Stopped Task
Default Cloud Formation Auto-Scaling Rolling Update
Old Instance
New Instance
Terminating Instance
Active Task
Relocated Task
Stopped Task
Problem:
Service “A” experiences downtime
Default Cloud Formation Auto-Scaling Rolling Update
A
Old Instance
New Instance
Terminating Instance
Active Task
Relocated Task
Stopped Task
Default Cloud Formation Auto-Scaling Rolling Update
A
Old Instance
New Instance
Terminating Instance
Active Task
Relocated Task
Stopped Task
Default Cloud Formation Auto-Scaling Rolling Update
A
A
Old Instance
New Instance
Terminating Instance
Active Task
Relocated Task
Stopped Task
Default Cloud Formation Auto-Scaling Rolling Update
A
A
Old Instance
New Instance
Terminating Instance
Active Task
Relocated Task
Stopped Task
Default Cloud Formation Auto-Scaling Rolling Update
A
A
Old Instance
New Instance
Terminating Instance
Active Task
Relocated Task
Stopped Task
Problem:
Tasks bunch up on first
few new instances
First Approach: Problem Summary
• CloudFormation  AutoScaling
Asynchrony
• Rolling Update not granular enough
• Outside our control
• Rollbacks with no ability for manual
intervention
• Tasks stop before they are relocated
• Service outage can happen if new
instances are not pulling images fast
enough
• Tasks are not evenly spread across new
instances
CloudFormation
Auto Scaling
Second Approach
• Custom Rolling Update automation
• Programmatic Update Script using
Ruby SDK
• But still took 10 (nail-biting) hours to
update a cluster with 100 instances
• Any issues during the process required
manual intervention Auto Scaling
Third Approach
• “PRISM”
• Project Replaced in Sixty Minutes
“PRISM” Goals
• Zero-downtime for applications as their workloads get relocated onto new instances
Safety
• Complete as fast as possible
Speed
• Quickly retreat back to known-good state if anything goes wrong
Rollbackable
• Resumeable if anything goes wrong
Idempotent
• Drain in batches to prevent burden on Docker registry and network
• Avoid having tasks relocated to instances about to be drained
Avoid “thundering herd” scenario
“PRISM” Phases
Phase 1: Expand
Phase 2: Relocate Tasks
Phase 3: Clean Up
Zero-Downtime Cluster Updates
Cloud
Formation
Stack
EC2 Instances
Auto Scaling group
Amazon
ECS
Cluster
Amazon
ECS
Cluster
Cloud
Formation
Stack
EC2 Instances
Auto Scaling group
Cloud
Formation
Stack
EC2 Instances
Auto Scaling group
Zero-Downtime Cluster Updates
Phase 1: Expand Cluster
Zero-Downtime Cluster Updates
Phase 2: Relocate Tasks
Amazon
ECS
Cluster
Cloud
Formation
Stack
EC2 Instances
Auto Scaling group
Cloud
Formation
Stack
EC2 Instances
Auto Scaling group
Draining…
Zero-Downtime Cluster Updates
Phase 3: Clean Up
Amazon
ECS
Cluster
Cloud
Formation
Stack
EC2 Instances
Auto Scaling group
Cloud
Formation
Stack
EC2 Instances
Auto Scaling group
Zero-Downtime Cluster Updates
Phase 3: Clean Up
Amazon
ECS
ClusterCloud
Formation
Stack
EC2 Instances
Auto Scaling group
Old Instance
New Instance
Draining Instance
A
Active Task
Relocated Task
“DRAINING” Task
A
Blue Auto-Scaling Group
Blue CFN Stack
Cluster Update – Phase 1: Expand
AA
Blue Auto-Scaling Group
Blue CFN Stack
Disable Auto-Scaling Processes
Cluster Update – Phase 1: Expand
Old Instance
New Instance
Draining Instance
Active Task
Relocated Task
“DRAINING” Task
AA
Blue Auto-Scaling Group
Blue CFN Stack
Green Auto-Scaling Group
Green CFN Stack
Disable Auto-Scaling Processes Disable Auto-Scaling Processes
Cluster Update – Phase 1: Expand
Old Instance
New Instance
Draining Instance
Active Task
Relocated Task
“DRAINING” Task
AA
Blue Auto-Scaling Group
Blue CFN Stack
Green Auto-Scaling Group
Green CFN Stack
state = ‘pre-drain’
Disable Auto-Scaling Processes Disable Auto-Scaling Processes
Cluster Update – Phase 1: Expand
Old Instance
New Instance
Draining Instance
Active Task
Relocated Task
“DRAINING” Task
AA
Blue Auto-Scaling Group
Blue CFN Stack
Green Auto-Scaling Group
Green CFN Stack
Disable Auto-Scaling Processes
Cluster Update – Phase 2: Relocate Tasks (batches of 3 instances)
Old Instance
New Instance
Draining Instance
Active Task
Relocated Task
“DRAINING” Task
Disable Auto-Scaling Processes
state = ‘pre-drain’
Blue Auto-Scaling Group
Blue CFN Stack
Green Auto-Scaling Group
Green CFN Stack
Disable Auto-Scaling Processes
Cluster Update – Phase 2: Relocate Tasks (batches of 3 instances)
Old Instance
New Instance
Draining Instance
Active Task
Relocated Task
“DRAINING” Task
AA
state = ‘pre-drain’
Disable Auto-Scaling Processes
Blue Auto-Scaling Group
Blue CFN Stack
Green Auto-Scaling Group
Green CFN Stack
Disable Auto-Scaling Processes
Cluster Update – Phase 2: Relocate Tasks (batches of 3 instances)
Old Instance
New Instance
Draining Instance
Active Task
Relocated Task
“DRAINING” Task
A
AAA
state = ‘pre-drain’
Disable Auto-Scaling Processes
Blue Auto-Scaling Group
Blue CFN Stack
A
A
Green Auto-Scaling Group
Green CFN Stack
Disable Auto-Scaling Processes
Cluster Update – Phase 2: Relocate Tasks (batches of 3 instances)
Old Instance
New Instance
Draining Instance
Active Task
Relocated Task
“DRAINING” Task
AA
state = ‘pre-drain’
Disable Auto-Scaling Processes
Blue Auto-Scaling Group
Blue CFN Stack
A
A
Green Auto-Scaling Group
Green CFN Stack
Disable Auto-Scaling Processes
Cluster Update – Phase 2: Relocate Tasks (batches of 3 instances)
Old Instance
New Instance
Draining Instance
Active Task
Relocated Task
“DRAINING” Task
state = ‘pre-drain’
Disable Auto-Scaling Processes
Blue Auto-Scaling Group
Blue CFN Stack
A
Green Auto-Scaling Group
Green CFN Stack
Disable Auto-Scaling Processes
Cluster Update – Phase 2: Relocate Tasks (batches of 3 instances)
Old Instance
New Instance
Draining Instance
Active Task
Relocated Task
“DRAINING” Task
A
state = ‘pre-drain’
Disable Auto-Scaling Processes
Blue Auto-Scaling Group
Blue CFN Stack
A
Green Auto-Scaling Group
Green CFN Stack
Disable Auto-Scaling Processes
Cluster Update – Phase 2: Relocate Tasks (batches of 3 instances)
Old Instance
New Instance
Draining Instance
Active Task
Relocated Task
“DRAINING” Task
A
state = ‘pre-drain’
Disable Auto-Scaling Processes
Blue Auto-Scaling Group
Blue CFN Stack
A
A
Green Auto-Scaling Group
Green CFN Stack
Disable Auto-Scaling Processes
Cluster Update – Phase 2: Relocate Tasks (batches of 3 instances)
Old Instance
New Instance
Draining Instance
Active Task
Relocated Task
“DRAINING” Task
state = ‘pre-drain’
Disable Auto-Scaling Processes
Blue Auto-Scaling Group
Blue CFN Stack
A
A
Green Auto-Scaling Group
Green CFN Stack
Disable Auto-Scaling Processes
Cluster Update – Phase 2: Relocate Tasks (batches of 3 instances)
Old Instance
New Instance
Draining Instance
Active Task
Relocated Task
“DRAINING” Task
state = ‘pre-drain’
Disable Auto-Scaling Processes
A
A
Green Auto-Scaling Group
Green CFN Stack
Disable Auto-Scaling Processes
Cluster Update – Phase 3: Clean Up
Old Instance
New Instance
Draining Instance
Active Task
Relocated Task
“DRAINING” Task
A
A
Green Auto-Scaling Group
Green CFN Stack
Resume Auto-Scaling Processes
Cluster Update – Phase 3: Clean Up
Old Instance
New Instance
Draining Instance
Active Task
Relocated Task
“DRAINING” Task
Enabling Aspects
Pre-flight Checks
• Before Cluster Create/Update
• Check # instances available of target instance type
• Check IP addresses available in target subnets
• Check EBS volume space available of target volume type
Drain on Scale-down with Lifecycle Lambda
• Lambda triggered by AutoScaling
EC2_INSTANCE_TERMINATE SNS events
• Updates instance state to “DRAINING”
• ASG has 30-minute heartbeat to keep instance in
Terminating:Wait state for 30mins
• Allows ECS to safely relocate any tasks that are part of a
service to another instance
AutoScaling Lifecycle Hook
resource "ECSContainerInstanceTerminating",
Type: "AWS::AutoScaling::LifecycleHook",
Properties: {
AutoScalingGroupName: ref("ECSAutoScalingGroup"),
LifecycleTransition: "autoscaling:EC2_INSTANCE_TERMINATING",
DefaultResult: "CONTINUE",
RoleARN: ref("LifecycleHookRoleARN"),
NotificationTargetARN: ref("ECSAutoscalingLifecycleTopic"),
HeartbeatTimeout: 1800 # 30 minutes
},
Avoid Relocating to Old Instance
• Problem:
• Tasks can get rescheduled to another old instance in the ASG that is about to be replaced - so
tasks can get bumped from instance to instance until all instances are replaced
• Solution:
• Deploy services with placement constraint
• This means that a service won’t be placed on an instance that has an attribute named “state” with
value “pre-drain”
• At cluster replacement time, we will stand up all new clusters, place old clusters into the “pre-
drain” state, and terminate the old instances in batches.
• Relocated tasks will only be placed on new instances, avoiding the default “thundering herd”
scenario
placement_constraints = [ {
type: 'memberOf',
expression: 'attribute:state !exists or attribute:state != pre-drain'
}]
Task Definition Placement Constraint
Task Definition Placement Constraint
• “state” is a custom ECS Instance Attribute
• By default, the “state” attribute doesn’t exist on instances
• Set only to “pre-drain” during cluster update
• Prevents ECS scheduling tasks on instances that are about to be drained
• Removed from instance only in case of rollback of cluster update
Instance Launch Considerations
• Worked with AWS Auto-Scaling team to enable more
appropriate Auto-Scaling “Launch Rate”
• Start the ECS agent with exponential backoff
• Throttle on container instance registration rate = 1 per
second/60 max per minute
Gotchas
• Found a number of apps don't relocate easily
• Slows down prod upgrades and old stack decommissions
Related
AWS re:invent 2017: Going Big with Containers: Customer
Case Studies of Large-Scale (ENT209)
• https://www.youtube.com/watch?v=L3l_ZiYRrks
• Covers full scope of Expedia’s ECS deployment
automation platform

More Related Content

What's hot

Under the Hood of Amazon Route 53 (ARC408-R1) - AWS re:Invent 2018
Under the Hood of Amazon Route 53 (ARC408-R1) - AWS re:Invent 2018Under the Hood of Amazon Route 53 (ARC408-R1) - AWS re:Invent 2018
Under the Hood of Amazon Route 53 (ARC408-R1) - AWS re:Invent 2018Amazon Web Services
 
Realize True Business Value With ThousandEyes
Realize True Business Value With ThousandEyesRealize True Business Value With ThousandEyes
Realize True Business Value With ThousandEyesThousandEyes
 
Decompose your monolith: strategies for migrating to microservices (Tide)
Decompose your monolith: strategies for migrating to microservices (Tide)Decompose your monolith: strategies for migrating to microservices (Tide)
Decompose your monolith: strategies for migrating to microservices (Tide)Chris Richardson
 
AWS re:Invent 2016: Deep Dive on Amazon Aurora (DAT303)
AWS re:Invent 2016: Deep Dive on Amazon Aurora (DAT303)AWS re:Invent 2016: Deep Dive on Amazon Aurora (DAT303)
AWS re:Invent 2016: Deep Dive on Amazon Aurora (DAT303)Amazon Web Services
 
如何規劃與執行大型資料中心遷移和案例分享
如何規劃與執行大型資料中心遷移和案例分享如何規劃與執行大型資料中心遷移和案例分享
如何規劃與執行大型資料中心遷移和案例分享Amazon Web Services
 
初心者向けWebinar AWS上でのファイルサーバ構築
初心者向けWebinar AWS上でのファイルサーバ構築初心者向けWebinar AWS上でのファイルサーバ構築
初心者向けWebinar AWS上でのファイルサーバ構築Amazon Web Services Japan
 
AWS Black Belt Tech シリーズ 2015 - Amazon Redshift
AWS Black Belt Tech シリーズ 2015 - Amazon RedshiftAWS Black Belt Tech シリーズ 2015 - Amazon Redshift
AWS Black Belt Tech シリーズ 2015 - Amazon RedshiftAmazon Web Services Japan
 
Deep Dive: Amazon DynamoDB (db tech showcase 2016)
Deep Dive: Amazon DynamoDB (db tech showcase 2016) Deep Dive: Amazon DynamoDB (db tech showcase 2016)
Deep Dive: Amazon DynamoDB (db tech showcase 2016) Amazon Web Services Japan
 
Building Applications with DynamoDB
Building Applications with DynamoDBBuilding Applications with DynamoDB
Building Applications with DynamoDBAmazon Web Services
 
AWS Black Belt Techシリーズ AWS Management Console
AWS Black Belt Techシリーズ AWS Management ConsoleAWS Black Belt Techシリーズ AWS Management Console
AWS Black Belt Techシリーズ AWS Management ConsoleAmazon Web Services Japan
 
AWS Black Belt Online Seminar 2017 Amazon Aurora with PostgreSQL Compatibility
AWS Black Belt Online Seminar 2017 Amazon Aurora with PostgreSQL CompatibilityAWS Black Belt Online Seminar 2017 Amazon Aurora with PostgreSQL Compatibility
AWS Black Belt Online Seminar 2017 Amazon Aurora with PostgreSQL CompatibilityAmazon Web Services Japan
 
AWS Summit Seoul 2023 | AWS에서 최소한의 비용으로 구현하는 멀티리전 DR 자동화 구성
AWS Summit Seoul 2023 | AWS에서 최소한의 비용으로 구현하는 멀티리전 DR 자동화 구성AWS Summit Seoul 2023 | AWS에서 최소한의 비용으로 구현하는 멀티리전 DR 자동화 구성
AWS Summit Seoul 2023 | AWS에서 최소한의 비용으로 구현하는 멀티리전 DR 자동화 구성Amazon Web Services Korea
 
[20210519 Security-JAWS] AWS エッジサービス入門ハンズオンの紹介と AWS WAF のアップデートについて
[20210519 Security-JAWS] AWS エッジサービス入門ハンズオンの紹介と AWS WAF のアップデートについて[20210519 Security-JAWS] AWS エッジサービス入門ハンズオンの紹介と AWS WAF のアップデートについて
[20210519 Security-JAWS] AWS エッジサービス入門ハンズオンの紹介と AWS WAF のアップデートについてAmazon Web Services Japan
 
Stopping DDoS Attacks in North America
Stopping DDoS Attacks in North AmericaStopping DDoS Attacks in North America
Stopping DDoS Attacks in North AmericaCloudflare
 
Db2をAWS上に構築する際のヒント&TIPS 2019年7月版
Db2をAWS上に構築する際のヒント&TIPS 2019年7月版Db2をAWS上に構築する際のヒント&TIPS 2019年7月版
Db2をAWS上に構築する際のヒント&TIPS 2019年7月版Akira Shimosako
 
ZERO TRUST ARCHITECTURE - DIGITAL TRUST FRAMEWORK
ZERO TRUST ARCHITECTURE - DIGITAL TRUST FRAMEWORKZERO TRUST ARCHITECTURE - DIGITAL TRUST FRAMEWORK
ZERO TRUST ARCHITECTURE - DIGITAL TRUST FRAMEWORKMaganathin Veeraragaloo
 
AWS Black Belt Online Seminar 2017 Amazon Aurora
AWS Black Belt Online Seminar 2017 Amazon AuroraAWS Black Belt Online Seminar 2017 Amazon Aurora
AWS Black Belt Online Seminar 2017 Amazon AuroraAmazon Web Services Japan
 
[社内勉強会]ELBとALBと数万スパイク負荷テスト
[社内勉強会]ELBとALBと数万スパイク負荷テスト[社内勉強会]ELBとALBと数万スパイク負荷テスト
[社内勉強会]ELBとALBと数万スパイク負荷テストTakahiro Moteki
 

What's hot (20)

Under the Hood of Amazon Route 53 (ARC408-R1) - AWS re:Invent 2018
Under the Hood of Amazon Route 53 (ARC408-R1) - AWS re:Invent 2018Under the Hood of Amazon Route 53 (ARC408-R1) - AWS re:Invent 2018
Under the Hood of Amazon Route 53 (ARC408-R1) - AWS re:Invent 2018
 
Realize True Business Value With ThousandEyes
Realize True Business Value With ThousandEyesRealize True Business Value With ThousandEyes
Realize True Business Value With ThousandEyes
 
Decompose your monolith: strategies for migrating to microservices (Tide)
Decompose your monolith: strategies for migrating to microservices (Tide)Decompose your monolith: strategies for migrating to microservices (Tide)
Decompose your monolith: strategies for migrating to microservices (Tide)
 
AWS re:Invent 2016: Deep Dive on Amazon Aurora (DAT303)
AWS re:Invent 2016: Deep Dive on Amazon Aurora (DAT303)AWS re:Invent 2016: Deep Dive on Amazon Aurora (DAT303)
AWS re:Invent 2016: Deep Dive on Amazon Aurora (DAT303)
 
如何規劃與執行大型資料中心遷移和案例分享
如何規劃與執行大型資料中心遷移和案例分享如何規劃與執行大型資料中心遷移和案例分享
如何規劃與執行大型資料中心遷移和案例分享
 
初心者向けWebinar AWS上でのファイルサーバ構築
初心者向けWebinar AWS上でのファイルサーバ構築初心者向けWebinar AWS上でのファイルサーバ構築
初心者向けWebinar AWS上でのファイルサーバ構築
 
AWS Black Belt Tech シリーズ 2015 - Amazon Redshift
AWS Black Belt Tech シリーズ 2015 - Amazon RedshiftAWS Black Belt Tech シリーズ 2015 - Amazon Redshift
AWS Black Belt Tech シリーズ 2015 - Amazon Redshift
 
Deep Dive: Amazon DynamoDB (db tech showcase 2016)
Deep Dive: Amazon DynamoDB (db tech showcase 2016) Deep Dive: Amazon DynamoDB (db tech showcase 2016)
Deep Dive: Amazon DynamoDB (db tech showcase 2016)
 
Building Applications with DynamoDB
Building Applications with DynamoDBBuilding Applications with DynamoDB
Building Applications with DynamoDB
 
AWS Tools for Windows PowerShell
AWS Tools for Windows PowerShellAWS Tools for Windows PowerShell
AWS Tools for Windows PowerShell
 
AWS Black Belt Techシリーズ AWS Management Console
AWS Black Belt Techシリーズ AWS Management ConsoleAWS Black Belt Techシリーズ AWS Management Console
AWS Black Belt Techシリーズ AWS Management Console
 
AWS Black Belt Online Seminar 2017 Amazon Aurora with PostgreSQL Compatibility
AWS Black Belt Online Seminar 2017 Amazon Aurora with PostgreSQL CompatibilityAWS Black Belt Online Seminar 2017 Amazon Aurora with PostgreSQL Compatibility
AWS Black Belt Online Seminar 2017 Amazon Aurora with PostgreSQL Compatibility
 
AWS Summit Seoul 2023 | AWS에서 최소한의 비용으로 구현하는 멀티리전 DR 자동화 구성
AWS Summit Seoul 2023 | AWS에서 최소한의 비용으로 구현하는 멀티리전 DR 자동화 구성AWS Summit Seoul 2023 | AWS에서 최소한의 비용으로 구현하는 멀티리전 DR 자동화 구성
AWS Summit Seoul 2023 | AWS에서 최소한의 비용으로 구현하는 멀티리전 DR 자동화 구성
 
[20210519 Security-JAWS] AWS エッジサービス入門ハンズオンの紹介と AWS WAF のアップデートについて
[20210519 Security-JAWS] AWS エッジサービス入門ハンズオンの紹介と AWS WAF のアップデートについて[20210519 Security-JAWS] AWS エッジサービス入門ハンズオンの紹介と AWS WAF のアップデートについて
[20210519 Security-JAWS] AWS エッジサービス入門ハンズオンの紹介と AWS WAF のアップデートについて
 
Stopping DDoS Attacks in North America
Stopping DDoS Attacks in North AmericaStopping DDoS Attacks in North America
Stopping DDoS Attacks in North America
 
Db2をAWS上に構築する際のヒント&TIPS 2019年7月版
Db2をAWS上に構築する際のヒント&TIPS 2019年7月版Db2をAWS上に構築する際のヒント&TIPS 2019年7月版
Db2をAWS上に構築する際のヒント&TIPS 2019年7月版
 
ZERO TRUST ARCHITECTURE - DIGITAL TRUST FRAMEWORK
ZERO TRUST ARCHITECTURE - DIGITAL TRUST FRAMEWORKZERO TRUST ARCHITECTURE - DIGITAL TRUST FRAMEWORK
ZERO TRUST ARCHITECTURE - DIGITAL TRUST FRAMEWORK
 
AWS Black Belt Online Seminar 2017 Amazon Aurora
AWS Black Belt Online Seminar 2017 Amazon AuroraAWS Black Belt Online Seminar 2017 Amazon Aurora
AWS Black Belt Online Seminar 2017 Amazon Aurora
 
[社内勉強会]ELBとALBと数万スパイク負荷テスト
[社内勉強会]ELBとALBと数万スパイク負荷テスト[社内勉強会]ELBとALBと数万スパイク負荷テスト
[社内勉強会]ELBとALBと数万スパイク負荷テスト
 
Aurora Deep Dive | AWS Floor28
Aurora Deep Dive | AWS Floor28Aurora Deep Dive | AWS Floor28
Aurora Deep Dive | AWS Floor28
 

Similar to Automating Zero-Downtime Production Cluster Upgrades for Amazon ECS

Docker Cluster Management with ECS
Docker Cluster Management with ECSDocker Cluster Management with ECS
Docker Cluster Management with ECSMatt Callanan
 
Deep Dive on Microservices and Amazon ECS
Deep Dive on Microservices and Amazon ECSDeep Dive on Microservices and Amazon ECS
Deep Dive on Microservices and Amazon ECSAmazon Web Services
 
Running Containerised Applications at Scale on AWS
Running Containerised Applications at Scale on AWSRunning Containerised Applications at Scale on AWS
Running Containerised Applications at Scale on AWSAmazon Web Services
 
Building a CICD Pipeline for deploying on Containers
Building a CICD Pipeline for deploying on ContainersBuilding a CICD Pipeline for deploying on Containers
Building a CICD Pipeline for deploying on ContainersAmazon Web Services
 
Deep Dive with Amazon EC2 Container Service Hands-on Workshop
Deep Dive with Amazon EC2 Container Service Hands-on WorkshopDeep Dive with Amazon EC2 Container Service Hands-on Workshop
Deep Dive with Amazon EC2 Container Service Hands-on WorkshopAmazon Web Services
 
(DVO401) Deep Dive into Blue/Green Deployments on AWS
(DVO401) Deep Dive into Blue/Green Deployments on AWS(DVO401) Deep Dive into Blue/Green Deployments on AWS
(DVO401) Deep Dive into Blue/Green Deployments on AWSAmazon Web Services
 
Automating Management of Amazon EC2 Instances with Auto Scaling - March 2017 ...
Automating Management of Amazon EC2 Instances with Auto Scaling - March 2017 ...Automating Management of Amazon EC2 Instances with Auto Scaling - March 2017 ...
Automating Management of Amazon EC2 Instances with Auto Scaling - March 2017 ...Amazon Web Services
 
Netflix presents at MassTLC Cloud Summit 2013
Netflix presents at MassTLC Cloud Summit 2013Netflix presents at MassTLC Cloud Summit 2013
Netflix presents at MassTLC Cloud Summit 2013MassTLC
 
Building a CICD Pipeline for Container Deployment to Amazon ECS
Building a CICD Pipeline for Container Deployment to Amazon ECSBuilding a CICD Pipeline for Container Deployment to Amazon ECS
Building a CICD Pipeline for Container Deployment to Amazon ECSAmazon Web Services
 
Continuous Delivery to Amazon ECS
Continuous Delivery to Amazon ECS Continuous Delivery to Amazon ECS
Continuous Delivery to Amazon ECS Amazon Web Services
 
Deep Dive into Container Scheduling with Amazon ECS - CON404 - re:Invent 2017
Deep Dive into Container Scheduling with Amazon ECS - CON404 - re:Invent 2017Deep Dive into Container Scheduling with Amazon ECS - CON404 - re:Invent 2017
Deep Dive into Container Scheduling with Amazon ECS - CON404 - re:Invent 2017Amazon Web Services
 
SRV412 Deep Dive on CICD and Docker
SRV412 Deep Dive on CICD and DockerSRV412 Deep Dive on CICD and Docker
SRV412 Deep Dive on CICD and DockerAmazon Web Services
 
Building A CICD Pipeline for Deploying to Containers
Building A CICD Pipeline for Deploying to ContainersBuilding A CICD Pipeline for Deploying to Containers
Building A CICD Pipeline for Deploying to ContainersAmazon Web Services
 
Building CI/CD Pipelines for Serverless Applications
Building CI/CD Pipelines for Serverless ApplicationsBuilding CI/CD Pipelines for Serverless Applications
Building CI/CD Pipelines for Serverless ApplicationsAmazon Web Services
 
Building a CICD Pipeline for Container Deployment to Amazon ECS - May 2017 AW...
Building a CICD Pipeline for Container Deployment to Amazon ECS - May 2017 AW...Building a CICD Pipeline for Container Deployment to Amazon ECS - May 2017 AW...
Building a CICD Pipeline for Container Deployment to Amazon ECS - May 2017 AW...Amazon Web Services
 
Building a CI/CD Pipeline For Container Deployment to Amazon ECS
Building a CI/CD Pipeline For Container Deployment to Amazon ECSBuilding a CI/CD Pipeline For Container Deployment to Amazon ECS
Building a CI/CD Pipeline For Container Deployment to Amazon ECSAmazon Web Services
 
Building a CICD Pipeline for Deploying to Containers
Building a CICD Pipeline for Deploying to ContainersBuilding a CICD Pipeline for Deploying to Containers
Building a CICD Pipeline for Deploying to ContainersAmazon Web Services
 
Deploy, scale and manage your application with AWS Elastic Beanstal
Deploy, scale and manage your application with AWS Elastic BeanstalDeploy, scale and manage your application with AWS Elastic Beanstal
Deploy, scale and manage your application with AWS Elastic BeanstalAmazon Web Services
 
[AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵
 [AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵 [AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵
[AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵Amazon Web Services Korea
 

Similar to Automating Zero-Downtime Production Cluster Upgrades for Amazon ECS (20)

Docker Cluster Management with ECS
Docker Cluster Management with ECSDocker Cluster Management with ECS
Docker Cluster Management with ECS
 
Deep Dive on Microservices and Amazon ECS
Deep Dive on Microservices and Amazon ECSDeep Dive on Microservices and Amazon ECS
Deep Dive on Microservices and Amazon ECS
 
Running Containerised Applications at Scale on AWS
Running Containerised Applications at Scale on AWSRunning Containerised Applications at Scale on AWS
Running Containerised Applications at Scale on AWS
 
Building a CICD Pipeline for deploying on Containers
Building a CICD Pipeline for deploying on ContainersBuilding a CICD Pipeline for deploying on Containers
Building a CICD Pipeline for deploying on Containers
 
Deep Dive with Amazon EC2 Container Service Hands-on Workshop
Deep Dive with Amazon EC2 Container Service Hands-on WorkshopDeep Dive with Amazon EC2 Container Service Hands-on Workshop
Deep Dive with Amazon EC2 Container Service Hands-on Workshop
 
(DVO401) Deep Dive into Blue/Green Deployments on AWS
(DVO401) Deep Dive into Blue/Green Deployments on AWS(DVO401) Deep Dive into Blue/Green Deployments on AWS
(DVO401) Deep Dive into Blue/Green Deployments on AWS
 
Automating Management of Amazon EC2 Instances with Auto Scaling - March 2017 ...
Automating Management of Amazon EC2 Instances with Auto Scaling - March 2017 ...Automating Management of Amazon EC2 Instances with Auto Scaling - March 2017 ...
Automating Management of Amazon EC2 Instances with Auto Scaling - March 2017 ...
 
Netflix presents at MassTLC Cloud Summit 2013
Netflix presents at MassTLC Cloud Summit 2013Netflix presents at MassTLC Cloud Summit 2013
Netflix presents at MassTLC Cloud Summit 2013
 
Building a CICD Pipeline for Container Deployment to Amazon ECS
Building a CICD Pipeline for Container Deployment to Amazon ECSBuilding a CICD Pipeline for Container Deployment to Amazon ECS
Building a CICD Pipeline for Container Deployment to Amazon ECS
 
Testing Framework on AWS Cloud - Solution Set
Testing Framework on AWS Cloud - Solution SetTesting Framework on AWS Cloud - Solution Set
Testing Framework on AWS Cloud - Solution Set
 
Continuous Delivery to Amazon ECS
Continuous Delivery to Amazon ECS Continuous Delivery to Amazon ECS
Continuous Delivery to Amazon ECS
 
Deep Dive into Container Scheduling with Amazon ECS - CON404 - re:Invent 2017
Deep Dive into Container Scheduling with Amazon ECS - CON404 - re:Invent 2017Deep Dive into Container Scheduling with Amazon ECS - CON404 - re:Invent 2017
Deep Dive into Container Scheduling with Amazon ECS - CON404 - re:Invent 2017
 
SRV412 Deep Dive on CICD and Docker
SRV412 Deep Dive on CICD and DockerSRV412 Deep Dive on CICD and Docker
SRV412 Deep Dive on CICD and Docker
 
Building A CICD Pipeline for Deploying to Containers
Building A CICD Pipeline for Deploying to ContainersBuilding A CICD Pipeline for Deploying to Containers
Building A CICD Pipeline for Deploying to Containers
 
Building CI/CD Pipelines for Serverless Applications
Building CI/CD Pipelines for Serverless ApplicationsBuilding CI/CD Pipelines for Serverless Applications
Building CI/CD Pipelines for Serverless Applications
 
Building a CICD Pipeline for Container Deployment to Amazon ECS - May 2017 AW...
Building a CICD Pipeline for Container Deployment to Amazon ECS - May 2017 AW...Building a CICD Pipeline for Container Deployment to Amazon ECS - May 2017 AW...
Building a CICD Pipeline for Container Deployment to Amazon ECS - May 2017 AW...
 
Building a CI/CD Pipeline For Container Deployment to Amazon ECS
Building a CI/CD Pipeline For Container Deployment to Amazon ECSBuilding a CI/CD Pipeline For Container Deployment to Amazon ECS
Building a CI/CD Pipeline For Container Deployment to Amazon ECS
 
Building a CICD Pipeline for Deploying to Containers
Building a CICD Pipeline for Deploying to ContainersBuilding a CICD Pipeline for Deploying to Containers
Building a CICD Pipeline for Deploying to Containers
 
Deploy, scale and manage your application with AWS Elastic Beanstal
Deploy, scale and manage your application with AWS Elastic BeanstalDeploy, scale and manage your application with AWS Elastic Beanstal
Deploy, scale and manage your application with AWS Elastic Beanstal
 
[AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵
 [AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵 [AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵
[AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵
 

Recently uploaded

Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 

Recently uploaded (20)

Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 

Automating Zero-Downtime Production Cluster Upgrades for Amazon ECS

  • 2. Matt Callanan Engineering Manager/Tech Lead “Cloud Acceleration Team” Expedia Brisbane, Australia • mcallanan@expedia.com • linkedin.com/in/matthewcallanan • @mcallana
  • 3. Cluster Management • How to upgrade the EC2 infrastructure underlying a live production Amazon ECS cluster without affecting service availability. • How to safely relocate hundreds of tasks onto new Amazon EC2 instances with zero-downtime to applications.
  • 4. Region 1 Region 2 Region 3 Region 4 Region 5 Expedia ECS Cluster Statistics 2,600 ECS Services (1,100 Applications) 13,000 Containers 860 EC2 Instances (13 ECS Clusters)
  • 5. Region 1 Region 2 Region 3 Region 4 Region 5 Expedia ECS Cluster Topology Production Cluster Test Environment Cluster 480 Services 230 Instances
  • 6. Production Cluster Visualization 230 Instances 480 Services 3,200 Containers c3vis Open Source: https://github.com/ExpediaDotCom/c3vis
  • 7. ECS Cluster Creation Cloud Formation Stack EC2 Instances Auto Scaling Group Amazon ECS Cluster
  • 8. Immutable Servers Amazon-provided Base AMI Standard Chef cookbook Custom setup baked into AMI ecs-optimized AMI Expedia standard image Docker Config Daemon containers Golden AMI docker ecs-agent
  • 9. Immutable Servers ecs-optimized AMI Expedia standard image Docker Config Daemon containers Golden AMI docker ecs-agent ecs-optimized AMI Expedia standard image Docker Config Daemon containers Cluster Instance Custom bootstrap: • ECS Cluster Config • Start ECS Agent, Docker • Cron: Restart ECS agent • Cron: Custom Metrics docker ecs-agent
  • 11. First Approach • CloudFormation Auto-Scaling Rolling Update
  • 12. Default Cloud Formation Auto-Scaling Rolling Update Old Instance New Instance Terminating Instance Active Task Relocated Task Stopped Task
  • 13. Default Cloud Formation Auto-Scaling Rolling Update A A Old Instance New Instance Terminating Instance Active Task Relocated Task Stopped Task
  • 14. Default Cloud Formation Auto-Scaling Rolling Update A A Old Instance New Instance Terminating Instance Active Task Relocated Task Stopped Task Problem: Tasks stop before they are relocated. Even with tasks on distinct instances, outage can happen if new instances are not pulling images fast enough
  • 15. Default Cloud Formation Auto-Scaling Rolling Update A Old Instance New Instance Terminating Instance Active Task Relocated Task Stopped Task
  • 16. Default Cloud Formation Auto-Scaling Rolling Update A A Problem: Tasks start on instances about to be terminated Old Instance New Instance Terminating Instance Active Task Relocated Task Stopped Task
  • 17. Default Cloud Formation Auto-Scaling Rolling Update A A Old Instance New Instance Terminating Instance Active Task Relocated Task Stopped Task
  • 18. Default Cloud Formation Auto-Scaling Rolling Update A A Old Instance New Instance Terminating Instance Active Task Relocated Task Stopped Task
  • 19. Default Cloud Formation Auto-Scaling Rolling Update A A Old Instance New Instance Terminating Instance Active Task Relocated Task Stopped Task
  • 20. Default Cloud Formation Auto-Scaling Rolling Update A A Old Instance New Instance Terminating Instance Active Task Relocated Task Stopped Task
  • 21. Default Cloud Formation Auto-Scaling Rolling Update A A Old Instance New Instance Terminating Instance Active Task Relocated Task Stopped Task
  • 22. Default Cloud Formation Auto-Scaling Rolling Update A A Old Instance New Instance Terminating Instance Active Task Relocated Task Stopped Task
  • 23. Default Cloud Formation Auto-Scaling Rolling Update A Old Instance New Instance Terminating Instance Active Task Relocated Task Stopped Task
  • 24. Default Cloud Formation Auto-Scaling Rolling Update A Old Instance New Instance Terminating Instance Active Task Relocated Task Stopped Task
  • 25. Default Cloud Formation Auto-Scaling Rolling Update A Old Instance New Instance Terminating Instance Active Task Relocated Task Stopped Task
  • 26. Default Cloud Formation Auto-Scaling Rolling Update A Old Instance New Instance Terminating Instance Active Task Relocated Task Stopped Task
  • 27. Default Cloud Formation Auto-Scaling Rolling Update Old Instance New Instance Terminating Instance Active Task Relocated Task Stopped Task Problem: Service “A” experiences downtime
  • 28. Default Cloud Formation Auto-Scaling Rolling Update A Old Instance New Instance Terminating Instance Active Task Relocated Task Stopped Task
  • 29. Default Cloud Formation Auto-Scaling Rolling Update A Old Instance New Instance Terminating Instance Active Task Relocated Task Stopped Task
  • 30. Default Cloud Formation Auto-Scaling Rolling Update A A Old Instance New Instance Terminating Instance Active Task Relocated Task Stopped Task
  • 31. Default Cloud Formation Auto-Scaling Rolling Update A A Old Instance New Instance Terminating Instance Active Task Relocated Task Stopped Task
  • 32. Default Cloud Formation Auto-Scaling Rolling Update A A Old Instance New Instance Terminating Instance Active Task Relocated Task Stopped Task Problem: Tasks bunch up on first few new instances
  • 33. First Approach: Problem Summary • CloudFormation  AutoScaling Asynchrony • Rolling Update not granular enough • Outside our control • Rollbacks with no ability for manual intervention • Tasks stop before they are relocated • Service outage can happen if new instances are not pulling images fast enough • Tasks are not evenly spread across new instances CloudFormation Auto Scaling
  • 34. Second Approach • Custom Rolling Update automation • Programmatic Update Script using Ruby SDK • But still took 10 (nail-biting) hours to update a cluster with 100 instances • Any issues during the process required manual intervention Auto Scaling
  • 35. Third Approach • “PRISM” • Project Replaced in Sixty Minutes
  • 36. “PRISM” Goals • Zero-downtime for applications as their workloads get relocated onto new instances Safety • Complete as fast as possible Speed • Quickly retreat back to known-good state if anything goes wrong Rollbackable • Resumeable if anything goes wrong Idempotent • Drain in batches to prevent burden on Docker registry and network • Avoid having tasks relocated to instances about to be drained Avoid “thundering herd” scenario
  • 37. “PRISM” Phases Phase 1: Expand Phase 2: Relocate Tasks Phase 3: Clean Up
  • 38. Zero-Downtime Cluster Updates Cloud Formation Stack EC2 Instances Auto Scaling group Amazon ECS Cluster
  • 39. Amazon ECS Cluster Cloud Formation Stack EC2 Instances Auto Scaling group Cloud Formation Stack EC2 Instances Auto Scaling group Zero-Downtime Cluster Updates Phase 1: Expand Cluster
  • 40. Zero-Downtime Cluster Updates Phase 2: Relocate Tasks Amazon ECS Cluster Cloud Formation Stack EC2 Instances Auto Scaling group Cloud Formation Stack EC2 Instances Auto Scaling group Draining…
  • 41. Zero-Downtime Cluster Updates Phase 3: Clean Up Amazon ECS Cluster Cloud Formation Stack EC2 Instances Auto Scaling group Cloud Formation Stack EC2 Instances Auto Scaling group
  • 42. Zero-Downtime Cluster Updates Phase 3: Clean Up Amazon ECS ClusterCloud Formation Stack EC2 Instances Auto Scaling group
  • 43. Old Instance New Instance Draining Instance A Active Task Relocated Task “DRAINING” Task A Blue Auto-Scaling Group Blue CFN Stack Cluster Update – Phase 1: Expand
  • 44. AA Blue Auto-Scaling Group Blue CFN Stack Disable Auto-Scaling Processes Cluster Update – Phase 1: Expand Old Instance New Instance Draining Instance Active Task Relocated Task “DRAINING” Task
  • 45. AA Blue Auto-Scaling Group Blue CFN Stack Green Auto-Scaling Group Green CFN Stack Disable Auto-Scaling Processes Disable Auto-Scaling Processes Cluster Update – Phase 1: Expand Old Instance New Instance Draining Instance Active Task Relocated Task “DRAINING” Task
  • 46. AA Blue Auto-Scaling Group Blue CFN Stack Green Auto-Scaling Group Green CFN Stack state = ‘pre-drain’ Disable Auto-Scaling Processes Disable Auto-Scaling Processes Cluster Update – Phase 1: Expand Old Instance New Instance Draining Instance Active Task Relocated Task “DRAINING” Task
  • 47. AA Blue Auto-Scaling Group Blue CFN Stack Green Auto-Scaling Group Green CFN Stack Disable Auto-Scaling Processes Cluster Update – Phase 2: Relocate Tasks (batches of 3 instances) Old Instance New Instance Draining Instance Active Task Relocated Task “DRAINING” Task Disable Auto-Scaling Processes state = ‘pre-drain’
  • 48. Blue Auto-Scaling Group Blue CFN Stack Green Auto-Scaling Group Green CFN Stack Disable Auto-Scaling Processes Cluster Update – Phase 2: Relocate Tasks (batches of 3 instances) Old Instance New Instance Draining Instance Active Task Relocated Task “DRAINING” Task AA state = ‘pre-drain’ Disable Auto-Scaling Processes
  • 49. Blue Auto-Scaling Group Blue CFN Stack Green Auto-Scaling Group Green CFN Stack Disable Auto-Scaling Processes Cluster Update – Phase 2: Relocate Tasks (batches of 3 instances) Old Instance New Instance Draining Instance Active Task Relocated Task “DRAINING” Task A AAA state = ‘pre-drain’ Disable Auto-Scaling Processes
  • 50. Blue Auto-Scaling Group Blue CFN Stack A A Green Auto-Scaling Group Green CFN Stack Disable Auto-Scaling Processes Cluster Update – Phase 2: Relocate Tasks (batches of 3 instances) Old Instance New Instance Draining Instance Active Task Relocated Task “DRAINING” Task AA state = ‘pre-drain’ Disable Auto-Scaling Processes
  • 51. Blue Auto-Scaling Group Blue CFN Stack A A Green Auto-Scaling Group Green CFN Stack Disable Auto-Scaling Processes Cluster Update – Phase 2: Relocate Tasks (batches of 3 instances) Old Instance New Instance Draining Instance Active Task Relocated Task “DRAINING” Task state = ‘pre-drain’ Disable Auto-Scaling Processes
  • 52. Blue Auto-Scaling Group Blue CFN Stack A Green Auto-Scaling Group Green CFN Stack Disable Auto-Scaling Processes Cluster Update – Phase 2: Relocate Tasks (batches of 3 instances) Old Instance New Instance Draining Instance Active Task Relocated Task “DRAINING” Task A state = ‘pre-drain’ Disable Auto-Scaling Processes
  • 53. Blue Auto-Scaling Group Blue CFN Stack A Green Auto-Scaling Group Green CFN Stack Disable Auto-Scaling Processes Cluster Update – Phase 2: Relocate Tasks (batches of 3 instances) Old Instance New Instance Draining Instance Active Task Relocated Task “DRAINING” Task A state = ‘pre-drain’ Disable Auto-Scaling Processes
  • 54. Blue Auto-Scaling Group Blue CFN Stack A A Green Auto-Scaling Group Green CFN Stack Disable Auto-Scaling Processes Cluster Update – Phase 2: Relocate Tasks (batches of 3 instances) Old Instance New Instance Draining Instance Active Task Relocated Task “DRAINING” Task state = ‘pre-drain’ Disable Auto-Scaling Processes
  • 55. Blue Auto-Scaling Group Blue CFN Stack A A Green Auto-Scaling Group Green CFN Stack Disable Auto-Scaling Processes Cluster Update – Phase 2: Relocate Tasks (batches of 3 instances) Old Instance New Instance Draining Instance Active Task Relocated Task “DRAINING” Task state = ‘pre-drain’ Disable Auto-Scaling Processes
  • 56. A A Green Auto-Scaling Group Green CFN Stack Disable Auto-Scaling Processes Cluster Update – Phase 3: Clean Up Old Instance New Instance Draining Instance Active Task Relocated Task “DRAINING” Task
  • 57. A A Green Auto-Scaling Group Green CFN Stack Resume Auto-Scaling Processes Cluster Update – Phase 3: Clean Up Old Instance New Instance Draining Instance Active Task Relocated Task “DRAINING” Task
  • 59. Pre-flight Checks • Before Cluster Create/Update • Check # instances available of target instance type • Check IP addresses available in target subnets • Check EBS volume space available of target volume type
  • 60. Drain on Scale-down with Lifecycle Lambda • Lambda triggered by AutoScaling EC2_INSTANCE_TERMINATE SNS events • Updates instance state to “DRAINING” • ASG has 30-minute heartbeat to keep instance in Terminating:Wait state for 30mins • Allows ECS to safely relocate any tasks that are part of a service to another instance
  • 61. AutoScaling Lifecycle Hook resource "ECSContainerInstanceTerminating", Type: "AWS::AutoScaling::LifecycleHook", Properties: { AutoScalingGroupName: ref("ECSAutoScalingGroup"), LifecycleTransition: "autoscaling:EC2_INSTANCE_TERMINATING", DefaultResult: "CONTINUE", RoleARN: ref("LifecycleHookRoleARN"), NotificationTargetARN: ref("ECSAutoscalingLifecycleTopic"), HeartbeatTimeout: 1800 # 30 minutes },
  • 62. Avoid Relocating to Old Instance • Problem: • Tasks can get rescheduled to another old instance in the ASG that is about to be replaced - so tasks can get bumped from instance to instance until all instances are replaced • Solution: • Deploy services with placement constraint • This means that a service won’t be placed on an instance that has an attribute named “state” with value “pre-drain” • At cluster replacement time, we will stand up all new clusters, place old clusters into the “pre- drain” state, and terminate the old instances in batches. • Relocated tasks will only be placed on new instances, avoiding the default “thundering herd” scenario placement_constraints = [ { type: 'memberOf', expression: 'attribute:state !exists or attribute:state != pre-drain' }]
  • 64. Task Definition Placement Constraint • “state” is a custom ECS Instance Attribute • By default, the “state” attribute doesn’t exist on instances • Set only to “pre-drain” during cluster update • Prevents ECS scheduling tasks on instances that are about to be drained • Removed from instance only in case of rollback of cluster update
  • 65. Instance Launch Considerations • Worked with AWS Auto-Scaling team to enable more appropriate Auto-Scaling “Launch Rate” • Start the ECS agent with exponential backoff • Throttle on container instance registration rate = 1 per second/60 max per minute
  • 66. Gotchas • Found a number of apps don't relocate easily • Slows down prod upgrades and old stack decommissions
  • 67. Related AWS re:invent 2017: Going Big with Containers: Customer Case Studies of Large-Scale (ENT209) • https://www.youtube.com/watch?v=L3l_ZiYRrks • Covers full scope of Expedia’s ECS deployment automation platform