Matt Callanan from Expedia Australia shares their experience running hundreds of production microservices with Docker on Amazon's ECS Service and attempts to answer the question "How can we manage clusters as cattle instead of pets?"
Video: https://www.youtube.com/watch?v=XvfrLIujQsc
Meetup: www.meetup.com/Devops-Brisbane/events/231760662/
2. Table
of
Contents
• How
Do
We
Bootstrap
Instances?
• Rolling
Update
with
AutoScaling Group
• How
Do
We
Update
Cluster
Instances?
• How
Do
We
Detect
&
Remediate
Broken
Instances?
• How
Do
We
AnalyseCluster-‐Wide
Issues?
• How
Do
We
Auto-‐Scale?
• Lessons
Learned
• Future
Work
5. How
Do
We
Bootstrap
Instances?
• Based
on
Amazon’s
ECS
Optimized
AMI
• e.g.
“amzn-‐ami-‐2016.03.b-‐amazon-‐ecs-‐optimized”
• CloudFormation userdata runs
at
launch
time
to
set
up:
• Networking
• Security
• Log
forwarding
• Cron job:
Push
EC2
statistics
and
custom
metrics
• Run
‘cadvisor’
and
‘docker-‐cleanup’
as
ECS
Tasks
on
each
instance
(using
‘start-‐task’)
29. Zero-‐Downtime
Instance
Replacement
• Uses
a
Lambda
to
avoid
outages
in
production
during
a
cluster
instance
rolling
update
• Lambda
is
triggered
by
AutoScaling EC2_INSTANCE_TERMINATE SNS
events
• Lambda
deregisters
the
instance
from
the
ECS
cluster
• Lambda
also
sends
a
heartbeat
to
the
ASG
to
keep
the
instance
in
Terminating:Wait state
for
30mins
• This
is
generally
enough
to
allow
ECS
to
reschedule
any
tasks
that
are
part
of
a
service
to
another
instance
• Downsides:
• Tasks
can
get
rescheduled
to
another
old
instance
in
the
ASG
that
is
about
to
be
replaced
-‐ so
tasks
can
get
bumped
from
instance
to
instance
until
all
instances
are
replaced
• 30mins
is
a
long
time
for
old
containers
to
still
be
registered
in
the
services'
ELBs.
Any
deploys
during
that
time
can
cause
confusion
around
why
old
and
new
versions
of
service
are
running
behind
ELB
• ECS
agent
pulls
Docker
containers
serially
so
can
take
a
while
to
launch
a
bunch
of
new
tasks
51. How
Do
We
Detect
&
Remediate
Broken
Instances?
52. How
Do
We
Detect
&
Remediate
Broken
Instances?
• Custom
Cloudwatch metrics
• How
long
does
“docker images”
take?
Alarm
if
longer
than
4
seconds for
5mins
• How
long
does
“docker ps”
take?
Alarm
if
longer
than
4
seconds
for
5mins
• Is
the
ecs agent
running?
Alarm
if
not
for
5mins
• Manual
remediation
based
on
email
alert
• Run
“evict_instance”
script
• Terminates
instance
via
ASG
– allows
Lambda
to
deregister
and
pause
termination
• aws autoscaling terminate-‐instance-‐in-‐auto-‐scaling-‐group -‐-‐region $REGION -‐
-‐instance-‐id $INSTANCE_ID -‐-‐no-‐should-‐decrement-‐desired-‐capacity
54. How
Do
We
Analyse Cluster-‐Wide
Issues?
• Centralised Logging
• Forward
instance
logs
to
Splunk:
• /var/log/cfn-‐*
• /var/log/ecs*
• Query
with
timechart
56. How
Do
We
Auto-‐Scale?
• Scale
Up:
• CPU
Reservation
across
entire
cluster
>
70%
for
5mins
or
• Memory
Reservation
across
entire
cluster
>
60%
for
5mins
• Scale
Down
• CPU
Reservation
<
20%
for
5mins
or
• Memory
Reservation
< 40%
for
5mins
59. Lesson
#1:
Use
Immutable
Servers
with
CloudFormation
• cfn-‐update
is
dangerous
if
you
don’t
know
what
you’re
doing
• Problem:
• Rolled
out
change
that
configures
an
extra
docker EBS
volume
on
new
instances
• cfn-‐update
ran
simultaneously
on
all
old instances
• Simultaneously
restarted
docker and
deleted
/var/lib/docker on
all
old
instances
– 5mins
prod
outage
• Solution:
• Removed
cfn-‐update
from
userdata
• Rename
launch
configuration
every
time
to
force
CFN’s
ASG
Rolling
Update
even
for
minor
config
changes
• Changed
our
mentality
by
renaming
our
“update”
command
to
“replace_instances”
61. Lesson
#1:
Suspend
ASG
Processes
During
CFN
Rolling
Update
• CFN
and
ASG
are
independent
services
• Problem:
• Changed
ASG
from
1
to
2
subnets
as
part
of
CFN
update
• ASG
instantly
tries
to
launch
n/2
instances
in
new
subnet
• Meanwhile
CFN
is
waiting
for
1
signal
at
a
time
– times
out
– rolls
back
• Solution:
• Suspend
processes
with
CFN
Update
Policy:
• 'AlarmNotification’
• 'HealthCheck’
• 'ReplaceUnhealthy’
• 'AZRebalance’
64. Lesson
#3:
Don’t
Use
CloudFormationfor
Rolling
Updates
• CFN
interaction
with
ASG
is
too
unreliable
• Problem:
• CFN
timed
out
after
not
receiving
a
signal
from
instance
created
by
ASG
• AWS
support
explained
there
was
an
issue
with
the
Auto
Scaling
service
for
3hrs
that
caused
CloudFormation to
experience
increased
latency
when
creating,
updating
and
deleting
stacks
in
us-‐
east-‐1
• Solution:
• Replace
CFN
rolling
update
with
programmatic
logic
• Include
health
checks
• Include
deregistration
logic
66. Lesson
#4:
Scale
Down
Carefully
• Problem:
1. ASG
scales
up
due
to
high
Memory
Reservation
2. 5mins
later
ASG
scales
down
due
to
low
CPU
Reservation
3. Repeat
from
#1
• Solution:
• Fix
scaling
dimensions
• Scale
up
when
either CPU
or
Memory
Reservation
is
high
• Scale
Down
only
on
when
both are
low
• Tightly
control
cpu /
mem reservations
per
service
• Match
equal
ratios
of
instance
type
resources
69. Future
Work:
“Workload
Profiles”
• Predictable
resource
reservation
• Workload
Profiles
• Opinionated
resource
sizings based
on
equal
CPU
/
Memory
ratio
of
instance
type
resources
• App
owners
cannot
specify
cpu /
mem
– can
only
choose
from
preset
profiles
• Downsides:
• Ties
cluster
to
instance
type
family
• Example:
• For
“m4”
family…
Profile CPU
(Cores) Memory
(GiB)
Tiny 0.25 1
Small 0.5 2
Medium 1 4
Large 2 8
X.Large 4 16
70. Future
Work:
Treat
Clusters
as
Cattle
• Automate
all
manual
aspects
of
cluster
updates
• Building
confidence
in
our
automated
checks
• Are
there
enough
IP
addresses
in
target
subnets?
• Is
there
enough
EBS
volume
space
for
N instances?
• Are
there
enough
instances
of
desired
instance
type
available?
• Packer
for
building
AMIs
• Jenkins
Pipeline
for
rolling
out
with
confidence
71. Q
&
AThanks!
Any
Questions?
Matt
Callanan
mcallanan@expedia.com
linkedin.com/in/matthewcallanan
@mcallana