Gilles will be sharing some of the experiences they faced first hand, how they resolved them and what they are looking to do differently in the future
* Presented at the Sydney AWS Meetup session 6th July 2016
http://www.meetup.com/AWS-Sydney/
Hosted and organised by PolarSeven - http://polarseven.com
1. AWS Outage / Availability zone failure
in
Sydney region
- 05th June 2016 -
Author: Gilles Baillet
* Disclaimer: The opinions expressed in this presentation are the author's own and do not reflect the view of his employer
2. Who am I?
Gilles Baillet
Cloud Centre of Excellence Manager
Leading a team of 5 DevOps engineers on the Ops (dark) side of DevOps
AWS Certified SysOps Associate and Solutions Architect Associate
Fan of DevOps, AWS, Data Pipeline and now Lambda
Food, drinks travel-addict and almost married!
You can meet me at several meetups around Sydney (AWS, Docker, Elastic)
You can connect with me on LinkedIn: https://au.linkedin.com/in/gillesbaillet I accept connections from (almost)
everyone!
3. Before we start
Availability Zone Alignment
Randomisation of the assignment of AZs across AWS accounts
Our AZ are “aligned” across all our production and non-production accounts
Tip: Talk to your TAM!
4. The chain of events as presented by AWS
• At 3:25PM AEST: loss of power at a regional substation
• At 4:46PM AEST: power restored
• At 6:00PM AEST: over 80% of impacted services back online
• At 1:00AM AEST: nearly all instances recovered
• TOTAL DURATION: 1h21 / 9h35
http://aws.amazon.com/message/4372T8/
5. The chain of events as experienced by my company
• At 3:25PM AEST: trigger of monitoring/alerting services
• At 3:30PM AEST: conference bridge opened
• At 5:30PM AEST: most services were restored
• At 3:00AM AEST: all production services were restored
• TOTAL DURATION: 2h05 / 11h35
6. Black Swan
“An event that comes as a surprise, has a major effect, and is often inappropriately rationalized after the fact
with the benefit of hindsight. The term is based on an ancient saying which presumed black swans did not
exist, but the saying was rewritten after black swans were discovered in the wild”
https://en.wikipedia.org/wiki/Black_swan_theory
Taleb, N. N. (2007). The black swan: The impact of the highly improbable. Random house.
7. Impact during the outage
• all services running in the impacted AZ
• some Auto Scaling Group processes
• a NIC failure at 3:26PM
Instance restarted
No ELB health checks
Healthy instance marked as unhealthy
• EC2 Console / EC2 CLI commands
• Some CloudWatch metrics
• Some services relying on a single instance of a service (eg. domain controller)
8. Impact after the outage
• DB repair / integrity check
• Restoration of data stored on ephemeral storage
• 24 hours fixing instances in lower environments (DEV, UAT etc.)
• Clean up of rogue instances
9. Some things did work
• ELB Health checks
• RDS Database failover
• Some Auto Scaling Group processes
• AWS support escalation
• All critical services running on Cloud 2.0!
10. Lesson learned
• Implementation vs design
• Instance type matters
• AWS Enterprise support is worth the cost
• Cattle are awesome
• Datacenters in Sydney are not weather proof
• 100s of companies impacted
11. What’s next?
• Review of design documents vs implementation
• Use older instance types
• Use Chaos Monkey
• Turn Pets into Cattle (more work for my team!)
• Deploy new VPCs across 3 AZs
• Revisit DNS client TTL versus Health Check timeout
• AWS to fix “things” on their end
Impactful event
but fantastic opportunity to prove or disprove some design decisionsand assess the impact of such events on our services
to make them more resilient
and keep our customers happy.
Cloud 2.0
Cattle
French (for those…)
So good food and good drinks are always on the table
Last but not least, soon a married man
Poll
Who is familiar with the difference between Pets and Cattle?
Who is running pets?
Who is running cattle?
Who has been impacted?
Access to both primary and secondary power lost as a result of a failure to transfer the load to generators.
by forcing passive services in ap-southeast-2b to become active
by temporarily removing some of our dependencies
by activating/ implementing kill switches
A complete datacentre failure = event that most people think can’t happen
1700 – All swans are white
Until the discovery of Black Swans in WA
Linux – DNS timeout of 5 sec
/etc/resolv.conf was showing the unavailable DNS server first in the list
ELB health check timeout = 5 sec
Failure of API making us blind on what was happening on the infrastructure
DNS was failing as some services failed as a result of not being able to reach their RDS database
That correlates with AWS description of the behaviour of their infrastructure:
When the APIs initially recovered, our systems were delayed in propagating some state changes and making them available via describe API calls. This meant that some customers could not see their newly launched resources, and some existing instances appeared as stuck in pending or shutting down when customers tried to make changes to their infrastructure in the affected Availability Zone. These state delays also increased latency of adding new instances to existing Elastic Load Balancing (ELB) load balancers.
That correlates with AWS description of the behaviour of their infrastructure:
When the APIs initially recovered, our systems were delayed in propagating some state changes and making them available via describe API calls. This meant that some customers could not see their newly launched resources, and some existing instances appeared as stuck in pending or shutting down when customers tried to make changes to their infrastructure in the affected Availability Zone. These state delays also increased latency of adding new instances to existing Elastic Load Balancing (ELB) load balancers.
Re-designing the VPC structure across 3 availability zones in a challenge in itself as current subnets use the whole IP range for the VPC