SlideShare une entreprise Scribd logo
1  sur  12
AWS Outage / Availability zone failure
in
Sydney region
- 05th June 2016 -
Author: Gilles Baillet
* Disclaimer: The opinions expressed in this presentation are the author's own and do not reflect the view of his employer
Who am I?
Gilles Baillet
Cloud Centre of Excellence Manager
Leading a team of 5 DevOps engineers on the Ops (dark) side of DevOps
AWS Certified SysOps Associate and Solutions Architect Associate
Fan of DevOps, AWS, Data Pipeline and now Lambda
Food, drinks travel-addict and almost married!
You can meet me at several meetups around Sydney (AWS, Docker, Elastic)
You can connect with me on LinkedIn: https://au.linkedin.com/in/gillesbaillet I accept connections from (almost)
everyone!
Before we start
Availability Zone Alignment
Randomisation of the assignment of AZs across AWS accounts
Our AZ are “aligned” across all our production and non-production accounts
Tip: Talk to your TAM!
The chain of events as presented by AWS
• At 3:25PM AEST: loss of power at a regional substation
• At 4:46PM AEST: power restored
• At 6:00PM AEST: over 80% of impacted services back online
• At 1:00AM AEST: nearly all instances recovered
• TOTAL DURATION: 1h21 / 9h35
http://aws.amazon.com/message/4372T8/
The chain of events as experienced by my company
• At 3:25PM AEST: trigger of monitoring/alerting services
• At 3:30PM AEST: conference bridge opened
• At 5:30PM AEST: most services were restored
• At 3:00AM AEST: all production services were restored
• TOTAL DURATION: 2h05 / 11h35
Black Swan
“An event that comes as a surprise, has a major effect, and is often inappropriately rationalized after the fact
with the benefit of hindsight. The term is based on an ancient saying which presumed black swans did not
exist, but the saying was rewritten after black swans were discovered in the wild”
https://en.wikipedia.org/wiki/Black_swan_theory
Taleb, N. N. (2007). The black swan: The impact of the highly improbable. Random house.
Impact during the outage
• all services running in the impacted AZ
• some Auto Scaling Group processes
• a NIC failure at 3:26PM
Instance restarted
No ELB health checks
Healthy instance marked as unhealthy
• EC2 Console / EC2 CLI commands
• Some CloudWatch metrics
• Some services relying on a single instance of a service (eg. domain controller)
Impact after the outage
• DB repair / integrity check
• Restoration of data stored on ephemeral storage
• 24 hours fixing instances in lower environments (DEV, UAT etc.)
• Clean up of rogue instances
Some things did work
• ELB Health checks
• RDS Database failover
• Some Auto Scaling Group processes
• AWS support escalation
• All critical services running on Cloud 2.0!
Lesson learned
• Implementation vs design
• Instance type matters
• AWS Enterprise support is worth the cost
• Cattle are awesome
• Datacenters in Sydney are not weather proof
• 100s of companies impacted
What’s next?
• Review of design documents vs implementation
• Use older instance types
• Use Chaos Monkey
• Turn Pets into Cattle (more work for my team!)
• Deploy new VPCs across 3 AZs
• Revisit DNS client TTL versus Health Check timeout
• AWS to fix “things” on their end
Questions?

Contenu connexe

Tendances

Tendances (20)

Taking Gliffy to the Cloud – Moving to Atlassian Connect - Mike Cialowicz
Taking Gliffy to the Cloud – Moving to Atlassian Connect - Mike CialowiczTaking Gliffy to the Cloud – Moving to Atlassian Connect - Mike Cialowicz
Taking Gliffy to the Cloud – Moving to Atlassian Connect - Mike Cialowicz
 
(DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument
(DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument(DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument
(DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument
 
Infrastructure Automation on AWS using a Real-World Customer Example
Infrastructure Automation on AWS using a Real-World Customer ExampleInfrastructure Automation on AWS using a Real-World Customer Example
Infrastructure Automation on AWS using a Real-World Customer Example
 
Lessons Learned in Deploying the ELK Stack (Elasticsearch, Logstash, and Kibana)
Lessons Learned in Deploying the ELK Stack (Elasticsearch, Logstash, and Kibana)Lessons Learned in Deploying the ELK Stack (Elasticsearch, Logstash, and Kibana)
Lessons Learned in Deploying the ELK Stack (Elasticsearch, Logstash, and Kibana)
 
Lessons Learned Running The Largest OpenStack Clouds
Lessons Learned Running The Largest OpenStack CloudsLessons Learned Running The Largest OpenStack Clouds
Lessons Learned Running The Largest OpenStack Clouds
 
Chicago AWS user group meetup - May 2014 at Cohesive
Chicago AWS user group meetup - May 2014 at CohesiveChicago AWS user group meetup - May 2014 at Cohesive
Chicago AWS user group meetup - May 2014 at Cohesive
 
Application Monitoring using Datadog
Application Monitoring using DatadogApplication Monitoring using Datadog
Application Monitoring using Datadog
 
Datadog- Monitoring In Motion
Datadog- Monitoring In Motion Datadog- Monitoring In Motion
Datadog- Monitoring In Motion
 
How Netflix thinks of DevOps. Spoiler: we don’t.
How Netflix thinks of DevOps. Spoiler: we don’t.How Netflix thinks of DevOps. Spoiler: we don’t.
How Netflix thinks of DevOps. Spoiler: we don’t.
 
DevOpsCon Cloud Workshop
DevOpsCon Cloud Workshop DevOpsCon Cloud Workshop
DevOpsCon Cloud Workshop
 
HA SOA Application with GlusterFS
HA SOA Application with GlusterFSHA SOA Application with GlusterFS
HA SOA Application with GlusterFS
 
Sas 2015 event_driven
Sas 2015 event_drivenSas 2015 event_driven
Sas 2015 event_driven
 
Is serverless the new swiss cheese? ServerlessDays NYC 2018
Is serverless the new swiss cheese? ServerlessDays NYC 2018Is serverless the new swiss cheese? ServerlessDays NYC 2018
Is serverless the new swiss cheese? ServerlessDays NYC 2018
 
MongoDB.local Berlin: Atlas for your Enterprise
MongoDB.local Berlin: Atlas for your EnterpriseMongoDB.local Berlin: Atlas for your Enterprise
MongoDB.local Berlin: Atlas for your Enterprise
 
DevOps Fest 2020. Kohsuke Kawaguchi. GitOps, Jenkins X & the Future of CI/CD
DevOps Fest 2020. Kohsuke Kawaguchi. GitOps, Jenkins X & the Future of CI/CDDevOps Fest 2020. Kohsuke Kawaguchi. GitOps, Jenkins X & the Future of CI/CD
DevOps Fest 2020. Kohsuke Kawaguchi. GitOps, Jenkins X & the Future of CI/CD
 
AWS re:Invent 2016: Open-Source Resources (DCS201)
AWS re:Invent 2016: Open-Source Resources (DCS201)AWS re:Invent 2016: Open-Source Resources (DCS201)
AWS re:Invent 2016: Open-Source Resources (DCS201)
 
Global Azure Bootcamp 2016 - Azure Automation Invades Your Data Centre
Global Azure Bootcamp 2016 - Azure Automation Invades Your Data CentreGlobal Azure Bootcamp 2016 - Azure Automation Invades Your Data Centre
Global Azure Bootcamp 2016 - Azure Automation Invades Your Data Centre
 
Akkurate Akka
Akkurate AkkaAkkurate Akka
Akkurate Akka
 
Five Years of EC2 Distilled
Five Years of EC2 DistilledFive Years of EC2 Distilled
Five Years of EC2 Distilled
 
(ISM309) Efficient Innovation:High-Velocity Cost Management at Netflix
(ISM309) Efficient Innovation:High-Velocity Cost Management at Netflix(ISM309) Efficient Innovation:High-Velocity Cost Management at Netflix
(ISM309) Efficient Innovation:High-Velocity Cost Management at Netflix
 

En vedette

En vedette (11)

AWS re:Invent 2016: Get the Most from AWS KMS: Architecting Applications for ...
AWS re:Invent 2016: Get the Most from AWS KMS: Architecting Applications for ...AWS re:Invent 2016: Get the Most from AWS KMS: Architecting Applications for ...
AWS re:Invent 2016: Get the Most from AWS KMS: Architecting Applications for ...
 
Strategies to Optimize Costs Using AWS - AWS May 2016 Webinar Series
Strategies to Optimize Costs Using AWS - AWS May 2016 Webinar SeriesStrategies to Optimize Costs Using AWS - AWS May 2016 Webinar Series
Strategies to Optimize Costs Using AWS - AWS May 2016 Webinar Series
 
Deep Dive on Serverless Web Applications - AWS May 2016 Webinar Series
Deep Dive on Serverless Web Applications - AWS May 2016 Webinar SeriesDeep Dive on Serverless Web Applications - AWS May 2016 Webinar Series
Deep Dive on Serverless Web Applications - AWS May 2016 Webinar Series
 
AWS re:Invent 2016: Workshop: Adhere to the Principle of Least Privilege by U...
AWS re:Invent 2016: Workshop: Adhere to the Principle of Least Privilege by U...AWS re:Invent 2016: Workshop: Adhere to the Principle of Least Privilege by U...
AWS re:Invent 2016: Workshop: Adhere to the Principle of Least Privilege by U...
 
AWS re:Invent 2016: Cloud agility and faster connectivity with AT&T NetBond a...
AWS re:Invent 2016: Cloud agility and faster connectivity with AT&T NetBond a...AWS re:Invent 2016: Cloud agility and faster connectivity with AT&T NetBond a...
AWS re:Invent 2016: Cloud agility and faster connectivity with AT&T NetBond a...
 
Deep Dive on AWS reInvent 2016 Breakout Sessions
Deep Dive on AWS reInvent 2016 Breakout SessionsDeep Dive on AWS reInvent 2016 Breakout Sessions
Deep Dive on AWS reInvent 2016 Breakout Sessions
 
Incident Coordination Workshop
Incident Coordination WorkshopIncident Coordination Workshop
Incident Coordination Workshop
 
AWS Foundational and Platform Services - Module 1 Parts 2 & 3 - AWSome Day 2017
AWS Foundational and Platform Services - Module 1 Parts 2 & 3 - AWSome Day 2017AWS Foundational and Platform Services - Module 1 Parts 2 & 3 - AWSome Day 2017
AWS Foundational and Platform Services - Module 1 Parts 2 & 3 - AWSome Day 2017
 
AWS re:Invent 2016: AWS Database State of the Union (DAT320)
AWS re:Invent 2016: AWS Database State of the Union (DAT320)AWS re:Invent 2016: AWS Database State of the Union (DAT320)
AWS re:Invent 2016: AWS Database State of the Union (DAT320)
 
(SEC301) Strategies for Protecting Data Using Encryption in AWS
(SEC301) Strategies for Protecting Data Using Encryption in AWS(SEC301) Strategies for Protecting Data Using Encryption in AWS
(SEC301) Strategies for Protecting Data Using Encryption in AWS
 
AWS re:Invent 2016: Building Complex Serverless Applications (GPST404)
AWS re:Invent 2016: Building Complex Serverless Applications (GPST404)AWS re:Invent 2016: Building Complex Serverless Applications (GPST404)
AWS re:Invent 2016: Building Complex Serverless Applications (GPST404)
 

Similaire à What we learned from the AWS Outage

T1 – Architecting highly available applications on aws
T1 – Architecting highly available applications on awsT1 – Architecting highly available applications on aws
T1 – Architecting highly available applications on aws
Amazon Web Services
 
Scaling on AWS for the First 10 Million Users
Scaling on AWS for the First 10 Million UsersScaling on AWS for the First 10 Million Users
Scaling on AWS for the First 10 Million Users
Amazon Web Services
 

Similaire à What we learned from the AWS Outage (20)

Improving Availability & Lowering Costs with Auto Scaling & Amazon EC2 (CPN20...
Improving Availability & Lowering Costs with Auto Scaling & Amazon EC2 (CPN20...Improving Availability & Lowering Costs with Auto Scaling & Amazon EC2 (CPN20...
Improving Availability & Lowering Costs with Auto Scaling & Amazon EC2 (CPN20...
 
AWS Black Belt Tips
AWS Black Belt TipsAWS Black Belt Tips
AWS Black Belt Tips
 
AWS Black Belt Tips
AWS Black Belt TipsAWS Black Belt Tips
AWS Black Belt Tips
 
T1 – Architecting highly available applications on aws
T1 – Architecting highly available applications on awsT1 – Architecting highly available applications on aws
T1 – Architecting highly available applications on aws
 
Scaling on AWS for the First 10 Million Users
Scaling on AWS for the First 10 Million UsersScaling on AWS for the First 10 Million Users
Scaling on AWS for the First 10 Million Users
 
How to Design for High Availability & Scale with AWS
How to Design for High Availability & Scale with AWSHow to Design for High Availability & Scale with AWS
How to Design for High Availability & Scale with AWS
 
(ARC301) Scaling Up to Your First 10 Million Users
(ARC301) Scaling Up to Your First 10 Million Users(ARC301) Scaling Up to Your First 10 Million Users
(ARC301) Scaling Up to Your First 10 Million Users
 
Deep Dive: Scaling Up to Your First 10 Million Users
Deep Dive: Scaling Up to Your First 10 Million UsersDeep Dive: Scaling Up to Your First 10 Million Users
Deep Dive: Scaling Up to Your First 10 Million Users
 
From AWS to Series A in 5 Easy Pieces
From AWS to Series A in 5 Easy PiecesFrom AWS to Series A in 5 Easy Pieces
From AWS to Series A in 5 Easy Pieces
 
Aws webcast - Scaling on AWS 13 08-20
Aws webcast - Scaling on AWS 13 08-20Aws webcast - Scaling on AWS 13 08-20
Aws webcast - Scaling on AWS 13 08-20
 
Scaling on AWS for the First 10 Million Users
Scaling on AWS for the First 10 Million UsersScaling on AWS for the First 10 Million Users
Scaling on AWS for the First 10 Million Users
 
What is Amazon Web Services & How to Start to deploy your apps ?
What is Amazon Web Services & How to Start to deploy your apps ?What is Amazon Web Services & How to Start to deploy your apps ?
What is Amazon Web Services & How to Start to deploy your apps ?
 
AWS Summit Auckland 2014 | Scaling on AWS for the First 10 Million Users
 AWS Summit Auckland 2014 | Scaling on AWS for the First 10 Million Users AWS Summit Auckland 2014 | Scaling on AWS for the First 10 Million Users
AWS Summit Auckland 2014 | Scaling on AWS for the First 10 Million Users
 
Site reliability in the Serverless age - Serverless Boston 2019
Site reliability in the Serverless age  - Serverless Boston 2019Site reliability in the Serverless age  - Serverless Boston 2019
Site reliability in the Serverless age - Serverless Boston 2019
 
AWS Summit London 2014 | Scaling on AWS for the First 10 Million Users (200)
AWS Summit London 2014 | Scaling on AWS for the First 10 Million Users (200)AWS Summit London 2014 | Scaling on AWS for the First 10 Million Users (200)
AWS Summit London 2014 | Scaling on AWS for the First 10 Million Users (200)
 
ENT309 scaling up to your first 10 million users
ENT309 scaling up to your first 10 million usersENT309 scaling up to your first 10 million users
ENT309 scaling up to your first 10 million users
 
AWS Summit Sydney 2014 | Scaling on AWS for the First 10 Million Users
AWS Summit Sydney 2014 | Scaling on AWS for the First 10 Million UsersAWS Summit Sydney 2014 | Scaling on AWS for the First 10 Million Users
AWS Summit Sydney 2014 | Scaling on AWS for the First 10 Million Users
 
Scaling on AWS for the First 10 Million Users
Scaling on AWS for the First 10 Million UsersScaling on AWS for the First 10 Million Users
Scaling on AWS for the First 10 Million Users
 
Aplicaciones a gran escala: Cómo servir a millones de usuarios
Aplicaciones a gran escala: Cómo servir a millones de usuariosAplicaciones a gran escala: Cómo servir a millones de usuarios
Aplicaciones a gran escala: Cómo servir a millones de usuarios
 
AWS Public Sector Symposium 2014 Canberra | Black Belt Tips on AWS
AWS Public Sector Symposium 2014 Canberra | Black Belt Tips on AWS AWS Public Sector Symposium 2014 Canberra | Black Belt Tips on AWS
AWS Public Sector Symposium 2014 Canberra | Black Belt Tips on AWS
 

Plus de PolarSeven Pty Ltd

Amazon Web Services User Group Sydney - February 2018
Amazon Web Services User Group Sydney - February 2018Amazon Web Services User Group Sydney - February 2018
Amazon Web Services User Group Sydney - February 2018
PolarSeven Pty Ltd
 
AWS CloudFormation Automation, TrafficScript, and Serverless architecture wit...
AWS CloudFormation Automation, TrafficScript, and Serverless architecture wit...AWS CloudFormation Automation, TrafficScript, and Serverless architecture wit...
AWS CloudFormation Automation, TrafficScript, and Serverless architecture wit...
PolarSeven Pty Ltd
 

Plus de PolarSeven Pty Ltd (20)

AWS Forcecast: DeepAR Predictor Time-series
AWS Forcecast: DeepAR Predictor Time-series AWS Forcecast: DeepAR Predictor Time-series
AWS Forcecast: DeepAR Predictor Time-series
 
Aws user group #04 landing zones
Aws user group #04   landing zonesAws user group #04   landing zones
Aws user group #04 landing zones
 
Aws user group #03 - All things Iot
Aws user group #03 - All things IotAws user group #03 - All things Iot
Aws user group #03 - All things Iot
 
Aws user group #01 lets talk serverless
Aws user group #01   lets talk serverlessAws user group #01   lets talk serverless
Aws user group #01 lets talk serverless
 
AWS Reinvent Recap 2018
AWS Reinvent Recap 2018 AWS Reinvent Recap 2018
AWS Reinvent Recap 2018
 
AWS User Group October
AWS User Group OctoberAWS User Group October
AWS User Group October
 
AWS User Group August
AWS User Group AugustAWS User Group August
AWS User Group August
 
AWS User Group November
AWS User Group NovemberAWS User Group November
AWS User Group November
 
AWS User Group September
AWS User Group September AWS User Group September
AWS User Group September
 
Amazon Web Services User Group Sydney - March 2018
Amazon Web Services User Group Sydney - March 2018Amazon Web Services User Group Sydney - March 2018
Amazon Web Services User Group Sydney - March 2018
 
Amazon Web Services User Group Sydney - February 2018
Amazon Web Services User Group Sydney - February 2018Amazon Web Services User Group Sydney - February 2018
Amazon Web Services User Group Sydney - February 2018
 
Deep Dive on Cloud Policies and Automation
Deep Dive on Cloud Policies and AutomationDeep Dive on Cloud Policies and Automation
Deep Dive on Cloud Policies and Automation
 
Securing Traffic Leaving A VPC
Securing Traffic Leaving A VPCSecuring Traffic Leaving A VPC
Securing Traffic Leaving A VPC
 
Telstra Programmable Networks & Scaling a Serverless Team with Automation
 Telstra Programmable Networks & Scaling a Serverless Team with Automation Telstra Programmable Networks & Scaling a Serverless Team with Automation
Telstra Programmable Networks & Scaling a Serverless Team with Automation
 
AWS User Group Sydney - Meetup #60
AWS User Group Sydney - Meetup #60AWS User Group Sydney - Meetup #60
AWS User Group Sydney - Meetup #60
 
Shared Security in AWS
Shared Security in AWSShared Security in AWS
Shared Security in AWS
 
Visibility, Optimization & Governance for Cloud Services
Visibility, Optimization & Governance for Cloud ServicesVisibility, Optimization & Governance for Cloud Services
Visibility, Optimization & Governance for Cloud Services
 
AWS OpsWorks for Chef Automate
AWS OpsWorks for Chef AutomateAWS OpsWorks for Chef Automate
AWS OpsWorks for Chef Automate
 
AWS CloudFormation Automation, TrafficScript, and Serverless architecture wit...
AWS CloudFormation Automation, TrafficScript, and Serverless architecture wit...AWS CloudFormation Automation, TrafficScript, and Serverless architecture wit...
AWS CloudFormation Automation, TrafficScript, and Serverless architecture wit...
 
AWS User Group December 2016
AWS User Group December 2016AWS User Group December 2016
AWS User Group December 2016
 

Dernier

pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdfpdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
JOHNBEBONYAP1
 
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
ydyuyu
 
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi EscortsIndian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Monica Sydney
 
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
ydyuyu
 
一比一原版田纳西大学毕业证如何办理
一比一原版田纳西大学毕业证如何办理一比一原版田纳西大学毕业证如何办理
一比一原版田纳西大学毕业证如何办理
F
 
Abu Dhabi Escorts Service 0508644382 Escorts in Abu Dhabi
Abu Dhabi Escorts Service 0508644382 Escorts in Abu DhabiAbu Dhabi Escorts Service 0508644382 Escorts in Abu Dhabi
Abu Dhabi Escorts Service 0508644382 Escorts in Abu Dhabi
Monica Sydney
 
一比一原版奥兹学院毕业证如何办理
一比一原版奥兹学院毕业证如何办理一比一原版奥兹学院毕业证如何办理
一比一原版奥兹学院毕业证如何办理
F
 
Russian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Russian Escort Abu Dhabi 0503464457 Abu DHabi EscortsRussian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Russian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Monica Sydney
 

Dernier (20)

pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdfpdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
 
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
 
Mira Road Housewife Call Girls 07506202331, Nalasopara Call Girls
Mira Road Housewife Call Girls 07506202331, Nalasopara Call GirlsMira Road Housewife Call Girls 07506202331, Nalasopara Call Girls
Mira Road Housewife Call Girls 07506202331, Nalasopara Call Girls
 
Best SEO Services Company in Dallas | Best SEO Agency Dallas
Best SEO Services Company in Dallas | Best SEO Agency DallasBest SEO Services Company in Dallas | Best SEO Agency Dallas
Best SEO Services Company in Dallas | Best SEO Agency Dallas
 
Ballia Escorts Service Girl ^ 9332606886, WhatsApp Anytime Ballia
Ballia Escorts Service Girl ^ 9332606886, WhatsApp Anytime BalliaBallia Escorts Service Girl ^ 9332606886, WhatsApp Anytime Ballia
Ballia Escorts Service Girl ^ 9332606886, WhatsApp Anytime Ballia
 
Nagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nagercoil
Nagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime NagercoilNagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nagercoil
Nagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nagercoil
 
Local Call Girls in Seoni 9332606886 HOT & SEXY Models beautiful and charmin...
Local Call Girls in Seoni  9332606886 HOT & SEXY Models beautiful and charmin...Local Call Girls in Seoni  9332606886 HOT & SEXY Models beautiful and charmin...
Local Call Girls in Seoni 9332606886 HOT & SEXY Models beautiful and charmin...
 
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi EscortsIndian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
 
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
 
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
 
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...
 
一比一原版田纳西大学毕业证如何办理
一比一原版田纳西大学毕业证如何办理一比一原版田纳西大学毕业证如何办理
一比一原版田纳西大学毕业证如何办理
 
Abu Dhabi Escorts Service 0508644382 Escorts in Abu Dhabi
Abu Dhabi Escorts Service 0508644382 Escorts in Abu DhabiAbu Dhabi Escorts Service 0508644382 Escorts in Abu Dhabi
Abu Dhabi Escorts Service 0508644382 Escorts in Abu Dhabi
 
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
 
一比一原版奥兹学院毕业证如何办理
一比一原版奥兹学院毕业证如何办理一比一原版奥兹学院毕业证如何办理
一比一原版奥兹学院毕业证如何办理
 
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
 
Russian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Russian Escort Abu Dhabi 0503464457 Abu DHabi EscortsRussian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Russian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
 
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrStory Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
 
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
 
Real Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtReal Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirt
 

What we learned from the AWS Outage

  • 1. AWS Outage / Availability zone failure in Sydney region - 05th June 2016 - Author: Gilles Baillet * Disclaimer: The opinions expressed in this presentation are the author's own and do not reflect the view of his employer
  • 2. Who am I? Gilles Baillet Cloud Centre of Excellence Manager Leading a team of 5 DevOps engineers on the Ops (dark) side of DevOps AWS Certified SysOps Associate and Solutions Architect Associate Fan of DevOps, AWS, Data Pipeline and now Lambda Food, drinks travel-addict and almost married! You can meet me at several meetups around Sydney (AWS, Docker, Elastic) You can connect with me on LinkedIn: https://au.linkedin.com/in/gillesbaillet I accept connections from (almost) everyone!
  • 3. Before we start Availability Zone Alignment Randomisation of the assignment of AZs across AWS accounts Our AZ are “aligned” across all our production and non-production accounts Tip: Talk to your TAM!
  • 4. The chain of events as presented by AWS • At 3:25PM AEST: loss of power at a regional substation • At 4:46PM AEST: power restored • At 6:00PM AEST: over 80% of impacted services back online • At 1:00AM AEST: nearly all instances recovered • TOTAL DURATION: 1h21 / 9h35 http://aws.amazon.com/message/4372T8/
  • 5. The chain of events as experienced by my company • At 3:25PM AEST: trigger of monitoring/alerting services • At 3:30PM AEST: conference bridge opened • At 5:30PM AEST: most services were restored • At 3:00AM AEST: all production services were restored • TOTAL DURATION: 2h05 / 11h35
  • 6. Black Swan “An event that comes as a surprise, has a major effect, and is often inappropriately rationalized after the fact with the benefit of hindsight. The term is based on an ancient saying which presumed black swans did not exist, but the saying was rewritten after black swans were discovered in the wild” https://en.wikipedia.org/wiki/Black_swan_theory Taleb, N. N. (2007). The black swan: The impact of the highly improbable. Random house.
  • 7. Impact during the outage • all services running in the impacted AZ • some Auto Scaling Group processes • a NIC failure at 3:26PM Instance restarted No ELB health checks Healthy instance marked as unhealthy • EC2 Console / EC2 CLI commands • Some CloudWatch metrics • Some services relying on a single instance of a service (eg. domain controller)
  • 8. Impact after the outage • DB repair / integrity check • Restoration of data stored on ephemeral storage • 24 hours fixing instances in lower environments (DEV, UAT etc.) • Clean up of rogue instances
  • 9. Some things did work • ELB Health checks • RDS Database failover • Some Auto Scaling Group processes • AWS support escalation • All critical services running on Cloud 2.0!
  • 10. Lesson learned • Implementation vs design • Instance type matters • AWS Enterprise support is worth the cost • Cattle are awesome • Datacenters in Sydney are not weather proof • 100s of companies impacted
  • 11. What’s next? • Review of design documents vs implementation • Use older instance types • Use Chaos Monkey • Turn Pets into Cattle (more work for my team!) • Deploy new VPCs across 3 AZs • Revisit DNS client TTL versus Health Check timeout • AWS to fix “things” on their end

Notes de l'éditeur

  1. Impactful event but fantastic opportunity to prove or disprove some design decisions and assess the impact of such events on our services to make them more resilient and keep our customers happy.
  2. Cloud 2.0 Cattle French (for those…) So good food and good drinks are always on the table Last but not least, soon a married man Poll Who is familiar with the difference between Pets and Cattle? Who is running pets? Who is running cattle? Who has been impacted?
  3. Access to both primary and secondary power lost as a result of a failure to transfer the load to generators.
  4. by forcing passive services in ap-southeast-2b to become active by temporarily removing some of our dependencies by activating/ implementing kill switches
  5. A complete datacentre failure = event that most people think can’t happen 1700 – All swans are white Until the discovery of Black Swans in WA
  6. Linux – DNS timeout of 5 sec /etc/resolv.conf was showing the unavailable DNS server first in the list ELB health check timeout = 5 sec Failure of API making us blind on what was happening on the infrastructure DNS was failing as some services failed as a result of not being able to reach their RDS database That correlates with AWS description of the behaviour of their infrastructure: When the APIs initially recovered, our systems were delayed in propagating some state changes and making them available via describe API calls. This meant that some customers could not see their newly launched resources, and some existing instances appeared as stuck in pending or shutting down when customers tried to make changes to their infrastructure in the affected Availability Zone. These state delays also increased latency of adding new instances to existing Elastic Load Balancing (ELB) load balancers.
  7. That correlates with AWS description of the behaviour of their infrastructure: When the APIs initially recovered, our systems were delayed in propagating some state changes and making them available via describe API calls. This meant that some customers could not see their newly launched resources, and some existing instances appeared as stuck in pending or shutting down when customers tried to make changes to their infrastructure in the affected Availability Zone. These state delays also increased latency of adding new instances to existing Elastic Load Balancing (ELB) load balancers.
  8. Re-designing the VPC structure across 3 availability zones in a challenge in itself as current subnets use the whole IP range for the VPC