SlideShare une entreprise Scribd logo
1  sur  45
Failure Happens
F***, the F***ing thing is F***king F***ed*
            *Official WebOps term from Artur Bergman




            jesse@oreilly.com
This will be on the test:

 FAILURE HAPPENS!
25%

75%
25%

75%         Paranoid
25%   Pyromaniac



75%         Paranoid
Good
Book!
“multiple and unexpected
interactions of failures are
        inevitable”
                -Charles Perrow
Failure Happens
define:
 Nines (roughly)
define:
 Nines (roughly)
   99%	 5256 min (3.5 days)
define:
 Nines (roughly)
   99%	 5256 min (3.5 days)
   99.9%	 528 min ( 8.8 hours )
define:
 Nines (roughly)
   99%	 5256 min (3.5 days)
   99.9%	 528 min ( 8.8 hours )
   99.99% 53 min
define:
 Nines (roughly)
   99%	 5256 min (3.5 days)
   99.9%	 528 min ( 8.8 hours )
   99.99% 53 min
   99.999% 5 min
define:
 Nines (roughly)
   99%	 5256 min (3.5 days)
   99.9%	 528 min ( 8.8 hours )
   99.99% 53 min
   99.999% 5 min
   99.9999% 30 Seconds
define:
 Nines (roughly)
   99%	 5256 min (3.5 days)
   99.9%	 528 min ( 8.8 hours )
   99.99% 53 min
   99.999% 5 min
   99.9999% 30 Seconds
   99.99999% 3 Seconds
Internet Routing... won’t.
;''-1(<quot;=/-)quot;3.1>0?-'quot;@'-':




!quot;#$$%quot;&'(')*)quot;+,-.,-/01,(   +/.01210*quot;345467quot;89:   #
#googlefail
YOU
Continuous Power...
       isn’t
365 Main SF
365 364.96 Main SF
Failure happens

 A single datacenter is the
 problem
 • Since they all fail at some point

 Recovery procedures after
 failure
 • Power was gone ~45 minutes
 • Most services took hours to come back
 • Some unnamed ones more than 12 hours
Truck 1, Rackspace 0
Geography is a
Single Point of Failure
+2304,$5%67quot;#,-8$1




 !quot;#$%#&'()(#*&+,&!quot;#$%&!'()* #%-#%*%,.&'(/,.#+%*&0+.1&-#%2+3&(/.quot;4%*&(2&quot;.&)%quot;*.&5678
!quot;#$%&''(                                   +#,$-#$,%./-$0,1                             )*
Taser weilding robbers

C I Hosts' Chicago facility
robbed twice!

(the other two times were
merely quot;break-ins where things
were stolenquot;)
Providers are
baskets too.
Failure Happens.
Anyone promising otherwise
 is either foolish or lying
          (or both).
Go Here!

une 22-24, 2009


         Jesse Robbins
       jesse@oreilly.com

Contenu connexe

Similaire à Failure Happens: CloudCamp Interop

Time Travel - Predicting the Future and Surviving a Parallel Universe - JDC2012
Time Travel - Predicting the Future and Surviving a Parallel Universe - JDC2012 Time Travel - Predicting the Future and Surviving a Parallel Universe - JDC2012
Time Travel - Predicting the Future and Surviving a Parallel Universe - JDC2012
Hossam Karim
 
841charity
841charity841charity
841charity
spmath
 
Colección Primavera de Blaubloom
Colección Primavera de BlaubloomColección Primavera de Blaubloom
Colección Primavera de Blaubloom
MASmedia
 
презентация итог коррект
презентация итог корректпрезентация итог коррект
презентация итог коррект
Alexandra Kiseleva
 

Similaire à Failure Happens: CloudCamp Interop (20)

6. Jesse Robbins CloudCamp 5minute Presentation
6. Jesse Robbins CloudCamp 5minute Presentation6. Jesse Robbins CloudCamp 5minute Presentation
6. Jesse Robbins CloudCamp 5minute Presentation
 
Time Travel - Predicting the Future and Surviving a Parallel Universe - JDC2012
Time Travel - Predicting the Future and Surviving a Parallel Universe - JDC2012 Time Travel - Predicting the Future and Surviving a Parallel Universe - JDC2012
Time Travel - Predicting the Future and Surviving a Parallel Universe - JDC2012
 
Hajj or Umra by Shykh Bin Baz
Hajj or Umra by Shykh Bin BazHajj or Umra by Shykh Bin Baz
Hajj or Umra by Shykh Bin Baz
 
Dilon ki Islaah
Dilon ki IslaahDilon ki Islaah
Dilon ki Islaah
 
Social Media for Cause Marketers - CMF 2009 Workshop
Social Media for Cause Marketers - CMF 2009 WorkshopSocial Media for Cause Marketers - CMF 2009 Workshop
Social Media for Cause Marketers - CMF 2009 Workshop
 
Improvement of defect record
Improvement of defect recordImprovement of defect record
Improvement of defect record
 
transcripts
transcriptstranscripts
transcripts
 
OEQ Receipt
OEQ  ReceiptOEQ  Receipt
OEQ Receipt
 
841charity
841charity841charity
841charity
 
Colección Primavera de Blaubloom
Colección Primavera de BlaubloomColección Primavera de Blaubloom
Colección Primavera de Blaubloom
 
Bay of Plenty, October 2014, Travel Digest
Bay of Plenty, October 2014, Travel DigestBay of Plenty, October 2014, Travel Digest
Bay of Plenty, October 2014, Travel Digest
 
Benz deller(ex30 06_56)
Benz deller(ex30 06_56)Benz deller(ex30 06_56)
Benz deller(ex30 06_56)
 
ESWC 2009 Lightning Talks
ESWC 2009 Lightning TalksESWC 2009 Lightning Talks
ESWC 2009 Lightning Talks
 
The Even Darker Art Of Rails Engines
The Even Darker Art Of Rails EnginesThe Even Darker Art Of Rails Engines
The Even Darker Art Of Rails Engines
 
The Even Darker Art of Rails Engines (2009)
The Even Darker Art of Rails Engines (2009)The Even Darker Art of Rails Engines (2009)
The Even Darker Art of Rails Engines (2009)
 
Apresentação sobre Gerencia de redes sem fio em ambiente externo (2008)
Apresentação sobre Gerencia de redes sem fio em ambiente externo (2008)Apresentação sobre Gerencia de redes sem fio em ambiente externo (2008)
Apresentação sobre Gerencia de redes sem fio em ambiente externo (2008)
 
El tractor
El tractorEl tractor
El tractor
 
El tractor
El tractorEl tractor
El tractor
 
스트레스의 비밀 CCL
스트레스의 비밀 CCL스트레스의 비밀 CCL
스트레스의 비밀 CCL
 
презентация итог коррект
презентация итог корректпрезентация итог коррект
презентация итог коррект
 

Plus de Jesse Robbins

Jesse Robbins Keynote - Hacking Culture @ Cloud Expo Europe 2013
Jesse Robbins Keynote - Hacking Culture @ Cloud Expo Europe 2013Jesse Robbins Keynote - Hacking Culture @ Cloud Expo Europe 2013
Jesse Robbins Keynote - Hacking Culture @ Cloud Expo Europe 2013
Jesse Robbins
 

Plus de Jesse Robbins (9)

Jesse Robbins @ MWC 2015 - Building Orion Onyx - Real-time wearable push to t...
Jesse Robbins @ MWC 2015 - Building Orion Onyx - Real-time wearable push to t...Jesse Robbins @ MWC 2015 - Building Orion Onyx - Real-time wearable push to t...
Jesse Robbins @ MWC 2015 - Building Orion Onyx - Real-time wearable push to t...
 
Jesse Robbins Keynote - Hacking Culture @ Cloud Expo Europe 2013
Jesse Robbins Keynote - Hacking Culture @ Cloud Expo Europe 2013Jesse Robbins Keynote - Hacking Culture @ Cloud Expo Europe 2013
Jesse Robbins Keynote - Hacking Culture @ Cloud Expo Europe 2013
 
Continuous Deployment & Delivery + Culture Hacks @ QCON 2012
Continuous Deployment & Delivery + Culture Hacks @ QCON 2012Continuous Deployment & Delivery + Culture Hacks @ QCON 2012
Continuous Deployment & Delivery + Culture Hacks @ QCON 2012
 
Hacking Culture at VelocityConf
Hacking Culture at VelocityConfHacking Culture at VelocityConf
Hacking Culture at VelocityConf
 
GameDay: Creating Resiliency Through Destruction - LISA11
GameDay: Creating Resiliency Through Destruction - LISA11GameDay: Creating Resiliency Through Destruction - LISA11
GameDay: Creating Resiliency Through Destruction - LISA11
 
DevOps @ InterOP Las Vegas - Jesse Robbins - Opscode
DevOps @ InterOP Las Vegas - Jesse Robbins - OpscodeDevOps @ InterOP Las Vegas - Jesse Robbins - Opscode
DevOps @ InterOP Las Vegas - Jesse Robbins - Opscode
 
Gov 2.0: Scaling, Automation, & Management in the Cloud
Gov 2.0: Scaling, Automation, & Management in the CloudGov 2.0: Scaling, Automation, & Management in the Cloud
Gov 2.0: Scaling, Automation, & Management in the Cloud
 
Cloud Operations Bootcamp: Culture - Jesse Robbins
Cloud Operations Bootcamp: Culture - Jesse Robbins Cloud Operations Bootcamp: Culture - Jesse Robbins
Cloud Operations Bootcamp: Culture - Jesse Robbins
 
Using Chef for Automated Infrastructure in the Cloud
Using Chef for Automated Infrastructure in the CloudUsing Chef for Automated Infrastructure in the Cloud
Using Chef for Automated Infrastructure in the Cloud
 

Dernier

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Dernier (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

Failure Happens: CloudCamp Interop

Notes de l'éditeur

  1. firefighters are usually considered to be about 75% paranoid and about 25% pyromaniac.
  2. firefighters are usually considered to be about 75% paranoid and about 25% pyromaniac.
  3. firefighters are usually considered to be about 75% paranoid and about 25% pyromaniac.
  4. Which means this sort of thing makes perfect sense to me at the time.
  5. The 365main site does not have a typical battery backup system. Instead they rely on Continuous Power Supplies (CPS) which use a flywheel driven alternator to generate electricity. The flywheel is connected to both a large diesel motor and an electric motor which runs on utility power. The flywheel is normally turned by the electric motor, and stores enough kinetic energy to power the alternator for up to 15 seconds. When utility power fails the diesel motor is supposed to start in under 5 seconds, well before the flywheel's kinetic energy is exhausted, providing uninterrupted electrical power.The advantage of a CPS over a battery-based system is that the power going to the datacenter is decoupled from the utility power. This eliminates the complex electrical switching required from most battery-based systems, making many CPS systems simpler and sometimes more reliable.
  6. In this incident, latent defects caused three generators to fail during start-up. No customers were affected until a fourth generator failed 30 seconds later, which overloaded the surviving backup system and caused power failures to 3 of 8 customer areas.What's most interesting is that the redundant design of the system is what caused it to fail so completely. The failure of the fourth generator should have only brought down one area instead of three. This kind of cascade failure is common in complex & tightly coupled systems. In my experience, these sorts of failure-modes are often identified and then promptly dismissed as being \"nearly impossible\". Unfortunately, the impossible often becomes reality.To put it another way... Failure Happens.
  7. Hurricane Katrina landed, and like many people I wanted to help.