SlideShare une entreprise Scribd logo
1  sur  32
Télécharger pour lire hors ligne
Reliability and Resilience Patterns
05/09/2017
Dmitry Chornyi
dchornyi@opentable.com
Overview
● What are the common causes of service outages?
● Lessons learned from production incidents
● Patterns that we use to make our services resilient
2
About me
● Software engineer at OpenTable
● Own >20 microservices in production
● Currently on-call
3
About OpenTable
● Connecting 40k restaurants to 21m diners each month
● >2000 GitHub repositories
● Hundreds of microservices
4
“Developer looking at logs after a production
outage”
Sir Joseph Noel Paton
Oil on Canvas, 1861
5Source: http://classicprogrammerpaintings.com/
My First Outage
6
Simple Testing Can Prevent Most Critical Failures
“... we also found that for the most catastrophic failures, almost all of them are
caused by incorrect error handling, and 58% of them are trivial mistakes or can be
exposed by statement coverage testing.”
— http://dl.acm.org/citation.cfm?id=2685068
7
8
Release It!: Design and
Deploy Production-Ready
Software.
Michael T. Nygard
Drift into Failure:
From Hunting Broken
Components to
Understanding
Complex Systems.
Sidney Dekker
Site Reliability
Engineering: How
Google Runs Production
Systems.
Betsy Beyer, Chris Jones,
Jennifer Petoff, Niall
Richard Murphy
Systems Performance:
Enterprise and the Cloud.
Brendan Gregg
Designing for Failure
● Complex systems are rife with failure and are resistant to top-down control
● Moving from eliminating failure to anticipating failure in every component
● Software should be prepared for real-world production challenges and not
require constant life-support and human intervention
● Build systems and organizations that improve over time, rather than just not
degrade
● Design for failure and operate to learn
9
Reliability vs Resilience
Reliability:
● Stiff boundaries, layers
● Defense in depth
● Redundancy
● Interference protection
● Assurance
● Accountability
10
Resilience:
● Withstand transients
● Recover swiftly and smoothly
● Prioritize to serve high-level goals
● Recognize and respond to
anomalies
● Adapt to change
Source: Cook 2012
Failure Modes
● Failure is comprised of a chain of cracks in the system: a failure mode
● High levels of complexity provide more directions for the cracks to propagate
● Tightly coupled architectures increase the chance of propagation
● At each step in the chain, the crack can be accelerated, slowed, or stopped
● Design failure modes that drive failures away from indispensable features
11 Source: Nygard 2007
Bulkheads
● In a ship bulkheads create watertight compartments, restrict fires, separate cargo
● Partition your systems to keep failure in one part from destroying everything
● Requires more precise capacity planning
12
“Waiting for the server response”
Victor Vasnetsov, 1898
Oil, canvas
13Source: http://classicprogrammerpaintings.com/
Timeouts
● Never ever block forever
● Set a timeout on any operation that can block threads
● Prefer queue-and-retry to synchronous retries and use circuit breakers
● Don’t forget to clean up resources after a timeout happened
● How high should timeouts be? Try 99.9% response time
14
Resource Pools
● Pool and reuse resources whenever possible to increase efficiency, isolate
failure, limit concurrency, separate workloads
● Prefer several smaller pools to one large pool
● Keep pool size as small as possible
15
Fail Fast
● Slow failure responses tie up capacity, waste system resources, and cascade
● If the system can determine in advance that it will fail at an operation, it’s
always better to fail fast
● Check that all resources are available and healthy before beginning a
transaction
16
Load Shedding
● Define operational limits of your system and withstand excessive load spikes
● Shed load by rejecting excessive requests, executing a fallback method,
returning static data, or applying backpressure to the caller
● Explicit backpressure with handshaking to signal to callers that a service is
overloaded
● Implicit backpressure using blocking synchronous calls, semaphores, TCP
protocol
17
Circuit Breakers
● Electrical fuses: detect excess usage, fail first, and open the circuit
● Wrap dangerous operations with a component that can circumvent calls when
the system is not healthy—opposite of retries
● Automatically open the circuit when error threshold is exceeded, provide a
fallback mechanism, autorecover when system heals
● Degrade application functionality in response to failure
18
Client-Side Load Balancing
● Load balancing distributes load, handles failover, reduces coupling
● Prefer client-side load balancing
● Lower latency, higher throughput, partition tolerance, advanced routing
19
“MySQL maintenance”
Vasily Perov
Oil on canvas, 1866
20Source: http://classicprogrammerpaintings.com/
21
Keep Utilization Low
Source: Performance by Design
Queuing Effects
● In every system, exactly one constraint determines the system’s capacity
● Once it is reached, all other parts of the system will queue up or drop work
● Response time = processing time + latency (time spent in the queue)
● In practice queues are only found in two states: empty or full
22
Graceful Degradation
● Define features that your service absolutely needs to provide
● Route failure modes away from the critical path of these features
● Feature flags to shut down parts of your service
23
Steady State
● The system should be able to run indefinitely without human intervention
● Typical interventions: manual disk cleanups, nightly restarts
● For every mechanism that accumulates a resource, some other mechanism
must recycle that resource
● Purge old DB data, rotate log files, expire cache, decommission infrastructure
24
Service Autonomy
● Expose yourself to latency as rarely as possible
● Use asynchronous communication to reduce temporal coupling
○ No synchronous calls to other services on the request path
○ No transactions that span multiple services
○ Prefetch and cache reads, queue writes
○ Eventsourcing and CQRS
○ SOA Saga Pattern and Service Choreography
25
Source: Dahan 2006
Separation of Concerns
● Separate gateways to third parties are from the main transaction services
● API gateways can be used to implement load shedding, timeouts, circuit breakers,
handshaking, failure-injection
● Also a good place for security, metrics, logging, and other cross-cutting concerns
● Particularly valuable for legacy systems
26
Understand your Platform
● You don’t have to be an engineer to be be a racing driver, but you do have to have
Mechanical Sympathy.
● Understand how hardware, OS, and VM work in order to create efficient software
● Abstractions are leaky: CPU cores, caches, RAM, HDD, network, JVM, GC, thread
affinity, Docker, virtualization, data structures
27
Source: Mechanical Sympathy
Test for Failure
● Build test harnesses that can provoke socket, protocol, and application errors
● Run longevity tests using realistic data volumes to find steady state violations
● Use traffic bursts to measure latency variance and queuing
● Make failure a first-class citizen: game days, chaos monkey, failure injection
28
Incident Response
● Define an incident management framework
● Clear understanding of responsibilities: command, operational work,
communication, planning
● Follow a systematic troubleshooting process
● Blameless postmortems
29
Antifragile Organization
● Banishing error also banishes innovation and adaptation
● Trade the precise robustness of complicated systems for the sloppy resilience
of complex systems
● Remove organizational scar tissue, clean out, automate, reduce handoffs
● Diversity, loose coupling, slack, decentralized anticipation, communication
● Operational discretion at lower levels in the organization
● Regulation, compliance, oversight, inspection are mismatched to complexity
30 Source: Dekker 2011
“Github Major Service Outage”
Georges Seurat, 1884
Oil on canvas
31Source: http://classicprogrammerpaintings.com/
References
Nygard, M. T. (2007). Release It!: Design and Deploy Production-Ready Software (Pragmatic Programmers)
Dekker, S. (2011). Drift into Failure: From Hunting Broken Components to Understanding Complex Systems
Beyer, B., Jones, C., Petoff, J., Murphy, N.R. (2016). Site Reliability Engineering: How Google Runs Production Systems
Cook, R. (2012). How Complex Systems Fail
Holtman, J., & Gunther, N. J. (2008). Getting in the Zone for Successful Scalability
Gunther, N. (2010). Quantifying Scalability FTW
Schwartz, B. (2015). Everything You Need To Know About Queueing Theory
Herbert, F. (2014). Planning for Overload
Thompson, M. (2012). Applying Back Pressure When Overloaded
Dean, J. (2012). Achieving Rapid Response Times in Large Online Services
Dahan, U. (2006). Autonomous Services and Enterprise Entity Aggregation
Thompson, M. Mechanical Sympathy
Rasmussen, J. (1997). Risk management in a dynamic society: a modelling problem
Andrus, K. (2015). Breaking Bad at Netflix: Building Failure as a Service
Tarjan, P. (2017). Scaling your API with rate limiters
32

Contenu connexe

Tendances

SRM versus Stretched Clusters: Choosing the Right Solution
SRM versus Stretched Clusters: Choosing the Right SolutionSRM versus Stretched Clusters: Choosing the Right Solution
SRM versus Stretched Clusters: Choosing the Right Solution
Scott Lowe
 
Hdfs ha using journal nodes
Hdfs ha using journal nodesHdfs ha using journal nodes
Hdfs ha using journal nodes
Evans Ye
 
Galvin-operating System(Ch1)
Galvin-operating System(Ch1)Galvin-operating System(Ch1)
Galvin-operating System(Ch1)
dsuyal1
 

Tendances (14)

HBC9363 Virtualization 2.0 How the Cloud is Evolving the Modern Data Center
HBC9363 Virtualization 2.0 How the Cloud is Evolving the Modern Data CenterHBC9363 Virtualization 2.0 How the Cloud is Evolving the Modern Data Center
HBC9363 Virtualization 2.0 How the Cloud is Evolving the Modern Data Center
 
Chapter 1: Introduction to Operating System
Chapter 1: Introduction to Operating SystemChapter 1: Introduction to Operating System
Chapter 1: Introduction to Operating System
 
Vx works RTOS
Vx works RTOSVx works RTOS
Vx works RTOS
 
A Generic Approach for Deploying and Upgrading Mutable Software Components
A Generic Approach for Deploying and Upgrading Mutable Software ComponentsA Generic Approach for Deploying and Upgrading Mutable Software Components
A Generic Approach for Deploying and Upgrading Mutable Software Components
 
Techno-Fest-15nov16
Techno-Fest-15nov16Techno-Fest-15nov16
Techno-Fest-15nov16
 
SRM versus Stretched Clusters: Choosing the Right Solution
SRM versus Stretched Clusters: Choosing the Right SolutionSRM versus Stretched Clusters: Choosing the Right Solution
SRM versus Stretched Clusters: Choosing the Right Solution
 
Hdfs ha using journal nodes
Hdfs ha using journal nodesHdfs ha using journal nodes
Hdfs ha using journal nodes
 
Strata + Hadoop World 2012: High Availability for the HDFS NameNode Phase 2
Strata + Hadoop World 2012: High Availability for the HDFS NameNode Phase 2Strata + Hadoop World 2012: High Availability for the HDFS NameNode Phase 2
Strata + Hadoop World 2012: High Availability for the HDFS NameNode Phase 2
 
OS_Ch2
OS_Ch2OS_Ch2
OS_Ch2
 
Exploring Stretched Clusters
Exploring Stretched ClustersExploring Stretched Clusters
Exploring Stretched Clusters
 
VMworld 2015: Troubleshooting for vSphere 6
VMworld 2015: Troubleshooting for vSphere 6VMworld 2015: Troubleshooting for vSphere 6
VMworld 2015: Troubleshooting for vSphere 6
 
Real Time Operating Systems
Real Time Operating SystemsReal Time Operating Systems
Real Time Operating Systems
 
Galvin-operating System(Ch1)
Galvin-operating System(Ch1)Galvin-operating System(Ch1)
Galvin-operating System(Ch1)
 
VMworld 2015: How To Troubleshoot Using vRealize Operations Manager (Deep Liv...
VMworld 2015: How To Troubleshoot Using vRealize Operations Manager (Deep Liv...VMworld 2015: How To Troubleshoot Using vRealize Operations Manager (Deep Liv...
VMworld 2015: How To Troubleshoot Using vRealize Operations Manager (Deep Liv...
 

Similaire à Reliability and Resilience Patterns

PayPal Resilient System Design
PayPal Resilient System DesignPayPal Resilient System Design
PayPal Resilient System Design
Pradeep Ballal
 
Lessons from Large-Scale Cloud Software at Databricks
Lessons from Large-Scale Cloud Software at DatabricksLessons from Large-Scale Cloud Software at Databricks
Lessons from Large-Scale Cloud Software at Databricks
Matei Zaharia
 

Similaire à Reliability and Resilience Patterns (20)

Performance tuning Grails applications SpringOne 2GX 2014
Performance tuning Grails applications SpringOne 2GX 2014Performance tuning Grails applications SpringOne 2GX 2014
Performance tuning Grails applications SpringOne 2GX 2014
 
Designing apps for resiliency
Designing apps for resiliencyDesigning apps for resiliency
Designing apps for resiliency
 
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache KafkaStrata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
 
Ame 2269 ibm mq high availability
Ame 2269 ibm mq high availabilityAme 2269 ibm mq high availability
Ame 2269 ibm mq high availability
 
System Accidents: Understanding Common Accidents
System Accidents: Understanding Common AccidentsSystem Accidents: Understanding Common Accidents
System Accidents: Understanding Common Accidents
 
Distruted applications
Distruted applicationsDistruted applications
Distruted applications
 
Chaos engineering with Litmus Chaos Framework
Chaos engineering with Litmus Chaos FrameworkChaos engineering with Litmus Chaos Framework
Chaos engineering with Litmus Chaos Framework
 
Resisting to The Shocks
Resisting to The ShocksResisting to The Shocks
Resisting to The Shocks
 
Reactive by example (DevOpsDaysTLV 2019)
Reactive by example (DevOpsDaysTLV 2019)Reactive by example (DevOpsDaysTLV 2019)
Reactive by example (DevOpsDaysTLV 2019)
 
micro services architecture (FrosCon2014)
micro services architecture (FrosCon2014)micro services architecture (FrosCon2014)
micro services architecture (FrosCon2014)
 
Designing Scalable Applications
Designing Scalable ApplicationsDesigning Scalable Applications
Designing Scalable Applications
 
Sistemas Distribuidos
Sistemas DistribuidosSistemas Distribuidos
Sistemas Distribuidos
 
PayPal Resilient System Design
PayPal Resilient System DesignPayPal Resilient System Design
PayPal Resilient System Design
 
POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud ...
POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud ...POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud ...
POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud ...
 
Series of Unfortunate Netflix Container Events - QConNYC17
Series of Unfortunate Netflix Container Events - QConNYC17Series of Unfortunate Netflix Container Events - QConNYC17
Series of Unfortunate Netflix Container Events - QConNYC17
 
Lessons from Large-Scale Cloud Software at Databricks
Lessons from Large-Scale Cloud Software at DatabricksLessons from Large-Scale Cloud Software at Databricks
Lessons from Large-Scale Cloud Software at Databricks
 
Ronald van Luttikhuizen - Effective fault handling in SOA Suite and OSB 11g
Ronald van Luttikhuizen - Effective fault handling in SOA Suite and OSB 11gRonald van Luttikhuizen - Effective fault handling in SOA Suite and OSB 11g
Ronald van Luttikhuizen - Effective fault handling in SOA Suite and OSB 11g
 
Introduction
IntroductionIntroduction
Introduction
 
Building Cloud Ready Apps
Building Cloud Ready AppsBuilding Cloud Ready Apps
Building Cloud Ready Apps
 
Stability Patterns for Microservices
Stability Patterns for MicroservicesStability Patterns for Microservices
Stability Patterns for Microservices
 

Dernier

Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Dr.Costas Sachpazis
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Dernier (20)

Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 

Reliability and Resilience Patterns

  • 1. Reliability and Resilience Patterns 05/09/2017 Dmitry Chornyi dchornyi@opentable.com
  • 2. Overview ● What are the common causes of service outages? ● Lessons learned from production incidents ● Patterns that we use to make our services resilient 2
  • 3. About me ● Software engineer at OpenTable ● Own >20 microservices in production ● Currently on-call 3
  • 4. About OpenTable ● Connecting 40k restaurants to 21m diners each month ● >2000 GitHub repositories ● Hundreds of microservices 4
  • 5. “Developer looking at logs after a production outage” Sir Joseph Noel Paton Oil on Canvas, 1861 5Source: http://classicprogrammerpaintings.com/
  • 7. Simple Testing Can Prevent Most Critical Failures “... we also found that for the most catastrophic failures, almost all of them are caused by incorrect error handling, and 58% of them are trivial mistakes or can be exposed by statement coverage testing.” — http://dl.acm.org/citation.cfm?id=2685068 7
  • 8. 8 Release It!: Design and Deploy Production-Ready Software. Michael T. Nygard Drift into Failure: From Hunting Broken Components to Understanding Complex Systems. Sidney Dekker Site Reliability Engineering: How Google Runs Production Systems. Betsy Beyer, Chris Jones, Jennifer Petoff, Niall Richard Murphy Systems Performance: Enterprise and the Cloud. Brendan Gregg
  • 9. Designing for Failure ● Complex systems are rife with failure and are resistant to top-down control ● Moving from eliminating failure to anticipating failure in every component ● Software should be prepared for real-world production challenges and not require constant life-support and human intervention ● Build systems and organizations that improve over time, rather than just not degrade ● Design for failure and operate to learn 9
  • 10. Reliability vs Resilience Reliability: ● Stiff boundaries, layers ● Defense in depth ● Redundancy ● Interference protection ● Assurance ● Accountability 10 Resilience: ● Withstand transients ● Recover swiftly and smoothly ● Prioritize to serve high-level goals ● Recognize and respond to anomalies ● Adapt to change Source: Cook 2012
  • 11. Failure Modes ● Failure is comprised of a chain of cracks in the system: a failure mode ● High levels of complexity provide more directions for the cracks to propagate ● Tightly coupled architectures increase the chance of propagation ● At each step in the chain, the crack can be accelerated, slowed, or stopped ● Design failure modes that drive failures away from indispensable features 11 Source: Nygard 2007
  • 12. Bulkheads ● In a ship bulkheads create watertight compartments, restrict fires, separate cargo ● Partition your systems to keep failure in one part from destroying everything ● Requires more precise capacity planning 12
  • 13. “Waiting for the server response” Victor Vasnetsov, 1898 Oil, canvas 13Source: http://classicprogrammerpaintings.com/
  • 14. Timeouts ● Never ever block forever ● Set a timeout on any operation that can block threads ● Prefer queue-and-retry to synchronous retries and use circuit breakers ● Don’t forget to clean up resources after a timeout happened ● How high should timeouts be? Try 99.9% response time 14
  • 15. Resource Pools ● Pool and reuse resources whenever possible to increase efficiency, isolate failure, limit concurrency, separate workloads ● Prefer several smaller pools to one large pool ● Keep pool size as small as possible 15
  • 16. Fail Fast ● Slow failure responses tie up capacity, waste system resources, and cascade ● If the system can determine in advance that it will fail at an operation, it’s always better to fail fast ● Check that all resources are available and healthy before beginning a transaction 16
  • 17. Load Shedding ● Define operational limits of your system and withstand excessive load spikes ● Shed load by rejecting excessive requests, executing a fallback method, returning static data, or applying backpressure to the caller ● Explicit backpressure with handshaking to signal to callers that a service is overloaded ● Implicit backpressure using blocking synchronous calls, semaphores, TCP protocol 17
  • 18. Circuit Breakers ● Electrical fuses: detect excess usage, fail first, and open the circuit ● Wrap dangerous operations with a component that can circumvent calls when the system is not healthy—opposite of retries ● Automatically open the circuit when error threshold is exceeded, provide a fallback mechanism, autorecover when system heals ● Degrade application functionality in response to failure 18
  • 19. Client-Side Load Balancing ● Load balancing distributes load, handles failover, reduces coupling ● Prefer client-side load balancing ● Lower latency, higher throughput, partition tolerance, advanced routing 19
  • 20. “MySQL maintenance” Vasily Perov Oil on canvas, 1866 20Source: http://classicprogrammerpaintings.com/
  • 21. 21 Keep Utilization Low Source: Performance by Design
  • 22. Queuing Effects ● In every system, exactly one constraint determines the system’s capacity ● Once it is reached, all other parts of the system will queue up or drop work ● Response time = processing time + latency (time spent in the queue) ● In practice queues are only found in two states: empty or full 22
  • 23. Graceful Degradation ● Define features that your service absolutely needs to provide ● Route failure modes away from the critical path of these features ● Feature flags to shut down parts of your service 23
  • 24. Steady State ● The system should be able to run indefinitely without human intervention ● Typical interventions: manual disk cleanups, nightly restarts ● For every mechanism that accumulates a resource, some other mechanism must recycle that resource ● Purge old DB data, rotate log files, expire cache, decommission infrastructure 24
  • 25. Service Autonomy ● Expose yourself to latency as rarely as possible ● Use asynchronous communication to reduce temporal coupling ○ No synchronous calls to other services on the request path ○ No transactions that span multiple services ○ Prefetch and cache reads, queue writes ○ Eventsourcing and CQRS ○ SOA Saga Pattern and Service Choreography 25 Source: Dahan 2006
  • 26. Separation of Concerns ● Separate gateways to third parties are from the main transaction services ● API gateways can be used to implement load shedding, timeouts, circuit breakers, handshaking, failure-injection ● Also a good place for security, metrics, logging, and other cross-cutting concerns ● Particularly valuable for legacy systems 26
  • 27. Understand your Platform ● You don’t have to be an engineer to be be a racing driver, but you do have to have Mechanical Sympathy. ● Understand how hardware, OS, and VM work in order to create efficient software ● Abstractions are leaky: CPU cores, caches, RAM, HDD, network, JVM, GC, thread affinity, Docker, virtualization, data structures 27 Source: Mechanical Sympathy
  • 28. Test for Failure ● Build test harnesses that can provoke socket, protocol, and application errors ● Run longevity tests using realistic data volumes to find steady state violations ● Use traffic bursts to measure latency variance and queuing ● Make failure a first-class citizen: game days, chaos monkey, failure injection 28
  • 29. Incident Response ● Define an incident management framework ● Clear understanding of responsibilities: command, operational work, communication, planning ● Follow a systematic troubleshooting process ● Blameless postmortems 29
  • 30. Antifragile Organization ● Banishing error also banishes innovation and adaptation ● Trade the precise robustness of complicated systems for the sloppy resilience of complex systems ● Remove organizational scar tissue, clean out, automate, reduce handoffs ● Diversity, loose coupling, slack, decentralized anticipation, communication ● Operational discretion at lower levels in the organization ● Regulation, compliance, oversight, inspection are mismatched to complexity 30 Source: Dekker 2011
  • 31. “Github Major Service Outage” Georges Seurat, 1884 Oil on canvas 31Source: http://classicprogrammerpaintings.com/
  • 32. References Nygard, M. T. (2007). Release It!: Design and Deploy Production-Ready Software (Pragmatic Programmers) Dekker, S. (2011). Drift into Failure: From Hunting Broken Components to Understanding Complex Systems Beyer, B., Jones, C., Petoff, J., Murphy, N.R. (2016). Site Reliability Engineering: How Google Runs Production Systems Cook, R. (2012). How Complex Systems Fail Holtman, J., & Gunther, N. J. (2008). Getting in the Zone for Successful Scalability Gunther, N. (2010). Quantifying Scalability FTW Schwartz, B. (2015). Everything You Need To Know About Queueing Theory Herbert, F. (2014). Planning for Overload Thompson, M. (2012). Applying Back Pressure When Overloaded Dean, J. (2012). Achieving Rapid Response Times in Large Online Services Dahan, U. (2006). Autonomous Services and Enterprise Entity Aggregation Thompson, M. Mechanical Sympathy Rasmussen, J. (1997). Risk management in a dynamic society: a modelling problem Andrus, K. (2015). Breaking Bad at Netflix: Building Failure as a Service Tarjan, P. (2017). Scaling your API with rate limiters 32