SlideShare une entreprise Scribd logo
1  sur  33
Resilience testing
Ran Levy, Ran Peled
Jan 2020
MyHeritage
2
MyHeritage is the leading global
discovery platform for family history
and DNA testing
110M
Users
42
Languages Supported
3.7
Tree profiles
3.8M
DNA Database
11B
Historical Records
4
Family trees
& historical records
MyHeritage
DNA
MyHeritage
Health
Resilience testing – WHY?
We’ll start with a showcase
Showcase
Incident:
Downtime of the entire web site for 1
minute due to a failure in DNA DB
(non critical DB, except for DNA use
case).
Root cause:
Navigation bar code (which appears
in every web page) tried to
determine if the user appears in
DNA whitelist.
6
Modern Architecture
7
Chaos Engineering (we are not there yet)
Chaos Engineering is the discipline of experimenting on a system
in order to build confidence in the system’s capability
to withstand turbulent conditions in production. https://principlesofchaos.org/
● Start by defining measurable ‘steady state’
● Hypothesize that this steady state will continue
in both the control group and the experimental group.
● Introduce variables that reflect real world events
like servers that crash, hard drives that malfunction, network connections that are severed, etc.
● Try to disprove the hypothesis
by looking for a difference in steady state between the control group and the experimental group.
8
Path to Chaos Engineering
The path to chaos engineering starts much earlier before experimenting in production.
WARNING: DO NOT TEMPT CHAOS ENGINEERING TOO EARLY!
9
How to start? (Or how did we start)?
Resilience Testing
Software resilience testing is a method of
software testing that focuses on ensuring that
applications will perform well in real-life or chaotic
conditions.
Developer machine
Testing environment(s)
● Start with defining your testing environment. Prefer isolated environments.
11
AWS Dev account Production
AWS Staging Account
Staging Prod
Staging
Services
Staging
web
servers
Production
Services
Prod
web srvr
EKS cluster
Health
services
Health
services
Health
services
AWS
Managed
services
- Aurora
- Kafka
- KMS
- Secrets
AWS Staging server
Health
services
Health
services
Dockerized
Local env
AWS Staging server
Health
services
Health
services
Dockerized
Local env
AWS Staging server
Health
services
Health
services
Dockerized
Local env
AWS Staging server
Health
services
Health
services
Dockerized
Local env
AWS Staging server
Health
services
Health
services
Dockerized
Local env
Developer machine
Health
services
Health
services
Dockerized
Local env
Unit and integration testing
● The easiest way to start with resilience testing.
● Risk free methodology.
● Easy setup and easy to implement.
● Resilience testing is done in parallel to development.
12
Unit and integration testing
Examples:
● Make database mock return error.
● Make microservice mock return invalid data or throw exceptions.
● Make code dependency mocks return the documented error codes and exceptions.
● Simulate timeouts, slow response, …
Verify code behaves “as advertised”, for example
● Sensible defaults for failed fetch operations.
● Writing to secondary location upon write failure (when such fallback is applicable).
● Documented error code returned or exception thrown to caller.
13
End-to-end resilience testing
● We continued with more sophisticated end to scenarios.
● We aimed to verify expected behavior under deterministic faulty conditions.
● Started with limited test scope, in testing environment (do not affect users).
14
End-to-end resilience testing
A test defines
● Use case (user flow), e.g. “Show search results
page”
● Injected fault(s), e.g. “Records Database instance
is down”
● Expected behavior, for example
○ No effect on user (by using a redundant
service).
○ A limited performance hit (e.g. calculate on
the fly instead of using cache).
○ Expected service degradation (e.g. missing
UI component on a page).
○ No service to the user, but effect is localized.
15
web
server
Test setup
Faults can happen anywhere, where do we start?
Focus on failing dependencies of the user-facing web app
● Most important is to protect the user experience
● Also easiest to implement quickly:
○ Need to inject faults only on one location, the web
server.
○ Can run on Staging AWS
○ Can run on Dallas Staging/RC server without
affecting production
16
PHP web
app
Micorservice
Micorservice
Microservice
Microservice
DB
DB
DB
Memcache
Tools
● We continued tools exploration and POC with the most promising ones.
● The major guideline in the selection we had in mind is avoiding code changes.
17
Tools
● Chaos Monkey / Chaos Kong
Netflix’s OSS for simulating faults. Aimed for running in production and turn off EC2 instances
or entire AWS region. https://github.com/Netflix/chaosmonkey
● Gremlin
Fault injection framework (infra and Java app level). Saas based control plane. Commercial
with entry-level free tier. Does not cover test scenarios, only injection.
https://www.gremlin.com/docs/
● Envoy
Envoy is a general-purpose proxy. Used internally by Istio. HTTP fault injection capabilities,
though this is not it’s main purpose. Supports aborting requests, injecting predefined HTTP
return codes, adding delays, rate limits, etc. Does not support TCP level fault injection (except
rate limit and the specific MongoDB filter).
https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/fault_filter
18
Tools
● Chaos Toolkit
Python based resilience tests framework. Plugins to control many environments,
including Spring Boot services, AWS, K8s, and probing data from Prometheus,
Open Tracing. Focus on test specification.
https://docs.chaostoolkit.org/reference/concepts/
● ToxiProxy
Our current selection. Used as the network driver of Chaos Toolkit.
https://github.com/Shopify/toxiproxy
19
ToxiProxy
“Toxiproxy is a framework for simulating network conditions”
A TCP proxy built for interfering with connections
20
APP
ToxiProxy
DB
Configure app to go via proxy, e.g.
stats_db: localhost:20000
Configure proxy mapping, e.g.
stats: localhost:20000 => data121:3306
Configure proxy toxic, e.g.
stats: delay=10sec
Toxic types:
● latency
● down
● bandwidth
● slow_close
● timeout
● …
ToxiProxy setup
CLI interface – examples
# run toxiproxy locally (detached)
sudo docker run -d --net=host --name=toxiproxy 
artifactory.dal.myhrtg.net:8082/shopify/toxiproxy
# set up proxy named “web” from TCP port 8080 to 80
sudo docker exec -it toxiproxy 
/go/bin/toxiproxy-cli create web -l localhost:8080 -
u localhost:80
# set up toxic of 10,000 ms delay on “web” proxy
sudo docker exec -it toxiproxy 
/go/bin/toxiproxy-cli toxic add web -t latency -a
latency=10000
# test
curl localhost:8080
# see that response is received after 10 seconds
21
REST interface
# e.g. get configured proxies
curl http://127.0.0.1:8474/proxies
Testing - manual
• We started development of environment setup scripts and ran a few manual
experiments.
1) Start ToxiProxy on the test server
Manually run ToxiProxy Docker container, or use this helper:
/srv/MyHeritageControl/resilience/container_ctrl/start_toxiproxy.sh
2) Manually patch PHP config to go through the proxy, or use this helper:
cd /srv/MyHeritageControl/resilience 
&& npm i 
&& node resilience_switcher.js -s http://<server address>
3) Set up toxics via CLI or REST-API
4) Test
22
Manual testing gave
immediate and
actionable insights
23
Real world example 1
Test case and expected result
DNA Database unavailability should not have any effect on non-DNA pages like the
home page.
Test results
Home page does not load. In fact, any page with navigation menu does not load…
Analysis
Eligibility for DNA check which is used by navigation code, is sensitive to
DNA database availability.
Solution
• try … catch block, to have a default behavior on DB unavailability.
• Limit wait time for query.
24
Real world example 2
Test case and expected result
Slow response time from the Cassandra AccountStore cluster should not affect the site
(or worst, slow it by up to X seconds per request)
Test results
5 sec delay: Site does not load
0.5 sec delay: Site loads slowly; some user info missing
Possible solution
Shorten Cassandra access timeouts, and make sure site loads even with timeouts.
In addition, introduce circuit-breaker on Cassandra access.
25
Testing - Automations
• We continued with writing end to end automations tests.
• Test setup configured Toxic Proxy and relevant code configuration changes.
• Test scenario defines the actual failure to inject:
def toxiproxy_settings = [
['records;all_down','@resilience_records_db_down'],
['solrAggregator_url;
{"type":"latency","attributes":{"latency":300000}}','@resilience_solr_aggregator_delay']]
• Test teardown reverted the configuration.
26
It’s a journey
We continued with:
Documenting
Spreading the knowledge
Guidelines definitions
Resilience guidelines - Limits
Tripping over limits is a common resilience issue. Often hard to fix in production!
Examples:
● Software: max row id, auto-increment, max URL length, connection limits.
● Hardware/infra: disk space, bandwidth, CPU.
Mitigations:
● Awareness, defensive programming.
● Load testing (reads, writes, startup, shutdown, resharding, backup, etc).
● Monitor getting close to limits
○ Adjust slack according to time to adjust.
○ Alerts are also good documentation - add extra info, link to explanation and mitigation
info.
28
Resilience guidelines - spreading slowness
Slowness tends to spread to other components and create growing blast radius
Examples:
● overload - CPU, I/O, lack of memory.
● slow caller connections.
● slow dependencies.
● misbehaving retry loops.
Mitigations:
● Fail fast.
Tune timeouts (DB queries, microservice calls, etc); Limit retries with exponential backoff and jitter;
Circuit Breakers; cascading deadlines.
● Visibility
Dashboards for quick response: errors, latency, utilization, saturation. Drill down analysis tools.
29
Resilience guidelines - coordinated demand
Overload is often not isolated, and impacts many parts of the system.
Examples:
● From users: campaigns, media coverage, seasonal.
● From systems: cron jobs, batch jobs, scheduled client updates, coming back after
outage, rebalancing.
Mitigations:
● Degraded service: throttle, serve only paying users, show partial results.
● Queued processing: limit demand by number of consumers.
● Testing.
30
It’s a journey: Future work
Integrate Resilience into development cycle
● Spec, software design, test plans, visibility and monitoring.
Expand resilience tests coverage, fix issues
Testing infra
● Simplify configuration and automations development.
● Downstream failures (microservice-to-microservice, MS to DB).
● Environmental toxics (e.g. high CPU load, disk space shortage).
Resilience infrastructure
● E.g. Cascading deadlines, Backoff and Jitter retry strategies.
Chaos testing
● Can start small and controlled. E.g. Fail X% of memcache reads on prod, to measure
effect on DB load; fail X% of client requests to check retry behavior, etc.
References
● Chaos Engineering book
On O’reilly (requires login), or downloaded PDF
● Google’s SRE book
https://landing.google.com/sre/sre-book/toc/
● Principles of Chaos Engineering book
https://principlesofchaos.org/
● Taxonomy of Black Swans talk (review of public post mortems and conclusions)
https://www.infoq.com/presentations/taxonomy-black-swans/
32
Thank You

Contenu connexe

Tendances

Monitoring Apache Kafka
Monitoring Apache KafkaMonitoring Apache Kafka
Monitoring Apache Kafkaconfluent
 
Building Immutable Machine Images with Packer and Ansible
Building Immutable Machine Images with Packer and AnsibleBuilding Immutable Machine Images with Packer and Ansible
Building Immutable Machine Images with Packer and AnsibleJason Harley
 
Inside the InfluxDB storage engine
Inside the InfluxDB storage engineInside the InfluxDB storage engine
Inside the InfluxDB storage engineInfluxData
 
Chaos Engineering
Chaos EngineeringChaos Engineering
Chaos EngineeringYury Roa
 
Introduction to Prometheus
Introduction to PrometheusIntroduction to Prometheus
Introduction to PrometheusJulien Pivotto
 
Simple Solutions for Complex Problems
Simple Solutions for Complex Problems Simple Solutions for Complex Problems
Simple Solutions for Complex Problems Apcera
 
Prometheus Overview
Prometheus OverviewPrometheus Overview
Prometheus OverviewBrian Brazil
 
Embracing Failure - Fault Injection and Service Resilience at Netflix
Embracing Failure - Fault Injection and Service Resilience at NetflixEmbracing Failure - Fault Injection and Service Resilience at Netflix
Embracing Failure - Fault Injection and Service Resilience at NetflixJosh Evans
 
Chaos Engineering - The Art of Breaking Things in Production
Chaos Engineering - The Art of Breaking Things in ProductionChaos Engineering - The Art of Breaking Things in Production
Chaos Engineering - The Art of Breaking Things in ProductionKeet Sugathadasa
 
Introduction to Integration Testing With Cypress
Introduction to Integration Testing With CypressIntroduction to Integration Testing With Cypress
Introduction to Integration Testing With CypressErez Cohen
 
Redpanda and ClickHouse
Redpanda and ClickHouseRedpanda and ClickHouse
Redpanda and ClickHouseAltinity Ltd
 
Real Life Clean Architecture
Real Life Clean ArchitectureReal Life Clean Architecture
Real Life Clean ArchitectureMattia Battiston
 
Monitoring With Prometheus
Monitoring With PrometheusMonitoring With Prometheus
Monitoring With PrometheusKnoldus Inc.
 
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...DataWorks Summit
 
Apache Kafka - Patterns anti-patterns
Apache Kafka - Patterns anti-patternsApache Kafka - Patterns anti-patterns
Apache Kafka - Patterns anti-patternsFlorent Ramiere
 
Service Discovery and Registration in a Microservices Architecture
Service Discovery and Registration in a Microservices ArchitectureService Discovery and Registration in a Microservices Architecture
Service Discovery and Registration in a Microservices ArchitecturePLUMgrid
 
How netflix manages petabyte scale apache cassandra in the cloud
How netflix manages petabyte scale apache cassandra in the cloudHow netflix manages petabyte scale apache cassandra in the cloud
How netflix manages petabyte scale apache cassandra in the cloudVinay Kumar Chella
 

Tendances (20)

Monitoring Apache Kafka
Monitoring Apache KafkaMonitoring Apache Kafka
Monitoring Apache Kafka
 
Building Immutable Machine Images with Packer and Ansible
Building Immutable Machine Images with Packer and AnsibleBuilding Immutable Machine Images with Packer and Ansible
Building Immutable Machine Images with Packer and Ansible
 
Inside the InfluxDB storage engine
Inside the InfluxDB storage engineInside the InfluxDB storage engine
Inside the InfluxDB storage engine
 
Chaos Engineering
Chaos EngineeringChaos Engineering
Chaos Engineering
 
Introduction to Prometheus
Introduction to PrometheusIntroduction to Prometheus
Introduction to Prometheus
 
Simple Solutions for Complex Problems
Simple Solutions for Complex Problems Simple Solutions for Complex Problems
Simple Solutions for Complex Problems
 
Prometheus Overview
Prometheus OverviewPrometheus Overview
Prometheus Overview
 
Embracing Failure - Fault Injection and Service Resilience at Netflix
Embracing Failure - Fault Injection and Service Resilience at NetflixEmbracing Failure - Fault Injection and Service Resilience at Netflix
Embracing Failure - Fault Injection and Service Resilience at Netflix
 
Chaos Engineering - The Art of Breaking Things in Production
Chaos Engineering - The Art of Breaking Things in ProductionChaos Engineering - The Art of Breaking Things in Production
Chaos Engineering - The Art of Breaking Things in Production
 
Introduction to Integration Testing With Cypress
Introduction to Integration Testing With CypressIntroduction to Integration Testing With Cypress
Introduction to Integration Testing With Cypress
 
Postman.ppt
Postman.pptPostman.ppt
Postman.ppt
 
Redpanda and ClickHouse
Redpanda and ClickHouseRedpanda and ClickHouse
Redpanda and ClickHouse
 
Real Life Clean Architecture
Real Life Clean ArchitectureReal Life Clean Architecture
Real Life Clean Architecture
 
Monitoring With Prometheus
Monitoring With PrometheusMonitoring With Prometheus
Monitoring With Prometheus
 
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
 
Apache Kafka - Patterns anti-patterns
Apache Kafka - Patterns anti-patternsApache Kafka - Patterns anti-patterns
Apache Kafka - Patterns anti-patterns
 
Service Discovery and Registration in a Microservices Architecture
Service Discovery and Registration in a Microservices ArchitectureService Discovery and Registration in a Microservices Architecture
Service Discovery and Registration in a Microservices Architecture
 
How netflix manages petabyte scale apache cassandra in the cloud
How netflix manages petabyte scale apache cassandra in the cloudHow netflix manages petabyte scale apache cassandra in the cloud
How netflix manages petabyte scale apache cassandra in the cloud
 
The Test Pyramid
The Test PyramidThe Test Pyramid
The Test Pyramid
 
QA Best Practices in Agile World_new
QA Best Practices in Agile World_newQA Best Practices in Agile World_new
QA Best Practices in Agile World_new
 

Similaire à Resilience Testing

Containerised Testing at Demonware : PyCon Ireland 2016
Containerised Testing at Demonware : PyCon Ireland 2016Containerised Testing at Demonware : PyCon Ireland 2016
Containerised Testing at Demonware : PyCon Ireland 2016Thomas Shaw
 
FIWARE Tech Summit - Docker Swarm Secrets for Creating Great FIWARE Platforms
FIWARE Tech Summit - Docker Swarm Secrets for Creating Great FIWARE PlatformsFIWARE Tech Summit - Docker Swarm Secrets for Creating Great FIWARE Platforms
FIWARE Tech Summit - Docker Swarm Secrets for Creating Great FIWARE PlatformsFIWARE
 
[KubeCon NA 2018] Telepresence Deep Dive Session - Rafael Schloming & Luke Sh...
[KubeCon NA 2018] Telepresence Deep Dive Session - Rafael Schloming & Luke Sh...[KubeCon NA 2018] Telepresence Deep Dive Session - Rafael Schloming & Luke Sh...
[KubeCon NA 2018] Telepresence Deep Dive Session - Rafael Schloming & Luke Sh...Ambassador Labs
 
Docker Swarm secrets for creating great FIWARE platforms
Docker Swarm secrets for creating great FIWARE platformsDocker Swarm secrets for creating great FIWARE platforms
Docker Swarm secrets for creating great FIWARE platformsFederico Michele Facca
 
The Future of Security and Productivity in Our Newly Remote World
The Future of Security and Productivity in Our Newly Remote WorldThe Future of Security and Productivity in Our Newly Remote World
The Future of Security and Productivity in Our Newly Remote WorldDevOps.com
 
DevOpsDays Singapore - Continuous Auditing with Compliance as Code
DevOpsDays Singapore - Continuous Auditing with Compliance as CodeDevOpsDays Singapore - Continuous Auditing with Compliance as Code
DevOpsDays Singapore - Continuous Auditing with Compliance as CodeMatt Ray
 
Declarative Infrastructure Tools
Declarative Infrastructure Tools Declarative Infrastructure Tools
Declarative Infrastructure Tools Yulia Shcherbachova
 
DevSec Delight with Compliance as Code - Matt Ray - AgileNZ 2017
DevSec Delight with Compliance as Code - Matt Ray - AgileNZ 2017DevSec Delight with Compliance as Code - Matt Ray - AgileNZ 2017
DevSec Delight with Compliance as Code - Matt Ray - AgileNZ 2017AgileNZ Conference
 
OWASP ZAP Workshop for QA Testers
OWASP ZAP Workshop for QA TestersOWASP ZAP Workshop for QA Testers
OWASP ZAP Workshop for QA TestersJavan Rasokat
 
How to secure your web applications with NGINX
How to secure your web applications with NGINXHow to secure your web applications with NGINX
How to secure your web applications with NGINXWallarm
 
Here Be Dragons: Security Maps of the Container New World
Here Be Dragons: Security Maps of the Container New WorldHere Be Dragons: Security Maps of the Container New World
Here Be Dragons: Security Maps of the Container New WorldC4Media
 
Level Up Your Integration Testing With Testcontainers
Level Up Your Integration Testing With TestcontainersLevel Up Your Integration Testing With Testcontainers
Level Up Your Integration Testing With TestcontainersVMware Tanzu
 
Automating Software Development Life Cycle - A DevOps Approach
Automating Software Development Life Cycle - A DevOps ApproachAutomating Software Development Life Cycle - A DevOps Approach
Automating Software Development Life Cycle - A DevOps ApproachAkshaya Mahapatra
 
DCSF 19 Building Your Development Pipeline
DCSF 19 Building Your Development Pipeline  DCSF 19 Building Your Development Pipeline
DCSF 19 Building Your Development Pipeline Docker, Inc.
 
Caching in Windows Azure
Caching in Windows AzureCaching in Windows Azure
Caching in Windows AzureIdo Flatow
 
chaos-engineering-Knolx
chaos-engineering-Knolxchaos-engineering-Knolx
chaos-engineering-KnolxKnoldus Inc.
 
MongoDB World 2019: Why NBCUniversal Migrated to MongoDB Atlas
MongoDB World 2019: Why NBCUniversal Migrated to MongoDB AtlasMongoDB World 2019: Why NBCUniversal Migrated to MongoDB Atlas
MongoDB World 2019: Why NBCUniversal Migrated to MongoDB AtlasMongoDB
 
Antifragility and testing for distributed systems failure
Antifragility and testing for distributed systems failureAntifragility and testing for distributed systems failure
Antifragility and testing for distributed systems failureDiUS
 
Pluggable Infrastructure with CI/CD and Docker
Pluggable Infrastructure with CI/CD and DockerPluggable Infrastructure with CI/CD and Docker
Pluggable Infrastructure with CI/CD and DockerBob Killen
 
ContainerCon - Test Driven Infrastructure
ContainerCon - Test Driven InfrastructureContainerCon - Test Driven Infrastructure
ContainerCon - Test Driven InfrastructureYury Tsarev
 

Similaire à Resilience Testing (20)

Containerised Testing at Demonware : PyCon Ireland 2016
Containerised Testing at Demonware : PyCon Ireland 2016Containerised Testing at Demonware : PyCon Ireland 2016
Containerised Testing at Demonware : PyCon Ireland 2016
 
FIWARE Tech Summit - Docker Swarm Secrets for Creating Great FIWARE Platforms
FIWARE Tech Summit - Docker Swarm Secrets for Creating Great FIWARE PlatformsFIWARE Tech Summit - Docker Swarm Secrets for Creating Great FIWARE Platforms
FIWARE Tech Summit - Docker Swarm Secrets for Creating Great FIWARE Platforms
 
[KubeCon NA 2018] Telepresence Deep Dive Session - Rafael Schloming & Luke Sh...
[KubeCon NA 2018] Telepresence Deep Dive Session - Rafael Schloming & Luke Sh...[KubeCon NA 2018] Telepresence Deep Dive Session - Rafael Schloming & Luke Sh...
[KubeCon NA 2018] Telepresence Deep Dive Session - Rafael Schloming & Luke Sh...
 
Docker Swarm secrets for creating great FIWARE platforms
Docker Swarm secrets for creating great FIWARE platformsDocker Swarm secrets for creating great FIWARE platforms
Docker Swarm secrets for creating great FIWARE platforms
 
The Future of Security and Productivity in Our Newly Remote World
The Future of Security and Productivity in Our Newly Remote WorldThe Future of Security and Productivity in Our Newly Remote World
The Future of Security and Productivity in Our Newly Remote World
 
DevOpsDays Singapore - Continuous Auditing with Compliance as Code
DevOpsDays Singapore - Continuous Auditing with Compliance as CodeDevOpsDays Singapore - Continuous Auditing with Compliance as Code
DevOpsDays Singapore - Continuous Auditing with Compliance as Code
 
Declarative Infrastructure Tools
Declarative Infrastructure Tools Declarative Infrastructure Tools
Declarative Infrastructure Tools
 
DevSec Delight with Compliance as Code - Matt Ray - AgileNZ 2017
DevSec Delight with Compliance as Code - Matt Ray - AgileNZ 2017DevSec Delight with Compliance as Code - Matt Ray - AgileNZ 2017
DevSec Delight with Compliance as Code - Matt Ray - AgileNZ 2017
 
OWASP ZAP Workshop for QA Testers
OWASP ZAP Workshop for QA TestersOWASP ZAP Workshop for QA Testers
OWASP ZAP Workshop for QA Testers
 
How to secure your web applications with NGINX
How to secure your web applications with NGINXHow to secure your web applications with NGINX
How to secure your web applications with NGINX
 
Here Be Dragons: Security Maps of the Container New World
Here Be Dragons: Security Maps of the Container New WorldHere Be Dragons: Security Maps of the Container New World
Here Be Dragons: Security Maps of the Container New World
 
Level Up Your Integration Testing With Testcontainers
Level Up Your Integration Testing With TestcontainersLevel Up Your Integration Testing With Testcontainers
Level Up Your Integration Testing With Testcontainers
 
Automating Software Development Life Cycle - A DevOps Approach
Automating Software Development Life Cycle - A DevOps ApproachAutomating Software Development Life Cycle - A DevOps Approach
Automating Software Development Life Cycle - A DevOps Approach
 
DCSF 19 Building Your Development Pipeline
DCSF 19 Building Your Development Pipeline  DCSF 19 Building Your Development Pipeline
DCSF 19 Building Your Development Pipeline
 
Caching in Windows Azure
Caching in Windows AzureCaching in Windows Azure
Caching in Windows Azure
 
chaos-engineering-Knolx
chaos-engineering-Knolxchaos-engineering-Knolx
chaos-engineering-Knolx
 
MongoDB World 2019: Why NBCUniversal Migrated to MongoDB Atlas
MongoDB World 2019: Why NBCUniversal Migrated to MongoDB AtlasMongoDB World 2019: Why NBCUniversal Migrated to MongoDB Atlas
MongoDB World 2019: Why NBCUniversal Migrated to MongoDB Atlas
 
Antifragility and testing for distributed systems failure
Antifragility and testing for distributed systems failureAntifragility and testing for distributed systems failure
Antifragility and testing for distributed systems failure
 
Pluggable Infrastructure with CI/CD and Docker
Pluggable Infrastructure with CI/CD and DockerPluggable Infrastructure with CI/CD and Docker
Pluggable Infrastructure with CI/CD and Docker
 
ContainerCon - Test Driven Infrastructure
ContainerCon - Test Driven InfrastructureContainerCon - Test Driven Infrastructure
ContainerCon - Test Driven Infrastructure
 

Plus de Ran Levy

CD presentation march 12th, 2018
CD presentation march 12th, 2018CD presentation march 12th, 2018
CD presentation march 12th, 2018Ran Levy
 
MyHeritage backend 2015 summary
MyHeritage backend 2015 summaryMyHeritage backend 2015 summary
MyHeritage backend 2015 summaryRan Levy
 
Continuous Deployment Applied at MyHeritage
Continuous Deployment Applied at MyHeritageContinuous Deployment Applied at MyHeritage
Continuous Deployment Applied at MyHeritageRan Levy
 
MyHeritage backend group - build to scale
MyHeritage backend group - build to scaleMyHeritage backend group - build to scale
MyHeritage backend group - build to scaleRan Levy
 
Documenting sw acrchitecture
Documenting sw acrchitectureDocumenting sw acrchitecture
Documenting sw acrchitectureRan Levy
 
MyHeritage Kakfa use cases - Feb 2014 Meetup
MyHeritage Kakfa use cases - Feb 2014 Meetup MyHeritage Kakfa use cases - Feb 2014 Meetup
MyHeritage Kakfa use cases - Feb 2014 Meetup Ran Levy
 

Plus de Ran Levy (6)

CD presentation march 12th, 2018
CD presentation march 12th, 2018CD presentation march 12th, 2018
CD presentation march 12th, 2018
 
MyHeritage backend 2015 summary
MyHeritage backend 2015 summaryMyHeritage backend 2015 summary
MyHeritage backend 2015 summary
 
Continuous Deployment Applied at MyHeritage
Continuous Deployment Applied at MyHeritageContinuous Deployment Applied at MyHeritage
Continuous Deployment Applied at MyHeritage
 
MyHeritage backend group - build to scale
MyHeritage backend group - build to scaleMyHeritage backend group - build to scale
MyHeritage backend group - build to scale
 
Documenting sw acrchitecture
Documenting sw acrchitectureDocumenting sw acrchitecture
Documenting sw acrchitecture
 
MyHeritage Kakfa use cases - Feb 2014 Meetup
MyHeritage Kakfa use cases - Feb 2014 Meetup MyHeritage Kakfa use cases - Feb 2014 Meetup
MyHeritage Kakfa use cases - Feb 2014 Meetup
 

Dernier

Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdfankushspencer015
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)simmis5
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...ranjana rawat
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdfKamal Acharya
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduitsrknatarajan
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 

Dernier (20)

Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 

Resilience Testing

  • 1. Resilience testing Ran Levy, Ran Peled Jan 2020
  • 2. MyHeritage 2 MyHeritage is the leading global discovery platform for family history and DNA testing
  • 4. 4 Family trees & historical records MyHeritage DNA MyHeritage Health
  • 5. Resilience testing – WHY? We’ll start with a showcase
  • 6. Showcase Incident: Downtime of the entire web site for 1 minute due to a failure in DNA DB (non critical DB, except for DNA use case). Root cause: Navigation bar code (which appears in every web page) tried to determine if the user appears in DNA whitelist. 6
  • 8. Chaos Engineering (we are not there yet) Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production. https://principlesofchaos.org/ ● Start by defining measurable ‘steady state’ ● Hypothesize that this steady state will continue in both the control group and the experimental group. ● Introduce variables that reflect real world events like servers that crash, hard drives that malfunction, network connections that are severed, etc. ● Try to disprove the hypothesis by looking for a difference in steady state between the control group and the experimental group. 8
  • 9. Path to Chaos Engineering The path to chaos engineering starts much earlier before experimenting in production. WARNING: DO NOT TEMPT CHAOS ENGINEERING TOO EARLY! 9
  • 10. How to start? (Or how did we start)? Resilience Testing Software resilience testing is a method of software testing that focuses on ensuring that applications will perform well in real-life or chaotic conditions.
  • 11. Developer machine Testing environment(s) ● Start with defining your testing environment. Prefer isolated environments. 11 AWS Dev account Production AWS Staging Account Staging Prod Staging Services Staging web servers Production Services Prod web srvr EKS cluster Health services Health services Health services AWS Managed services - Aurora - Kafka - KMS - Secrets AWS Staging server Health services Health services Dockerized Local env AWS Staging server Health services Health services Dockerized Local env AWS Staging server Health services Health services Dockerized Local env AWS Staging server Health services Health services Dockerized Local env AWS Staging server Health services Health services Dockerized Local env Developer machine Health services Health services Dockerized Local env
  • 12. Unit and integration testing ● The easiest way to start with resilience testing. ● Risk free methodology. ● Easy setup and easy to implement. ● Resilience testing is done in parallel to development. 12
  • 13. Unit and integration testing Examples: ● Make database mock return error. ● Make microservice mock return invalid data or throw exceptions. ● Make code dependency mocks return the documented error codes and exceptions. ● Simulate timeouts, slow response, … Verify code behaves “as advertised”, for example ● Sensible defaults for failed fetch operations. ● Writing to secondary location upon write failure (when such fallback is applicable). ● Documented error code returned or exception thrown to caller. 13
  • 14. End-to-end resilience testing ● We continued with more sophisticated end to scenarios. ● We aimed to verify expected behavior under deterministic faulty conditions. ● Started with limited test scope, in testing environment (do not affect users). 14
  • 15. End-to-end resilience testing A test defines ● Use case (user flow), e.g. “Show search results page” ● Injected fault(s), e.g. “Records Database instance is down” ● Expected behavior, for example ○ No effect on user (by using a redundant service). ○ A limited performance hit (e.g. calculate on the fly instead of using cache). ○ Expected service degradation (e.g. missing UI component on a page). ○ No service to the user, but effect is localized. 15
  • 16. web server Test setup Faults can happen anywhere, where do we start? Focus on failing dependencies of the user-facing web app ● Most important is to protect the user experience ● Also easiest to implement quickly: ○ Need to inject faults only on one location, the web server. ○ Can run on Staging AWS ○ Can run on Dallas Staging/RC server without affecting production 16 PHP web app Micorservice Micorservice Microservice Microservice DB DB DB Memcache
  • 17. Tools ● We continued tools exploration and POC with the most promising ones. ● The major guideline in the selection we had in mind is avoiding code changes. 17
  • 18. Tools ● Chaos Monkey / Chaos Kong Netflix’s OSS for simulating faults. Aimed for running in production and turn off EC2 instances or entire AWS region. https://github.com/Netflix/chaosmonkey ● Gremlin Fault injection framework (infra and Java app level). Saas based control plane. Commercial with entry-level free tier. Does not cover test scenarios, only injection. https://www.gremlin.com/docs/ ● Envoy Envoy is a general-purpose proxy. Used internally by Istio. HTTP fault injection capabilities, though this is not it’s main purpose. Supports aborting requests, injecting predefined HTTP return codes, adding delays, rate limits, etc. Does not support TCP level fault injection (except rate limit and the specific MongoDB filter). https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/fault_filter 18
  • 19. Tools ● Chaos Toolkit Python based resilience tests framework. Plugins to control many environments, including Spring Boot services, AWS, K8s, and probing data from Prometheus, Open Tracing. Focus on test specification. https://docs.chaostoolkit.org/reference/concepts/ ● ToxiProxy Our current selection. Used as the network driver of Chaos Toolkit. https://github.com/Shopify/toxiproxy 19
  • 20. ToxiProxy “Toxiproxy is a framework for simulating network conditions” A TCP proxy built for interfering with connections 20 APP ToxiProxy DB Configure app to go via proxy, e.g. stats_db: localhost:20000 Configure proxy mapping, e.g. stats: localhost:20000 => data121:3306 Configure proxy toxic, e.g. stats: delay=10sec Toxic types: ● latency ● down ● bandwidth ● slow_close ● timeout ● …
  • 21. ToxiProxy setup CLI interface – examples # run toxiproxy locally (detached) sudo docker run -d --net=host --name=toxiproxy artifactory.dal.myhrtg.net:8082/shopify/toxiproxy # set up proxy named “web” from TCP port 8080 to 80 sudo docker exec -it toxiproxy /go/bin/toxiproxy-cli create web -l localhost:8080 - u localhost:80 # set up toxic of 10,000 ms delay on “web” proxy sudo docker exec -it toxiproxy /go/bin/toxiproxy-cli toxic add web -t latency -a latency=10000 # test curl localhost:8080 # see that response is received after 10 seconds 21 REST interface # e.g. get configured proxies curl http://127.0.0.1:8474/proxies
  • 22. Testing - manual • We started development of environment setup scripts and ran a few manual experiments. 1) Start ToxiProxy on the test server Manually run ToxiProxy Docker container, or use this helper: /srv/MyHeritageControl/resilience/container_ctrl/start_toxiproxy.sh 2) Manually patch PHP config to go through the proxy, or use this helper: cd /srv/MyHeritageControl/resilience && npm i && node resilience_switcher.js -s http://<server address> 3) Set up toxics via CLI or REST-API 4) Test 22
  • 23. Manual testing gave immediate and actionable insights 23
  • 24. Real world example 1 Test case and expected result DNA Database unavailability should not have any effect on non-DNA pages like the home page. Test results Home page does not load. In fact, any page with navigation menu does not load… Analysis Eligibility for DNA check which is used by navigation code, is sensitive to DNA database availability. Solution • try … catch block, to have a default behavior on DB unavailability. • Limit wait time for query. 24
  • 25. Real world example 2 Test case and expected result Slow response time from the Cassandra AccountStore cluster should not affect the site (or worst, slow it by up to X seconds per request) Test results 5 sec delay: Site does not load 0.5 sec delay: Site loads slowly; some user info missing Possible solution Shorten Cassandra access timeouts, and make sure site loads even with timeouts. In addition, introduce circuit-breaker on Cassandra access. 25
  • 26. Testing - Automations • We continued with writing end to end automations tests. • Test setup configured Toxic Proxy and relevant code configuration changes. • Test scenario defines the actual failure to inject: def toxiproxy_settings = [ ['records;all_down','@resilience_records_db_down'], ['solrAggregator_url; {"type":"latency","attributes":{"latency":300000}}','@resilience_solr_aggregator_delay']] • Test teardown reverted the configuration. 26
  • 27. It’s a journey We continued with: Documenting Spreading the knowledge Guidelines definitions
  • 28. Resilience guidelines - Limits Tripping over limits is a common resilience issue. Often hard to fix in production! Examples: ● Software: max row id, auto-increment, max URL length, connection limits. ● Hardware/infra: disk space, bandwidth, CPU. Mitigations: ● Awareness, defensive programming. ● Load testing (reads, writes, startup, shutdown, resharding, backup, etc). ● Monitor getting close to limits ○ Adjust slack according to time to adjust. ○ Alerts are also good documentation - add extra info, link to explanation and mitigation info. 28
  • 29. Resilience guidelines - spreading slowness Slowness tends to spread to other components and create growing blast radius Examples: ● overload - CPU, I/O, lack of memory. ● slow caller connections. ● slow dependencies. ● misbehaving retry loops. Mitigations: ● Fail fast. Tune timeouts (DB queries, microservice calls, etc); Limit retries with exponential backoff and jitter; Circuit Breakers; cascading deadlines. ● Visibility Dashboards for quick response: errors, latency, utilization, saturation. Drill down analysis tools. 29
  • 30. Resilience guidelines - coordinated demand Overload is often not isolated, and impacts many parts of the system. Examples: ● From users: campaigns, media coverage, seasonal. ● From systems: cron jobs, batch jobs, scheduled client updates, coming back after outage, rebalancing. Mitigations: ● Degraded service: throttle, serve only paying users, show partial results. ● Queued processing: limit demand by number of consumers. ● Testing. 30
  • 31. It’s a journey: Future work Integrate Resilience into development cycle ● Spec, software design, test plans, visibility and monitoring. Expand resilience tests coverage, fix issues Testing infra ● Simplify configuration and automations development. ● Downstream failures (microservice-to-microservice, MS to DB). ● Environmental toxics (e.g. high CPU load, disk space shortage). Resilience infrastructure ● E.g. Cascading deadlines, Backoff and Jitter retry strategies. Chaos testing ● Can start small and controlled. E.g. Fail X% of memcache reads on prod, to measure effect on DB load; fail X% of client requests to check retry behavior, etc.
  • 32. References ● Chaos Engineering book On O’reilly (requires login), or downloaded PDF ● Google’s SRE book https://landing.google.com/sre/sre-book/toc/ ● Principles of Chaos Engineering book https://principlesofchaos.org/ ● Taxonomy of Black Swans talk (review of public post mortems and conclusions) https://www.infoq.com/presentations/taxonomy-black-swans/ 32