My talk in Bessemer VP R&D / CTO yearly event (Jan 2020).
The presentation discusses major concept in resilience testing and MyHeritage's path to Chaos Engineering.
6. Showcase
Incident:
Downtime of the entire web site for 1
minute due to a failure in DNA DB
(non critical DB, except for DNA use
case).
Root cause:
Navigation bar code (which appears
in every web page) tried to
determine if the user appears in
DNA whitelist.
6
8. Chaos Engineering (we are not there yet)
Chaos Engineering is the discipline of experimenting on a system
in order to build confidence in the system’s capability
to withstand turbulent conditions in production. https://principlesofchaos.org/
● Start by defining measurable ‘steady state’
● Hypothesize that this steady state will continue
in both the control group and the experimental group.
● Introduce variables that reflect real world events
like servers that crash, hard drives that malfunction, network connections that are severed, etc.
● Try to disprove the hypothesis
by looking for a difference in steady state between the control group and the experimental group.
8
9. Path to Chaos Engineering
The path to chaos engineering starts much earlier before experimenting in production.
WARNING: DO NOT TEMPT CHAOS ENGINEERING TOO EARLY!
9
10. How to start? (Or how did we start)?
Resilience Testing
Software resilience testing is a method of
software testing that focuses on ensuring that
applications will perform well in real-life or chaotic
conditions.
11. Developer machine
Testing environment(s)
● Start with defining your testing environment. Prefer isolated environments.
11
AWS Dev account Production
AWS Staging Account
Staging Prod
Staging
Services
Staging
web
servers
Production
Services
Prod
web srvr
EKS cluster
Health
services
Health
services
Health
services
AWS
Managed
services
- Aurora
- Kafka
- KMS
- Secrets
AWS Staging server
Health
services
Health
services
Dockerized
Local env
AWS Staging server
Health
services
Health
services
Dockerized
Local env
AWS Staging server
Health
services
Health
services
Dockerized
Local env
AWS Staging server
Health
services
Health
services
Dockerized
Local env
AWS Staging server
Health
services
Health
services
Dockerized
Local env
Developer machine
Health
services
Health
services
Dockerized
Local env
12. Unit and integration testing
● The easiest way to start with resilience testing.
● Risk free methodology.
● Easy setup and easy to implement.
● Resilience testing is done in parallel to development.
12
13. Unit and integration testing
Examples:
● Make database mock return error.
● Make microservice mock return invalid data or throw exceptions.
● Make code dependency mocks return the documented error codes and exceptions.
● Simulate timeouts, slow response, …
Verify code behaves “as advertised”, for example
● Sensible defaults for failed fetch operations.
● Writing to secondary location upon write failure (when such fallback is applicable).
● Documented error code returned or exception thrown to caller.
13
14. End-to-end resilience testing
● We continued with more sophisticated end to scenarios.
● We aimed to verify expected behavior under deterministic faulty conditions.
● Started with limited test scope, in testing environment (do not affect users).
14
15. End-to-end resilience testing
A test defines
● Use case (user flow), e.g. “Show search results
page”
● Injected fault(s), e.g. “Records Database instance
is down”
● Expected behavior, for example
○ No effect on user (by using a redundant
service).
○ A limited performance hit (e.g. calculate on
the fly instead of using cache).
○ Expected service degradation (e.g. missing
UI component on a page).
○ No service to the user, but effect is localized.
15
16. web
server
Test setup
Faults can happen anywhere, where do we start?
Focus on failing dependencies of the user-facing web app
● Most important is to protect the user experience
● Also easiest to implement quickly:
○ Need to inject faults only on one location, the web
server.
○ Can run on Staging AWS
○ Can run on Dallas Staging/RC server without
affecting production
16
PHP web
app
Micorservice
Micorservice
Microservice
Microservice
DB
DB
DB
Memcache
17. Tools
● We continued tools exploration and POC with the most promising ones.
● The major guideline in the selection we had in mind is avoiding code changes.
17
18. Tools
● Chaos Monkey / Chaos Kong
Netflix’s OSS for simulating faults. Aimed for running in production and turn off EC2 instances
or entire AWS region. https://github.com/Netflix/chaosmonkey
● Gremlin
Fault injection framework (infra and Java app level). Saas based control plane. Commercial
with entry-level free tier. Does not cover test scenarios, only injection.
https://www.gremlin.com/docs/
● Envoy
Envoy is a general-purpose proxy. Used internally by Istio. HTTP fault injection capabilities,
though this is not it’s main purpose. Supports aborting requests, injecting predefined HTTP
return codes, adding delays, rate limits, etc. Does not support TCP level fault injection (except
rate limit and the specific MongoDB filter).
https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/fault_filter
18
19. Tools
● Chaos Toolkit
Python based resilience tests framework. Plugins to control many environments,
including Spring Boot services, AWS, K8s, and probing data from Prometheus,
Open Tracing. Focus on test specification.
https://docs.chaostoolkit.org/reference/concepts/
● ToxiProxy
Our current selection. Used as the network driver of Chaos Toolkit.
https://github.com/Shopify/toxiproxy
19
20. ToxiProxy
“Toxiproxy is a framework for simulating network conditions”
A TCP proxy built for interfering with connections
20
APP
ToxiProxy
DB
Configure app to go via proxy, e.g.
stats_db: localhost:20000
Configure proxy mapping, e.g.
stats: localhost:20000 => data121:3306
Configure proxy toxic, e.g.
stats: delay=10sec
Toxic types:
● latency
● down
● bandwidth
● slow_close
● timeout
● …
21. ToxiProxy setup
CLI interface – examples
# run toxiproxy locally (detached)
sudo docker run -d --net=host --name=toxiproxy
artifactory.dal.myhrtg.net:8082/shopify/toxiproxy
# set up proxy named “web” from TCP port 8080 to 80
sudo docker exec -it toxiproxy
/go/bin/toxiproxy-cli create web -l localhost:8080 -
u localhost:80
# set up toxic of 10,000 ms delay on “web” proxy
sudo docker exec -it toxiproxy
/go/bin/toxiproxy-cli toxic add web -t latency -a
latency=10000
# test
curl localhost:8080
# see that response is received after 10 seconds
21
REST interface
# e.g. get configured proxies
curl http://127.0.0.1:8474/proxies
22. Testing - manual
• We started development of environment setup scripts and ran a few manual
experiments.
1) Start ToxiProxy on the test server
Manually run ToxiProxy Docker container, or use this helper:
/srv/MyHeritageControl/resilience/container_ctrl/start_toxiproxy.sh
2) Manually patch PHP config to go through the proxy, or use this helper:
cd /srv/MyHeritageControl/resilience
&& npm i
&& node resilience_switcher.js -s http://<server address>
3) Set up toxics via CLI or REST-API
4) Test
22
24. Real world example 1
Test case and expected result
DNA Database unavailability should not have any effect on non-DNA pages like the
home page.
Test results
Home page does not load. In fact, any page with navigation menu does not load…
Analysis
Eligibility for DNA check which is used by navigation code, is sensitive to
DNA database availability.
Solution
• try … catch block, to have a default behavior on DB unavailability.
• Limit wait time for query.
24
25. Real world example 2
Test case and expected result
Slow response time from the Cassandra AccountStore cluster should not affect the site
(or worst, slow it by up to X seconds per request)
Test results
5 sec delay: Site does not load
0.5 sec delay: Site loads slowly; some user info missing
Possible solution
Shorten Cassandra access timeouts, and make sure site loads even with timeouts.
In addition, introduce circuit-breaker on Cassandra access.
25
26. Testing - Automations
• We continued with writing end to end automations tests.
• Test setup configured Toxic Proxy and relevant code configuration changes.
• Test scenario defines the actual failure to inject:
def toxiproxy_settings = [
['records;all_down','@resilience_records_db_down'],
['solrAggregator_url;
{"type":"latency","attributes":{"latency":300000}}','@resilience_solr_aggregator_delay']]
• Test teardown reverted the configuration.
26
27. It’s a journey
We continued with:
Documenting
Spreading the knowledge
Guidelines definitions
28. Resilience guidelines - Limits
Tripping over limits is a common resilience issue. Often hard to fix in production!
Examples:
● Software: max row id, auto-increment, max URL length, connection limits.
● Hardware/infra: disk space, bandwidth, CPU.
Mitigations:
● Awareness, defensive programming.
● Load testing (reads, writes, startup, shutdown, resharding, backup, etc).
● Monitor getting close to limits
○ Adjust slack according to time to adjust.
○ Alerts are also good documentation - add extra info, link to explanation and mitigation
info.
28
29. Resilience guidelines - spreading slowness
Slowness tends to spread to other components and create growing blast radius
Examples:
● overload - CPU, I/O, lack of memory.
● slow caller connections.
● slow dependencies.
● misbehaving retry loops.
Mitigations:
● Fail fast.
Tune timeouts (DB queries, microservice calls, etc); Limit retries with exponential backoff and jitter;
Circuit Breakers; cascading deadlines.
● Visibility
Dashboards for quick response: errors, latency, utilization, saturation. Drill down analysis tools.
29
30. Resilience guidelines - coordinated demand
Overload is often not isolated, and impacts many parts of the system.
Examples:
● From users: campaigns, media coverage, seasonal.
● From systems: cron jobs, batch jobs, scheduled client updates, coming back after
outage, rebalancing.
Mitigations:
● Degraded service: throttle, serve only paying users, show partial results.
● Queued processing: limit demand by number of consumers.
● Testing.
30
31. It’s a journey: Future work
Integrate Resilience into development cycle
● Spec, software design, test plans, visibility and monitoring.
Expand resilience tests coverage, fix issues
Testing infra
● Simplify configuration and automations development.
● Downstream failures (microservice-to-microservice, MS to DB).
● Environmental toxics (e.g. high CPU load, disk space shortage).
Resilience infrastructure
● E.g. Cascading deadlines, Backoff and Jitter retry strategies.
Chaos testing
● Can start small and controlled. E.g. Fail X% of memcache reads on prod, to measure
effect on DB load; fail X% of client requests to check retry behavior, etc.
32. References
● Chaos Engineering book
On O’reilly (requires login), or downloaded PDF
● Google’s SRE book
https://landing.google.com/sre/sre-book/toc/
● Principles of Chaos Engineering book
https://principlesofchaos.org/
● Taxonomy of Black Swans talk (review of public post mortems and conclusions)
https://www.infoq.com/presentations/taxonomy-black-swans/
32