Slides from my ContainerCamp UK 2017 session.
These slides present a practical chaos engineering approach for resilience testing of Docker based software systems.
2. Who am I?
‣Alexei Ledenev (@alexeiled)
‣Chief of Research @codefresh.io
‣Open Source Projects
‣github.com/alexei-led/pumba
‣github.com/codefresh-io/microci
‣#docker #k8s #aws #gcloud
3. Complex Systems
"Sooner or later, any complex system will fail, and software systems are no exception.
Failure can occur anytime and almost anywhere. So you should never get too comfortable."
4. Last Year Outages
• IBM Cloud, January 26
• GitLab, January 31
• AWS, February 28
• Microsoft Azure, March 16
• ...
• Visit http://outage.report/
5.
6. What can we do
to achieve better Quality?
More testing? Better monitoring?
Functional Testing
Performance Testing
Integration Testing
Penetration Testing
Acceptance Testing Log Analytics
Monitoring Alerts
Failure Predictions
8. CAP Theorem
“Of three properties of
shared-data systems
(Consistency, Availability
and tolerance to network
Partitions) only two can be
achieved at any given
moment in time.”
Eric Brewer
9. Chaos Engineering
• Embrace the failure!
• Defines an empirical approach to resilience testing of distributed software systems
• Chaos Experiment
- define a "normal/steady" state of the system (e.g. by monitoring a set of system and business
metrics)
- pseudo-randomly inject faults (e.g. by terminating VMs, killing containers or changing network
behavior)
- try to discover system weaknesses by deviation from expected or steady-state behavior
The harder it is to disrupt the steady state, the more confidence we have in the behavior of the system.
http://principlesofchaos.org/
12. What is Pumba(a)?
1. Pumbaa is a well-known supporting character
(warthog) from Disney’s animated film The Lion King
2. In Swahili, pumbaa means “to be foolish, silly, weak-
minded, careless, negligent”
3. It's also an open source Chaos Testing tool for Docker
containers
1. https://github.com/gaia-adm/pumba
2. Linux, Windows, MacOS, Docker
13. What Pumba can do?
• Pumba disturbs Docker runtime environment, injecting different failures
• The "victim" container can be specified, providing name/s or regex
• Radom selection is also supported (with `--random` flag)
• It's possible to define a repeatable time interval and duration parameters
to better control the Chaos
• Pumba can disturb either single Docker host, Swarm cluster, and
Kubernetes cluster
14. Pumba Docker Chaos Commands
1. stop running Docker container
2. kill (send termination or other signal) to the main process within a
Docker container
3. remove "victim" containers, with their links and volumes
4. pause all processes within a "victim" Docker container for a
specified time
16. Examples
# stop random container once in a 10 minutes
$ pumba --random --interval 10m kill --signal SIGSTOP
# every 15 minutes kill `mysql` container and
# every hour remove containers starting with "cf"
$ pumba --interval 15m kill --signal SIGTERM mysql &
$ pumba --interval 1h rm re2:^cf &
# every 5 min randomly kill "worker1" or "worker2" containers
# and every 3 minutes pause "queue" container for 15s
$ pumba --random --interval 5m kill --signal SIGKILL worker1 worker2 &
$ pumba --interval 3m pause --duration 15s queue &
17. Pumba Network Chaos Commands
1. Pumba can emulate network failures at container level (filter by IP too)
2. delay egress traffic for the specified containers
3. add packet-loss based on different probability loss models (2-3-4 state
Markov, Gilbert, Simple Gilbert and Bernoulli)
4. rate limit egress traffic for the specified containers
18. # add 3 seconds delay for all outgoing packets
# on (default) network device of Docker container for 5 minutes
$ pumba netem --duration 5m delay --time 3000 mydb
# add a delay of 3000ms ± 30ms,
# with the next random element depending 20% on the last one,
# for all outgoing packets on device of all Docker container,
# with name start with for 10 minutes
$ pumba netem --duration 5m --interface eth1 delay
--time 3000 --jitter 30 --correlation 20 re2:^hp
# add a delay of 3000ms ± 40ms, where variation in delay
# is described by normal distribution,
# for all outgoing packets on main network device of randomly
# chosen Docker container
# from the specified list, for 5 minutes
$ pumba --random netem --duration 5m delay --time 3000
--jitter 40 --distribution normal
container1 container2 container3
19. Pumba Netem under the hood
• The Linux kernel offers a native framework for routing, bridging, firewalling, address
translation and much else.
• Before a packet leaves the output interface, it passes through Linux Traffic Control (tc). This
component is a powerful tool for scheduling, shaping, classifying and prioritizing traffic.
• The basic component of Linux Traffic Control is the queuing discipline (qdisc). The
simplest implementation of a qdisc is first in first out (FIFO). There are others too.
• The network emulation (netem) project adds queuing disciplines that emulate wide area
network properties such as latency, jitter, loss, duplication, corruption and reordering.