9. Cloud Case Study
9@bruce_m_wong
XSA-108 Security Vulnerability
~10% of EC2 instances
rebooted
Spread over a 5 days
One availability-zone at a time
20. 20
Resilience needs to be tested
@bruce_m_wong
Testing is hard
Large and growing data sets
Internet-scale traffic
Innovation and New features
Change is constant
21. 21
Resilience needs to be tested
@bruce_m_wong
Validate resilience design
Don’t wait for next outage
Un-controlled
Un-predictable
Hope is not a strategy
22. Types of Chaos
22
Instances Fail
Lessons
• Be as stateless as possible
• Autoscaling groups are good
• Invest in automation to rebuilt
state when necessary
• Running Chaos Monkey on
C*
@bruce_m_wong
23. Types of Chaos
23
Many Instances can Fail
Lessons
• Cassandra works as expected
• Moving Traffic back to steady
state is just as hard
• Infrastructure Management tools
can be a bottleneck
@bruce_m_wong
24. Types of Chaos
24
Natural Disasters Happen
Lessons
• Cassandra works as expected
• Moving Traffic back to steady
state is just as hard
• Infrastructure Management can
be a bottleneck
• Smaller Blast-Radius Benefits
• Traffic + Capacity orchestration
is hard
@bruce_m_wong
25. Types of Chaos
25
Latency
Still Learning
• Functional fallbacks don’t
account for system limitations
• Thread pools
• Connection pools
• Slow can be hard to find
• Slow can be hard to contain
• Unbounded Queues are BAD
@bruce_m_wong
26. 26
Unbounded Queues
@bruce_m_wong
Come in many forms, to name a few
Threads
Memory
Disk
Bounded by physical limitations
VERY difficult to find
Elastic is not Infinite
27. 27
For Example: Memory and Data
@bruce_m_wong
Data is important
In-Memory Queue grows and shrinks
Failure Mode # 1 – Out of memory
NOT A MEMORY LEAK!
28. 28
For Example: Memory and Data
@bruce_m_wong
Data is important
If Queue gets to size X
Write to disk
Flush later
Failure Mode # 2
Disk Full
File Descriptors Saturated
29. 29
For Example: Memory and Data
@bruce_m_wong
Data is important
…
But not as important as uptime
30. Starting Chaos
30
Start small, very small.
Start simple, stateless systems
Start manually and coordinated
Failure Injection Fridays
Build confidence
Outages are opportunities
@bruce_m_wong