2. Lothar
I am solutions architect and digital disruptor.
Since 2009, I work at the intersection between
cloud and analytics. Digital disruption is coming
to ever more sectors and I want to understand its
technological, societal and economical impacts.
Before 2009, I managed large project budgets,
turned to an architect later on and built a digital
radiology and migrated the Miles & More.
@lwieske news.trivadis.com/blog
5. Cloud native technologies empower organizations to
build and run scalable applications in modern,
dynamic environments such as public, private, and
hybrid clouds. Containers, service meshes,
microservices, immutable infrastructure, and
declarative APIs exemplify this approach.
8. 2012: Netflix Open Sourced Chaos Monkey.
2016: Netflix Completed Transition To a 100% AWS Infrastructure
Cloud Changed the Way Netflix Runs the Company
9. Netflix Handled Amazon Maintenance
Update
• Amazon performed a major maintenance update at the end of September 2014 in order to patch a
security vulnerability in a Xen hypervisor affecting about 10% of their global fleet of cloud servers.
• Netflix has a long history of using their Simian army - Chaos Monkey, Gorilla and Kong – to force
reboots of their servers in order to see how the overall system reacts and what can be done to
improve resilience. The problem this time was that the operation would affect some of their
database servers, more exactly 218 Cassandra nodes. It is one thing to perform a live restart of a
server streaming a video, and it is a lot more difficult to do the same to a stateful database.
• Out of our 2700+ production Cassandra nodes, 218 were rebooted.
• 22 Cassandra nodes were on hardware that did not reboot successfully.
• They were detected and replaced with minimal human intervention.
• Netflix experienced 0 downtime that weekend.
12. PRINCIPLES OF CHAOS ENGINEERING
• The following principles describe an ideal application of Chaos Engineering, applied to the processes
of experimentation described above. The degree to which these principles are pursued strongly
correlates to the confidence we can have in a distributed system at scale.
• Build a Hypothesis around Steady State Behavior
• Vary Real-world Events
• Run Experiments in Production
• Automate Experiments to Run Continuously
• Minimize Blast Radius
• Experimenting in production has the potential to cause unnecessary customer pain. While there
must be an allowance for some short-term negative impact, it is the responsibility and obligation of
the Chaos Engineer to ensure the fallout from experiments are minimized and contained.
14. Chaos Engineering Is Not Just Tools.
Culture Is Part Of Your System.
Complexity Is Part Of Your System.
Testing In Production? Yes You Can!
You Should Chaos Engineer Everything Cloud
and Microservices – Among Others