"Kafka often finds itself as the backbone of a company’s systems, but the failure modes and signals leading to those failures are not always well understood. Chaos Engineering espouses empiricism, experimentation over testing, and verification over validation. We can prime a Kafka cluster as a Chaos Experiment by putting it under a controlled load test called a ‘squeeze test’. This session gives attendees confidence in the steps needed to build experiments to prove Kafka cluster(s) can fulfill the needs of the business.
We start by demonstrating how to build a “steady state” hypothesis based on cluster sizing, best practices and expected usage, monitoring configuration, and perfunctory performance testing. We then develop an example hypothesis that as the load on the Kafka cluster increases towards the tipping point we will receive monitoring alerts/signals for key metrics.
Attendees learn in detail how real world events were varied for the experiment, including design goals, hard trade-offs, and safety mechanisms necessary for the load tool to adhere to Chaos Engineering principles. We show how the results were analyzed to support or debunk the hypothesis. Finally, we lay out the next steps for attendees’ Chaos Engineering journey."
4. What does it all mean?
● Chaos Engineering
○ “the facilitation of experiments to uncover systemic weaknesses.”
● Experiment
○ “an operation or procedure carried out under controlled conditions in
order to discover an unknown effect, to test or establish a hypothesis, or
to illustrate a known law”
● Use experiments to create new knowledge
○ Tests make assertions about known properties
● Experiments verify behavior; not validate
VERICA | CONTINUOUS VERIFICATION
5. Chaos Engineering Principals
● Define “steady state”
● Form a hypothesis
● Introduce variables
● Attempt to disprove hypothesis
VERICA | CONTINUOUS VERIFICATION
6. Advanced Principles
● Build hypothesis around steady-state behavior
● Vary real-world events
● Run experiments in production
● Automate experiments to run continuously
● Minimize blast radius
VERICA | CONTINUOUS VERIFICATION
7. Why should
we care about
Chaos Engineering?
VERICA | CONTINUOUS VERIFICATION
8. Complex Systems
● Businesses require capabilities/properties/features
● Requires complexity from systems
● Can’t avoid complexity
● Embrace and navigate complexity
● As complexity increases, can’t maintain mental
model
VERICA | CONTINUOUS VERIFICATION
9. ● Kafka sits at the core of our businesses
● Kafka is a complex system
● More complex systems built on top of Kafka
● Cloud infrastructure isn’t always what we expect
● Know the safety margins of our systems
VERICA | CONTINUOUS VERIFICATION
Chaofka?
13. Hypothesis
“As the load on the Kafka cluster increases, the standard workload
can continue to successfully process each batch of messages
before the next batch begins.”
● How do we measure this?
○ Monitoring
■ Message/data rates
■ CPU/Memory/Net/Disk usage
○ Application status
VERICA | CONTINUOUS VERIFICATION
14. Introducing Variables
● How do we increase load? Enter Horus!
○ Scalable & configurable
○ Safety features
■ Halting could be triggered by
● Cluster or client metrics
● Other conditions
● Manual intervention
VERICA | CONTINUOUS VERIFICATION
18. The Future!
● Context sensitive
○ One size does not fit all
● Start small
● Start in non-production environment
● Minimize blast radius
● Unleash the Chaos!
VERICA | CONTINUOUS VERIFICATION
19. References and Resources
● Rosenthal, Casey and Jones, Nora. Chaos Engineering: System Resiliency in
Practice. 1st ed., O’Reilly, 2020.
● Hausmann, Steffen. “Best practices for right-sizing your Apache Kafka
clusters to optimize performance and cost.” AWS Big Data Blog, 17 Mar. 2022,
https://aws.amazon.com/blogs/big-data/best-practices-for-right-sizing-your-
apache-kafka-clusters-to-optimize-performance-and-cost/
● https://principlesofchaos.org/
● https://www.verica.io
● https://www.thevoid.community/
VERICA | CONTINUOUS VERIFICATION