Don’t Let Kafka Be A Cluster: Kafka Chaos Experimentation with Justin Fetherolf

Don’t Let Kafka Be A Cluster:
Kafka Chaos Experimentation
Justin Fetherolf
Welcome!
VERICA | CONTINUOUS VERIFICATION
● What is Chaos Engineering?
● Why should we care?
● Experiment design
● Experiment results
● What’s next?
What is Chaos
Engineering?
VERICA | CONTINUOUS VERIFICATION
What does it all mean?
● Chaos Engineering
○ “the facilitation of experiments to uncover systemic weaknesses.”
● Experiment
○ “an operation or procedure carried out under controlled conditions in
order to discover an unknown effect, to test or establish a hypothesis, or
to illustrate a known law”
● Use experiments to create new knowledge
○ Tests make assertions about known properties
● Experiments verify behavior; not validate
VERICA | CONTINUOUS VERIFICATION
Chaos Engineering Principals
● Define “steady state”
● Form a hypothesis
● Introduce variables
● Attempt to disprove hypothesis
VERICA | CONTINUOUS VERIFICATION
Advanced Principles
● Build hypothesis around steady-state behavior
● Vary real-world events
● Run experiments in production
● Automate experiments to run continuously
● Minimize blast radius
VERICA | CONTINUOUS VERIFICATION
Why should
we care about
Chaos Engineering?
VERICA | CONTINUOUS VERIFICATION
Complex Systems
● Businesses require capabilities/properties/features
● Requires complexity from systems
● Can’t avoid complexity
● Embrace and navigate complexity
● As complexity increases, can’t maintain mental
model
VERICA | CONTINUOUS VERIFICATION
● Kafka sits at the core of our businesses
● Kafka is a complex system
● More complex systems built on top of Kafka
● Cloud infrastructure isn’t always what we expect
● Know the safety margins of our systems
VERICA | CONTINUOUS VERIFICATION
Chaofka?
Our Kafka
Experiment
VERICA | CONTINUOUS VERIFICATION
Steady State
● Cluster
○ 5-node EKS w/ 1 broker per node - 5-broker Kafka cluster
○ t2.xlarge instance types
■ “moderate” network - 83.4 - 107.3 MiB/s
■ 20 GB “gp2” EBS volumes - 128 MiB/s
○ Metrics to Prometheus/Grafana
● Batch style workload; ~3 min @ 2.5 MiB/s every 5 min
○ ~5 million messages produced and consumed
○ 3 partitions, 3 replicas
VERICA | CONTINUOUS VERIFICATION
Steady State Metrics
VERICA | CONTINUOUS VERIFICATION
Hypothesis
“As the load on the Kafka cluster increases, the standard workload
can continue to successfully process each batch of messages
before the next batch begins.”
● How do we measure this?
○ Monitoring
■ Message/data rates
■ CPU/Memory/Net/Disk usage
○ Application status
VERICA | CONTINUOUS VERIFICATION
Introducing Variables
● How do we increase load? Enter Horus!
○ Scalable & configurable
○ Safety features
■ Halting could be triggered by
● Cluster or client metrics
● Other conditions
● Manual intervention
VERICA | CONTINUOUS VERIFICATION
Load Scaling Configuration
VERICA | CONTINUOUS VERIFICATION
● 4 distinct, increasing client sets; 15 minutes each
● 5 partition, 5 replica topic
● 10 - 40 producers; 10 step
○ 500 msg/s; 1024 byte/msg
● 7 consumer groups; 3 consumers each
● 4.88 - 19.5 MiB/s total production traffic
● 34.18 - 136.72 MiB/s total consumer traffic
● Increased replication traffic
Experiment Results
VERICA | CONTINUOUS VERIFICATION
What’s Next?
VERICA | CONTINUOUS VERIFICATION
The Future!
● Context sensitive
○ One size does not fit all
● Start small
● Start in non-production environment
● Minimize blast radius
● Unleash the Chaos!
VERICA | CONTINUOUS VERIFICATION
References and Resources
● Rosenthal, Casey and Jones, Nora. Chaos Engineering: System Resiliency in
Practice. 1st ed., O’Reilly, 2020.
● Hausmann, Steffen. “Best practices for right-sizing your Apache Kafka
clusters to optimize performance and cost.” AWS Big Data Blog, 17 Mar. 2022,
https://aws.amazon.com/blogs/big-data/best-practices-for-right-sizing-your-
apache-kafka-clusters-to-optimize-performance-and-cost/
● https://principlesofchaos.org/
● https://www.verica.io
● https://www.thevoid.community/
VERICA | CONTINUOUS VERIFICATION
Justin Fetherolf
Sr. Software Engineer
https://www.verica.io
1 sur 20

Recommandé

HKG15-204: OpenStack: 3rd party testing and performance benchmarking par
HKG15-204: OpenStack: 3rd party testing and performance benchmarkingHKG15-204: OpenStack: 3rd party testing and performance benchmarking
HKG15-204: OpenStack: 3rd party testing and performance benchmarkingLinaro
3.8K vues30 diapositives
Scaling Monitoring At Databricks From Prometheus to M3 par
Scaling Monitoring At Databricks From Prometheus to M3Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3LibbySchulze
291 vues36 diapositives
TestIstanbul 2015 par
TestIstanbul 2015TestIstanbul 2015
TestIstanbul 2015Martin Spier
2.1K vues42 diapositives
Performance Test Driven Development with Oracle Coherence par
Performance Test Driven Development with Oracle CoherencePerformance Test Driven Development with Oracle Coherence
Performance Test Driven Development with Oracle Coherencearagozin
7.5K vues30 diapositives
NVMe and NVMe-oF Plugfest Webinar 9 par
NVMe and NVMe-oF Plugfest Webinar 9NVMe and NVMe-oF Plugfest Webinar 9
NVMe and NVMe-oF Plugfest Webinar 9UNH InterOperability Lab
245 vues54 diapositives
Continuous testing and deployment in Perl (London.pm Technical Meeting Octobe... par
Continuous testing and deployment in Perl (London.pm Technical Meeting Octobe...Continuous testing and deployment in Perl (London.pm Technical Meeting Octobe...
Continuous testing and deployment in Perl (London.pm Technical Meeting Octobe...Alex Balhatchet
2.5K vues32 diapositives

Contenu connexe

Similaire à Don’t Let Kafka Be A Cluster: Kafka Chaos Experimentation with Justin Fetherolf

New types of tests for Java projects par
New types of tests for Java projectsNew types of tests for Java projects
New types of tests for Java projectsVincent Massol
309 vues32 diapositives
Our Multi-Year Journey to a 10x Faster Confluent Cloud par
Our Multi-Year Journey to a 10x Faster Confluent CloudOur Multi-Year Journey to a 10x Faster Confluent Cloud
Our Multi-Year Journey to a 10x Faster Confluent CloudHostedbyConfluent
24 vues43 diapositives
Advanced Java Testing @ POSS 2019 par
Advanced Java Testing @ POSS 2019Advanced Java Testing @ POSS 2019
Advanced Java Testing @ POSS 2019Vincent Massol
103 vues31 diapositives
Distributed Performance testing by funkload par
Distributed Performance testing by funkloadDistributed Performance testing by funkload
Distributed Performance testing by funkloadAkhil Singh
474 vues31 diapositives
Ansible, integration testing, and you. par
Ansible, integration testing, and you.Ansible, integration testing, and you.
Ansible, integration testing, and you.Bob Killen
1.4K vues47 diapositives
The journey to Native Cloud Architecture & Microservices, tracing the footste... par
The journey to Native Cloud Architecture & Microservices, tracing the footste...The journey to Native Cloud Architecture & Microservices, tracing the footste...
The journey to Native Cloud Architecture & Microservices, tracing the footste...Mek Srunyu Stittri
844 vues41 diapositives

Similaire à Don’t Let Kafka Be A Cluster: Kafka Chaos Experimentation with Justin Fetherolf(20)

New types of tests for Java projects par Vincent Massol
New types of tests for Java projectsNew types of tests for Java projects
New types of tests for Java projects
Vincent Massol309 vues
Our Multi-Year Journey to a 10x Faster Confluent Cloud par HostedbyConfluent
Our Multi-Year Journey to a 10x Faster Confluent CloudOur Multi-Year Journey to a 10x Faster Confluent Cloud
Our Multi-Year Journey to a 10x Faster Confluent Cloud
Advanced Java Testing @ POSS 2019 par Vincent Massol
Advanced Java Testing @ POSS 2019Advanced Java Testing @ POSS 2019
Advanced Java Testing @ POSS 2019
Vincent Massol103 vues
Distributed Performance testing by funkload par Akhil Singh
Distributed Performance testing by funkloadDistributed Performance testing by funkload
Distributed Performance testing by funkload
Akhil Singh474 vues
Ansible, integration testing, and you. par Bob Killen
Ansible, integration testing, and you.Ansible, integration testing, and you.
Ansible, integration testing, and you.
Bob Killen1.4K vues
The journey to Native Cloud Architecture & Microservices, tracing the footste... par Mek Srunyu Stittri
The journey to Native Cloud Architecture & Microservices, tracing the footste...The journey to Native Cloud Architecture & Microservices, tracing the footste...
The journey to Native Cloud Architecture & Microservices, tracing the footste...
Andreas Grabner - Performance as Code, Let's Make It a Standard par Neotys_Partner
Andreas Grabner - Performance as Code, Let's Make It a StandardAndreas Grabner - Performance as Code, Let's Make It a Standard
Andreas Grabner - Performance as Code, Let's Make It a Standard
Neotys_Partner394 vues
Lightweight continuous delivery for small schools par Charles Fulton
Lightweight continuous delivery for small schoolsLightweight continuous delivery for small schools
Lightweight continuous delivery for small schools
Charles Fulton600 vues
Continuous Performance Testing par C4Media
Continuous Performance TestingContinuous Performance Testing
Continuous Performance Testing
C4Media241 vues
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas par Flink Forward
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy FarkasVirtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Flink Forward1.1K vues
Atmosphere 2018: Yury Tsarev - TEST DRIVEN INFRASTRUCTURE FOR HIGHLY PERFORMI... par PROIDEA
Atmosphere 2018: Yury Tsarev - TEST DRIVEN INFRASTRUCTURE FOR HIGHLY PERFORMI...Atmosphere 2018: Yury Tsarev - TEST DRIVEN INFRASTRUCTURE FOR HIGHLY PERFORMI...
Atmosphere 2018: Yury Tsarev - TEST DRIVEN INFRASTRUCTURE FOR HIGHLY PERFORMI...
PROIDEA28 vues
Apache Big Data Europe 2015: Selected Talks par Andrii Gakhov
Apache Big Data Europe 2015: Selected TalksApache Big Data Europe 2015: Selected Talks
Apache Big Data Europe 2015: Selected Talks
Andrii Gakhov716 vues
Unit testing legacy code par Lars Thorup
Unit testing legacy codeUnit testing legacy code
Unit testing legacy code
Lars Thorup2K vues
Rally--OpenStack Benchmarking at Scale par Mirantis
Rally--OpenStack Benchmarking at ScaleRally--OpenStack Benchmarking at Scale
Rally--OpenStack Benchmarking at Scale
Mirantis8.9K vues
Automated Testing Environment by Bugzilla, Testopia and Jenkins par walkerchang
Automated Testing Environment by Bugzilla, Testopia and JenkinsAutomated Testing Environment by Bugzilla, Testopia and Jenkins
Automated Testing Environment by Bugzilla, Testopia and Jenkins
walkerchang15.8K vues
New types of tests for Java projects par Vincent Massol
New types of tests for Java projectsNew types of tests for Java projects
New types of tests for Java projects
Vincent Massol105 vues

Plus de HostedbyConfluent

Build Real-time Machine Learning Apps on Generative AI with Kafka Streams par
Build Real-time Machine Learning Apps on Generative AI with Kafka StreamsBuild Real-time Machine Learning Apps on Generative AI with Kafka Streams
Build Real-time Machine Learning Apps on Generative AI with Kafka StreamsHostedbyConfluent
60 vues26 diapositives
When Only the Last Writer Wins We All Lose: Active-Active Geo-Replication in ... par
When Only the Last Writer Wins We All Lose: Active-Active Geo-Replication in ...When Only the Last Writer Wins We All Lose: Active-Active Geo-Replication in ...
When Only the Last Writer Wins We All Lose: Active-Active Geo-Replication in ...HostedbyConfluent
26 vues84 diapositives
Apache Kafka's Next-Gen Rebalance Protocol: Towards More Stable and Scalable ... par
Apache Kafka's Next-Gen Rebalance Protocol: Towards More Stable and Scalable ...Apache Kafka's Next-Gen Rebalance Protocol: Towards More Stable and Scalable ...
Apache Kafka's Next-Gen Rebalance Protocol: Towards More Stable and Scalable ...HostedbyConfluent
55 vues97 diapositives
Using Kafka at Scale - A Case Study of Micro Services Data Pipelines at Evern... par
Using Kafka at Scale - A Case Study of Micro Services Data Pipelines at Evern...Using Kafka at Scale - A Case Study of Micro Services Data Pipelines at Evern...
Using Kafka at Scale - A Case Study of Micro Services Data Pipelines at Evern...HostedbyConfluent
50 vues15 diapositives
Rule Based Asset Management Workflow Automation at Netflix par
Rule Based Asset Management Workflow Automation at NetflixRule Based Asset Management Workflow Automation at Netflix
Rule Based Asset Management Workflow Automation at NetflixHostedbyConfluent
31 vues56 diapositives
Scalable E-Commerce Data Pipelines with Kafka: Real-Time Analytics, Batch, ML... par
Scalable E-Commerce Data Pipelines with Kafka: Real-Time Analytics, Batch, ML...Scalable E-Commerce Data Pipelines with Kafka: Real-Time Analytics, Batch, ML...
Scalable E-Commerce Data Pipelines with Kafka: Real-Time Analytics, Batch, ML...HostedbyConfluent
56 vues32 diapositives

Plus de HostedbyConfluent(20)

Build Real-time Machine Learning Apps on Generative AI with Kafka Streams par HostedbyConfluent
Build Real-time Machine Learning Apps on Generative AI with Kafka StreamsBuild Real-time Machine Learning Apps on Generative AI with Kafka Streams
Build Real-time Machine Learning Apps on Generative AI with Kafka Streams
When Only the Last Writer Wins We All Lose: Active-Active Geo-Replication in ... par HostedbyConfluent
When Only the Last Writer Wins We All Lose: Active-Active Geo-Replication in ...When Only the Last Writer Wins We All Lose: Active-Active Geo-Replication in ...
When Only the Last Writer Wins We All Lose: Active-Active Geo-Replication in ...
Apache Kafka's Next-Gen Rebalance Protocol: Towards More Stable and Scalable ... par HostedbyConfluent
Apache Kafka's Next-Gen Rebalance Protocol: Towards More Stable and Scalable ...Apache Kafka's Next-Gen Rebalance Protocol: Towards More Stable and Scalable ...
Apache Kafka's Next-Gen Rebalance Protocol: Towards More Stable and Scalable ...
Using Kafka at Scale - A Case Study of Micro Services Data Pipelines at Evern... par HostedbyConfluent
Using Kafka at Scale - A Case Study of Micro Services Data Pipelines at Evern...Using Kafka at Scale - A Case Study of Micro Services Data Pipelines at Evern...
Using Kafka at Scale - A Case Study of Micro Services Data Pipelines at Evern...
Rule Based Asset Management Workflow Automation at Netflix par HostedbyConfluent
Rule Based Asset Management Workflow Automation at NetflixRule Based Asset Management Workflow Automation at Netflix
Rule Based Asset Management Workflow Automation at Netflix
Scalable E-Commerce Data Pipelines with Kafka: Real-Time Analytics, Batch, ML... par HostedbyConfluent
Scalable E-Commerce Data Pipelines with Kafka: Real-Time Analytics, Batch, ML...Scalable E-Commerce Data Pipelines with Kafka: Real-Time Analytics, Batch, ML...
Scalable E-Commerce Data Pipelines with Kafka: Real-Time Analytics, Batch, ML...
Indeed Flex: The Story of a Revolutionary Recruitment Platform par HostedbyConfluent
Indeed Flex: The Story of a Revolutionary Recruitment PlatformIndeed Flex: The Story of a Revolutionary Recruitment Platform
Indeed Flex: The Story of a Revolutionary Recruitment Platform
Forecasting Kafka Lag Issues with Machine Learning par HostedbyConfluent
Forecasting Kafka Lag Issues with Machine LearningForecasting Kafka Lag Issues with Machine Learning
Forecasting Kafka Lag Issues with Machine Learning
Getting Under the Hood of Kafka Streams: Optimizing Storage Engines to Tune U... par HostedbyConfluent
Getting Under the Hood of Kafka Streams: Optimizing Storage Engines to Tune U...Getting Under the Hood of Kafka Streams: Optimizing Storage Engines to Tune U...
Getting Under the Hood of Kafka Streams: Optimizing Storage Engines to Tune U...
Maximizing Real-Time Data Processing with Apache Kafka and InfluxDB: A Compre... par HostedbyConfluent
Maximizing Real-Time Data Processing with Apache Kafka and InfluxDB: A Compre...Maximizing Real-Time Data Processing with Apache Kafka and InfluxDB: A Compre...
Maximizing Real-Time Data Processing with Apache Kafka and InfluxDB: A Compre...
Accelerating Path to Production for Generative AI-powered Applications par HostedbyConfluent
Accelerating Path to Production for Generative AI-powered ApplicationsAccelerating Path to Production for Generative AI-powered Applications
Accelerating Path to Production for Generative AI-powered Applications
Optimize Costs and Scale Your Streaming Applications with Virtually Unlimited... par HostedbyConfluent
Optimize Costs and Scale Your Streaming Applications with Virtually Unlimited...Optimize Costs and Scale Your Streaming Applications with Virtually Unlimited...
Optimize Costs and Scale Your Streaming Applications with Virtually Unlimited...
Don’t Let Degradation Bring You Down: Automatically Detect & Remediate Degrad... par HostedbyConfluent
Don’t Let Degradation Bring You Down: Automatically Detect & Remediate Degrad...Don’t Let Degradation Bring You Down: Automatically Detect & Remediate Degrad...
Don’t Let Degradation Bring You Down: Automatically Detect & Remediate Degrad...
Go Big or Go Home: Approaching Kafka Replication at Scale par HostedbyConfluent
Go Big or Go Home: Approaching Kafka Replication at ScaleGo Big or Go Home: Approaching Kafka Replication at Scale
Go Big or Go Home: Approaching Kafka Replication at Scale
What's in store? Part Deux; Creating Custom Queries with Kafka Streams IQv2 par HostedbyConfluent
What's in store? Part Deux; Creating Custom Queries with Kafka Streams IQv2What's in store? Part Deux; Creating Custom Queries with Kafka Streams IQv2
What's in store? Part Deux; Creating Custom Queries with Kafka Streams IQv2
A Trifecta of Real-Time Applications: Apache Kafka, Flink, and Druid par HostedbyConfluent
A Trifecta of Real-Time Applications: Apache Kafka, Flink, and DruidA Trifecta of Real-Time Applications: Apache Kafka, Flink, and Druid
A Trifecta of Real-Time Applications: Apache Kafka, Flink, and Druid
From Raw Data to an Interactive Data App in an Hour: Powered by Snowpark Python par HostedbyConfluent
From Raw Data to an Interactive Data App in an Hour: Powered by Snowpark PythonFrom Raw Data to an Interactive Data App in an Hour: Powered by Snowpark Python
From Raw Data to an Interactive Data App in an Hour: Powered by Snowpark Python
Beyond Monoliths: Thrivent’s Lessons in Building a Modern Integration Archite... par HostedbyConfluent
Beyond Monoliths: Thrivent’s Lessons in Building a Modern Integration Archite...Beyond Monoliths: Thrivent’s Lessons in Building a Modern Integration Archite...
Beyond Monoliths: Thrivent’s Lessons in Building a Modern Integration Archite...
Exactly-Once Semantics Revisited: Distributed Transactions across Flink and K... par HostedbyConfluent
Exactly-Once Semantics Revisited: Distributed Transactions across Flink and K...Exactly-Once Semantics Revisited: Distributed Transactions across Flink and K...
Exactly-Once Semantics Revisited: Distributed Transactions across Flink and K...

Dernier

Combining Orchestration and Choreography for a Clean Architecture par
Combining Orchestration and Choreography for a Clean ArchitectureCombining Orchestration and Choreography for a Clean Architecture
Combining Orchestration and Choreography for a Clean ArchitectureThomasHeinrichs1
69 vues24 diapositives
Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica... par
Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica...Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica...
Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica...NUS-ISS
16 vues28 diapositives
The Importance of Cybersecurity for Digital Transformation par
The Importance of Cybersecurity for Digital TransformationThe Importance of Cybersecurity for Digital Transformation
The Importance of Cybersecurity for Digital TransformationNUS-ISS
27 vues26 diapositives
Business Analyst Series 2023 - Week 3 Session 5 par
Business Analyst Series 2023 -  Week 3 Session 5Business Analyst Series 2023 -  Week 3 Session 5
Business Analyst Series 2023 - Week 3 Session 5DianaGray10
209 vues20 diapositives
PharoJS - Zürich Smalltalk Group Meetup November 2023 par
PharoJS - Zürich Smalltalk Group Meetup November 2023PharoJS - Zürich Smalltalk Group Meetup November 2023
PharoJS - Zürich Smalltalk Group Meetup November 2023Noury Bouraqadi
120 vues17 diapositives
Throughput par
ThroughputThroughput
ThroughputMoisés Armani Ramírez
36 vues11 diapositives

Dernier(20)

Combining Orchestration and Choreography for a Clean Architecture par ThomasHeinrichs1
Combining Orchestration and Choreography for a Clean ArchitectureCombining Orchestration and Choreography for a Clean Architecture
Combining Orchestration and Choreography for a Clean Architecture
Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica... par NUS-ISS
Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica...Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica...
Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica...
NUS-ISS16 vues
The Importance of Cybersecurity for Digital Transformation par NUS-ISS
The Importance of Cybersecurity for Digital TransformationThe Importance of Cybersecurity for Digital Transformation
The Importance of Cybersecurity for Digital Transformation
NUS-ISS27 vues
Business Analyst Series 2023 - Week 3 Session 5 par DianaGray10
Business Analyst Series 2023 -  Week 3 Session 5Business Analyst Series 2023 -  Week 3 Session 5
Business Analyst Series 2023 - Week 3 Session 5
DianaGray10209 vues
PharoJS - Zürich Smalltalk Group Meetup November 2023 par Noury Bouraqadi
PharoJS - Zürich Smalltalk Group Meetup November 2023PharoJS - Zürich Smalltalk Group Meetup November 2023
PharoJS - Zürich Smalltalk Group Meetup November 2023
Noury Bouraqadi120 vues
Future of Learning - Yap Aye Wee.pdf par NUS-ISS
Future of Learning - Yap Aye Wee.pdfFuture of Learning - Yap Aye Wee.pdf
Future of Learning - Yap Aye Wee.pdf
NUS-ISS41 vues
How to reduce cold starts for Java Serverless applications in AWS at JCON Wor... par Vadym Kazulkin
How to reduce cold starts for Java Serverless applications in AWS at JCON Wor...How to reduce cold starts for Java Serverless applications in AWS at JCON Wor...
How to reduce cold starts for Java Serverless applications in AWS at JCON Wor...
Vadym Kazulkin75 vues
AI: mind, matter, meaning, metaphors, being, becoming, life values par Twain Liu 刘秋艳
AI: mind, matter, meaning, metaphors, being, becoming, life valuesAI: mind, matter, meaning, metaphors, being, becoming, life values
AI: mind, matter, meaning, metaphors, being, becoming, life values
Digital Product-Centric Enterprise and Enterprise Architecture - Tan Eng Tsze par NUS-ISS
Digital Product-Centric Enterprise and Enterprise Architecture - Tan Eng TszeDigital Product-Centric Enterprise and Enterprise Architecture - Tan Eng Tsze
Digital Product-Centric Enterprise and Enterprise Architecture - Tan Eng Tsze
NUS-ISS19 vues
Attacking IoT Devices from a Web Perspective - Linux Day par Simone Onofri
Attacking IoT Devices from a Web Perspective - Linux Day Attacking IoT Devices from a Web Perspective - Linux Day
Attacking IoT Devices from a Web Perspective - Linux Day
Simone Onofri15 vues
Empathic Computing: Delivering the Potential of the Metaverse par Mark Billinghurst
Empathic Computing: Delivering  the Potential of the MetaverseEmpathic Computing: Delivering  the Potential of the Metaverse
Empathic Computing: Delivering the Potential of the Metaverse
Architecting CX Measurement Frameworks and Ensuring CX Metrics are fit for Pu... par NUS-ISS
Architecting CX Measurement Frameworks and Ensuring CX Metrics are fit for Pu...Architecting CX Measurement Frameworks and Ensuring CX Metrics are fit for Pu...
Architecting CX Measurement Frameworks and Ensuring CX Metrics are fit for Pu...
NUS-ISS37 vues

Don’t Let Kafka Be A Cluster: Kafka Chaos Experimentation with Justin Fetherolf

  • 1. Don’t Let Kafka Be A Cluster: Kafka Chaos Experimentation Justin Fetherolf
  • 2. Welcome! VERICA | CONTINUOUS VERIFICATION ● What is Chaos Engineering? ● Why should we care? ● Experiment design ● Experiment results ● What’s next?
  • 3. What is Chaos Engineering? VERICA | CONTINUOUS VERIFICATION
  • 4. What does it all mean? ● Chaos Engineering ○ “the facilitation of experiments to uncover systemic weaknesses.” ● Experiment ○ “an operation or procedure carried out under controlled conditions in order to discover an unknown effect, to test or establish a hypothesis, or to illustrate a known law” ● Use experiments to create new knowledge ○ Tests make assertions about known properties ● Experiments verify behavior; not validate VERICA | CONTINUOUS VERIFICATION
  • 5. Chaos Engineering Principals ● Define “steady state” ● Form a hypothesis ● Introduce variables ● Attempt to disprove hypothesis VERICA | CONTINUOUS VERIFICATION
  • 6. Advanced Principles ● Build hypothesis around steady-state behavior ● Vary real-world events ● Run experiments in production ● Automate experiments to run continuously ● Minimize blast radius VERICA | CONTINUOUS VERIFICATION
  • 7. Why should we care about Chaos Engineering? VERICA | CONTINUOUS VERIFICATION
  • 8. Complex Systems ● Businesses require capabilities/properties/features ● Requires complexity from systems ● Can’t avoid complexity ● Embrace and navigate complexity ● As complexity increases, can’t maintain mental model VERICA | CONTINUOUS VERIFICATION
  • 9. ● Kafka sits at the core of our businesses ● Kafka is a complex system ● More complex systems built on top of Kafka ● Cloud infrastructure isn’t always what we expect ● Know the safety margins of our systems VERICA | CONTINUOUS VERIFICATION Chaofka?
  • 10. Our Kafka Experiment VERICA | CONTINUOUS VERIFICATION
  • 11. Steady State ● Cluster ○ 5-node EKS w/ 1 broker per node - 5-broker Kafka cluster ○ t2.xlarge instance types ■ “moderate” network - 83.4 - 107.3 MiB/s ■ 20 GB “gp2” EBS volumes - 128 MiB/s ○ Metrics to Prometheus/Grafana ● Batch style workload; ~3 min @ 2.5 MiB/s every 5 min ○ ~5 million messages produced and consumed ○ 3 partitions, 3 replicas VERICA | CONTINUOUS VERIFICATION
  • 12. Steady State Metrics VERICA | CONTINUOUS VERIFICATION
  • 13. Hypothesis “As the load on the Kafka cluster increases, the standard workload can continue to successfully process each batch of messages before the next batch begins.” ● How do we measure this? ○ Monitoring ■ Message/data rates ■ CPU/Memory/Net/Disk usage ○ Application status VERICA | CONTINUOUS VERIFICATION
  • 14. Introducing Variables ● How do we increase load? Enter Horus! ○ Scalable & configurable ○ Safety features ■ Halting could be triggered by ● Cluster or client metrics ● Other conditions ● Manual intervention VERICA | CONTINUOUS VERIFICATION
  • 15. Load Scaling Configuration VERICA | CONTINUOUS VERIFICATION ● 4 distinct, increasing client sets; 15 minutes each ● 5 partition, 5 replica topic ● 10 - 40 producers; 10 step ○ 500 msg/s; 1024 byte/msg ● 7 consumer groups; 3 consumers each ● 4.88 - 19.5 MiB/s total production traffic ● 34.18 - 136.72 MiB/s total consumer traffic ● Increased replication traffic
  • 16. Experiment Results VERICA | CONTINUOUS VERIFICATION
  • 17. What’s Next? VERICA | CONTINUOUS VERIFICATION
  • 18. The Future! ● Context sensitive ○ One size does not fit all ● Start small ● Start in non-production environment ● Minimize blast radius ● Unleash the Chaos! VERICA | CONTINUOUS VERIFICATION
  • 19. References and Resources ● Rosenthal, Casey and Jones, Nora. Chaos Engineering: System Resiliency in Practice. 1st ed., O’Reilly, 2020. ● Hausmann, Steffen. “Best practices for right-sizing your Apache Kafka clusters to optimize performance and cost.” AWS Big Data Blog, 17 Mar. 2022, https://aws.amazon.com/blogs/big-data/best-practices-for-right-sizing-your- apache-kafka-clusters-to-optimize-performance-and-cost/ ● https://principlesofchaos.org/ ● https://www.verica.io ● https://www.thevoid.community/ VERICA | CONTINUOUS VERIFICATION
  • 20. Justin Fetherolf Sr. Software Engineer https://www.verica.io