Don’t Let Kafka Be A Cluster: Kafka Chaos Experimentation with Justin Fetherolf

•

0 j'aime•304 vues

"Kafka often finds itself as the backbone of a company’s systems, but the failure modes and signals leading to those failures are not always well understood. Chaos Engineering espouses empiricism, experimentation over testing, and verification over validation. We can prime a Kafka cluster as a Chaos Experiment by putting it under a controlled load test called a ‘squeeze test’. This session gives attendees confidence in the steps needed to build experiments to prove Kafka cluster(s) can fulfill the needs of the business. We start by demonstrating how to build a “steady state” hypothesis based on cluster sizing, best practices and expected usage, monitoring configuration, and perfunctory performance testing. We then develop an example hypothesis that as the load on the Kafka cluster increases towards the tipping point we will receive monitoring alerts/signals for key metrics. Attendees learn in detail how real world events were varied for the experiment, including design goals, hard trade-offs, and safety mechanisms necessary for the load tool to adhere to Chaos Engineering principles. We show how the results were analyzed to support or debunk the hypothesis. Finally, we lay out the next steps for attendees’ Chaos Engineering journey."

Technologie

Don’t Let Kafka Be A Cluster:
Kafka Chaos Experimentation
Justin Fetherolf

Welcome!
VERICA | CONTINUOUS VERIFICATION
● What is Chaos Engineering?
● Why should we care?
● Experiment design
● Experiment results
● What’s next?

What is Chaos
Engineering?
VERICA | CONTINUOUS VERIFICATION

What does it all mean?
● Chaos Engineering
○ “the facilitation of experiments to uncover systemic weaknesses.”
● Experiment
○ “an operation or procedure carried out under controlled conditions in
order to discover an unknown effect, to test or establish a hypothesis, or
to illustrate a known law”
● Use experiments to create new knowledge
○ Tests make assertions about known properties
● Experiments verify behavior; not validate
VERICA | CONTINUOUS VERIFICATION

Chaos Engineering Principals
● Define “steady state”
● Form a hypothesis
● Introduce variables
● Attempt to disprove hypothesis
VERICA | CONTINUOUS VERIFICATION

Advanced Principles
● Build hypothesis around steady-state behavior
● Vary real-world events
● Run experiments in production
● Automate experiments to run continuously
● Minimize blast radius
VERICA | CONTINUOUS VERIFICATION

Why should
we care about
Chaos Engineering?
VERICA | CONTINUOUS VERIFICATION

Complex Systems
● Businesses require capabilities/properties/features
● Requires complexity from systems
● Can’t avoid complexity
● Embrace and navigate complexity
● As complexity increases, can’t maintain mental
model
VERICA | CONTINUOUS VERIFICATION

● Kafka sits at the core of our businesses
● Kafka is a complex system
● More complex systems built on top of Kafka
● Cloud infrastructure isn’t always what we expect
● Know the safety margins of our systems
VERICA | CONTINUOUS VERIFICATION
Chaofka?

Our Kafka
Experiment
VERICA | CONTINUOUS VERIFICATION

Steady State
● Cluster
○ 5-node EKS w/ 1 broker per node - 5-broker Kafka cluster
○ t2.xlarge instance types
■ “moderate” network - 83.4 - 107.3 MiB/s
■ 20 GB “gp2” EBS volumes - 128 MiB/s
○ Metrics to Prometheus/Grafana
● Batch style workload; ~3 min @ 2.5 MiB/s every 5 min
○ ~5 million messages produced and consumed
○ 3 partitions, 3 replicas
VERICA | CONTINUOUS VERIFICATION

Steady State Metrics
VERICA | CONTINUOUS VERIFICATION

Hypothesis
“As the load on the Kafka cluster increases, the standard workload
can continue to successfully process each batch of messages
before the next batch begins.”
● How do we measure this?
○ Monitoring
■ Message/data rates
■ CPU/Memory/Net/Disk usage
○ Application status
VERICA | CONTINUOUS VERIFICATION

Introducing Variables
● How do we increase load? Enter Horus!
○ Scalable & configurable
○ Safety features
■ Halting could be triggered by
● Cluster or client metrics
● Other conditions
● Manual intervention
VERICA | CONTINUOUS VERIFICATION

Load Scaling Configuration
VERICA | CONTINUOUS VERIFICATION
● 4 distinct, increasing client sets; 15 minutes each
● 5 partition, 5 replica topic
● 10 - 40 producers; 10 step
○ 500 msg/s; 1024 byte/msg
● 7 consumer groups; 3 consumers each
● 4.88 - 19.5 MiB/s total production traffic
● 34.18 - 136.72 MiB/s total consumer traffic
● Increased replication traffic

Experiment Results
VERICA | CONTINUOUS VERIFICATION

What’s Next?
VERICA | CONTINUOUS VERIFICATION

The Future!
● Context sensitive
○ One size does not fit all
● Start small
● Start in non-production environment
● Minimize blast radius
● Unleash the Chaos!
VERICA | CONTINUOUS VERIFICATION

References and Resources
● Rosenthal, Casey and Jones, Nora. Chaos Engineering: System Resiliency in
Practice. 1st ed., O’Reilly, 2020.
● Hausmann, Steffen. “Best practices for right-sizing your Apache Kafka
clusters to optimize performance and cost.” AWS Big Data Blog, 17 Mar. 2022,
https://aws.amazon.com/blogs/big-data/best-practices-for-right-sizing-your-
apache-kafka-clusters-to-optimize-performance-and-cost/
● https://principlesofchaos.org/
● https://www.verica.io
● https://www.thevoid.community/
VERICA | CONTINUOUS VERIFICATION

Justin Fetherolf
Sr. Software Engineer
https://www.verica.io

Contenu connexe

Similaire à Don’t Let Kafka Be A Cluster: Kafka Chaos Experimentation with Justin Fetherolf

Audience: Advanced About: Real world lessons and war stories about Catalyst IT’s experience in rolling out an OpenStack based public cloud in New Zealand. This presentation will provide tips and advice that may save you a lot of time, money and nights of sleep if you are planning to run OpenStack in the future. It may also bring some insights to people that are already running OpenStack in production. Topics covered will include: selection of hardware for optimal costs, techniques that drive quality and service levels up, common deployment mistakes, in place upgrades, how to identify the maturity level of each project and decide what is ready for production, and much more! Speaker Bio: Bruno Lago – Entrepreneur, Catalyst IT Limited Bruno Lago is a solutions architect that has been involved with the Catalyst Cloud (New Zealand’s first public cloud based on OpenStack) from its inception. He is passionate about open source software, cloud computing and disruptive technologies. OpenStack Australia Day - Sydney 2016 https://events.aptira.com/openstack-australia-day-sydney-2016/

Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT

OpenStack

Netflix accounts for more than a third of all traffic heading into American homes at peak hours. Making sure users are getting the best possible experience at all times is no simple feat and performance is at the core of this experience. In order to ensure performance and maintain development agility in a highly decentralized environment/(organization?), Netflix employs a multitude of strategies, such as production canary analysis, fully automated performance tests, simple zero-downtime deployments and rollbacks, auto-scaling clusters and a fault-tolerant stateless service architecture. We will present a set of use cases that demonstrate how and why different groups employ different strategies to achieve a common goal, great performance and stability, and detail how these strategies are incorporated into development, test and DevOps with minimal overhead.

Ensuring Performance in a Fast-Paced Environment (CMG 2014)

Martin Spier

This talk demonstrates advanced testing practices coming from the STAMP research project and applied to the XWiki open source project: - Testing for coverage with Jacoco and defining a viable strategy for slowly improving the situation - Testing the quality of your tests with Descartes Mutation testing - Automatically enriching your test suite with DSpot - Testing various configurations with Docker containers and Jenkins - Generating tests automatically from production stack traces

New types of tests for Java projects

Vincent Massol

"Confluent Cloud is a cloud-native service based on Apache Kafka. We run tens of thousands of clusters across all major cloud service providers (AWS, GCP and Azure). In this talk, we will go over our journey to make Confluent Cloud 10x faster than Apache Kafka. We will talk about how we designed our various workloads, the complexities involved in our cloud-native service, the challenges we faced, and the various pitfalls we ran into. We will also cover the interesting learnings, which in hindsight, are first principles from this multi-year journey. By attending this talk, attendees will be able to take our learnings from making Confluent Cloud latencies 10x better and possibly apply similar principles to their cloud native data streaming systems."

Our Multi-Year Journey to a 10x Faster Confluent Cloud

HostedbyConfluent

Advanced Java Testing @ POSS 2019

Vincent Massol

Distributed Performance testing by funkload

Akhil Singh

Ansible, integration testing, and you.

Bob Killen

The journey to Native Cloud Architecture & Microservices, tracing the footste...

Mek Srunyu Stittri

Cloud Architecture & Distributed Systems Trivia

Dr.-Ing. Michael Menzel

Since its beginning, the Performance Advisory Council aims to promote engagement between various experts from around the world, to create relevant, value-added content sharing between members. For Neotys, to strengthen our position as a thought leader in load & performance testing. During this event, 12 participants convened in Chamonix (France) exploring several topics on the minds of today’s performance tester such as DevOps, Shift Left/Right, Test Automation, Blockchain and Artificial Intelligence.

Andreas Grabner - Performance as Code, Let's Make It a Standard

Neotys_Partner

In a continuous delivery environment web application updates are pushed out fast and frequently. Implementing that environment requires many different pieces: version control, automated testing, and automated deployment. It’s a lot to wrap your head around, but there are tangible benefits for small schools, including new opportunities to collaborate among institutions or with student developers. In this presentation we will demonstrate how to build a lightweight continuous integration and delivery stack using free and open source tools: GitLab for version control, GitLab CI and Docker for testing, and Docker and Capistrano for deployment. We will walk through how each piece is separately important and how combining them creates a simple yet powerful deployment strategy. We will also describe concrete examples of how we are using these tools to share application development with students and each other.

Lightweight continuous delivery for small schools

Charles Fulton

First steps with kubernetes

Vinícius Kroth

Chaos Engineering

Anshul Patel

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2qoUklo. Mark Price talks about techniques for making performance testing a first-class citizen in a Continuous Delivery pipeline. He covers a number of war stories experienced by the team building one of the world's most advanced trading exchanges. Filmed at qconlondon.com. Mark Price is a Senior Performance Engineer at Improbable.io, working on optimizing and scaling reality-scale simulations. Previously, he worked as Lead Performance Engineer at LMAX Exchange, where he helped to optimize the platform to become one of the world's fastest FX exchanges.

Continuous Performance Testing

C4Media

Keystone Data Pipeline manages several thousand Flink pipelines, with variable workloads. These pipelines are simple routers which consume from Kafka and write to one of three sinks. In order to alleviate our operational overhead, we’ve implemented autoscaling for our routers. Autoscaling has reduced our resource usage by 25% - 45% (varying by region and time), and has reduced our on call burden. This talk will take an in depth look at the mathematics, algorithms, and infrastructure details for implementing autoscaling of simple pipelines at scale. It will also discuss future work for autoscaling complex pipelines.

Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas

Flink Forward

In this presentation Yury Tsarev will review practical examples of how to build infrastructure-as-code with a strong test-driven approach. While equipped with an opinionated tools selection, the audience will be provided with a generic framework to build upon where the components are fully replaceable. Further, Yury strongly believes that infrastructure code should be treated like any other code. This means applying a test driven development model, storing it in a source control system and building a regression test suite. He suggests doing this with Test Kitchen, a pluggable and extensible test orchestrator that originated in the Chef community. Using Test Kitchen’s disposable modules it is possible to test both mutable (e.g. based on Puppet/Chef/Ansible) as well as immutable infrastructure (e.g. Terraform based). Serverspec/Inspec can verify that the configuration code behaves properly. In addition, shell mocking can be used to bypass external dependencies and create hermetic infra tests. Having such a powerful test infra toolset enables DevOps/SRE teams to practice TDD, create strong CI/CD pipelines and reduce overhead by never having to test manually again.

Atmosphere 2018: Yury Tsarev - TEST DRIVEN INFRASTRUCTURE FOR HIGHLY PERFORMI...

PROIDEA

Apache Big Data Europe 2015: Selected Talks

Andrii Gakhov

Unit testing and test-driven development are practices that makes it easy and efficient to create well-structured and well-working code. However, many software projects didn't create unit tests from the beginning. In this presentation I will show a test automation strategy that works well for legacy code, and how to implement such a strategy on a project. The strategy focuses on characterization tests and refactoring, and the slides contain a detailed example of how to carry through a major refactoring in many tiny steps

Unit testing legacy code

Lars Thorup

Rally--OpenStack Benchmarking at Scale

Mirantis

Automated Testing Environment by Bugzilla, Testopia and Jenkins

walkerchang

Similaire à Don’t Let Kafka Be A Cluster: Kafka Chaos Experimentation with Justin Fetherolf (20)

Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT

Ensuring Performance in a Fast-Paced Environment (CMG 2014)

New types of tests for Java projects

Our Multi-Year Journey to a 10x Faster Confluent Cloud

Advanced Java Testing @ POSS 2019

Distributed Performance testing by funkload

Ansible, integration testing, and you.

The journey to Native Cloud Architecture & Microservices, tracing the footste...

Cloud Architecture & Distributed Systems Trivia

Andreas Grabner - Performance as Code, Let's Make It a Standard

Lightweight continuous delivery for small schools

First steps with kubernetes

Chaos Engineering

Continuous Performance Testing

Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas

Atmosphere 2018: Yury Tsarev - TEST DRIVEN INFRASTRUCTURE FOR HIGHLY PERFORMI...

Apache Big Data Europe 2015: Selected Talks

Unit testing legacy code

Rally--OpenStack Benchmarking at Scale

Automated Testing Environment by Bugzilla, Testopia and Jenkins

Plus de HostedbyConfluent

"In this talk, attendees will be provided with an introduction to Kafka Connect and the basics of Single Message Transforms (SMTs) and how they can be used to transform data streams in a simple and efficient way. SMTs are a powerful feature of Kafka Connect that allow custom logic to be applied to individual messages as they pass through the data pipeline. The session will explain how SMTs work, the types of transformations they can be used for, and how they can be applied in a modular and composable way. Further, the session will discuss where SMTs fit in with Kafka Connect and when they should be used. Examples will be provided of how SMTs can be used to solve common data integration challenges, such as data enrichment, filtering, and restructuring. Attendees will also learn about the limitations of SMTs and when it might be more appropriate to use other tools or frameworks. Additionally, an overview of the alternatives to SMTs, such as Kafka Streams and KSQL, will be provided. This will help attendees make an informed decision about which approach is best for their specific use case. Whether attendees are developers, data engineers, or data scientists, this talk will provide valuable insights into how Kafka Connect and SMTs can help streamline data processing workflows. Attendees will come away with a better understanding of how these tools work and how they can be used to solve common data integration challenges."

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...

HostedbyConfluent

"While Apache Kafka lacks native support for topic renaming, there are scenarios where renaming topics becomes necessary. This presentation will delve into the utilization of MirrorMaker 2.0 as a solution for renaming Kafka topics. It will illustrate how MirrorMaker 2.0 can efficiently facilitate the migration of messages from the old topic to the new one and how Kafka Connect Metrics can be employed to monitor the mirroring progress. The discussion will encompass the complexity of renaming Kafka topics, addressing certain limitations, and exploring potential workarounds when using MirrorMaker 2.0 for this purpose. Despite not being originally designed for topic renaming, MirrorMaker 2.0 has a suitable solution for renaming Kafka topics. Blog Post : https://engineering.hellofresh.com/renaming-a-kafka-topic-d6ff3aaf3f03"

Renaming a Kafka Topic | Kafka Summit London

HostedbyConfluent

"Trendyol, Turkey's leading e-commerce company, is committed to positively impacting the lives of millions of customers. Our decision-making processes are entirely driven by data. As a data warehouse team, our primary goal is to provide accurate and up-to-date data, enabling the extraction of valuable business insights. We utilize the benefits provided by Kafka and Kafka Connect to facilitate the transfer of data from the source to our analytical environment. We recently transitioned our Kafka Connect clusters from on-premise VMs to Kubernetes. This shift was driven by our desire to effectively manage rapid growth(marked by a growing number of producers, consumers, and daily messages), ensuring proper monitoring and consistency. Consistency is crucial, especially in instances where we employ Single Message Transforms to manipulate records like filtering based on their keys or converting a JSON Object into a JSON string. Monitoring our cluster's health is key and we achieve this through Grafana dashboards and alerts generated through kube-state-metrics. Additionally, Kafka Connect's JMX metrics, coupled with NewRelic, are employed for comprehensive monitoring. The session will aim to explain our approach to NRT data ingestion, outlining the role of Kafka and Kafka Connect, our transition journey to K8s, and methods employed to monitor the health of our clusters."

Evolution of NRT Data Ingestion Pipeline at Trendyol

HostedbyConfluent

"Join our lightning talk to delve into the strategies vital for maintaining a resilient Kafka service. While proactive monitoring is key for issue prevention, failures will still occur. Rapid detection tools will enable you to identify and resolve problems before they impact end-users. This session explores the techniques employed by Kafka cloud providers for this detection, many of which are also applicable if you are managing independent Kafka clusters or applications. The talk focuses on health-checking, a powerful tool that encompasses an application and its monitoring to validate Kafka environment availability. The session navigates through Kafka health-check methods, sharing best practices, identifying common pitfalls, and highlighting the monitoring of critical performance metrics like throughput and latency for early issue detection. Attendees will gain valuable insights into the art of health-checking their Kafka environment, equipping them with the tools to identify and address issues before they escalate into critical problems. We invite all Kafka enthusiasts to join us in this talk to foster a deeper understanding of Kafka health-checking and ensure the continued smooth operation of your Kafka environment."

Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques

HostedbyConfluent

"Stream processing systems traditionally gave their users the choice between at least once processing and at most once processing: accepting duplicate data or missing data. But ideally we would provide exactly-once processing, where every event in the input data is represented exactly once in the output. Kafka provides a transaction API that enables exactly-once when using Kafka as your source and sink. But this API has turned out to not be well suited for use by high level streaming systems, requiring various work arounds to still provide transactional processing. In this talk, I’ll cover how the transaction API works, and how systems like Arroyo and Flink have used it to build exactly-once support, and how improvements to the transactional API will enable better end-to-end support for consistent stream processing."

Exactly-once Stream Processing with Arroyo and Kafka

HostedbyConfluent

"In this talk, we will explore the exciting world of IoT and computer vision by presenting a unique project: Fish Plays Pokemon. Using an ESP Eye camera connected to an ESP32 and other IoT devices, to monitor fish's movements in an aquarium. This project showcases the power of IoT and computer vision, demonstrating how even a fish can play a popular video game. We will discuss the challenges we faced during development, including real-time processing, IoT device integration, and Kafka message consumption. By the end of the talk, attendees will have a better understanding of how to combine IoT, computer vision, and the usage of a serverless cloud to create innovative projects. They will also learn how to integrate IoT devices with Kafka to simulate keyboard behavior, opening up endless possibilities for real-time interactions between the physical and digital worlds."

Fish Plays Pokemon | Kafka Summit London

HostedbyConfluent

Tiered Storage 101 | Kafla Summit London

HostedbyConfluent

"Real-time 24/7 monitoring and verification of massive data is challenging – even more so for the world’s second largest manufacturer of memory chips and semiconductors. Tolerance levels are incredibly small, any small defect needs to be identified and dealt with immediately. The goal of semiconductor manufacturing is to improve yield and minimize unnecessary work. However, even with real-time data collection, the data was not easy to manipulate by users and it took many days to enable stream processing requests – limiting its usefulness and value to the business. You’ll hear why SK hynix switched to Confluent and how we developed a self-service stream process portal on top of it. Now users have an easy-to-use service to manipulate the data they want. Results have been impressive, stream processing requests are available the same day – previously taking 5 days! We were also able to drive down costs by 10% as stream processing requests no longer require additional hardware. What you’ll take away from our talk: - What were the pain points in the previous environment - How we transitioned to Confluent without service downtime - Creating a self-service stream processing portal built on top of Connect and ksqlDB - Use case of stream process portal"

Building a Self-Service Stream Processing Portal: How And Why

HostedbyConfluent

"Discover how default configurations might impact ingestion times, especially when dealing with large files. We'll explore a real-world scenario with a 20,000,000+ line file, assessing metrics and exploring the bottleneck in the default setup. Understand the intricacies of batch size calculations and how to optimize them based on your unique data characteristics. Walk away with actionable insights as we showcase a practical example, turning a 7-hour ingestion process into a mere 30 minutes for over 30,000,000 records in a Kafka topic. Uncover metrics, configurations, and best practices to elevate the performance of your Kafka Connect CSV source connectors. Don't miss this opportunity to optimize your data pipeline and ensure smooth, efficient data flow."

From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...

HostedbyConfluent

"In order to meet the current and ever-increasing demand for near-zero RPO/RTO systems, a focus on resiliency is critical. While Kafka offers built-in resiliency features, a perfect blend of client and cluster resiliency is necessary in order to achieve a highly resilient Kafka client application. At Fidelity Investments, Kafka is used for a variety of event streaming needs such as core brokerage trading platforms, log aggregation, communication platforms, and data migrations. In this lightening talk, we will discuss the governance framework that has enabled producers and consumers to achieve their SLAs during unprecedented failure scenarios. We will highlight how we automated resiliency tests through chaos engineering and tightly integrated observability dashboards for Kafka clients to analyze and optimize client configurations. And finally, we will summarize the chaos test suite and the ""test, test and test"" mantra that are helping Fidelity Investments reach its goal of a future with zero down-time."

Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...

HostedbyConfluent

"There are various strategies for securely connecting to Kafka clusters between different networks or over the public internet. Many cloud providers even offer endpoints that privately route traffic between networks and are not exposed to the internet. But, depending on your network setup and how you are running Kafka, these options ... might not be an option! In this session, we’ll discuss how you can use SSH bastions or a self managed PrivateLink endpoint to establish connectivity to your Kafka clusters without exposing brokers directly to the internet. We explain the required network configuration, and show how we at Materialize have contributed to librdkafka to simplify these scenarios and avoid fragile workarounds."

Navigating Private Network Connectivity Options for Kafka Clusters

HostedbyConfluent

"In my talk, we will examine all the stages of building our self-service Streaming Data Platform based on Apache Flink and Kafka Connect, from the selection of a solution for stateful streaming data processing, right up to the successful design of a robust self-service platform, covering the challenges that we’ve met. I will share our experience in providing non-Java developers with a company-wide self-service solution, which allows them to quickly and easily develop their streaming data pipelines. Additionally, I will highlight specific business use cases that would not have been implemented without our platform.0 characters0 characters"

Apache Flink: Building a Company-wide Self-service Streaming Data Platform

HostedbyConfluent

"Almost everyone has heard about large language models, and tens of millions of people have tried out OpenAI ChatGPT and Google Bard. However, the intricate architecture and underlying mathematics driving these remarkable systems remain elusive to many. LLM's are fascinating - so let's grab a drink and find out how these systems are built and dive deep into their inner workings. In the length of time it to enjoy a round of drinks, you'll understand the inner workings of these models. We'll take our first sip of word vectors, enjoy the refreshing taste of the transformer, and drain a glass understanding how these models are trained on phenomenally large quantities of data. Large language models for your streaming application - explained with a little maths and a lot of pub stories"

Explaining How Real-Time GenAI Works in a Noisy Pub

HostedbyConfluent

"Monitoring is a fundamental operation when running Kafka and Kafka applications in production. There are numerous metrics available when using Kafka, however the sheer number is overwhelming, making it challenging to know where to start and how to properly utilise them. This session will introduce you to some of the key metrics that should be monitored and best practices in fine tuning your monitoring. We will delve into which metrics are the indicators for cluster’s availability and performance and are the most helpful when debugging client applications."

TL;DR Kafka Metrics | Kafka Summit London

HostedbyConfluent

Kafka Streams relies on state restoration for maintaining standby tasks as failure recovery mechanism as well as for restoring the state after rebalance scenarios. When you are scaling up or down your application instances, it is necessary to know the current state of the restoration process for each active and standby task in order to prevent a long restoration process as much as possible. During this presentation, you will get an understanding of how KIP-869 provides valuable information about the current active task restoration after a rebalance and KIP-988 opens a window to the continuous process of standby restoration. When you encounter a situation in which you need to choose whether or not to scale up or down your application instances, both KIPs will be an invaluable ally for you.

A Window Into Your Kafka Streams Tasks | KSL

HostedbyConfluent

"In this talk, we will dive into the world of Kafka producer configs and explore how to understand and optimize them for better performance. We will cover the different types of configs, their impact on performance, and how to tune them to achieve the best results. Whether you're new to Kafka or a seasoned pro, this session will provide valuable insights and practical tips for improving your Kafka producer performance. - Introduction to Kafka producer internal and workflow - Understanding the producer configs like linger.ms, batch.size, buffer.memory and their impact on performance - Learning about producer configs like max.block.ms, delivery.timeout.ms, request.timeout.ms and retries to make producer more resilient. - Discuss configs like enable.idempotence, max.in.flight.requests.per.connection and transaction related configs to achieve delivery guarantees. - Q&A session with attendees to address specific questions and concerns."

Mastering Kafka Producer Configs: A Guide to Optimizing Performance

HostedbyConfluent

"Data contracts are one of the hottest topics in the data management community. A data contract is a formal agreement between a data producer and its consumers, aimed at reducing data downtime and improving data quality. Schemas are an important part of data contracts, but they are not the only relevant element. In this talk, we’ll: 1. see why data contracts are so important but also difficult to implement; 2. identify the characteristics of a well-designed data contract: discuss the anatomy of a data contract, its main elements and, how to formally describe them; 3. show how to manage the lifecycle of a data contract leveraging Confluent Platform's services."

Data Contracts Management: Schema Registry and Beyond

HostedbyConfluent

"In the realm of stateful stream processing, Apache Flink has emerged as a powerful and versatile platform. However, the conventional SQL-based approach often limits the full potential of Flink applications. We will delve into the benefits of adopting a code-first approach, which provides developers with greater control over application logic, facilitates complex transformations, and enables more efficient handling of state and time. We will also discuss how the code-first approach can lead to more maintainable and testable code, ultimately improving the overall quality of your Flink applications. Whether you're a seasoned Flink developer or just starting your journey, this talk will provide valuable insights into how a code-first approach can revolutionize your stream processing applications."

Code-First Approach: Crafting Efficient Flink Apps

HostedbyConfluent

"Change Data Capture (CDC) has become a commodity in data engineering, much in part due to the ever-rising success of Debezium [1]. But is that all there is? In this lightning talk, we’ll outline the current state of the CDC ecosystem, and understand why adopting a Debezium alternative is still a hard sell. If you’ve ever wondered what else is out there, but can’t keep up with the sprawling of new tools in the ecosystem; we’ll wrap it up for you! [1] https://debezium.io/"

Debezium vs. the World: An Overview of the CDC Ecosystem

HostedbyConfluent

"Separation of compute and storage has become the de-facto standard in the data industry for batch processing. The addition of tiered storage to open source Apache Kafka is the first step in bringing true separation of compute and storage to the streaming world. In this talk, we'll discuss in technical detail how to take the concept of tiered storage to its logical extreme by building an Apache Kafka protocol compatible system that has zero local disks. Eliminating all local disks in the system requires not only separating storage from compute, but also separating data from metadata. This is a monumental task that requires reimagining Kafka's architecture from the ground up, but the benefits are worth it. This approach enables a stateless, elastic, and serverless deployment model that minimizes operational overhead and also drives inter-zone networking costs to almost zero."

Beyond Tiered Storage: Serverless Kafka with No Local Disks

HostedbyConfluent

Plus de HostedbyConfluent (20)

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...

Renaming a Kafka Topic | Kafka Summit London

Evolution of NRT Data Ingestion Pipeline at Trendyol

Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques

Exactly-once Stream Processing with Arroyo and Kafka

Fish Plays Pokemon | Kafka Summit London

Tiered Storage 101 | Kafla Summit London

Building a Self-Service Stream Processing Portal: How And Why

From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...

Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...

Navigating Private Network Connectivity Options for Kafka Clusters

Apache Flink: Building a Company-wide Self-service Streaming Data Platform

Explaining How Real-Time GenAI Works in a Noisy Pub

TL;DR Kafka Metrics | Kafka Summit London

A Window Into Your Kafka Streams Tasks | KSL

Mastering Kafka Producer Configs: A Guide to Optimizing Performance

Data Contracts Management: Schema Registry and Beyond

Code-First Approach: Crafting Efficient Flink Apps

Debezium vs. the World: An Overview of the CDC Ecosystem

Beyond Tiered Storage: Serverless Kafka with No Local Disks

Dernier

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

Juan lago vázquez

Manulife - Insurer Innovation Award 2024

The Digital Insurer

Scaling API-first – The story of a global engineering organization

Radu Cotescu

Real Time Object Detection Using Open CV

Khem

Axa Assurance Maroc - Insurer Innovation Award 2024

The Digital Insurer

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

Edi Saputra

Partners Life - Insurer Innovation Award 2024

The Digital Insurer

Strategies for Landing an Oracle DBA Job as a Fresher

Remote DBA Services

Discord is a free app offering voice, video, and text chat functionalities, primarily catering to the gaming community. It serves as a hub for users to create and join servers tailored to their interests. Discord’s ecosystem comprises servers, each functioning as a distinct online community with its own channels dedicated to specific topics or activities. Users can engage in text-based discussions, voice calls, or video chats within these channels. Understanding Discord Servers Discord servers are virtual spaces where users congregate to interact, share content, and build communities. Servers may revolve around gaming, hobbies, interests, or fandoms, providing a platform for like-minded individuals to connect. Communication Features Discord offers a range of communication tools, including text channels for messaging, voice channels for real-time audio conversations, and video channels for face-to-face interactions. These features facilitate seamless communication and collaboration. What Does NSFW Mean? The acronym NSFW stands for “Not Safe For Work,” indicating content that may be inappropriate for professional or public settings. NSFW Content NSFW content encompasses material that is sexually explicit, violent, or otherwise graphic in nature. It often includes nudity, profanity, or depictions of sensitive topics.

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf

UK Journal

Join our latest Connector Corner webinar to discover how UiPath Integration Service revolutionizes API-centric automation in a 'Quote to Cash' process—and how that automation empowers businesses to accelerate revenue generation. A comprehensive demo will explore connecting systems, GenAI, and people, through powerful pre-built connectors designed to speed process cycle times. Speakers: James Dickson, Senior Software Engineer Charlie Greenberg, Host, Product Marketing Manager

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

DianaGray10

Webinar Recording: https://www.panagenda.com/webinars/why-teams-call-analytics-is-critical-to-your-entire-business Nothing is as frustrating and noticeable as being in an important call and being unable to see or hear the other person. Not surprising then, that issues with Teams calls are among the most common problems users call their helpdesk for. Having in depth insight into everything relevant going on at the user’s device, local network, ISP and Microsoft itself during the call is crucial for good Microsoft Teams Call quality support. To ensure a quick and adequate solution and to ensure your users get the most out of their Microsoft 365. But did you know that ‘bad calls’ are also an excellent indicator of other problems arising? Precisely because it is so noticeable!? Like the canary in the mine, bad calls can be early indicators of problems. Problems that might otherwise not have been noticed for a while but can have a big impact on productivity and satisfaction. Join this session by Christoph Adler to learn how true Microsoft Teams call quality analytics helped other organizations troubleshoot bad calls and identify and fix problems that impacted Teams calls or the use of Microsoft365 in general. See what it can do to keep your users happy and productive! In this session we will cover - Why CQD data alone is not enough to troubleshoot call problems - The importance of attributing call problems to the right call participant - What call quality analytics can do to help you quickly find, fix-, and prevent problems - Why having retrospective detailed insights matters - Real life examples of how others have used Microsoft Teams call quality monitoring to problem shoot problems with their ISP, network, device health and more.

Why Teams call analytics are critical to your entire business

panagenda

Increase engagement and revenue with Muvi Live Paywall! In this presentation, we will explore the five key benefits of using Muvi Live Paywall to monetize your live streams. You'll learn how Muvi Live Paywall can help you: Monetize your live content easily: Set up pay-per-view access to your live streams and start generating revenue from your content. Increase audience engagement: Provide exclusive, premium content behind the paywall to keep your viewers engaged. Gain valuable viewer insights: Track viewer data and analytics to better understand your audience and tailor your content accordingly. Reduce content piracy: Muvi Live Paywall's security features help protect your content from unauthorized distribution. Streamline your workflow: The all-in-one platform simplifies the process of managing and monetizing your live streams. With Muvi Live Paywall, you can take control of your live stream monetization and create a sustainable business model for your content. Learn more about Muvi Live Paywall and start generating revenue from your live streams today!

Top 5 Benefits OF Using Muvi Live Paywall For Live Streams

Roshan Dwivedi

Boost Fertility New Invention Ups Success Rates.pdf

sudhanshuwaghmare1

A Principled Technologies deployment guide Conclusion Deploying VMware Cloud Foundation 5.1 on next gen Dell PowerEdge servers brings together critical virtualization capabilities and high-performing hardware infrastructure. Relying on our hands-on experience, this deployment guide offers a comprehensive roadmap that can guide your organization through the seamless integration of advanced VMware cloud solutions with the performance and reliability of Dell PowerEdge servers. In addition to the deployment efficiency, the Cloud Foundation 5.1 and PowerEdge solution delivered strong performance while running a MySQL database workload. By leveraging VMware Cloud Foundation 5.1 and PowerEdge servers, you could help your organization embrace cloud computing with confidence, potentially unlocking a new level of agility, scalability, and efficiency in your data center operations.

Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...

Principled Technologies

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Martijn de Jong

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Rafal Los

AWS Community Day CPH - Three problems of Terraform

Andrey Devyatkin

Abhishek Deb(1), Mr Abdul Kalam(2) M. Des (UX) , School of Design, DIT University , Dehradun. This paper explores the future potential of AI-enabled smartphone processors, aiming to investigate the advancements, capabilities, and implications of integrating artificial intelligence (AI) into smartphone technology. The research study goals consist of evaluating the development of AI in mobile phone processors, analyzing the existing state as well as abilities of AI-enabled cpus determining future patterns as well as chances together with reviewing obstacles as well as factors to consider for more growth.

Exploring the Future Potential of AI-Enabled Smartphone Processors

debabhi2

A Domino Admins Adventures (Engage 2024)

Gabriella Davis

Imagine a world where information flows as swiftly as thought itself, making decision-making as fluid as the data driving it. Every moment is critical, and the right tools can significantly boost your organization’s performance. The power of real-time data automation through FME can turn this vision into reality. Aimed at professionals eager to leverage real-time data for enhanced decision-making and efficiency, this webinar will cover the essentials of real-time data and its significance. We’ll explore: FME’s role in real-time event processing, from data intake and analysis to transformation and reporting An overview of leveraging streams vs. automations FME’s impact across various industries highlighted by real-life case studies Live demonstrations on setting up FME workflows for real-time data Practical advice on getting started, best practices, and tips for effective implementation Join us to enhance your skills in real-time data automation with FME, and take your operational capabilities to the next level.

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

Safe Software

Dernier (20)

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

Manulife - Insurer Innovation Award 2024

Scaling API-first – The story of a global engineering organization

Real Time Object Detection Using Open CV

Axa Assurance Maroc - Insurer Innovation Award 2024

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

Partners Life - Insurer Innovation Award 2024

Strategies for Landing an Oracle DBA Job as a Fresher

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

Why Teams call analytics are critical to your entire business

Top 5 Benefits OF Using Muvi Live Paywall For Live Streams

Boost Fertility New Invention Ups Success Rates.pdf

Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...

2024: Domino Containers - The Next Step. News from the Domino Container commu...

The 7 Things I Know About Cyber Security After 25 Years | April 2024

AWS Community Day CPH - Three problems of Terraform

Exploring the Future Potential of AI-Enabled Smartphone Processors

A Domino Admins Adventures (Engage 2024)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

Don’t Let Kafka Be A Cluster: Kafka Chaos Experimentation with Justin Fetherolf

1. Don’t Let Kafka Be A Cluster: Kafka Chaos Experimentation Justin Fetherolf

2. Welcome! VERICA | CONTINUOUS VERIFICATION ● What is Chaos Engineering? ● Why should we care? ● Experiment design ● Experiment results ● What’s next?

3. What is Chaos Engineering? VERICA | CONTINUOUS VERIFICATION

4. What does it all mean? ● Chaos Engineering ○ “the facilitation of experiments to uncover systemic weaknesses.” ● Experiment ○ “an operation or procedure carried out under controlled conditions in order to discover an unknown effect, to test or establish a hypothesis, or to illustrate a known law” ● Use experiments to create new knowledge ○ Tests make assertions about known properties ● Experiments verify behavior; not validate VERICA | CONTINUOUS VERIFICATION

5. Chaos Engineering Principals ● Define “steady state” ● Form a hypothesis ● Introduce variables ● Attempt to disprove hypothesis VERICA | CONTINUOUS VERIFICATION

6. Advanced Principles ● Build hypothesis around steady-state behavior ● Vary real-world events ● Run experiments in production ● Automate experiments to run continuously ● Minimize blast radius VERICA | CONTINUOUS VERIFICATION

7. Why should we care about Chaos Engineering? VERICA | CONTINUOUS VERIFICATION

8. Complex Systems ● Businesses require capabilities/properties/features ● Requires complexity from systems ● Can’t avoid complexity ● Embrace and navigate complexity ● As complexity increases, can’t maintain mental model VERICA | CONTINUOUS VERIFICATION

9. ● Kafka sits at the core of our businesses ● Kafka is a complex system ● More complex systems built on top of Kafka ● Cloud infrastructure isn’t always what we expect ● Know the safety margins of our systems VERICA | CONTINUOUS VERIFICATION Chaofka?

10. Our Kafka Experiment VERICA | CONTINUOUS VERIFICATION

11. Steady State ● Cluster ○ 5-node EKS w/ 1 broker per node - 5-broker Kafka cluster ○ t2.xlarge instance types ■ “moderate” network - 83.4 - 107.3 MiB/s ■ 20 GB “gp2” EBS volumes - 128 MiB/s ○ Metrics to Prometheus/Grafana ● Batch style workload; ~3 min @ 2.5 MiB/s every 5 min ○ ~5 million messages produced and consumed ○ 3 partitions, 3 replicas VERICA | CONTINUOUS VERIFICATION

12. Steady State Metrics VERICA | CONTINUOUS VERIFICATION

13. Hypothesis “As the load on the Kafka cluster increases, the standard workload can continue to successfully process each batch of messages before the next batch begins.” ● How do we measure this? ○ Monitoring ■ Message/data rates ■ CPU/Memory/Net/Disk usage ○ Application status VERICA | CONTINUOUS VERIFICATION

14. Introducing Variables ● How do we increase load? Enter Horus! ○ Scalable & configurable ○ Safety features ■ Halting could be triggered by ● Cluster or client metrics ● Other conditions ● Manual intervention VERICA | CONTINUOUS VERIFICATION

15. Load Scaling Configuration VERICA | CONTINUOUS VERIFICATION ● 4 distinct, increasing client sets; 15 minutes each ● 5 partition, 5 replica topic ● 10 - 40 producers; 10 step ○ 500 msg/s; 1024 byte/msg ● 7 consumer groups; 3 consumers each ● 4.88 - 19.5 MiB/s total production traffic ● 34.18 - 136.72 MiB/s total consumer traffic ● Increased replication traffic

16. Experiment Results VERICA | CONTINUOUS VERIFICATION

17. What’s Next? VERICA | CONTINUOUS VERIFICATION

18. The Future! ● Context sensitive ○ One size does not fit all ● Start small ● Start in non-production environment ● Minimize blast radius ● Unleash the Chaos! VERICA | CONTINUOUS VERIFICATION

19. References and Resources ● Rosenthal, Casey and Jones, Nora. Chaos Engineering: System Resiliency in Practice. 1st ed., O’Reilly, 2020. ● Hausmann, Steffen. “Best practices for right-sizing your Apache Kafka clusters to optimize performance and cost.” AWS Big Data Blog, 17 Mar. 2022, https://aws.amazon.com/blogs/big-data/best-practices-for-right-sizing-your- apache-kafka-clusters-to-optimize-performance-and-cost/ ● https://principlesofchaos.org/ ● https://www.verica.io ● https://www.thevoid.community/ VERICA | CONTINUOUS VERIFICATION

20. Justin Fetherolf Sr. Software Engineer https://www.verica.io

Don’t Let Kafka Be A Cluster: Kafka Chaos Experimentation with Justin Fetherolf

Recommandé

Recommandé

Contenu connexe

Similaire à Don’t Let Kafka Be A Cluster: Kafka Chaos Experimentation with Justin Fetherolf

Similaire à Don’t Let Kafka Be A Cluster: Kafka Chaos Experimentation with Justin Fetherolf (20)

Plus de HostedbyConfluent

Plus de HostedbyConfluent (20)

Dernier

Dernier (20)

Don’t Let Kafka Be A Cluster: Kafka Chaos Experimentation with Justin Fetherolf