SlideShare une entreprise Scribd logo
1  sur  51
Télécharger pour lire hors ligne
Kafka Streams Rebalances and Assignments:
The Whole Story
John Roesler
john@confluent.io
Alieh Saeedi
asaeedi@confluent.io
Agenda
2
Intro and setup
Background
Walkthrough
Takeaways
1.
2.
3.
4.
Ice Breaker
How many people here use Kafka Streams?
3
Ice Breaker
How many people here use Kafka Streams?
How many of you are scared of rebalances?
4
Pop quiz!
Scenario:
● Tuesday morning, notice app has a rebalance (State transition from RUNNING to
REBALANCING)
● Rebalance completes (State transition from REBALANCING to RUNNING)
● Another rebalance 10 minutes later
● Cycle continues all day
5
Pop quiz!
Scenario:
● Tuesday morning, notice app has a rebalance (State transition from RUNNING to
REBALANCING)
● Rebalance completes (State transition from REBALANCING to RUNNING)
● Another rebalance 10 minutes later
● Cycle continues all day
Is this:
1. Bad
2. Good
3. Not enough information
6
Pop quiz!
Scenario:
● Tuesday morning, notice app has a rebalance (State transition from RUNNING to
REBALANCING)
● Rebalance completes (State transition from REBALANCING to RUNNING)
● Another rebalance 10 minutes later
● Cycle continues all day
Is this:
1. Bad
2. Good
3. Not enough information
What will you do?
1. Wait for another day
2. Try restarting some/all instances
3. Nothing: my on-call shift is over!
7
Agenda
8
Intro and setup
Background
Walkthrough
Takeaways
1.
2.
3.
4.
Wait a Minute…
What is a Rebalance?!
9
Kafka Streams
12
1
JoinGroup
(Subscription)
2
SyncGroup
(Assignment)
Task Assignment
13
1
JoinGroup
(Subscription)
2
SyncGroup
(Assignment)
Task Assignment
Rebalancing:
Moving task ownership
from one instance to another
(because of changes in consumer group or topic subscriptions)
14
Why are Rebalances so important?
❌ In the normal course of events rebalances are fairly undesirable.
○ When partitions are moved from one consumer to another, the consumer loses its
current state;
■ if it was caching any data, it will need to refresh its caches
■ slowing down the application until the consumer sets up its state again.
15
✅ Rebalances provide the consumer group with high availability and elastic scalability
○ Easily and safely add and remove consumers
○ Survive consumer crash
Throughout this talk we will discuss how to safely handle rebalances and how to avoid
unnecessary ones.
How does Kafka make rebalancing process less
painful?
● 2.4: Optimistically continue processing during rebalances
○ Incremental Cooperative Protocol
● 2.6: Continue processing on stateful tasks while re-distributing state in the background
○ Smooth Scaling Protocol
16
Old Protocol (before 2.4, 2019)
17
Incremental Cooperative Protocol (2.4 in 2019)
18
Incremental Cooperative Protocol (2.4 in 2019)
19
Smooth Scaling Protocol (2.6 in 2020)
20
Agenda
21
Intro and setup
Background
Walkthrough
Takeaways
1.
2.
3.
4.
Pop quiz!
Scenario:
● Tuesday morning, notice app has a rebalance (State transition from RUNNING to
REBALANCING)
● Rebalance completes (State transition from REBALANCING to RUNNING
● ), but some instances don't get an assignment
● Another rebalance 10 minutes later
● Cycle continues all day
Is this:
1. Bad
2. Good
3. Not enough information
What will you do?
1. Wait for another day
2. Try restarting some/all instances
3. Nothing: my on-call shift is over!
22
How to dig in?
23
● First challenge: logs are spread over all 10 different instances
● collect all logs into a common logging platform
● collate by timestamp, but not always easy to see the right causal ordering
○ Solution: line them up by Generation ID
2022-09-14 14:33:44,991 INFO … Successfully joined group with generation Generation{generationId=3188, … }
Cluster state in each generation
24
● Second challenge: Keeping track of which instances are in the cluster and the
tasks that they're assigned
○ Solution find this log line: Assigned tasks […] including stateful [...] to
clients as:...
1f061087-7be9-48ef-9300-f1e8d88027ef=[activeTasks: ([0_0]) standbyTasks: ([0_1])]
f7420f7b-4a31-4a79-bf9b-279c259414a3=[activeTasks: ([0_1]) standbyTasks: ([0_2])]
0f134dd0-4ae3-4603-88d5-fe097fc9e9f5=[activeTasks: ([0_2]) standbyTasks: ([0_0])]
Cluster state in each generation
25
● Second challenge: Keeping track of which instances are in the cluster and the
tasks that they're assigned
○ Solution find this log line: Assigned tasks […] including stateful [...] to
clients as:...
A =[activeTasks: ([0_0]) standbyTasks: ([0_1])]
B =[activeTasks: ([0_1]) standbyTasks: ([0_2])]
C =[activeTasks: ([0_2]) standbyTasks: ([0_0])]
Start at the beginning
26
● As with any troubleshooting, try to find the first event.
● In our case, the cluster went a while before rebalancing, and then had a bunch of
them,
● so we'll figure out the generation and cause of that first rebalance
Start at the beginning
27
2022-09-14 14:33:44,552 INFO … Request joining group due to: group is already rebalancing
2022-09-14 14:33:44,991 INFO … Successfully joined group with generation Generation{generationId=3188, … }
3188
What happened?
28
● It looks like one of the instances (B) dropped out of the group
3188
A =[activeTasks: ([0_0]) standbyTasks: ([0_1])]
B =[activeTasks: ([0_1]) standbyTasks: ([0_2])]
C =[activeTasks: ([0_2]) standbyTasks: ([0_0])]
…
2022-09-14 14:33:44,552 INFO … Request joining group due to: group is already rebalancing
A =[activeTasks: ([0_0, 0_1]) standbyTasks: ([0_2])]
C =[activeTasks: ([0_2]) standbyTasks: ([0_0, 0_1])]
…
How do we react?
29
● B was active for 0_1 and standby for 0_2
3188
A =[activeTasks: ([0_0]) standbyTasks: ([0_1])]
B =[activeTasks: ([0_1]) standbyTasks: ([0_2])]
C =[activeTasks: ([0_2]) standbyTasks: ([0_0])]
…
2022-09-14 14:33:44,552 INFO … Request joining group due to: group is already rebalancing
A =[activeTasks: ([0_0, 0_1]) standbyTasks: ([0_2])]
C =[activeTasks: ([0_2]) standbyTasks: ([0_0, 0_1])]
…
How do we react (0_1)?
30
● B was active for 0_1 and standby for 0_2
● A had a standby for 0_1, so it will take over as active
● Need a new standby for 0_1, so C picks it up
3188
A =[activeTasks: ([0_0]) standbyTasks: ([0_1])]
B =[activeTasks: ([0_1]) standbyTasks: ([0_2])]
C =[activeTasks: ([0_2]) standbyTasks: ([0_0])]
…
2022-09-14 14:33:44,552 INFO … Request joining group due to: group is already rebalancing
A =[activeTasks: ([0_0, 0_1]) standbyTasks: ([0_2])]
C =[activeTasks: ([0_2]) standbyTasks: ([0_0, 0_1])]
…
How do we react (0_2)?
31
● B was active for 0_1 and standby for 0_2
● A had a standby for 0_1, so it will take over as active
● Need a new standby for 0_1, so C picks it up
● Need a new standby for 0_2, so A picks it up
3188
A =[activeTasks: ([0_0]) standbyTasks: ([0_1])]
B =[activeTasks: ([0_1]) standbyTasks: ([0_2])]
C =[activeTasks: ([0_2]) standbyTasks: ([0_0])]
…
2022-09-14 14:33:44,552 INFO … Request joining group due to: group is already rebalancing
A =[activeTasks: ([0_0, 0_1]) standbyTasks: ([0_2])]
C =[activeTasks: ([0_2]) standbyTasks: ([0_0, 0_1])]
…
B is back!
32
● Generation 3189: B is back in the cluster
3189
2022-09-14 14:50:44,552 INFO … Request joining group due to: group is already rebalancing
A =[activeTasks: ([0_0, 0_1]) standbyTasks: ([0_2])]
B =[activeTasks: ([]) standbyTasks: ([0_1, 0_2])]
C =[activeTasks: ([0_2]) standbyTasks: ([0_0, 0_1])]
2022-09-14 14:50:44,552 INFO … Decided on assignment … with followup probing rebalance
B is back!
33
● Generation 3189: B is back in the cluster
● Leave everything where it is while B warms up on 0_1 and 0_2
3189
2022-09-14 14:50:44,552 INFO … Request joining group due to: group is already rebalancing
A =[activeTasks: ([0_0, 0_1]) standbyTasks: ([0_2])]
B =[activeTasks: ([]) standbyTasks: ([0_1, 0_2])]
C =[activeTasks: ([0_2]) standbyTasks: ([0_0, 0_1])]
2022-09-14 14:50:44,552 INFO … Decided on assignment … with followup probing rebalance
B is back!
34
● Generation 3189: B is back in the cluster
● Leave everything where it is while B warms up on 0_1 and 0_2
● Schedule a probing rebalance to check on B in 10 minutes
3189
2022-09-14 14:50:44,552 INFO … Request joining group due to: group is already rebalancing
A =[activeTasks: ([0_0, 0_1]) standbyTasks: ([0_2])]
B =[activeTasks: ([]) standbyTasks: ([0_1, 0_2])]
C =[activeTasks: ([0_2]) standbyTasks: ([0_0, 0_1])]
2022-09-14 14:50:44,552 INFO … Decided on assignment … with followup probing rebalance
Checking in on B
35
● Generation 3190: probing rebalance to check on the progress of the warmup
3190
2022-09-14 15:00:44,552 INFO … Request joining group due to: Scheduled probing rebalance
2022-09-14 15:00:44,552 INFO … Decided on assignment … with followup probing rebalance
2022-09-14 15:00:44,552 INFO … Finished unstable assignment of tasks, a followup rebalance will be scheduled
Checking in on B
36
● Generation 3190: probing rebalance to check on the progress of the warmup
● Not ready yet
3190
2022-09-14 15:00:44,552 INFO … Request joining group due to: Scheduled probing rebalance
2022-09-14 15:00:44,552 INFO … Decided on assignment … with followup probing rebalance
2022-09-14 15:00:44,552 INFO … Finished unstable assignment of tasks, a followup rebalance will be scheduled
B is ready for action!
37
● Generation 3191: B is warmed up!
3191
A =[activeTasks: ([0_0, 0_1]) standbyTasks: ([0_2])]
B =[activeTasks: ([]) standbyTasks: ([0_1, 0_2])]
C =[activeTasks: ([0_2]) standbyTasks: ([0_0, 0_1])]
2022-09-14 15:10:44,552 INFO … Request joining group due to: Scheduled probing rebalance
A =[activeTasks: ([0_0]) standbyTasks: ([0_1])]
B =[activeTasks: ([0_1]) standbyTasks: ([0_2])]
C =[activeTasks: ([0_2]) standbyTasks: ([0_0])]
B is ready for action!
38
● Generation 3191: B is warmed up!
● Make B active on 0_1 (and swap A back to standby)
3191
A =[activeTasks: ([0_0, 0_1]) standbyTasks: ([0_2])]
B =[activeTasks: ([]) standbyTasks: ([0_1, 0_2])]
C =[activeTasks: ([0_2]) standbyTasks: ([0_0, 0_1])]
2022-09-14 15:10:44,552 INFO … Request joining group due to: Scheduled probing rebalance
A =[activeTasks: ([0_0]) standbyTasks: ([0_1])]
B =[activeTasks: ([0_1]) standbyTasks: ([0_2])]
C =[activeTasks: ([0_2]) standbyTasks: ([0_0])]
B is ready for action!
39
● Generation 3191: B is warmed up!
● Make B active on 0_1 (and swap A back to standby)
● Drop extra standbys (0_2 on A and 0_1 on C)
3191
A =[activeTasks: ([0_0, 0_1]) standbyTasks: ([0_2])]
B =[activeTasks: ([]) standbyTasks: ([0_1, 0_2])]
C =[activeTasks: ([0_2]) standbyTasks: ([0_0, 0_1])]
2022-09-14 15:10:44,552 INFO … Request joining group due to: Scheduled probing rebalance
A =[activeTasks: ([0_0]) standbyTasks: ([0_1])]
B =[activeTasks: ([0_1]) standbyTasks: ([0_2])]
C =[activeTasks: ([0_2]) standbyTasks: ([0_0])]
Bonus round
40
● Surprise rebalance a few ms later!
3192
2022-09-14 15:10:44,565 INFO … Request joining group due to tasks changing ownership
A =[activeTasks: ([0_0]) standbyTasks: ([0_1])]
B =[activeTasks: ([0_1]) standbyTasks: ([0_2])]
C =[activeTasks: ([0_2]) standbyTasks: ([0_3])]
Bonus round
41
● Surprise rebalance a few ms later!
● Can't swap ownership of 0_1 in one rebalance
2022-09-14 15:10:44,565 INFO … Request joining group due to tasks changing ownership
A =[activeTasks: ([0_0]) standbyTasks: ([0_1])]
B =[activeTasks: ([0_1]) standbyTasks: ([0_2])]
C =[activeTasks: ([0_2]) standbyTasks: ([0_3])]
3192
Bonus round
42
● Surprise rebalance a few ms later!
● Can't swap ownership of 0_1 in one rebalance
● As soon as A revokes 0_1, it triggers another rebalance so B can start processing
2022-09-14 15:10:44,565 INFO … Request joining group due to tasks changing ownership
A =[activeTasks: ([0_0]) standbyTasks: ([0_1])]
B =[activeTasks: ([0_1]) standbyTasks: ([0_2])]
C =[activeTasks: ([0_2]) standbyTasks: ([0_3])]
3192
Recap
43
● Although we saw five rebalances, there was really just one bad event.
● All the others were just the cluster healing itself
● Longest outage was detecting the failure.
○ Task 0_1 was down for session.timeout.ms when instance B timed out
Recap
44
● Although we saw five rebalances, there was really just one bad event.
● All the others were just the cluster healing itself
● Longest outage was detecting the failure.
○ Task 0_1 was down for session.timeout.ms when instance B timed out
● It's not always easy to identify the first failure based on timing, but you can
eliminate probing and cooperative rebalances
● Sometimes there's more than one failure going on at the same time
Pop quiz!
Scenario:
● Tuesday morning, notice app has a rebalance (State transition from RUNNING to
REBALANCING)
● Rebalance completes (State transition from REBALANCING to RUNNING
● ), but some instances don't get an assignment
● Another rebalance 10 minutes later
● Cycle continues all day
Is this:
1. Bad
2. (Mostly) Good
3. Not enough information
What will you do?
1. Wait for another day
2. Try restarting some/all instances
3. Nothing: my on-call shift is over!
45
Agenda
46
Intro and setup
Background
Walkthrough
Takeaways
1.
2.
3.
4.
(Alieh) Tuning the system (5-10 mintues)(old slide)
https://docs.google.com/document/d/1XMVUlRJUHTuas2aUqP1nY_nOuorMM2w8JPhhzdBos
zc/edit#heading=h.2j5e73a1tone
tuning in general is great, but ideally we'd tie it to the scenario somehow.
● more concurrent warmups -> reduce the total recovery time (eg for indeed it was 20
hours) -> didn't like moderate load for a long time; prefer higher load for a shorter time
● reduce task warmup time (by allowing higher tput from broker)
● improve HA -> more standbys
● consumer group protocol settings to avoid missed polls/hbs/etc. -> make it less
sensitive (and less responsive)
What to monitor
● poll, heartbeat
● state
● when to alert (SLA violation) vs. warn (frequent rebalances, etc.)
47
Tuning: Sensitivity Tradeoff
● Configs
○ max.poll.interval.ms
○ heartbeat.interval.ms, session.timeout.ms
● Less Sensitive: Prevent unnecessary rebalances
○ Occasional long I/O
○ Long GC pauses
○ Flaky networks
● More Sensitive: Detect failures faster
○ Might get unnecessary rebalance
○ Failover faster to meet uptime SLAs
48
Tuning: Speed up probing rebalance phase
● Real Scenario!
○ App had 60 tasks
○ Warming up a task took 30-40 min
○ Migrating two tasks at a time
⠀ 60 tasks * 40 minutes / 2 warmups
⠀ = 20 hours of probing rebalances
● Solution:
● More concurrent warmups: max.warmup.replicas
○ reduce the total recovery time (instead of having moderate load for a long
time, have a higher load for a shorter time)
● Reduce task warmup time (by allowing higher tput from broker)
○ max.poll.records, max.partition.fetch.bytes, fetch.max.bytes
49
What to monitor?
50
Probing Rebalance
last.rebalance.seconds.ago
Last poll seconds ago
Default max.poll.interval.ms
Last heartbeat seconds ago
Default session.timeout.ms
Heartbeat interval
last-poll-seconds-ago: consumer-metrics
max.poll.interval.ms: Consumer config
last-heartbeat-seconds-ago: consumer-metrics
session.timeout.ms: Consumer config
heartbeat.interval.ms: Consumer config
last.rebalance.seconds.ago: consumer-coordinator-metrics
probing.rebalance.interval.ms:: Streams config
Prediction: What will you see in the next Kafka
Summit?
● KIP-848: The Next Generation of the Consumer
Rebalance Protocol
○ Adds ability to compute assignments in the
broker
● Add generation id to all rebalance logs
51
Go forth and rebalance!
John Roesler
john@confluent.io
Alieh Saeedi
asaeedi@confluent.io

Contenu connexe

Tendances

Mind the App: How to Monitor Your Kafka Streams Applications | Bruno Cadonna,...
Mind the App: How to Monitor Your Kafka Streams Applications | Bruno Cadonna,...Mind the App: How to Monitor Your Kafka Streams Applications | Bruno Cadonna,...
Mind the App: How to Monitor Your Kafka Streams Applications | Bruno Cadonna,...HostedbyConfluent
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Databricks
 
Managing multiple event types in a single topic with Schema Registry | Bill B...
Managing multiple event types in a single topic with Schema Registry | Bill B...Managing multiple event types in a single topic with Schema Registry | Bill B...
Managing multiple event types in a single topic with Schema Registry | Bill B...HostedbyConfluent
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward
 
Flink SQL & TableAPI in Large Scale Production at Alibaba
Flink SQL & TableAPI in Large Scale Production at AlibabaFlink SQL & TableAPI in Large Scale Production at Alibaba
Flink SQL & TableAPI in Large Scale Production at AlibabaDataWorks Summit
 
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)Kai Wähner
 
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...confluent
 
Common issues with Apache Kafka® Producer
Common issues with Apache Kafka® ProducerCommon issues with Apache Kafka® Producer
Common issues with Apache Kafka® Producerconfluent
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraFlink Forward
 
Distributed stream processing with Apache Kafka
Distributed stream processing with Apache KafkaDistributed stream processing with Apache Kafka
Distributed stream processing with Apache Kafkaconfluent
 
Highly Available Kafka Consumers and Kafka Streams on Kubernetes with Adrian ...
Highly Available Kafka Consumers and Kafka Streams on Kubernetes with Adrian ...Highly Available Kafka Consumers and Kafka Streams on Kubernetes with Adrian ...
Highly Available Kafka Consumers and Kafka Streams on Kubernetes with Adrian ...HostedbyConfluent
 
ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!Guido Schmutz
 
Spring Boot+Kafka: the New Enterprise Platform
Spring Boot+Kafka: the New Enterprise PlatformSpring Boot+Kafka: the New Enterprise Platform
Spring Boot+Kafka: the New Enterprise PlatformVMware Tanzu
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaJiangjie Qin
 
Using Queryable State for Fun and Profit
Using Queryable State for Fun and ProfitUsing Queryable State for Fun and Profit
Using Queryable State for Fun and ProfitFlink Forward
 
The Foundations of Multi-DC Kafka (Jakub Korab, Solutions Architect, Confluen...
The Foundations of Multi-DC Kafka (Jakub Korab, Solutions Architect, Confluen...The Foundations of Multi-DC Kafka (Jakub Korab, Solutions Architect, Confluen...
The Foundations of Multi-DC Kafka (Jakub Korab, Solutions Architect, Confluen...confluent
 
Airflow Clustering and High Availability
Airflow Clustering and High AvailabilityAirflow Clustering and High Availability
Airflow Clustering and High AvailabilityRobert Sanders
 
From Postgres to Event-Driven: using docker-compose to build CDC pipelines in...
From Postgres to Event-Driven: using docker-compose to build CDC pipelines in...From Postgres to Event-Driven: using docker-compose to build CDC pipelines in...
From Postgres to Event-Driven: using docker-compose to build CDC pipelines in...confluent
 
Storage Capacity Management on Multi-tenant Kafka Cluster with Nurettin Omeroglu
Storage Capacity Management on Multi-tenant Kafka Cluster with Nurettin OmerogluStorage Capacity Management on Multi-tenant Kafka Cluster with Nurettin Omeroglu
Storage Capacity Management on Multi-tenant Kafka Cluster with Nurettin OmerogluHostedbyConfluent
 
Expose your event-driven data to the outside world using webhooks powered by ...
Expose your event-driven data to the outside world using webhooks powered by ...Expose your event-driven data to the outside world using webhooks powered by ...
Expose your event-driven data to the outside world using webhooks powered by ...HostedbyConfluent
 

Tendances (20)

Mind the App: How to Monitor Your Kafka Streams Applications | Bruno Cadonna,...
Mind the App: How to Monitor Your Kafka Streams Applications | Bruno Cadonna,...Mind the App: How to Monitor Your Kafka Streams Applications | Bruno Cadonna,...
Mind the App: How to Monitor Your Kafka Streams Applications | Bruno Cadonna,...
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
 
Managing multiple event types in a single topic with Schema Registry | Bill B...
Managing multiple event types in a single topic with Schema Registry | Bill B...Managing multiple event types in a single topic with Schema Registry | Bill B...
Managing multiple event types in a single topic with Schema Registry | Bill B...
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
 
Flink SQL & TableAPI in Large Scale Production at Alibaba
Flink SQL & TableAPI in Large Scale Production at AlibabaFlink SQL & TableAPI in Large Scale Production at Alibaba
Flink SQL & TableAPI in Large Scale Production at Alibaba
 
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
 
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
 
Common issues with Apache Kafka® Producer
Common issues with Apache Kafka® ProducerCommon issues with Apache Kafka® Producer
Common issues with Apache Kafka® Producer
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
 
Distributed stream processing with Apache Kafka
Distributed stream processing with Apache KafkaDistributed stream processing with Apache Kafka
Distributed stream processing with Apache Kafka
 
Highly Available Kafka Consumers and Kafka Streams on Kubernetes with Adrian ...
Highly Available Kafka Consumers and Kafka Streams on Kubernetes with Adrian ...Highly Available Kafka Consumers and Kafka Streams on Kubernetes with Adrian ...
Highly Available Kafka Consumers and Kafka Streams on Kubernetes with Adrian ...
 
ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!
 
Spring Boot+Kafka: the New Enterprise Platform
Spring Boot+Kafka: the New Enterprise PlatformSpring Boot+Kafka: the New Enterprise Platform
Spring Boot+Kafka: the New Enterprise Platform
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache Kafka
 
Using Queryable State for Fun and Profit
Using Queryable State for Fun and ProfitUsing Queryable State for Fun and Profit
Using Queryable State for Fun and Profit
 
The Foundations of Multi-DC Kafka (Jakub Korab, Solutions Architect, Confluen...
The Foundations of Multi-DC Kafka (Jakub Korab, Solutions Architect, Confluen...The Foundations of Multi-DC Kafka (Jakub Korab, Solutions Architect, Confluen...
The Foundations of Multi-DC Kafka (Jakub Korab, Solutions Architect, Confluen...
 
Airflow Clustering and High Availability
Airflow Clustering and High AvailabilityAirflow Clustering and High Availability
Airflow Clustering and High Availability
 
From Postgres to Event-Driven: using docker-compose to build CDC pipelines in...
From Postgres to Event-Driven: using docker-compose to build CDC pipelines in...From Postgres to Event-Driven: using docker-compose to build CDC pipelines in...
From Postgres to Event-Driven: using docker-compose to build CDC pipelines in...
 
Storage Capacity Management on Multi-tenant Kafka Cluster with Nurettin Omeroglu
Storage Capacity Management on Multi-tenant Kafka Cluster with Nurettin OmerogluStorage Capacity Management on Multi-tenant Kafka Cluster with Nurettin Omeroglu
Storage Capacity Management on Multi-tenant Kafka Cluster with Nurettin Omeroglu
 
Expose your event-driven data to the outside world using webhooks powered by ...
Expose your event-driven data to the outside world using webhooks powered by ...Expose your event-driven data to the outside world using webhooks powered by ...
Expose your event-driven data to the outside world using webhooks powered by ...
 

Plus de HostedbyConfluent

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Renaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonRenaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonHostedbyConfluent
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolHostedbyConfluent
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesEnsuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesHostedbyConfluent
 
Exactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaExactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaHostedbyConfluent
 
Fish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonFish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonHostedbyConfluent
 
Tiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonTiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonHostedbyConfluent
 
Building a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyBuilding a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyHostedbyConfluent
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...HostedbyConfluent
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...HostedbyConfluent
 
Navigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersNavigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersHostedbyConfluent
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformApache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformHostedbyConfluent
 
Explaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubExplaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubHostedbyConfluent
 
TL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonTL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonHostedbyConfluent
 
A Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLA Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLHostedbyConfluent
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceMastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceHostedbyConfluent
 
Data Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondData Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondHostedbyConfluent
 
Code-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsCode-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsHostedbyConfluent
 
Debezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemDebezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemHostedbyConfluent
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksBeyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksHostedbyConfluent
 

Plus de HostedbyConfluent (20)

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Renaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonRenaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit London
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at Trendyol
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesEnsuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
 
Exactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaExactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and Kafka
 
Fish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonFish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit London
 
Tiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonTiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit London
 
Building a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyBuilding a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And Why
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
 
Navigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersNavigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka Clusters
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformApache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
 
Explaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubExplaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy Pub
 
TL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonTL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit London
 
A Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLA Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSL
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceMastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
 
Data Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondData Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and Beyond
 
Code-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsCode-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink Apps
 
Debezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemDebezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC Ecosystem
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksBeyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local Disks
 

Dernier

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 

Dernier (20)

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 

Kafka Streams Rebalances and Assignments: The Whole Story with Alieh Saeedi & John Roesler

  • 1. Kafka Streams Rebalances and Assignments: The Whole Story John Roesler john@confluent.io Alieh Saeedi asaeedi@confluent.io
  • 3. Ice Breaker How many people here use Kafka Streams? 3
  • 4. Ice Breaker How many people here use Kafka Streams? How many of you are scared of rebalances? 4
  • 5. Pop quiz! Scenario: ● Tuesday morning, notice app has a rebalance (State transition from RUNNING to REBALANCING) ● Rebalance completes (State transition from REBALANCING to RUNNING) ● Another rebalance 10 minutes later ● Cycle continues all day 5
  • 6. Pop quiz! Scenario: ● Tuesday morning, notice app has a rebalance (State transition from RUNNING to REBALANCING) ● Rebalance completes (State transition from REBALANCING to RUNNING) ● Another rebalance 10 minutes later ● Cycle continues all day Is this: 1. Bad 2. Good 3. Not enough information 6
  • 7. Pop quiz! Scenario: ● Tuesday morning, notice app has a rebalance (State transition from RUNNING to REBALANCING) ● Rebalance completes (State transition from REBALANCING to RUNNING) ● Another rebalance 10 minutes later ● Cycle continues all day Is this: 1. Bad 2. Good 3. Not enough information What will you do? 1. Wait for another day 2. Try restarting some/all instances 3. Nothing: my on-call shift is over! 7
  • 9. Wait a Minute… What is a Rebalance?! 9
  • 13. Rebalancing: Moving task ownership from one instance to another (because of changes in consumer group or topic subscriptions) 14
  • 14. Why are Rebalances so important? ❌ In the normal course of events rebalances are fairly undesirable. ○ When partitions are moved from one consumer to another, the consumer loses its current state; ■ if it was caching any data, it will need to refresh its caches ■ slowing down the application until the consumer sets up its state again. 15 ✅ Rebalances provide the consumer group with high availability and elastic scalability ○ Easily and safely add and remove consumers ○ Survive consumer crash Throughout this talk we will discuss how to safely handle rebalances and how to avoid unnecessary ones.
  • 15. How does Kafka make rebalancing process less painful? ● 2.4: Optimistically continue processing during rebalances ○ Incremental Cooperative Protocol ● 2.6: Continue processing on stateful tasks while re-distributing state in the background ○ Smooth Scaling Protocol 16
  • 16. Old Protocol (before 2.4, 2019) 17
  • 19. Smooth Scaling Protocol (2.6 in 2020) 20
  • 21. Pop quiz! Scenario: ● Tuesday morning, notice app has a rebalance (State transition from RUNNING to REBALANCING) ● Rebalance completes (State transition from REBALANCING to RUNNING ● ), but some instances don't get an assignment ● Another rebalance 10 minutes later ● Cycle continues all day Is this: 1. Bad 2. Good 3. Not enough information What will you do? 1. Wait for another day 2. Try restarting some/all instances 3. Nothing: my on-call shift is over! 22
  • 22. How to dig in? 23 ● First challenge: logs are spread over all 10 different instances ● collect all logs into a common logging platform ● collate by timestamp, but not always easy to see the right causal ordering ○ Solution: line them up by Generation ID 2022-09-14 14:33:44,991 INFO … Successfully joined group with generation Generation{generationId=3188, … }
  • 23. Cluster state in each generation 24 ● Second challenge: Keeping track of which instances are in the cluster and the tasks that they're assigned ○ Solution find this log line: Assigned tasks […] including stateful [...] to clients as:... 1f061087-7be9-48ef-9300-f1e8d88027ef=[activeTasks: ([0_0]) standbyTasks: ([0_1])] f7420f7b-4a31-4a79-bf9b-279c259414a3=[activeTasks: ([0_1]) standbyTasks: ([0_2])] 0f134dd0-4ae3-4603-88d5-fe097fc9e9f5=[activeTasks: ([0_2]) standbyTasks: ([0_0])]
  • 24. Cluster state in each generation 25 ● Second challenge: Keeping track of which instances are in the cluster and the tasks that they're assigned ○ Solution find this log line: Assigned tasks […] including stateful [...] to clients as:... A =[activeTasks: ([0_0]) standbyTasks: ([0_1])] B =[activeTasks: ([0_1]) standbyTasks: ([0_2])] C =[activeTasks: ([0_2]) standbyTasks: ([0_0])]
  • 25. Start at the beginning 26 ● As with any troubleshooting, try to find the first event. ● In our case, the cluster went a while before rebalancing, and then had a bunch of them, ● so we'll figure out the generation and cause of that first rebalance
  • 26. Start at the beginning 27 2022-09-14 14:33:44,552 INFO … Request joining group due to: group is already rebalancing 2022-09-14 14:33:44,991 INFO … Successfully joined group with generation Generation{generationId=3188, … } 3188
  • 27. What happened? 28 ● It looks like one of the instances (B) dropped out of the group 3188 A =[activeTasks: ([0_0]) standbyTasks: ([0_1])] B =[activeTasks: ([0_1]) standbyTasks: ([0_2])] C =[activeTasks: ([0_2]) standbyTasks: ([0_0])] … 2022-09-14 14:33:44,552 INFO … Request joining group due to: group is already rebalancing A =[activeTasks: ([0_0, 0_1]) standbyTasks: ([0_2])] C =[activeTasks: ([0_2]) standbyTasks: ([0_0, 0_1])] …
  • 28. How do we react? 29 ● B was active for 0_1 and standby for 0_2 3188 A =[activeTasks: ([0_0]) standbyTasks: ([0_1])] B =[activeTasks: ([0_1]) standbyTasks: ([0_2])] C =[activeTasks: ([0_2]) standbyTasks: ([0_0])] … 2022-09-14 14:33:44,552 INFO … Request joining group due to: group is already rebalancing A =[activeTasks: ([0_0, 0_1]) standbyTasks: ([0_2])] C =[activeTasks: ([0_2]) standbyTasks: ([0_0, 0_1])] …
  • 29. How do we react (0_1)? 30 ● B was active for 0_1 and standby for 0_2 ● A had a standby for 0_1, so it will take over as active ● Need a new standby for 0_1, so C picks it up 3188 A =[activeTasks: ([0_0]) standbyTasks: ([0_1])] B =[activeTasks: ([0_1]) standbyTasks: ([0_2])] C =[activeTasks: ([0_2]) standbyTasks: ([0_0])] … 2022-09-14 14:33:44,552 INFO … Request joining group due to: group is already rebalancing A =[activeTasks: ([0_0, 0_1]) standbyTasks: ([0_2])] C =[activeTasks: ([0_2]) standbyTasks: ([0_0, 0_1])] …
  • 30. How do we react (0_2)? 31 ● B was active for 0_1 and standby for 0_2 ● A had a standby for 0_1, so it will take over as active ● Need a new standby for 0_1, so C picks it up ● Need a new standby for 0_2, so A picks it up 3188 A =[activeTasks: ([0_0]) standbyTasks: ([0_1])] B =[activeTasks: ([0_1]) standbyTasks: ([0_2])] C =[activeTasks: ([0_2]) standbyTasks: ([0_0])] … 2022-09-14 14:33:44,552 INFO … Request joining group due to: group is already rebalancing A =[activeTasks: ([0_0, 0_1]) standbyTasks: ([0_2])] C =[activeTasks: ([0_2]) standbyTasks: ([0_0, 0_1])] …
  • 31. B is back! 32 ● Generation 3189: B is back in the cluster 3189 2022-09-14 14:50:44,552 INFO … Request joining group due to: group is already rebalancing A =[activeTasks: ([0_0, 0_1]) standbyTasks: ([0_2])] B =[activeTasks: ([]) standbyTasks: ([0_1, 0_2])] C =[activeTasks: ([0_2]) standbyTasks: ([0_0, 0_1])] 2022-09-14 14:50:44,552 INFO … Decided on assignment … with followup probing rebalance
  • 32. B is back! 33 ● Generation 3189: B is back in the cluster ● Leave everything where it is while B warms up on 0_1 and 0_2 3189 2022-09-14 14:50:44,552 INFO … Request joining group due to: group is already rebalancing A =[activeTasks: ([0_0, 0_1]) standbyTasks: ([0_2])] B =[activeTasks: ([]) standbyTasks: ([0_1, 0_2])] C =[activeTasks: ([0_2]) standbyTasks: ([0_0, 0_1])] 2022-09-14 14:50:44,552 INFO … Decided on assignment … with followup probing rebalance
  • 33. B is back! 34 ● Generation 3189: B is back in the cluster ● Leave everything where it is while B warms up on 0_1 and 0_2 ● Schedule a probing rebalance to check on B in 10 minutes 3189 2022-09-14 14:50:44,552 INFO … Request joining group due to: group is already rebalancing A =[activeTasks: ([0_0, 0_1]) standbyTasks: ([0_2])] B =[activeTasks: ([]) standbyTasks: ([0_1, 0_2])] C =[activeTasks: ([0_2]) standbyTasks: ([0_0, 0_1])] 2022-09-14 14:50:44,552 INFO … Decided on assignment … with followup probing rebalance
  • 34. Checking in on B 35 ● Generation 3190: probing rebalance to check on the progress of the warmup 3190 2022-09-14 15:00:44,552 INFO … Request joining group due to: Scheduled probing rebalance 2022-09-14 15:00:44,552 INFO … Decided on assignment … with followup probing rebalance 2022-09-14 15:00:44,552 INFO … Finished unstable assignment of tasks, a followup rebalance will be scheduled
  • 35. Checking in on B 36 ● Generation 3190: probing rebalance to check on the progress of the warmup ● Not ready yet 3190 2022-09-14 15:00:44,552 INFO … Request joining group due to: Scheduled probing rebalance 2022-09-14 15:00:44,552 INFO … Decided on assignment … with followup probing rebalance 2022-09-14 15:00:44,552 INFO … Finished unstable assignment of tasks, a followup rebalance will be scheduled
  • 36. B is ready for action! 37 ● Generation 3191: B is warmed up! 3191 A =[activeTasks: ([0_0, 0_1]) standbyTasks: ([0_2])] B =[activeTasks: ([]) standbyTasks: ([0_1, 0_2])] C =[activeTasks: ([0_2]) standbyTasks: ([0_0, 0_1])] 2022-09-14 15:10:44,552 INFO … Request joining group due to: Scheduled probing rebalance A =[activeTasks: ([0_0]) standbyTasks: ([0_1])] B =[activeTasks: ([0_1]) standbyTasks: ([0_2])] C =[activeTasks: ([0_2]) standbyTasks: ([0_0])]
  • 37. B is ready for action! 38 ● Generation 3191: B is warmed up! ● Make B active on 0_1 (and swap A back to standby) 3191 A =[activeTasks: ([0_0, 0_1]) standbyTasks: ([0_2])] B =[activeTasks: ([]) standbyTasks: ([0_1, 0_2])] C =[activeTasks: ([0_2]) standbyTasks: ([0_0, 0_1])] 2022-09-14 15:10:44,552 INFO … Request joining group due to: Scheduled probing rebalance A =[activeTasks: ([0_0]) standbyTasks: ([0_1])] B =[activeTasks: ([0_1]) standbyTasks: ([0_2])] C =[activeTasks: ([0_2]) standbyTasks: ([0_0])]
  • 38. B is ready for action! 39 ● Generation 3191: B is warmed up! ● Make B active on 0_1 (and swap A back to standby) ● Drop extra standbys (0_2 on A and 0_1 on C) 3191 A =[activeTasks: ([0_0, 0_1]) standbyTasks: ([0_2])] B =[activeTasks: ([]) standbyTasks: ([0_1, 0_2])] C =[activeTasks: ([0_2]) standbyTasks: ([0_0, 0_1])] 2022-09-14 15:10:44,552 INFO … Request joining group due to: Scheduled probing rebalance A =[activeTasks: ([0_0]) standbyTasks: ([0_1])] B =[activeTasks: ([0_1]) standbyTasks: ([0_2])] C =[activeTasks: ([0_2]) standbyTasks: ([0_0])]
  • 39. Bonus round 40 ● Surprise rebalance a few ms later! 3192 2022-09-14 15:10:44,565 INFO … Request joining group due to tasks changing ownership A =[activeTasks: ([0_0]) standbyTasks: ([0_1])] B =[activeTasks: ([0_1]) standbyTasks: ([0_2])] C =[activeTasks: ([0_2]) standbyTasks: ([0_3])]
  • 40. Bonus round 41 ● Surprise rebalance a few ms later! ● Can't swap ownership of 0_1 in one rebalance 2022-09-14 15:10:44,565 INFO … Request joining group due to tasks changing ownership A =[activeTasks: ([0_0]) standbyTasks: ([0_1])] B =[activeTasks: ([0_1]) standbyTasks: ([0_2])] C =[activeTasks: ([0_2]) standbyTasks: ([0_3])] 3192
  • 41. Bonus round 42 ● Surprise rebalance a few ms later! ● Can't swap ownership of 0_1 in one rebalance ● As soon as A revokes 0_1, it triggers another rebalance so B can start processing 2022-09-14 15:10:44,565 INFO … Request joining group due to tasks changing ownership A =[activeTasks: ([0_0]) standbyTasks: ([0_1])] B =[activeTasks: ([0_1]) standbyTasks: ([0_2])] C =[activeTasks: ([0_2]) standbyTasks: ([0_3])] 3192
  • 42. Recap 43 ● Although we saw five rebalances, there was really just one bad event. ● All the others were just the cluster healing itself ● Longest outage was detecting the failure. ○ Task 0_1 was down for session.timeout.ms when instance B timed out
  • 43. Recap 44 ● Although we saw five rebalances, there was really just one bad event. ● All the others were just the cluster healing itself ● Longest outage was detecting the failure. ○ Task 0_1 was down for session.timeout.ms when instance B timed out ● It's not always easy to identify the first failure based on timing, but you can eliminate probing and cooperative rebalances ● Sometimes there's more than one failure going on at the same time
  • 44. Pop quiz! Scenario: ● Tuesday morning, notice app has a rebalance (State transition from RUNNING to REBALANCING) ● Rebalance completes (State transition from REBALANCING to RUNNING ● ), but some instances don't get an assignment ● Another rebalance 10 minutes later ● Cycle continues all day Is this: 1. Bad 2. (Mostly) Good 3. Not enough information What will you do? 1. Wait for another day 2. Try restarting some/all instances 3. Nothing: my on-call shift is over! 45
  • 46. (Alieh) Tuning the system (5-10 mintues)(old slide) https://docs.google.com/document/d/1XMVUlRJUHTuas2aUqP1nY_nOuorMM2w8JPhhzdBos zc/edit#heading=h.2j5e73a1tone tuning in general is great, but ideally we'd tie it to the scenario somehow. ● more concurrent warmups -> reduce the total recovery time (eg for indeed it was 20 hours) -> didn't like moderate load for a long time; prefer higher load for a shorter time ● reduce task warmup time (by allowing higher tput from broker) ● improve HA -> more standbys ● consumer group protocol settings to avoid missed polls/hbs/etc. -> make it less sensitive (and less responsive) What to monitor ● poll, heartbeat ● state ● when to alert (SLA violation) vs. warn (frequent rebalances, etc.) 47
  • 47. Tuning: Sensitivity Tradeoff ● Configs ○ max.poll.interval.ms ○ heartbeat.interval.ms, session.timeout.ms ● Less Sensitive: Prevent unnecessary rebalances ○ Occasional long I/O ○ Long GC pauses ○ Flaky networks ● More Sensitive: Detect failures faster ○ Might get unnecessary rebalance ○ Failover faster to meet uptime SLAs 48
  • 48. Tuning: Speed up probing rebalance phase ● Real Scenario! ○ App had 60 tasks ○ Warming up a task took 30-40 min ○ Migrating two tasks at a time ⠀ 60 tasks * 40 minutes / 2 warmups ⠀ = 20 hours of probing rebalances ● Solution: ● More concurrent warmups: max.warmup.replicas ○ reduce the total recovery time (instead of having moderate load for a long time, have a higher load for a shorter time) ● Reduce task warmup time (by allowing higher tput from broker) ○ max.poll.records, max.partition.fetch.bytes, fetch.max.bytes 49
  • 49. What to monitor? 50 Probing Rebalance last.rebalance.seconds.ago Last poll seconds ago Default max.poll.interval.ms Last heartbeat seconds ago Default session.timeout.ms Heartbeat interval last-poll-seconds-ago: consumer-metrics max.poll.interval.ms: Consumer config last-heartbeat-seconds-ago: consumer-metrics session.timeout.ms: Consumer config heartbeat.interval.ms: Consumer config last.rebalance.seconds.ago: consumer-coordinator-metrics probing.rebalance.interval.ms:: Streams config
  • 50. Prediction: What will you see in the next Kafka Summit? ● KIP-848: The Next Generation of the Consumer Rebalance Protocol ○ Adds ability to compute assignments in the broker ● Add generation id to all rebalance logs 51
  • 51. Go forth and rebalance! John Roesler john@confluent.io Alieh Saeedi asaeedi@confluent.io