Kafka Streams Rebalances and Assignments: The Whole Story with Alieh Saeedi & John Roesler

Kafka Streams Rebalances and Assignments:
The Whole Story
John Roesler
john@conﬂuent.io
Alieh Saeedi
asaeedi@conﬂuent.io

Agenda
2
Intro and setup
Background
Walkthrough
Takeaways
1.
2.
3.
4.

Ice Breaker
How many people here use Kafka Streams?
3

Ice Breaker
How many people here use Kafka Streams?
How many of you are scared of rebalances?
4

Pop quiz!
Scenario:
● Tuesday morning, notice app has a rebalance (State transition from RUNNING to
REBALANCING)
● Rebalance completes (State transition from REBALANCING to RUNNING)
● Another rebalance 10 minutes later
● Cycle continues all day
5

Pop quiz!
Scenario:
REBALANCING)
Is this:
1. Bad
2. Good
3. Not enough information
6

Pop quiz!
Scenario:
REBALANCING)
Is this:
1. Bad
2. Good
What will you do?
1. Wait for another day
2. Try restarting some/all instances
3. Nothing: my on-call shift is over!
7

Agenda
8
Intro and setup
Background
Walkthrough
Takeaways
1.
2.
3.
4.

Wait a Minute…
What is a Rebalance?!
9

12
1
JoinGroup
(Subscription)
2
SyncGroup
(Assignment)
Task Assignment

13
1
JoinGroup
(Subscription)
2
SyncGroup
(Assignment)
Task Assignment

Rebalancing:
Moving task ownership
from one instance to another
(because of changes in consumer group or topic subscriptions)
14

Why are Rebalances so important?
❌ In the normal course of events rebalances are fairly undesirable.
○ When partitions are moved from one consumer to another, the consumer loses its
current state;
■ if it was caching any data, it will need to refresh its caches
■ slowing down the application until the consumer sets up its state again.
15
✅ Rebalances provide the consumer group with high availability and elastic scalability
○ Easily and safely add and remove consumers
○ Survive consumer crash
Throughout this talk we will discuss how to safely handle rebalances and how to avoid
unnecessary ones.

How does Kafka make rebalancing process less
painful?
● 2.4: Optimistically continue processing during rebalances
○ Incremental Cooperative Protocol
● 2.6: Continue processing on stateful tasks while re-distributing state in the background
○ Smooth Scaling Protocol
16

Old Protocol (before 2.4, 2019)
17

Incremental Cooperative Protocol (2.4 in 2019)
18

Incremental Cooperative Protocol (2.4 in 2019)
19

Smooth Scaling Protocol (2.6 in 2020)
20

Agenda
21
Intro and setup
Background
Walkthrough
Takeaways
1.
2.
3.
4.

Pop quiz!
Scenario:
REBALANCING)
● Rebalance completes (State transition from REBALANCING to RUNNING
● ), but some instances don't get an assignment
Is this:
1. Bad
2. Good
What will you do?
22

How to dig in?
23
● First challenge: logs are spread over all 10 different instances
● collect all logs into a common logging platform
● collate by timestamp, but not always easy to see the right causal ordering
○ Solution: line them up by Generation ID
2022-09-14 14:33:44,991 INFO … Successfully joined group with generation Generation{generationId=3188, … }

Cluster state in each generation
24
● Second challenge: Keeping track of which instances are in the cluster and the
tasks that they're assigned
○ Solution ﬁnd this log line: Assigned tasks […] including stateful [...] to
clients as:...
1f061087-7be9-48ef-9300-f1e8d88027ef=[activeTasks: ([0_0]) standbyTasks: ([0_1])]
f7420f7b-4a31-4a79-bf9b-279c259414a3=[activeTasks: ([0_1]) standbyTasks: ([0_2])]
0f134dd0-4ae3-4603-88d5-fe097fc9e9f5=[activeTasks: ([0_2]) standbyTasks: ([0_0])]

Cluster state in each generation
25
● Second challenge: Keeping track of which instances are in the cluster and the
tasks that they're assigned
○ Solution ﬁnd this log line: Assigned tasks […] including stateful [...] to
clients as:...
A =[activeTasks: ([0_0]) standbyTasks: ([0_1])]
B =[activeTasks: ([0_1]) standbyTasks: ([0_2])]
C =[activeTasks: ([0_2]) standbyTasks: ([0_0])]

Start at the beginning
26
● As with any troubleshooting, try to find the first event.
● In our case, the cluster went a while before rebalancing, and then had a bunch of
them,
● so we'll figure out the generation and cause of that first rebalance

Start at the beginning
27
2022-09-14 14:33:44,552 INFO … Request joining group due to: group is already rebalancing
2022-09-14 14:33:44,991 INFO … Successfully joined group with generation Generation{generationId=3188, … }
3188

What happened?
28
● It looks like one of the instances (B) dropped out of the group
3188
…
A =[activeTasks: ([0_0, 0_1]) standbyTasks: ([0_2])]
C =[activeTasks: ([0_2]) standbyTasks: ([0_0, 0_1])]
…

How do we react?
29
● B was active for 0_1 and standby for 0_2
3188
…
…

How do we react (0_1)?
30
● A had a standby for 0_1, so it will take over as active
● Need a new standby for 0_1, so C picks it up
3188
…
…

How do we react (0_2)?
31
● A had a standby for 0_1, so it will take over as active
● Need a new standby for 0_1, so C picks it up
● Need a new standby for 0_2, so A picks it up
3188
…
…

B is back!
32
● Generation 3189: B is back in the cluster
3189
B =[activeTasks: ([]) standbyTasks: ([0_1, 0_2])]
2022-09-14 14:50:44,552 INFO … Decided on assignment … with followup probing rebalance

B is back!
33
● Leave everything where it is while B warms up on 0_1 and 0_2
3189

B is back!
34
● Leave everything where it is while B warms up on 0_1 and 0_2
● Schedule a probing rebalance to check on B in 10 minutes
3189

Checking in on B
35
● Generation 3190: probing rebalance to check on the progress of the warmup
3190
2022-09-14 15:00:44,552 INFO … Request joining group due to: Scheduled probing rebalance
2022-09-14 15:00:44,552 INFO … Finished unstable assignment of tasks, a followup rebalance will be scheduled

Checking in on B
36
● Generation 3190: probing rebalance to check on the progress of the warmup
● Not ready yet
3190
2022-09-14 15:00:44,552 INFO … Finished unstable assignment of tasks, a followup rebalance will be scheduled

B is ready for action!
37
● Generation 3191: B is warmed up!
3191

38
● Make B active on 0_1 (and swap A back to standby)
3191

39
● Make B active on 0_1 (and swap A back to standby)
● Drop extra standbys (0_2 on A and 0_1 on C)
3191

Bonus round
40
● Surprise rebalance a few ms later!
3192
2022-09-14 15:10:44,565 INFO … Request joining group due to tasks changing ownership

Bonus round
41
● Can't swap ownership of 0_1 in one rebalance
3192

Bonus round
42
● Can't swap ownership of 0_1 in one rebalance
● As soon as A revokes 0_1, it triggers another rebalance so B can start processing
3192

Recap
43
● Although we saw ﬁve rebalances, there was really just one bad event.
● All the others were just the cluster healing itself
● Longest outage was detecting the failure.
○ Task 0_1 was down for session.timeout.ms when instance B timed out

Recap
44
● Although we saw ﬁve rebalances, there was really just one bad event.
● All the others were just the cluster healing itself
● Longest outage was detecting the failure.
○ Task 0_1 was down for session.timeout.ms when instance B timed out
● It's not always easy to identify the ﬁrst failure based on timing, but you can
eliminate probing and cooperative rebalances
● Sometimes there's more than one failure going on at the same time

Pop quiz!
Scenario:
REBALANCING)
● Rebalance completes (State transition from REBALANCING to RUNNING
● ), but some instances don't get an assignment
Is this:
1. Bad
2. (Mostly) Good
What will you do?
45

Agenda
46
Intro and setup
Background
Walkthrough
Takeaways
1.
2.
3.
4.

(Alieh) Tuning the system (5-10 mintues)(old slide)
https://docs.google.com/document/d/1XMVUlRJUHTuas2aUqP1nY_nOuorMM2w8JPhhzdBos
zc/edit#heading=h.2j5e73a1tone
tuning in general is great, but ideally we'd tie it to the scenario somehow.
● more concurrent warmups -> reduce the total recovery time (eg for indeed it was 20
hours) -> didn't like moderate load for a long time; prefer higher load for a shorter time
● reduce task warmup time (by allowing higher tput from broker)
● improve HA -> more standbys
● consumer group protocol settings to avoid missed polls/hbs/etc. -> make it less
sensitive (and less responsive)
What to monitor
● poll, heartbeat
● state
● when to alert (SLA violation) vs. warn (frequent rebalances, etc.)
47

Tuning: Sensitivity Tradeoff
● Conﬁgs
○ max.poll.interval.ms
○ heartbeat.interval.ms, session.timeout.ms
● Less Sensitive: Prevent unnecessary rebalances
○ Occasional long I/O
○ Long GC pauses
○ Flaky networks
● More Sensitive: Detect failures faster
○ Might get unnecessary rebalance
○ Failover faster to meet uptime SLAs
48

Tuning: Speed up probing rebalance phase
● Real Scenario!
○ App had 60 tasks
○ Warming up a task took 30-40 min
○ Migrating two tasks at a time
⠀ 60 tasks * 40 minutes / 2 warmups
⠀ = 20 hours of probing rebalances
● Solution:
● More concurrent warmups: max.warmup.replicas
○ reduce the total recovery time (instead of having moderate load for a long
time, have a higher load for a shorter time)
● Reduce task warmup time (by allowing higher tput from broker)
○ max.poll.records, max.partition.fetch.bytes, fetch.max.bytes
49

What to monitor?
50
Probing Rebalance
last.rebalance.seconds.ago
Last poll seconds ago
Default max.poll.interval.ms
Last heartbeat seconds ago
Default session.timeout.ms
Heartbeat interval
last-poll-seconds-ago: consumer-metrics
max.poll.interval.ms: Consumer config
last-heartbeat-seconds-ago: consumer-metrics
session.timeout.ms: Consumer config
heartbeat.interval.ms: Consumer config
last.rebalance.seconds.ago: consumer-coordinator-metrics
probing.rebalance.interval.ms:: Streams config

Prediction: What will you see in the next Kafka
Summit?
● KIP-848: The Next Generation of the Consumer
Rebalance Protocol
○ Adds ability to compute assignments in the
broker
● Add generation id to all rebalance logs
51

Go forth and rebalance!
John Roesler
john@conﬂuent.io
Alieh Saeedi
asaeedi@conﬂuent.io

Kafka Streams Rebalances and Assignments: The Whole Story with Alieh Saeedi & John Roesler

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Plus de HostedbyConfluent

Plus de HostedbyConfluent (20)

Dernier

Dernier (20)

Kafka Streams Rebalances and Assignments: The Whole Story with Alieh Saeedi & John Roesler