Speaker: Jun Rao, VP of Apache Kafka and Co-founder of Confluent
The controller is the brain of Apache Kafka®. A big part of what the controller does is to maintain the consistency of the replicas and determine which replica can be used to serve the clients, especially during individual broker failure.
In this talk, Jun will outline the main data flow in the controller—in particular, when a broker fails, how the controller automatically promotes another replica as the leader to serve the clients, and when a broker is started, how the controller resumes the replication pipeline in the restarted broker. Jun will then describe recent improvements to the controller that allow it to handle certain edge cases correctly and increase its performance, which allows for more partitions in a Kafka cluster.
Jun Rao is the co-founder of Confluent, a company that provides a streaming data platform on top of Apache Kafka. Previously, Jun was a senior staff engineer at LinkedIn, where he led the development of Kafka, and a researcher at IBM's Almaden research datacenter, where he conducted research on database and distributed systems. Jun is the PMC chair of Apache Kafka and a committer of Cassandra. He writes at https://cnfl.io/blog-jun-rao.
3. High Level Data Flow in Replication
broker 1
producer
leader
broker 2
follower
broker 3
follower
4
2
2
3
commit
ack
topic1-part1 topic1-part1 topic1-part1
consumer
1
4. What’s controller
4
• One broker in a cluster acts as controller
• Monitor the liveness of brokers
• Elect new leaders on broker failure
• Communicate new leaders to brokers
7. Controlled shutdown
SIG_TERM
Zookeeper
Controller
1
2
broker 2
part t-0: follower
part t-1: follower
broker 1
part t-0: leader
part t-1: leader
broker 0
Zookeeper
Controller
3
5
broker 2
part t-0: leader
part t-1: leader
broker 1
part t-0:
part t-1:
broker 0
4
/topics/t/0 2
/topics/t/1 2
8. Issues with controlled shutdown (pre 1.1)
Zookeeper
Controller
3
5
broker 0
4
Writes to ZK
are serial
Impact:
longer
shutdown
time
Communication of new
leaders not batched
Impact: client timeout
broker 2
part t-0: leader
part t-1: leader
broker 1
part t-0:
part t-1:
/topics/t/0 2
/topics/t/1 2
12. Issues with controller failover (pre 1.1)
Controller
broker 0 broker 3broker 2broker 1
1 2
3
Controller
Reads from ZK are serial
Impact: availability
Zombie old controller
Impact: inconsistency
Zookeeper
/controller broker 2
/topic/t1/0 leader:1
/topic/t1/1 leader:3
…
/topic/t1/9 leader:2
13. Performance improvements in 1.1
13
• Controller uses async ZK api for reads/writes
• Controller communicates new leaders to brokers in batches
part 1 part 2 part 3 part 4
part 1
part 2
part 3
part 4
Old (serial):
New (pipelined):
14. /topics/t/0 2
/topics/t/1 2
Controlled shutdown (post 1.1)
Zookeeper
Controller
3
5
broker 0
4
Writes to ZK
pipelined
Communication of new
leaders batched
broker 2
part t-0: leader
part t-1: leader
broker 1
part t-0:
part t-1:
16. Results for controlled shutdown
16
• 5 ZK nodes and 5 brokers on different racks
• 25K topics, 1 partition, 2 replicas
• 10K partitions per broker
Kafka 1.0.0 Kafka 1.1.0
Controlled shutdown time 6.5 minutes 3 seconds
17. Results for controller failover
17
• 5 ZK nodes and 5 brokers on different racks
• 2K topics, 50 partitions, 1 replica
• Controller failover: reload100K partitions from ZK
Kafka 1.0.0 Kafka 1.1.0
State reload time 28 seconds 14 seconds
18. Fencing zombie requests from controller
18
• Zombie controller
• ZK session expiration: better handling in the controller (1.1)
• Controller path deletion: writes to ZK conditioned on controller
epoch (2.1)
• Zombie request from broker restart
• Broker epoch (KIP-380 in 2.2)
• Also fixed the missing ZK watcher issue
19. Summary
• Significant performance improvement in controller in 1.1
• Allow 10X more partitions in a Kafka cluster
• Better fencing of zombie requests from controller (1.1, 2.1,
2.2)
• More details and remaining work in KAFKA-5027