2. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Agenda for today
Today we are going to cover:
● How to model latency in Apache Kafka and the existing tradeoff
● How to effectively measure Apache Kafka latency
● What you can do as a user to optimise your deployment effectively
Hopefully after this talk you are going to take home a good toolset to effectively measure,
understand and optimize your system with latency in mind.
2
4. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Tales of Apache Kafka latency
Measuring performance and latency in distributed systems is certainly not an easy task, way to
many moving parts.
What are the most important properties to consider in Apache Kafka:
● Durability, Availability Throughput and for sure Latency
4
NOTE, It is not possible to really achieve great values in all of them!
5. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
The different latencies of Apache Kafka
Apache Kafka is a distributed system and many “latencies” can be measured
5
6. By default the producer is optimised for
latency.
Batching can improve throughput, but
could introduce an artificial delay
6
A batch might need longer waiting time if
the broker has reached
max.inflight.requests.per.connection.
The use of compression might help with
throughput and latency.
Produce time
The time since the application produces a record
(KafkaProducer.send()) until a request containing
the message is send to an Apache Kafka Broker.
Important configuration variables:
● batch.size
● linger.ms
● compression.type
● max.inflight.requests.per.connection
7. With low load, usually most of the time is
in network and IO.
7
As the brokers become more load, queue
time usually dominate.
Publish time
The time between when the producer send a
batch of messages to when the corresponding
message gets append to the log (leader).
Time include:
● network and io processing
● queue time (request and response queue)
8. The time that takes a record to commit is
equal to the time it takes the slowest
in-sync follower to replicate.
8
The default configuration is optimised for
latency.
Commit times are usually impacted by
replication factor and load.
Commit time
Kafka consumers can only read messages from
fully replicated messages. This time accounts for
all the time necessary for a message to land in all
in sync replicas
Important configuration variables
● replica.fetch.min.bytes
● replica.fetch.wait.max.ms
9. The default configuration is optimised for
latency.
9
Fetch time
The time it takes for a record to be fetched from
a partition, in Java a successful call to the
KafkaConsumer.poll() method.
Important configuration variables
● fetch.min.bytes
● fetch.wait.max.ms
10. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc. 10
The distributed system fallacy
12. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Durability vs Latency
12
13. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Acknowledgements (acks)
If the broker become slower at giving back acknowledgements it usually decrease the
producing throughput as it will increase the waiting time (max.in.flight.request.per.connection).
Using acks=all usually mean increasing the number of producers.
Configuring min.in.sync.replicas is important for availability, however it is not relevant for
latency as replication will happen for all in-sync replicas not impacting the commit time.
13
14. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Throughput vs Latency, the eternal question
14
15. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Improving batching without artificial delays
When applications produce messages that are not send to the same partitions this will affect
batching as they could not be grouped together. So, it is better to make applications aware of
this when deciding with key to use.
If this is not possible, since AK 2.4 you can take advantage of the sticky partitioner (KIP-480).
This partitioner will “stick” to a partition until a full batch if full making a better use of batching.
15
16. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
What about the number of clients?
More clients generally mean more load for the Brokers, even if the throughput remains, there
are going to be more metadata requests and connections to be handled.
More clients will have an impact on tail latency, more clients will increase the number of
produce and fetch requests send to a Kafka Broker at a time
16
17. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
When more partitions could increase latencies
Partitions are a unit of scalability for Kafka, either for reading or writing.
However too many partitions can have a negative impact on latency, more partitions could
mean worst batching performance, more overhead for replication and bigger metadata
requests, larger commit times and increased CPU load.
This could increase end to end latency for all clients, including the ones using smaller
partitions.
17