Here are slides from my talk on introducing exactly once semantics to Apache Kafka. The talk was given at the Kafka Summit NYC, 8 May 2017.
The slides dive into the design of transactions in Apache Kafka.
15. 15
Why improve?
• Stream processing is becoming an ever bigger part of the
data landscape.
• Apache Kafka is the heart of the streams platform.
• Strengthening Kafka’s semantics expands the universe of
streaming applications.
33. 33
TL;DR
• Sequence numbers and producer ids:
• enable de-dup
• are in the log.
• Hence de-dup works transparently across leader
changes.
• Will not de-dup application-level resends.
• Works transparently – no API changes.
50. 50
Some notes on consuming transactions
• Two ‘isolation levels’ : read_committed, and
read_uncommitted.
• Messages read in offset order.
• read_committed consumers read to the point where there
are no open transactions.
51. 51
TL;DR
• Transaction coordinator and transaction log maintain
transaction state.
• Use the new producer APIs for transactions.
• Consumers can read only committed messages.
53. 53
What’s new, part 3: Performance boost!
• Up to +20% producer throughput
• Up to +50% consumer throughput
• Up to -20% disk utilization
• Savings start when you batch
• Details: https://bit.ly/kafka-eos-perf
61. 61
TL;DR
• With a batch size of 2, the new format starts saving
space.
• Savings are maximal for large batches of small
messages.
• Hence higher throughput when IO bound.
• Works as soon as you upgrade to the new format.
66. 66
Putting it together
• We understood Kafka’s existing delivery semantics
• Understood why we want to improve them
• Learned how these have been strengthened
• Learned how the new semantics work
67. 67
When is it available?
Available to try in Kafka 0.11, June 2017.
Stress the application level resends bit. Encourage people to rely on the producer retries and not re-send messages from their apps.
New concept – control messages.
Mention that commit markers are special message with log the producer id and the result of the transaction.
These messages are not passed on to application – the client interprets them and acts accordingly.
Mention ’read_uncommitted’
Mention that the buffering is broker side.
Mention transaction expiration.
Specify that you don’t need to use any of the new features to get these performance savings.
max.inflight is required for idempotence. It will cause a slowdown because you now have a sync producer.