2. Kafka
• Distributed, scalable, durable, fault-tolerant, high throughput publish-subscribe messaging system
• Originally developed at LinkedIn
• Neha Narkhede and Jun Rao - Confluet
• Unified platform for handling all the real-time data feeds,
LinkedIn defines four categories of messages: queuing, metrics, logs and tracking data that each live in their own
cluster.
1. Stream Processing
2. Website Activity Tracking
3. Metrics Collection and Monitoring
4. Log Aggregation.
5. ..
6. ..
Common Use cases
3. Comparison with other systems
JMS, RabbitMQ, … Apache Kafka
Push model Pull model
Persistent message with TTL Retention Policy
Guaranteed delivery Guaranteed “Consumability”
Hard to scale Scalable
Fault tolerance – Active – passive Fault tolerance – ISR (In Sync Replicas)
4. • Topic
• Stream of records of particular category.
• Each record consists of a key, a value, and a timestamp
(from 0.10.0)
• Producer
• Publishing messages to a topic
• Broker
• Where the messages are stored
• Consumer
• Subscribes to one or more Topics
• consumes the messages from the PARTITIONS
• Multiple producers and consumers can publish and retrieve
messages at the same time.
LinkedIn runs over 1100 Kafka brokers organized into more than 60 clusters.
Key Components
Broker and Topic
Metadata
Consumer metadata
partition offsets
Kafka Cluster
producer
Kafka Broker1
consumer
zookeeper
Kafka Broker 2
Kafka Broker n
Topic Partition 0
Topic Partition 1
Topic Partition n
5. Zookeeper and Kafka
• Electing a controller
• Cluster membership - When adding a new broker or failure of brokers
• Topic Configuration
• (0.9.0) - Quotas - how much data is each client allowed to read and write
• (0.9.0) - ACLs - who is allowed to read and write to which topic
• All write requests are routed through leader and changes are broadcast to all followers. Change
broadcast is termed as atomic broadcast.
9. Message Format
Messages
• Messages are stored as Key-Value pairs
• Message Ids are incremental but not consecutive
• Messages are identified by 64 bit offset integer
CRC Magic Byte Attributes Key Length Key Message length Message
10. Current Message Format
• MessageAndOffset => MessageSize Offset Message
MessageSize => int32
Offset => int64
• Message => Crc MagicByte Attributes KeyLength Key ValueLength Value
Crc => int32
MagicByte => int8
Attributes => int8
KeyLength => int32
Key => bytes
ValueLength => int32
Value => bytes
• The magic byte (int8) contains the version id of the message, currently set to 0.
• The attribute byte (int8) holds metadata attributes about the message. The lowest 2 bits contain the compression codec used for
the message. The other bits are currently set to 0.
• The key / value field can be omitted if the keylength / valuelength field is set to -1.
• For compressed message, the offset field stores the last wrapped message's offset.
11. Topic
• Topic are messages of particular category
• A topic can have zero, one, or many consumers that subscribe to the data written to it
12. Partition
• A topic can be divided into partitions which may be distributed.
• An ordered, immutable sequence of records that is continually appended to - a structured
commit log
13. Partition continued…
• Distribute the data across brokers (think sharding)
• Simplify parallelization
• Ensure sequencing of related messages[ordering]
• Kafka only provides a total order over messages within a partition, not between different
partitions in a topic
• Replicated
-rw-r--r-- 1 kafka kafka 0 Nov 4 16:03 edhTopic-45/00000000000000000000.log
-rw-r--r-- 1 kafka kafka 10485760 Nov 4 16:03 edhTopic-45/00000000000000000000.index
-rw-r--r-- 1 kafka kafka 0 Nov 4 16:03 edhTopic-61/00000000000000000000.log
-rw-r--r-- 1 kafka kafka 10485760 Nov 4 16:03 edhTopic-61/00000000000000000000.index
14. Log
• Simplest possible storage abstraction.
• A log is implemented as a set of segment files of approximately the same size (e.g., 1GB).
• Append-only, totally-ordered sequence of records ordered by time.
• A log, like a filesystem, is easy to optimize for linear read and write patterns.
• The log can group small reads and writes together into larger, high-throughput operations
15. Log continued…
• Every segment of a log (the files *.log) has it's corresponding index (the files *.index) with the
same name as they represent the base offset.
• The log file contains the actual messages structured in a message format.
• Messages begin with 64 bit integer offset
• Properties:
• log.roll.hours
The maximum time before a new log segment is rolled out (in hours), secondary to log.roll.ms
property
• log.roll.minutes
The maximum time before a new log segment is rolled out (in milliseconds). If not set, the value in
log.roll.hours is used
• log.segment.bytes
The maximum size of a single log file
16. Segment File
• A segment with a base offset of [base_offset] would be stored in two files, a [base_offset].index
and a [base_offset].log file.
• The broker simply appends the message to the last(active) segment file.
• Segment file is flushed to the disk after configurable numbers of messages have been published
• log.flush.interval.messages
• Or after a certain amount of time elapsed
• log.flush.interval.ms
• Messages are exposed to consumer after it gets flushed.
17. Index file
• Maps the logical offset of the message to the physical location
• The structure of the messages within the index file describes only 2 fields, each of them 32bit long:
• 4 Bytes: Relative Offset
• 4 Bytes: Physical Position
<offset, time-stamp, physical position>
21. Producer
• The producer sends data directly to the broker that is the leader for the partition without any intervening
routing tier
• Request for metadata about which servers are alive and where the leaders for the partitions of a topic are at
any given time to allow the producer to appropriately direct its requests.
The client controls which partition it publishes messages to. This can be done at random, implementing a kind of random load balancing, or it
can be done by some semantic partitioning function
23. Constructing a Kafka producer
• Bootstrap servers
• Key serializer
• Value serializer
kafkaProps.put("bootstrap.servers", "broker1:9092,broker2:9092");
kafkaProps.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
kafkaProps.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
24. Partitioning
• The client controls which partition it publishes messages to
• Random partitioning
• when the partitioning key is not specified or null, A producer will pick a random partition and
stick to it for some time (default is 10 mins) before switching to another one
• If key exists and default partitioner is used
• Key.hashcode() / no. of partitions
• Can implement custom partitioning
• Implements an interface org.apache.kafka.clients.producer.Partitioner;
25. • Fire-and-forget
• send a message to the server and don’t really care
• Synchronous Send
• if the send() was successful or not
• Asynchronous Send
• send messages asynchronously and still handle error scenarios,
• callback function, which gets triggered when receive a response from the Kafka broker.
producer.send(record);
//send with callback
producer.send(record, new DemoProducerCallback());
producer.send(record).get();
26. Batching
• Batching the data to be sent to same partition of a topic
• allows the accumulation of more bytes to send, and few larger I/O operations on the servers.
• Requests sent to brokers will contain multiple batches, one for each partition with data available
to be sent.
• Fixed number of messages or to wait no longer than some fixed latency bound (say 10 ms)
• This can be achieved by:
batch.size = 16384
linger.ms = 5ms
27. Buffer memory
• Amount of memory the producer will use to buffer messages waiting to be sent to brokers.
• If records are sent faster than they can be delivered to the server the producer will either block or
throw an exception based on the preference specified by block.on.buffer.full.
• Property:
buffer.memory
28. Durability
• While producer sends messages to Kafka it can require different levels of consistency:
• acks = 0
producer doesn’t wait for confirmation
• acks = 1
wait for acknowledge from leader
• acks = all
wait for acknowledge from all ISR ~ message commit
31. Consumer Poll Loop
• Handles coordination, partition
rebalances, heartbeats and data
fetching
• consumer.wakeup()
try {
while (true) {
ConsumerRecords<String, String> records = consumer.poll(100);
for (ConsumerRecord<String, String> record : records)
{
//some process
}
}
} finally {
consumer.close();
}
32. How consumer consumes?
• Consumer always consumes messages from a particular partition sequentially and if the
consumer acknowledges particular message offset, it implies that the consumer has consumed all
prior messages
• Consumer sends pull request to the broker to have the bytes ready to consume
• Each request have the offset of the message to consume
33. Consumer Group
Topic T1
Partition 0
Partition 1
Partition 2
Partition 4
Consumer Group g1
Consumer 1
Consumer 2
Consumer 3
Consumer 4
When multiple consumers are subscribed to a topic and belong to the same consumer group, then each
consumer in the group will receive messages from a different subset of the partitions in the topic
34. Committing Offsets
• Automatic Commit
• Allow the consumer to do it for you
• Commit Current Offset
• Exercise more control over the time offsets are
committed
• Offsets will only be committed when the
application explicitly chooses to do so
• Retry the commit until it either succeeds or
encounters a non-retriable failure
• Asynchronous commit
• CommitAsync() will not retry.
• CommitAsync() also gives you an option to pass
in a callback that will be triggered when the
broker responds
• Executes the callback in case of any failures
enable.auto.commit = true
auto.commit.interval.ms= 5 sec[default]
auto.commit.offset = false,
consumer.commitSync();
auto.commit.offset = false,
consumer.commitAsync(new OffsetCommitCallback());
35. Message Delivery Semantics
• At Most Once
• Commit and process
• Might lose some messages when processing fails.
• At Least Once
• Process and Commit
• Might get Duplicates when Commit fails
• Exactly-Once
• With exactly-once semantics, messages are pulled one or more times, processed only once, and delivery is
guaranteed.
• Exactly-once semantics is ideal for operational applications, as it guarantees no duplicates or missing data.
• Many enterprise applications, like those used for credit card processing, require exactly-once semantics.
37. Retention policy
• For a specific amount of time
• log.retention.hours
• log.retention.minutes
• log.retention.ms
• For a specific total size of messages in a partition
• log.retention.bytes
38. Log Compaction
• Compaction is a process where for each message key is just one message, usually the latest one.
• config/server.properties:
log.cleaner.enable=true
• To enable log cleaning on a particular topic
log.cleanup.policy=compact
39. key points about log compaction
• The “min.cleanable.dirty.ratio” is a setting at the topic and broker level that determines how
“dirty” a topic needs to be before it is cleaned. You can set it to 0.01 to be aggressive in cleaning
• Log compaction runs on its own threads, and it defaults to 1 thread. It isn’t unusual for a cleaner
thread to die.
• Compaction is done in the background by periodically recopying log segments
• Log compaction will never happen on the LAST segment. Segments can be rolled over based on
time or size, or both. The default time based rollover is 7 days
40. How Compaction works
As of Kafka 0.9.0.1, Configuration parameter log.cleaner.enable is now true by default. This means topics with a
cleanup.policy=compact will now be compacted by default, and 128 MB of heap will be allocated to the cleaner
process via log.cleaner.dedupe.buffer.size
41. Deletion of messages
• Compaction also allows for deletes.
• A message with a key and a null payload will be treated as a delete from the log.
• This delete marker will cause any prior message with that key to be removed (as would any new
message with that key), but delete markers are special in that they will themselves be cleaned out
of the log after a period of time to free up space