How to use kakfa for storing intermediate data and use it as a pub/sub model with each of the Producer/Consumer/Topic configs deeply and the Internals working of it.
2. Lack of etiquette and manners is a huge turn off.
KnolX Etiquettes
Punctuality
Join the session 5 minutes prior to
the session start time. We start on
time and conclude on time!
Feedback
Make sure to submit a constructive
feedback for all sessions as it is
very helpful for the presenter.
Silent Mode
Keep your mobile devices in silent
mode, feel free to move out of
session in case you need to attend
an urgent call.
Avoid Disturbance
Avoid unwanted chit chat during
the session.
4. Introduction
● Apache Kafka is used primarily to build real-time data streaming pipelines.
● It is used to store Stream of Data. It is build on pub/sub model.
● Streams can receive tens of thousands of records per second, and some will receive one or two records per
hour.
● Thus kafka is most important tool to store intermediate data/events of big data application.
● These stream of data is store in kafka as Kafka topic.
● Once the stream is stored in kafka, it can be consumed by multiple applications for different use cases such as
storing in database, Analytics.
● Apache Kafka works as cushion between two application mostly if later application is slower.
5.
6. Kafka Features
● Low Latency:- It offers low latency low latency value, i.e., upto 10 milliseconds.
● High Throughput:- due to low latency it can handle high velocity and volume.
● Fault Tolerance:- handle node/machine failure within the cluster.
● Durability:- because of its replication feature.
● Distributed:- kafka contains a distributed architecture which makes it scalable.
● Real Time Handling:- kafka is able to handle real-time data pipeline.
● Batching:- Kafka works with batch-like use cases, supports batching.
7. Kafka Components
● Apache Kafka stores data stream in Topic.
● Producer API is used to write data/publish in kafka topic.
● Consumer API is used to consumed data from the topic for further use.
8. Kafka Topic
It’s similar to the table of database, kafka uses topics to organise the message of a particular catogery.
We can do query on kafka topic unlike the database table. We need to create the producer to write data and
consumer to read that too in sequential order.
Data in topics are deleted as per retention period.
Important kafka topic config:-
Number of Partition:-
Replication Factor:-
Message Size:-
Log CleanUp Policy:-
9. Config Details
Number of partition govern the parallelism of application.
In order to do parallel computation we need multiple consumer instance and since we know one partition can’t feed data
to multiple consumer. We have to increase the partition count to achieve same.
10. Continue _ _ _ _
Replication Factor is multiple copy of the data over different broker. It help us in dealing
with data loss when broker goes offline or fails. Replicated data server the data.
In ideal case we give replication factor as 3.
If we increase the replication factor more it will hit the performance and keeping it as less we
will lose the data.
Message Size:- Kafka has a default limit of 1MB per message in the topic.
in few scenario we need to send data which is larger than 1 Mb. In that case we can modify
the default message size till 10 MB.
replica.fetch.max.bytes=10485880
11. Continue _ _ _ _
Log cleanup policy make sure that older message in the topic is getting cleaned. so that it
free up memory of the broker.
It is being controlled by following two configuration.
log.retention.hours :- The most common configuration for how long Kafka will retain
messages is by time. The default is specified in the configuration file using the
log.retention.hours parameter, and it is set to 168 hours, the equivalent of one week.
log.retention.bytes:- Another way to expire messages is based on the total number of bytes of
messages retained. This value is set using the log.retention.bytes parameter, and it is applied per partition.
The default is -1, meaning that there is no limit and only a time limit is applied.
12. Kafka Producer
Once a topic has been created with Kafka, the next step is to send data into the topic. This is
where Kafka Producers come in.
Kafka producer sends messages to a topic, and messages are distributed to partitions
according to a mechanism such as key hashing
13. Continue _ _ _ _
Kafka messages are created by the producer. A Kafka message consists of the following elements:
14. Continue _ _ _ _
Key is optional in the Kafka message and it can be null. A key may be a string, number, or any object and then the
key is serialized into binary format.
Value represents the content of the message and can also be null. The value format is arbitrary and is then also
serialized into binary format.
Compression Type Kafka messages can be compressed. The compression Options are none, gzip, lz4, snappy,
and zstd
Headers. This is key value pair added especially for tracing of the message.
Partition + Offset. Once a message is sent into a Kafka topic, it receives a partition number and an offset id. The
combination of topic+partition+offset uniquely identifies the message
Timestamp. A timestamp is added either by the user or the system in the message.
15. Continue _ _ _ _
// Must Have Config
bootstrap.servers -> bootstrapServers,
key.serializer -> stringSerializer,
value.serializer -> stringSerializer,
// safe producer
enable.idempotence -> "true",
acks -> "all",
retries -> Integer.MAX_VALUE.toString(),
max.in.flight.requests.per.connection -> "5",
// high throughput producer at the expense of a bit of latency and CPU usage
compression.type -> "snappy",
linger.ms -> "20",
batch.size -> Integer.toString(32 * 1024) // 32KB
16. Ack
Safe Producer:-
Acks is the number of brokers who need to acknowledge receiving the message before it is considered a
successful write.
acks=0 producers consider messages as "written successfully" the moment the message was sent without
waiting for the broker to accept it at all. this is fastest approaches but data loss is possible.
acks=1 , producers consider messages as "written successfully" when the message was acknowledged by only
the leader.
acks=all, producers consider messages as "written successfully" when the message is accepted by all in-sync
replicas (ISR)
17. Retry
Retries ensure that no messages are dropped when sent to Apache Kafka.
for kafka > 2.1 default retires value is max int retries = 214748364
it doesn’t mean that it will keep retrying forever. it is being controlled by delivery.timeout.ms
default setting for the timeout is 2 minute delivery.timeout.ms=120000
max.in.flight.requests.per.connection = 1 if we want to keep the ordering maintained
then we have to set the max in fight request = 1 but it impact the performance to keep
the performance high we should set it as 5
18. Compression
Producers group messages in a batch before sending.
If the producer is sending compressed messages, all the messages in a single producer batch are compressed
together and sent as the "value" of a "wrapper message".
Compression is more effective the bigger the batch of messages being sent to Kafka.
Compression options are are compression.type= none, gzip, lz4, snappy, and zstd
19. Batching
By Default, Kafka producers try to send records as soon as possible.
If we want to increase the throughput we have to enable batching.
Batching is mainly controlled by two producer settings - linger.ms and batch.size
the default value of longest.ms = 20ms and batch.size = 16KB
Producer will wait either till 16Kb of the message or till 20 ms before sending message.
20. kafka Consumer
Applications that pull event data from one or more Kafka topics are known as Kafka Consumers.
Consumers can read from one or more partitions at a time in Apache Kafka.
Data is being read in order within each partition.
21. Delivery Semantics
A consumer reading from a Kafka partition may choose when to commit offsets.
At Most Once Delivery:- offsets are committed as soon as a message batch is received after calling poll().
If processing fails, the message will be lost as, it will not be read again as the offsets of those messages have been
committed already.
At Least Once Delivery:- In this semantic we don’t want to lose the message to ensure it we commit the offset after
processing is done but due to retries it leads to duplicate processing.
This is suitable for consumers that cannot afford any data loss.
Exactly Once Delivery:- In this semantic we want the message to be processing exactly once we don’t want the
duplicate data. it’s applicable in case of payment and similar sensitive use cases.
processing.guarantee=exactly.once
22. Polling
Kafka consumers poll the Kafka broker to receive batches of data.
Polling allows consumers to control:-
● From where in the log they want to consume
● How fast they want to consume
● Ability to replay events
consumer sends the heartbeat on regular interval (heartbeat.interval.ms). if it stops sending heartbeat group
coordinator wait till (session.timeout.ms) time and retrigger a rebalance.
Consumers poll brokers periodically using the .poll() method. If two .poll() calls are separated by more than
max.poll.interval.ms time, then the consumer will be disconnected from the group.
default value of max.poll.interval.ms = 5 minute
23. Continue _ _ _
max.poll.records: (default 500) :- It control the maximum number of records that a single call to poll() will fetch.
This is the important config which control how fast the data will be coming to our application.
Suppose your application is process the data slower then you must set this max.poll.records value as lower.
and if the application is processing data faster then we can set this value as high.
If processing of a particular batch takes more time than the max.poll.interval.ms time, then the consumer will
be disconnected from the group. to make sure this doesn’t happen reduce the max.poll.records value.
24. Auto Offsets Reset
● A consumer is expected to read from a log continuously.
● But due to some network error or bug in application. We are not able to process the message and
● once we restart the application the current offset data is not available (deleted due to retention policy) then
● The specific consumer have two option either to read from beginning or read from latest.
The same behavior is controlled by auto.offset.reset it can take following values:-
latest:- consumers will read messages from the tail of the partition
earliest:- reading from the oldest offset in the partition
none:- throw exception to the consumer if no previous offset is found