Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

Kafka Deep Dive

Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Chargement dans…3
×

Consultez-les par la suite

1 sur 25 Publicité

Kafka Deep Dive

Télécharger pour lire hors ligne

How to use kakfa for storing intermediate data and use it as a pub/sub model with each of the Producer/Consumer/Topic configs deeply and the Internals working of it.

How to use kakfa for storing intermediate data and use it as a pub/sub model with each of the Producer/Consumer/Topic configs deeply and the Internals working of it.

Publicité
Publicité

Plus De Contenu Connexe

Similaire à Kafka Deep Dive (20)

Plus par Knoldus Inc. (20)

Publicité

Plus récents (20)

Kafka Deep Dive

  1. 1. Presented By: Amit Kumar Deep Dive into Kafka for Big Data Application
  2. 2. Lack of etiquette and manners is a huge turn off. KnolX Etiquettes Punctuality Join the session 5 minutes prior to the session start time. We start on time and conclude on time! Feedback Make sure to submit a constructive feedback for all sessions as it is very helpful for the presenter. Silent Mode Keep your mobile devices in silent mode, feel free to move out of session in case you need to attend an urgent call. Avoid Disturbance Avoid unwanted chit chat during the session.
  3. 3. Our Agenda 01 Kafka Introduction 02 Topic Config 03 Producer Config 04 Consumer Config 05 Demo
  4. 4. Introduction ● Apache Kafka is used primarily to build real-time data streaming pipelines. ● It is used to store Stream of Data. It is build on pub/sub model. ● Streams can receive tens of thousands of records per second, and some will receive one or two records per hour. ● Thus kafka is most important tool to store intermediate data/events of big data application. ● These stream of data is store in kafka as Kafka topic. ● Once the stream is stored in kafka, it can be consumed by multiple applications for different use cases such as storing in database, Analytics. ● Apache Kafka works as cushion between two application mostly if later application is slower.
  5. 5. Kafka Features ● Low Latency:- It offers low latency low latency value, i.e., upto 10 milliseconds. ● High Throughput:- due to low latency it can handle high velocity and volume. ● Fault Tolerance:- handle node/machine failure within the cluster. ● Durability:- because of its replication feature. ● Distributed:- kafka contains a distributed architecture which makes it scalable. ● Real Time Handling:- kafka is able to handle real-time data pipeline. ● Batching:- Kafka works with batch-like use cases, supports batching.
  6. 6. Kafka Components ● Apache Kafka stores data stream in Topic. ● Producer API is used to write data/publish in kafka topic. ● Consumer API is used to consumed data from the topic for further use.
  7. 7. Kafka Topic It’s similar to the table of database, kafka uses topics to organise the message of a particular catogery. We can do query on kafka topic unlike the database table. We need to create the producer to write data and consumer to read that too in sequential order. Data in topics are deleted as per retention period. Important kafka topic config:- Number of Partition:- Replication Factor:- Message Size:- Log CleanUp Policy:-
  8. 8. Config Details Number of partition govern the parallelism of application. In order to do parallel computation we need multiple consumer instance and since we know one partition can’t feed data to multiple consumer. We have to increase the partition count to achieve same.
  9. 9. Continue _ _ _ _ Replication Factor is multiple copy of the data over different broker. It help us in dealing with data loss when broker goes offline or fails. Replicated data server the data. In ideal case we give replication factor as 3. If we increase the replication factor more it will hit the performance and keeping it as less we will lose the data. Message Size:- Kafka has a default limit of 1MB per message in the topic. in few scenario we need to send data which is larger than 1 Mb. In that case we can modify the default message size till 10 MB. replica.fetch.max.bytes=10485880
  10. 10. Continue _ _ _ _ Log cleanup policy make sure that older message in the topic is getting cleaned. so that it free up memory of the broker. It is being controlled by following two configuration. log.retention.hours :- The most common configuration for how long Kafka will retain messages is by time. The default is specified in the configuration file using the log.retention.hours parameter, and it is set to 168 hours, the equivalent of one week. log.retention.bytes:- Another way to expire messages is based on the total number of bytes of messages retained. This value is set using the log.retention.bytes parameter, and it is applied per partition. The default is -1, meaning that there is no limit and only a time limit is applied.
  11. 11. Kafka Producer Once a topic has been created with Kafka, the next step is to send data into the topic. This is where Kafka Producers come in. Kafka producer sends messages to a topic, and messages are distributed to partitions according to a mechanism such as key hashing
  12. 12. Continue _ _ _ _ Kafka messages are created by the producer. A Kafka message consists of the following elements:
  13. 13. Continue _ _ _ _ Key is optional in the Kafka message and it can be null. A key may be a string, number, or any object and then the key is serialized into binary format. Value represents the content of the message and can also be null. The value format is arbitrary and is then also serialized into binary format. Compression Type Kafka messages can be compressed. The compression Options are none, gzip, lz4, snappy, and zstd Headers. This is key value pair added especially for tracing of the message. Partition + Offset. Once a message is sent into a Kafka topic, it receives a partition number and an offset id. The combination of topic+partition+offset uniquely identifies the message Timestamp. A timestamp is added either by the user or the system in the message.
  14. 14. Continue _ _ _ _ // Must Have Config bootstrap.servers -> bootstrapServers, key.serializer -> stringSerializer, value.serializer -> stringSerializer, // safe producer enable.idempotence -> "true", acks -> "all", retries -> Integer.MAX_VALUE.toString(), max.in.flight.requests.per.connection -> "5", // high throughput producer at the expense of a bit of latency and CPU usage compression.type -> "snappy", linger.ms -> "20", batch.size -> Integer.toString(32 * 1024) // 32KB
  15. 15. Ack Safe Producer:- Acks is the number of brokers who need to acknowledge receiving the message before it is considered a successful write. acks=0 producers consider messages as "written successfully" the moment the message was sent without waiting for the broker to accept it at all. this is fastest approaches but data loss is possible. acks=1 , producers consider messages as "written successfully" when the message was acknowledged by only the leader. acks=all, producers consider messages as "written successfully" when the message is accepted by all in-sync replicas (ISR)
  16. 16. Retry Retries ensure that no messages are dropped when sent to Apache Kafka. for kafka > 2.1 default retires value is max int retries = 214748364 it doesn’t mean that it will keep retrying forever. it is being controlled by delivery.timeout.ms default setting for the timeout is 2 minute delivery.timeout.ms=120000 max.in.flight.requests.per.connection = 1 if we want to keep the ordering maintained then we have to set the max in fight request = 1 but it impact the performance to keep the performance high we should set it as 5
  17. 17. Compression Producers group messages in a batch before sending. If the producer is sending compressed messages, all the messages in a single producer batch are compressed together and sent as the "value" of a "wrapper message". Compression is more effective the bigger the batch of messages being sent to Kafka. Compression options are are compression.type= none, gzip, lz4, snappy, and zstd
  18. 18. Batching By Default, Kafka producers try to send records as soon as possible. If we want to increase the throughput we have to enable batching. Batching is mainly controlled by two producer settings - linger.ms and batch.size the default value of longest.ms = 20ms and batch.size = 16KB Producer will wait either till 16Kb of the message or till 20 ms before sending message.
  19. 19. kafka Consumer Applications that pull event data from one or more Kafka topics are known as Kafka Consumers. Consumers can read from one or more partitions at a time in Apache Kafka. Data is being read in order within each partition.
  20. 20. Delivery Semantics A consumer reading from a Kafka partition may choose when to commit offsets. At Most Once Delivery:- offsets are committed as soon as a message batch is received after calling poll(). If processing fails, the message will be lost as, it will not be read again as the offsets of those messages have been committed already. At Least Once Delivery:- In this semantic we don’t want to lose the message to ensure it we commit the offset after processing is done but due to retries it leads to duplicate processing. This is suitable for consumers that cannot afford any data loss. Exactly Once Delivery:- In this semantic we want the message to be processing exactly once we don’t want the duplicate data. it’s applicable in case of payment and similar sensitive use cases. processing.guarantee=exactly.once
  21. 21. Polling Kafka consumers poll the Kafka broker to receive batches of data. Polling allows consumers to control:- ● From where in the log they want to consume ● How fast they want to consume ● Ability to replay events consumer sends the heartbeat on regular interval (heartbeat.interval.ms). if it stops sending heartbeat group coordinator wait till (session.timeout.ms) time and retrigger a rebalance. Consumers poll brokers periodically using the .poll() method. If two .poll() calls are separated by more than max.poll.interval.ms time, then the consumer will be disconnected from the group. default value of max.poll.interval.ms = 5 minute
  22. 22. Continue _ _ _ max.poll.records: (default 500) :- It control the maximum number of records that a single call to poll() will fetch. This is the important config which control how fast the data will be coming to our application. Suppose your application is process the data slower then you must set this max.poll.records value as lower. and if the application is processing data faster then we can set this value as high. If processing of a particular batch takes more time than the max.poll.interval.ms time, then the consumer will be disconnected from the group. to make sure this doesn’t happen reduce the max.poll.records value.
  23. 23. Auto Offsets Reset ● A consumer is expected to read from a log continuously. ● But due to some network error or bug in application. We are not able to process the message and ● once we restart the application the current offset data is not available (deleted due to retention policy) then ● The specific consumer have two option either to read from beginning or read from latest. The same behavior is controlled by auto.offset.reset it can take following values:- latest:- consumers will read messages from the tail of the partition earliest:- reading from the oldest offset in the partition none:- throw exception to the consumer if no previous offset is found
  24. 24. Thank You ! Get in touch with us:

×