SlideShare une entreprise Scribd logo
1  sur  25
Télécharger pour lire hors ligne
Presented By: Amit Kumar
Deep Dive into Kafka for Big
Data Application
Lack of etiquette and manners is a huge turn off.
KnolX Etiquettes
Punctuality
Join the session 5 minutes prior to
the session start time. We start on
time and conclude on time!
Feedback
Make sure to submit a constructive
feedback for all sessions as it is
very helpful for the presenter.
Silent Mode
Keep your mobile devices in silent
mode, feel free to move out of
session in case you need to attend
an urgent call.
Avoid Disturbance
Avoid unwanted chit chat during
the session.
Our Agenda
01 Kafka Introduction
02 Topic Config
03 Producer Config
04 Consumer Config
05 Demo
Introduction
● Apache Kafka is used primarily to build real-time data streaming pipelines.
● It is used to store Stream of Data. It is build on pub/sub model.
● Streams can receive tens of thousands of records per second, and some will receive one or two records per
hour.
● Thus kafka is most important tool to store intermediate data/events of big data application.
● These stream of data is store in kafka as Kafka topic.
● Once the stream is stored in kafka, it can be consumed by multiple applications for different use cases such as
storing in database, Analytics.
● Apache Kafka works as cushion between two application mostly if later application is slower.
Kafka Features
● Low Latency:- It offers low latency low latency value, i.e., upto 10 milliseconds.
● High Throughput:- due to low latency it can handle high velocity and volume.
● Fault Tolerance:- handle node/machine failure within the cluster.
● Durability:- because of its replication feature.
● Distributed:- kafka contains a distributed architecture which makes it scalable.
● Real Time Handling:- kafka is able to handle real-time data pipeline.
● Batching:- Kafka works with batch-like use cases, supports batching.
Kafka Components
● Apache Kafka stores data stream in Topic.
● Producer API is used to write data/publish in kafka topic.
● Consumer API is used to consumed data from the topic for further use.
Kafka Topic
It’s similar to the table of database, kafka uses topics to organise the message of a particular catogery.
We can do query on kafka topic unlike the database table. We need to create the producer to write data and
consumer to read that too in sequential order.
Data in topics are deleted as per retention period.
Important kafka topic config:-
Number of Partition:-
Replication Factor:-
Message Size:-
Log CleanUp Policy:-
Config Details
Number of partition govern the parallelism of application.
In order to do parallel computation we need multiple consumer instance and since we know one partition can’t feed data
to multiple consumer. We have to increase the partition count to achieve same.
Continue _ _ _ _
Replication Factor is multiple copy of the data over different broker. It help us in dealing
with data loss when broker goes offline or fails. Replicated data server the data.
In ideal case we give replication factor as 3.
If we increase the replication factor more it will hit the performance and keeping it as less we
will lose the data.
Message Size:- Kafka has a default limit of 1MB per message in the topic.
in few scenario we need to send data which is larger than 1 Mb. In that case we can modify
the default message size till 10 MB.
replica.fetch.max.bytes=10485880
Continue _ _ _ _
Log cleanup policy make sure that older message in the topic is getting cleaned. so that it
free up memory of the broker.
It is being controlled by following two configuration.
log.retention.hours :- The most common configuration for how long Kafka will retain
messages is by time. The default is specified in the configuration file using the
log.retention.hours parameter, and it is set to 168 hours, the equivalent of one week.
log.retention.bytes:- Another way to expire messages is based on the total number of bytes of
messages retained. This value is set using the log.retention.bytes parameter, and it is applied per partition.
The default is -1, meaning that there is no limit and only a time limit is applied.
Kafka Producer
Once a topic has been created with Kafka, the next step is to send data into the topic. This is
where Kafka Producers come in.
Kafka producer sends messages to a topic, and messages are distributed to partitions
according to a mechanism such as key hashing
Continue _ _ _ _
Kafka messages are created by the producer. A Kafka message consists of the following elements:
Continue _ _ _ _
Key is optional in the Kafka message and it can be null. A key may be a string, number, or any object and then the
key is serialized into binary format.
Value represents the content of the message and can also be null. The value format is arbitrary and is then also
serialized into binary format.
Compression Type Kafka messages can be compressed. The compression Options are none, gzip, lz4, snappy,
and zstd
Headers. This is key value pair added especially for tracing of the message.
Partition + Offset. Once a message is sent into a Kafka topic, it receives a partition number and an offset id. The
combination of topic+partition+offset uniquely identifies the message
Timestamp. A timestamp is added either by the user or the system in the message.
Continue _ _ _ _
// Must Have Config
bootstrap.servers -> bootstrapServers,
key.serializer -> stringSerializer,
value.serializer -> stringSerializer,
// safe producer
enable.idempotence -> "true",
acks -> "all",
retries -> Integer.MAX_VALUE.toString(),
max.in.flight.requests.per.connection -> "5",
// high throughput producer at the expense of a bit of latency and CPU usage
compression.type -> "snappy",
linger.ms -> "20",
batch.size -> Integer.toString(32 * 1024) // 32KB
Ack
Safe Producer:-
Acks is the number of brokers who need to acknowledge receiving the message before it is considered a
successful write.
acks=0 producers consider messages as "written successfully" the moment the message was sent without
waiting for the broker to accept it at all. this is fastest approaches but data loss is possible.
acks=1 , producers consider messages as "written successfully" when the message was acknowledged by only
the leader.
acks=all, producers consider messages as "written successfully" when the message is accepted by all in-sync
replicas (ISR)
Retry
Retries ensure that no messages are dropped when sent to Apache Kafka.
for kafka > 2.1 default retires value is max int retries = 214748364
it doesn’t mean that it will keep retrying forever. it is being controlled by delivery.timeout.ms
default setting for the timeout is 2 minute delivery.timeout.ms=120000
max.in.flight.requests.per.connection = 1 if we want to keep the ordering maintained
then we have to set the max in fight request = 1 but it impact the performance to keep
the performance high we should set it as 5
Compression
Producers group messages in a batch before sending.
If the producer is sending compressed messages, all the messages in a single producer batch are compressed
together and sent as the "value" of a "wrapper message".
Compression is more effective the bigger the batch of messages being sent to Kafka.
Compression options are are compression.type= none, gzip, lz4, snappy, and zstd
Batching
By Default, Kafka producers try to send records as soon as possible.
If we want to increase the throughput we have to enable batching.
Batching is mainly controlled by two producer settings - linger.ms and batch.size
the default value of longest.ms = 20ms and batch.size = 16KB
Producer will wait either till 16Kb of the message or till 20 ms before sending message.
kafka Consumer
Applications that pull event data from one or more Kafka topics are known as Kafka Consumers.
Consumers can read from one or more partitions at a time in Apache Kafka.
Data is being read in order within each partition.
Delivery Semantics
A consumer reading from a Kafka partition may choose when to commit offsets.
At Most Once Delivery:- offsets are committed as soon as a message batch is received after calling poll().
If processing fails, the message will be lost as, it will not be read again as the offsets of those messages have been
committed already.
At Least Once Delivery:- In this semantic we don’t want to lose the message to ensure it we commit the offset after
processing is done but due to retries it leads to duplicate processing.
This is suitable for consumers that cannot afford any data loss.
Exactly Once Delivery:- In this semantic we want the message to be processing exactly once we don’t want the
duplicate data. it’s applicable in case of payment and similar sensitive use cases.
processing.guarantee=exactly.once
Polling
Kafka consumers poll the Kafka broker to receive batches of data.
Polling allows consumers to control:-
● From where in the log they want to consume
● How fast they want to consume
● Ability to replay events
consumer sends the heartbeat on regular interval (heartbeat.interval.ms). if it stops sending heartbeat group
coordinator wait till (session.timeout.ms) time and retrigger a rebalance.
Consumers poll brokers periodically using the .poll() method. If two .poll() calls are separated by more than
max.poll.interval.ms time, then the consumer will be disconnected from the group.
default value of max.poll.interval.ms = 5 minute
Continue _ _ _
max.poll.records: (default 500) :- It control the maximum number of records that a single call to poll() will fetch.
This is the important config which control how fast the data will be coming to our application.
Suppose your application is process the data slower then you must set this max.poll.records value as lower.
and if the application is processing data faster then we can set this value as high.
If processing of a particular batch takes more time than the max.poll.interval.ms time, then the consumer will
be disconnected from the group. to make sure this doesn’t happen reduce the max.poll.records value.
Auto Offsets Reset
● A consumer is expected to read from a log continuously.
● But due to some network error or bug in application. We are not able to process the message and
● once we restart the application the current offset data is not available (deleted due to retention policy) then
● The specific consumer have two option either to read from beginning or read from latest.
The same behavior is controlled by auto.offset.reset it can take following values:-
latest:- consumers will read messages from the tail of the partition
earliest:- reading from the oldest offset in the partition
none:- throw exception to the consumer if no previous offset is found
Thank You !
Get in touch with us:

Contenu connexe

Tendances

Tendances (20)

Kafka Overview
Kafka OverviewKafka Overview
Kafka Overview
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
How Apache Kafka® Works
How Apache Kafka® WorksHow Apache Kafka® Works
How Apache Kafka® Works
 
An Introduction to Apache Kafka
An Introduction to Apache KafkaAn Introduction to Apache Kafka
An Introduction to Apache Kafka
 
Kafka Tutorial - introduction to the Kafka streaming platform
Kafka Tutorial - introduction to the Kafka streaming platformKafka Tutorial - introduction to the Kafka streaming platform
Kafka Tutorial - introduction to the Kafka streaming platform
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Kafka presentation
Kafka presentationKafka presentation
Kafka presentation
 
Apache Kafka - Martin Podval
Apache Kafka - Martin PodvalApache Kafka - Martin Podval
Apache Kafka - Martin Podval
 
Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache Kafka
 
An overview of the Kubernetes architecture
An overview of the Kubernetes architectureAn overview of the Kubernetes architecture
An overview of the Kubernetes architecture
 
Schema Registry 101 with Bill Bejeck | Kafka Summit London 2022
Schema Registry 101 with Bill Bejeck | Kafka Summit London 2022Schema Registry 101 with Bill Bejeck | Kafka Summit London 2022
Schema Registry 101 with Bill Bejeck | Kafka Summit London 2022
 
APACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka StreamsAPACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka Streams
 
Kafka 101
Kafka 101Kafka 101
Kafka 101
 
Spring Boot+Kafka: the New Enterprise Platform
Spring Boot+Kafka: the New Enterprise PlatformSpring Boot+Kafka: the New Enterprise Platform
Spring Boot+Kafka: the New Enterprise Platform
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache Kafka
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
 
Testing Kafka components with Kafka for JUnit
Testing Kafka components with Kafka for JUnitTesting Kafka components with Kafka for JUnit
Testing Kafka components with Kafka for JUnit
 
Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?
 
Improving Kafka at-least-once performance at Uber
Improving Kafka at-least-once performance at UberImproving Kafka at-least-once performance at Uber
Improving Kafka at-least-once performance at Uber
 

Similaire à Kafka Deep Dive

Cluster_Performance_Apache_Kafak_vs_RabbitMQ
Cluster_Performance_Apache_Kafak_vs_RabbitMQCluster_Performance_Apache_Kafak_vs_RabbitMQ
Cluster_Performance_Apache_Kafak_vs_RabbitMQ
Shameera Rathnayaka
 
apachekafka-160907180205.pdf
apachekafka-160907180205.pdfapachekafka-160907180205.pdf
apachekafka-160907180205.pdf
TarekHamdi8
 
Copy of Kafka-Camus
Copy of Kafka-CamusCopy of Kafka-Camus
Copy of Kafka-Camus
Deep Shah
 

Similaire à Kafka Deep Dive (20)

Cluster_Performance_Apache_Kafak_vs_RabbitMQ
Cluster_Performance_Apache_Kafak_vs_RabbitMQCluster_Performance_Apache_Kafak_vs_RabbitMQ
Cluster_Performance_Apache_Kafak_vs_RabbitMQ
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Kafka RealTime Streaming
Kafka RealTime StreamingKafka RealTime Streaming
Kafka RealTime Streaming
 
Introduction to Kafka and Event-Driven
Introduction to Kafka and Event-DrivenIntroduction to Kafka and Event-Driven
Introduction to Kafka and Event-Driven
 
Introduction to Kafka and Event-Driven
Introduction to Kafka and Event-DrivenIntroduction to Kafka and Event-Driven
Introduction to Kafka and Event-Driven
 
Messaging queue - Kafka
Messaging queue - KafkaMessaging queue - Kafka
Messaging queue - Kafka
 
Removing performance bottlenecks with Kafka Monitoring and topic configuration
Removing performance bottlenecks with Kafka Monitoring and topic configurationRemoving performance bottlenecks with Kafka Monitoring and topic configuration
Removing performance bottlenecks with Kafka Monitoring and topic configuration
 
Kafka Fundamentals
Kafka FundamentalsKafka Fundamentals
Kafka Fundamentals
 
Kafka tutorial
Kafka tutorialKafka tutorial
Kafka tutorial
 
apachekafka-160907180205.pdf
apachekafka-160907180205.pdfapachekafka-160907180205.pdf
apachekafka-160907180205.pdf
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Copy of Kafka-Camus
Copy of Kafka-CamusCopy of Kafka-Camus
Copy of Kafka-Camus
 
Non-Kafkaesque Apache Kafka - Yottabyte 2018
Non-Kafkaesque Apache Kafka - Yottabyte 2018Non-Kafkaesque Apache Kafka - Yottabyte 2018
Non-Kafkaesque Apache Kafka - Yottabyte 2018
 
Intoduction to Apache Kafka
Intoduction to Apache KafkaIntoduction to Apache Kafka
Intoduction to Apache Kafka
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
 
Developing Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaDeveloping Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache Kafka
 
Developing Realtime Data Pipelines With Apache Kafka
Developing Realtime Data Pipelines With Apache KafkaDeveloping Realtime Data Pipelines With Apache Kafka
Developing Realtime Data Pipelines With Apache Kafka
 
A Quick Guide to Refresh Kafka Skills
A Quick Guide to Refresh Kafka SkillsA Quick Guide to Refresh Kafka Skills
A Quick Guide to Refresh Kafka Skills
 

Plus de Knoldus Inc.

Plus de Knoldus Inc. (20)

Supply chain security with Kubeclarity.pptx
Supply chain security with Kubeclarity.pptxSupply chain security with Kubeclarity.pptx
Supply chain security with Kubeclarity.pptx
 
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML ParsingMastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
 
Akka gRPC Essentials A Hands-On Introduction
Akka gRPC Essentials A Hands-On IntroductionAkka gRPC Essentials A Hands-On Introduction
Akka gRPC Essentials A Hands-On Introduction
 
Entity Core with Core Microservices.pptx
Entity Core with Core Microservices.pptxEntity Core with Core Microservices.pptx
Entity Core with Core Microservices.pptx
 
Introduction to Redis and its features.pptx
Introduction to Redis and its features.pptxIntroduction to Redis and its features.pptx
Introduction to Redis and its features.pptx
 
GraphQL with .NET Core Microservices.pdf
GraphQL with .NET Core Microservices.pdfGraphQL with .NET Core Microservices.pdf
GraphQL with .NET Core Microservices.pdf
 
NuGet Packages Presentation (DoT NeT).pptx
NuGet Packages Presentation (DoT NeT).pptxNuGet Packages Presentation (DoT NeT).pptx
NuGet Packages Presentation (DoT NeT).pptx
 
Data Quality in Test Automation Navigating the Path to Reliable Testing
Data Quality in Test Automation Navigating the Path to Reliable TestingData Quality in Test Automation Navigating the Path to Reliable Testing
Data Quality in Test Automation Navigating the Path to Reliable Testing
 
K8sGPTThe AI​ way to diagnose Kubernetes
K8sGPTThe AI​ way to diagnose KubernetesK8sGPTThe AI​ way to diagnose Kubernetes
K8sGPTThe AI​ way to diagnose Kubernetes
 
Introduction to Circle Ci Presentation.pptx
Introduction to Circle Ci Presentation.pptxIntroduction to Circle Ci Presentation.pptx
Introduction to Circle Ci Presentation.pptx
 
Robusta -Tool Presentation (DevOps).pptx
Robusta -Tool Presentation (DevOps).pptxRobusta -Tool Presentation (DevOps).pptx
Robusta -Tool Presentation (DevOps).pptx
 
Optimizing Kubernetes using GOLDILOCKS.pptx
Optimizing Kubernetes using GOLDILOCKS.pptxOptimizing Kubernetes using GOLDILOCKS.pptx
Optimizing Kubernetes using GOLDILOCKS.pptx
 
Azure Function App Exception Handling.pptx
Azure Function App Exception Handling.pptxAzure Function App Exception Handling.pptx
Azure Function App Exception Handling.pptx
 
CQRS Design Pattern Presentation (Java).pptx
CQRS Design Pattern Presentation (Java).pptxCQRS Design Pattern Presentation (Java).pptx
CQRS Design Pattern Presentation (Java).pptx
 
ETL Observability: Azure to Snowflake Presentation
ETL Observability: Azure to Snowflake PresentationETL Observability: Azure to Snowflake Presentation
ETL Observability: Azure to Snowflake Presentation
 
Scripting with K6 - Beyond the Basics Presentation
Scripting with K6 - Beyond the Basics PresentationScripting with K6 - Beyond the Basics Presentation
Scripting with K6 - Beyond the Basics Presentation
 
Getting started with dotnet core Web APIs
Getting started with dotnet core Web APIsGetting started with dotnet core Web APIs
Getting started with dotnet core Web APIs
 
Introduction To Rust part II Presentation
Introduction To Rust part II PresentationIntroduction To Rust part II Presentation
Introduction To Rust part II Presentation
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Configuring Workflows & Validators in JIRA
Configuring Workflows & Validators in JIRAConfiguring Workflows & Validators in JIRA
Configuring Workflows & Validators in JIRA
 

Dernier

Dernier (20)

Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

Kafka Deep Dive

  • 1. Presented By: Amit Kumar Deep Dive into Kafka for Big Data Application
  • 2. Lack of etiquette and manners is a huge turn off. KnolX Etiquettes Punctuality Join the session 5 minutes prior to the session start time. We start on time and conclude on time! Feedback Make sure to submit a constructive feedback for all sessions as it is very helpful for the presenter. Silent Mode Keep your mobile devices in silent mode, feel free to move out of session in case you need to attend an urgent call. Avoid Disturbance Avoid unwanted chit chat during the session.
  • 3. Our Agenda 01 Kafka Introduction 02 Topic Config 03 Producer Config 04 Consumer Config 05 Demo
  • 4. Introduction ● Apache Kafka is used primarily to build real-time data streaming pipelines. ● It is used to store Stream of Data. It is build on pub/sub model. ● Streams can receive tens of thousands of records per second, and some will receive one or two records per hour. ● Thus kafka is most important tool to store intermediate data/events of big data application. ● These stream of data is store in kafka as Kafka topic. ● Once the stream is stored in kafka, it can be consumed by multiple applications for different use cases such as storing in database, Analytics. ● Apache Kafka works as cushion between two application mostly if later application is slower.
  • 5.
  • 6. Kafka Features ● Low Latency:- It offers low latency low latency value, i.e., upto 10 milliseconds. ● High Throughput:- due to low latency it can handle high velocity and volume. ● Fault Tolerance:- handle node/machine failure within the cluster. ● Durability:- because of its replication feature. ● Distributed:- kafka contains a distributed architecture which makes it scalable. ● Real Time Handling:- kafka is able to handle real-time data pipeline. ● Batching:- Kafka works with batch-like use cases, supports batching.
  • 7. Kafka Components ● Apache Kafka stores data stream in Topic. ● Producer API is used to write data/publish in kafka topic. ● Consumer API is used to consumed data from the topic for further use.
  • 8. Kafka Topic It’s similar to the table of database, kafka uses topics to organise the message of a particular catogery. We can do query on kafka topic unlike the database table. We need to create the producer to write data and consumer to read that too in sequential order. Data in topics are deleted as per retention period. Important kafka topic config:- Number of Partition:- Replication Factor:- Message Size:- Log CleanUp Policy:-
  • 9. Config Details Number of partition govern the parallelism of application. In order to do parallel computation we need multiple consumer instance and since we know one partition can’t feed data to multiple consumer. We have to increase the partition count to achieve same.
  • 10. Continue _ _ _ _ Replication Factor is multiple copy of the data over different broker. It help us in dealing with data loss when broker goes offline or fails. Replicated data server the data. In ideal case we give replication factor as 3. If we increase the replication factor more it will hit the performance and keeping it as less we will lose the data. Message Size:- Kafka has a default limit of 1MB per message in the topic. in few scenario we need to send data which is larger than 1 Mb. In that case we can modify the default message size till 10 MB. replica.fetch.max.bytes=10485880
  • 11. Continue _ _ _ _ Log cleanup policy make sure that older message in the topic is getting cleaned. so that it free up memory of the broker. It is being controlled by following two configuration. log.retention.hours :- The most common configuration for how long Kafka will retain messages is by time. The default is specified in the configuration file using the log.retention.hours parameter, and it is set to 168 hours, the equivalent of one week. log.retention.bytes:- Another way to expire messages is based on the total number of bytes of messages retained. This value is set using the log.retention.bytes parameter, and it is applied per partition. The default is -1, meaning that there is no limit and only a time limit is applied.
  • 12. Kafka Producer Once a topic has been created with Kafka, the next step is to send data into the topic. This is where Kafka Producers come in. Kafka producer sends messages to a topic, and messages are distributed to partitions according to a mechanism such as key hashing
  • 13. Continue _ _ _ _ Kafka messages are created by the producer. A Kafka message consists of the following elements:
  • 14. Continue _ _ _ _ Key is optional in the Kafka message and it can be null. A key may be a string, number, or any object and then the key is serialized into binary format. Value represents the content of the message and can also be null. The value format is arbitrary and is then also serialized into binary format. Compression Type Kafka messages can be compressed. The compression Options are none, gzip, lz4, snappy, and zstd Headers. This is key value pair added especially for tracing of the message. Partition + Offset. Once a message is sent into a Kafka topic, it receives a partition number and an offset id. The combination of topic+partition+offset uniquely identifies the message Timestamp. A timestamp is added either by the user or the system in the message.
  • 15. Continue _ _ _ _ // Must Have Config bootstrap.servers -> bootstrapServers, key.serializer -> stringSerializer, value.serializer -> stringSerializer, // safe producer enable.idempotence -> "true", acks -> "all", retries -> Integer.MAX_VALUE.toString(), max.in.flight.requests.per.connection -> "5", // high throughput producer at the expense of a bit of latency and CPU usage compression.type -> "snappy", linger.ms -> "20", batch.size -> Integer.toString(32 * 1024) // 32KB
  • 16. Ack Safe Producer:- Acks is the number of brokers who need to acknowledge receiving the message before it is considered a successful write. acks=0 producers consider messages as "written successfully" the moment the message was sent without waiting for the broker to accept it at all. this is fastest approaches but data loss is possible. acks=1 , producers consider messages as "written successfully" when the message was acknowledged by only the leader. acks=all, producers consider messages as "written successfully" when the message is accepted by all in-sync replicas (ISR)
  • 17. Retry Retries ensure that no messages are dropped when sent to Apache Kafka. for kafka > 2.1 default retires value is max int retries = 214748364 it doesn’t mean that it will keep retrying forever. it is being controlled by delivery.timeout.ms default setting for the timeout is 2 minute delivery.timeout.ms=120000 max.in.flight.requests.per.connection = 1 if we want to keep the ordering maintained then we have to set the max in fight request = 1 but it impact the performance to keep the performance high we should set it as 5
  • 18. Compression Producers group messages in a batch before sending. If the producer is sending compressed messages, all the messages in a single producer batch are compressed together and sent as the "value" of a "wrapper message". Compression is more effective the bigger the batch of messages being sent to Kafka. Compression options are are compression.type= none, gzip, lz4, snappy, and zstd
  • 19. Batching By Default, Kafka producers try to send records as soon as possible. If we want to increase the throughput we have to enable batching. Batching is mainly controlled by two producer settings - linger.ms and batch.size the default value of longest.ms = 20ms and batch.size = 16KB Producer will wait either till 16Kb of the message or till 20 ms before sending message.
  • 20. kafka Consumer Applications that pull event data from one or more Kafka topics are known as Kafka Consumers. Consumers can read from one or more partitions at a time in Apache Kafka. Data is being read in order within each partition.
  • 21. Delivery Semantics A consumer reading from a Kafka partition may choose when to commit offsets. At Most Once Delivery:- offsets are committed as soon as a message batch is received after calling poll(). If processing fails, the message will be lost as, it will not be read again as the offsets of those messages have been committed already. At Least Once Delivery:- In this semantic we don’t want to lose the message to ensure it we commit the offset after processing is done but due to retries it leads to duplicate processing. This is suitable for consumers that cannot afford any data loss. Exactly Once Delivery:- In this semantic we want the message to be processing exactly once we don’t want the duplicate data. it’s applicable in case of payment and similar sensitive use cases. processing.guarantee=exactly.once
  • 22. Polling Kafka consumers poll the Kafka broker to receive batches of data. Polling allows consumers to control:- ● From where in the log they want to consume ● How fast they want to consume ● Ability to replay events consumer sends the heartbeat on regular interval (heartbeat.interval.ms). if it stops sending heartbeat group coordinator wait till (session.timeout.ms) time and retrigger a rebalance. Consumers poll brokers periodically using the .poll() method. If two .poll() calls are separated by more than max.poll.interval.ms time, then the consumer will be disconnected from the group. default value of max.poll.interval.ms = 5 minute
  • 23. Continue _ _ _ max.poll.records: (default 500) :- It control the maximum number of records that a single call to poll() will fetch. This is the important config which control how fast the data will be coming to our application. Suppose your application is process the data slower then you must set this max.poll.records value as lower. and if the application is processing data faster then we can set this value as high. If processing of a particular batch takes more time than the max.poll.interval.ms time, then the consumer will be disconnected from the group. to make sure this doesn’t happen reduce the max.poll.records value.
  • 24. Auto Offsets Reset ● A consumer is expected to read from a log continuously. ● But due to some network error or bug in application. We are not able to process the message and ● once we restart the application the current offset data is not available (deleted due to retention policy) then ● The specific consumer have two option either to read from beginning or read from latest. The same behavior is controlled by auto.offset.reset it can take following values:- latest:- consumers will read messages from the tail of the partition earliest:- reading from the oldest offset in the partition none:- throw exception to the consumer if no previous offset is found
  • 25. Thank You ! Get in touch with us: