SlideShare une entreprise Scribd logo
1  sur  38
Apache Kafka
Introduction
Kumar Shivam
A distributed streaming platform
History
• Apache Kafka is an open-source stream-processing software platform developed by LinkedIn and donated to the Apache
Software Foundation, written in Scala and Java.
• Kafka can connect to external systems (for data import/export) via Kafka Connect and provides Kafka Streams, a Java
stream processing library.
• Kafka uses a binary TCP-based protocol
Use cases
• Messaging system
• Activity Tracking
• Gather metrics from many different locations
• Application logs gathering
• Stream processing (with the Kafka streams API or Spark for example)
• De-coupling of system dependencies.
• Integration with Spark, FLink, Strom ,Hadoop and many big data tech.
Application data flow(without using Kafka)
Application data flow(using Kafka)
Companies Use cases
• Netflix - it uses kafka to apply recommendations in the real time while watching TV shows
• Uber - It uses to gather user,taxi and trip data in real-time to compute and forcast demand and compute surge pricing in
the real time.
• LinkedIn - it uses to prevent spam , collect user interactions to make better connections recommendations in the real
time.
• Spotify - Kafka is used at Spotify as part of their log delivery system.
• Coursera - At Coursera, Kafka powers education at scale, serving as the data pipeline for realtime learning
analytics/dashboards.
• Oracle - Oracle provides native connectivity to Kafka from its Enterprise Service Bus product called OSB (Oracle Service
Bus) which allows developers to leverage OSB built-in mediation capabilities to implement staged data pipelines.
• Trivago - Trivago uses Kafka for stream processing in Storm as well as processing of application logs.
• Zalando: As the leading online fashion retailer in Europe, Zalando uses Kafka as an ESB (Enterprise Service Bus), which helps us in
transitioning from a monolithic to a micro services architecture. Using Kafka for processing event streams enables our technical
team to do near-real time business intelligence.
Kafka in ERP
Jargons
• Topics (category)
• Partition
• Offset
• Replicas
• Broker
• Cluster
• Producers
• Consumer
• Leader
• Follower
Topic(Category)
Stream of messages belonging to a particular
category is called a topic. Data is stored in
topics.
Partition
• Topics split into partitions .
• Partition contains msg. in an immutable
ordered seq.
• Partition is impl. as set of segment files of
equal sizes.
• Data once written to a partition are
immutable.
Offset
Each message gets stored into partitions with
an incr. ID (i.e. Unique seq. id )called as
offset”.
Offset
Replicas
• Backup of partition.
• Replication factor – No. of copies of data
over multiple brokers.
Offset
Replicas
• Topics X and partition 0 is available in
broker 0 and Similarly for Partition 1 .
• Problem :-
• In Broker 2 , we are keeping actual data
(i.e. Topic- X Partition 1 ) and replicated
data (i.e. Topic – X Partition 0 ).
• Solution :-
• Choose one broker’s partition as a
leader and the rest as followers.
Brokers(containers)
• System responsible for maintaining the
published data.
• Holds multiple topics with multiple
partitions.
• Brokers are stateless.
• 1 Kafka broker = ~ 1 Million read/write
per sec.
• Handles TBs of meg. Without
performance hit.
• Brokers in the cluster is identified by an
ID.
• Kafka broker are also known as Bootstrap
broker because con. With any one broker
means connection with entire cluster.
Offset
Kafka Clusters
• Kafka’s having more than one broker are
called as Kafka cluster.
• A Kafka cluster can be expanded without
downtime.
• These clusters are used to manage the
persistence and replication of message
data.
• It typically consists of multiple broker to
maintain load balance.
Kafka Ecosystem
Producer
• The publisher of messages to one or
more Kafka topics
Offset
Consumer
• Read data from brokers.
• Consumers subscribes to one or more
topics and consume published messages
by pulling data from the brokers.
Offset
Leaders
• Node responsible for all reads and writes
for the given partition.
Offset
Follower
• Node which follows leader instructions
are called as followers.
• If leader fails , one of the follower will
automatically become the new leader.
Offset
Zookeeper
• It manages and co-ordinates Kafka
brokers.
• Used to notify producer and consumer
abt. the presence and failure of any
broker in the Kafka system.
• So that in Failure, Producer & Consumer
can take decision and start coordinating
their task with some other broker.
Kafka Producers
• How does the producer write data to the cluster?
• Message Keys
• Acknowledgment
• With the concept of key to send message in a specific order. The key enables the producer with two choices
• Send the data to the each partition
• If the value of key=NULL, it means that the data is sent without a key. Thus, it will be distributed in a round-robin manner (i.e.,
distributed to each partition).
• Send the data to specific partition.
• If the value of the key!=NULL, it means the key is attached with the data, and thus all messages will always be delivered to the
same partition.
without key
• scenario where a producer writes data to
the Kafka cluster
with key
• scenario where a producer specifies a key
as Prod_id
Prod_id_1
Prod_id_2
Acknowledgment
• In order to write data to the Kafka cluster,
the producer has another choice of
acknowledgment. Message
Sent
Message
Received
Case 1
• Producer sends data to each of the
Broker, but not receiving any
acknowledgment
• acks = 0 : producer sends the data to the
broker but does not wait for the
acknowledgement.
Case 2 (half - Duplex)
• Producer sends data to each of the
Broker, receiving any acknowledgment
• acks = 1 : producer will wait for the
leader's acknowledgement. The leader
asks the broker whether it successfully
received the data, and then
acknowledgment.
• The producers send data to the brokers.
Broker 1 holds the leader. Thus, the
leader asks Broker 1 whether it has
successfully received data. After receiving
the Broker's confirmation, the leader
sends the feedback to the Producer with
ack=1.
Case 3 (full - Duplex)
• Producer sends data to each of the
Broker, receiving acknowledgment from
both end.
• acks = all : the acknowledgment is done
by both the leader and its followers.
Kafka Core Apis
Producer Consumer
Comparision
Parameters Apache Kafka Apache Spark
Developers Originally developed by LinkedIn. Later, donated to Apache
Software Foundation.
Originally developed at the University of California. Later, it was
donated to Apache Software Foundation.
Infrastructure It is a Java client library. Thus, it can execute wherever Java is
supported.
It executes on the top of the Spark stack. It can be either Spark
standalone, YARN, or container-based.
Data Sources It processes data from Kafka itself via topics and streams. Spark ingest data from various files, Kafka, Socket source, etc.
Processing Model It processes the events as it arrives. Thus, it uses Event-at-a-
time (continuous) processing model.
It has a micro-batch processing model. It splits the incoming
streams into small batches for further processing.
Latency It has low latency than Apache Spark It has a higher latency.
ETL Transformation It is not supported in Apache Kafka. This transformation is supported in Spark.
Fault-tolerance Fault-tolerance is complex in Kafka. Fault-tolerance is easy in Spark.
Language Support It supports Java mainly. It supports multiple languages such as Java, Scala, R, Python.
Use Cases The New York Times, Zalando, Trivago, etc. use Kafka Streams
to store and distribute data.
Booking.com, Yelp (ad platform) uses Spark streams for
handling millions of ad requests per day.
Interact with Apache Kafka clusters in Azure
HDInsight using a REST proxy
Hoe can we use Spark, Kafka and Cassandra
to build a robust analytical platform?
• Concerns ?
1. High data flow
concern 1 :- A lot of orders get placed on the Walmart website every second, item availability also changes
frequently. Updating data (which can be 100 MB per second) means streaming information to analytics platform in real-
time.
Solution :- Kafka is a distributed, scalable fault-tolerant messaging system which by default provides a streaming
support.
Hoe can we use Spark, Kafka and Cassandra
to build a robust analytical platform?
• Concerns ?
2. Storing terabytes of data with frequent updates
concern 2 :- To store item availability data, we needed datastore which can process huge amount of upsert
without compromising on performance . To even generate reports, data had to be processed every few hours — so
read had to be fast too.
Solution :- Though RDBMS can store large amount of data however it cannot provide reliable upsert and read
performance. We had good experience with Cassandra in past, hence, it was the first choice. Apache Cassandra has best
write and read performance. Like Kafka it is distributed, highly scalable and fault-tolerant.
Hoe can we use Spark, Kafka and Cassandra
to build a robust analytical platform?
• Concerns ?
3. Processing huge amount of data
concern 3 Data processing had to be carried out at two places in the pipeline.
1. During write, where we have to stream data from Kafka, process it and save it to Cassandra.
2. while generating business reports, where we have to read complete Cassandra table, join it with other data sources
and aggregate it at multiple columns.
Solution :- Apache Spark achieves high performance for both batch and streaming data, using a state-of-the-art DAG
scheduler, a query optimizer, and a physical execution engine.
Hoe can we use Spark, Kafka and Cassandra
to build a robust analytical platform?
Spark
batch job
Security
• Data Encription among brokers and between client – broker
• Using SSL
• Authentication modes between client and brokers
• Using SSL(mutual Authentication)
• Using SASL(i.e. Kerberos or SCRAM-SHA)
• Authorisation of read/write operation by cients
• ACLs on topics.
Thank you!
Keep in touch.
https://www.linkedin.com/in/kumar-shivam-3a07807b/
Kshivam@firstam.com
https://github.com/ThirstyBrain

Contenu connexe

Tendances

Tendances (20)

Kafka Overview
Kafka OverviewKafka Overview
Kafka Overview
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Kafka: Internals
Kafka: InternalsKafka: Internals
Kafka: Internals
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache Kafka
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
 
Apache Kafka - Messaging System Overview
Apache Kafka - Messaging System OverviewApache Kafka - Messaging System Overview
Apache Kafka - Messaging System Overview
 
kafka
kafkakafka
kafka
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka Streams
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Kafka basics
Kafka basicsKafka basics
Kafka basics
 
Apache Kafka at LinkedIn
Apache Kafka at LinkedInApache Kafka at LinkedIn
Apache Kafka at LinkedIn
 
Apache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and DevelopersApache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and Developers
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explained
 

Similaire à Apache kafka

apachekafka-160907180205.pdf
apachekafka-160907180205.pdfapachekafka-160907180205.pdf
apachekafka-160907180205.pdf
TarekHamdi8
 
Cluster_Performance_Apache_Kafak_vs_RabbitMQ
Cluster_Performance_Apache_Kafak_vs_RabbitMQCluster_Performance_Apache_Kafak_vs_RabbitMQ
Cluster_Performance_Apache_Kafak_vs_RabbitMQ
Shameera Rathnayaka
 
Unlocking the Power of Apache Kafka: How Kafka Listeners Facilitate Real-time...
Unlocking the Power of Apache Kafka: How Kafka Listeners Facilitate Real-time...Unlocking the Power of Apache Kafka: How Kafka Listeners Facilitate Real-time...
Unlocking the Power of Apache Kafka: How Kafka Listeners Facilitate Real-time...
Denodo
 

Similaire à Apache kafka (20)

Unleashing Real-time Power with Kafka.pptx
Unleashing Real-time Power with Kafka.pptxUnleashing Real-time Power with Kafka.pptx
Unleashing Real-time Power with Kafka.pptx
 
Ten reasons to choose Apache Pulsar over Apache Kafka for Event Sourcing_Robe...
Ten reasons to choose Apache Pulsar over Apache Kafka for Event Sourcing_Robe...Ten reasons to choose Apache Pulsar over Apache Kafka for Event Sourcing_Robe...
Ten reasons to choose Apache Pulsar over Apache Kafka for Event Sourcing_Robe...
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
 
Fundamentals and Architecture of Apache Kafka
Fundamentals and Architecture of Apache KafkaFundamentals and Architecture of Apache Kafka
Fundamentals and Architecture of Apache Kafka
 
Building streaming data applications using Kafka*[Connect + Core + Streams] b...
Building streaming data applications using Kafka*[Connect + Core + Streams] b...Building streaming data applications using Kafka*[Connect + Core + Streams] b...
Building streaming data applications using Kafka*[Connect + Core + Streams] b...
 
Building Streaming Data Applications Using Apache Kafka
Building Streaming Data Applications Using Apache KafkaBuilding Streaming Data Applications Using Apache Kafka
Building Streaming Data Applications Using Apache Kafka
 
Kafka tutorial
Kafka tutorialKafka tutorial
Kafka tutorial
 
Lessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and MicroservicesLessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and Microservices
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
apachekafka-160907180205.pdf
apachekafka-160907180205.pdfapachekafka-160907180205.pdf
apachekafka-160907180205.pdf
 
Cluster_Performance_Apache_Kafak_vs_RabbitMQ
Cluster_Performance_Apache_Kafak_vs_RabbitMQCluster_Performance_Apache_Kafak_vs_RabbitMQ
Cluster_Performance_Apache_Kafak_vs_RabbitMQ
 
Unlocking the Power of Apache Kafka: How Kafka Listeners Facilitate Real-time...
Unlocking the Power of Apache Kafka: How Kafka Listeners Facilitate Real-time...Unlocking the Power of Apache Kafka: How Kafka Listeners Facilitate Real-time...
Unlocking the Power of Apache Kafka: How Kafka Listeners Facilitate Real-time...
 
Introduction to Kafka Streams Presentation
Introduction to Kafka Streams PresentationIntroduction to Kafka Streams Presentation
Introduction to Kafka Streams Presentation
 
Kafka Tutorial - introduction to the Kafka streaming platform
Kafka Tutorial - introduction to the Kafka streaming platformKafka Tutorial - introduction to the Kafka streaming platform
Kafka Tutorial - introduction to the Kafka streaming platform
 
Timothy Spann: Apache Pulsar for ML
Timothy Spann: Apache Pulsar for MLTimothy Spann: Apache Pulsar for ML
Timothy Spann: Apache Pulsar for ML
 
Session 23 - Kafka and Zookeeper
Session 23 - Kafka and ZookeeperSession 23 - Kafka and Zookeeper
Session 23 - Kafka and Zookeeper
 
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
 
Current and Future of Apache Kafka
Current and Future of Apache KafkaCurrent and Future of Apache Kafka
Current and Future of Apache Kafka
 
Python Kafka Integration: Developers Guide
Python Kafka Integration: Developers GuidePython Kafka Integration: Developers Guide
Python Kafka Integration: Developers Guide
 
kafka_session_updated.pptx
kafka_session_updated.pptxkafka_session_updated.pptx
kafka_session_updated.pptx
 

Dernier

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Dernier (20)

DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 

Apache kafka

  • 1. Apache Kafka Introduction Kumar Shivam A distributed streaming platform
  • 2. History • Apache Kafka is an open-source stream-processing software platform developed by LinkedIn and donated to the Apache Software Foundation, written in Scala and Java. • Kafka can connect to external systems (for data import/export) via Kafka Connect and provides Kafka Streams, a Java stream processing library. • Kafka uses a binary TCP-based protocol
  • 3. Use cases • Messaging system • Activity Tracking • Gather metrics from many different locations • Application logs gathering • Stream processing (with the Kafka streams API or Spark for example) • De-coupling of system dependencies. • Integration with Spark, FLink, Strom ,Hadoop and many big data tech.
  • 6.
  • 7. Companies Use cases • Netflix - it uses kafka to apply recommendations in the real time while watching TV shows • Uber - It uses to gather user,taxi and trip data in real-time to compute and forcast demand and compute surge pricing in the real time. • LinkedIn - it uses to prevent spam , collect user interactions to make better connections recommendations in the real time. • Spotify - Kafka is used at Spotify as part of their log delivery system. • Coursera - At Coursera, Kafka powers education at scale, serving as the data pipeline for realtime learning analytics/dashboards. • Oracle - Oracle provides native connectivity to Kafka from its Enterprise Service Bus product called OSB (Oracle Service Bus) which allows developers to leverage OSB built-in mediation capabilities to implement staged data pipelines. • Trivago - Trivago uses Kafka for stream processing in Storm as well as processing of application logs. • Zalando: As the leading online fashion retailer in Europe, Zalando uses Kafka as an ESB (Enterprise Service Bus), which helps us in transitioning from a monolithic to a micro services architecture. Using Kafka for processing event streams enables our technical team to do near-real time business intelligence.
  • 9. Jargons • Topics (category) • Partition • Offset • Replicas • Broker • Cluster • Producers • Consumer • Leader • Follower
  • 10. Topic(Category) Stream of messages belonging to a particular category is called a topic. Data is stored in topics.
  • 11. Partition • Topics split into partitions . • Partition contains msg. in an immutable ordered seq. • Partition is impl. as set of segment files of equal sizes. • Data once written to a partition are immutable.
  • 12. Offset Each message gets stored into partitions with an incr. ID (i.e. Unique seq. id )called as offset”. Offset
  • 13. Replicas • Backup of partition. • Replication factor – No. of copies of data over multiple brokers. Offset
  • 14. Replicas • Topics X and partition 0 is available in broker 0 and Similarly for Partition 1 . • Problem :- • In Broker 2 , we are keeping actual data (i.e. Topic- X Partition 1 ) and replicated data (i.e. Topic – X Partition 0 ). • Solution :- • Choose one broker’s partition as a leader and the rest as followers.
  • 15. Brokers(containers) • System responsible for maintaining the published data. • Holds multiple topics with multiple partitions. • Brokers are stateless. • 1 Kafka broker = ~ 1 Million read/write per sec. • Handles TBs of meg. Without performance hit. • Brokers in the cluster is identified by an ID. • Kafka broker are also known as Bootstrap broker because con. With any one broker means connection with entire cluster. Offset
  • 16. Kafka Clusters • Kafka’s having more than one broker are called as Kafka cluster. • A Kafka cluster can be expanded without downtime. • These clusters are used to manage the persistence and replication of message data. • It typically consists of multiple broker to maintain load balance. Kafka Ecosystem
  • 17. Producer • The publisher of messages to one or more Kafka topics Offset
  • 18. Consumer • Read data from brokers. • Consumers subscribes to one or more topics and consume published messages by pulling data from the brokers. Offset
  • 19. Leaders • Node responsible for all reads and writes for the given partition. Offset
  • 20. Follower • Node which follows leader instructions are called as followers. • If leader fails , one of the follower will automatically become the new leader. Offset
  • 21. Zookeeper • It manages and co-ordinates Kafka brokers. • Used to notify producer and consumer abt. the presence and failure of any broker in the Kafka system. • So that in Failure, Producer & Consumer can take decision and start coordinating their task with some other broker.
  • 22. Kafka Producers • How does the producer write data to the cluster? • Message Keys • Acknowledgment • With the concept of key to send message in a specific order. The key enables the producer with two choices • Send the data to the each partition • If the value of key=NULL, it means that the data is sent without a key. Thus, it will be distributed in a round-robin manner (i.e., distributed to each partition). • Send the data to specific partition. • If the value of the key!=NULL, it means the key is attached with the data, and thus all messages will always be delivered to the same partition.
  • 23. without key • scenario where a producer writes data to the Kafka cluster
  • 24. with key • scenario where a producer specifies a key as Prod_id Prod_id_1 Prod_id_2
  • 25. Acknowledgment • In order to write data to the Kafka cluster, the producer has another choice of acknowledgment. Message Sent Message Received
  • 26. Case 1 • Producer sends data to each of the Broker, but not receiving any acknowledgment • acks = 0 : producer sends the data to the broker but does not wait for the acknowledgement.
  • 27. Case 2 (half - Duplex) • Producer sends data to each of the Broker, receiving any acknowledgment • acks = 1 : producer will wait for the leader's acknowledgement. The leader asks the broker whether it successfully received the data, and then acknowledgment. • The producers send data to the brokers. Broker 1 holds the leader. Thus, the leader asks Broker 1 whether it has successfully received data. After receiving the Broker's confirmation, the leader sends the feedback to the Producer with ack=1.
  • 28. Case 3 (full - Duplex) • Producer sends data to each of the Broker, receiving acknowledgment from both end. • acks = all : the acknowledgment is done by both the leader and its followers.
  • 30. Comparision Parameters Apache Kafka Apache Spark Developers Originally developed by LinkedIn. Later, donated to Apache Software Foundation. Originally developed at the University of California. Later, it was donated to Apache Software Foundation. Infrastructure It is a Java client library. Thus, it can execute wherever Java is supported. It executes on the top of the Spark stack. It can be either Spark standalone, YARN, or container-based. Data Sources It processes data from Kafka itself via topics and streams. Spark ingest data from various files, Kafka, Socket source, etc. Processing Model It processes the events as it arrives. Thus, it uses Event-at-a- time (continuous) processing model. It has a micro-batch processing model. It splits the incoming streams into small batches for further processing. Latency It has low latency than Apache Spark It has a higher latency. ETL Transformation It is not supported in Apache Kafka. This transformation is supported in Spark. Fault-tolerance Fault-tolerance is complex in Kafka. Fault-tolerance is easy in Spark. Language Support It supports Java mainly. It supports multiple languages such as Java, Scala, R, Python. Use Cases The New York Times, Zalando, Trivago, etc. use Kafka Streams to store and distribute data. Booking.com, Yelp (ad platform) uses Spark streams for handling millions of ad requests per day.
  • 31.
  • 32. Interact with Apache Kafka clusters in Azure HDInsight using a REST proxy
  • 33. Hoe can we use Spark, Kafka and Cassandra to build a robust analytical platform? • Concerns ? 1. High data flow concern 1 :- A lot of orders get placed on the Walmart website every second, item availability also changes frequently. Updating data (which can be 100 MB per second) means streaming information to analytics platform in real- time. Solution :- Kafka is a distributed, scalable fault-tolerant messaging system which by default provides a streaming support.
  • 34. Hoe can we use Spark, Kafka and Cassandra to build a robust analytical platform? • Concerns ? 2. Storing terabytes of data with frequent updates concern 2 :- To store item availability data, we needed datastore which can process huge amount of upsert without compromising on performance . To even generate reports, data had to be processed every few hours — so read had to be fast too. Solution :- Though RDBMS can store large amount of data however it cannot provide reliable upsert and read performance. We had good experience with Cassandra in past, hence, it was the first choice. Apache Cassandra has best write and read performance. Like Kafka it is distributed, highly scalable and fault-tolerant.
  • 35. Hoe can we use Spark, Kafka and Cassandra to build a robust analytical platform? • Concerns ? 3. Processing huge amount of data concern 3 Data processing had to be carried out at two places in the pipeline. 1. During write, where we have to stream data from Kafka, process it and save it to Cassandra. 2. while generating business reports, where we have to read complete Cassandra table, join it with other data sources and aggregate it at multiple columns. Solution :- Apache Spark achieves high performance for both batch and streaming data, using a state-of-the-art DAG scheduler, a query optimizer, and a physical execution engine.
  • 36. Hoe can we use Spark, Kafka and Cassandra to build a robust analytical platform? Spark batch job
  • 37. Security • Data Encription among brokers and between client – broker • Using SSL • Authentication modes between client and brokers • Using SSL(mutual Authentication) • Using SASL(i.e. Kerberos or SCRAM-SHA) • Authorisation of read/write operation by cients • ACLs on topics.
  • 38. Thank you! Keep in touch. https://www.linkedin.com/in/kumar-shivam-3a07807b/ Kshivam@firstam.com https://github.com/ThirstyBrain