SlideShare une entreprise Scribd logo
1  sur  27
- Ranganathan B
Trend of software development
Src: http://eil.stanford.edu/publications/david_liu/david_dissertation.pdf
Pre-Kafka days in LinkedIn
src: https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
Pre-Kafka days in LinkedIn
src: https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
Any problem, how would you solve?
Existing Queues solutions
beanstalkd
Crossroads.io
Darner Delayed::Job
Kestrel
queue_classic
Resque
RestMQ
Zaqar
Kafka is
● publish-subscribe messaging service
● distributed commit/write-ahead log
● decouples data pipelines
● per partition ordered
producers produce, consumers
consume, in large distributed way
Kafka characteristics
● fast O(1)
high throughput - 3.25 million writes per
second
● scalable - (300+ brokers at LinkedIn)
● durable
● distributed
● replicated (fault - tolerance)
Pre-Kafka days in LinkedIn
src: https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
Introducing Kafka at LinkedIn
src: https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
Why not RabbitMQ/ActiveMQ/….?
● Existing queues - http://queues.io/
● For highly distributed messages, Kafka stands out.
● Consumer messages are ordered per partition.
● Good number of language api support and integration
apis.
● Fast reads (efficient use of page cache) and fast writes
(efficient transfer from page cache to network sockets -
zero copy optimization)
Zero copy
src: https://www.ibm.com/developerworks/linux/library/j-zerocopy/
Timeline
● Originally developed at LinkedIn
● Open sourced in 2011, as version 0.6
● Graduated from Apache - Oct 2012
● Written in Scala
● Latest stable - 0.8.2.0
Messaging Terminology
● Data is called message.
● Producers publish messages.
● Messages are stored in topics.
● Topics are partitioned and replicated into Kafka
servers.
● Each Kafka server in a cluster is called Broker.
● Consumers consume messages from brokers.
Producers send messages over network to
Kafka cluster which in turn serves consumers.
Producer ProducerProducer
Consumer Consumer Consumer
Broker Broker Broker Broker
Kafka Cluster
Consumer Consumer
TCP
TCP
Topics
….
Remove messages based on
● number of messages: log.flush.interval
● time: log.default.flush.interval.ms,topic.flush.intervals.ms
● size: log.retention.size
….
Partition
…. ….
Partition logic can be
● default (kafka.producer.DefaultPartitioner - based on hash of
key)
● custom (by requirement. e.g: user-id - if we need more
processing based on user-id)
Partition
● Ordered, immutable sequence of messages
● Each message is assigned unique offset
● Serves
o Horizontal scaling
o Parallel consumer reads (with consumption by
partition based order)
Replication
src: https://engineering.linkedin.com/kafka/intra-cluster-replication-apache-kafka
Producers - push
● batching
● compression
● sync(Ack), async(auto batch - say 60k or
10ms)
● sequential writes - guaranteed order per
partition
Consumers - pull
● Queue of consumers (consumer group)
● Position based on offset, controlled by
consumer and persisted at intervals into
topic - __consumer_offsets
● can rewind offset
● Guaranteed order per partition
● More partitions enables better parallel reads
Message delivery guarantees
● At most once - Read -> Save position ->Process
● At least once - Read -> Process -> Save position
● Exactly once - Write output and position at same place.
With Kafka:
● At least once - (Default)
● At most once - (Disable Producer retries and save offset before processing)
● Exactly once - Use offset to store at destination system
QuickStart
● Download 0.8.2.0 and untar
● bin/zookeeper-server-start.sh config/zookeeper.properties
● bin/kafka-server-start.sh config/server.properties
● bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor
1 --partitions 1 --topic test
● bin/kafka-topics.sh --list --zookeeper localhost:2181
● bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
● bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic test --
from-beginning
Setting multi-broker cluster
● cp config/server.properties config/server-1.properties
● cp config/server.properties config/server-2.properties
config/server-1.properties:
broker.id=1
port=9093
log.dir=/tmp/kafka-logs-1
config/server-2.properties:
broker.id=2
port=9094
log.dir=/tmp/kafka-logs-2
Setting multi-broker cluster contd..
● bin/kafka-server-start.sh config/server-1.properties
● bin/kafka-server-start.sh config/server-2.properties
● bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor
3 --partitions 3 --topic my-replicated-topic
● bin/kafka-topics.sh --describe --zookeeper localhost:2181 --topic my-
replicated-topic
● bin/kafka-console-producer.sh --broker-list localhost:9092 --topic my-
replicated-topic
● bin/kafka-console-consumer.sh --zookeeper localhost:2181 --from-
beginning --topic my-replicated-topic
Usecases
● Messaging (comparable with ActiveMQ,
RabbitMQ)
● Website activity tracking
● Operational monitoring
● Log Aggregation
● Stream processing (along with Storm/Samza)
● Event sourcing
● Commit log (comparable with BookKeeper)
Equivalent to Kafka
Thank you

Contenu connexe

En vedette

En vedette (20)

zeromq
zeromqzeromq
zeromq
 
Présentation de Apache Zookeeper
Présentation de Apache ZookeeperPrésentation de Apache Zookeeper
Présentation de Apache Zookeeper
 
Apache kafka big data track
Apache kafka   big data trackApache kafka   big data track
Apache kafka big data track
 
Continuous Deployment with Jenkins on Kubernetes
Continuous Deployment with Jenkins on KubernetesContinuous Deployment with Jenkins on Kubernetes
Continuous Deployment with Jenkins on Kubernetes
 
Apache Storm - Introduction au traitement temps-réel avec Storm
Apache Storm - Introduction au traitement temps-réel avec StormApache Storm - Introduction au traitement temps-réel avec Storm
Apache Storm - Introduction au traitement temps-réel avec Storm
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
 
NATS: A Central Nervous System for IoT Messaging - Larry McQueary
NATS: A Central Nervous System for IoT Messaging - Larry McQuearyNATS: A Central Nervous System for IoT Messaging - Larry McQueary
NATS: A Central Nervous System for IoT Messaging - Larry McQueary
 
Overview of ZeroMQ
Overview of ZeroMQOverview of ZeroMQ
Overview of ZeroMQ
 
Achieving CI/CD with Kubernetes
Achieving CI/CD with KubernetesAchieving CI/CD with Kubernetes
Achieving CI/CD with Kubernetes
 
Handle Large Messages In Apache Kafka
Handle Large Messages In Apache KafkaHandle Large Messages In Apache Kafka
Handle Large Messages In Apache Kafka
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
MicroServices at Netflix - challenges of scale
MicroServices at Netflix - challenges of scaleMicroServices at Netflix - challenges of scale
MicroServices at Netflix - challenges of scale
 
Comment l’architecture événementielle révolutionne la communication dans le S...
Comment l’architecture événementielle révolutionne la communication dans le S...Comment l’architecture événementielle révolutionne la communication dans le S...
Comment l’architecture événementielle révolutionne la communication dans le S...
 
Apache Kafka, Un système distribué de messagerie hautement performant
Apache Kafka, Un système distribué de messagerie hautement performantApache Kafka, Un système distribué de messagerie hautement performant
Apache Kafka, Un système distribué de messagerie hautement performant
 
Realtime Web avec Akka, Kafka, Spark et Mesos - Devoxx Paris 2014
Realtime Web avec Akka, Kafka, Spark et Mesos - Devoxx Paris 2014Realtime Web avec Akka, Kafka, Spark et Mesos - Devoxx Paris 2014
Realtime Web avec Akka, Kafka, Spark et Mesos - Devoxx Paris 2014
 
Du Big Data vers le SMART Data : Scénario d'un processus
Du Big Data vers le SMART Data : Scénario d'un processusDu Big Data vers le SMART Data : Scénario d'un processus
Du Big Data vers le SMART Data : Scénario d'un processus
 
ZeroMQ with NodeJS
ZeroMQ with NodeJSZeroMQ with NodeJS
ZeroMQ with NodeJS
 
Hadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureHadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm Architecture
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark Streaming
 
Apache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignApache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - Verisign
 

Plus de Hyderabad Scalability Meetup

Plus de Hyderabad Scalability Meetup (16)

Serverless architectures
Serverless architecturesServerless architectures
Serverless architectures
 
GeekNight: Evolution of Programming Languages
GeekNight: Evolution of Programming LanguagesGeekNight: Evolution of Programming Languages
GeekNight: Evolution of Programming Languages
 
Geeknight : Artificial Intelligence and Machine Learning
Geeknight : Artificial Intelligence and Machine LearningGeeknight : Artificial Intelligence and Machine Learning
Geeknight : Artificial Intelligence and Machine Learning
 
Map reduce and the art of Thinking Parallel - Dr. Shailesh Kumar
Map reduce and the art of Thinking Parallel   - Dr. Shailesh KumarMap reduce and the art of Thinking Parallel   - Dr. Shailesh Kumar
Map reduce and the art of Thinking Parallel - Dr. Shailesh Kumar
 
Offline first geeknight
Offline first geeknightOffline first geeknight
Offline first geeknight
 
Understanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQLUnderstanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQL
 
Turbo charging v8 engine
Turbo charging v8 engineTurbo charging v8 engine
Turbo charging v8 engine
 
Git internals
Git internalsGit internals
Git internals
 
Nlp
NlpNlp
Nlp
 
Internet of Things - GeekNight - Hyderabad
Internet of Things - GeekNight - HyderabadInternet of Things - GeekNight - Hyderabad
Internet of Things - GeekNight - Hyderabad
 
Demystify Big Data, Data Science & Signal Extraction Deep Dive
Demystify Big Data, Data Science & Signal Extraction Deep DiveDemystify Big Data, Data Science & Signal Extraction Deep Dive
Demystify Big Data, Data Science & Signal Extraction Deep Dive
 
Demystify Big Data, Data Science & Signal Extraction Deep Dive
Demystify Big Data, Data Science & Signal Extraction Deep DiveDemystify Big Data, Data Science & Signal Extraction Deep Dive
Demystify Big Data, Data Science & Signal Extraction Deep Dive
 
Java 8 Lambda Expressions
Java 8 Lambda ExpressionsJava 8 Lambda Expressions
Java 8 Lambda Expressions
 
No SQL and MongoDB - Hyderabad Scalability Meetup
No SQL and MongoDB - Hyderabad Scalability MeetupNo SQL and MongoDB - Hyderabad Scalability Meetup
No SQL and MongoDB - Hyderabad Scalability Meetup
 
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability MeetupApache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
 
Docker by demo
Docker by demoDocker by demo
Docker by demo
 

Dernier

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Dernier (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 

Apache Kafka + Zookeeper = 3.5 million writes per second

  • 2. Trend of software development Src: http://eil.stanford.edu/publications/david_liu/david_dissertation.pdf
  • 3. Pre-Kafka days in LinkedIn src: https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
  • 4. Pre-Kafka days in LinkedIn src: https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying Any problem, how would you solve?
  • 5. Existing Queues solutions beanstalkd Crossroads.io Darner Delayed::Job Kestrel queue_classic Resque RestMQ Zaqar
  • 6. Kafka is ● publish-subscribe messaging service ● distributed commit/write-ahead log ● decouples data pipelines ● per partition ordered producers produce, consumers consume, in large distributed way
  • 7. Kafka characteristics ● fast O(1) high throughput - 3.25 million writes per second ● scalable - (300+ brokers at LinkedIn) ● durable ● distributed ● replicated (fault - tolerance)
  • 8. Pre-Kafka days in LinkedIn src: https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
  • 9. Introducing Kafka at LinkedIn src: https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
  • 10. Why not RabbitMQ/ActiveMQ/….? ● Existing queues - http://queues.io/ ● For highly distributed messages, Kafka stands out. ● Consumer messages are ordered per partition. ● Good number of language api support and integration apis. ● Fast reads (efficient use of page cache) and fast writes (efficient transfer from page cache to network sockets - zero copy optimization)
  • 12. Timeline ● Originally developed at LinkedIn ● Open sourced in 2011, as version 0.6 ● Graduated from Apache - Oct 2012 ● Written in Scala ● Latest stable - 0.8.2.0
  • 13. Messaging Terminology ● Data is called message. ● Producers publish messages. ● Messages are stored in topics. ● Topics are partitioned and replicated into Kafka servers. ● Each Kafka server in a cluster is called Broker. ● Consumers consume messages from brokers.
  • 14. Producers send messages over network to Kafka cluster which in turn serves consumers. Producer ProducerProducer Consumer Consumer Consumer Broker Broker Broker Broker Kafka Cluster Consumer Consumer TCP TCP
  • 15. Topics …. Remove messages based on ● number of messages: log.flush.interval ● time: log.default.flush.interval.ms,topic.flush.intervals.ms ● size: log.retention.size ….
  • 16. Partition …. …. Partition logic can be ● default (kafka.producer.DefaultPartitioner - based on hash of key) ● custom (by requirement. e.g: user-id - if we need more processing based on user-id)
  • 17. Partition ● Ordered, immutable sequence of messages ● Each message is assigned unique offset ● Serves o Horizontal scaling o Parallel consumer reads (with consumption by partition based order)
  • 19. Producers - push ● batching ● compression ● sync(Ack), async(auto batch - say 60k or 10ms) ● sequential writes - guaranteed order per partition
  • 20. Consumers - pull ● Queue of consumers (consumer group) ● Position based on offset, controlled by consumer and persisted at intervals into topic - __consumer_offsets ● can rewind offset ● Guaranteed order per partition ● More partitions enables better parallel reads
  • 21. Message delivery guarantees ● At most once - Read -> Save position ->Process ● At least once - Read -> Process -> Save position ● Exactly once - Write output and position at same place. With Kafka: ● At least once - (Default) ● At most once - (Disable Producer retries and save offset before processing) ● Exactly once - Use offset to store at destination system
  • 22. QuickStart ● Download 0.8.2.0 and untar ● bin/zookeeper-server-start.sh config/zookeeper.properties ● bin/kafka-server-start.sh config/server.properties ● bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test ● bin/kafka-topics.sh --list --zookeeper localhost:2181 ● bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test ● bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic test -- from-beginning
  • 23. Setting multi-broker cluster ● cp config/server.properties config/server-1.properties ● cp config/server.properties config/server-2.properties config/server-1.properties: broker.id=1 port=9093 log.dir=/tmp/kafka-logs-1 config/server-2.properties: broker.id=2 port=9094 log.dir=/tmp/kafka-logs-2
  • 24. Setting multi-broker cluster contd.. ● bin/kafka-server-start.sh config/server-1.properties ● bin/kafka-server-start.sh config/server-2.properties ● bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 3 --partitions 3 --topic my-replicated-topic ● bin/kafka-topics.sh --describe --zookeeper localhost:2181 --topic my- replicated-topic ● bin/kafka-console-producer.sh --broker-list localhost:9092 --topic my- replicated-topic ● bin/kafka-console-consumer.sh --zookeeper localhost:2181 --from- beginning --topic my-replicated-topic
  • 25. Usecases ● Messaging (comparable with ActiveMQ, RabbitMQ) ● Website activity tracking ● Operational monitoring ● Log Aggregation ● Stream processing (along with Storm/Samza) ● Event sourcing ● Commit log (comparable with BookKeeper)

Notes de l'éditeur

  1. Event Sourcing ensures that all changes to application state are stored as a sequence of events. Not just can we query these events, we can also use the event log to reconstruct past states, and as a foundation to automatically adjust the state to cope with retroactive changes.