Apache Kafka + Zookeeper = 3.5 million writes per second

Trend of software development
Src: http://eil.stanford.edu/publications/david_liu/david_dissertation.pdf

Pre-Kafka days in LinkedIn
src: https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying

Pre-Kafka days in LinkedIn
Any problem, how would you solve?

Existing Queues solutions
beanstalkd
Crossroads.io
Darner Delayed::Job
Kestrel
queue_classic
Resque
RestMQ
Zaqar

Kafka is
● publish-subscribe messaging service
● distributed commit/write-ahead log
● decouples data pipelines
● per partition ordered
producers produce, consumers
consume, in large distributed way

Kafka characteristics
● fast O(1)
high throughput - 3.25 million writes per
second
● scalable - (300+ brokers at LinkedIn)
● durable
● distributed
● replicated (fault - tolerance)

Introducing Kafka at LinkedIn

Why not RabbitMQ/ActiveMQ/….?
● Existing queues - http://queues.io/
● For highly distributed messages, Kafka stands out.
● Consumer messages are ordered per partition.
● Good number of language api support and integration
apis.
● Fast reads (efficient use of page cache) and fast writes
(efficient transfer from page cache to network sockets -
zero copy optimization)

Zero copy
src: https://www.ibm.com/developerworks/linux/library/j-zerocopy/

Timeline
● Originally developed at LinkedIn
● Open sourced in 2011, as version 0.6
● Graduated from Apache - Oct 2012
● Written in Scala
● Latest stable - 0.8.2.0

Messaging Terminology
● Data is called message.
● Producers publish messages.
● Messages are stored in topics.
● Topics are partitioned and replicated into Kafka
servers.
● Each Kafka server in a cluster is called Broker.
● Consumers consume messages from brokers.

Producers send messages over network to
Kafka cluster which in turn serves consumers.
Producer ProducerProducer
Consumer Consumer Consumer
Broker Broker Broker Broker
Kafka Cluster
Consumer Consumer
TCP
TCP

Topics
….
Remove messages based on
● number of messages: log.flush.interval
● time: log.default.flush.interval.ms,topic.flush.intervals.ms
● size: log.retention.size
….

Partition
…. ….
Partition logic can be
● default (kafka.producer.DefaultPartitioner - based on hash of
key)
● custom (by requirement. e.g: user-id - if we need more
processing based on user-id)

Partition
● Ordered, immutable sequence of messages
● Each message is assigned unique offset
● Serves
o Horizontal scaling
o Parallel consumer reads (with consumption by
partition based order)

Replication
src: https://engineering.linkedin.com/kafka/intra-cluster-replication-apache-kafka

Producers - push
● batching
● compression
● sync(Ack), async(auto batch - say 60k or
10ms)
● sequential writes - guaranteed order per
partition

Consumers - pull
● Queue of consumers (consumer group)
● Position based on offset, controlled by
consumer and persisted at intervals into
topic - __consumer_offsets
● can rewind offset
● Guaranteed order per partition
● More partitions enables better parallel reads

Message delivery guarantees
● At most once - Read -> Save position ->Process
● At least once - Read -> Process -> Save position
● Exactly once - Write output and position at same place.
With Kafka:
● At least once - (Default)
● At most once - (Disable Producer retries and save offset before processing)
● Exactly once - Use offset to store at destination system

QuickStart
● Download 0.8.2.0 and untar
● bin/zookeeper-server-start.sh config/zookeeper.properties
● bin/kafka-server-start.sh config/server.properties
● bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor
1 --partitions 1 --topic test
● bin/kafka-topics.sh --list --zookeeper localhost:2181
● bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
● bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic test --
from-beginning

Setting multi-broker cluster
● cp config/server.properties config/server-1.properties
● cp config/server.properties config/server-2.properties
config/server-1.properties:
broker.id=1
port=9093
log.dir=/tmp/kafka-logs-1
config/server-2.properties:
broker.id=2
port=9094
log.dir=/tmp/kafka-logs-2

Setting multi-broker cluster contd..
● bin/kafka-server-start.sh config/server-1.properties
● bin/kafka-server-start.sh config/server-2.properties
● bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor
3 --partitions 3 --topic my-replicated-topic
● bin/kafka-topics.sh --describe --zookeeper localhost:2181 --topic my-
replicated-topic
● bin/kafka-console-producer.sh --broker-list localhost:9092 --topic my-
replicated-topic
● bin/kafka-console-consumer.sh --zookeeper localhost:2181 --from-
beginning --topic my-replicated-topic

Usecases
● Messaging (comparable with ActiveMQ,
RabbitMQ)
● Website activity tracking
● Operational monitoring
● Log Aggregation
● Stream processing (along with Storm/Samza)
● Event sourcing
● Commit log (comparable with BookKeeper)

Apache Kafka + Zookeeper = 3.5 million writes per second

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (20)

Plus de Hyderabad Scalability Meetup

Plus de Hyderabad Scalability Meetup (16)

Dernier

Dernier (20)

Apache Kafka + Zookeeper = 3.5 million writes per second

Notes de l'éditeur