TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
Apache Kafka Women Who Code Meetup
1. APACHE KAFKA & USECASES WITHIN
SEARCH SYSTEM @WALMARTLABS
- Snehal Nagmote @WalmartLabs
-WOMEN WHO CODE SUNNYVALE MEETUP
2. Todays Agenda
Apache Kafka Intro
Kafka Core Concepts
Kafka Producer - In Detail
Kafka Consumer - In Detail
Kafka Ecosystem
Kafka Streams and Kafka Connect
Search Use Cases @WalmartLabs
3. Kafka decouples Data Pipelines
What is Apache Kafka?
3
Source
System
Source
System
Source
System
Source
System
Hadoop
Security
Systems
Real-time
monitoring
Data
Warehouse
Kafka
Producers
Brokers
Consumers
4. Key terminology
Kafka maintains feeds of messages in categories called
topics.
Processes that publish messages to a Kafka topic are
called producers.
Processes that subscribe to topics and process the feed of
published messages are called consumers.
Kafka is run as a cluster comprised of one or more servers
each of which is called a broker.
Communication between all components is done via a high
performance simple binary API over TCP protocol
7. Why Kafka is so Fast ?
Kafka achieves it’s high throughput and low latency primarily from two key
concepts
1) Batching of individual messages to amortize network overhead and
append/consume chunks together
2) Zero copy I/O using sendfile (Java’s NIO FileChannel transferTo method).
Implements linux sendfile() system call which skips unnecessary copies
Heavily relies on Linux PageCache
The I/O scheduler will batch together consecutive small writes into
bigger physical writes which improves throughput.
The I/O scheduler will attempt to re-sequence writes to minimize
movement of the disk head which improves throughput.
It automatically uses all the free memory on the machine
10. Overview of Part 2: Kafka core
concepts
A first look
Topics, partitions, replicas, offsets
Producers, brokers, consumers
Putting it all together
10
11. A first look
The who is who
Producers write data to brokers.
Consumers read data from brokers.
All this is distributed.
The data
Data is stored in topics.
Topics are split into partitions, which are replicated.
11
12. Topics
12
Broker(s)
new
Producer A1
Producer A2
Producer An
…
Producers always append to “tail”
…
Older msgs Newer msgs
Consumer group C1
Consumers use an “offset pointer” to
track/control their read progress
(and decide the pace of consumption)
Consumer group C2
Topic: feed name to which messages are published
Retention Policy : time Based/size based
Configs : retention.ms (Time based) or size based (retention.bytes)
13. Creation of Topic
Creating a topic
CLI
Auto-create via auto.create.topics.enable = true
Modifying a topic
https://kafka.apache.org/documentation.html#basic_ops_modify_topic
Delete a topic
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-
factor 2 –partitions 3 –topic test-topic
14. Partitions
14
A topic consists of partitions.
Partition: ordered + immutable sequence of
messages
15. Partitions
15
Partitions of a topic is configurable
Partitions determines max consumer (group)
parallelism
Consumer group A, with 2 consumers, reads from a 4-partition topic
Consumer group B, with 4 consumers, reads from the same topic
16. Partition offsets
16
Offset: messages in the partitions are each assigned a
unique (per partition) and sequential id called the
offset
Consumers track their pointers via (offset, partition,
topic) tuples
Consumer group C1
17. Replicas of a partition
17
Replicas: “backups” of a partition
They exist solely to prevent data loss.
Replicas are never read from, never written to.
They do NOT help to increase producer or consumer
parallelism!
Kafka tolerates (numReplicas - 1) dead brokers before
losing data
numReplicas == 2 1 broker can die
18. Topics vs. Partitions vs. Replicas
19
http://www.michael-noll.com/blog/2013/03/13/running-a-multi-broker-apache-kafka-cluster-on-a-single-node/
19. Inspecting the current state of a topic
--describe the topic
Leader: brokerID of the currently elected leader broker
Replica ID’s = broker ID’s
ISR = “in-sync replica”, replicas that are in sync with the leader
In this example:
Broker 0 is leader for partition 1.
Broker 1 is leader for partitions 0 and 2.
All replicas are in-sync with their respective leader partitions.
19
$ kafka-topics.sh --zookeeper zookeeper1:2181 --describe –topic test-topic
Topic:test-topic PartitionCount:3 ReplicationFactor:2 Configs:
Topic: test-topic Partition: 0 Leader: 1 Replicas: 1,0 Isr: 1,0
Topic: test-topic Partition: 1 Leader: 0 Replicas: 0,1 Isr: 0,1
Topic: test-topic Partition: 2 Leader: 1 Replicas: 1,0 Isr: 1,0
20. Let’s recap
The who is who
Producers write data to brokers.
Consumers read data from brokers.
All this is distributed.
The data
Data is stored in topics.
Topics are split into partitions which are replicated.
20
22. Producers -Writing data to Kafka
Producers publish to a topic of their choosing (push)
Load can be distributed
Typically by “round-robin”
Can also do “semantic partitioning” based on a key in the
message
Brokers load balance by partition
Support async (less durable) sending
All nodes can answer metadata requests about:
Which servers are alive
Where leaders are for the partitions of a topic
23. Producer - 0.9
All calls are non-blocking async
2 Options for checking for failures:
Immediately block for response: send().get()
Do followup work in Callback, close producer after error threshold
Be careful about buffering these failures. Future work? KAFKA-
1955
Don’t forget to close the producer! producer.close() will block
until in-flight txns complete
retries (producer config) defaults to 0
message.send.max.retries (server config) defaults to 3
In flight requests could lead to message re-ordering
24. Understand the Kafka producer
Record Accumulator
batch
0
batch
1
topic0, 0
● Serialization
● Partitioning
● Compression
Tasks done by the user threads.
compressor
used M free
callback
s
CB
Topic
Metadata
topic=“topic0”
value=“hello” PartitionerSerializer
topic=“topic0”
partition =0
value=“hello”
2
4
User: producer.send(new ProducerRecord(“topic0”, “hello”), callback);
25. Sender:
1. polls batches from the batch queues (one batch / partition)
2. groups batches based on the leader broker
3. sends the grouped batches to the brokers
4. Pipelining if max.in.flight.requests.per.connection > 1
Understand the Kafka producer
batch1 topic0, 0
topic0, 1
batch
1
batch
2
Broker 0
request 0
batch0 batch0
(one batch / partition)
compressor
used M free
…...
request 1
batch0
Broker 1 topic0, M
sender thread Record Accumulator
2
5
callback
s
CB
26. Producer Apis – (0.9)
/**
* Send the given record asynchronously and return a future which will eventually
contain the response information.
* @return A future which will eventually contain the response information
*/
public Future send(ProducerRecord record);
/**
* Send a record and invoke the given callback when the record has been
acknowledged by the server
*/
public Future send(ProducerRecord record, Callback callback);
}
27. Producer – Message Reordering
Message reordering might happen if:
● max.in.flight.requests.per.connection > 1, AND
● retries are enabled
To Prevent Reordering
● max.in.flight.requests.per.connection=1
● close producer in callback with close(0) on send failure
Kafka
Broker
Producer
message 0
Message 1
Retry message0
Message 0
28. Producer – Configs
batch.size – size based batching
linger.ms – time based batching
More batching-> better compression->higher
throughput->higher latency
compression.type
max.in.flight.requests.per.connection (affects
ordering)
acks (affects durability)
29. Producers – Data Loss
Caveats - Producer can cause data loss when
block.on.buffer.full=false
retries are exhausted
sending message without using acks=all
Solutions
block.on.buffer.full=TRUE – make sure producer does not
throw messages
retries=Long.MAX_VALUE
acks=all
resend in callback when message send failed
29
30. Producers – Ack Settings
3
0
Durability can be configured with the producer configuration
request.required.acks
0 The message is written to the network (buffer)
1 The message is written to the leader
all The producer gets an ack after all ISRs receive the data; the
message is committed
.
acks Throughput Latency Durability
0 high low No guarantee
1 medium medium leader
-1 low high ISR
31. Write operations behind the scenes
When writing to a topic in Kafka, producers write directly to the partition
leaders (brokers) of that topic
Remember: Writes always go to the leader ISR of a partition!
This raises two questions:
How to know the “right” partition for a given topic?
How to know the current leader broker/replica of a partition?
31
32. In Kafka, a producer – i.e. the client – decides to which target partition a
message will be sent.
Can be random ~ load balancing across receiving brokers.
Can be semantic based on message “key”, e.g. by user ID or domain name.
Here, Kafka guarantees that all data for the same key will go to the same partition, so
consumers can make locality assumptions.
1) How to know the “right” partition
when sending?32
33. 2) How to know the current leader of a
partition?
Producers: broker discovery aka bootstrapping
Producers don’t talk to ZooKeeper, so it’s not through ZK.
Broker discovery is achieved by providing producers with a “bootstrapping”
broker list, cf. bootstrap.servers
These brokers inform the producer about all alive brokers and where to find current partition leaders. The
bootstrap brokers do use ZK for that.
Impacts on failure handling
In Kafka 0.8 the bootstrap list is static/immutable during producer run-time.
The current bootstrap approach has improved in Kafka 0.9. This change will make
the life of Ops easier.
33
34. Consumer – Reading data
Multiple Consumers can read from the same topic
Each Consumer is responsible for managing it’s own offset
Messages stay on Kafka…they are not removed after
they are consumed
1234567
1234568
1234569
1234570
1234571
1234572
1234573
1234574
1234575
1234576
1234577
Consumer
Producer
Consumer
Consumer
1234577
Send
Writ
e
Fetc
h
Fetc
h
Fetc
h
35. Consumer
Consumers can go away
1234567
1234568
1234569
1234570
1234571
1234572
1234573
1234574
1234575
1234576
1234577
Consumer
Producer
Consumer
1234577
Send
Writ
e
Fetc
h
Fetc
h
36. Consumer
And then come back
1234567
1234568
1234569
1234570
1234571
1234572
1234573
1234574
1234575
1234576
1234577
Consumer
Producer
Consumer
Consumer
1234577
Send
Writ
e
Fetc
h
Fetc
h
Fetc
h
37. Consumer - Groups
Consumers can be organized into Consumer Groups
Common Patterns:
1) All consumer instances in one group
Acts like a traditional queue with load balancing
2) All consumer instances in different groups
All messages are broadcast to all consumer instances
3) “Logical Subscriber” – Many consumer instances in a
group
Consumers are added for scalability and fault tolerance
Each consumer instance reads from one or more partitions for a
topic
There cannot be more consumer instances than partitions
38. Consumer - Groups
P0 P3 P1 P2
C1 C2 C3 C4 C5 C6
Kafka
ClusterBroker 1 Broker 2
Consumer
Group A
Consumer Group B
Consumer
Groups provide
isolation to
topics and
partitions
39. Consumer – Group Rebalancing
Can rebalance
themselves
P0 P3 P1 P2
C1 C2 C3 C4 C5 C6
Kafka Cluster
Broker 1 Broker 2
Consumer Group A Consumer Group B
X
42. Delivery Semantics
At least once
Messages are never lost but may be redelivered
At most once
Messages are lost but never redelivered
Exactly once
Messages are delivered once and only once
Default
43. Delivery Semantics
At least once
Messages are never lost but may be redelivered
At most once
Messages are lost but never redelivered
Exactly once
Messages are delivered once and only once
Much Harder
(Impossible??)
44. Kafka + X for processing the data?
Kafka + Spark Streaming
Near real time indexing in Search (WalmartLabs)
Kafka Streams – Library for Stream Processing
Kafka + Storm/Heron often used in combination, e.g. Twitter
Samza (since Aug ’13) – also by LinkedIn
“Normal” Java multi-threaded setups
Akka actors with Scala or Java, e.g. Ooyala
Kafka + Camus/goblin for Kafka->Hadoop ingestion
44
https://cwiki.apache.org/confluence/display/KAFKA/Powered+By
45. Kafka Ecosystem
Kafka Connect - Used for writing sources and sinks that either
continuously ingest data into Kafka or continuously ingest data from
Kafka into external systems.
Kafka Streams - Built-in stream processing library of the Apache
Kafka project
Confluent Control Center -
Kafka Manager - A tool for managing Apache Kafka.
Kafka Mirrormaker – A tool to mirror a source Kafka cluster into a
target (mirror) Kafka cluster.
HiveKa – Hive queries on Kafka Topic http://hiveka.weebly.com/
Src : https://cwiki.apache.org/confluence/display/KAFKA/Ecosystem
46. Search Usecases @WalmartLabs
Filter Pick up Today – Search filter allows you to
buy online and pick up from store
Easily scales upto 15-20k events/sec
Near real time indexing pipeline
Uses 9 Kafka topics ~100’s gbs of data
NRT Walmart Sponsored Ads
Application Log Processing
Hubble logs data – Collecting User Activity Data
During holiday – almost scales upto 40-50k events/sec
47. Pick Up Today- Challenges
Processing
• 200 Million events/day
• 5k events/sec on avg
• 12k events/sec at peak
Storage
• 1 item ->5000 stores (5 million*5000)
• Fast retrieval