SlideShare une entreprise Scribd logo
1  sur  51
Télécharger pour lire hors ligne
APACHE KAFKA & USECASES WITHIN
SEARCH SYSTEM @WALMARTLABS
- Snehal Nagmote @WalmartLabs
-WOMEN WHO CODE SUNNYVALE MEETUP
Todays Agenda
 Apache Kafka Intro
 Kafka Core Concepts
 Kafka Producer - In Detail
 Kafka Consumer - In Detail
 Kafka Ecosystem
 Kafka Streams and Kafka Connect
 Search Use Cases @WalmartLabs
Kafka decouples Data Pipelines
What is Apache Kafka?
3
Source
System
Source
System
Source
System
Source
System
Hadoop
Security
Systems
Real-time
monitoring
Data
Warehouse
Kafka
Producers
Brokers
Consumers
Key terminology
 Kafka maintains feeds of messages in categories called
topics.
 Processes that publish messages to a Kafka topic are
called producers.
 Processes that subscribe to topics and process the feed of
published messages are called consumers.
 Kafka is run as a cluster comprised of one or more servers
each of which is called a broker.
 Communication between all components is done via a high
performance simple binary API over TCP protocol
Architecture
5
Producer
Consumer Consumer
Producers
Kafka Cluster
Consumers
Broker Broker Broker Broker
Producer
Zookeeper
KafkaVs
src: https://softwaremill.com/mqperf/
Why Kafka is so Fast ?
 Kafka achieves it’s high throughput and low latency primarily from two key
concepts
 1) Batching of individual messages to amortize network overhead and
append/consume chunks together
 2) Zero copy I/O using sendfile (Java’s NIO FileChannel transferTo method).
 Implements linux sendfile() system call which skips unnecessary copies
 Heavily relies on Linux PageCache
 The I/O scheduler will batch together consecutive small writes into
bigger physical writes which improves throughput.
 The I/O scheduler will attempt to re-sequence writes to minimize
movement of the disk head which improves throughput.
 It automatically uses all the free memory on the machine
Zero Copy
Part 2: Kafka core concepts
9
Overview of Part 2: Kafka core
concepts
 A first look
 Topics, partitions, replicas, offsets
 Producers, brokers, consumers
 Putting it all together
10
A first look
 The who is who
 Producers write data to brokers.
 Consumers read data from brokers.
 All this is distributed.
 The data
 Data is stored in topics.
 Topics are split into partitions, which are replicated.
11
Topics
12
Broker(s)
new
Producer A1
Producer A2
Producer An
…
Producers always append to “tail”
…
Older msgs Newer msgs
Consumer group C1
Consumers use an “offset pointer” to
track/control their read progress
(and decide the pace of consumption)
Consumer group C2
Topic: feed name to which messages are published
Retention Policy : time Based/size based
Configs : retention.ms (Time based) or size based (retention.bytes)
Creation of Topic
 Creating a topic
 CLI
 Auto-create via auto.create.topics.enable = true
 Modifying a topic
 https://kafka.apache.org/documentation.html#basic_ops_modify_topic
 Delete a topic
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-
factor 2 –partitions 3 –topic test-topic
Partitions
14
 A topic consists of partitions.
 Partition: ordered + immutable sequence of
messages
Partitions
15
 Partitions of a topic is configurable
 Partitions determines max consumer (group)
parallelism
 Consumer group A, with 2 consumers, reads from a 4-partition topic
 Consumer group B, with 4 consumers, reads from the same topic
Partition offsets
16
Offset: messages in the partitions are each assigned a
unique (per partition) and sequential id called the
offset
 Consumers track their pointers via (offset, partition,
topic) tuples
Consumer group C1
Replicas of a partition
17
Replicas: “backups” of a partition
 They exist solely to prevent data loss.
 Replicas are never read from, never written to.
 They do NOT help to increase producer or consumer
parallelism!
 Kafka tolerates (numReplicas - 1) dead brokers before
losing data
 numReplicas == 2  1 broker can die
Topics vs. Partitions vs. Replicas
19
http://www.michael-noll.com/blog/2013/03/13/running-a-multi-broker-apache-kafka-cluster-on-a-single-node/
Inspecting the current state of a topic
 --describe the topic
 Leader: brokerID of the currently elected leader broker
 Replica ID’s = broker ID’s
 ISR = “in-sync replica”, replicas that are in sync with the leader
 In this example:
 Broker 0 is leader for partition 1.
 Broker 1 is leader for partitions 0 and 2.
 All replicas are in-sync with their respective leader partitions.
19
$ kafka-topics.sh --zookeeper zookeeper1:2181 --describe –topic test-topic
Topic:test-topic PartitionCount:3 ReplicationFactor:2 Configs:
Topic: test-topic Partition: 0 Leader: 1 Replicas: 1,0 Isr: 1,0
Topic: test-topic Partition: 1 Leader: 0 Replicas: 0,1 Isr: 0,1
Topic: test-topic Partition: 2 Leader: 1 Replicas: 1,0 Isr: 1,0
Let’s recap
 The who is who
 Producers write data to brokers.
 Consumers read data from brokers.
 All this is distributed.
 The data
 Data is stored in topics.
 Topics are split into partitions which are replicated.
20
Kafka Producers
21
Producers -Writing data to Kafka
 Producers publish to a topic of their choosing (push)
 Load can be distributed
 Typically by “round-robin”
 Can also do “semantic partitioning” based on a key in the
message
 Brokers load balance by partition
 Support async (less durable) sending
 All nodes can answer metadata requests about:
 Which servers are alive
 Where leaders are for the partitions of a topic
Producer - 0.9
 All calls are non-blocking async
 2 Options for checking for failures:
 Immediately block for response: send().get()
 Do followup work in Callback, close producer after error threshold
 Be careful about buffering these failures. Future work? KAFKA-
1955
 Don’t forget to close the producer! producer.close() will block
until in-flight txns complete
 retries (producer config) defaults to 0
 message.send.max.retries (server config) defaults to 3
 In flight requests could lead to message re-ordering
Understand the Kafka producer
Record Accumulator
batch
0
batch
1
topic0, 0
● Serialization
● Partitioning
● Compression
Tasks done by the user threads.
compressor
used M free
callback
s
CB
Topic
Metadata
topic=“topic0”
value=“hello” PartitionerSerializer
topic=“topic0”
partition =0
value=“hello”
2
4
User: producer.send(new ProducerRecord(“topic0”, “hello”), callback);
Sender:
1. polls batches from the batch queues (one batch / partition)
2. groups batches based on the leader broker
3. sends the grouped batches to the brokers
4. Pipelining if max.in.flight.requests.per.connection > 1
Understand the Kafka producer
batch1 topic0, 0
topic0, 1
batch
1
batch
2
Broker 0
request 0
batch0 batch0
(one batch / partition)
compressor
used M free
…...
request 1
batch0
Broker 1 topic0, M
sender thread Record Accumulator
2
5
callback
s
CB
Producer Apis – (0.9)
/**
* Send the given record asynchronously and return a future which will eventually
contain the response information.
* @return A future which will eventually contain the response information
*/
public Future send(ProducerRecord record);
/**
* Send a record and invoke the given callback when the record has been
acknowledged by the server
*/
public Future send(ProducerRecord record, Callback callback);
}
Producer – Message Reordering
Message reordering might happen if:
● max.in.flight.requests.per.connection > 1, AND
● retries are enabled
To Prevent Reordering
● max.in.flight.requests.per.connection=1
● close producer in callback with close(0) on send failure
Kafka
Broker
Producer
message 0
Message 1
Retry message0
Message 0
Producer – Configs
 batch.size – size based batching
 linger.ms – time based batching
 More batching-> better compression->higher
throughput->higher latency
 compression.type
 max.in.flight.requests.per.connection (affects
ordering)
 acks (affects durability)
Producers – Data Loss
 Caveats - Producer can cause data loss when
 block.on.buffer.full=false
 retries are exhausted
 sending message without using acks=all
 Solutions
 block.on.buffer.full=TRUE – make sure producer does not
throw messages
 retries=Long.MAX_VALUE
 acks=all
 resend in callback when message send failed
29
Producers – Ack Settings
3
0
 Durability can be configured with the producer configuration
request.required.acks
 0 The message is written to the network (buffer)
 1 The message is written to the leader
 all The producer gets an ack after all ISRs receive the data; the
message is committed
.
acks Throughput Latency Durability
0 high low No guarantee
1 medium medium leader
-1 low high ISR
Write operations behind the scenes
 When writing to a topic in Kafka, producers write directly to the partition
leaders (brokers) of that topic
 Remember: Writes always go to the leader ISR of a partition!
 This raises two questions:
 How to know the “right” partition for a given topic?
 How to know the current leader broker/replica of a partition?
31
 In Kafka, a producer – i.e. the client – decides to which target partition a
message will be sent.
 Can be random ~ load balancing across receiving brokers.
 Can be semantic based on message “key”, e.g. by user ID or domain name.
 Here, Kafka guarantees that all data for the same key will go to the same partition, so
consumers can make locality assumptions.
1) How to know the “right” partition
when sending?32
2) How to know the current leader of a
partition?
 Producers: broker discovery aka bootstrapping
 Producers don’t talk to ZooKeeper, so it’s not through ZK.
 Broker discovery is achieved by providing producers with a “bootstrapping”
broker list, cf. bootstrap.servers
 These brokers inform the producer about all alive brokers and where to find current partition leaders. The
bootstrap brokers do use ZK for that.
 Impacts on failure handling
 In Kafka 0.8 the bootstrap list is static/immutable during producer run-time.
 The current bootstrap approach has improved in Kafka 0.9. This change will make
the life of Ops easier.
33
Consumer – Reading data
 Multiple Consumers can read from the same topic
 Each Consumer is responsible for managing it’s own offset
 Messages stay on Kafka…they are not removed after
they are consumed
1234567
1234568
1234569
1234570
1234571
1234572
1234573
1234574
1234575
1234576
1234577
Consumer
Producer
Consumer
Consumer
1234577
Send
Writ
e
Fetc
h
Fetc
h
Fetc
h
Consumer
 Consumers can go away
1234567
1234568
1234569
1234570
1234571
1234572
1234573
1234574
1234575
1234576
1234577
Consumer
Producer
Consumer
1234577
Send
Writ
e
Fetc
h
Fetc
h
Consumer
 And then come back
1234567
1234568
1234569
1234570
1234571
1234572
1234573
1234574
1234575
1234576
1234577
Consumer
Producer
Consumer
Consumer
1234577
Send
Writ
e
Fetc
h
Fetc
h
Fetc
h
Consumer - Groups
 Consumers can be organized into Consumer Groups
Common Patterns:
 1) All consumer instances in one group
 Acts like a traditional queue with load balancing
 2) All consumer instances in different groups
 All messages are broadcast to all consumer instances
 3) “Logical Subscriber” – Many consumer instances in a
group
 Consumers are added for scalability and fault tolerance
 Each consumer instance reads from one or more partitions for a
topic
 There cannot be more consumer instances than partitions
Consumer - Groups
P0 P3 P1 P2
C1 C2 C3 C4 C5 C6
Kafka
ClusterBroker 1 Broker 2
Consumer
Group A
Consumer Group B
Consumer
Groups provide
isolation to
topics and
partitions
Consumer – Group Rebalancing
Can rebalance
themselves
P0 P3 P1 P2
C1 C2 C3 C4 C5 C6
Kafka Cluster
Broker 1 Broker 2
Consumer Group A Consumer Group B
X
Rebalancing – In detail
New consumer Java API in 0.9.0
41
Configure
Subscribe
Process
Delivery Semantics
 At least once
 Messages are never lost but may be redelivered
 At most once
 Messages are lost but never redelivered
 Exactly once
 Messages are delivered once and only once
Default
Delivery Semantics
 At least once
 Messages are never lost but may be redelivered
 At most once
 Messages are lost but never redelivered
 Exactly once
 Messages are delivered once and only once
Much Harder
(Impossible??)
Kafka + X for processing the data?
 Kafka + Spark Streaming
 Near real time indexing in Search (WalmartLabs)
 Kafka Streams – Library for Stream Processing
 Kafka + Storm/Heron often used in combination, e.g. Twitter
 Samza (since Aug ’13) – also by LinkedIn
 “Normal” Java multi-threaded setups
 Akka actors with Scala or Java, e.g. Ooyala
 Kafka + Camus/goblin for Kafka->Hadoop ingestion
44
https://cwiki.apache.org/confluence/display/KAFKA/Powered+By
Kafka Ecosystem
 Kafka Connect - Used for writing sources and sinks that either
continuously ingest data into Kafka or continuously ingest data from
Kafka into external systems.
 Kafka Streams - Built-in stream processing library of the Apache
Kafka project
 Confluent Control Center -
 Kafka Manager - A tool for managing Apache Kafka.
 Kafka Mirrormaker – A tool to mirror a source Kafka cluster into a
target (mirror) Kafka cluster.
 HiveKa – Hive queries on Kafka Topic http://hiveka.weebly.com/
Src : https://cwiki.apache.org/confluence/display/KAFKA/Ecosystem
Search Usecases @WalmartLabs
 Filter Pick up Today – Search filter allows you to
buy online and pick up from store
 Easily scales upto 15-20k events/sec
 Near real time indexing pipeline
 Uses 9 Kafka topics ~100’s gbs of data
 NRT Walmart Sponsored Ads
 Application Log Processing
 Hubble logs data – Collecting User Activity Data
 During holiday – almost scales upto 40-50k events/sec
Pick Up Today- Challenges
 Processing
• 200 Million events/day
• 5k events/sec on avg
• 12k events/sec at peak
 Storage
• 1 item ->5000 stores (5 million*5000)
• Fast retrieval
Pick Up Today – High Level Overview
Implementation Considerations
 Reprocessing
 Delivery Semantics
 Recovery from Streaming Failures
 Monitoring
 Consumer Lag Monitoring
 Apache Burrow
 Consumer Offset Checker
Streaming Component Interaction
Questions ?

Contenu connexe

Tendances

Intro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data MeetupIntro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data Meetup
Gwen (Chen) Shapira
 
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINEKafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
kawamuray
 

Tendances (20)

Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
 
Intro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data MeetupIntro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data Meetup
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
 
Spark streaming + kafka 0.10
Spark streaming + kafka 0.10Spark streaming + kafka 0.10
Spark streaming + kafka 0.10
 
Hadoop engineering bo_f_final
Hadoop engineering bo_f_finalHadoop engineering bo_f_final
Hadoop engineering bo_f_final
 
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINEKafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
 
Nov 2011 HUG: Blur - Lucene on Hadoop
Nov 2011 HUG: Blur - Lucene on HadoopNov 2011 HUG: Blur - Lucene on Hadoop
Nov 2011 HUG: Blur - Lucene on Hadoop
 
Kafka basics
Kafka basicsKafka basics
Kafka basics
 
Current and Future of Apache Kafka
Current and Future of Apache KafkaCurrent and Future of Apache Kafka
Current and Future of Apache Kafka
 
Kafka blr-meetup-presentation - Kafka internals
Kafka blr-meetup-presentation - Kafka internalsKafka blr-meetup-presentation - Kafka internals
Kafka blr-meetup-presentation - Kafka internals
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analytics
 
Near-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBaseNear-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBase
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017
 
Kafka Technical Overview
Kafka Technical OverviewKafka Technical Overview
Kafka Technical Overview
 
Kafka Needs no Keeper( Jason Gustafson & Colin McCabe, Confluent) Kafka Summi...
Kafka Needs no Keeper( Jason Gustafson & Colin McCabe, Confluent) Kafka Summi...Kafka Needs no Keeper( Jason Gustafson & Colin McCabe, Confluent) Kafka Summi...
Kafka Needs no Keeper( Jason Gustafson & Colin McCabe, Confluent) Kafka Summi...
 
Kafka Connect
Kafka ConnectKafka Connect
Kafka Connect
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Visualizing Kafka Security
Visualizing Kafka SecurityVisualizing Kafka Security
Visualizing Kafka Security
 
Apache Kafka® Security Overview
Apache Kafka® Security OverviewApache Kafka® Security Overview
Apache Kafka® Security Overview
 
Apache Kafka - Messaging System Overview
Apache Kafka - Messaging System OverviewApache Kafka - Messaging System Overview
Apache Kafka - Messaging System Overview
 

Similaire à Apache Kafka Women Who Code Meetup

Copy of Kafka-Camus
Copy of Kafka-CamusCopy of Kafka-Camus
Copy of Kafka-Camus
Deep Shah
 

Similaire à Apache Kafka Women Who Code Meetup (20)

Apache Kafka - Event Sourcing, Monitoring, Librdkafka, Scaling & Partitioning
Apache Kafka - Event Sourcing, Monitoring, Librdkafka, Scaling & PartitioningApache Kafka - Event Sourcing, Monitoring, Librdkafka, Scaling & Partitioning
Apache Kafka - Event Sourcing, Monitoring, Librdkafka, Scaling & Partitioning
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
 
Developing Realtime Data Pipelines With Apache Kafka
Developing Realtime Data Pipelines With Apache KafkaDeveloping Realtime Data Pipelines With Apache Kafka
Developing Realtime Data Pipelines With Apache Kafka
 
Kafka zero to hero
Kafka zero to heroKafka zero to hero
Kafka zero to hero
 
Apache Kafka - From zero to hero
Apache Kafka - From zero to heroApache Kafka - From zero to hero
Apache Kafka - From zero to hero
 
Kafka 101
Kafka 101Kafka 101
Kafka 101
 
Kafka Deep Dive
Kafka Deep DiveKafka Deep Dive
Kafka Deep Dive
 
Copy of Kafka-Camus
Copy of Kafka-CamusCopy of Kafka-Camus
Copy of Kafka-Camus
 
Introduction to Kafka and Event-Driven
Introduction to Kafka and Event-DrivenIntroduction to Kafka and Event-Driven
Introduction to Kafka and Event-Driven
 
Introduction to Kafka and Event-Driven
Introduction to Kafka and Event-DrivenIntroduction to Kafka and Event-Driven
Introduction to Kafka and Event-Driven
 
Developing Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaDeveloping Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache Kafka
 
Citi TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep DiveCiti TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep Dive
 
Apache Kafka - Scalable Message Processing and more!
Apache Kafka - Scalable Message Processing and more!Apache Kafka - Scalable Message Processing and more!
Apache Kafka - Scalable Message Processing and more!
 
Apache Kafka from 0.7 to 1.0, History and Lesson Learned
Apache Kafka from 0.7 to 1.0, History and Lesson LearnedApache Kafka from 0.7 to 1.0, History and Lesson Learned
Apache Kafka from 0.7 to 1.0, History and Lesson Learned
 
Session 23 - Kafka and Zookeeper
Session 23 - Kafka and ZookeeperSession 23 - Kafka and Zookeeper
Session 23 - Kafka and Zookeeper
 
Kafka 10000 feet view
Kafka 10000 feet viewKafka 10000 feet view
Kafka 10000 feet view
 
Grokking TechTalk #24: Kafka's principles and protocols
Grokking TechTalk #24: Kafka's principles and protocolsGrokking TechTalk #24: Kafka's principles and protocols
Grokking TechTalk #24: Kafka's principles and protocols
 
Kafka overview
Kafka overviewKafka overview
Kafka overview
 
Proof of Concept on Kafka.pptx
Proof of Concept on Kafka.pptxProof of Concept on Kafka.pptx
Proof of Concept on Kafka.pptx
 
04-Kafka.pptx
04-Kafka.pptx04-Kafka.pptx
04-Kafka.pptx
 

Dernier

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Dernier (20)

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

Apache Kafka Women Who Code Meetup

  • 1. APACHE KAFKA & USECASES WITHIN SEARCH SYSTEM @WALMARTLABS - Snehal Nagmote @WalmartLabs -WOMEN WHO CODE SUNNYVALE MEETUP
  • 2. Todays Agenda  Apache Kafka Intro  Kafka Core Concepts  Kafka Producer - In Detail  Kafka Consumer - In Detail  Kafka Ecosystem  Kafka Streams and Kafka Connect  Search Use Cases @WalmartLabs
  • 3. Kafka decouples Data Pipelines What is Apache Kafka? 3 Source System Source System Source System Source System Hadoop Security Systems Real-time monitoring Data Warehouse Kafka Producers Brokers Consumers
  • 4. Key terminology  Kafka maintains feeds of messages in categories called topics.  Processes that publish messages to a Kafka topic are called producers.  Processes that subscribe to topics and process the feed of published messages are called consumers.  Kafka is run as a cluster comprised of one or more servers each of which is called a broker.  Communication between all components is done via a high performance simple binary API over TCP protocol
  • 7. Why Kafka is so Fast ?  Kafka achieves it’s high throughput and low latency primarily from two key concepts  1) Batching of individual messages to amortize network overhead and append/consume chunks together  2) Zero copy I/O using sendfile (Java’s NIO FileChannel transferTo method).  Implements linux sendfile() system call which skips unnecessary copies  Heavily relies on Linux PageCache  The I/O scheduler will batch together consecutive small writes into bigger physical writes which improves throughput.  The I/O scheduler will attempt to re-sequence writes to minimize movement of the disk head which improves throughput.  It automatically uses all the free memory on the machine
  • 9. Part 2: Kafka core concepts 9
  • 10. Overview of Part 2: Kafka core concepts  A first look  Topics, partitions, replicas, offsets  Producers, brokers, consumers  Putting it all together 10
  • 11. A first look  The who is who  Producers write data to brokers.  Consumers read data from brokers.  All this is distributed.  The data  Data is stored in topics.  Topics are split into partitions, which are replicated. 11
  • 12. Topics 12 Broker(s) new Producer A1 Producer A2 Producer An … Producers always append to “tail” … Older msgs Newer msgs Consumer group C1 Consumers use an “offset pointer” to track/control their read progress (and decide the pace of consumption) Consumer group C2 Topic: feed name to which messages are published Retention Policy : time Based/size based Configs : retention.ms (Time based) or size based (retention.bytes)
  • 13. Creation of Topic  Creating a topic  CLI  Auto-create via auto.create.topics.enable = true  Modifying a topic  https://kafka.apache.org/documentation.html#basic_ops_modify_topic  Delete a topic bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication- factor 2 –partitions 3 –topic test-topic
  • 14. Partitions 14  A topic consists of partitions.  Partition: ordered + immutable sequence of messages
  • 15. Partitions 15  Partitions of a topic is configurable  Partitions determines max consumer (group) parallelism  Consumer group A, with 2 consumers, reads from a 4-partition topic  Consumer group B, with 4 consumers, reads from the same topic
  • 16. Partition offsets 16 Offset: messages in the partitions are each assigned a unique (per partition) and sequential id called the offset  Consumers track their pointers via (offset, partition, topic) tuples Consumer group C1
  • 17. Replicas of a partition 17 Replicas: “backups” of a partition  They exist solely to prevent data loss.  Replicas are never read from, never written to.  They do NOT help to increase producer or consumer parallelism!  Kafka tolerates (numReplicas - 1) dead brokers before losing data  numReplicas == 2  1 broker can die
  • 18. Topics vs. Partitions vs. Replicas 19 http://www.michael-noll.com/blog/2013/03/13/running-a-multi-broker-apache-kafka-cluster-on-a-single-node/
  • 19. Inspecting the current state of a topic  --describe the topic  Leader: brokerID of the currently elected leader broker  Replica ID’s = broker ID’s  ISR = “in-sync replica”, replicas that are in sync with the leader  In this example:  Broker 0 is leader for partition 1.  Broker 1 is leader for partitions 0 and 2.  All replicas are in-sync with their respective leader partitions. 19 $ kafka-topics.sh --zookeeper zookeeper1:2181 --describe –topic test-topic Topic:test-topic PartitionCount:3 ReplicationFactor:2 Configs: Topic: test-topic Partition: 0 Leader: 1 Replicas: 1,0 Isr: 1,0 Topic: test-topic Partition: 1 Leader: 0 Replicas: 0,1 Isr: 0,1 Topic: test-topic Partition: 2 Leader: 1 Replicas: 1,0 Isr: 1,0
  • 20. Let’s recap  The who is who  Producers write data to brokers.  Consumers read data from brokers.  All this is distributed.  The data  Data is stored in topics.  Topics are split into partitions which are replicated. 20
  • 22. Producers -Writing data to Kafka  Producers publish to a topic of their choosing (push)  Load can be distributed  Typically by “round-robin”  Can also do “semantic partitioning” based on a key in the message  Brokers load balance by partition  Support async (less durable) sending  All nodes can answer metadata requests about:  Which servers are alive  Where leaders are for the partitions of a topic
  • 23. Producer - 0.9  All calls are non-blocking async  2 Options for checking for failures:  Immediately block for response: send().get()  Do followup work in Callback, close producer after error threshold  Be careful about buffering these failures. Future work? KAFKA- 1955  Don’t forget to close the producer! producer.close() will block until in-flight txns complete  retries (producer config) defaults to 0  message.send.max.retries (server config) defaults to 3  In flight requests could lead to message re-ordering
  • 24. Understand the Kafka producer Record Accumulator batch 0 batch 1 topic0, 0 ● Serialization ● Partitioning ● Compression Tasks done by the user threads. compressor used M free callback s CB Topic Metadata topic=“topic0” value=“hello” PartitionerSerializer topic=“topic0” partition =0 value=“hello” 2 4 User: producer.send(new ProducerRecord(“topic0”, “hello”), callback);
  • 25. Sender: 1. polls batches from the batch queues (one batch / partition) 2. groups batches based on the leader broker 3. sends the grouped batches to the brokers 4. Pipelining if max.in.flight.requests.per.connection > 1 Understand the Kafka producer batch1 topic0, 0 topic0, 1 batch 1 batch 2 Broker 0 request 0 batch0 batch0 (one batch / partition) compressor used M free …... request 1 batch0 Broker 1 topic0, M sender thread Record Accumulator 2 5 callback s CB
  • 26. Producer Apis – (0.9) /** * Send the given record asynchronously and return a future which will eventually contain the response information. * @return A future which will eventually contain the response information */ public Future send(ProducerRecord record); /** * Send a record and invoke the given callback when the record has been acknowledged by the server */ public Future send(ProducerRecord record, Callback callback); }
  • 27. Producer – Message Reordering Message reordering might happen if: ● max.in.flight.requests.per.connection > 1, AND ● retries are enabled To Prevent Reordering ● max.in.flight.requests.per.connection=1 ● close producer in callback with close(0) on send failure Kafka Broker Producer message 0 Message 1 Retry message0 Message 0
  • 28. Producer – Configs  batch.size – size based batching  linger.ms – time based batching  More batching-> better compression->higher throughput->higher latency  compression.type  max.in.flight.requests.per.connection (affects ordering)  acks (affects durability)
  • 29. Producers – Data Loss  Caveats - Producer can cause data loss when  block.on.buffer.full=false  retries are exhausted  sending message without using acks=all  Solutions  block.on.buffer.full=TRUE – make sure producer does not throw messages  retries=Long.MAX_VALUE  acks=all  resend in callback when message send failed 29
  • 30. Producers – Ack Settings 3 0  Durability can be configured with the producer configuration request.required.acks  0 The message is written to the network (buffer)  1 The message is written to the leader  all The producer gets an ack after all ISRs receive the data; the message is committed . acks Throughput Latency Durability 0 high low No guarantee 1 medium medium leader -1 low high ISR
  • 31. Write operations behind the scenes  When writing to a topic in Kafka, producers write directly to the partition leaders (brokers) of that topic  Remember: Writes always go to the leader ISR of a partition!  This raises two questions:  How to know the “right” partition for a given topic?  How to know the current leader broker/replica of a partition? 31
  • 32.  In Kafka, a producer – i.e. the client – decides to which target partition a message will be sent.  Can be random ~ load balancing across receiving brokers.  Can be semantic based on message “key”, e.g. by user ID or domain name.  Here, Kafka guarantees that all data for the same key will go to the same partition, so consumers can make locality assumptions. 1) How to know the “right” partition when sending?32
  • 33. 2) How to know the current leader of a partition?  Producers: broker discovery aka bootstrapping  Producers don’t talk to ZooKeeper, so it’s not through ZK.  Broker discovery is achieved by providing producers with a “bootstrapping” broker list, cf. bootstrap.servers  These brokers inform the producer about all alive brokers and where to find current partition leaders. The bootstrap brokers do use ZK for that.  Impacts on failure handling  In Kafka 0.8 the bootstrap list is static/immutable during producer run-time.  The current bootstrap approach has improved in Kafka 0.9. This change will make the life of Ops easier. 33
  • 34. Consumer – Reading data  Multiple Consumers can read from the same topic  Each Consumer is responsible for managing it’s own offset  Messages stay on Kafka…they are not removed after they are consumed 1234567 1234568 1234569 1234570 1234571 1234572 1234573 1234574 1234575 1234576 1234577 Consumer Producer Consumer Consumer 1234577 Send Writ e Fetc h Fetc h Fetc h
  • 35. Consumer  Consumers can go away 1234567 1234568 1234569 1234570 1234571 1234572 1234573 1234574 1234575 1234576 1234577 Consumer Producer Consumer 1234577 Send Writ e Fetc h Fetc h
  • 36. Consumer  And then come back 1234567 1234568 1234569 1234570 1234571 1234572 1234573 1234574 1234575 1234576 1234577 Consumer Producer Consumer Consumer 1234577 Send Writ e Fetc h Fetc h Fetc h
  • 37. Consumer - Groups  Consumers can be organized into Consumer Groups Common Patterns:  1) All consumer instances in one group  Acts like a traditional queue with load balancing  2) All consumer instances in different groups  All messages are broadcast to all consumer instances  3) “Logical Subscriber” – Many consumer instances in a group  Consumers are added for scalability and fault tolerance  Each consumer instance reads from one or more partitions for a topic  There cannot be more consumer instances than partitions
  • 38. Consumer - Groups P0 P3 P1 P2 C1 C2 C3 C4 C5 C6 Kafka ClusterBroker 1 Broker 2 Consumer Group A Consumer Group B Consumer Groups provide isolation to topics and partitions
  • 39. Consumer – Group Rebalancing Can rebalance themselves P0 P3 P1 P2 C1 C2 C3 C4 C5 C6 Kafka Cluster Broker 1 Broker 2 Consumer Group A Consumer Group B X
  • 41. New consumer Java API in 0.9.0 41 Configure Subscribe Process
  • 42. Delivery Semantics  At least once  Messages are never lost but may be redelivered  At most once  Messages are lost but never redelivered  Exactly once  Messages are delivered once and only once Default
  • 43. Delivery Semantics  At least once  Messages are never lost but may be redelivered  At most once  Messages are lost but never redelivered  Exactly once  Messages are delivered once and only once Much Harder (Impossible??)
  • 44. Kafka + X for processing the data?  Kafka + Spark Streaming  Near real time indexing in Search (WalmartLabs)  Kafka Streams – Library for Stream Processing  Kafka + Storm/Heron often used in combination, e.g. Twitter  Samza (since Aug ’13) – also by LinkedIn  “Normal” Java multi-threaded setups  Akka actors with Scala or Java, e.g. Ooyala  Kafka + Camus/goblin for Kafka->Hadoop ingestion 44 https://cwiki.apache.org/confluence/display/KAFKA/Powered+By
  • 45. Kafka Ecosystem  Kafka Connect - Used for writing sources and sinks that either continuously ingest data into Kafka or continuously ingest data from Kafka into external systems.  Kafka Streams - Built-in stream processing library of the Apache Kafka project  Confluent Control Center -  Kafka Manager - A tool for managing Apache Kafka.  Kafka Mirrormaker – A tool to mirror a source Kafka cluster into a target (mirror) Kafka cluster.  HiveKa – Hive queries on Kafka Topic http://hiveka.weebly.com/ Src : https://cwiki.apache.org/confluence/display/KAFKA/Ecosystem
  • 46. Search Usecases @WalmartLabs  Filter Pick up Today – Search filter allows you to buy online and pick up from store  Easily scales upto 15-20k events/sec  Near real time indexing pipeline  Uses 9 Kafka topics ~100’s gbs of data  NRT Walmart Sponsored Ads  Application Log Processing  Hubble logs data – Collecting User Activity Data  During holiday – almost scales upto 40-50k events/sec
  • 47. Pick Up Today- Challenges  Processing • 200 Million events/day • 5k events/sec on avg • 12k events/sec at peak  Storage • 1 item ->5000 stores (5 million*5000) • Fast retrieval
  • 48. Pick Up Today – High Level Overview
  • 49. Implementation Considerations  Reprocessing  Delivery Semantics  Recovery from Streaming Failures  Monitoring  Consumer Lag Monitoring  Apache Burrow  Consumer Offset Checker