SlideShare une entreprise Scribd logo
1  sur  138
Télécharger pour lire hors ligne
Kafkaesque days
at LinkedIn in 2015
Joel Koshy
Kafka Summit 2016
Kafkaesque
adjective Kaf·ka·esque ˌkäf-kə-ˈesk, ˌkaf-
: of, relating to, or suggestive of Franz Kafka or his writings; especially : having
a nightmarishly complex, bizarre, or illogical quality
Merriam-Webster
Kafka @ LinkedIn
What @bonkoif said:
More clusters
More use-cases
More problems …
Kafka @ LinkedIn
Incidents that we will cover
● Offset rewinds
● Data loss
● Cluster unavailability
● (In)compatibility
● Blackout
Offset rewinds
What are offset rewinds?
valid offsets
invalid offsetsinvalid offsets
yet to arrive
messages
purged
messages
If a consumer gets an OffsetOutOfRangeException:
What are offset rewinds?
valid offsets
invalid offsetsinvalid offsets
auto.offset.reset ← earliest auto.offset.reset ← latest
What are offset rewinds… and why do they matter?
HADOOP
Kafka
(CORP)
Push
job
Kafka
(PROD)
Stork
Mirror
Maker
Email
campaigns
What are offset rewinds… and why do they matter?
HADOOP Kafka
Push
job
Kafka
(PROD)
Stork
Mirror
Maker
Email
campaigns
Real-life incident courtesy of xkcd
offset rewind
Deployment: Deployed Multiproduct kafka-mirror-maker 0.1.13 to DCX by jkoshy
CRT Notifications <crt-notifications-noreply@linkedin.com> Fri, Jul 10, 2015 at 8:27 PM
Multiproduct 0.1.13 of kafka-mirror-maker has been Deployed to DCX by jkoshy
Offset rewinds: the first incident
Deployment: Deployed Multiproduct kafka-mirror-maker 0.1.13 to DCX by jkoshy
CRT Notifications <crt-notifications-noreply@linkedin.com> Fri, Jul 10, 2015 at 8:27 PM
Multiproduct 0.1.13 of kafka-mirror-maker has been Deployed to DCX by jkoshy on Wednesday, Jul 8, 2015 at 10:14 AM
Offset rewinds: the first incident
What are offset rewinds… and why do they matter?
HADOOP
Kafka
(CORP)
Push
job
Kafka
(PROD)
Stork
Mirror
Maker
Email
campaigns
Good practice to have
some filtering logic here
Offset rewinds: detection
Offset rewinds: detection
Offset rewinds: detection - just use this
Offset rewinds: a typical cause
Offset rewinds: a typical cause
valid offsets
invalid offsetsinvalid offsets
consumer
position
Offset rewinds: a typical cause
valid offsets
invalid offsetsinvalid offsets
consumer
position
Unclean leader election truncates the log
Offset rewinds: a typical cause
valid offsets
invalid offsetsinvalid offsets
consumer
position
Unclean leader election truncates the log
… and consumer’s offset goes out of range
But there were no ULEs when this happened
But there were no ULEs when this happened
… and we set auto.offset.reset to latest
Offset management - a quick overview
(broker)
Consumer
Consumer group
Consumer Consumer
(broker)(broker)
Consume (fetch requests)
Offset management - a quick overview
Offset
Manager
(broker)
Consumer
Consumer group
Consumer Consumer
Periodic OffsetCommitRequest
(broker)(broker)
Offset management - a quick overview
Offset
Manager
(broker)
Consumer
Consumer group
Consumer Consumer
OffsetFetchRequest
(after rebalance)
(broker) (broker)
Offset management - a quick overview
mirror-maker
PageViewEvent-0
240
mirror-maker
LoginEvent-8
456
mirror-maker
LoginEvent-8
512
mirror-maker
PageViewEvent-0
321
__consumer_offsets topic
Offset management - a quick overview
mirror-maker
PageViewEvent-0
240
mirror-maker
LoginEvent-8
456
mirror-maker
LoginEvent-8
512
mirror-maker
PageViewEvent-0
321
__consumer_offsets topic
New offset commits
append to the topic
Offset management - a quick overview
mirror-maker
PageViewEvent-0
240
mirror-maker
LoginEvent-8
456
mirror-maker
LoginEvent-8
512
mirror-maker
PageViewEvent-0
321
__consumer_offsets topic
New offset commits
append to the topic
mirror-maker
PageViewEvent-0
321
mirror-maker
LoginEvent-8
512
… …
Maintain offset cache
to serve offset fetch
requests quickly
Offset management - a quick overview
mirror-maker
PageViewEvent-0
240
mirror-maker
LoginEvent-8
456
mirror-maker
LoginEvent-8
512
mirror-maker
PageViewEvent-0
321
__consumer_offsets topic
New offset commits
append to the topic
mirror-maker
PageViewEvent-0
321
mirror-maker
LoginEvent-8
512
… …
Purge old offsets
via log compaction
Maintain offset cache
to serve offset fetch
requests quickly
Offset management - a quick overview
mirror-maker
PageViewEvent-0
240
mirror-maker
LoginEvent-8
456
mirror-maker
LoginEvent-8
512
mirror-maker
PageViewEvent-0
321
__consumer_offsets topic
When a new broker becomes
the leader (i.e., offset manager)
it loads offsets into its cache
Offset management - a quick overview
mirror-maker
PageViewEvent-0
240
mirror-maker
LoginEvent-8
456
mirror-maker
LoginEvent-8
512
mirror-maker
PageViewEvent-0
321
__consumer_offsets topic
mirror-maker
PageViewEvent-0
321
mirror-maker
LoginEvent-8
512
… …
See this deck for more details
Back to the incident…
2015/07/10 02:32:16.174 [ConsumerFetcherThread] [ConsumerFetcherThread-mirror-maker-9de01f48-0-287],
Current offset 6811737 for partition [some-log_event,13] out of range; reset offset to 9581225
Back to the incident…
... <rebalance>
2015/07/10 02:08:14.252 [some-log_event,13], initOffset 9581205
... <rebalance>
2015/07/10 02:24:11.965 [some-log_event,13], initOffset 9581223
... <rebalance>
2015/07/10 02:32:16.131 [some-log_event,13], initOffset 6811737
...
2015/07/10 02:32:16.174 [ConsumerFetcherThread] [ConsumerFetcherThread-mirror-maker-9de01f48-0-287],
Current offset 6811737 for partition [some-log_event,13] out of range; reset offset to 9581225
./bin/kafka-console-consumer.sh --topic __consumer_offsets --zookeeper
<zookeeperConnect> --formatter "kafka.coordinator.
GroupMetadataManager$OffsetsMessageFormatter" --consumer.config config/consumer.
properties
(must set exclude.internal.topics=false in consumer.properties)
While debugging offset rewinds, do this first!
...
…
[mirror-maker,metrics_event,1]::OffsetAndMetadata[83511737,NO_METADATA,1433178005711]
[mirror-maker,some-log_event,13]::OffsetAndMetadata[6811737,NO_METADATA,1433178005711]
...
...
[mirror-maker,some-log_event,13]::OffsetAndMetadata[9581223,NO_METADATA,1436495051231]
...
Inside the __consumer_offsets topic
Jul 10 (today)
Jun 1 !!
So why did the offset manager return a stale offset?
Offset manager logs:
2015/07/10 02:31:57.941 ERROR [OffsetManager] [kafka-scheduler-1] [kafka-server] [] [Offset Manager on
Broker 191]: Error in loading offsets from [__consumer_offsets,63]
java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.String
at
kafka.server.OffsetManager$.kafka$server$OffsetManager$$readMessageValue(OffsetManager.scala:576)
So why did the offset manager return a stale offset?
Offset manager logs:
2015/07/10 02:31:57.941 ERROR [OffsetManager] [kafka-scheduler-1] [kafka-server] [] [Offset Manager on
Broker 191]: Error in loading offsets from [__consumer_offsets,63]
java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.String
at
kafka.server.OffsetManager$.kafka$server$OffsetManager$$readMessageValue(OffsetManager.scala:576)
... ...
mirror-maker
some-log_event, 13 6811737
... ...
Leader moved and new offset
manager hit KAFKA-2117 while
loading offsets
old offsets recent offsets
… caused a ton of offset resets
2015/07/10 02:08:14.252 [some-log_event,13], initOffset 9581205
...
2015/07/10 02:24:11.965 [some-log_event,13], initOffset 9581223
...
2015/07/10 02:32:16.131 [some-log_event,13], initOffset 6811737
...
2015/07/10 02:32:16.174 [ConsumerFetcherThread] [ConsumerFetcherThread-mirror-maker-9de01f48-0-287],
Current offset 6811737 for partition [some-log_event,13] out of range; reset offset to 9581225
[some-log_event, 13]
846232 9581225
purged
… but why the duplicate email?
Deployment: Deployed Multiproduct kafka-mirror-maker 0.1.13 to DCX by jkoshy
CRT Notifications <crt-notifications-noreply@linkedin.com> Fri, Jul 10, 2015 at 8:27 PM
Multiproduct 0.1.13 of kafka-mirror-maker has been Deployed to DCX by jkoshy
… but why the duplicate email?
2015/07/10 02:08:15.524 [crt-event,12], initOffset 11464
...
2015/07/10 02:31:40.827 [crt-event,12], initOffset 11464
...
2015/07/10 02:32:17.739 [crt-event,12], initOffset 9539
...
Also from Jun 1
… but why the duplicate email?
2015/07/10 02:08:15.524 [crt-event,12], initOffset 11464
...
2015/07/10 02:31:40.827 [crt-event,12], initOffset 11464
...
2015/07/10 02:32:17.739 [crt-event,12], initOffset 9539
...
[crt-event, 12]
0 11464
… but still valid!
Time-based retention does not work well
for low-volume topics
Addressed by KIP-32/KIP-33
Offset rewinds: the second incident
mirror makers
got wedged
restarted
sent duplicate emails
to (few) members
Offset rewinds: the second incident
Consumer logs
2015/04/29 17:22:48.952 <rebalance started>
...
2015/04/29 17:36:37.790 <rebalance ended>initOffset -1 (for various partitions)
Offset rewinds: the second incident
Consumer logs
2015/04/29 17:22:48.952 <rebalance started>
...
2015/04/29 17:36:37.790 <rebalance ended>initOffset -1 (for various partitions)
Broker (offset manager) logs
2015/04/29 17:18:46.143 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker
517]: Loading offsets from [__consumer_offsets,84]
...
2015/04/29 17:36:35.228 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker
517]: Finished loading offsets from [__consumer_offsets,84] in 1069085 milliseconds. (17 minutes!)
Offset rewinds: the second incident
Consumer logs
2015/04/29 17:22:48.952 <rebalance started>
...
2015/04/29 17:36:37.790 <rebalance ended>initOffset -1 (for various partitions)
Broker (offset manager) logs
2015/04/29 17:18:46.143 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker
517]: Loading offsets from [__consumer_offsets,84]
...
2015/04/29 17:36:35.228 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker
517]: Finished loading offsets from [__consumer_offsets,84] in 1069085 milliseconds. (17 minutes!)
⇒ log cleaner had failed a while ago…
but why did offset fetch return -1?
Offset management - a quick overview
How are stale offsets (for dead consumers) cleaned up?
dead-group
PageViewEvent-0 321
timestamp
older than a
week
active-group
LoginEvent-8
512
recent
timestamp
… …
__consumer_offsets
Offset cache
cleanup
task
Offset management - a quick overview
How are stale offsets (for dead consumers) cleaned up?
dead-group
PageViewEvent-0 321
timestamp
older than a
week
active-group
LoginEvent-8
512
recent
timestamp
… …
__consumer_offsets
Offset cache
cleanup
task
Append tombstones
for dead-group
and delete entry in
offset cache
Back to the incident...
2015/04/29 17:18:46.143 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker
517]: Loading offsets from [__consumer_offsets,84]
...
2015/04/29 17:36:35.228 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker
517]: Finished loading offsets from [__consumer_offsets,84] in 1069085 milliseconds. (17 minutes!)
mirror-maker
PageViewEvent-0
45
very old
timstamp
mirror-maker
LoginEvent-8
12
very old
timestamp
... ... ...
old offsets recent offsets
load offsets
Back to the incident...
2015/04/29 17:18:46.143 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker
517]: Loading offsets from [__consumer_offsets,84]
...
2015/04/29 17:36:35.228 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker
517]: Finished loading offsets from [__consumer_offsets,84] in 1069085 milliseconds. (17 minutes!)
mirror-maker
PageViewEvent-0
45
very old
timstamp
mirror-maker
LoginEvent-8
12
very old
timestamp
... ... ...
old offsets recent offsets
load offsets
Cleanup task happened to
run during the load
Back to the incident...
2015/04/29 17:18:46.143 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker
517]: Loading offsets from [__consumer_offsets,84]
...
2015/04/29 17:36:35.228 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker
517]: Finished loading offsets from [__consumer_offsets,84] in 1069085 milliseconds. (17 minutes!)
... ... ...
old offsets recent offsets
load offsets
Back to the incident...
2015/04/29 17:18:46.143 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker
517]: Loading offsets from [__consumer_offsets,84]
...
2015/04/29 17:36:35.228 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker
517]: Finished loading offsets from [__consumer_offsets,84] in 1069085 milliseconds. (17 minutes!)
mirror-maker
PageViewEvent-0
321
recent
timestamp
mirror-maker
LoginEvent-8
512
recent
timestamp
... ... ...
old offsets recent offsets
load offsets
Back to the incident...
2015/04/29 17:18:46.143 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker
517]: Loading offsets from [__consumer_offsets,84]
...
2015/04/29 17:36:35.228 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker
517]: Finished loading offsets from [__consumer_offsets,84] in 1069085 milliseconds. (17 minutes!)
... ... ...
old offsets recent offsets
load offsets
Root cause of this rewind
● Log cleaner had failed (separate bug)
○ ⇒ offsets topic grew big
○ ⇒ offset load on leader movement took a while
● Cache cleanup ran during the load
○ which appended tombstones
○ and overrode the most recent offsets
● (Fixed in KAFKA-2163)
Offset rewinds: wrapping it up
● Monitor log cleaner health
● If you suspect a rewind:
○ Check for unclean leader elections
○ Check for offset manager movement (i.e., __consumer_offsets partitions had leader changes)
○ Take a dump of the offsets topic
○ … stare long and hard at the logs (both consumer and offset manager)
● auto.offset.reset ← closest ?
● Better lag monitoring via Burrow
Critical data loss
P
R
O
D
B
P
R
O
D
A
C
O
R
P
Y
C
O
R
P
X
Data loss: the first incident
Kafka
aggregate
Kafka
local
Kafka
aggregate
Hadoop
Producers
Kafka
aggregate
Kafka
local
Kafka
aggregate
Hadoop
Producers
C
O
R
P
Y
C
O
R
P
X
P
R
O
D
B
P
R
O
D
A
Audit trail
Kafka
aggregate
Kafka
local
Kafka
aggregate
Hadoop
Producers
Kafka
aggregate
Kafka
local
Kafka
aggregate
Hadoop
Producers
data
audit
Data loss: detection (example 1)
P
R
O
D
B
P
R
O
D
A
C
O
R
P
Y
C
O
R
P
X
Kafka
aggregate
Kafka
local
Kafka
aggregate
Hadoop
Producers
Kafka
aggregate
Kafka
local
Kafka
aggregate
Hadoop
Producers
Data loss: detection (example 1)
P
R
O
D
B
P
R
O
D
A
C
O
R
P
Y
C
O
R
P
X
Kafka
aggregate
Kafka
local
Kafka
aggregate
Hadoop
Producers
Kafka
aggregate
Kafka
local
Kafka
aggregate
Hadoop
Producers
Data loss: detection (example 2)
P
R
O
D
B
P
R
O
D
A
C
O
R
P
Y
C
O
R
P
X
Kafka
aggregate
Kafka
local
Kafka
aggregate
Hadoop
Producers
Kafka
aggregate
Kafka
local
Kafka
aggregate
Hadoop
Producers
Data loss? (The actual incident)
P
R
O
D
B
P
R
O
D
A
C
O
R
P
Y
C
O
R
P
X
Kafka
aggregate
Kafka
local
Kafka
aggregate
Hadoop
Producers
Kafka
aggregate
Kafka
local
Kafka
aggregate
Hadoop
Producers
Data loss or audit issue? (The actual incident)
P
R
O
D
B
P
R
O
D
A
C
O
R
P
Y
C
O
R
P
X
Kafka
aggregate
Kafka
local
Kafka
aggregate
Hadoop
Producers
Kafka
aggregate
Kafka
local
Kafka
aggregate
Hadoop
Producers
Sporadic discrepancies in Kafka-
aggregate-CORP-X counts for several
topics
However, Hadoop-X tier is complete
✔
✔ ✔
✔
✔
✔✔
Verified actual data completeness by recounting events in a few low-volume topics
… so definitely an audit-only issue
Likely caused by dropping audit events
Verified actual data completeness by recounting events in a few low-volume topics
… so definitely an audit-only issue
Possible sources of discrepancy:
● Cluster auditor
● Cluster itself (i.e., data loss in audit topic)
● Audit front-end
Likely caused by dropping audit events
Possible causes
C
O
R
P
X
Kafka
aggregate
Hadoop
Cluster
auditor
consume all
topics
emit audit
counts
Cluster auditor
● Counting incorrectly
○ but same version of auditor everywhere
and only CORP-X has issues
● Not consuming all data for audit or failing
to send all audit events
○ but no errors in auditor logs
● … and auditor bounces did not help
Data loss in audit topic
● … but no unclean leader elections
● … and no data loss in sampled topics
(counted manually)
Possible causes
C
O
R
P
X
Kafka
aggregate
Hadoop
Cluster
auditor
consume all
topics
emit audit
counts
Audit front-end fails to insert audit events into DB
● … but other tiers (e.g., CORP-Y) are correct
● … and no errors in logs
Possible causes
C
O
R
P
X
Kafka
aggregate
Hadoop
Audit
front-end
consume
audit
Audit DB
insert
from
CORP-Y
● Emit counts to new test tier
Attempt to reproduce
C
O
R
P
X
Kafka
aggregate
Hadoop
Cluster
auditor
consume all
topics
Tier CORP-X
Cluster
auditor
Tier test
… fortunately worked:
● Emit counts to new test tier
● test tier counts were also sporadically off
Attempt to reproduce
C
O
R
P
X
Kafka
aggregate
Hadoop
Cluster
auditor
consume all
topics
Tier CORP-X
Cluster
auditor
Tier test
● Enabled select TRACE logs to log audit
events before sending
● Audit counts were correct
● … and successfully emitted
… and debug
C
O
R
P
X
Kafka
aggregate
Hadoop
Cluster
auditor
consume all
topics
Tier CORP-X
Cluster
auditor
Tier test
● Enabled select TRACE logs to log audit
events before sending
● Audit counts were correct
● … and successfully emitted
● Verified from broker public access logs
that audit event was sent
… and debug
C
O
R
P
X
Kafka
aggregate
Hadoop
Cluster
auditor
consume all
topics
Tier CORP-X
Cluster
auditor
Tier test
● Enabled select TRACE logs to log audit
events before sending
● Audit counts were correct
● … and successfully emitted
● Verified from broker public access logs
that audit event was sent
● … but on closer look realized it was not
the leader for that partition of the audit
topic
… and debug
C
O
R
P
X
Kafka
aggregate
Hadoop
Cluster
auditor
consume all
topics
Tier CORP-X
Cluster
auditor
Tier test
● Enabled select TRACE logs to log audit
events before sending
● Audit counts were correct
● … and successfully emitted
● Verified from broker public access logs
that audit event was sent
● … but on closer look realized it was not
the leader for that partition of the audit
topic
● So why did it not return
NotLeaderForPartition?
… and debug
C
O
R
P
X
Kafka
aggregate
Hadoop
Cluster
auditor
consume all
topics
Tier CORP-X
Cluster
auditor
Tier test
That broker was part of another cluster!
C
O
R
P
X
Kafka
aggregate
Hadoop
Cluster
auditor
Some other
Kafka cluster
Tier test
siphoned audit
events
… and we had a VIP misconfiguration
C
O
R
P
X
Kafka
aggregate
Hadoop
Cluster
auditor
Some other
Kafka cluster
V
I
P
stray broker
entry
● Auditor still uses the old producer
● Periodically refreshes metadata (via VIP)
for the audit topic
● ⇒ sometimes fetches metadata from the
other cluster
So audit events leaked into the other cluster
C
O
R
P
X
Kafka
aggregate
Hadoop
Cluster
auditor
Some other
Kafka cluster
V
I
P
AuditTopic
Metadata
Request
Metadata
response
● Auditor still uses the old producer
● Periodically refreshes metadata (via VIP)
for the audit topic
● ⇒ sometimes fetches metadata from the
other cluster
● and leaks audit events to that cluster until
at least next metadata refresh
So audit events leaked into the other cluster
C
O
R
P
X
Kafka
aggregate
Hadoop
Cluster
auditor
Some other
Kafka cluster
V
I
P
emit audit
counts
Some takeaways
● Could have been worse if mirror-makers to CORP-X had been bounced
○ (Since mirror makers could have started siphoning actual data to the other cluster)
● Consider using round-robin DNS instead of VIPs
○ … which is also necessary for using per-IP connection limits
Data loss: the second incident
Prolonged period of data loss from our Kafka REST proxy
Data loss: the second incident
Alerts fire that a broker in tracking cluster had gone offline
NOC engages SYSOPS to investigate
NOC engages Feed SREs and Kafka SREs to investigate drop (not loss) in a subset of page views
On investigation, Kafka SRE finds no problems with Kafka (excluding the down broker), but notes an overall drop in
tracking messages starting shortly after the broker failure
NOC engages Traffic SRE to investigate why their tracking events had stopped
Traffic SRE say that they don’t see errors on their side, and add that they use Kafka REST proxy
Kafka SRE finds no immediate errors in Kafka REST logs but bounces the service as a precautionary measure
Tracking events return to normal (expected) counts after the bounce
Prolonged period of data loss from our Kafka REST proxy
Reproducing the issue
Broker
A
Producer
performance
Broker
B
Reproducing the issue
Broker
A
Producer
performance
Broker
B
Isolate the broker
(iptables)
Sender
Accumulator
Reproducing the issue
Broker
A
Broker
B
Partition 1
Partition 2
Partition n
send
Leader for
partition 1
in-flight
requests
Sender
Accumulator
Reproducing the issue
Broker
A
Broker
B
Partition 1
Partition 2
Partition n
send
New leader
for partition 1
in-flight
requests
Old leader for
partition 1
Sender
Accumulator
Reproducing the issue
Broker
A
Broker
B
Partition 1
Partition 2
Partition n
send
New leader
for partition 1
in-flight
requests
New producer did not implement a request timeout
Old leader for
partition 1
Sender
Accumulator
Reproducing the issue
Broker
A
Broker
B
Partition 1
Partition 2
Partition n
send
in-flight
requests
New producer did not implement a request timeout
⇒ awaiting response
⇒ unaware of leader change until next metadata refresh
New leader
for partition 1
Old leader for
partition 1
Sender
Accumulator
Reproducing the issue
Broker
A
Broker
B
Partition 1
Partition 2
Partition n
send
in-flight
requests
So client continues to send
to partition 1
New leader
for partition 1
Old leader for
partition 1
Sender
Accumulator
Reproducing the issue
Broker
A
Broker
B
Partition 2
Partition n
send
batches pile up in partition 1 and
eat up accumulator memory
in-flight
requests
New leader
for partition 1
Old leader for
partition 1
Sender
Accumulator
Reproducing the issue
Broker
B
Partition 2
Partition n
send
in-flight
requests
subsequent sends drop/block
per block.on.buffer .full
config
New leader
for partition 1
Old leader for
partition 1
Broker
A
Reproducing the issue
● netstat
tcp 0 0 ::ffff:127.0.0.1:35938 ::ffff:127.0.0.1:9092 ESTABLISHED 3704/java
● Producer metrics
○ zero retry/error rate
● Thread dump
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(long, TimeUnit)
org.apache.kafka.clients.producer.internals.BufferPool.allocate(int)
org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, byte[], byte[],
CompressionType, Callback)
● Resolved by KAFKA-2120 (KIP-19)
Cluster unavailability
(This is an abridged version of my earlier talk.)
The incident
Occurred a few days after upgrading to pick up quotas and SSL
Multi-port
KAFKA-1809
KAFKA-1928
SSL
KAFKA-1690
x25 x38
October 13
Various quota patches
June 3April 5 August 18
The incident
Broker (which happened to be controller) failed in our queuing Kafka cluster
The incident
Multiple applications begin to report “issues”: socket timeouts to Kafka cluster
Posts search was one such
impacted application
The incident
Two brokers report high request and response queue sizes
The incident
Two brokers report high request queue size and request latencies
The incident
● Other observations
○ High CPU load on those brokers
○ Throughput degrades to ~ half the normal throughput
○ Tons of broken pipe exceptions in server logs
○ Application owners report socket timeouts in their logs
Remediation
Shifted site traffic to another data center
“Kafka outage ⇒ member impact
Multi-colo is critical!
Remediation
● Controller moves did not help
● Firewall the affected brokers
● The above helped, but cluster fell over again after dropping the rules
● Suspect misbehaving clients on broker failure
○ … but x25 never exhibited this issue
sudo iptables -A INPUT -p tcp --dport <broker-port> -s <other-broker> -j ACCEPT
sudo iptables -A INPUT -p tcp --dport <broker-port> -j DROP
Remediation
Friday night ⇒ roll-back to x25 and debug later
… but SREs had to babysit the rollback
x38 x38 x38 x38
Rolling downgrade
Remediation
Friday night ⇒ roll-back to x25 and debug later
… but SREs had to babysit the rollback
x38 x38 x38 x38
Rolling downgrade
Move leaders
Remediation
Friday night ⇒ roll-back to x25 and debug later
… but SREs had to babysit the rollback
x38 x38 x38 x38
Rolling downgrade
Firewall
Remediation
Friday night ⇒ roll-back to x25 and debug later
… but SREs had to babysit the rollback
x38 x38 x38
Rolling downgrade
Firewall
x25
Remediation
Friday night ⇒ roll-back to x25 and debug later
… but SREs had to babysit the rollback
x38 x38 x38
Rolling downgrade
x25
Move leaders
● Test cluster
○ Tried killing controller
○ Multiple rolling bounces
○ Could not reproduce
● Upgraded the queuing cluster to x38 again
○ Could not reproduce
● So nothing…
Attempts at reproducing the issue
Unraveling queue backups…
API layer
Life-cycle of a Kafka request
Network layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Response queue
Response queue
Response queue
Request queue
API handler
API handler
API handler
Clientconnections
Purgatory
New
connections
Read
request
Quota manager
Await handling
Total time = queue-time
Handle request
+ local-time + remote-time
long-poll requests
Hold if quota
violated
+ quota-time
Await processor
+ response-queue-time
Write
response
+ response-send-time
Investigating high request times
● First look for high local time
○ then high response send time
■ then high remote (purgatory) time → generally non-issue (but caveats described later)
● High request queue/response queue times are effects, not causes
High local times during incident (e.g., fetch)
How are fetch requests handled?
● Get physical offsets to be read from local log during response
● If fetch from follower (i.e., replica fetch):
○ If follower was out of ISR and just caught-up then expand ISR (ZooKeeper write)
○ Maybe satisfy eligible delayed produce requests (with acks -1)
● Else (i.e., consumer fetch):
○ Record/update byte-rate of this client
○ Throttle the request on quota violation
Could these cause high local times?
● Get physical offsets to be read from local log during response
● If fetch from follower (i.e., replica fetch):
○ If follower was out of ISR and just caught-up then expand ISR (ZooKeeper write)
○ Maybe satisfy eligible delayed produce requests (with acks -1)
● Else (i.e., consumer fetch):
○ Record/update byte-rate of this client
○ Throttle the request on quota violation
Not using acks -1
Should be fast
Should be fast
Delayed outside
API thread
Test this…
Maintains byte-rate metrics on a per-client-id basis
2015/10/10 03:20:08.393 [] [] [] [logger] Completed request:Name: FetchRequest; Version: 0;
CorrelationId: 0; ClientId: 2c27cc8b_ccb7_42ae_98b6_51ea4b4dccf2; ReplicaId: -1; MaxWait: 0
ms; MinBytes: 0 bytes from connection <clientIP>:<brokerPort>-<localAddr>;totalTime:6589,
requestQueueTime:6589,localTime:0,remoteTime:0,responseQueueTime:0,sendTime:0,
securityProtocol:PLAINTEXT,principal:ANONYMOUS
Quota metrics
??!
Quota metrics - a quick benchmark
for (clientId ← 0 until N) {
timer.time {
quotaMetrics.recordAndMaybeThrottle(clientId, 0, DefaultCallBack)
}
}
Quota metrics - a quick benchmark
Quota metrics - a quick benchmark
Fixed in KAFKA-2664
meanwhile in our queuing cluster…
due to climbing
client-id counts
Rolling bounce of cluster forced the issue to recur on brokers that had high client-
id metric counts
○ Used jmxterm to check per-client-id metric counts before experiment
○ Hooked up profiler to verify during incident
■ Generally avoid profiling/heapdumps in production due to interference
○ Did not see in earlier rolling bounce due to only a few client-id metrics at the time
How to fix high local times
● Optimize the request’s handling. For e.g.,:
○ cached topic metadata as opposed to ZooKeeper reads (see KAFKA-901)
○ and KAFKA-1356
● Make it asynchronous
○ E.g., we will do this for StopReplica in KAFKA-1911
● Put it in a purgatory (usually if response depends on some condition); but be
aware of the caveats:
○ Higher memory pressure if request purgatory size grows
○ Expired requests are handled in purgatory expiration thread (which is good)
○ but satisfied requests are handled in API thread of satisfying request ⇒ if a request satisfies
several delayed requests then local time can increase for the satisfying request
● Request queue size
● Response queue sizes
● Request latencies:
○ Total time
○ Local time
○ Response send time
○ Remote time
● Request handler pool idle ratio
Monitor these closely!
Breaking compatibility
The first incident: new clients old clusters
Test
cluster
(old version)
Certification
cluster
(old version)
Metrics
cluster
(old version)
metric
events
metric
events
The first incident: new clients old clusters
Test
cluster
(new version)
Certification
cluster
(old version)
Metrics
cluster
(old version)
metric
events
metric
events
org.apache.kafka.common.protocol.types.SchemaException: Error reading field 'throttle_time_ms': java.nio.
BufferUnderflowException
at org.apache.kafka.common.protocol.types.Schema.read(Schema.java:73)
at org.apache.kafka.clients.NetworkClient.handleCompletedReceives(NetworkClient.java:397)
...
New clients old clusters: remediation
Test
cluster
(new version)
Certification
cluster
(new version)
Metrics
cluster
(old version)
metric
events
metric
events
Set acks to zero
New clients old clusters: remediation
Test
cluster
(new version)
Certification
cluster
(new version)
Metrics
cluster
(new version)
metric
events
metric
events
Reset acks to 1
New clients old clusters: remediation
(BTW this just hit us again with the protocol changes in KIP-31/KIP-32)
KIP-35 would help a ton!
The second incident: new endpoints
{ "version":1,
"jmx_port":9999,
"timestamp":2233345666,
"host":"localhost",
“port”:9092
}
x14older broker versions
ZooKeeper
registration
{ "version”:2,
"jmx_port":9999,
"timestamp":2233345666,
"host":"localhost",
“port”:9092,
"endpoints": [
{"plaintext://localhost:
9092"}
]
}
x14
client
old
client
ignore endpoints v2 ⇒ use endpoints
The second incident: new endpoints
{ "version":1,
"jmx_port":9999,
"timestamp":2233345666,
"host":"localhost",
“port”:9092
}
x14older broker versions
ZooKeeper
registration
{ "version”:2,
"jmx_port":9999,
"timestamp":2233345666,
"host":"localhost",
“port”:9092,
"endpoints": [
{"plaintext://localhost:
9092"}
]
}
x36
{ "version”:2,
"jmx_port":9999,
"timestamp":2233345666,
"host":"localhost",
“port”:9092,
"endpoints": [
{"plaintext://localhost:
9092"}, {“ssl:
//localhost:9093”} ]
}
x14
client
old
client
java.lang.IllegalArgumentException: No enum constant
org.apache.kafka.common.protocol.SecurityProtocol.SSL
at java.lang.Enum.valueOf(Enum.java:238)
at org.apache.kafka.common.protocol.
SecurityProtocol.valueOf(SecurityProtocol.java:24)
New endpoints: remediation
{ "version":1,
"jmx_port":9999,
"timestamp":2233345666,
"host":"localhost",
“port”:9092
}
x14older broker versions
ZooKeeper
registration
{ "version”:2,
"jmx_port":9999,
"timestamp":2233345666,
"host":"localhost",
“port”:9092,
"endpoints": [
{"plaintext://localhost:
9092"}
]
}
x36
{ "version”:2 1,
"jmx_port":9999,
"timestamp":2233345666,
"host":"localhost",
“port”:9092,
"endpoints": [
{"plaintext://localhost:
9092"}, {“ssl:
//localhost:9093”} ]
}
x14
client
old
client
v1 ⇒ ignore endpoints
New endpoints: remediation
{ "version":1,
"jmx_port":9999,
"timestamp":2233345666,
"host":"localhost",
“port”:9092
}
x14older broker versions
ZooKeeper
registration
{ "version”:2,
"jmx_port":9999,
"timestamp":2233345666,
"host":"localhost",
“port”:9092,
"endpoints": [
{"plaintext://localhost:
9092"}
]
}
x36
{ "version”:2 1,
"jmx_port":9999,
"timestamp":2233345666,
"host":"localhost",
“port”:9092,
"endpoints": [
{"plaintext://localhost:
9092"}, {“ssl:
//localhost:9093”} ]
}
x14
client
x36
client
old
client
v1 ⇒ ignore endpoints
v1 ⇒ use endpoints if
present
New endpoints: remediation
● Fix in KAFKA-2584
● Also related: KAFKA-3100
Power outage
Widespread FS corruption after power outage
● Mount settings at the time
○ type ext4 (rw,noatime,data=writeback,commit=120)
● Restarts were successful but brokers subsequently hit corruption
● Subsequent restarts also hit corruption in index files
Summary
● Monitoring beyond per-broker/controller
metrics
○ Validate SLAs
○ Continuously test admin functionality (in
test clusters)
● Automate release validation
● https://github.com/linkedin/streaming
Kafka monitor
Kafka
cluster
producer
Monitor
instance
ackLatencyMs
e2eLatencyMs
duplicateRate
retryRate
failureRate
lossRate
consumer
Availability %
● Monitoring beyond per-broker/controller
metrics
○ Validate SLAs
○ Continuously test admin functionality (in
test clusters)
● Automate release validation
● https://github.com/linkedin/streaming
Kafka monitor
Kafka
cluster
producer
Monitor
instance
ackLatencyMs
e2eLatencyMs
duplicateRate
retryRate
failureRate
lossRate
consumer
Monitor
instance
Admin
Utils
Monitor
instance
checkReassign
checkPLE
Q&A
Software developers and Site Reliability Engineers at all
levels
Streams infrastructure @ LinkedIn
● Kafka pub-sub ecosystem
● Stream processing platform built on Apache Samza
● Next Gen Change capture technology (incubating)
Contact Kartik Paramasivam
Where LinkedIn campus
2061 Stierlin Ct.,
Mountain View, CA
When May 11 at 6.30 PM
Register http://bit.ly/1Sv8ach
We are hiring! LinkedIn Data Infrastructure meetup

Contenu connexe

Tendances

Reactor, Reactive streams and MicroServices
Reactor, Reactive streams and MicroServicesReactor, Reactive streams and MicroServices
Reactor, Reactive streams and MicroServicesStéphane Maldini
 
Data Loss and Duplication in Kafka
Data Loss and Duplication in KafkaData Loss and Duplication in Kafka
Data Loss and Duplication in KafkaJayesh Thakrar
 
Kafka Summit NYC 2017 - Running Hundreds of Kafka Clusters with 5 People
Kafka Summit NYC 2017 - Running Hundreds of Kafka Clusters with 5 PeopleKafka Summit NYC 2017 - Running Hundreds of Kafka Clusters with 5 People
Kafka Summit NYC 2017 - Running Hundreds of Kafka Clusters with 5 Peopleconfluent
 
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17Gwen (Chen) Shapira
 
Nyc kafka meetup 2015 - when bad things happen to good kafka clusters
Nyc kafka meetup 2015 - when bad things happen to good kafka clustersNyc kafka meetup 2015 - when bad things happen to good kafka clusters
Nyc kafka meetup 2015 - when bad things happen to good kafka clustersGwen (Chen) Shapira
 
The Foundations of Multi-DC Kafka (Jakub Korab, Solutions Architect, Confluen...
The Foundations of Multi-DC Kafka (Jakub Korab, Solutions Architect, Confluen...The Foundations of Multi-DC Kafka (Jakub Korab, Solutions Architect, Confluen...
The Foundations of Multi-DC Kafka (Jakub Korab, Solutions Architect, Confluen...confluent
 
Kafka At Scale in the Cloud
Kafka At Scale in the CloudKafka At Scale in the Cloud
Kafka At Scale in the Cloudconfluent
 
Introduction to apache kafka
Introduction to apache kafkaIntroduction to apache kafka
Introduction to apache kafkaSamuel Kerrien
 
Troubleshooting Kafka's socket server: from incident to resolution
Troubleshooting Kafka's socket server: from incident to resolutionTroubleshooting Kafka's socket server: from incident to resolution
Troubleshooting Kafka's socket server: from incident to resolutionJoel Koshy
 
Let the alpakka pull your stream
Let the alpakka pull your streamLet the alpakka pull your stream
Let the alpakka pull your streamEnno Runne
 
Bootstrapping Microservices with Kafka, Akka and Spark
Bootstrapping Microservices with Kafka, Akka and SparkBootstrapping Microservices with Kafka, Akka and Spark
Bootstrapping Microservices with Kafka, Akka and SparkAlex Silva
 
Help, my Kafka is broken! (Emma Humber, IBM) Kafka Summit SF 2019
Help, my Kafka is broken! (Emma Humber, IBM) Kafka Summit SF 2019Help, my Kafka is broken! (Emma Humber, IBM) Kafka Summit SF 2019
Help, my Kafka is broken! (Emma Humber, IBM) Kafka Summit SF 2019confluent
 
Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017Gwen (Chen) Shapira
 
Kafka Summit SF 2017 - One Data Center is Not Enough: Scaling Apache Kafka Ac...
Kafka Summit SF 2017 - One Data Center is Not Enough: Scaling Apache Kafka Ac...Kafka Summit SF 2017 - One Data Center is Not Enough: Scaling Apache Kafka Ac...
Kafka Summit SF 2017 - One Data Center is Not Enough: Scaling Apache Kafka Ac...confluent
 
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...DataWorks Summit/Hadoop Summit
 
Kafka Needs no Keeper( Jason Gustafson & Colin McCabe, Confluent) Kafka Summi...
Kafka Needs no Keeper( Jason Gustafson & Colin McCabe, Confluent) Kafka Summi...Kafka Needs no Keeper( Jason Gustafson & Colin McCabe, Confluent) Kafka Summi...
Kafka Needs no Keeper( Jason Gustafson & Colin McCabe, Confluent) Kafka Summi...confluent
 
Kafka Summit NYC 2017 - Introducing Exactly Once Semantics in Apache Kafka
Kafka Summit NYC 2017 - Introducing Exactly Once Semantics in Apache KafkaKafka Summit NYC 2017 - Introducing Exactly Once Semantics in Apache Kafka
Kafka Summit NYC 2017 - Introducing Exactly Once Semantics in Apache Kafkaconfluent
 
Decoupling Decisions with Apache Kafka
Decoupling Decisions with Apache KafkaDecoupling Decisions with Apache Kafka
Decoupling Decisions with Apache KafkaGrant Henke
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache KafkaJeff Holoman
 

Tendances (20)

Reactor, Reactive streams and MicroServices
Reactor, Reactive streams and MicroServicesReactor, Reactive streams and MicroServices
Reactor, Reactive streams and MicroServices
 
Data Loss and Duplication in Kafka
Data Loss and Duplication in KafkaData Loss and Duplication in Kafka
Data Loss and Duplication in Kafka
 
Kafka Summit NYC 2017 - Running Hundreds of Kafka Clusters with 5 People
Kafka Summit NYC 2017 - Running Hundreds of Kafka Clusters with 5 PeopleKafka Summit NYC 2017 - Running Hundreds of Kafka Clusters with 5 People
Kafka Summit NYC 2017 - Running Hundreds of Kafka Clusters with 5 People
 
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
 
Nyc kafka meetup 2015 - when bad things happen to good kafka clusters
Nyc kafka meetup 2015 - when bad things happen to good kafka clustersNyc kafka meetup 2015 - when bad things happen to good kafka clusters
Nyc kafka meetup 2015 - when bad things happen to good kafka clusters
 
The Foundations of Multi-DC Kafka (Jakub Korab, Solutions Architect, Confluen...
The Foundations of Multi-DC Kafka (Jakub Korab, Solutions Architect, Confluen...The Foundations of Multi-DC Kafka (Jakub Korab, Solutions Architect, Confluen...
The Foundations of Multi-DC Kafka (Jakub Korab, Solutions Architect, Confluen...
 
Kafka At Scale in the Cloud
Kafka At Scale in the CloudKafka At Scale in the Cloud
Kafka At Scale in the Cloud
 
Introduction to apache kafka
Introduction to apache kafkaIntroduction to apache kafka
Introduction to apache kafka
 
Troubleshooting Kafka's socket server: from incident to resolution
Troubleshooting Kafka's socket server: from incident to resolutionTroubleshooting Kafka's socket server: from incident to resolution
Troubleshooting Kafka's socket server: from incident to resolution
 
Kafka at scale facebook israel
Kafka at scale   facebook israelKafka at scale   facebook israel
Kafka at scale facebook israel
 
Let the alpakka pull your stream
Let the alpakka pull your streamLet the alpakka pull your stream
Let the alpakka pull your stream
 
Bootstrapping Microservices with Kafka, Akka and Spark
Bootstrapping Microservices with Kafka, Akka and SparkBootstrapping Microservices with Kafka, Akka and Spark
Bootstrapping Microservices with Kafka, Akka and Spark
 
Help, my Kafka is broken! (Emma Humber, IBM) Kafka Summit SF 2019
Help, my Kafka is broken! (Emma Humber, IBM) Kafka Summit SF 2019Help, my Kafka is broken! (Emma Humber, IBM) Kafka Summit SF 2019
Help, my Kafka is broken! (Emma Humber, IBM) Kafka Summit SF 2019
 
Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017
 
Kafka Summit SF 2017 - One Data Center is Not Enough: Scaling Apache Kafka Ac...
Kafka Summit SF 2017 - One Data Center is Not Enough: Scaling Apache Kafka Ac...Kafka Summit SF 2017 - One Data Center is Not Enough: Scaling Apache Kafka Ac...
Kafka Summit SF 2017 - One Data Center is Not Enough: Scaling Apache Kafka Ac...
 
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
 
Kafka Needs no Keeper( Jason Gustafson & Colin McCabe, Confluent) Kafka Summi...
Kafka Needs no Keeper( Jason Gustafson & Colin McCabe, Confluent) Kafka Summi...Kafka Needs no Keeper( Jason Gustafson & Colin McCabe, Confluent) Kafka Summi...
Kafka Needs no Keeper( Jason Gustafson & Colin McCabe, Confluent) Kafka Summi...
 
Kafka Summit NYC 2017 - Introducing Exactly Once Semantics in Apache Kafka
Kafka Summit NYC 2017 - Introducing Exactly Once Semantics in Apache KafkaKafka Summit NYC 2017 - Introducing Exactly Once Semantics in Apache Kafka
Kafka Summit NYC 2017 - Introducing Exactly Once Semantics in Apache Kafka
 
Decoupling Decisions with Apache Kafka
Decoupling Decisions with Apache KafkaDecoupling Decisions with Apache Kafka
Decoupling Decisions with Apache Kafka
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 

En vedette

Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...Chris Fregly
 
More Datacenters, More Problems
More Datacenters, More ProblemsMore Datacenters, More Problems
More Datacenters, More ProblemsTodd Palino
 
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...Helena Edelson
 
Introducing Kafka Streams: Large-scale Stream Processing with Kafka, Neha Nar...
Introducing Kafka Streams: Large-scale Stream Processing with Kafka, Neha Nar...Introducing Kafka Streams: Large-scale Stream Processing with Kafka, Neha Nar...
Introducing Kafka Streams: Large-scale Stream Processing with Kafka, Neha Nar...confluent
 
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan EwenAdvanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewenconfluent
 
Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...
Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...
Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...confluent
 
Introduction to Kafka Cruise Control
Introduction to Kafka Cruise ControlIntroduction to Kafka Cruise Control
Introduction to Kafka Cruise ControlJiangjie Qin
 
Kafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier ArchitecturesKafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier ArchitecturesTodd Palino
 
Securing Kafka
Securing Kafka Securing Kafka
Securing Kafka confluent
 
No data loss pipeline with apache kafka
No data loss pipeline with apache kafkaNo data loss pipeline with apache kafka
No data loss pipeline with apache kafkaJiangjie Qin
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaJiangjie Qin
 
Handle Large Messages In Apache Kafka
Handle Large Messages In Apache KafkaHandle Large Messages In Apache Kafka
Handle Large Messages In Apache KafkaJiangjie Qin
 
Apache Kafka Bay Area Sep Meetup - 24/7 Customer, Inc.
Apache Kafka Bay Area Sep Meetup - 24/7 Customer, Inc.Apache Kafka Bay Area Sep Meetup - 24/7 Customer, Inc.
Apache Kafka Bay Area Sep Meetup - 24/7 Customer, Inc.Suneet Grover
 
Workshop mesos docker devoxx fr 2016
Workshop mesos docker devoxx fr 2016Workshop mesos docker devoxx fr 2016
Workshop mesos docker devoxx fr 2016Julia Mateo
 
Spark Summit EU talk by Jiri Simsa
Spark Summit EU talk by Jiri SimsaSpark Summit EU talk by Jiri Simsa
Spark Summit EU talk by Jiri SimsaSpark Summit
 
Embedded Mirror Maker
Embedded Mirror MakerEmbedded Mirror Maker
Embedded Mirror MakerSimon Suo
 
Tuning Kafka for Fun and Profit
Tuning Kafka for Fun and ProfitTuning Kafka for Fun and Profit
Tuning Kafka for Fun and ProfitTodd Palino
 
Building an Event-oriented Data Platform with Kafka, Eric Sammer
Building an Event-oriented Data Platform with Kafka, Eric Sammer Building an Event-oriented Data Platform with Kafka, Eric Sammer
Building an Event-oriented Data Platform with Kafka, Eric Sammer confluent
 
Deploying Kafka at Dropbox, Mark Smith, Sean Fellows
Deploying Kafka at Dropbox, Mark Smith, Sean FellowsDeploying Kafka at Dropbox, Mark Smith, Sean Fellows
Deploying Kafka at Dropbox, Mark Smith, Sean Fellowsconfluent
 

En vedette (20)

Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
 
More Datacenters, More Problems
More Datacenters, More ProblemsMore Datacenters, More Problems
More Datacenters, More Problems
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
 
Introducing Kafka Streams: Large-scale Stream Processing with Kafka, Neha Nar...
Introducing Kafka Streams: Large-scale Stream Processing with Kafka, Neha Nar...Introducing Kafka Streams: Large-scale Stream Processing with Kafka, Neha Nar...
Introducing Kafka Streams: Large-scale Stream Processing with Kafka, Neha Nar...
 
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan EwenAdvanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
 
Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...
Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...
Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...
 
Introduction to Kafka Cruise Control
Introduction to Kafka Cruise ControlIntroduction to Kafka Cruise Control
Introduction to Kafka Cruise Control
 
Kafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier ArchitecturesKafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier Architectures
 
Securing Kafka
Securing Kafka Securing Kafka
Securing Kafka
 
No data loss pipeline with apache kafka
No data loss pipeline with apache kafkaNo data loss pipeline with apache kafka
No data loss pipeline with apache kafka
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache Kafka
 
Handle Large Messages In Apache Kafka
Handle Large Messages In Apache KafkaHandle Large Messages In Apache Kafka
Handle Large Messages In Apache Kafka
 
Apache Kafka Bay Area Sep Meetup - 24/7 Customer, Inc.
Apache Kafka Bay Area Sep Meetup - 24/7 Customer, Inc.Apache Kafka Bay Area Sep Meetup - 24/7 Customer, Inc.
Apache Kafka Bay Area Sep Meetup - 24/7 Customer, Inc.
 
Workshop mesos docker devoxx fr 2016
Workshop mesos docker devoxx fr 2016Workshop mesos docker devoxx fr 2016
Workshop mesos docker devoxx fr 2016
 
Spark Summit EU talk by Jiri Simsa
Spark Summit EU talk by Jiri SimsaSpark Summit EU talk by Jiri Simsa
Spark Summit EU talk by Jiri Simsa
 
Embedded Mirror Maker
Embedded Mirror MakerEmbedded Mirror Maker
Embedded Mirror Maker
 
Tuning Kafka for Fun and Profit
Tuning Kafka for Fun and ProfitTuning Kafka for Fun and Profit
Tuning Kafka for Fun and Profit
 
Building an Event-oriented Data Platform with Kafka, Eric Sammer
Building an Event-oriented Data Platform with Kafka, Eric Sammer Building an Event-oriented Data Platform with Kafka, Eric Sammer
Building an Event-oriented Data Platform with Kafka, Eric Sammer
 
Deploying Kafka at Dropbox, Mark Smith, Sean Fellows
Deploying Kafka at Dropbox, Mark Smith, Sean FellowsDeploying Kafka at Dropbox, Mark Smith, Sean Fellows
Deploying Kafka at Dropbox, Mark Smith, Sean Fellows
 

Similaire à Kafkaesque days at linked in in 2015

Microservies Vienna - Lost in transactions
Microservies Vienna - Lost in transactionsMicroservies Vienna - Lost in transactions
Microservies Vienna - Lost in transactionsNiallDeehan
 
How to build streaming data pipelines with Akka Streams, Flink, and Spark usi...
How to build streaming data pipelines with Akka Streams, Flink, and Spark usi...How to build streaming data pipelines with Akka Streams, Flink, and Spark usi...
How to build streaming data pipelines with Akka Streams, Flink, and Spark usi...Lightbend
 
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...DataWorks Summit
 
Why is My Stream Processing Job Slow? with Xavier Leaute
Why is My Stream Processing Job Slow? with Xavier LeauteWhy is My Stream Processing Job Slow? with Xavier Leaute
Why is My Stream Processing Job Slow? with Xavier LeauteDatabricks
 
The Art of The Event Streaming Application: Streams, Stream Processors and Sc...
The Art of The Event Streaming Application: Streams, Stream Processors and Sc...The Art of The Event Streaming Application: Streams, Stream Processors and Sc...
The Art of The Event Streaming Application: Streams, Stream Processors and Sc...confluent
 
Kakfa summit london 2019 - the art of the event-streaming app
Kakfa summit london 2019 - the art of the event-streaming appKakfa summit london 2019 - the art of the event-streaming app
Kakfa summit london 2019 - the art of the event-streaming appNeil Avery
 
The art of the event streaming application: streams, stream processors and sc...
The art of the event streaming application: streams, stream processors and sc...The art of the event streaming application: streams, stream processors and sc...
The art of the event streaming application: streams, stream processors and sc...confluent
 
Kafka summit SF 2019 - the art of the event-streaming app
Kafka summit SF 2019 - the art of the event-streaming appKafka summit SF 2019 - the art of the event-streaming app
Kafka summit SF 2019 - the art of the event-streaming appNeil Avery
 
Keeping Your DevOps Transformation From Crushing Your Ops Capacity
Keeping Your DevOps Transformation From Crushing Your Ops Capacity Keeping Your DevOps Transformation From Crushing Your Ops Capacity
Keeping Your DevOps Transformation From Crushing Your Ops Capacity Rundeck
 
10 Lessons Learned from using Kafka in 1000 microservices - ScalaUA
10 Lessons Learned from using Kafka in 1000 microservices - ScalaUA10 Lessons Learned from using Kafka in 1000 microservices - ScalaUA
10 Lessons Learned from using Kafka in 1000 microservices - ScalaUANatan Silnitsky
 
Exactly Once Delivery with Kafka - Kafka Tel-Aviv Meetup
Exactly Once Delivery with Kafka - Kafka Tel-Aviv MeetupExactly Once Delivery with Kafka - Kafka Tel-Aviv Meetup
Exactly Once Delivery with Kafka - Kafka Tel-Aviv MeetupNatan Silnitsky
 
Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드
Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드
Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드confluent
 
Stateful patterns in Azure Functions
Stateful patterns in Azure FunctionsStateful patterns in Azure Functions
Stateful patterns in Azure FunctionsMassimo Bonanni
 
Questions Log: Dynamic Cubes – Set to Retire Transformer?
Questions Log: Dynamic Cubes – Set to Retire Transformer?Questions Log: Dynamic Cubes – Set to Retire Transformer?
Questions Log: Dynamic Cubes – Set to Retire Transformer?Senturus
 
MeasureWorks - Design for Fast Experiences
MeasureWorks - Design for Fast ExperiencesMeasureWorks - Design for Fast Experiences
MeasureWorks - Design for Fast ExperiencesMeasureWorks
 
Etl, esb, mq? no! es Apache Kafka®
Etl, esb, mq?  no! es Apache Kafka®Etl, esb, mq?  no! es Apache Kafka®
Etl, esb, mq? no! es Apache Kafka®confluent
 
Balancing Infrastructure with Optimization and Problem Formulation
Balancing Infrastructure with Optimization and Problem FormulationBalancing Infrastructure with Optimization and Problem Formulation
Balancing Infrastructure with Optimization and Problem FormulationAlex D. Gaudio
 
Spring Boot & Spring Cloud on PAS- Nate Schutta (2/2)
Spring Boot & Spring Cloud on PAS- Nate Schutta (2/2)Spring Boot & Spring Cloud on PAS- Nate Schutta (2/2)
Spring Boot & Spring Cloud on PAS- Nate Schutta (2/2)VMware Tanzu
 

Similaire à Kafkaesque days at linked in in 2015 (20)

Microservies Vienna - Lost in transactions
Microservies Vienna - Lost in transactionsMicroservies Vienna - Lost in transactions
Microservies Vienna - Lost in transactions
 
How to build streaming data pipelines with Akka Streams, Flink, and Spark usi...
How to build streaming data pipelines with Akka Streams, Flink, and Spark usi...How to build streaming data pipelines with Akka Streams, Flink, and Spark usi...
How to build streaming data pipelines with Akka Streams, Flink, and Spark usi...
 
Implementing dr w. hyper v clustering
Implementing dr w. hyper v clusteringImplementing dr w. hyper v clustering
Implementing dr w. hyper v clustering
 
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
 
F33 book-depend-pres-pt6
F33 book-depend-pres-pt6F33 book-depend-pres-pt6
F33 book-depend-pres-pt6
 
Why is My Stream Processing Job Slow? with Xavier Leaute
Why is My Stream Processing Job Slow? with Xavier LeauteWhy is My Stream Processing Job Slow? with Xavier Leaute
Why is My Stream Processing Job Slow? with Xavier Leaute
 
The Art of The Event Streaming Application: Streams, Stream Processors and Sc...
The Art of The Event Streaming Application: Streams, Stream Processors and Sc...The Art of The Event Streaming Application: Streams, Stream Processors and Sc...
The Art of The Event Streaming Application: Streams, Stream Processors and Sc...
 
Kakfa summit london 2019 - the art of the event-streaming app
Kakfa summit london 2019 - the art of the event-streaming appKakfa summit london 2019 - the art of the event-streaming app
Kakfa summit london 2019 - the art of the event-streaming app
 
The art of the event streaming application: streams, stream processors and sc...
The art of the event streaming application: streams, stream processors and sc...The art of the event streaming application: streams, stream processors and sc...
The art of the event streaming application: streams, stream processors and sc...
 
Kafka summit SF 2019 - the art of the event-streaming app
Kafka summit SF 2019 - the art of the event-streaming appKafka summit SF 2019 - the art of the event-streaming app
Kafka summit SF 2019 - the art of the event-streaming app
 
Keeping Your DevOps Transformation From Crushing Your Ops Capacity
Keeping Your DevOps Transformation From Crushing Your Ops Capacity Keeping Your DevOps Transformation From Crushing Your Ops Capacity
Keeping Your DevOps Transformation From Crushing Your Ops Capacity
 
10 Lessons Learned from using Kafka in 1000 microservices - ScalaUA
10 Lessons Learned from using Kafka in 1000 microservices - ScalaUA10 Lessons Learned from using Kafka in 1000 microservices - ScalaUA
10 Lessons Learned from using Kafka in 1000 microservices - ScalaUA
 
Exactly Once Delivery with Kafka - Kafka Tel-Aviv Meetup
Exactly Once Delivery with Kafka - Kafka Tel-Aviv MeetupExactly Once Delivery with Kafka - Kafka Tel-Aviv Meetup
Exactly Once Delivery with Kafka - Kafka Tel-Aviv Meetup
 
Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드
Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드
Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드
 
Stateful patterns in Azure Functions
Stateful patterns in Azure FunctionsStateful patterns in Azure Functions
Stateful patterns in Azure Functions
 
Questions Log: Dynamic Cubes – Set to Retire Transformer?
Questions Log: Dynamic Cubes – Set to Retire Transformer?Questions Log: Dynamic Cubes – Set to Retire Transformer?
Questions Log: Dynamic Cubes – Set to Retire Transformer?
 
MeasureWorks - Design for Fast Experiences
MeasureWorks - Design for Fast ExperiencesMeasureWorks - Design for Fast Experiences
MeasureWorks - Design for Fast Experiences
 
Etl, esb, mq? no! es Apache Kafka®
Etl, esb, mq?  no! es Apache Kafka®Etl, esb, mq?  no! es Apache Kafka®
Etl, esb, mq? no! es Apache Kafka®
 
Balancing Infrastructure with Optimization and Problem Formulation
Balancing Infrastructure with Optimization and Problem FormulationBalancing Infrastructure with Optimization and Problem Formulation
Balancing Infrastructure with Optimization and Problem Formulation
 
Spring Boot & Spring Cloud on PAS- Nate Schutta (2/2)
Spring Boot & Spring Cloud on PAS- Nate Schutta (2/2)Spring Boot & Spring Cloud on PAS- Nate Schutta (2/2)
Spring Boot & Spring Cloud on PAS- Nate Schutta (2/2)
 

Dernier

Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 

Dernier (20)

Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 

Kafkaesque days at linked in in 2015

  • 1. Kafkaesque days at LinkedIn in 2015 Joel Koshy Kafka Summit 2016
  • 2. Kafkaesque adjective Kaf·ka·esque ˌkäf-kə-ˈesk, ˌkaf- : of, relating to, or suggestive of Franz Kafka or his writings; especially : having a nightmarishly complex, bizarre, or illogical quality Merriam-Webster
  • 4. What @bonkoif said: More clusters More use-cases More problems … Kafka @ LinkedIn
  • 5. Incidents that we will cover ● Offset rewinds ● Data loss ● Cluster unavailability ● (In)compatibility ● Blackout
  • 7. What are offset rewinds? valid offsets invalid offsetsinvalid offsets yet to arrive messages purged messages
  • 8. If a consumer gets an OffsetOutOfRangeException: What are offset rewinds? valid offsets invalid offsetsinvalid offsets auto.offset.reset ← earliest auto.offset.reset ← latest
  • 9. What are offset rewinds… and why do they matter? HADOOP Kafka (CORP) Push job Kafka (PROD) Stork Mirror Maker Email campaigns
  • 10. What are offset rewinds… and why do they matter? HADOOP Kafka Push job Kafka (PROD) Stork Mirror Maker Email campaigns Real-life incident courtesy of xkcd offset rewind
  • 11. Deployment: Deployed Multiproduct kafka-mirror-maker 0.1.13 to DCX by jkoshy CRT Notifications <crt-notifications-noreply@linkedin.com> Fri, Jul 10, 2015 at 8:27 PM Multiproduct 0.1.13 of kafka-mirror-maker has been Deployed to DCX by jkoshy Offset rewinds: the first incident
  • 12. Deployment: Deployed Multiproduct kafka-mirror-maker 0.1.13 to DCX by jkoshy CRT Notifications <crt-notifications-noreply@linkedin.com> Fri, Jul 10, 2015 at 8:27 PM Multiproduct 0.1.13 of kafka-mirror-maker has been Deployed to DCX by jkoshy on Wednesday, Jul 8, 2015 at 10:14 AM Offset rewinds: the first incident
  • 13. What are offset rewinds… and why do they matter? HADOOP Kafka (CORP) Push job Kafka (PROD) Stork Mirror Maker Email campaigns Good practice to have some filtering logic here
  • 16. Offset rewinds: detection - just use this
  • 17. Offset rewinds: a typical cause
  • 18. Offset rewinds: a typical cause valid offsets invalid offsetsinvalid offsets consumer position
  • 19. Offset rewinds: a typical cause valid offsets invalid offsetsinvalid offsets consumer position Unclean leader election truncates the log
  • 20. Offset rewinds: a typical cause valid offsets invalid offsetsinvalid offsets consumer position Unclean leader election truncates the log … and consumer’s offset goes out of range
  • 21. But there were no ULEs when this happened
  • 22. But there were no ULEs when this happened … and we set auto.offset.reset to latest
  • 23. Offset management - a quick overview (broker) Consumer Consumer group Consumer Consumer (broker)(broker) Consume (fetch requests)
  • 24. Offset management - a quick overview Offset Manager (broker) Consumer Consumer group Consumer Consumer Periodic OffsetCommitRequest (broker)(broker)
  • 25. Offset management - a quick overview Offset Manager (broker) Consumer Consumer group Consumer Consumer OffsetFetchRequest (after rebalance) (broker) (broker)
  • 26. Offset management - a quick overview mirror-maker PageViewEvent-0 240 mirror-maker LoginEvent-8 456 mirror-maker LoginEvent-8 512 mirror-maker PageViewEvent-0 321 __consumer_offsets topic
  • 27. Offset management - a quick overview mirror-maker PageViewEvent-0 240 mirror-maker LoginEvent-8 456 mirror-maker LoginEvent-8 512 mirror-maker PageViewEvent-0 321 __consumer_offsets topic New offset commits append to the topic
  • 28. Offset management - a quick overview mirror-maker PageViewEvent-0 240 mirror-maker LoginEvent-8 456 mirror-maker LoginEvent-8 512 mirror-maker PageViewEvent-0 321 __consumer_offsets topic New offset commits append to the topic mirror-maker PageViewEvent-0 321 mirror-maker LoginEvent-8 512 … … Maintain offset cache to serve offset fetch requests quickly
  • 29. Offset management - a quick overview mirror-maker PageViewEvent-0 240 mirror-maker LoginEvent-8 456 mirror-maker LoginEvent-8 512 mirror-maker PageViewEvent-0 321 __consumer_offsets topic New offset commits append to the topic mirror-maker PageViewEvent-0 321 mirror-maker LoginEvent-8 512 … … Purge old offsets via log compaction Maintain offset cache to serve offset fetch requests quickly
  • 30. Offset management - a quick overview mirror-maker PageViewEvent-0 240 mirror-maker LoginEvent-8 456 mirror-maker LoginEvent-8 512 mirror-maker PageViewEvent-0 321 __consumer_offsets topic When a new broker becomes the leader (i.e., offset manager) it loads offsets into its cache
  • 31. Offset management - a quick overview mirror-maker PageViewEvent-0 240 mirror-maker LoginEvent-8 456 mirror-maker LoginEvent-8 512 mirror-maker PageViewEvent-0 321 __consumer_offsets topic mirror-maker PageViewEvent-0 321 mirror-maker LoginEvent-8 512 … … See this deck for more details
  • 32. Back to the incident… 2015/07/10 02:32:16.174 [ConsumerFetcherThread] [ConsumerFetcherThread-mirror-maker-9de01f48-0-287], Current offset 6811737 for partition [some-log_event,13] out of range; reset offset to 9581225
  • 33. Back to the incident… ... <rebalance> 2015/07/10 02:08:14.252 [some-log_event,13], initOffset 9581205 ... <rebalance> 2015/07/10 02:24:11.965 [some-log_event,13], initOffset 9581223 ... <rebalance> 2015/07/10 02:32:16.131 [some-log_event,13], initOffset 6811737 ... 2015/07/10 02:32:16.174 [ConsumerFetcherThread] [ConsumerFetcherThread-mirror-maker-9de01f48-0-287], Current offset 6811737 for partition [some-log_event,13] out of range; reset offset to 9581225
  • 34. ./bin/kafka-console-consumer.sh --topic __consumer_offsets --zookeeper <zookeeperConnect> --formatter "kafka.coordinator. GroupMetadataManager$OffsetsMessageFormatter" --consumer.config config/consumer. properties (must set exclude.internal.topics=false in consumer.properties) While debugging offset rewinds, do this first!
  • 36. So why did the offset manager return a stale offset? Offset manager logs: 2015/07/10 02:31:57.941 ERROR [OffsetManager] [kafka-scheduler-1] [kafka-server] [] [Offset Manager on Broker 191]: Error in loading offsets from [__consumer_offsets,63] java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.String at kafka.server.OffsetManager$.kafka$server$OffsetManager$$readMessageValue(OffsetManager.scala:576)
  • 37. So why did the offset manager return a stale offset? Offset manager logs: 2015/07/10 02:31:57.941 ERROR [OffsetManager] [kafka-scheduler-1] [kafka-server] [] [Offset Manager on Broker 191]: Error in loading offsets from [__consumer_offsets,63] java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.String at kafka.server.OffsetManager$.kafka$server$OffsetManager$$readMessageValue(OffsetManager.scala:576) ... ... mirror-maker some-log_event, 13 6811737 ... ... Leader moved and new offset manager hit KAFKA-2117 while loading offsets old offsets recent offsets
  • 38. … caused a ton of offset resets 2015/07/10 02:08:14.252 [some-log_event,13], initOffset 9581205 ... 2015/07/10 02:24:11.965 [some-log_event,13], initOffset 9581223 ... 2015/07/10 02:32:16.131 [some-log_event,13], initOffset 6811737 ... 2015/07/10 02:32:16.174 [ConsumerFetcherThread] [ConsumerFetcherThread-mirror-maker-9de01f48-0-287], Current offset 6811737 for partition [some-log_event,13] out of range; reset offset to 9581225 [some-log_event, 13] 846232 9581225 purged
  • 39. … but why the duplicate email? Deployment: Deployed Multiproduct kafka-mirror-maker 0.1.13 to DCX by jkoshy CRT Notifications <crt-notifications-noreply@linkedin.com> Fri, Jul 10, 2015 at 8:27 PM Multiproduct 0.1.13 of kafka-mirror-maker has been Deployed to DCX by jkoshy
  • 40. … but why the duplicate email? 2015/07/10 02:08:15.524 [crt-event,12], initOffset 11464 ... 2015/07/10 02:31:40.827 [crt-event,12], initOffset 11464 ... 2015/07/10 02:32:17.739 [crt-event,12], initOffset 9539 ... Also from Jun 1
  • 41. … but why the duplicate email? 2015/07/10 02:08:15.524 [crt-event,12], initOffset 11464 ... 2015/07/10 02:31:40.827 [crt-event,12], initOffset 11464 ... 2015/07/10 02:32:17.739 [crt-event,12], initOffset 9539 ... [crt-event, 12] 0 11464 … but still valid!
  • 42. Time-based retention does not work well for low-volume topics Addressed by KIP-32/KIP-33
  • 43. Offset rewinds: the second incident mirror makers got wedged restarted sent duplicate emails to (few) members
  • 44. Offset rewinds: the second incident Consumer logs 2015/04/29 17:22:48.952 <rebalance started> ... 2015/04/29 17:36:37.790 <rebalance ended>initOffset -1 (for various partitions)
  • 45. Offset rewinds: the second incident Consumer logs 2015/04/29 17:22:48.952 <rebalance started> ... 2015/04/29 17:36:37.790 <rebalance ended>initOffset -1 (for various partitions) Broker (offset manager) logs 2015/04/29 17:18:46.143 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker 517]: Loading offsets from [__consumer_offsets,84] ... 2015/04/29 17:36:35.228 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker 517]: Finished loading offsets from [__consumer_offsets,84] in 1069085 milliseconds. (17 minutes!)
  • 46. Offset rewinds: the second incident Consumer logs 2015/04/29 17:22:48.952 <rebalance started> ... 2015/04/29 17:36:37.790 <rebalance ended>initOffset -1 (for various partitions) Broker (offset manager) logs 2015/04/29 17:18:46.143 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker 517]: Loading offsets from [__consumer_offsets,84] ... 2015/04/29 17:36:35.228 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker 517]: Finished loading offsets from [__consumer_offsets,84] in 1069085 milliseconds. (17 minutes!) ⇒ log cleaner had failed a while ago… but why did offset fetch return -1?
  • 47. Offset management - a quick overview How are stale offsets (for dead consumers) cleaned up? dead-group PageViewEvent-0 321 timestamp older than a week active-group LoginEvent-8 512 recent timestamp … … __consumer_offsets Offset cache cleanup task
  • 48. Offset management - a quick overview How are stale offsets (for dead consumers) cleaned up? dead-group PageViewEvent-0 321 timestamp older than a week active-group LoginEvent-8 512 recent timestamp … … __consumer_offsets Offset cache cleanup task Append tombstones for dead-group and delete entry in offset cache
  • 49. Back to the incident... 2015/04/29 17:18:46.143 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker 517]: Loading offsets from [__consumer_offsets,84] ... 2015/04/29 17:36:35.228 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker 517]: Finished loading offsets from [__consumer_offsets,84] in 1069085 milliseconds. (17 minutes!) mirror-maker PageViewEvent-0 45 very old timstamp mirror-maker LoginEvent-8 12 very old timestamp ... ... ... old offsets recent offsets load offsets
  • 50. Back to the incident... 2015/04/29 17:18:46.143 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker 517]: Loading offsets from [__consumer_offsets,84] ... 2015/04/29 17:36:35.228 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker 517]: Finished loading offsets from [__consumer_offsets,84] in 1069085 milliseconds. (17 minutes!) mirror-maker PageViewEvent-0 45 very old timstamp mirror-maker LoginEvent-8 12 very old timestamp ... ... ... old offsets recent offsets load offsets Cleanup task happened to run during the load
  • 51. Back to the incident... 2015/04/29 17:18:46.143 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker 517]: Loading offsets from [__consumer_offsets,84] ... 2015/04/29 17:36:35.228 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker 517]: Finished loading offsets from [__consumer_offsets,84] in 1069085 milliseconds. (17 minutes!) ... ... ... old offsets recent offsets load offsets
  • 52. Back to the incident... 2015/04/29 17:18:46.143 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker 517]: Loading offsets from [__consumer_offsets,84] ... 2015/04/29 17:36:35.228 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker 517]: Finished loading offsets from [__consumer_offsets,84] in 1069085 milliseconds. (17 minutes!) mirror-maker PageViewEvent-0 321 recent timestamp mirror-maker LoginEvent-8 512 recent timestamp ... ... ... old offsets recent offsets load offsets
  • 53. Back to the incident... 2015/04/29 17:18:46.143 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker 517]: Loading offsets from [__consumer_offsets,84] ... 2015/04/29 17:36:35.228 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker 517]: Finished loading offsets from [__consumer_offsets,84] in 1069085 milliseconds. (17 minutes!) ... ... ... old offsets recent offsets load offsets
  • 54. Root cause of this rewind ● Log cleaner had failed (separate bug) ○ ⇒ offsets topic grew big ○ ⇒ offset load on leader movement took a while ● Cache cleanup ran during the load ○ which appended tombstones ○ and overrode the most recent offsets ● (Fixed in KAFKA-2163)
  • 55. Offset rewinds: wrapping it up ● Monitor log cleaner health ● If you suspect a rewind: ○ Check for unclean leader elections ○ Check for offset manager movement (i.e., __consumer_offsets partitions had leader changes) ○ Take a dump of the offsets topic ○ … stare long and hard at the logs (both consumer and offset manager) ● auto.offset.reset ← closest ? ● Better lag monitoring via Burrow
  • 57. P R O D B P R O D A C O R P Y C O R P X Data loss: the first incident Kafka aggregate Kafka local Kafka aggregate Hadoop Producers Kafka aggregate Kafka local Kafka aggregate Hadoop Producers
  • 59. Data loss: detection (example 1) P R O D B P R O D A C O R P Y C O R P X Kafka aggregate Kafka local Kafka aggregate Hadoop Producers Kafka aggregate Kafka local Kafka aggregate Hadoop Producers
  • 60. Data loss: detection (example 1) P R O D B P R O D A C O R P Y C O R P X Kafka aggregate Kafka local Kafka aggregate Hadoop Producers Kafka aggregate Kafka local Kafka aggregate Hadoop Producers
  • 61. Data loss: detection (example 2) P R O D B P R O D A C O R P Y C O R P X Kafka aggregate Kafka local Kafka aggregate Hadoop Producers Kafka aggregate Kafka local Kafka aggregate Hadoop Producers
  • 62. Data loss? (The actual incident) P R O D B P R O D A C O R P Y C O R P X Kafka aggregate Kafka local Kafka aggregate Hadoop Producers Kafka aggregate Kafka local Kafka aggregate Hadoop Producers
  • 63. Data loss or audit issue? (The actual incident) P R O D B P R O D A C O R P Y C O R P X Kafka aggregate Kafka local Kafka aggregate Hadoop Producers Kafka aggregate Kafka local Kafka aggregate Hadoop Producers Sporadic discrepancies in Kafka- aggregate-CORP-X counts for several topics However, Hadoop-X tier is complete ✔ ✔ ✔ ✔ ✔ ✔✔
  • 64. Verified actual data completeness by recounting events in a few low-volume topics … so definitely an audit-only issue Likely caused by dropping audit events
  • 65. Verified actual data completeness by recounting events in a few low-volume topics … so definitely an audit-only issue Possible sources of discrepancy: ● Cluster auditor ● Cluster itself (i.e., data loss in audit topic) ● Audit front-end Likely caused by dropping audit events
  • 66. Possible causes C O R P X Kafka aggregate Hadoop Cluster auditor consume all topics emit audit counts Cluster auditor ● Counting incorrectly ○ but same version of auditor everywhere and only CORP-X has issues ● Not consuming all data for audit or failing to send all audit events ○ but no errors in auditor logs ● … and auditor bounces did not help
  • 67. Data loss in audit topic ● … but no unclean leader elections ● … and no data loss in sampled topics (counted manually) Possible causes C O R P X Kafka aggregate Hadoop Cluster auditor consume all topics emit audit counts
  • 68. Audit front-end fails to insert audit events into DB ● … but other tiers (e.g., CORP-Y) are correct ● … and no errors in logs Possible causes C O R P X Kafka aggregate Hadoop Audit front-end consume audit Audit DB insert from CORP-Y
  • 69. ● Emit counts to new test tier Attempt to reproduce C O R P X Kafka aggregate Hadoop Cluster auditor consume all topics Tier CORP-X Cluster auditor Tier test
  • 70. … fortunately worked: ● Emit counts to new test tier ● test tier counts were also sporadically off Attempt to reproduce C O R P X Kafka aggregate Hadoop Cluster auditor consume all topics Tier CORP-X Cluster auditor Tier test
  • 71. ● Enabled select TRACE logs to log audit events before sending ● Audit counts were correct ● … and successfully emitted … and debug C O R P X Kafka aggregate Hadoop Cluster auditor consume all topics Tier CORP-X Cluster auditor Tier test
  • 72. ● Enabled select TRACE logs to log audit events before sending ● Audit counts were correct ● … and successfully emitted ● Verified from broker public access logs that audit event was sent … and debug C O R P X Kafka aggregate Hadoop Cluster auditor consume all topics Tier CORP-X Cluster auditor Tier test
  • 73. ● Enabled select TRACE logs to log audit events before sending ● Audit counts were correct ● … and successfully emitted ● Verified from broker public access logs that audit event was sent ● … but on closer look realized it was not the leader for that partition of the audit topic … and debug C O R P X Kafka aggregate Hadoop Cluster auditor consume all topics Tier CORP-X Cluster auditor Tier test
  • 74. ● Enabled select TRACE logs to log audit events before sending ● Audit counts were correct ● … and successfully emitted ● Verified from broker public access logs that audit event was sent ● … but on closer look realized it was not the leader for that partition of the audit topic ● So why did it not return NotLeaderForPartition? … and debug C O R P X Kafka aggregate Hadoop Cluster auditor consume all topics Tier CORP-X Cluster auditor Tier test
  • 75. That broker was part of another cluster! C O R P X Kafka aggregate Hadoop Cluster auditor Some other Kafka cluster Tier test siphoned audit events
  • 76. … and we had a VIP misconfiguration C O R P X Kafka aggregate Hadoop Cluster auditor Some other Kafka cluster V I P stray broker entry
  • 77. ● Auditor still uses the old producer ● Periodically refreshes metadata (via VIP) for the audit topic ● ⇒ sometimes fetches metadata from the other cluster So audit events leaked into the other cluster C O R P X Kafka aggregate Hadoop Cluster auditor Some other Kafka cluster V I P AuditTopic Metadata Request Metadata response
  • 78. ● Auditor still uses the old producer ● Periodically refreshes metadata (via VIP) for the audit topic ● ⇒ sometimes fetches metadata from the other cluster ● and leaks audit events to that cluster until at least next metadata refresh So audit events leaked into the other cluster C O R P X Kafka aggregate Hadoop Cluster auditor Some other Kafka cluster V I P emit audit counts
  • 79. Some takeaways ● Could have been worse if mirror-makers to CORP-X had been bounced ○ (Since mirror makers could have started siphoning actual data to the other cluster) ● Consider using round-robin DNS instead of VIPs ○ … which is also necessary for using per-IP connection limits
  • 80. Data loss: the second incident Prolonged period of data loss from our Kafka REST proxy
  • 81. Data loss: the second incident Alerts fire that a broker in tracking cluster had gone offline NOC engages SYSOPS to investigate NOC engages Feed SREs and Kafka SREs to investigate drop (not loss) in a subset of page views On investigation, Kafka SRE finds no problems with Kafka (excluding the down broker), but notes an overall drop in tracking messages starting shortly after the broker failure NOC engages Traffic SRE to investigate why their tracking events had stopped Traffic SRE say that they don’t see errors on their side, and add that they use Kafka REST proxy Kafka SRE finds no immediate errors in Kafka REST logs but bounces the service as a precautionary measure Tracking events return to normal (expected) counts after the bounce Prolonged period of data loss from our Kafka REST proxy
  • 84. Sender Accumulator Reproducing the issue Broker A Broker B Partition 1 Partition 2 Partition n send Leader for partition 1 in-flight requests
  • 85. Sender Accumulator Reproducing the issue Broker A Broker B Partition 1 Partition 2 Partition n send New leader for partition 1 in-flight requests Old leader for partition 1
  • 86. Sender Accumulator Reproducing the issue Broker A Broker B Partition 1 Partition 2 Partition n send New leader for partition 1 in-flight requests New producer did not implement a request timeout Old leader for partition 1
  • 87. Sender Accumulator Reproducing the issue Broker A Broker B Partition 1 Partition 2 Partition n send in-flight requests New producer did not implement a request timeout ⇒ awaiting response ⇒ unaware of leader change until next metadata refresh New leader for partition 1 Old leader for partition 1
  • 88. Sender Accumulator Reproducing the issue Broker A Broker B Partition 1 Partition 2 Partition n send in-flight requests So client continues to send to partition 1 New leader for partition 1 Old leader for partition 1
  • 89. Sender Accumulator Reproducing the issue Broker A Broker B Partition 2 Partition n send batches pile up in partition 1 and eat up accumulator memory in-flight requests New leader for partition 1 Old leader for partition 1
  • 90. Sender Accumulator Reproducing the issue Broker B Partition 2 Partition n send in-flight requests subsequent sends drop/block per block.on.buffer .full config New leader for partition 1 Old leader for partition 1 Broker A
  • 91. Reproducing the issue ● netstat tcp 0 0 ::ffff:127.0.0.1:35938 ::ffff:127.0.0.1:9092 ESTABLISHED 3704/java ● Producer metrics ○ zero retry/error rate ● Thread dump java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(long, TimeUnit) org.apache.kafka.clients.producer.internals.BufferPool.allocate(int) org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, byte[], byte[], CompressionType, Callback) ● Resolved by KAFKA-2120 (KIP-19)
  • 92. Cluster unavailability (This is an abridged version of my earlier talk.)
  • 93. The incident Occurred a few days after upgrading to pick up quotas and SSL Multi-port KAFKA-1809 KAFKA-1928 SSL KAFKA-1690 x25 x38 October 13 Various quota patches June 3April 5 August 18
  • 94. The incident Broker (which happened to be controller) failed in our queuing Kafka cluster
  • 95. The incident Multiple applications begin to report “issues”: socket timeouts to Kafka cluster Posts search was one such impacted application
  • 96. The incident Two brokers report high request and response queue sizes
  • 97. The incident Two brokers report high request queue size and request latencies
  • 98. The incident ● Other observations ○ High CPU load on those brokers ○ Throughput degrades to ~ half the normal throughput ○ Tons of broken pipe exceptions in server logs ○ Application owners report socket timeouts in their logs
  • 99. Remediation Shifted site traffic to another data center “Kafka outage ⇒ member impact Multi-colo is critical!
  • 100. Remediation ● Controller moves did not help ● Firewall the affected brokers ● The above helped, but cluster fell over again after dropping the rules ● Suspect misbehaving clients on broker failure ○ … but x25 never exhibited this issue sudo iptables -A INPUT -p tcp --dport <broker-port> -s <other-broker> -j ACCEPT sudo iptables -A INPUT -p tcp --dport <broker-port> -j DROP
  • 101. Remediation Friday night ⇒ roll-back to x25 and debug later … but SREs had to babysit the rollback x38 x38 x38 x38 Rolling downgrade
  • 102. Remediation Friday night ⇒ roll-back to x25 and debug later … but SREs had to babysit the rollback x38 x38 x38 x38 Rolling downgrade Move leaders
  • 103. Remediation Friday night ⇒ roll-back to x25 and debug later … but SREs had to babysit the rollback x38 x38 x38 x38 Rolling downgrade Firewall
  • 104. Remediation Friday night ⇒ roll-back to x25 and debug later … but SREs had to babysit the rollback x38 x38 x38 Rolling downgrade Firewall x25
  • 105. Remediation Friday night ⇒ roll-back to x25 and debug later … but SREs had to babysit the rollback x38 x38 x38 Rolling downgrade x25 Move leaders
  • 106. ● Test cluster ○ Tried killing controller ○ Multiple rolling bounces ○ Could not reproduce ● Upgraded the queuing cluster to x38 again ○ Could not reproduce ● So nothing… Attempts at reproducing the issue
  • 108. API layer Life-cycle of a Kafka request Network layer Acceptor Processor Processor Processor Processor Response queue Response queue Response queue Response queue Request queue API handler API handler API handler Clientconnections Purgatory New connections Read request Quota manager Await handling Total time = queue-time Handle request + local-time + remote-time long-poll requests Hold if quota violated + quota-time Await processor + response-queue-time Write response + response-send-time
  • 109. Investigating high request times ● First look for high local time ○ then high response send time ■ then high remote (purgatory) time → generally non-issue (but caveats described later) ● High request queue/response queue times are effects, not causes
  • 110. High local times during incident (e.g., fetch)
  • 111. How are fetch requests handled? ● Get physical offsets to be read from local log during response ● If fetch from follower (i.e., replica fetch): ○ If follower was out of ISR and just caught-up then expand ISR (ZooKeeper write) ○ Maybe satisfy eligible delayed produce requests (with acks -1) ● Else (i.e., consumer fetch): ○ Record/update byte-rate of this client ○ Throttle the request on quota violation
  • 112. Could these cause high local times? ● Get physical offsets to be read from local log during response ● If fetch from follower (i.e., replica fetch): ○ If follower was out of ISR and just caught-up then expand ISR (ZooKeeper write) ○ Maybe satisfy eligible delayed produce requests (with acks -1) ● Else (i.e., consumer fetch): ○ Record/update byte-rate of this client ○ Throttle the request on quota violation Not using acks -1 Should be fast Should be fast Delayed outside API thread Test this…
  • 113. Maintains byte-rate metrics on a per-client-id basis 2015/10/10 03:20:08.393 [] [] [] [logger] Completed request:Name: FetchRequest; Version: 0; CorrelationId: 0; ClientId: 2c27cc8b_ccb7_42ae_98b6_51ea4b4dccf2; ReplicaId: -1; MaxWait: 0 ms; MinBytes: 0 bytes from connection <clientIP>:<brokerPort>-<localAddr>;totalTime:6589, requestQueueTime:6589,localTime:0,remoteTime:0,responseQueueTime:0,sendTime:0, securityProtocol:PLAINTEXT,principal:ANONYMOUS Quota metrics ??!
  • 114. Quota metrics - a quick benchmark for (clientId ← 0 until N) { timer.time { quotaMetrics.recordAndMaybeThrottle(clientId, 0, DefaultCallBack) } }
  • 115. Quota metrics - a quick benchmark
  • 116. Quota metrics - a quick benchmark Fixed in KAFKA-2664
  • 117. meanwhile in our queuing cluster… due to climbing client-id counts
  • 118. Rolling bounce of cluster forced the issue to recur on brokers that had high client- id metric counts ○ Used jmxterm to check per-client-id metric counts before experiment ○ Hooked up profiler to verify during incident ■ Generally avoid profiling/heapdumps in production due to interference ○ Did not see in earlier rolling bounce due to only a few client-id metrics at the time
  • 119. How to fix high local times ● Optimize the request’s handling. For e.g.,: ○ cached topic metadata as opposed to ZooKeeper reads (see KAFKA-901) ○ and KAFKA-1356 ● Make it asynchronous ○ E.g., we will do this for StopReplica in KAFKA-1911 ● Put it in a purgatory (usually if response depends on some condition); but be aware of the caveats: ○ Higher memory pressure if request purgatory size grows ○ Expired requests are handled in purgatory expiration thread (which is good) ○ but satisfied requests are handled in API thread of satisfying request ⇒ if a request satisfies several delayed requests then local time can increase for the satisfying request
  • 120. ● Request queue size ● Response queue sizes ● Request latencies: ○ Total time ○ Local time ○ Response send time ○ Remote time ● Request handler pool idle ratio Monitor these closely!
  • 122. The first incident: new clients old clusters Test cluster (old version) Certification cluster (old version) Metrics cluster (old version) metric events metric events
  • 123. The first incident: new clients old clusters Test cluster (new version) Certification cluster (old version) Metrics cluster (old version) metric events metric events org.apache.kafka.common.protocol.types.SchemaException: Error reading field 'throttle_time_ms': java.nio. BufferUnderflowException at org.apache.kafka.common.protocol.types.Schema.read(Schema.java:73) at org.apache.kafka.clients.NetworkClient.handleCompletedReceives(NetworkClient.java:397) ...
  • 124. New clients old clusters: remediation Test cluster (new version) Certification cluster (new version) Metrics cluster (old version) metric events metric events Set acks to zero
  • 125. New clients old clusters: remediation Test cluster (new version) Certification cluster (new version) Metrics cluster (new version) metric events metric events Reset acks to 1
  • 126. New clients old clusters: remediation (BTW this just hit us again with the protocol changes in KIP-31/KIP-32) KIP-35 would help a ton!
  • 127. The second incident: new endpoints { "version":1, "jmx_port":9999, "timestamp":2233345666, "host":"localhost", “port”:9092 } x14older broker versions ZooKeeper registration { "version”:2, "jmx_port":9999, "timestamp":2233345666, "host":"localhost", “port”:9092, "endpoints": [ {"plaintext://localhost: 9092"} ] } x14 client old client ignore endpoints v2 ⇒ use endpoints
  • 128. The second incident: new endpoints { "version":1, "jmx_port":9999, "timestamp":2233345666, "host":"localhost", “port”:9092 } x14older broker versions ZooKeeper registration { "version”:2, "jmx_port":9999, "timestamp":2233345666, "host":"localhost", “port”:9092, "endpoints": [ {"plaintext://localhost: 9092"} ] } x36 { "version”:2, "jmx_port":9999, "timestamp":2233345666, "host":"localhost", “port”:9092, "endpoints": [ {"plaintext://localhost: 9092"}, {“ssl: //localhost:9093”} ] } x14 client old client java.lang.IllegalArgumentException: No enum constant org.apache.kafka.common.protocol.SecurityProtocol.SSL at java.lang.Enum.valueOf(Enum.java:238) at org.apache.kafka.common.protocol. SecurityProtocol.valueOf(SecurityProtocol.java:24)
  • 129. New endpoints: remediation { "version":1, "jmx_port":9999, "timestamp":2233345666, "host":"localhost", “port”:9092 } x14older broker versions ZooKeeper registration { "version”:2, "jmx_port":9999, "timestamp":2233345666, "host":"localhost", “port”:9092, "endpoints": [ {"plaintext://localhost: 9092"} ] } x36 { "version”:2 1, "jmx_port":9999, "timestamp":2233345666, "host":"localhost", “port”:9092, "endpoints": [ {"plaintext://localhost: 9092"}, {“ssl: //localhost:9093”} ] } x14 client old client v1 ⇒ ignore endpoints
  • 130. New endpoints: remediation { "version":1, "jmx_port":9999, "timestamp":2233345666, "host":"localhost", “port”:9092 } x14older broker versions ZooKeeper registration { "version”:2, "jmx_port":9999, "timestamp":2233345666, "host":"localhost", “port”:9092, "endpoints": [ {"plaintext://localhost: 9092"} ] } x36 { "version”:2 1, "jmx_port":9999, "timestamp":2233345666, "host":"localhost", “port”:9092, "endpoints": [ {"plaintext://localhost: 9092"}, {“ssl: //localhost:9093”} ] } x14 client x36 client old client v1 ⇒ ignore endpoints v1 ⇒ use endpoints if present
  • 131. New endpoints: remediation ● Fix in KAFKA-2584 ● Also related: KAFKA-3100
  • 133. Widespread FS corruption after power outage ● Mount settings at the time ○ type ext4 (rw,noatime,data=writeback,commit=120) ● Restarts were successful but brokers subsequently hit corruption ● Subsequent restarts also hit corruption in index files
  • 135. ● Monitoring beyond per-broker/controller metrics ○ Validate SLAs ○ Continuously test admin functionality (in test clusters) ● Automate release validation ● https://github.com/linkedin/streaming Kafka monitor Kafka cluster producer Monitor instance ackLatencyMs e2eLatencyMs duplicateRate retryRate failureRate lossRate consumer Availability %
  • 136. ● Monitoring beyond per-broker/controller metrics ○ Validate SLAs ○ Continuously test admin functionality (in test clusters) ● Automate release validation ● https://github.com/linkedin/streaming Kafka monitor Kafka cluster producer Monitor instance ackLatencyMs e2eLatencyMs duplicateRate retryRate failureRate lossRate consumer Monitor instance Admin Utils Monitor instance checkReassign checkPLE
  • 137. Q&A
  • 138. Software developers and Site Reliability Engineers at all levels Streams infrastructure @ LinkedIn ● Kafka pub-sub ecosystem ● Stream processing platform built on Apache Samza ● Next Gen Change capture technology (incubating) Contact Kartik Paramasivam Where LinkedIn campus 2061 Stierlin Ct., Mountain View, CA When May 11 at 6.30 PM Register http://bit.ly/1Sv8ach We are hiring! LinkedIn Data Infrastructure meetup