Presented at the inaugural Kafka summit (2016) hosted by Confluent in San Francisco
Abstract:
Kafka is a backbone for various data pipelines and asynchronous messaging at LinkedIn and beyond. 2015 was an exciting year at LinkedIn in that we hit a new level of scale with Kafka: we now process more than 1 trillion published messages per day across nearly 1300 brokers. We run into some interesting production issues at this scale and I will dive into some of the most critical incidents that we encountered at LinkedIn in the past year:
Data loss: We have extremely stringent SLAs on latency and completeness that were violated on a few occasions. Some of these incidents were due to subtle configuration problems or even missing features.
Offset resets: As of early 2015, Kafka-based offset management was still a relatively new feature and we occasionally hit offset resets. Troubleshooting these incidents turned out to be extremely tricky and resulted in various fixes in offset management/log compaction as well as our monitoring.
Cluster unavailability due to high request/response latencies: Such incidents demonstrate how even subtle performance regressions and monitoring gaps can lead to an eventual cluster meltdown.
Power failures! What happens when an entire data center goes down? We experienced this first hand and it was not so pretty.
and more…
This talk will go over how we detected, investigated and remediated each of these issues and summarize some of the features in Kafka that we are working on that will help eliminate or mitigate such incidents in the future.
2. Kafkaesque
adjective Kaf·ka·esque ˌkäf-kə-ˈesk, ˌkaf-
: of, relating to, or suggestive of Franz Kafka or his writings; especially : having
a nightmarishly complex, bizarre, or illogical quality
Merriam-Webster
7. What are offset rewinds?
valid offsets
invalid offsetsinvalid offsets
yet to arrive
messages
purged
messages
8. If a consumer gets an OffsetOutOfRangeException:
What are offset rewinds?
valid offsets
invalid offsetsinvalid offsets
auto.offset.reset ← earliest auto.offset.reset ← latest
9. What are offset rewinds… and why do they matter?
HADOOP
Kafka
(CORP)
Push
job
Kafka
(PROD)
Stork
Mirror
Maker
Email
campaigns
10. What are offset rewinds… and why do they matter?
HADOOP Kafka
Push
job
Kafka
(PROD)
Stork
Mirror
Maker
Email
campaigns
Real-life incident courtesy of xkcd
offset rewind
11. Deployment: Deployed Multiproduct kafka-mirror-maker 0.1.13 to DCX by jkoshy
CRT Notifications <crt-notifications-noreply@linkedin.com> Fri, Jul 10, 2015 at 8:27 PM
Multiproduct 0.1.13 of kafka-mirror-maker has been Deployed to DCX by jkoshy
Offset rewinds: the first incident
12. Deployment: Deployed Multiproduct kafka-mirror-maker 0.1.13 to DCX by jkoshy
CRT Notifications <crt-notifications-noreply@linkedin.com> Fri, Jul 10, 2015 at 8:27 PM
Multiproduct 0.1.13 of kafka-mirror-maker has been Deployed to DCX by jkoshy on Wednesday, Jul 8, 2015 at 10:14 AM
Offset rewinds: the first incident
13. What are offset rewinds… and why do they matter?
HADOOP
Kafka
(CORP)
Push
job
Kafka
(PROD)
Stork
Mirror
Maker
Email
campaigns
Good practice to have
some filtering logic here
18. Offset rewinds: a typical cause
valid offsets
invalid offsetsinvalid offsets
consumer
position
19. Offset rewinds: a typical cause
valid offsets
invalid offsetsinvalid offsets
consumer
position
Unclean leader election truncates the log
20. Offset rewinds: a typical cause
valid offsets
invalid offsetsinvalid offsets
consumer
position
Unclean leader election truncates the log
… and consumer’s offset goes out of range
27. Offset management - a quick overview
mirror-maker
PageViewEvent-0
240
mirror-maker
LoginEvent-8
456
mirror-maker
LoginEvent-8
512
mirror-maker
PageViewEvent-0
321
__consumer_offsets topic
New offset commits
append to the topic
28. Offset management - a quick overview
mirror-maker
PageViewEvent-0
240
mirror-maker
LoginEvent-8
456
mirror-maker
LoginEvent-8
512
mirror-maker
PageViewEvent-0
321
__consumer_offsets topic
New offset commits
append to the topic
mirror-maker
PageViewEvent-0
321
mirror-maker
LoginEvent-8
512
… …
Maintain offset cache
to serve offset fetch
requests quickly
29. Offset management - a quick overview
mirror-maker
PageViewEvent-0
240
mirror-maker
LoginEvent-8
456
mirror-maker
LoginEvent-8
512
mirror-maker
PageViewEvent-0
321
__consumer_offsets topic
New offset commits
append to the topic
mirror-maker
PageViewEvent-0
321
mirror-maker
LoginEvent-8
512
… …
Purge old offsets
via log compaction
Maintain offset cache
to serve offset fetch
requests quickly
30. Offset management - a quick overview
mirror-maker
PageViewEvent-0
240
mirror-maker
LoginEvent-8
456
mirror-maker
LoginEvent-8
512
mirror-maker
PageViewEvent-0
321
__consumer_offsets topic
When a new broker becomes
the leader (i.e., offset manager)
it loads offsets into its cache
31. Offset management - a quick overview
mirror-maker
PageViewEvent-0
240
mirror-maker
LoginEvent-8
456
mirror-maker
LoginEvent-8
512
mirror-maker
PageViewEvent-0
321
__consumer_offsets topic
mirror-maker
PageViewEvent-0
321
mirror-maker
LoginEvent-8
512
… …
See this deck for more details
32. Back to the incident…
2015/07/10 02:32:16.174 [ConsumerFetcherThread] [ConsumerFetcherThread-mirror-maker-9de01f48-0-287],
Current offset 6811737 for partition [some-log_event,13] out of range; reset offset to 9581225
33. Back to the incident…
... <rebalance>
2015/07/10 02:08:14.252 [some-log_event,13], initOffset 9581205
... <rebalance>
2015/07/10 02:24:11.965 [some-log_event,13], initOffset 9581223
... <rebalance>
2015/07/10 02:32:16.131 [some-log_event,13], initOffset 6811737
...
2015/07/10 02:32:16.174 [ConsumerFetcherThread] [ConsumerFetcherThread-mirror-maker-9de01f48-0-287],
Current offset 6811737 for partition [some-log_event,13] out of range; reset offset to 9581225
34. ./bin/kafka-console-consumer.sh --topic __consumer_offsets --zookeeper
<zookeeperConnect> --formatter "kafka.coordinator.
GroupMetadataManager$OffsetsMessageFormatter" --consumer.config config/consumer.
properties
(must set exclude.internal.topics=false in consumer.properties)
While debugging offset rewinds, do this first!
36. So why did the offset manager return a stale offset?
Offset manager logs:
2015/07/10 02:31:57.941 ERROR [OffsetManager] [kafka-scheduler-1] [kafka-server] [] [Offset Manager on
Broker 191]: Error in loading offsets from [__consumer_offsets,63]
java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.String
at
kafka.server.OffsetManager$.kafka$server$OffsetManager$$readMessageValue(OffsetManager.scala:576)
37. So why did the offset manager return a stale offset?
Offset manager logs:
2015/07/10 02:31:57.941 ERROR [OffsetManager] [kafka-scheduler-1] [kafka-server] [] [Offset Manager on
Broker 191]: Error in loading offsets from [__consumer_offsets,63]
java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.String
at
kafka.server.OffsetManager$.kafka$server$OffsetManager$$readMessageValue(OffsetManager.scala:576)
... ...
mirror-maker
some-log_event, 13 6811737
... ...
Leader moved and new offset
manager hit KAFKA-2117 while
loading offsets
old offsets recent offsets
38. … caused a ton of offset resets
2015/07/10 02:08:14.252 [some-log_event,13], initOffset 9581205
...
2015/07/10 02:24:11.965 [some-log_event,13], initOffset 9581223
...
2015/07/10 02:32:16.131 [some-log_event,13], initOffset 6811737
...
2015/07/10 02:32:16.174 [ConsumerFetcherThread] [ConsumerFetcherThread-mirror-maker-9de01f48-0-287],
Current offset 6811737 for partition [some-log_event,13] out of range; reset offset to 9581225
[some-log_event, 13]
846232 9581225
purged
39. … but why the duplicate email?
Deployment: Deployed Multiproduct kafka-mirror-maker 0.1.13 to DCX by jkoshy
CRT Notifications <crt-notifications-noreply@linkedin.com> Fri, Jul 10, 2015 at 8:27 PM
Multiproduct 0.1.13 of kafka-mirror-maker has been Deployed to DCX by jkoshy
40. … but why the duplicate email?
2015/07/10 02:08:15.524 [crt-event,12], initOffset 11464
...
2015/07/10 02:31:40.827 [crt-event,12], initOffset 11464
...
2015/07/10 02:32:17.739 [crt-event,12], initOffset 9539
...
Also from Jun 1
41. … but why the duplicate email?
2015/07/10 02:08:15.524 [crt-event,12], initOffset 11464
...
2015/07/10 02:31:40.827 [crt-event,12], initOffset 11464
...
2015/07/10 02:32:17.739 [crt-event,12], initOffset 9539
...
[crt-event, 12]
0 11464
… but still valid!
43. Offset rewinds: the second incident
mirror makers
got wedged
restarted
sent duplicate emails
to (few) members
44. Offset rewinds: the second incident
Consumer logs
2015/04/29 17:22:48.952 <rebalance started>
...
2015/04/29 17:36:37.790 <rebalance ended>initOffset -1 (for various partitions)
45. Offset rewinds: the second incident
Consumer logs
2015/04/29 17:22:48.952 <rebalance started>
...
2015/04/29 17:36:37.790 <rebalance ended>initOffset -1 (for various partitions)
Broker (offset manager) logs
2015/04/29 17:18:46.143 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker
517]: Loading offsets from [__consumer_offsets,84]
...
2015/04/29 17:36:35.228 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker
517]: Finished loading offsets from [__consumer_offsets,84] in 1069085 milliseconds. (17 minutes!)
46. Offset rewinds: the second incident
Consumer logs
2015/04/29 17:22:48.952 <rebalance started>
...
2015/04/29 17:36:37.790 <rebalance ended>initOffset -1 (for various partitions)
Broker (offset manager) logs
2015/04/29 17:18:46.143 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker
517]: Loading offsets from [__consumer_offsets,84]
...
2015/04/29 17:36:35.228 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker
517]: Finished loading offsets from [__consumer_offsets,84] in 1069085 milliseconds. (17 minutes!)
⇒ log cleaner had failed a while ago…
but why did offset fetch return -1?
47. Offset management - a quick overview
How are stale offsets (for dead consumers) cleaned up?
dead-group
PageViewEvent-0 321
timestamp
older than a
week
active-group
LoginEvent-8
512
recent
timestamp
… …
__consumer_offsets
Offset cache
cleanup
task
48. Offset management - a quick overview
How are stale offsets (for dead consumers) cleaned up?
dead-group
PageViewEvent-0 321
timestamp
older than a
week
active-group
LoginEvent-8
512
recent
timestamp
… …
__consumer_offsets
Offset cache
cleanup
task
Append tombstones
for dead-group
and delete entry in
offset cache
49. Back to the incident...
2015/04/29 17:18:46.143 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker
517]: Loading offsets from [__consumer_offsets,84]
...
2015/04/29 17:36:35.228 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker
517]: Finished loading offsets from [__consumer_offsets,84] in 1069085 milliseconds. (17 minutes!)
mirror-maker
PageViewEvent-0
45
very old
timstamp
mirror-maker
LoginEvent-8
12
very old
timestamp
... ... ...
old offsets recent offsets
load offsets
50. Back to the incident...
2015/04/29 17:18:46.143 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker
517]: Loading offsets from [__consumer_offsets,84]
...
2015/04/29 17:36:35.228 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker
517]: Finished loading offsets from [__consumer_offsets,84] in 1069085 milliseconds. (17 minutes!)
mirror-maker
PageViewEvent-0
45
very old
timstamp
mirror-maker
LoginEvent-8
12
very old
timestamp
... ... ...
old offsets recent offsets
load offsets
Cleanup task happened to
run during the load
51. Back to the incident...
2015/04/29 17:18:46.143 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker
517]: Loading offsets from [__consumer_offsets,84]
...
2015/04/29 17:36:35.228 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker
517]: Finished loading offsets from [__consumer_offsets,84] in 1069085 milliseconds. (17 minutes!)
... ... ...
old offsets recent offsets
load offsets
52. Back to the incident...
2015/04/29 17:18:46.143 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker
517]: Loading offsets from [__consumer_offsets,84]
...
2015/04/29 17:36:35.228 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker
517]: Finished loading offsets from [__consumer_offsets,84] in 1069085 milliseconds. (17 minutes!)
mirror-maker
PageViewEvent-0
321
recent
timestamp
mirror-maker
LoginEvent-8
512
recent
timestamp
... ... ...
old offsets recent offsets
load offsets
53. Back to the incident...
2015/04/29 17:18:46.143 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker
517]: Loading offsets from [__consumer_offsets,84]
...
2015/04/29 17:36:35.228 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker
517]: Finished loading offsets from [__consumer_offsets,84] in 1069085 milliseconds. (17 minutes!)
... ... ...
old offsets recent offsets
load offsets
54. Root cause of this rewind
● Log cleaner had failed (separate bug)
○ ⇒ offsets topic grew big
○ ⇒ offset load on leader movement took a while
● Cache cleanup ran during the load
○ which appended tombstones
○ and overrode the most recent offsets
● (Fixed in KAFKA-2163)
55. Offset rewinds: wrapping it up
● Monitor log cleaner health
● If you suspect a rewind:
○ Check for unclean leader elections
○ Check for offset manager movement (i.e., __consumer_offsets partitions had leader changes)
○ Take a dump of the offsets topic
○ … stare long and hard at the logs (both consumer and offset manager)
● auto.offset.reset ← closest ?
● Better lag monitoring via Burrow
59. Data loss: detection (example 1)
P
R
O
D
B
P
R
O
D
A
C
O
R
P
Y
C
O
R
P
X
Kafka
aggregate
Kafka
local
Kafka
aggregate
Hadoop
Producers
Kafka
aggregate
Kafka
local
Kafka
aggregate
Hadoop
Producers
60. Data loss: detection (example 1)
P
R
O
D
B
P
R
O
D
A
C
O
R
P
Y
C
O
R
P
X
Kafka
aggregate
Kafka
local
Kafka
aggregate
Hadoop
Producers
Kafka
aggregate
Kafka
local
Kafka
aggregate
Hadoop
Producers
61. Data loss: detection (example 2)
P
R
O
D
B
P
R
O
D
A
C
O
R
P
Y
C
O
R
P
X
Kafka
aggregate
Kafka
local
Kafka
aggregate
Hadoop
Producers
Kafka
aggregate
Kafka
local
Kafka
aggregate
Hadoop
Producers
62. Data loss? (The actual incident)
P
R
O
D
B
P
R
O
D
A
C
O
R
P
Y
C
O
R
P
X
Kafka
aggregate
Kafka
local
Kafka
aggregate
Hadoop
Producers
Kafka
aggregate
Kafka
local
Kafka
aggregate
Hadoop
Producers
63. Data loss or audit issue? (The actual incident)
P
R
O
D
B
P
R
O
D
A
C
O
R
P
Y
C
O
R
P
X
Kafka
aggregate
Kafka
local
Kafka
aggregate
Hadoop
Producers
Kafka
aggregate
Kafka
local
Kafka
aggregate
Hadoop
Producers
Sporadic discrepancies in Kafka-
aggregate-CORP-X counts for several
topics
However, Hadoop-X tier is complete
✔
✔ ✔
✔
✔
✔✔
64. Verified actual data completeness by recounting events in a few low-volume topics
… so definitely an audit-only issue
Likely caused by dropping audit events
65. Verified actual data completeness by recounting events in a few low-volume topics
… so definitely an audit-only issue
Possible sources of discrepancy:
● Cluster auditor
● Cluster itself (i.e., data loss in audit topic)
● Audit front-end
Likely caused by dropping audit events
67. Data loss in audit topic
● … but no unclean leader elections
● … and no data loss in sampled topics
(counted manually)
Possible causes
C
O
R
P
X
Kafka
aggregate
Hadoop
Cluster
auditor
consume all
topics
emit audit
counts
68. Audit front-end fails to insert audit events into DB
● … but other tiers (e.g., CORP-Y) are correct
● … and no errors in logs
Possible causes
C
O
R
P
X
Kafka
aggregate
Hadoop
Audit
front-end
consume
audit
Audit DB
insert
from
CORP-Y
69. ● Emit counts to new test tier
Attempt to reproduce
C
O
R
P
X
Kafka
aggregate
Hadoop
Cluster
auditor
consume all
topics
Tier CORP-X
Cluster
auditor
Tier test
70. … fortunately worked:
● Emit counts to new test tier
● test tier counts were also sporadically off
Attempt to reproduce
C
O
R
P
X
Kafka
aggregate
Hadoop
Cluster
auditor
consume all
topics
Tier CORP-X
Cluster
auditor
Tier test
71. ● Enabled select TRACE logs to log audit
events before sending
● Audit counts were correct
● … and successfully emitted
… and debug
C
O
R
P
X
Kafka
aggregate
Hadoop
Cluster
auditor
consume all
topics
Tier CORP-X
Cluster
auditor
Tier test
72. ● Enabled select TRACE logs to log audit
events before sending
● Audit counts were correct
● … and successfully emitted
● Verified from broker public access logs
that audit event was sent
… and debug
C
O
R
P
X
Kafka
aggregate
Hadoop
Cluster
auditor
consume all
topics
Tier CORP-X
Cluster
auditor
Tier test
73. ● Enabled select TRACE logs to log audit
events before sending
● Audit counts were correct
● … and successfully emitted
● Verified from broker public access logs
that audit event was sent
● … but on closer look realized it was not
the leader for that partition of the audit
topic
… and debug
C
O
R
P
X
Kafka
aggregate
Hadoop
Cluster
auditor
consume all
topics
Tier CORP-X
Cluster
auditor
Tier test
74. ● Enabled select TRACE logs to log audit
events before sending
● Audit counts were correct
● … and successfully emitted
● Verified from broker public access logs
that audit event was sent
● … but on closer look realized it was not
the leader for that partition of the audit
topic
● So why did it not return
NotLeaderForPartition?
… and debug
C
O
R
P
X
Kafka
aggregate
Hadoop
Cluster
auditor
consume all
topics
Tier CORP-X
Cluster
auditor
Tier test
75. That broker was part of another cluster!
C
O
R
P
X
Kafka
aggregate
Hadoop
Cluster
auditor
Some other
Kafka cluster
Tier test
siphoned audit
events
76. … and we had a VIP misconfiguration
C
O
R
P
X
Kafka
aggregate
Hadoop
Cluster
auditor
Some other
Kafka cluster
V
I
P
stray broker
entry
77. ● Auditor still uses the old producer
● Periodically refreshes metadata (via VIP)
for the audit topic
● ⇒ sometimes fetches metadata from the
other cluster
So audit events leaked into the other cluster
C
O
R
P
X
Kafka
aggregate
Hadoop
Cluster
auditor
Some other
Kafka cluster
V
I
P
AuditTopic
Metadata
Request
Metadata
response
78. ● Auditor still uses the old producer
● Periodically refreshes metadata (via VIP)
for the audit topic
● ⇒ sometimes fetches metadata from the
other cluster
● and leaks audit events to that cluster until
at least next metadata refresh
So audit events leaked into the other cluster
C
O
R
P
X
Kafka
aggregate
Hadoop
Cluster
auditor
Some other
Kafka cluster
V
I
P
emit audit
counts
79. Some takeaways
● Could have been worse if mirror-makers to CORP-X had been bounced
○ (Since mirror makers could have started siphoning actual data to the other cluster)
● Consider using round-robin DNS instead of VIPs
○ … which is also necessary for using per-IP connection limits
80. Data loss: the second incident
Prolonged period of data loss from our Kafka REST proxy
81. Data loss: the second incident
Alerts fire that a broker in tracking cluster had gone offline
NOC engages SYSOPS to investigate
NOC engages Feed SREs and Kafka SREs to investigate drop (not loss) in a subset of page views
On investigation, Kafka SRE finds no problems with Kafka (excluding the down broker), but notes an overall drop in
tracking messages starting shortly after the broker failure
NOC engages Traffic SRE to investigate why their tracking events had stopped
Traffic SRE say that they don’t see errors on their side, and add that they use Kafka REST proxy
Kafka SRE finds no immediate errors in Kafka REST logs but bounces the service as a precautionary measure
Tracking events return to normal (expected) counts after the bounce
Prolonged period of data loss from our Kafka REST proxy
87. Sender
Accumulator
Reproducing the issue
Broker
A
Broker
B
Partition 1
Partition 2
Partition n
send
in-flight
requests
New producer did not implement a request timeout
⇒ awaiting response
⇒ unaware of leader change until next metadata refresh
New leader
for partition 1
Old leader for
partition 1
93. The incident
Occurred a few days after upgrading to pick up quotas and SSL
Multi-port
KAFKA-1809
KAFKA-1928
SSL
KAFKA-1690
x25 x38
October 13
Various quota patches
June 3April 5 August 18
98. The incident
● Other observations
○ High CPU load on those brokers
○ Throughput degrades to ~ half the normal throughput
○ Tons of broken pipe exceptions in server logs
○ Application owners report socket timeouts in their logs
100. Remediation
● Controller moves did not help
● Firewall the affected brokers
● The above helped, but cluster fell over again after dropping the rules
● Suspect misbehaving clients on broker failure
○ … but x25 never exhibited this issue
sudo iptables -A INPUT -p tcp --dport <broker-port> -s <other-broker> -j ACCEPT
sudo iptables -A INPUT -p tcp --dport <broker-port> -j DROP
101. Remediation
Friday night ⇒ roll-back to x25 and debug later
… but SREs had to babysit the rollback
x38 x38 x38 x38
Rolling downgrade
102. Remediation
Friday night ⇒ roll-back to x25 and debug later
… but SREs had to babysit the rollback
x38 x38 x38 x38
Rolling downgrade
Move leaders
103. Remediation
Friday night ⇒ roll-back to x25 and debug later
… but SREs had to babysit the rollback
x38 x38 x38 x38
Rolling downgrade
Firewall
104. Remediation
Friday night ⇒ roll-back to x25 and debug later
… but SREs had to babysit the rollback
x38 x38 x38
Rolling downgrade
Firewall
x25
105. Remediation
Friday night ⇒ roll-back to x25 and debug later
… but SREs had to babysit the rollback
x38 x38 x38
Rolling downgrade
x25
Move leaders
106. ● Test cluster
○ Tried killing controller
○ Multiple rolling bounces
○ Could not reproduce
● Upgraded the queuing cluster to x38 again
○ Could not reproduce
● So nothing…
Attempts at reproducing the issue
108. API layer
Life-cycle of a Kafka request
Network layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Response queue
Response queue
Response queue
Request queue
API handler
API handler
API handler
Clientconnections
Purgatory
New
connections
Read
request
Quota manager
Await handling
Total time = queue-time
Handle request
+ local-time + remote-time
long-poll requests
Hold if quota
violated
+ quota-time
Await processor
+ response-queue-time
Write
response
+ response-send-time
109. Investigating high request times
● First look for high local time
○ then high response send time
■ then high remote (purgatory) time → generally non-issue (but caveats described later)
● High request queue/response queue times are effects, not causes
111. How are fetch requests handled?
● Get physical offsets to be read from local log during response
● If fetch from follower (i.e., replica fetch):
○ If follower was out of ISR and just caught-up then expand ISR (ZooKeeper write)
○ Maybe satisfy eligible delayed produce requests (with acks -1)
● Else (i.e., consumer fetch):
○ Record/update byte-rate of this client
○ Throttle the request on quota violation
112. Could these cause high local times?
● Get physical offsets to be read from local log during response
● If fetch from follower (i.e., replica fetch):
○ If follower was out of ISR and just caught-up then expand ISR (ZooKeeper write)
○ Maybe satisfy eligible delayed produce requests (with acks -1)
● Else (i.e., consumer fetch):
○ Record/update byte-rate of this client
○ Throttle the request on quota violation
Not using acks -1
Should be fast
Should be fast
Delayed outside
API thread
Test this…
117. meanwhile in our queuing cluster…
due to climbing
client-id counts
118. Rolling bounce of cluster forced the issue to recur on brokers that had high client-
id metric counts
○ Used jmxterm to check per-client-id metric counts before experiment
○ Hooked up profiler to verify during incident
■ Generally avoid profiling/heapdumps in production due to interference
○ Did not see in earlier rolling bounce due to only a few client-id metrics at the time
119. How to fix high local times
● Optimize the request’s handling. For e.g.,:
○ cached topic metadata as opposed to ZooKeeper reads (see KAFKA-901)
○ and KAFKA-1356
● Make it asynchronous
○ E.g., we will do this for StopReplica in KAFKA-1911
● Put it in a purgatory (usually if response depends on some condition); but be
aware of the caveats:
○ Higher memory pressure if request purgatory size grows
○ Expired requests are handled in purgatory expiration thread (which is good)
○ but satisfied requests are handled in API thread of satisfying request ⇒ if a request satisfies
several delayed requests then local time can increase for the satisfying request
120. ● Request queue size
● Response queue sizes
● Request latencies:
○ Total time
○ Local time
○ Response send time
○ Remote time
● Request handler pool idle ratio
Monitor these closely!
122. The first incident: new clients old clusters
Test
cluster
(old version)
Certification
cluster
(old version)
Metrics
cluster
(old version)
metric
events
metric
events
123. The first incident: new clients old clusters
Test
cluster
(new version)
Certification
cluster
(old version)
Metrics
cluster
(old version)
metric
events
metric
events
org.apache.kafka.common.protocol.types.SchemaException: Error reading field 'throttle_time_ms': java.nio.
BufferUnderflowException
at org.apache.kafka.common.protocol.types.Schema.read(Schema.java:73)
at org.apache.kafka.clients.NetworkClient.handleCompletedReceives(NetworkClient.java:397)
...
124. New clients old clusters: remediation
Test
cluster
(new version)
Certification
cluster
(new version)
Metrics
cluster
(old version)
metric
events
metric
events
Set acks to zero
125. New clients old clusters: remediation
Test
cluster
(new version)
Certification
cluster
(new version)
Metrics
cluster
(new version)
metric
events
metric
events
Reset acks to 1
126. New clients old clusters: remediation
(BTW this just hit us again with the protocol changes in KIP-31/KIP-32)
KIP-35 would help a ton!
127. The second incident: new endpoints
{ "version":1,
"jmx_port":9999,
"timestamp":2233345666,
"host":"localhost",
“port”:9092
}
x14older broker versions
ZooKeeper
registration
{ "version”:2,
"jmx_port":9999,
"timestamp":2233345666,
"host":"localhost",
“port”:9092,
"endpoints": [
{"plaintext://localhost:
9092"}
]
}
x14
client
old
client
ignore endpoints v2 ⇒ use endpoints
128. The second incident: new endpoints
{ "version":1,
"jmx_port":9999,
"timestamp":2233345666,
"host":"localhost",
“port”:9092
}
x14older broker versions
ZooKeeper
registration
{ "version”:2,
"jmx_port":9999,
"timestamp":2233345666,
"host":"localhost",
“port”:9092,
"endpoints": [
{"plaintext://localhost:
9092"}
]
}
x36
{ "version”:2,
"jmx_port":9999,
"timestamp":2233345666,
"host":"localhost",
“port”:9092,
"endpoints": [
{"plaintext://localhost:
9092"}, {“ssl:
//localhost:9093”} ]
}
x14
client
old
client
java.lang.IllegalArgumentException: No enum constant
org.apache.kafka.common.protocol.SecurityProtocol.SSL
at java.lang.Enum.valueOf(Enum.java:238)
at org.apache.kafka.common.protocol.
SecurityProtocol.valueOf(SecurityProtocol.java:24)
133. Widespread FS corruption after power outage
● Mount settings at the time
○ type ext4 (rw,noatime,data=writeback,commit=120)
● Restarts were successful but brokers subsequently hit corruption
● Subsequent restarts also hit corruption in index files
138. Software developers and Site Reliability Engineers at all
levels
Streams infrastructure @ LinkedIn
● Kafka pub-sub ecosystem
● Stream processing platform built on Apache Samza
● Next Gen Change capture technology (incubating)
Contact Kartik Paramasivam
Where LinkedIn campus
2061 Stierlin Ct.,
Mountain View, CA
When May 11 at 6.30 PM
Register http://bit.ly/1Sv8ach
We are hiring! LinkedIn Data Infrastructure meetup