Kafkaesque days at linked in in 2015

Kafkaesque days
at LinkedIn in 2015
Joel Koshy
Kafka Summit 2016

Kafkaesque
adjective Kaf·ka·esque ˌkäf-kə-ˈesk, ˌkaf-
: of, relating to, or suggestive of Franz Kafka or his writings; especially : having
a nightmarishly complex, bizarre, or illogical quality
Merriam-Webster

What @bonkoif said:
More clusters
More use-cases
More problems …
Kafka @ LinkedIn

Incidents that we will cover
● Offset rewinds
● Data loss
● Cluster unavailability
● (In)compatibility
● Blackout

What are offset rewinds?
valid offsets
invalid offsetsinvalid offsets
yet to arrive
messages
purged
messages

If a consumer gets an OffsetOutOfRangeException:
What are offset rewinds?
valid offsets
auto.offset.reset ← earliest auto.offset.reset ← latest

What are offset rewinds… and why do they matter?
HADOOP
Kafka
(CORP)
Push
job
Kafka
(PROD)
Stork
Mirror
Maker
Email
campaigns

HADOOP Kafka
Push
job
Kafka
(PROD)
Stork
Mirror
Maker
Email
campaigns
Real-life incident courtesy of xkcd
offset rewind

Deployment: Deployed Multiproduct kafka-mirror-maker 0.1.13 to DCX by jkoshy
CRT Notifications <crt-notifications-noreply@linkedin.com> Fri, Jul 10, 2015 at 8:27 PM
Multiproduct 0.1.13 of kafka-mirror-maker has been Deployed to DCX by jkoshy
Offset rewinds: the first incident

Multiproduct 0.1.13 of kafka-mirror-maker has been Deployed to DCX by jkoshy on Wednesday, Jul 8, 2015 at 10:14 AM
Offset rewinds: the first incident

HADOOP
Kafka
(CORP)
Push
job
Kafka
(PROD)
Stork
Mirror
Maker
Email
campaigns
Good practice to have
some filtering logic here

Offset rewinds: detection - just use this

Offset rewinds: a typical cause

valid offsets
consumer
position

valid offsets
consumer
position
Unclean leader election truncates the log

valid offsets
consumer
position
Unclean leader election truncates the log
… and consumer’s offset goes out of range

But there were no ULEs when this happened

But there were no ULEs when this happened
… and we set auto.offset.reset to latest

Offset management - a quick overview
(broker)
Consumer
Consumer group
Consumer Consumer
(broker)(broker)
Consume (fetch requests)

Offset
Manager
(broker)
Consumer
Consumer group
Consumer Consumer
Periodic OffsetCommitRequest
(broker)(broker)

Offset
Manager
(broker)
Consumer
Consumer group
Consumer Consumer
OffsetFetchRequest
(after rebalance)
(broker) (broker)

mirror-maker
PageViewEvent-0
240
mirror-maker
LoginEvent-8
456
mirror-maker
LoginEvent-8
512
mirror-maker
PageViewEvent-0
321
__consumer_offsets topic

mirror-maker
PageViewEvent-0
240
mirror-maker
LoginEvent-8
456
mirror-maker
LoginEvent-8
512
mirror-maker
PageViewEvent-0
321
New offset commits
append to the topic

mirror-maker
PageViewEvent-0
240
mirror-maker
LoginEvent-8
456
mirror-maker
LoginEvent-8
512
mirror-maker
PageViewEvent-0
321
New offset commits
append to the topic
mirror-maker
PageViewEvent-0
321
mirror-maker
LoginEvent-8
512
… …
Maintain offset cache
to serve offset fetch
requests quickly

mirror-maker
PageViewEvent-0
240
mirror-maker
LoginEvent-8
456
mirror-maker
LoginEvent-8
512
mirror-maker
PageViewEvent-0
321
New offset commits
append to the topic
mirror-maker
PageViewEvent-0
321
mirror-maker
LoginEvent-8
512
… …
Purge old offsets
via log compaction
Maintain offset cache
to serve offset fetch
requests quickly

mirror-maker
PageViewEvent-0
240
mirror-maker
LoginEvent-8
456
mirror-maker
LoginEvent-8
512
mirror-maker
PageViewEvent-0
321
When a new broker becomes
the leader (i.e., offset manager)
it loads offsets into its cache

mirror-maker
PageViewEvent-0
240
mirror-maker
LoginEvent-8
456
mirror-maker
LoginEvent-8
512
mirror-maker
PageViewEvent-0
321
mirror-maker
PageViewEvent-0
321
mirror-maker
LoginEvent-8
512
… …
See this deck for more details

Back to the incident…
2015/07/10 02:32:16.174 [ConsumerFetcherThread] [ConsumerFetcherThread-mirror-maker-9de01f48-0-287],
Current offset 6811737 for partition [some-log_event,13] out of range; reset offset to 9581225

Back to the incident…
... <rebalance>
2015/07/10 02:08:14.252 [some-log_event,13], initOffset 9581205
... <rebalance>
... <rebalance>
...

./bin/kafka-console-consumer.sh --topic __consumer_offsets --zookeeper
<zookeeperConnect> --formatter "kafka.coordinator.
GroupMetadataManager$OffsetsMessageFormatter" --consumer.config config/consumer.
properties
(must set exclude.internal.topics=false in consumer.properties)
While debugging offset rewinds, do this first!

...
…
[mirror-maker,metrics_event,1]::OffsetAndMetadata[83511737,NO_METADATA,1433178005711]
[mirror-maker,some-log_event,13]::OffsetAndMetadata[6811737,NO_METADATA,1433178005711]
...
...
[mirror-maker,some-log_event,13]::OffsetAndMetadata[9581223,NO_METADATA,1436495051231]
...
Inside the __consumer_offsets topic
Jul 10 (today)
Jun 1 !!

So why did the offset manager return a stale offset?
Offset manager logs:
2015/07/10 02:31:57.941 ERROR [OffsetManager] [kafka-scheduler-1] [kafka-server] [] [Offset Manager on
Broker 191]: Error in loading offsets from [__consumer_offsets,63]
java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.String
at
kafka.server.OffsetManager$.kafka$server$OffsetManager$$readMessageValue(OffsetManager.scala:576)

So why did the offset manager return a stale offset?
Offset manager logs:
2015/07/10 02:31:57.941 ERROR [OffsetManager] [kafka-scheduler-1] [kafka-server] [] [Offset Manager on
Broker 191]: Error in loading offsets from [__consumer_offsets,63]
java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.String
at
kafka.server.OffsetManager$.kafka$server$OffsetManager$$readMessageValue(OffsetManager.scala:576)
... ...
mirror-maker
some-log_event, 13 6811737
... ...
Leader moved and new offset
manager hit KAFKA-2117 while
loading offsets
old offsets recent offsets

… caused a ton of offset resets
...
...
...
[some-log_event, 13]
846232 9581225
purged

… but why the duplicate email?
Multiproduct 0.1.13 of kafka-mirror-maker has been Deployed to DCX by jkoshy

2015/07/10 02:08:15.524 [crt-event,12], initOffset 11464
...
...
...
Also from Jun 1

...
...
...
[crt-event, 12]
0 11464
… but still valid!

Time-based retention does not work well
for low-volume topics
Addressed by KIP-32/KIP-33

Offset rewinds: the second incident
mirror makers
got wedged
restarted
sent duplicate emails
to (few) members

Consumer logs
2015/04/29 17:22:48.952 <rebalance started>
...
2015/04/29 17:36:37.790 <rebalance ended>initOffset -1 (for various partitions)

Consumer logs
...
Broker (offset manager) logs
2015/04/29 17:18:46.143 INFO [OffsetManager] [kafka-scheduler-3] [kafka-server] [] [Offset Manager on Broker
517]: Loading offsets from [__consumer_offsets,84]
...
517]: Finished loading offsets from [__consumer_offsets,84] in 1069085 milliseconds. (17 minutes!)

Consumer logs
...
Broker (offset manager) logs
...
⇒ log cleaner had failed a while ago…
but why did offset fetch return -1?

How are stale offsets (for dead consumers) cleaned up?
dead-group
PageViewEvent-0 321
timestamp
older than a
week
active-group
LoginEvent-8
512
recent
timestamp
… …
__consumer_offsets
Offset cache
cleanup
task

How are stale offsets (for dead consumers) cleaned up?
dead-group
PageViewEvent-0 321
timestamp
older than a
week
active-group
LoginEvent-8
512
recent
timestamp
… …
__consumer_offsets
Offset cache
cleanup
task
Append tombstones
for dead-group
and delete entry in
offset cache

Back to the incident...
...
mirror-maker
PageViewEvent-0
45
very old
timstamp
mirror-maker
LoginEvent-8
12
very old
timestamp
... ... ...
load offsets

...
mirror-maker
PageViewEvent-0
45
very old
timstamp
mirror-maker
LoginEvent-8
12
very old
timestamp
... ... ...
load offsets
Cleanup task happened to
run during the load

...
... ... ...
load offsets

...
mirror-maker
PageViewEvent-0
321
recent
timestamp
mirror-maker
LoginEvent-8
512
recent
timestamp
... ... ...
load offsets

Root cause of this rewind
● Log cleaner had failed (separate bug)
○ ⇒ offsets topic grew big
○ ⇒ offset load on leader movement took a while
● Cache cleanup ran during the load
○ which appended tombstones
○ and overrode the most recent offsets
● (Fixed in KAFKA-2163)

Offset rewinds: wrapping it up
● Monitor log cleaner health
● If you suspect a rewind:
○ Check for unclean leader elections
○ Check for offset manager movement (i.e., __consumer_offsets partitions had leader changes)
○ Take a dump of the offsets topic
○ … stare long and hard at the logs (both consumer and offset manager)
● auto.offset.reset ← closest ?
● Better lag monitoring via Burrow

P
R
O
D
B
P
R
O
D
A
C
O
R
P
Y
C
O
R
P
X
Data loss: the first incident
Kafka
aggregate
Kafka
local
Kafka
aggregate
Hadoop
Producers
Kafka
aggregate
Kafka
local
Kafka
aggregate
Hadoop
Producers

C
O
R
P
Y
C
O
R
P
X
P
R
O
D
B
P
R
O
D
A
Audit trail
Kafka
aggregate
Kafka
local
Kafka
aggregate
Hadoop
Producers
Kafka
aggregate
Kafka
local
Kafka
aggregate
Hadoop
Producers
data
audit

Data loss: detection (example 1)
P
R
O
D
B
P
R
O
D
A
C
O
R
P
Y
C
O
R
P
X
Kafka
aggregate
Kafka
local
Kafka
aggregate
Hadoop
Producers
Kafka
aggregate
Kafka
local
Kafka
aggregate
Hadoop
Producers

Data loss: detection (example 2)
P
R
O
D
B
P
R
O
D
A
C
O
R
P
Y
C
O
R
P
X
Kafka
aggregate
Kafka
local
Kafka
aggregate
Hadoop
Producers
Kafka
aggregate
Kafka
local
Kafka
aggregate
Hadoop
Producers

Data loss? (The actual incident)
P
R
O
D
B
P
R
O
D
A
C
O
R
P
Y
C
O
R
P
X
Kafka
aggregate
Kafka
local
Kafka
aggregate
Hadoop
Producers
Kafka
aggregate
Kafka
local
Kafka
aggregate
Hadoop
Producers

Data loss or audit issue? (The actual incident)
P
R
O
D
B
P
R
O
D
A
C
O
R
P
Y
C
O
R
P
X
Kafka
aggregate
Kafka
local
Kafka
aggregate
Hadoop
Producers
Kafka
aggregate
Kafka
local
Kafka
aggregate
Hadoop
Producers
Sporadic discrepancies in Kafka-
aggregate-CORP-X counts for several
topics
However, Hadoop-X tier is complete
✔
✔ ✔
✔
✔
✔✔

Verified actual data completeness by recounting events in a few low-volume topics
… so definitely an audit-only issue
Likely caused by dropping audit events

Verified actual data completeness by recounting events in a few low-volume topics
… so definitely an audit-only issue
Possible sources of discrepancy:
● Cluster auditor
● Cluster itself (i.e., data loss in audit topic)
● Audit front-end
Likely caused by dropping audit events

Possible causes
C
O
R
P
X
Kafka
aggregate
Hadoop
Cluster
auditor
consume all
topics
emit audit
counts
Cluster auditor
● Counting incorrectly
○ but same version of auditor everywhere
and only CORP-X has issues
● Not consuming all data for audit or failing
to send all audit events
○ but no errors in auditor logs
● … and auditor bounces did not help

Data loss in audit topic
● … but no unclean leader elections
● … and no data loss in sampled topics
(counted manually)
Possible causes
C
O
R
P
X
Kafka
aggregate
Hadoop
Cluster
auditor
consume all
topics
emit audit
counts

Audit front-end fails to insert audit events into DB
● … but other tiers (e.g., CORP-Y) are correct
● … and no errors in logs
Possible causes
C
O
R
P
X
Kafka
aggregate
Hadoop
Audit
front-end
consume
audit
Audit DB
insert
from
CORP-Y

● Emit counts to new test tier
Attempt to reproduce
C
O
R
P
X
Kafka
aggregate
Hadoop
Cluster
auditor
consume all
topics
Tier CORP-X
Cluster
auditor
Tier test

… fortunately worked:
● Emit counts to new test tier
● test tier counts were also sporadically off
Attempt to reproduce
C
O
R
P
X
Kafka
aggregate
Hadoop
Cluster
auditor
consume all
topics
Tier CORP-X
Cluster
auditor
Tier test

● Enabled select TRACE logs to log audit
events before sending
● Audit counts were correct
● … and successfully emitted
… and debug
C
O
R
P
X
Kafka
aggregate
Hadoop
Cluster
auditor
consume all
topics
Tier CORP-X
Cluster
auditor
Tier test

● Verified from broker public access logs
that audit event was sent
… and debug
C
O
R
P
X
Kafka
aggregate
Hadoop
Cluster
auditor
consume all
topics
Tier CORP-X
Cluster
auditor
Tier test

● … but on closer look realized it was not
the leader for that partition of the audit
topic
… and debug
C
O
R
P
X
Kafka
aggregate
Hadoop
Cluster
auditor
consume all
topics
Tier CORP-X
Cluster
auditor
Tier test

● … but on closer look realized it was not
the leader for that partition of the audit
topic
● So why did it not return
NotLeaderForPartition?
… and debug
C
O
R
P
X
Kafka
aggregate
Hadoop
Cluster
auditor
consume all
topics
Tier CORP-X
Cluster
auditor
Tier test

That broker was part of another cluster!
C
O
R
P
X
Kafka
aggregate
Hadoop
Cluster
auditor
Some other
Kafka cluster
Tier test
siphoned audit
events

… and we had a VIP misconfiguration
C
O
R
P
X
Kafka
aggregate
Hadoop
Cluster
auditor
Some other
Kafka cluster
V
I
P
stray broker
entry

● Auditor still uses the old producer
● Periodically refreshes metadata (via VIP)
for the audit topic
● ⇒ sometimes fetches metadata from the
other cluster
So audit events leaked into the other cluster
C
O
R
P
X
Kafka
aggregate
Hadoop
Cluster
auditor
Some other
Kafka cluster
V
I
P
AuditTopic
Metadata
Request
Metadata
response

● Auditor still uses the old producer
● Periodically refreshes metadata (via VIP)
for the audit topic
● ⇒ sometimes fetches metadata from the
other cluster
● and leaks audit events to that cluster until
at least next metadata refresh
So audit events leaked into the other cluster
C
O
R
P
X
Kafka
aggregate
Hadoop
Cluster
auditor
Some other
Kafka cluster
V
I
P
emit audit
counts

Some takeaways
● Could have been worse if mirror-makers to CORP-X had been bounced
○ (Since mirror makers could have started siphoning actual data to the other cluster)
● Consider using round-robin DNS instead of VIPs
○ … which is also necessary for using per-IP connection limits

Data loss: the second incident
Prolonged period of data loss from our Kafka REST proxy

Data loss: the second incident
Alerts fire that a broker in tracking cluster had gone offline
NOC engages SYSOPS to investigate
NOC engages Feed SREs and Kafka SREs to investigate drop (not loss) in a subset of page views
On investigation, Kafka SRE finds no problems with Kafka (excluding the down broker), but notes an overall drop in
tracking messages starting shortly after the broker failure
NOC engages Traffic SRE to investigate why their tracking events had stopped
Traffic SRE say that they don’t see errors on their side, and add that they use Kafka REST proxy
Kafka SRE finds no immediate errors in Kafka REST logs but bounces the service as a precautionary measure
Tracking events return to normal (expected) counts after the bounce
Prolonged period of data loss from our Kafka REST proxy

Reproducing the issue
Broker
A
Producer
performance
Broker
B

Broker
A
Producer
performance
Broker
B
Isolate the broker
(iptables)

Sender
Accumulator
Broker
A
Broker
B
Partition 1
Partition 2
Partition n
send
Leader for
partition 1
in-flight
requests

Sender
Accumulator
Broker
A
Broker
B
Partition 1
Partition 2
Partition n
send
New leader
for partition 1
in-flight
requests
Old leader for
partition 1

Sender
Accumulator
Broker
A
Broker
B
Partition 1
Partition 2
Partition n
send
New leader
for partition 1
in-flight
requests
New producer did not implement a request timeout
Old leader for
partition 1

Sender
Accumulator
Broker
A
Broker
B
Partition 1
Partition 2
Partition n
send
in-flight
requests
New producer did not implement a request timeout
⇒ awaiting response
⇒ unaware of leader change until next metadata refresh
New leader
for partition 1
Old leader for
partition 1

Sender
Accumulator
Broker
A
Broker
B
Partition 1
Partition 2
Partition n
send
in-flight
requests
So client continues to send
to partition 1
New leader
for partition 1
Old leader for
partition 1

Sender
Accumulator
Broker
A
Broker
B
Partition 2
Partition n
send
batches pile up in partition 1 and
eat up accumulator memory
in-flight
requests
New leader
for partition 1
Old leader for
partition 1

Sender
Accumulator
Broker
B
Partition 2
Partition n
send
in-flight
requests
subsequent sends drop/block
per block.on.buffer .full
config
New leader
for partition 1
Old leader for
partition 1
Broker
A

● netstat
tcp 0 0 ::ffff:127.0.0.1:35938 ::ffff:127.0.0.1:9092 ESTABLISHED 3704/java
● Producer metrics
○ zero retry/error rate
● Thread dump
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(long, TimeUnit)
org.apache.kafka.clients.producer.internals.BufferPool.allocate(int)
org.apache.kafka.clients.producer.internals.RecordAccumulator.append(TopicPartition, byte[], byte[],
CompressionType, Callback)
● Resolved by KAFKA-2120 (KIP-19)

Cluster unavailability
(This is an abridged version of my earlier talk.)

The incident
Occurred a few days after upgrading to pick up quotas and SSL
Multi-port
KAFKA-1809
KAFKA-1928
SSL
KAFKA-1690
x25 x38
October 13
Various quota patches
June 3April 5 August 18

The incident
Broker (which happened to be controller) failed in our queuing Kafka cluster

The incident
Multiple applications begin to report “issues”: socket timeouts to Kafka cluster
Posts search was one such
impacted application

The incident
Two brokers report high request and response queue sizes

The incident
Two brokers report high request queue size and request latencies

The incident
● Other observations
○ High CPU load on those brokers
○ Throughput degrades to ~ half the normal throughput
○ Tons of broken pipe exceptions in server logs
○ Application owners report socket timeouts in their logs

Remediation
Shifted site traffic to another data center
“Kafka outage ⇒ member impact
Multi-colo is critical!

Remediation
● Controller moves did not help
● Firewall the affected brokers
● The above helped, but cluster fell over again after dropping the rules
● Suspect misbehaving clients on broker failure
○ … but x25 never exhibited this issue
sudo iptables -A INPUT -p tcp --dport <broker-port> -s <other-broker> -j ACCEPT
sudo iptables -A INPUT -p tcp --dport <broker-port> -j DROP

Remediation
Friday night ⇒ roll-back to x25 and debug later
… but SREs had to babysit the rollback
x38 x38 x38 x38
Rolling downgrade

Remediation
x38 x38 x38 x38
Rolling downgrade
Move leaders

Remediation
x38 x38 x38 x38
Rolling downgrade
Firewall

Remediation
x38 x38 x38
Rolling downgrade
Firewall
x25

Remediation
x38 x38 x38
Rolling downgrade
x25
Move leaders

● Test cluster
○ Tried killing controller
○ Multiple rolling bounces
○ Could not reproduce
● Upgraded the queuing cluster to x38 again
○ Could not reproduce
● So nothing…
Attempts at reproducing the issue

API layer
Life-cycle of a Kafka request
Network layer
Acceptor
Processor
Processor
Processor
Processor
Response queue
Response queue
Response queue
Response queue
Request queue
API handler
API handler
API handler
Clientconnections
Purgatory
New
connections
Read
request
Quota manager
Await handling
Total time = queue-time
Handle request
+ local-time + remote-time
long-poll requests
Hold if quota
violated
+ quota-time
Await processor
+ response-queue-time
Write
response
+ response-send-time

Investigating high request times
● First look for high local time
○ then high response send time
■ then high remote (purgatory) time → generally non-issue (but caveats described later)
● High request queue/response queue times are effects, not causes

High local times during incident (e.g., fetch)

How are fetch requests handled?
● Get physical offsets to be read from local log during response
● If fetch from follower (i.e., replica fetch):
○ If follower was out of ISR and just caught-up then expand ISR (ZooKeeper write)
○ Maybe satisfy eligible delayed produce requests (with acks -1)
● Else (i.e., consumer fetch):
○ Record/update byte-rate of this client
○ Throttle the request on quota violation

Could these cause high local times?
● Get physical offsets to be read from local log during response
● If fetch from follower (i.e., replica fetch):
○ If follower was out of ISR and just caught-up then expand ISR (ZooKeeper write)
○ Maybe satisfy eligible delayed produce requests (with acks -1)
● Else (i.e., consumer fetch):
○ Record/update byte-rate of this client
○ Throttle the request on quota violation
Not using acks -1
Should be fast
Should be fast
Delayed outside
API thread
Test this…

Maintains byte-rate metrics on a per-client-id basis
2015/10/10 03:20:08.393 [] [] [] [logger] Completed request:Name: FetchRequest; Version: 0;
CorrelationId: 0; ClientId: 2c27cc8b_ccb7_42ae_98b6_51ea4b4dccf2; ReplicaId: -1; MaxWait: 0
ms; MinBytes: 0 bytes from connection <clientIP>:<brokerPort>-<localAddr>;totalTime:6589,
requestQueueTime:6589,localTime:0,remoteTime:0,responseQueueTime:0,sendTime:0,
securityProtocol:PLAINTEXT,principal:ANONYMOUS
Quota metrics
??!

Quota metrics - a quick benchmark
for (clientId ← 0 until N) {
timer.time {
quotaMetrics.recordAndMaybeThrottle(clientId, 0, DefaultCallBack)
}
}

Fixed in KAFKA-2664

meanwhile in our queuing cluster…
due to climbing
client-id counts

Rolling bounce of cluster forced the issue to recur on brokers that had high client-
id metric counts
○ Used jmxterm to check per-client-id metric counts before experiment
○ Hooked up profiler to verify during incident
■ Generally avoid profiling/heapdumps in production due to interference
○ Did not see in earlier rolling bounce due to only a few client-id metrics at the time

How to fix high local times
● Optimize the request’s handling. For e.g.,:
○ cached topic metadata as opposed to ZooKeeper reads (see KAFKA-901)
○ and KAFKA-1356
● Make it asynchronous
○ E.g., we will do this for StopReplica in KAFKA-1911
● Put it in a purgatory (usually if response depends on some condition); but be
aware of the caveats:
○ Higher memory pressure if request purgatory size grows
○ Expired requests are handled in purgatory expiration thread (which is good)
○ but satisfied requests are handled in API thread of satisfying request ⇒ if a request satisfies
several delayed requests then local time can increase for the satisfying request

● Request queue size
● Response queue sizes
● Request latencies:
○ Total time
○ Local time
○ Response send time
○ Remote time
● Request handler pool idle ratio
Monitor these closely!

The first incident: new clients old clusters
Test
cluster
(old version)
Certification
cluster
(old version)
Metrics
cluster
(old version)
metric
events
metric
events

The first incident: new clients old clusters
Test
cluster
(new version)
Certification
cluster
(old version)
Metrics
cluster
(old version)
metric
events
metric
events
org.apache.kafka.common.protocol.types.SchemaException: Error reading field 'throttle_time_ms': java.nio.
BufferUnderflowException
at org.apache.kafka.common.protocol.types.Schema.read(Schema.java:73)
at org.apache.kafka.clients.NetworkClient.handleCompletedReceives(NetworkClient.java:397)
...

New clients old clusters: remediation
Test
cluster
(new version)
Certification
cluster
(new version)
Metrics
cluster
(old version)
metric
events
metric
events
Set acks to zero

Test
cluster
(new version)
Certification
cluster
(new version)
Metrics
cluster
(new version)
metric
events
metric
events
Reset acks to 1

(BTW this just hit us again with the protocol changes in KIP-31/KIP-32)
KIP-35 would help a ton!

The second incident: new endpoints
{ "version":1,
"jmx_port":9999,
"timestamp":2233345666,
"host":"localhost",
“port”:9092
}
x14older broker versions
ZooKeeper
registration
{ "version”:2,
"jmx_port":9999,
"timestamp":2233345666,
"host":"localhost",
“port”:9092,
"endpoints": [
{"plaintext://localhost:
9092"}
]
}
x14
client
old
client
ignore endpoints v2 ⇒ use endpoints

The second incident: new endpoints
{ "version":1,
"jmx_port":9999,
"timestamp":2233345666,
"host":"localhost",
“port”:9092
}
ZooKeeper
registration
{ "version”:2,
"jmx_port":9999,
"timestamp":2233345666,
"host":"localhost",
“port”:9092,
"endpoints": [
9092"}
]
}
x36
{ "version”:2,
"jmx_port":9999,
"timestamp":2233345666,
"host":"localhost",
“port”:9092,
"endpoints": [
9092"}, {“ssl:
//localhost:9093”} ]
}
x14
client
old
client
java.lang.IllegalArgumentException: No enum constant
org.apache.kafka.common.protocol.SecurityProtocol.SSL
at java.lang.Enum.valueOf(Enum.java:238)
at org.apache.kafka.common.protocol.
SecurityProtocol.valueOf(SecurityProtocol.java:24)

New endpoints: remediation
{ "version":1,
"jmx_port":9999,
"timestamp":2233345666,
"host":"localhost",
“port”:9092
}
ZooKeeper
registration
{ "version”:2,
"jmx_port":9999,
"timestamp":2233345666,
"host":"localhost",
“port”:9092,
"endpoints": [
9092"}
]
}
x36
{ "version”:2 1,
"jmx_port":9999,
"timestamp":2233345666,
"host":"localhost",
“port”:9092,
"endpoints": [
9092"}, {“ssl:
}
x14
client
old
client
v1 ⇒ ignore endpoints

{ "version":1,
"jmx_port":9999,
"timestamp":2233345666,
"host":"localhost",
“port”:9092
}
ZooKeeper
registration
{ "version”:2,
"jmx_port":9999,
"timestamp":2233345666,
"host":"localhost",
“port”:9092,
"endpoints": [
9092"}
]
}
x36
{ "version”:2 1,
"jmx_port":9999,
"timestamp":2233345666,
"host":"localhost",
“port”:9092,
"endpoints": [
9092"}, {“ssl:
}
x14
client
x36
client
old
client
v1 ⇒ ignore endpoints
v1 ⇒ use endpoints if
present

● Fix in KAFKA-2584
● Also related: KAFKA-3100

Widespread FS corruption after power outage
● Mount settings at the time
○ type ext4 (rw,noatime,data=writeback,commit=120)
● Restarts were successful but brokers subsequently hit corruption
● Subsequent restarts also hit corruption in index files

● Monitoring beyond per-broker/controller
metrics
○ Validate SLAs
○ Continuously test admin functionality (in
test clusters)
● Automate release validation
● https://github.com/linkedin/streaming
Kafka monitor
Kafka
cluster
producer
Monitor
instance
ackLatencyMs
e2eLatencyMs
duplicateRate
retryRate
failureRate
lossRate
consumer
Availability %

● Monitoring beyond per-broker/controller
metrics
○ Validate SLAs
○ Continuously test admin functionality (in
test clusters)
● Automate release validation
● https://github.com/linkedin/streaming
Kafka monitor
Kafka
cluster
producer
Monitor
instance
ackLatencyMs
e2eLatencyMs
duplicateRate
retryRate
failureRate
lossRate
consumer
Monitor
instance
Admin
Utils
Monitor
instance
checkReassign
checkPLE

Software developers and Site Reliability Engineers at all
levels
Streams infrastructure @ LinkedIn
● Kafka pub-sub ecosystem
● Stream processing platform built on Apache Samza
● Next Gen Change capture technology (incubating)
Contact Kartik Paramasivam
Where LinkedIn campus
2061 Stierlin Ct.,
Mountain View, CA
When May 11 at 6.30 PM
Register http://bit.ly/1Sv8ach
We are hiring! LinkedIn Data Infrastructure meetup

Kafkaesque days at linked in in 2015

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Kafkaesque days at linked in in 2015

Similaire à Kafkaesque days at linked in in 2015 (20)

Dernier

Dernier (20)

Kafkaesque days at linked in in 2015