Kafka Summit SF 2017 - One Day, One Data Hub, 100 Billion Messages: Kafka at LINE

One Day, One Data Hub, 100
Billion Messages: Kafka at
LINE
Yuto Kawamura - LINE Corporation

Speaker introduction
— Yuto Kawamura
— Senior Software Engineer
— Leading project to redesign microservices architecture w/ Kafka
— Apache Kafka Contributor
— KAFKA-4614 Improved broker's response time
— KAFKA-4024 Removed unnecessary blocking behavior of
producer
— Publication
— Applying Kafka Streams for internal message delivery pipeline
https://engineering.linecorp.com/en/blog/detail/80

Outline
— LINE
— Kafka at LINE
— Performance engineering KAFKA-4614

LINE
— Messaging service
— 169 million active users1
in
countries with top market
share like Japan, Taiwan and
Thailand
— Many family services
— News
— Music
— LIVE (Video streaming)
1
As of June 2017. Sum of 4 countries: Japan, Taiwan,
Thailand and Indonesia.

Example: UserActivityEvent
— Explains user's activity on service
— e.g, UserA added UserB as a friend

Cluster Scale
— 150+ billion messages /day
— 40+ TB incoming data /day
— 3.5+ million messages /sec at peak times

Broker Servers
— CPU: Intel(R) Xeon(R) 2.20GHz x 40
— Memory: 256GiB
— more memory, more caching (page cache)
— Network: 10Gbps
— Disk: HDD x 12 RAID 1+0
— saves maintenance costs
— Number of servers: 30

New challanges - Being part of the infrastructure
— Higher trafﬁc
— Multi-tenancy
— Requirement for delivery latency
— As a communication path with bot systems
— Much faster threat detection

Performance engineering
KAFKA-4614

KAFKA-4614 - Long GC pause
harming broker performance
which is caused by mmap
objects created for OffsetIndex
— Highlighted as great
improvement in Log
Compaction Feb2
— https://issues.apache.org/jira/
browse/KAFKA-4614
2
https://www.conﬂuent.io/blog/log-compaction-
highlights-in-the-apache-kafka-and-stream-
processing-community-february-2017/

One day, we found response times of Produce requests
ge!ing unstable...

Looking into detailed system metrics...
— Found that small amount of disk read was occurring during response time
spikes.
— Interesting, because all our consumers are supposed to be caught-up by
the latest offset => all fetch requests should be served from page cache.

Who is reading disk, and for what?
Tried reading code, kept observing logs, periodically
taking jstack and jvisualvm ... but no luck

Paradigm shi!: Observing lower level - SystemTap
— A kernel layer dynamic tracing tool and scripting
language
— Safe to run in production because of low overhead
— If we run strace or perf on production servers... !

Simple example: Counting syscalls:
$ stap -x PID -e '
global cnt
probe syscall.* {
cnt[name] += 1
}
probe end {
foreach (k in cnt)
printf("%s called %d timesn", k, cnt[k])
}
'
^Cfcntl called 19 times
read called 3333 times
pselect6 called 1 times
sendto called 4929 times
...

Observing disk read from inside kernel
# disk-read-trace.stp
probe ioblock.request {
if (rw == BIO_READ && devname == "sdb1") { // for read ops for specific device
t_ms = gettimeofday_ms() + 9 * 3600 * 1000
printf("%s,%03d: tid = %d, device = %s, inode = %d, size = %dn",
ctime(t_ms / 1000), t_ms % 1000, tid(), devname, ino, size)
print_backtrace() // print kernel-level backtrace
print_ubacktrace() // print user-level backtrace
}
}

stap -x KAFKA_PID disk-read-trace.stp
...
Thu Dec 22 17:21:27 2016,093: tid = 126123, device = sdb1, inode = -1, size = 4096
0xffffffff81275050 : generic_make_request+0x0/0x5a0 [kernel]
0xffffffff81275660 : submit_bio+0x70/0x120 [kernel]
...
0xffffffffa036dc1b : xfs_buf_read+0xab/0x100 [xfs]
...
0xffffffffa032456f : xfs_bmbt_lookup_eq+0x1f/0x30 [xfs]
0xffffffffa032628b : xfs_bmap_del_extent+0x12b/0xac0 [xfs]
...
0xffffffffa0374620 : xfs_fs_clear_inode+0xa0/0xd0 [xfs]
...
0xffffffff811b0815 : generic_drop_inode+0x65/0x80 [kernel]
0xffffffff811af662 : iput+0x62/0x70 [kernel]
0x37ff2e5347 : munmap+0x7/0x30 [/lib64/libc-2.12.so]
0x7ff169ba5d47 : Java_sun_nio_ch_FileChannelImpl_unmap0+0x17/0x50 [/usr/jdk1.8.0_66/jre/lib/amd64/libnio.so]
0x7ff269a1307e

stap -x KAFKA_PID disk-read-trace.stp
...
Thu Dec 22 17:21:27 2016,093: tid = 126123, device = sdb1, inode = -1, size = 4096
0xffffffff81275050 : generic_make_request+0x0/0x5a0 [kernel]
0xffffffff81275660 : submit_bio+0x70/0x120 [kernel]
...
0xffffffffa036dc1b : xfs_buf_read+0xab/0x100 [xfs]
...
0xffffffffa032456f : xfs_bmbt_lookup_eq+0x1f/0x30 [xfs]
0xffffffffa032628b : xfs_bmap_del_extent+0x12b/0xac0 [xfs]
...
0xffffffffa0374620 : xfs_fs_clear_inode+0xa0/0xd0 [xfs]
...
0xffffffff811b0815 : generic_drop_inode+0x65/0x80 [kernel]
0xffffffff811af662 : iput+0x62/0x70 [kernel]
0x37ff2e5347 : munmap+0x7/0x30 [/lib64/libc-2.12.so]
0x7ff169ba5d47 : Java_sun_nio_ch_FileChannelImpl_unmap0+0x17/0x50 [/usr/jdk1.8.0_66/jre/lib/amd64/libnio.so]
0x7ff269a1307e
munmap ...?

Who's mmap?
[kafka]$ git grep mmap ...
class OffsetIndex ... {
...
private[this] var mmap: MappedByteBuffer = {
val newlyCreated = _file.createNewFile()
val raf = new RandomAccessFile(_file, "rw")
...
val idx = raf.getChannel.map(FileChannel.MapMode.READ_WRITE, 0, len)

Who is calling munmap?
Thu Dec 22 17:21:27 2016,093: tid = 126123,
device = sdb1, inode = -1, size = 4096
...
tid = 126123

Finding munmap caller
jstack and grep thread id3
# hex(126123) = 0x1ecab
jstack KAFKA_PID | grep nid=0x1ecab
"Reference Handler" #2 daemon prio=10 os_prio=0 tid=0x00007ff278d0c800 nid=0x1ecab
in Object.wait() [0x00007ff17da11000]
.... GC related thread?
3
nid=0xXXXX entry of jstack output tells "n"ative thread id

Visiting Javadoc of MappedByteBuffer
https://docs.oracle.com/javase/8/docs/api/java/nio/
MappedByteBuffer.html
> A mapped byte buffer and the ﬁle mapping that it
represents remain valid until the buffer itself is garbage-
collected.

How does munmap cause disk read?
Log cleaner thread deletes a log segment which expires
retention period.

OffsetIndex, calls File.delete() on an index ﬁle, but it
physically remains, as the living mmap still holds an open
reference.

MemoryMappedBuffer becomes garbage but it might not be
collected by GC as it is placed in a region which still has
many living objects.

While MemoryMappedBuffer survives several GC attempts,
several hours elapses, the entry which holds meta info
of the index ﬁle is evicted from buffer cache.

Finally GC collects the region which holds the disposed MemoryMappedBuffer,
and calls munmap(2) through MemoryMappedBuffer's cleaner.

Kernel realizes that the final reference to the file destroyed,
attempts to perform physical deletion of the index file.

XFS driver attempts to lookup up the inode entry for index ﬁle from
cache but can't ﬁnd it => read it from disk.

Conﬁrming...
grep 'Total time for which' kafkaServer-gc.log | # one-liner for summing up GC time
2017-01-11T01:43 = 317.8821
2017-01-11T01:44 = 302.1132
2017-01-11T01:45 = 950.5807 # << !!!
2017-01-11T01:46 = 344.9449
2017-01-11T01:47 = 328.936
Tip: You can enable the very useful "STW duration" logging by option:
-XX:+PrintGCApplicationStoppedTime
...
2017-08-03T20:15:27.413+0900: 12109287.163: Total time for which
application threads were stopped: 0.0186989 seconds, Stopping
threads took: 0.0000489 seconds

Fix it
mmap.asInstanceOf[sun.nio.ch.DirectBuffer].cleaner().clean()
— Forcefully perform munmap in application (cleaner)
thread instead of leaving it to GC
— https://github.com/apache/kafka/pull/2352

— Disk read still happens but in an application thread
(log cleaner thread)
— No other threads blocked

Result
— 99th percentile Produce response times always stays
lower than 20ms

Conclusion
— LINE uses Kafka as part of its fundamental microservice
infrastructure and its usage is increasing weekly
— Introduced advanced techniques to achieve deeper
observability for Kafka
— However, Kafka is amazingly stable and high-
performant for most cases even with the defaults
— Try out today's techniques just in case you run into
complicated issues :p
— Since Kafka 0.10.2.0, broker's response times have
become much faster and more stable

End of presentation.
Questions?

Kafka Summit SF 2017 - One Day, One Data Hub, 100 Billion Messages: Kafka at LINE

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Kafka Summit SF 2017 - One Day, One Data Hub, 100 Billion Messages: Kafka at LINE

Similaire à Kafka Summit SF 2017 - One Day, One Data Hub, 100 Billion Messages: Kafka at LINE (20)

Plus de confluent

Plus de confluent (20)

Dernier

Dernier (20)

Kafka Summit SF 2017 - One Day, One Data Hub, 100 Billion Messages: Kafka at LINE