What every software engineer should know about streams and tables in kafka kafka meetup zurich feb 2020

1
Apache Kafka™ Fundamentals:
What Every Software Engineer Should
Know about Streams and Tables in Kafka
Kafka Meetup Zurich, Switzerland, Feb 2020
Michael G. Noll
Senior Technologist, Oﬃce of the CTO
@miguno

22
@miguno
Streams Tables
Event Streaming

33
@miguno
Streams Tables
Event Streaming

44
Events, Streams, Tables
A Primer

5
An Event Streaming Platform
gives you three key functionalities
5
Publish & Subscribe
to Events
Store
Events
Process & Analyze
Events
@miguno

6
An Event
records the fact that something happened
6
A good
was sold
An invoice
was issued to
Michael
Alice paid
$200 to Bob
on Feb 01, 2020
at 8:31am
A new customer
registered
@miguno

7
A Stream
records history as a sequence of Events
7
@miguno

88
@miguno
Event Streaming Paradigm
Highly Scalable
Durable
Persistent
Maintains Order
Fast (Low Latency)
Kafka = Source of Truth,
stores every article since 1851
Denormalized into
“Content View”
Normalized assets
(images, articles,
bylines, etc.)
https://www.confluent.io/blog/publishing-apache-kafka-new-york-times/
Streams record history, even hundreds of years

99
“The ledger of sales.” “The sales totals.”
Streams
record history
Tables
represent state
@miguno

1010
1. e4 e5
2. Nf3 Nc6
3. Bc4 Bc5
4. d3 Nf6
5. Nbd2
“The sequence of moves.” “The state of the board.”
Streams
record history
Tables
represent state
@miguno

11
Streams = INSERT only
Immutable, append-only
Tables = INSERT, UPDATE, DELETE
Mutable, row key (event.key) identiﬁes which row
11
@miguno

12
The key to mutability is … the event.key!
12
@miguno
Stream Table
Has unique key constraint? No Yes
First event with key bob arrives INSERT INSERT
Another event with key bob arrives INSERT UPDATE
Event with key bob and value null arrives INSERT DELETE
Event with key null arrives INSERT <ignored>
RDBMS analogy: A Stream is ~ a Table that has no unique key and is append-only.

13
Creating a table from a stream or topic
streams

1414
@miguno
Stream
(facts)
Table
(dims)
alice Berlin
bob Lima
alice Berlin
alice Rome
bob Lima
alice Paris
bob Sydney
alice Berlin
alice Rome
bob Lima
alice Paris
bob Sydney
90°

1515
aggregation
like SUM(), COUNT()
table changes
*
See the academic paper Streams and Tables: Two Sides of the Same Coin, M. Sax et al., BIRTE ’18
Streams
record history
Tables
represent state
Duality
@miguno

16
Aggregating a stream (COUNT example)
streams
@miguno

17
Aggregating a stream (COUNT example)
@miguno
streams

1818
Topics, Partitions, and
Other Storage Fundamentals

19
A Topic
is where events are durably stored,
like a ﬁle in a distributed ﬁle system
19
@miguno
An unbounded sequence of serialized events,
where each event is represented as an
encoded key-value pair*
or “Kafka message”.
00100 11101 11000 0001110101 10001

20
Storage formats
Clients serialize events to bytes when writing,
and deserialize when reading.
20
@miguno
00100 11101 11000 00011 00100 0011010101 10001
Producer serializes
and writes an event
Consumer reads and
deserializes an event

21
Storage formats
Brokers take no part in this. All they see is bytes.
Being this “dumb” is actually really smart!
21
@miguno
Bytes in Bytes out

22
Storage is partitioned
And ‘partitions’ are the most important concept in Kafka
22
...
...
...
...
P1
P2
P3
P4
Storage
Topic
@miguno

23
Partitions are spread across Kafka brokers
A broker can be leader and/or follower of 1+ partitions
23
@miguno
...
P1 ...
P2
...
P3 ...
P1
...
...
...P4 ...P4 ...
...
P1
...
P3
...
P2
...
...
...
Leader
Follower

24
Storage Layer
(Brokers)
Processing Layer
(ksqlDB, KStreams,
Kafka clients)
Partitions play a central role in Kafka
24
Topics are partitioned. Partitions enable scalability, elasticity, fault tolerance.
stored in
replicated based on
ordered based on
PartitionsData is joined based on
read from and written to
processed based on
@miguno

25
Event producers determine event partitioning
through a partitioning function ƒ(event.key, event.value)*
25
...
...
...
...
P1
P2
P3
P4
Storage
Topic
Event sent and appended
to partition 1
@miguno

26
Events with same key should be in same partition
to ensure proper ordering of related events
to ensure processing by Consumers returns expected results
26
...
...
...
...
P1
P2
P3
P4
Yellow events should always
be stored in partition 3
@miguno

27
Top causes for same key in different partitions
1. Topic conﬁguration
You increased the number of partitions in a topic
2. Producer conﬁguration
A producer uses a custom partitioner
→ Be careful in this situation!
27
@miguno

2828
Processing Fundamentals with
ksqlDB and Kafka Streams
From Topics to Streams and Tables

29
Topics
live in the Kafka storage layer, ‘ﬁlesystem’ (Brokers)
Streams and Tables
live in the Kafka processing layer (ksqlDB, KStreams)
29
@miguno

30
Processing Layer
(ksqlDB, KStreams)
30
00100 11101 11000 00011 00100 00110Topic
alice Paris bob Sydney alice RomeEvent Stream
plus schema (serdes)
alice 2
bob 1
Table
plus aggregation
Storage Layer
(Brokers)
Topics vs. Streams and Tables
@miguno

31
Kafka Processing
Data is processed per-partition
31
...
...
...
...
P1
P2
P3
P4
Storage Processing
read via
network
Topic App Instance 1 Application
App Instance 2
‘payments’ with consumer group
‘fraud-detection-app’
@miguno

32
Kafka Processing
Stream tasks are the unit of parallelism
32
...
...
...
...
P1
P2
P3
P4
Storage Processing State
Stream Task 1
Stream Task 2
Stream Task 3
Stream Task 4
read via
network
Application Instance 1Topic Application Instance 1
Application Instance 2
@miguno

33
Streams and Tables are partitioned, too
33
...
...
...
...
P1
P2
P3
P4
Stream Task 1
Stream Task 2
Stream Task 3
Stream Task 4
TABLE / KTable
2 GB
3 GB
5 GB
2 GB

34
Global Tables give complete data to every task
Great for joins without re-keying the input or to broadcast info
34
...
...
...
...
P1
P2
P3
P4
Stream Task 1
Stream Task 2
Stream Task 3
Stream Task 4
GlobalKTable
2 + 3 + 5 + 2 = 12 GB
12 GB
12 GB
12 GB
@miguno

3535
@miguno@miguno
Concept Schema Partitioned Unbounded Ordering
Guarantee***
Mutable Single Value
Per Key
Fault Tolerant
Storage Layer
Topic No* (raw bytes) Yes Yes Yes No No Yes
Processing Layer
Stream Yes Yes Yes Yes No No Yes
Table Yes Yes No** No Yes Yes Yes
Global Table Yes No No** No Yes Yes Yes
*When using Schema Registry with Avro, then a schema is technically available for the data in a topic.
**Generally speaking the answer is Yes but, in practice, tables are almost always bounded due to ﬁnite key space.
***Ordering per partition, not across the multiple partitions in a topic.
Topics vs. Streams and Tables

3636
Elasticity, Fault Tolerance, and
Other Advanced Concepts

37
Tables are materialized on local disk
so that processing is fast, and tables can be larger than RAM.
Enables large-scale state. Saves $$$ on cloud instances.
37
But machines and containers can be lost!
How can we make this fault-tolerant? @miguno
Stream Task 1 Materialized on local disk under /var
(via RocksDB by default)
Global
Table
Table

38
streaming ‘restore’
via network
Stream Task 1
On machine B
Tables and other state are made fault-tolerant
by storing their data durably, remotely in Kafka topics
38
streaming ‘backup’
via network
table’s changelog topic
(log-compacted)
Kafka
Storage
ksqlDB,
Kafka Streams
On machine A
Stream Task 1

39
Standby Replicas speed up application recovery
App instances can optionally maintain copies of another
instance’s table/state data to minimize failover times
39
@miguno
num.standby.replicas = 1
Stream Task 1
Stream Task 2
App Instance 1
App Instance 2
num.standby.replicas = 0 (default)
Stream Task 1
Stream Task 2
Stream Task 2
failover

40
Elastic scaling
Stream tasks are migrated, including tables/state (via Kafka)
40
@miguno
App Instance 1
App Instance 2
Stream Task 1
Stream Task 3
Stream Task 2
Stream Task 4
App Instance 3
App Instance 4
Stream Task 2
Stream Task 4

41
bob Zurich
alice Rome
bob Zurich
bob Sydney alice Romealice Bern alice Paris
Tables <-> topic log-compaction
A table’s underlying topic is compacted by default
to save Kafka storage space
to speed up failover and recovery for processing
41
Note: Compaction intentionally removes part of a table’s history. If you need the full
history and don’t have the historic data elsewhere, consider disabling compaction.
@miguno

4242
@miguno
TL;DR for Log Compaction
Have a Stream?
→ Disable log compaction for its topic (= default)
Have a Table?
→ Enable log compaction for its topic (= default)
Disable only when needed, see previous slide

43
Max processing parallelism = #input partitions
43
...
...
...
...
P1
P2
P3
P4
Topic Application Instance 1
Application Instance 5 *** idle ***
Application Instance 6 *** idle ***
→ Need higher parallelism? Increase the original topic’s partition count.
→ Higher parallelism for just one use case? Derive a new topic from the
original with higher partition count. Lower its retention to save storage.
@miguno

44
How to increase number of partitions when needed
ksqlDB example: statement below creates a new stream with
the desired number of partitions.
44
CREATE STREAM products_repartitioned
WITH (PARTITIONS=30) AS
SELECT * FROM products
EMIT CHANGES
@miguno

45
‘Hot’ partitions can be problematic, often caused by
1. Events not evenly distributed across partitions
2. Events evenly distributed but certain events take longer to process
45
Strategies to address hot partitions include
1a. Ingress: Find better partitioning function ƒ(event.key) for producers
1b. Storage: Re-partition data into new topic if you can’t change the original
2. Scale processing vertically, e.g. more powerful CPU instances
...
...
...
...
P1
P2
P3
P4
@miguno

46
How to re-partition your data when needed
ksqlDB example: statement below creates a new stream with
changed number of partitions and a new ﬁeld as event.key
(so that its data is now correctly co-partitioned for joining)
46
CREATE STREAM products_repartitioned
WITH (PARTITIONS=42) AS
SELECT * FROM products
PARTITION BY product_id
EMIT CHANGES
@miguno

47
Capacity Planning and Sizing
Sorry, not covered here! I recommend:
ksqlDB Performance Tuning for Fun and Profit, by Nick Dearden
https://kafka-summit.org/sessions/ksql-performance-tuning-fun-profit/
ksqlDB documentation
https://docs.ksqldb.io/en/latest/operate-and-deploy/capacity-planning/
Kafka Streams documentation
https://docs.confluent.io/current/streams/sizing.html
47
@miguno

4848
Where to go from here?
My blog series on Kafka Fundamentals to learn more
https://www.confluent.io/blog/kafka-streams-tables-part-1-event-streaming
Confluent Cloud, the fastest & easiest way to get started with Kafka
https://www.confluent.io/confluent-cloud/
Confluent Platform, if you want to operate Kafka yourself
https://www.confluent.io/product/confluent-platform/
48
@miguno

49
THANK YOU
@miguno
michael@confluent.io
cnfl.io/meetups cnfl.io/blog cnfl.io/slack

What every software engineer should know about streams and tables in kafka kafka meetup zurich feb 2020

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to What every software engineer should know about streams and tables in kafka kafka meetup zurich feb 2020

Similar to What every software engineer should know about streams and tables in kafka kafka meetup zurich feb 2020 (20)

More from confluent

More from confluent (20)

Recently uploaded

Recently uploaded (20)

What every software engineer should know about streams and tables in kafka kafka meetup zurich feb 2020