Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
1
Kafka 102:
Streams and Tables All the Way Down
Kafka Summit San Francisco, Sep 2019
Michael G. Noll
Technologist, Office o...
22
Streams and Tables
A First Look
33
@miguno
Streams Tables
Event Streaming
44
@miguno
Streams Tables
Event Streaming
5
An Event Streaming Platform
gives you three key functionalities
5
Publish & Subscribe
to Events
Store
Events
Process & A...
6
An Event
records the fact that something happened
6
A good
was sold
An invoice
was issued
A payment
was made
A new custo...
7
A Stream
records history as a sequence of Events
7
@miguno
88
@miguno
Event Streaming Paradigm
Highly Scalable
Durable
Persistent
Maintains Order
Fast (Low Latency)
Kafka = Source o...
99
“The ledger of sales.” “The sales totals.”
Streams
record history
Tables
represent state
@miguno
1010
1. e4 e5
2. Nf3 Nc6
3. Bc4 Bc5
4. d3 Nf6
5. Nbd2
“The sequence of moves.” “The state of the board.”
Streams
record hi...
11
Streams = INSERT only
Immutable, append-only
Tables = INSERT, UPDATE, DELETE
Mutable, row key (event.key) identifies whi...
12
The key to mutability is … the event.key!
12
@miguno
Stream Table
Has unique key constraint? No Yes
First event with ke...
13
Creating a table from a stream or topic
streams
1414
@miguno
Stream
(facts)
Table
(dims)
alice Berlin
bob Lima
alice Berlin
alice Rome
bob Lima
alice Paris
bob Sydney
ali...
1515
aggregation
(like SUM, COUNT)
table changes
*See Streams and Tables: Two Sides of the Same Coin, M. Sax et al., BIRTE...
16
Aggregating a stream (COUNT example)
streams
@miguno
17
Aggregating a stream (COUNT example)
streams
@miguno
1818
Kafka Topics
The storage foundation of Streams and Tables
19
Data storage of a Kafka topic is partitioned
which impacts data processing as we see later
19
...
...
...
...
P1
P2
P3
...
20
Producers determine target partition of an event
through a partitioning function ƒ(event.key)
20
...
...
...
...
P1
P2
...
21
Events with same key should be in same partition
to ensure proper ordering of related events
to ensure processing by Co...
22
Top causes for same key in different partitions
1. You increased/decreased number of partitions
2. A producer uses a cu...
23
Processing Layer
(KStreams, KSQL, etc.)
Storage Layer
(Brokers)
Partitions play a central role in Kafka
23
@miguno
Topi...
2424
From Topics to Streams and Tables
25
Topics
live in the Kafka storage layer, ‘filesystem’ (Brokers)
Streams and Tables
live in the Kafka processing layer (KS...
26
Processing Layer
(KSQL, KStreams)
26
00100 11101 11000 00011 00100 00110Topic
alice Paris bob Sydney alice RomeStream
p...
27
Kafka Processing
Data is processed per-partition
27
...
...
...
...
P1
P2
P3
P4
storage processing state
Stream Task 1
...
28
Streams and Tables are partitioned, too
And so is their processing!
28
...
...
...
...
P1
P2
P3
P4
Stream Task 1
Stream...
29
Global Tables give complete data to every task
Great for joins without re-keying the input or to broadcast info
29
...
...
30
Tables are cached in ‘state stores’ on local disk
Tables and other state don’t need to fit into RAM.
Enables large-scale...
31
streaming restore
via network
Stream Task 1
On machine B
Tables and other state are always fault-tolerant
because backe...
32
Standby Replicas speed up application recovery
App instances can optionally maintain copies of another
instance’s local...
33
Elastic scaling
Stream tasks are migrated, including their state (via Kafka)
33
@miguno
App Instance 1
App Instance 2
S...
34
bob Zurich
alice Rome
bob Zurich
bob Sydney alice Romealice Bern alice Paris
Tables <-> topic log-compaction
A table’s ...
3535
@miguno
TL;DR for Log Compaction
Have a Stream?
→ Disable log compaction for its topic (= default)
Have a Table?
→ En...
3636
Concept Schema Partitioned Unbounded Ordering Mutable Unique key constraint Fault-tolerant
Storage Layer
Topic No (ra...
3737
Working with Streams and Tables
38
Max processing parallelism = #input partitions
38
...
...
...
...
P1
P2
P3
P4
Topic Application Instance 1
Application ...
39
How to increase number of partitions when needed
KSQL example: statement below creates a new stream with
the desired nu...
40
‘Hot’ partitions can be problematic, often caused by
1. Events not evenly distributed across partitions
2. Events evenl...
41
Joining Streams and Tables
Data must be ‘co-partitioned’
41
TableStream
Join Output
(Stream)
@miguno
42
Joining Streams and Tables
Data must be ‘co-partitioned’
42
bob male
alice female
alex male
alice Paris
Table
P1
P2
P3
...
43
Joining Streams and Tables
Data is looked up in same partition number
43
alice Paris alice male
alice female
alice Pari...
44
Joining Streams and Tables
Data is looked up in same partition number
44
alice Paris alice male
alice Paris
Stream Tabl...
45
Data co-partitioning requirements in detail
1. Same keying scheme for both input sides
2. Same number of partitions
3. ...
46
Why is that so?
Because of how input data is mapped to stream tasks
46
...
...
...
P1
P2
P3
storage
processing state
St...
47
How to re-partition your data when needed
KSQL example: statement below creates a new stream with
changed number of par...
48
Joining Streams and Global Tables
No need to worry about co-partitioning!
48
Global TableStream
Join Output
(Stream)
@m...
49
Capacity Planning and Sizing
Sorry, not covered here! I recommend:
KSQL Performance Tuning for Fun and Profit, by Nick D...
50
Streams and Tables in KSQL
Stream 01
Stream 02
Stream 03
Table
Process event streams to create new, continuously update...
51
Streams and Tables in KSQL
Query tables similar to a relational database
Table
QueryQuery
Pull
Query
SELECT * FROM Orde...
52
Query tables from other apps with push or pull queries
Other Applications
(Java, Go, Python, etc.)
can directly query t...
53
Streaming import/export of Tables
KSQL integrates with Kafka Connect
CREATE SOURCE CONNECTOR my-postgres-jdbc WITH (
co...
54
KSQL example use case
Creating an event-driven dashboard from a customer database
@miguno
customers
table
Table
Kafka C...
55
THANK YOU
@miguno
michael@confluent.io
cnfl.io/meetups cnfl.io/blog cnfl.io/slack
Prochain SlideShare
Chargement dans…5
×
Prochain SlideShare
What to Upload to SlideShare
Suivant
Télécharger pour lire hors ligne et voir en mode plein écran

6

Partager

Télécharger pour lire hors ligne

Kafka 102: Streams and Tables All the Way Down | Kafka Summit San Francisco 2019

Télécharger pour lire hors ligne

Talk URL: https://kafka-summit.org/sessions/kafka-102-streams-tables-way/
Video recording: https://www.confluent.io/kafka-summit-san-francisco-2019/kafka-102-streams-and-tables-all-the-way-down

Abstract: Streams and Tables are the foundation of event streaming with Kafka, and they power nearly every conceivable use case, from payment processing to change data capture, from streaming ETL to real-time alerting for connected cars, and even the lowly WordCount example. Tables are something that most of us are familiar with from the world of databases, whereas Streams are a rather new concept. Trying to leverage Kafka without understanding tables and streams is like building a rocket ship without understanding the laws of physics-a mission bound to fail. In this session for developers, operators, and architects alike we take a deep dive into these two fundamental primitives of Kafka’s data model. We discuss how streams and tables incl. global tables relate to each other and to topics, partitioning, compaction, serialization (Kafka’s storage layer), and how they interplay to process data, react to data changes, and manage state in an elastic, scalable, fault-tolerant manner (Kafka’s compute layer). Developers will understand better how to use streams and tables to build event-driven applications with Kafka Streams and KSQL, and we answer questions such as “How can I query my tables?” and “What is data co-partitioning, and how does it affect my join?”. Operators will better understand how these applications will run in production, with questions such as “How do I scale my application?” and “When my application crashes, how will it recover its state?”. At a higher level, we will explore how Kafka uses streams and tables to turn the Database inside-out and put it back together.

Kafka 102: Streams and Tables All the Way Down | Kafka Summit San Francisco 2019

  1. 1. 1 Kafka 102: Streams and Tables All the Way Down Kafka Summit San Francisco, Sep 2019 Michael G. Noll Technologist, Office of the CTO, Confluent @miguno
  2. 2. 22 Streams and Tables A First Look
  3. 3. 33 @miguno Streams Tables Event Streaming
  4. 4. 44 @miguno Streams Tables Event Streaming
  5. 5. 5 An Event Streaming Platform gives you three key functionalities 5 Publish & Subscribe to Events Store Events Process & Analyze Events @miguno
  6. 6. 6 An Event records the fact that something happened 6 A good was sold An invoice was issued A payment was made A new customer registered @miguno
  7. 7. 7 A Stream records history as a sequence of Events 7 @miguno
  8. 8. 88 @miguno Event Streaming Paradigm Highly Scalable Durable Persistent Maintains Order Fast (Low Latency) Kafka = Source of Truth, stores every article since 1851 Denormalized into “Content View” Normalized assets (images, articles, bylines, etc.) https://www.confluent.io/blog/publishing-apache-kafka-new-york-times/ Streams record history, even hundreds of years
  9. 9. 99 “The ledger of sales.” “The sales totals.” Streams record history Tables represent state @miguno
  10. 10. 1010 1. e4 e5 2. Nf3 Nc6 3. Bc4 Bc5 4. d3 Nf6 5. Nbd2 “The sequence of moves.” “The state of the board.” Streams record history Tables represent state @miguno
  11. 11. 11 Streams = INSERT only Immutable, append-only Tables = INSERT, UPDATE, DELETE Mutable, row key (event.key) identifies which row 11 @miguno
  12. 12. 12 The key to mutability is … the event.key! 12 @miguno Stream Table Has unique key constraint? No Yes First event with key ‘alice’ arrives INSERT INSERT Another event with key ‘alice’ arrives INSERT UPDATE Event with key ‘alice’ and value == null arrives INSERT DELETE Event with key == null arrives INSERT <ignored> RDBMS analogy: A Stream is ~ a Table that has no unique key and is append-only.
  13. 13. 13 Creating a table from a stream or topic streams
  14. 14. 1414 @miguno Stream (facts) Table (dims) alice Berlin bob Lima alice Berlin alice Rome bob Lima alice Paris bob Sydney alice Berlin alice Rome bob Lima alice Paris bob Sydney 90°
  15. 15. 1515 aggregation (like SUM, COUNT) table changes *See Streams and Tables: Two Sides of the Same Coin, M. Sax et al., BIRTE ’18 Streams record history Tables represent state Duality @miguno
  16. 16. 16 Aggregating a stream (COUNT example) streams @miguno
  17. 17. 17 Aggregating a stream (COUNT example) streams @miguno
  18. 18. 1818 Kafka Topics The storage foundation of Streams and Tables
  19. 19. 19 Data storage of a Kafka topic is partitioned which impacts data processing as we see later 19 ... ... ... ... P1 P2 P3 P4 storage Topic @miguno
  20. 20. 20 Producers determine target partition of an event through a partitioning function ƒ(event.key) 20 ... ... ... ... P1 P2 P3 P4 storage Topic Producer client 1 Producer client 2 event sent and appended to partition 1 @miguno
  21. 21. 21 Events with same key should be in same partition to ensure proper ordering of related events to ensure processing by Consumers returns expected results 21 ... ... ... ... P1 P2 P3 P4 Producer client 1 Producer client 2 Yellow events should always be stored in partition 3 @miguno
  22. 22. 22 Top causes for same key in different partitions 1. You increased/decreased number of partitions 2. A producer uses a custom partitioner → Be careful in this situation! 22 @miguno
  23. 23. 23 Processing Layer (KStreams, KSQL, etc.) Storage Layer (Brokers) Partitions play a central role in Kafka 23 @miguno Topics are partitioned. Partitions enable scalability, elasticity, fault-tolerance. joined based on stored in replicated based on log-compacted based on read from and written to processed based on partitionsData is
  24. 24. 2424 From Topics to Streams and Tables
  25. 25. 25 Topics live in the Kafka storage layer, ‘filesystem’ (Brokers) Streams and Tables live in the Kafka processing layer (KStreams, KSQL) 25 @miguno
  26. 26. 26 Processing Layer (KSQL, KStreams) 26 00100 11101 11000 00011 00100 00110Topic alice Paris bob Sydney alice RomeStream plus schema (serdes) alice Rome bob Sydney Table plus aggregation Storage Layer (Brokers) Topics vs. Streams and Tables @miguno
  27. 27. 27 Kafka Processing Data is processed per-partition 27 ... ... ... ... P1 P2 P3 P4 storage processing state Stream Task 1 Stream Task 2 Stream Task 3 Stream Task 4 read via network Application Instance 1Topic Application Instance 1 Application Instance 2 @miguno
  28. 28. 28 Streams and Tables are partitioned, too And so is their processing! 28 ... ... ... ... P1 P2 P3 P4 Stream Task 1 Stream Task 2 Stream Task 3 Stream Task 4 KTable / TABLE 2 GB 3 GB 5 GB 2 GB @miguno
  29. 29. 29 Global Tables give complete data to every task Great for joins without re-keying the input or to broadcast info 29 ... ... ... ... P1 P2 P3 P4 Stream Task 1 Stream Task 2 Stream Task 3 Stream Task 4 GlobalKTable 2 + 3 + 5 + 2 = 12 GB 12 GB 12 GB 12 GB @miguno
  30. 30. 30 Tables are cached in ‘state stores’ on local disk Tables and other state don’t need to fit into RAM. Enables large-scale state. Saves $$$ on cloud instances. 30 But: The ‘source of truth’ of a table is its underlying topic! @miguno Stream Task 1 Cached on local disk under /var (default storage engine: RocksDB) Global Table Table
  31. 31. 31 streaming restore via network Stream Task 1 On machine B Tables and other state are always fault-tolerant because backed by Kafka topics (cf. event sourcing) 31 streaming backup via network table’s changelog topic (log-compacted) Kafka Storage KSQL, Kafka Streams On machine A Stream Task 1 @miguno
  32. 32. 32 Standby Replicas speed up application recovery App instances can optionally maintain copies of another instance’s local state stores to minimize failover times 32 @miguno num.standby.replicas = 1 Stream Task 1 Stream Task 2 App Instance 1 App Instance 2 num.standby.replicas = 0 (default) Stream Task 1 Stream Task 2 Stream Task 2 failover
  33. 33. 33 Elastic scaling Stream tasks are migrated, including their state (via Kafka) 33 @miguno App Instance 1 App Instance 2 Stream Task 1 Stream Task 3 Stream Task 2 Stream Task 4 App Instance 3 App Instance 4 Stream Task 2 Stream Task 4
  34. 34. 34 bob Zurich alice Rome bob Zurich bob Sydney alice Romealice Bern alice Paris Tables <-> topic log-compaction A table’s underlying topic is compacted by default to save Kafka storage space to speed up failover and recovery for processing 34 Note: Compaction intentionally removes part of a table’s history. If you need the full history and don’t have the historic data elsewhere, consider disabling compaction. @miguno
  35. 35. 3535 @miguno TL;DR for Log Compaction Have a Stream? → Disable log compaction for its topic (= default) Have a Table? → Enable log compaction for its topic (= default) Disable only when needed, see previous slide
  36. 36. 3636 Concept Schema Partitioned Unbounded Ordering Mutable Unique key constraint Fault-tolerant Storage Layer Topic No (raw bytes) Yes Yes Yes No No Yes Processing Layer Stream Yes Yes Yes Yes No No Yes Table Yes Yes No* No Yes Yes Yes Global Table Yes No No* No Yes Yes Yes *Generally speaking the answer is Yes but, in practice, tables are almost always bounded due to finite key space. Topics vs. Streams and Tables @miguno
  37. 37. 3737 Working with Streams and Tables
  38. 38. 38 Max processing parallelism = #input partitions 38 ... ... ... ... P1 P2 P3 P4 Topic Application Instance 1 Application Instance 2 Application Instance 3 Application Instance 4 Application Instance 5 *** idle *** Application Instance 6 *** idle *** → Need higher parallelism? Increase the original topic’s partition count. → Higher parallelism for just one use case? Derive a new topic from the original with higher partition count. Lower its retention to save storage. @miguno
  39. 39. 39 How to increase number of partitions when needed KSQL example: statement below creates a new stream with the desired number of partitions. 39 CREATE STREAM products_repartitioned WITH (PARTITIONS=30) AS SELECT * FROM products @miguno
  40. 40. 40 ‘Hot’ partitions can be problematic, often caused by 1. Events not evenly distributed across partitions 2. Events evenly distributed but certain events take longer to process 40 Strategies to address hot partitions include 1a. Ingress: Find better partitioning function ƒ(event.key) for producers 1b. Storage: Re-partition data into new topic if you can’t change the original 2. Scale processing vertically, e.g. more powerful CPU instances ... ... ... ... P1 P2 P3 P4 @miguno
  41. 41. 41 Joining Streams and Tables Data must be ‘co-partitioned’ 41 TableStream Join Output (Stream) @miguno
  42. 42. 42 Joining Streams and Tables Data must be ‘co-partitioned’ 42 bob male alice female alex male alice Paris Table P1 P2 P3 zoie female andrew male mina female natalie female blake male alice Paris Stream P2 (alice, Paris) from stream’s P2 has a matching entry for alice in the table’s P2. female @miguno
  43. 43. 43 Joining Streams and Tables Data is looked up in same partition number 43 alice Paris alice male alice female alice Paris Stream Table P2 P1 P2 P3 Here, key ‘alice’ exists in multiple partitions. But entry in P2 (female) is used because the stream-side event is from stream’s partition P2. female Scenario 2 @miguno
  44. 44. 44 Joining Streams and Tables Data is looked up in same partition number 44 alice Paris alice male alice Paris Stream Table P2 P1 P2 P3 Here, key ‘alice’ exists only in the table’s P1 != P2. null no match! Scenario 3 @miguno
  45. 45. 45 Data co-partitioning requirements in detail 1. Same keying scheme for both input sides 2. Same number of partitions 3. Same partitioning function ƒ(event.key) 45 Further Reading on Joining Streams and Tables: https://www.confluent.io/kafka-summit-sf18/zen-and-the-art-of-streaming-joins https://docs.confluent.io/current/ksql/docs/developer-guide/partition-data.html @miguno
  46. 46. 46 Why is that so? Because of how input data is mapped to stream tasks 46 ... ... ... P1 P2 P3 storage processing state Stream Task 2 read via network Stream Topic ... ... ... P1 P2 P3 Table Topic from stream’s P2 from table’s P2 @miguno
  47. 47. 47 How to re-partition your data when needed KSQL example: statement below creates a new stream with changed number of partitions and a new field as event.key (so that its data is now correctly co-partitioned for joining) 47 CREATE STREAM products_repartitioned WITH (PARTITIONS=42) AS SELECT * FROM products PARTITION BY product_id; @miguno
  48. 48. 48 Joining Streams and Global Tables No need to worry about co-partitioning! 48 Global TableStream Join Output (Stream) @miguno Stream Task 1 2 + 3 + 5 + 2 = 12 GB That’s because each stream task has the full data from all the table’s partitions:
  49. 49. 49 Capacity Planning and Sizing Sorry, not covered here! I recommend: KSQL Performance Tuning for Fun and Profit, by Nick Dearden October 1, 2019 @ 2:50-3:30 pm, Stream Processing track https://kafka-summit.org/sessions/ksql-performance-tuning-fun-profit/ 49 @miguno
  50. 50. 50 Streams and Tables in KSQL Stream 01 Stream 02 Stream 03 Table Process event streams to create new, continuously updated streams or tables QueryQuery Push Query CREATE TABLE OrderTotals AS SELECT * FROM ... EMIT CHANGES @miguno
  51. 51. 51 Streams and Tables in KSQL Query tables similar to a relational database Table QueryQuery Pull Query SELECT * FROM OrderTotals WHERE region = ‘Europe’ Result New feature @miguno
  52. 52. 52 Query tables from other apps with push or pull queries Other Applications (Java, Go, Python, etc.) can directly query tables Result via network (KSQL REST API) Table SELECT * FROM OrderTotals WHERE region = ‘Europe’ Streams and Tables in KSQL @miguno
  53. 53. 53 Streaming import/export of Tables KSQL integrates with Kafka Connect CREATE SOURCE CONNECTOR my-postgres-jdbc WITH ( connector.class = "io.confluent.connect.jdbc.jdbcSourceConnector", connection.url = "jdbc:postgresql://dbserver:5432/my-db", ...); New feature controls controls @miguno
  54. 54. 54 KSQL example use case Creating an event-driven dashboard from a customer database @miguno customers table Table Kafka Connect is streaming change events Stream Aggregations are computed in real-time Table Results are continuously updating Elasticsearch Table (Index)Stream
  55. 55. 55 THANK YOU @miguno michael@confluent.io cnfl.io/meetups cnfl.io/blog cnfl.io/slack
  • tmarrs

    Feb. 21, 2021
  • igorberman

    Oct. 31, 2019
  • SudhakarDaggubati

    Oct. 6, 2019
  • StreamingAnalytics

    Oct. 3, 2019
  • RajeshKalluri1

    Oct. 3, 2019
  • NaveenSiddareddy

    Oct. 3, 2019

Talk URL: https://kafka-summit.org/sessions/kafka-102-streams-tables-way/ Video recording: https://www.confluent.io/kafka-summit-san-francisco-2019/kafka-102-streams-and-tables-all-the-way-down Abstract: Streams and Tables are the foundation of event streaming with Kafka, and they power nearly every conceivable use case, from payment processing to change data capture, from streaming ETL to real-time alerting for connected cars, and even the lowly WordCount example. Tables are something that most of us are familiar with from the world of databases, whereas Streams are a rather new concept. Trying to leverage Kafka without understanding tables and streams is like building a rocket ship without understanding the laws of physics-a mission bound to fail. In this session for developers, operators, and architects alike we take a deep dive into these two fundamental primitives of Kafka’s data model. We discuss how streams and tables incl. global tables relate to each other and to topics, partitioning, compaction, serialization (Kafka’s storage layer), and how they interplay to process data, react to data changes, and manage state in an elastic, scalable, fault-tolerant manner (Kafka’s compute layer). Developers will understand better how to use streams and tables to build event-driven applications with Kafka Streams and KSQL, and we answer questions such as “How can I query my tables?” and “What is data co-partitioning, and how does it affect my join?”. Operators will better understand how these applications will run in production, with questions such as “How do I scale my application?” and “When my application crashes, how will it recover its state?”. At a higher level, we will explore how Kafka uses streams and tables to turn the Database inside-out and put it back together.

Vues

Nombre de vues

3 260

Sur Slideshare

0

À partir des intégrations

0

Nombre d'intégrations

1 664

Actions

Téléchargements

79

Partages

0

Commentaires

0

Mentions J'aime

6

×