PostgreSQL is an open source relational database. Kafka is an open source log-based messaging system. Because both systems are powerful and flexible, they’re devouring whole categories of infrastructure. And they’re even better together.
In this talk, you’ll learn about commit logs and how that fundamental data structure underlies both PostgreSQL and Kafka. We’ll use that basis to understand what Kafka is, what advantages it has over traditional messaging systems, and why it’s perfect for modeling database tables as streams. From there, we’ll introduce the concept of change data capture (CDC) and run a live demo of Bottled Water, an open source CDC pipeline, watching INSERT, UPDATE, and DELETE operations in PostgreSQL stream into Kafka. We’ll wrap up with a discussion of use cases for this pipeline: messaging between systems with transactional guarantees, transmitting database changes to a data warehouse, and stream processing.
26. 26
Recap
Commit logs: what are they?
Write-ahead logging (WAL)
Commit logs as a data store
Demo: change data capture
Use cases
27. 27
• Blog post on Simple’s CDC pipeline
• https://www.simple.com/engineering
• Bottled Water: https://github.com/confluentinc/bottledwater-pg
• Debezium (CDC to Kafka from Postgres, MySQL, or MongoDB)
• http://debezium.io/
• https://wecode.wepay.com/posts/streaming-databases-in-
realtime-with-mysql-debezium-kafka
• https://www.confluent.io/kafka-summit-sf17/
• Martin Kleppmann, Making Sense of Stream Processing eBook
Also See…
34. 34
Deduplication
CREATE TABLE deduped LIKE pgkafka_txservice_transactions;
INSERT INTO deduped SELECT * FROM (
SELECT *, ROW_NUMBER()
OVER (PARTITION BY pg_lsn ORDER BY ingestion_timestamp DESC)
FROM pgkafka_txservice_transactions
) WHERE row_number = 1;
DROP TABLE pgkafka_txservice_transactions;
ALTER TABLE deduped RENAME TO pgkafka_txservice_transactions;
Amazon
Redshift
35. 35
View of Current State
CREATE VIEW current_txservice_transactions AS
SELECT transaction_id, amount,
FROM (
SELECT *, ROW_NUMBER()
OVER (PARTITION BY transaction_id
ORDER BY pg_lsn, pg_operation) AS n,
COUNT(*)
OVER (PARTITION BY transaction_id ROWS BETWEEN
UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS c
FROM pgkafka_txservice_transactions)
WHERE n = c
AND pg_operation <> 'delete';
Amazon
Redshift