PostgreSQL + Kafka: The Delight of Change Data Capture

PostgreSQL + Kafka
The Delight of Change Data Capture
Jeff Klukas - Data Engineer at Simple
1

2
Overview
Commit logs: what are they?
Write-ahead logging (WAL)
Commit logs as a data store
Demo: change data capture
Use cases

3
https://www.confluent.io/blog/hands-free-kafka-replication-a-lesson-in-operational-simplicity/
Commit Logs

4
Ordered Immutable Durable
Commit Logs

5
Commit Logs
Ordered Immutable Durable
In practice, old logs can be deleted or archived

7
– https://www.postgresql.org/docs/current/static/wal-intro.html
“WAL's central concept is that changes to
data files (where tables and indexes reside)
must be written only after those changes
have been logged, that is, after log records
describing the changes have been flushed to
permanent storage”

8
– https://www.postgresql.org/docs/9.4/static/logicaldecoding-explanation.html
“Logical decoding is the process of
extracting all persistent changes to a
database's tables into a coherent, easy to
understand format which can be interpreted
without detailed knowledge of the
database's internal state.”

13
https://www.confluent.io/blog/bottled-water-real-time-integration-of-postgresql-and-kafka/

14
INSERT INTO transactions
VALUES (56789, 20.00);
{
"transaction_id": {"int": 56789},
"amount": {"double": 20.00}
}
Bottled Water - Message Key
{ "transaction_id": { "int": 56789 } }
Bottled Water - Message Value

15
UPDATE transactions
SET amount = 25.00
WHERE transaction_id = 56789;
{
"transaction_id": {"int": 56789},
"amount": {"double": 25.00}
}

16
DELETE FROM transactions
WHERE transaction_id = 56789;
null

17
tx-service
tx-postgres
Use Cases

18
tx-service
tx-postgres
tx-pgkafka
Kafka topic: tx-pgkafka

19
tx-service
tx-postgres
tx-pgkafka
demux-service

20
tx-service
tx-postgres
tx-pgkafka
demux-service
Kafka topic: customers-table
Kafka topic: transactions-table

21
tx-service
tx-postgres
tx-pgkafka
demux-service
activity-service
activity-postgres
activity-pgkafka
Kafka topic: activity-pgkafka

22
tx-service
tx-postgres
tx-pgkafka
demux-service
activity-service
activity-postgres
activity-pgkafka
Amazon Redshift
(Data Warehouse)
Amazon S3
(Data Lake)
analytics-service

23
tx-service
tx-postgres
tx-pgkafka
demux-service
activity-service
activity-postgres
activity-pgkafka
Amazon Redshift
(Data Warehouse)
Amazon S3
(Data Lake)
analytics-service
Change Data Capture

24
tx-service
tx-postgres
tx-pgkafka
demux-service
activity-service
activity-postgres
activity-pgkafka
Amazon Redshift
(Data Warehouse)
Amazon S3
(Data Lake)
analytics-service
Messaging

25
tx-service
tx-postgres
tx-pgkafka
demux-service
activity-service
activity-postgres
activity-pgkafka
Amazon Redshift
(Data Warehouse)
Amazon S3
(Data Lake)
analytics-service
Analytics

26
Recap
Commit logs: what are they?
Write-ahead logging (WAL)
Commit logs as a data store
Demo: change data capture
Use cases

27
• Blog post on Simple’s CDC pipeline
• https://www.simple.com/engineering
• Bottled Water: https://github.com/confluentinc/bottledwater-pg
• Debezium (CDC to Kafka from Postgres, MySQL, or MongoDB)
• http://debezium.io/
• https://wecode.wepay.com/posts/streaming-databases-in-
realtime-with-mysql-debezium-kafka
• https://www.confluent.io/kafka-summit-sf17/
• Martin Kleppmann, Making Sense of Stream Processing eBook
Also See…

30
The Dual Write Problem
https://www.confluent.io/blog/bottled-water-real-time-integration-of-postgresql-and-kafka/

31
Redshift Architecture
Amazon
Redshift

33
Table Schema
CREATE TABLE pgkafka_txservice_transactions (
pg_lsn NUMERIC(20,0) ENCODE raw,
pg_txn_id BIGINT ENCODE lzo,
pg_operation CHAR(6) ENCODE bytedict,
pg_txn_timestamp TIMESTAMP ENCODE lzo,
ingestion_timestamp TIMESTAMP ENCODE lzo,
transaction_id INT ENCODE lzo,
amount NUMERIC(18,2) ENCODE lzo
)
DISTKEY transaction_id
SORTKEY (transaction_id, pg_lsn, pg_operation);
Amazon
Redshift

34
Deduplication
CREATE TABLE deduped LIKE pgkafka_txservice_transactions;
INSERT INTO deduped SELECT * FROM (
SELECT *, ROW_NUMBER()
OVER (PARTITION BY pg_lsn ORDER BY ingestion_timestamp DESC)
FROM pgkafka_txservice_transactions
) WHERE row_number = 1;
DROP TABLE pgkafka_txservice_transactions;
ALTER TABLE deduped RENAME TO pgkafka_txservice_transactions;
Amazon
Redshift

35
View of Current State
CREATE VIEW current_txservice_transactions AS
SELECT transaction_id, amount,
FROM (
SELECT *, ROW_NUMBER()
OVER (PARTITION BY transaction_id
ORDER BY pg_lsn, pg_operation) AS n,
COUNT(*)
OVER (PARTITION BY transaction_id ROWS BETWEEN
UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS c
FROM pgkafka_txservice_transactions)
WHERE n = c
AND pg_operation <> 'delete';
Amazon
Redshift

PostgreSQL + Kafka: The Delight of Change Data Capture

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (16)

Similar to PostgreSQL + Kafka: The Delight of Change Data Capture

Similar to PostgreSQL + Kafka: The Delight of Change Data Capture (20)

Recently uploaded

Recently uploaded (20)

PostgreSQL + Kafka: The Delight of Change Data Capture