Kafka Summit SF 2017 - Database Streaming at WePay

1
Database Streaming at WePay
with Kafka and Debezium
August 2017
Moira Tagle
Senior Software Engineer at WePay
linkedin.com/in/moiratagle

2
WePay’s Use
Cases
Agenda
A G E N D A
What is WePay?
Why we built a
realtime
datawarehousing
solution
Debezium
Architecture
Debezium Introduction
Detailed Architecture
Uses for the pipeline
Kafka/BQ
Connector
Kafka Connect
Introduction
BigQuery
Introduction
Kafka-BigQuery
Connector
View Deduplication
Future Work

4
What is WePay?
Payments for
Platforms
W E P A Y ’ S U S E C A S E
A platform is any online service that has merchants
(that receive money) and clients (that give money).
WePay manages the legal, compliance, and fraud risks
associated with payment processing.

5
WePay’s Traditional ETL System
Why we built a realtime datawarehousing solution.
Job scheduler-based.
We do both complete
and incremental dumps
of relevant tables.
Incremental dumps are
based on modify/create
time of the rows.
Existing System
Infrequent loads and
high latency.
Huge number of jobs
involved (several for
each table).
Tables or rows with
missing or inaccurate
create or modify times
caused issues.
Incapable of dealing
with hard deletes.
Problems
MySQL replication
latency can cause data
quality issues.
Periodic loads can cause
MySQL timeouts.
MySQL schema changes
don’t propagate
automatically.

6
A Better Way

8
Debezium Basics
Streaming for
your Database
Changes
D E B E Z I U M A R C H I T E C T U R E
Debezium will create a kafka event for every single
database change.

9
Example Event
{
"before": {
"id": 1004,
"first_name": "Anne",
"last_name": "Kretchmar",
"email": "annek@noanswer.org"
},
"after": {
"id": 1004,
"first_name": "Anne Marie",
},
...
"source": {
"name": "mysql-server-1",
"server_id": 223344,
"ts_sec": 1465581,
"gtid": null,
“file": "mysql-bin.000003",
“pos": 484,
"row": 0,
“snapshot": false,
“db”: "inventory",
"table":"customers"
},
"op": "u",
"ts_ms": 1465581029523
}

10
Snapshots
Some Finer Details
When you first startup debezium, it will
take a snapshot of the entire database.
Snapshots will only occur when the
Debezium connector first starts up.
We re-bootstrap our connectors often;
clients need to be able to handle
periodic message spikes of snapshot
info.
Database
Definitions
Data Definition Language (changes to
the format of tables) are stored in a
separate kafka topic.
But there is also the Kafka Schema
Registry
Our schema registry will automatically
updates for backwards-compatible
changes.

11

12

13

14

15

16

17
Uses for the Debezium Pipeline
Our Current Uses Other Potential Uses
Data warehousing
Aid in migration from
monolithic database
Change auditing
Data quality checks
Downstream service integration

19
Kafka Connect Basics
Streaming from
(or to) Kafka.
K A F K A C O N N E C T B I G Q U E R Y
Built specifically for ETL pipelines.
Lots of existing connectors.
But you can also build your own!

20
Kafka BigQuery Sink Connector

21
BigQuery Basics
Google’s Data
Warehouse
Has a streaming insert API.
Append-only.

22
Configurable
Retry Logic
Features of Kafka-Connect-BigQuery
BigQuery has issues
with transient errors.
The connector will
retry on “retriable”
errors instead of
failing immediately.
Schema
Updates
Automatically updates
the BigQuery schema
as necessary.
Optimistic
Writes
Quicker writes.
Particularly useful
during snapshots.
Partitioned
Tables
Writes to daily
partitions.

23
BigQuery View Compression and Deduplication
{
"before": {
"id": 1004,
},
"after": {
"id": 1004,
},
"source": {
"ts_sec": 1465581,
"gtid": null,
“pos": 484,
"row": 0,
"table":"customers"
},
"op": "u",
"ts_ms": 1465581029523
}
{
"id": 1004,
}

24
BigQuery View Compression and Deduplication
{
"before": {
"id": 1004,
},
"after": {
"id": 1004,
},
"source": {
"ts_sec": 1465581,
"gtid": null,
“pos": 484,
"row": 0,
"table":"customers"
},
"op": "u",
"ts_ms": 1465581029523
}
{
"id": 1004,
"last_name": "Kretchmar"
}

26
Monolith
Integration
Open Issues
F U T U R E W O R K
The monolithic
database uses our
traditional ETL
system.
Unreasonable to
rebootstrap every
time a new table
needs to be added.
DBZ-175
KCBQ Metrics
Currently not possible
to integrate metrics
into the kafka connect
framework.
KAFKA-2376
Compatibility
Checking
Confluent Schema
Registry runs
compatibility checks
for kafka messages.
Possible for an
incompatible change
at the MySQL layer.
DBZ-70
Cluster Scaling
As we add more
microservices, the two
DBZ MySQL replicas,
and the single cluster,
may not be enough.
Eventually split into
disparate clusters.

27
Additional Info
O T H E R
https://wecode.wepay.com/posts/streaming-databases-in-realtime-with-mysql-debezium-kafka
Blog Post
KCBQ Github
https://github.com/wepay/kafka-connect-bigquery

Kafka Summit SF 2017 - Database Streaming at WePay

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Kafka Summit SF 2017 - Database Streaming at WePay

Similaire à Kafka Summit SF 2017 - Database Streaming at WePay (20)

Plus de confluent

Plus de confluent (20)

Dernier

Dernier (20)

Kafka Summit SF 2017 - Database Streaming at WePay