2. 2
WePay’s Use
Cases
Agenda
A G E N D A
What is WePay?
Why we built a
realtime
datawarehousing
solution
Debezium
Architecture
Debezium Introduction
Detailed Architecture
Uses for the pipeline
Kafka/BQ
Connector
Kafka Connect
Introduction
BigQuery
Introduction
Kafka-BigQuery
Connector
View Deduplication
Future Work
4. 4
What is WePay?
Payments for
Platforms
W E P A Y ’ S U S E C A S E
A platform is any online service that has merchants
(that receive money) and clients (that give money).
WePay manages the legal, compliance, and fraud risks
associated with payment processing.
5. 5
WePay’s Traditional ETL System
W E P A Y ’ S U S E C A S E
Why we built a realtime datawarehousing solution.
Job scheduler-based.
We do both complete
and incremental dumps
of relevant tables.
Incremental dumps are
based on modify/create
time of the rows.
Existing System
Infrequent loads and
high latency.
Huge number of jobs
involved (several for
each table).
Tables or rows with
missing or inaccurate
create or modify times
caused issues.
Incapable of dealing
with hard deletes.
Problems
MySQL replication
latency can cause data
quality issues.
Periodic loads can cause
MySQL timeouts.
MySQL schema changes
don’t propagate
automatically.
9. 9
Example Event
D E B E Z I U M A R C H I T E C T U R E
{
"before": {
"id": 1004,
"first_name": "Anne",
"last_name": "Kretchmar",
"email": "annek@noanswer.org"
},
"after": {
"id": 1004,
"first_name": "Anne Marie",
"last_name": "Kretchmar",
"email": "annek@noanswer.org"
},
...
"source": {
"name": "mysql-server-1",
"server_id": 223344,
"ts_sec": 1465581,
"gtid": null,
“file": "mysql-bin.000003",
“pos": 484,
"row": 0,
“snapshot": false,
“db”: "inventory",
"table":"customers"
},
"op": "u",
"ts_ms": 1465581029523
}
10. 10
Snapshots
Some Finer Details
D E B E Z I U M A R C H I T E C T U R E
When you first startup debezium, it will
take a snapshot of the entire database.
Snapshots will only occur when the
Debezium connector first starts up.
We re-bootstrap our connectors often;
clients need to be able to handle
periodic message spikes of snapshot
info.
Database
Definitions
Data Definition Language (changes to
the format of tables) are stored in a
separate kafka topic.
But there is also the Kafka Schema
Registry
Our schema registry will automatically
updates for backwards-compatible
changes.
17. 17
Uses for the Debezium Pipeline
D E B E Z I U M A R C H I T E C T U R E
Our Current Uses Other Potential Uses
Data warehousing
Aid in migration from
monolithic database
Change auditing
Data quality checks
Downstream service integration
22. 22
Configurable
Retry Logic
Features of Kafka-Connect-BigQuery
K A F K A C O N N E C T B I G Q U E R Y
BigQuery has issues
with transient errors.
The connector will
retry on “retriable”
errors instead of
failing immediately.
Schema
Updates
Automatically updates
the BigQuery schema
as necessary.
Optimistic
Writes
Quicker writes.
Particularly useful
during snapshots.
Partitioned
Tables
Writes to daily
partitions.
23. 23
BigQuery View Compression and Deduplication
K A F K A C O N N E C T B I G Q U E R Y
{
"before": {
"id": 1004,
"first_name": "Anne",
"last_name": "Kretchmar",
"email": "annek@noanswer.org"
},
"after": {
"id": 1004,
"first_name": "Anne Marie",
"last_name": "Kretchmar",
"email": "annek@noanswer.org"
},
"source": {
"name": "mysql-server-1",
"server_id": 223344,
"ts_sec": 1465581,
"gtid": null,
“file": "mysql-bin.000003",
“pos": 484,
"row": 0,
“snapshot": false,
“db”: "inventory",
"table":"customers"
},
"op": "u",
"ts_ms": 1465581029523
}
{
"id": 1004,
"first_name": "Anne Marie",
"last_name": "Kretchmar",
"email": "annek@noanswer.org"
}
24. 24
BigQuery View Compression and Deduplication
K A F K A C O N N E C T B I G Q U E R Y
{
"before": {
"id": 1004,
"first_name": "Anne",
"last_name": "Kretchmar",
"email": "annek@noanswer.org"
},
"after": {
"id": 1004,
"first_name": "Anne Marie",
"last_name": "Kretchmar",
"email": "annek@noanswer.org"
},
"source": {
"name": "mysql-server-1",
"server_id": 223344,
"ts_sec": 1465581,
"gtid": null,
“file": "mysql-bin.000003",
“pos": 484,
"row": 0,
“snapshot": false,
“db”: "inventory",
"table":"customers"
},
"op": "u",
"ts_ms": 1465581029523
}
{
"id": 1004,
"first_name": "Anne Marie",
"last_name": "Kretchmar"
}
26. 26
Monolith
Integration
Open Issues
F U T U R E W O R K
The monolithic
database uses our
traditional ETL
system.
Unreasonable to
rebootstrap every
time a new table
needs to be added.
DBZ-175
KCBQ Metrics
Currently not possible
to integrate metrics
into the kafka connect
framework.
KAFKA-2376
Compatibility
Checking
Confluent Schema
Registry runs
compatibility checks
for kafka messages.
Possible for an
incompatible change
at the MySQL layer.
DBZ-70
Cluster Scaling
As we add more
microservices, the two
DBZ MySQL replicas,
and the single cluster,
may not be enough.
Eventually split into
disparate clusters.
27. 27
Additional Info
O T H E R
https://wecode.wepay.com/posts/streaming-databases-in-realtime-with-mysql-debezium-kafka
Blog Post
KCBQ Github
https://github.com/wepay/kafka-connect-bigquery