This document provides an overview of mypipe, a tool for buffering and consuming MySQL changes via Kafka. Mypipe reads MySQL binary logs and publishes the events to Kafka using Avro serialization. It includes features like transaction support, ALTER table handling, and the ability to bootstrap MySQL tables to Kafka. The architecture involves MySQL binary log consumers that parse events and enrich them with metadata, and Kafka producers that serialize the events. The tool is designed to be modular and configurable. Future work includes improving snapshot support and integrating more with Kafka.
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
mypipe: Buffering and consuming MySQL changes via Kafka
1. mypipe: Buffering and consuming
MySQL changes via Kafka
with
-=[ Scala - Avro - Akka ]=-
Hisham Mardam-Bey
Github: mardambey
Twitter: @codewarrior
2. Overview
● Who is this guy? + Quick Mate1 Intro
● Quick Tech Intro
● Motivation and History
● Features
● Design and Architecture
● Practical applications and usages
● System diagram
● Future work
● Q&A
3. Who is this guy?
● Linux and OpenBSD user and developer
since 1996
● Started out with C followed by Ruby
● Working with the JVM since 2007
● “Lately” building and running distributed
systems, and doing Scala
Github: mardambey
Twitter: @codewarrior
4. Mate1: quick intro
● Online dating, since 2003, based in Montreal
● Initially a team of 3, around 30 now
● Engineering team has 12 geeks / geekettes
○ Always looking for talent!
● We own and run our own hardware
○ fun!
○ mostly…
https://github.com/mate1
5. Super Quick Tech Intro
● MySQL: relational database
● Avro: data serialization system
● Kafka: publish-subscribe messaging
rethought as a distributed commit log
● Akka: toolkit and runtime simplifying the
construction of concurrent and distributed
applications
● Actors: universal primitives of concurrent
computation using message passing
● Schema repo / registry: holds versioned
Avro schemas
6. Motivation
● Initially, wanted:
○ MySQL triggers outside the DB
○ MySQL fan-in or fan-out replication (data cubes)
○ MySQL to “Hadoop”
● And then:
○ Cache or data store consistency with DB
○ Direct integration with big-data systems
○ Data schema evolution support
○ Turning MySQL inside out
■ Bootstrapping downstream data systems
7. History
● 2010: Custom Perl scripts to parse binlogs
● 2011/2012: Guzzler
○ Written in Scala, uses mysqlbinlog command
○ Simple to start with, difficult to maintain and control
● 2014: Enter mypipe!
○ Initial prototyping begins
8. Feature Overview (1/2)
● Emulates MySQL slave via binary log
○ Writes MySQL events to Kafka
● Uses Avro to serialize and deserialize data
○ Generically via a common schema for all tables
○ Specifically via per-table schema
● Modular by design
○ State saving / loading (files, MySQL, ZK, etc.)
○ Error handling
○ Event filtering
○ Connection sources
9. Feature Overview (2/2)
● Transaction and ALTER TABLE support
○ Includes transaction information within events
○ Refreshes schema as needed
● Can publish to any downstream system
○ Currently, we have have Kafka
○ Initially, we started with Cassandra for the prototype
● Can bootstrap a MySQL table into Kafka
○ Transforms entire table into Kafka events
○ Useful with Kafka log compaction
● Configurable
○ Kafka topic names
○ whitelist / blacklist support
● Console consumer, Dockerized dev env
10. Project Structure
● mypipe-api: API for MySQL binlogs
● mypipe-avro: binary protocol, mutation
serialization and deserialization
● mypipe-producers: push data downstream
● mypipe-kafka: Serializer & Decoder
implementations
● mypipe-runner: pipes and console tools
● mypipe-snapshotter: import MySQL tables
(beta)
11. MySQL Binary Logging
● Foundation of MySQL replication
● Statement or Row based
● Represents a journal / change log of data
● Allows applications to spy / tune in on
MySQL changes
12. MySQLBinaryLogConsumer
● Uses behavior from abstract class
● Modular design, in this case, uses config
based implementations
● Uses Hocon for ease and availability
case class MySQLBinaryLogConsumer(config: Config)
extends AbstractMySQLBinaryLogConsumer
with ConfigBasedConnectionSource
with ConfigBasedErrorHandlingBehaviour
with ConfigBasedEventSkippingBehaviour
with CacheableTableMapBehaviour
13. AbstractMySQLBinaryLogConsumer
● Maintains connection to MySQL
● Primarily handles
○ TABLE_MAP
○ QUERY (BEGIN, COMMIT, ROLLBACK, ALTER)
○ XID
○ Mutations (INSERT, UPDATE, DELETE)
● Provides an enriched binary log API
○ Looks up table metadata and includes it
○ Scala friendly case class and option-driven(*) API for
speaking MySQL binlogs
(*) constant work in progress (=
14. TABLE_MAP and table metadata
● Provides table metadata
○ Precedes mutation events
○ But no column names!
● MySQLMetadataManager
○ One actor per database
○ Uses “information_schema”
○ Determines column metadata and primary key
● TableCache
○ Wraps metadata actor providing a cache
○ Refreshes tables “when needed”
15. Mutations
case class ColumnMetadata(name: String, colType: ColumnType.EnumVal, isPrimaryKey: Boolean)
case class PrimaryKey(columns: List[ColumnMetadata])
case class Column(metadata: ColumnMetadata, value: java.io.Serializable)
case class Table(id: Long, name: String, db: String, columns: List[ColumnMetadata], primaryKey:
Option[PrimaryKey])
case class Row(table: Table, columns: Map[String, Column])
case class InsertMutation(timestamp: Long, table: Table, rows: List[Row], txid: UUID)
case class UpdateMutation(timestamp: Long, table: Table, rows: List[(Row, Row)], txid: UUID)
case class DeleteMutation(timestamp: Long, table: Table, rows: List[Row], txid: UUID)
● Fully enriched with table metadata
● Contain column types, data and txid
● Mutations can be serialized and deserialized
from and to Avro
16. Kafka Producers
● Two modes of operation:
○ Generic Avro beans
○ Specific Avro beans
● Producers decoupled from SerDE
○ Recently started supporting Kafka serializers and
decoders
○ Currently we only support: http://schemarepo.org/
○ Very soon we can integrate with systems such as
Confluent Platform’s schema registry.
17. Kafka Message Format
-----------------
| MAGIC | 1 byte |
|-----------------|
| MTYPE | 1 byte |
|-----------------|
| SCMID | N bytes |
|-----------------|
| DATA | N bytes |
-----------------
● MAGIC: magic byte, for protocol version
● MTYPE: mutation type, a single byte
○ indicating insert (0x1), update (0x2), or delete (0x3)
● SCMID: Avro schema ID, N bytes
● DATA: the actual mutation data as N bytes
18. Generic Message Format
3 Avro beans
○ InsertMutation, DeleteMutation, UpdateMutation
○ Hold data for new and old columns (for updates)
○ Groups data by type into Avro maps
{
"name": "old_integers",
"type": {"type": "map", "values": "int"}
},
{
"name": "new_integers",
"type": {"type": "map", "values": "int"}
},
{
"name": "old_strings",
"type": {"type": "map", "values": "string"}
},
{
"name": "new_strings",
"type": {"type": "map", "values": "string"}
} ...
19. Specific Message Format
Requires 3 Avro beans per table
○ Insert, Update, Delete
○ Specific fields can be used in the schema
{
"name": "UserInsert",
"fields": [
{
"name": "id",
"type": ["null", "int"]
},
{
"name": "username",
"type": ["null", "string"]
},
{
"name": "login_date",
"type": ["null", "long"]
},...
]
},
20. ALTER table support
● ALTER table queries intercepted
○ Producers can handle this event specifically
● Kafka serializer and deserializer
○ They inspect Avro beans and refresh schema if
needed
● Avro evolution rules must be respected
○ Or mypipe can’t properly encode / decode data
21. Pipes
● Join consumers to producers
● Use configurable time based checkpointing
and flushing
○ File based, MySQL based, ZK based, Kafka based
25. Practical Applications
● Cache coherence
● Change logging and auditing
● MySQL to:
○ HDFS
○ Cassandra
○ Spark
● Once Confluent Schema Registry integrated
○ Kafka Connect
○ KStreams
● Other reactive applications
○ Real-time notifications
26. Pipe 2
Pipe 1
Kafka
System Diagram
Hadoop Cassandra
MySQL
BinaryLog
Consumer
Dashboards
Binary Logs
Select
Consumer
MySQL
Kafka
Producer
Schema
Registry
Kafka
Producer
db2_tbl1
db2_tbl2
db1_tbl1
db1_tbl2
Event
Consumers
Users
Pipe N
MySQL
BinaryLog
Consumer
Kafka
Producer
27. Future Work
● Finish MySQL -> Kafka snapshot support
● Move to Kafka 0.10
● MySQL global transaction identifier (GTID)
support
● Publish to Maven
● More tests, we have a good amount, but you
can’t have enough!