SlideShare une entreprise Scribd logo
1  sur  28
Télécharger pour lire hors ligne
mypipe: Buffering and consuming
MySQL changes via Kafka
with
-=[ Scala - Avro - Akka ]=-
Hisham Mardam-Bey
Github: mardambey
Twitter: @codewarrior
Overview
● Who is this guy? + Quick Mate1 Intro
● Quick Tech Intro
● Motivation and History
● Features
● Design and Architecture
● Practical applications and usages
● System diagram
● Future work
● Q&A
Who is this guy?
● Linux and OpenBSD user and developer
since 1996
● Started out with C followed by Ruby
● Working with the JVM since 2007
● “Lately” building and running distributed
systems, and doing Scala
Github: mardambey
Twitter: @codewarrior
Mate1: quick intro
● Online dating, since 2003, based in Montreal
● Initially a team of 3, around 30 now
● Engineering team has 12 geeks / geekettes
○ Always looking for talent!
● We own and run our own hardware
○ fun!
○ mostly…
https://github.com/mate1
Super Quick Tech Intro
● MySQL: relational database
● Avro: data serialization system
● Kafka: publish-subscribe messaging
rethought as a distributed commit log
● Akka: toolkit and runtime simplifying the
construction of concurrent and distributed
applications
● Actors: universal primitives of concurrent
computation using message passing
● Schema repo / registry: holds versioned
Avro schemas
Motivation
● Initially, wanted:
○ MySQL triggers outside the DB
○ MySQL fan-in or fan-out replication (data cubes)
○ MySQL to “Hadoop”
● And then:
○ Cache or data store consistency with DB
○ Direct integration with big-data systems
○ Data schema evolution support
○ Turning MySQL inside out
■ Bootstrapping downstream data systems
History
● 2010: Custom Perl scripts to parse binlogs
● 2011/2012: Guzzler
○ Written in Scala, uses mysqlbinlog command
○ Simple to start with, difficult to maintain and control
● 2014: Enter mypipe!
○ Initial prototyping begins
Feature Overview (1/2)
● Emulates MySQL slave via binary log
○ Writes MySQL events to Kafka
● Uses Avro to serialize and deserialize data
○ Generically via a common schema for all tables
○ Specifically via per-table schema
● Modular by design
○ State saving / loading (files, MySQL, ZK, etc.)
○ Error handling
○ Event filtering
○ Connection sources
Feature Overview (2/2)
● Transaction and ALTER TABLE support
○ Includes transaction information within events
○ Refreshes schema as needed
● Can publish to any downstream system
○ Currently, we have have Kafka
○ Initially, we started with Cassandra for the prototype
● Can bootstrap a MySQL table into Kafka
○ Transforms entire table into Kafka events
○ Useful with Kafka log compaction
● Configurable
○ Kafka topic names
○ whitelist / blacklist support
● Console consumer, Dockerized dev env
Project Structure
● mypipe-api: API for MySQL binlogs
● mypipe-avro: binary protocol, mutation
serialization and deserialization
● mypipe-producers: push data downstream
● mypipe-kafka: Serializer & Decoder
implementations
● mypipe-runner: pipes and console tools
● mypipe-snapshotter: import MySQL tables
(beta)
MySQL Binary Logging
● Foundation of MySQL replication
● Statement or Row based
● Represents a journal / change log of data
● Allows applications to spy / tune in on
MySQL changes
MySQLBinaryLogConsumer
● Uses behavior from abstract class
● Modular design, in this case, uses config
based implementations
● Uses Hocon for ease and availability
case class MySQLBinaryLogConsumer(config: Config)
extends AbstractMySQLBinaryLogConsumer
with ConfigBasedConnectionSource
with ConfigBasedErrorHandlingBehaviour
with ConfigBasedEventSkippingBehaviour
with CacheableTableMapBehaviour
AbstractMySQLBinaryLogConsumer
● Maintains connection to MySQL
● Primarily handles
○ TABLE_MAP
○ QUERY (BEGIN, COMMIT, ROLLBACK, ALTER)
○ XID
○ Mutations (INSERT, UPDATE, DELETE)
● Provides an enriched binary log API
○ Looks up table metadata and includes it
○ Scala friendly case class and option-driven(*) API for
speaking MySQL binlogs
(*) constant work in progress (=
TABLE_MAP and table metadata
● Provides table metadata
○ Precedes mutation events
○ But no column names!
● MySQLMetadataManager
○ One actor per database
○ Uses “information_schema”
○ Determines column metadata and primary key
● TableCache
○ Wraps metadata actor providing a cache
○ Refreshes tables “when needed”
Mutations
case class ColumnMetadata(name: String, colType: ColumnType.EnumVal, isPrimaryKey: Boolean)
case class PrimaryKey(columns: List[ColumnMetadata])
case class Column(metadata: ColumnMetadata, value: java.io.Serializable)
case class Table(id: Long, name: String, db: String, columns: List[ColumnMetadata], primaryKey:
Option[PrimaryKey])
case class Row(table: Table, columns: Map[String, Column])
case class InsertMutation(timestamp: Long, table: Table, rows: List[Row], txid: UUID)
case class UpdateMutation(timestamp: Long, table: Table, rows: List[(Row, Row)], txid: UUID)
case class DeleteMutation(timestamp: Long, table: Table, rows: List[Row], txid: UUID)
● Fully enriched with table metadata
● Contain column types, data and txid
● Mutations can be serialized and deserialized
from and to Avro
Kafka Producers
● Two modes of operation:
○ Generic Avro beans
○ Specific Avro beans
● Producers decoupled from SerDE
○ Recently started supporting Kafka serializers and
decoders
○ Currently we only support: http://schemarepo.org/
○ Very soon we can integrate with systems such as
Confluent Platform’s schema registry.
Kafka Message Format
-----------------
| MAGIC | 1 byte |
|-----------------|
| MTYPE | 1 byte |
|-----------------|
| SCMID | N bytes |
|-----------------|
| DATA | N bytes |
-----------------
● MAGIC: magic byte, for protocol version
● MTYPE: mutation type, a single byte
○ indicating insert (0x1), update (0x2), or delete (0x3)
● SCMID: Avro schema ID, N bytes
● DATA: the actual mutation data as N bytes
Generic Message Format
3 Avro beans
○ InsertMutation, DeleteMutation, UpdateMutation
○ Hold data for new and old columns (for updates)
○ Groups data by type into Avro maps
{
"name": "old_integers",
"type": {"type": "map", "values": "int"}
},
{
"name": "new_integers",
"type": {"type": "map", "values": "int"}
},
{
"name": "old_strings",
"type": {"type": "map", "values": "string"}
},
{
"name": "new_strings",
"type": {"type": "map", "values": "string"}
} ...
Specific Message Format
Requires 3 Avro beans per table
○ Insert, Update, Delete
○ Specific fields can be used in the schema
{
"name": "UserInsert",
"fields": [
{
"name": "id",
"type": ["null", "int"]
},
{
"name": "username",
"type": ["null", "string"]
},
{
"name": "login_date",
"type": ["null", "long"]
},...
]
},
ALTER table support
● ALTER table queries intercepted
○ Producers can handle this event specifically
● Kafka serializer and deserializer
○ They inspect Avro beans and refresh schema if
needed
● Avro evolution rules must be respected
○ Or mypipe can’t properly encode / decode data
Pipes
● Join consumers to producers
● Use configurable time based checkpointing
and flushing
○ File based, MySQL based, ZK based, Kafka based
schema-repo-client = "mypipe.avro.schema.SchemaRepo"
consumers {
localhost {
# database "host:port:user:pass" array
source = "localhost:3306:mypipe:mypipe"
}
}
producers {
stdout {
class = "mypipe.kafka.producer.stdout.StdoutProducer"
}
kafka-generic {
class = "mypipe.kafka.producer.KafkaMutationGenericAvroProducer"
}
}
pipes {
stdout {
consumers = ["localhost"]
producer { stdout {} }
binlog-position-repo {
#class="mypipe.api.repo.ConfigurableMySQLBasedBinaryLogPositionRepository"
class = "mypipe.api.repo.ConfigurableFileBasedBinaryLogPositionRepository"
config {
file-prefix = "stdout-00" # required if binlog-position-repo is specifiec
data-dir = "/tmp/mypipe/data"
}
}
}
kafka-generic {
enabled = true
consumers = ["localhost"]
producer {
kafka-generic {
metadata-brokers = "localhost:9092"
}
}
}
Practical Applications
● Cache coherence
● Change logging and auditing
● MySQL to:
○ HDFS
○ Cassandra
○ Spark
● Once Confluent Schema Registry integrated
○ Kafka Connect
○ KStreams
● Other reactive applications
○ Real-time notifications
Pipe 2
Pipe 1
Kafka
System Diagram
Hadoop Cassandra
MySQL
BinaryLog
Consumer
Dashboards
Binary Logs
Select
Consumer
MySQL
Kafka
Producer
Schema
Registry
Kafka
Producer
db2_tbl1
db2_tbl2
db1_tbl1
db1_tbl2
Event
Consumers
Users
Pipe N
MySQL
BinaryLog
Consumer
Kafka
Producer
Future Work
● Finish MySQL -> Kafka snapshot support
● Move to Kafka 0.10
● MySQL global transaction identifier (GTID)
support
● Publish to Maven
● More tests, we have a good amount, but you
can’t have enough!
Fin!
That’s all folks (=
Thanks!
Questions?
https://github.com/mardambey/mypipe

Contenu connexe

Tendances

9b. Document-Oriented Databases lab
9b. Document-Oriented Databases lab9b. Document-Oriented Databases lab
9b. Document-Oriented Databases labFabio Fumarola
 
How bol.com makes sense of its logs, using the Elastic technology stack.
How bol.com makes sense of its logs, using the Elastic technology stack.How bol.com makes sense of its logs, using the Elastic technology stack.
How bol.com makes sense of its logs, using the Elastic technology stack.Renzo Tomà
 
KubeCon EU 2019 - P2P Docker Image Distribution in Hybrid Cloud Environment w...
KubeCon EU 2019 - P2P Docker Image Distribution in Hybrid Cloud Environment w...KubeCon EU 2019 - P2P Docker Image Distribution in Hybrid Cloud Environment w...
KubeCon EU 2019 - P2P Docker Image Distribution in Hybrid Cloud Environment w...Yiran Wang
 
Scaling an ELK stack at bol.com
Scaling an ELK stack at bol.comScaling an ELK stack at bol.com
Scaling an ELK stack at bol.comRenzo Tomà
 
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...Rob Skillington
 
GSoC2014 - Uniritter Presentation May, 2015
GSoC2014 - Uniritter Presentation May, 2015GSoC2014 - Uniritter Presentation May, 2015
GSoC2014 - Uniritter Presentation May, 2015Fabrízio Mello
 
.NET Memory Primer (Martin Kulov)
.NET Memory Primer (Martin Kulov).NET Memory Primer (Martin Kulov)
.NET Memory Primer (Martin Kulov)ITCamp
 
Introduction to NoSql
Introduction to NoSqlIntroduction to NoSql
Introduction to NoSqlOmid Vahdaty
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandrashimi_k
 
Go and Uber’s time series database m3
Go and Uber’s time series database m3Go and Uber’s time series database m3
Go and Uber’s time series database m3Rob Skillington
 
ConvNetJS & CaffeJS
ConvNetJS & CaffeJSConvNetJS & CaffeJS
ConvNetJS & CaffeJSAnyline
 
Sphinx && Perl Houston Perl Mongers - May 8th, 2014
Sphinx && Perl  Houston Perl Mongers - May 8th, 2014Sphinx && Perl  Houston Perl Mongers - May 8th, 2014
Sphinx && Perl Houston Perl Mongers - May 8th, 2014Brett Estrade
 
My talk about Tarantool and Lua at Percona Live 2016
My talk about Tarantool and Lua at Percona Live 2016My talk about Tarantool and Lua at Percona Live 2016
My talk about Tarantool and Lua at Percona Live 2016Konstantin Osipov
 
Experiences in ELK with D3.js for Large Log Analysis and Visualization
Experiences in ELK with D3.js  for Large Log Analysis  and VisualizationExperiences in ELK with D3.js  for Large Log Analysis  and Visualization
Experiences in ELK with D3.js for Large Log Analysis and VisualizationSurasak Sanguanpong
 
Tokyo Cabinet
Tokyo CabinetTokyo Cabinet
Tokyo Cabinetehuard
 
FOSDEM 2020: Querying over millions and billions of metrics with M3DB's index
FOSDEM 2020: Querying over millions and billions of metrics with M3DB's indexFOSDEM 2020: Querying over millions and billions of metrics with M3DB's index
FOSDEM 2020: Querying over millions and billions of metrics with M3DB's indexRob Skillington
 
Tuga it 2017 - Event processing with Apache Storm
Tuga it 2017 - Event processing with Apache StormTuga it 2017 - Event processing with Apache Storm
Tuga it 2017 - Event processing with Apache StormNuno Caneco
 

Tendances (19)

9b. Document-Oriented Databases lab
9b. Document-Oriented Databases lab9b. Document-Oriented Databases lab
9b. Document-Oriented Databases lab
 
How bol.com makes sense of its logs, using the Elastic technology stack.
How bol.com makes sense of its logs, using the Elastic technology stack.How bol.com makes sense of its logs, using the Elastic technology stack.
How bol.com makes sense of its logs, using the Elastic technology stack.
 
KubeCon EU 2019 - P2P Docker Image Distribution in Hybrid Cloud Environment w...
KubeCon EU 2019 - P2P Docker Image Distribution in Hybrid Cloud Environment w...KubeCon EU 2019 - P2P Docker Image Distribution in Hybrid Cloud Environment w...
KubeCon EU 2019 - P2P Docker Image Distribution in Hybrid Cloud Environment w...
 
Scaling an ELK stack at bol.com
Scaling an ELK stack at bol.comScaling an ELK stack at bol.com
Scaling an ELK stack at bol.com
 
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
 
Xephon K A Time series database with multiple backends
Xephon K A Time series database with multiple backendsXephon K A Time series database with multiple backends
Xephon K A Time series database with multiple backends
 
GSoC2014 - Uniritter Presentation May, 2015
GSoC2014 - Uniritter Presentation May, 2015GSoC2014 - Uniritter Presentation May, 2015
GSoC2014 - Uniritter Presentation May, 2015
 
.NET Memory Primer (Martin Kulov)
.NET Memory Primer (Martin Kulov).NET Memory Primer (Martin Kulov)
.NET Memory Primer (Martin Kulov)
 
Introduction to NoSql
Introduction to NoSqlIntroduction to NoSql
Introduction to NoSql
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandra
 
Go and Uber’s time series database m3
Go and Uber’s time series database m3Go and Uber’s time series database m3
Go and Uber’s time series database m3
 
ConvNetJS & CaffeJS
ConvNetJS & CaffeJSConvNetJS & CaffeJS
ConvNetJS & CaffeJS
 
Sphinx && Perl Houston Perl Mongers - May 8th, 2014
Sphinx && Perl  Houston Perl Mongers - May 8th, 2014Sphinx && Perl  Houston Perl Mongers - May 8th, 2014
Sphinx && Perl Houston Perl Mongers - May 8th, 2014
 
My talk about Tarantool and Lua at Percona Live 2016
My talk about Tarantool and Lua at Percona Live 2016My talk about Tarantool and Lua at Percona Live 2016
My talk about Tarantool and Lua at Percona Live 2016
 
Experiences in ELK with D3.js for Large Log Analysis and Visualization
Experiences in ELK with D3.js  for Large Log Analysis  and VisualizationExperiences in ELK with D3.js  for Large Log Analysis  and Visualization
Experiences in ELK with D3.js for Large Log Analysis and Visualization
 
Tokyo Cabinet
Tokyo CabinetTokyo Cabinet
Tokyo Cabinet
 
Tokyo Cabinet
Tokyo CabinetTokyo Cabinet
Tokyo Cabinet
 
FOSDEM 2020: Querying over millions and billions of metrics with M3DB's index
FOSDEM 2020: Querying over millions and billions of metrics with M3DB's indexFOSDEM 2020: Querying over millions and billions of metrics with M3DB's index
FOSDEM 2020: Querying over millions and billions of metrics with M3DB's index
 
Tuga it 2017 - Event processing with Apache Storm
Tuga it 2017 - Event processing with Apache StormTuga it 2017 - Event processing with Apache Storm
Tuga it 2017 - Event processing with Apache Storm
 

Similaire à mypipe: Buffering and consuming MySQL changes via Kafka

Streaming Operational Data with MariaDB MaxScale
Streaming Operational Data with MariaDB MaxScaleStreaming Operational Data with MariaDB MaxScale
Streaming Operational Data with MariaDB MaxScaleMariaDB plc
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifyNeville Li
 
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...AMD Developer Central
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce AlgorithmsAmund Tveit
 
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache BeamScio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache BeamNeville Li
 
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at NightHow Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at NightScyllaDB
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBantoinegirbal
 
2011 Mongo FR - MongoDB introduction
2011 Mongo FR - MongoDB introduction2011 Mongo FR - MongoDB introduction
2011 Mongo FR - MongoDB introductionantoinegirbal
 
Performing Data Science with HBase
Performing Data Science with HBasePerforming Data Science with HBase
Performing Data Science with HBaseWibiData
 
Mito, a successor of Integral
Mito, a successor of IntegralMito, a successor of Integral
Mito, a successor of Integralfukamachi
 
Using akka streams to access s3 objects
Using akka streams to access s3 objectsUsing akka streams to access s3 objects
Using akka streams to access s3 objectsMikhail Girkin
 
How to make data available for analytics ASAP
How to make data available for analytics ASAPHow to make data available for analytics ASAP
How to make data available for analytics ASAPMariaDB plc
 
Updating materialized views and caches using kafka
Updating materialized views and caches using kafkaUpdating materialized views and caches using kafka
Updating materialized views and caches using kafkaZach Cox
 
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...DataStax
 
2021 04-20 apache arrow and its impact on the database industry.pptx
2021 04-20  apache arrow and its impact on the database industry.pptx2021 04-20  apache arrow and its impact on the database industry.pptx
2021 04-20 apache arrow and its impact on the database industry.pptxAndrew Lamb
 
Managing Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDBManaging Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDBJason Terpko
 
APIdays Paris 2018 - Building scalable, type-safe GraphQL servers from scratc...
APIdays Paris 2018 - Building scalable, type-safe GraphQL servers from scratc...APIdays Paris 2018 - Building scalable, type-safe GraphQL servers from scratc...
APIdays Paris 2018 - Building scalable, type-safe GraphQL servers from scratc...apidays
 
Managing data and operation distribution in MongoDB
Managing data and operation distribution in MongoDBManaging data and operation distribution in MongoDB
Managing data and operation distribution in MongoDBAntonios Giannopoulos
 
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by ScyllaScylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by ScyllaScyllaDB
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDatabricks
 

Similaire à mypipe: Buffering and consuming MySQL changes via Kafka (20)

Streaming Operational Data with MariaDB MaxScale
Streaming Operational Data with MariaDB MaxScaleStreaming Operational Data with MariaDB MaxScale
Streaming Operational Data with MariaDB MaxScale
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at Spotify
 
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache BeamScio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
 
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at NightHow Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
2011 Mongo FR - MongoDB introduction
2011 Mongo FR - MongoDB introduction2011 Mongo FR - MongoDB introduction
2011 Mongo FR - MongoDB introduction
 
Performing Data Science with HBase
Performing Data Science with HBasePerforming Data Science with HBase
Performing Data Science with HBase
 
Mito, a successor of Integral
Mito, a successor of IntegralMito, a successor of Integral
Mito, a successor of Integral
 
Using akka streams to access s3 objects
Using akka streams to access s3 objectsUsing akka streams to access s3 objects
Using akka streams to access s3 objects
 
How to make data available for analytics ASAP
How to make data available for analytics ASAPHow to make data available for analytics ASAP
How to make data available for analytics ASAP
 
Updating materialized views and caches using kafka
Updating materialized views and caches using kafkaUpdating materialized views and caches using kafka
Updating materialized views and caches using kafka
 
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
 
2021 04-20 apache arrow and its impact on the database industry.pptx
2021 04-20  apache arrow and its impact on the database industry.pptx2021 04-20  apache arrow and its impact on the database industry.pptx
2021 04-20 apache arrow and its impact on the database industry.pptx
 
Managing Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDBManaging Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDB
 
APIdays Paris 2018 - Building scalable, type-safe GraphQL servers from scratc...
APIdays Paris 2018 - Building scalable, type-safe GraphQL servers from scratc...APIdays Paris 2018 - Building scalable, type-safe GraphQL servers from scratc...
APIdays Paris 2018 - Building scalable, type-safe GraphQL servers from scratc...
 
Managing data and operation distribution in MongoDB
Managing data and operation distribution in MongoDBManaging data and operation distribution in MongoDB
Managing data and operation distribution in MongoDB
 
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by ScyllaScylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
 

Dernier

Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...Nitya salvi
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...kalichargn70th171
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrandmasabamasaba
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024Mind IT Systems
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdfPearlKirahMaeRagusta1
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is insideshinachiaurasa2
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension AidPhilip Schwarz
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park masabamasaba
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfkalichargn70th171
 

Dernier (20)

Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 

mypipe: Buffering and consuming MySQL changes via Kafka

  • 1. mypipe: Buffering and consuming MySQL changes via Kafka with -=[ Scala - Avro - Akka ]=- Hisham Mardam-Bey Github: mardambey Twitter: @codewarrior
  • 2. Overview ● Who is this guy? + Quick Mate1 Intro ● Quick Tech Intro ● Motivation and History ● Features ● Design and Architecture ● Practical applications and usages ● System diagram ● Future work ● Q&A
  • 3. Who is this guy? ● Linux and OpenBSD user and developer since 1996 ● Started out with C followed by Ruby ● Working with the JVM since 2007 ● “Lately” building and running distributed systems, and doing Scala Github: mardambey Twitter: @codewarrior
  • 4. Mate1: quick intro ● Online dating, since 2003, based in Montreal ● Initially a team of 3, around 30 now ● Engineering team has 12 geeks / geekettes ○ Always looking for talent! ● We own and run our own hardware ○ fun! ○ mostly… https://github.com/mate1
  • 5. Super Quick Tech Intro ● MySQL: relational database ● Avro: data serialization system ● Kafka: publish-subscribe messaging rethought as a distributed commit log ● Akka: toolkit and runtime simplifying the construction of concurrent and distributed applications ● Actors: universal primitives of concurrent computation using message passing ● Schema repo / registry: holds versioned Avro schemas
  • 6. Motivation ● Initially, wanted: ○ MySQL triggers outside the DB ○ MySQL fan-in or fan-out replication (data cubes) ○ MySQL to “Hadoop” ● And then: ○ Cache or data store consistency with DB ○ Direct integration with big-data systems ○ Data schema evolution support ○ Turning MySQL inside out ■ Bootstrapping downstream data systems
  • 7. History ● 2010: Custom Perl scripts to parse binlogs ● 2011/2012: Guzzler ○ Written in Scala, uses mysqlbinlog command ○ Simple to start with, difficult to maintain and control ● 2014: Enter mypipe! ○ Initial prototyping begins
  • 8. Feature Overview (1/2) ● Emulates MySQL slave via binary log ○ Writes MySQL events to Kafka ● Uses Avro to serialize and deserialize data ○ Generically via a common schema for all tables ○ Specifically via per-table schema ● Modular by design ○ State saving / loading (files, MySQL, ZK, etc.) ○ Error handling ○ Event filtering ○ Connection sources
  • 9. Feature Overview (2/2) ● Transaction and ALTER TABLE support ○ Includes transaction information within events ○ Refreshes schema as needed ● Can publish to any downstream system ○ Currently, we have have Kafka ○ Initially, we started with Cassandra for the prototype ● Can bootstrap a MySQL table into Kafka ○ Transforms entire table into Kafka events ○ Useful with Kafka log compaction ● Configurable ○ Kafka topic names ○ whitelist / blacklist support ● Console consumer, Dockerized dev env
  • 10. Project Structure ● mypipe-api: API for MySQL binlogs ● mypipe-avro: binary protocol, mutation serialization and deserialization ● mypipe-producers: push data downstream ● mypipe-kafka: Serializer & Decoder implementations ● mypipe-runner: pipes and console tools ● mypipe-snapshotter: import MySQL tables (beta)
  • 11. MySQL Binary Logging ● Foundation of MySQL replication ● Statement or Row based ● Represents a journal / change log of data ● Allows applications to spy / tune in on MySQL changes
  • 12. MySQLBinaryLogConsumer ● Uses behavior from abstract class ● Modular design, in this case, uses config based implementations ● Uses Hocon for ease and availability case class MySQLBinaryLogConsumer(config: Config) extends AbstractMySQLBinaryLogConsumer with ConfigBasedConnectionSource with ConfigBasedErrorHandlingBehaviour with ConfigBasedEventSkippingBehaviour with CacheableTableMapBehaviour
  • 13. AbstractMySQLBinaryLogConsumer ● Maintains connection to MySQL ● Primarily handles ○ TABLE_MAP ○ QUERY (BEGIN, COMMIT, ROLLBACK, ALTER) ○ XID ○ Mutations (INSERT, UPDATE, DELETE) ● Provides an enriched binary log API ○ Looks up table metadata and includes it ○ Scala friendly case class and option-driven(*) API for speaking MySQL binlogs (*) constant work in progress (=
  • 14. TABLE_MAP and table metadata ● Provides table metadata ○ Precedes mutation events ○ But no column names! ● MySQLMetadataManager ○ One actor per database ○ Uses “information_schema” ○ Determines column metadata and primary key ● TableCache ○ Wraps metadata actor providing a cache ○ Refreshes tables “when needed”
  • 15. Mutations case class ColumnMetadata(name: String, colType: ColumnType.EnumVal, isPrimaryKey: Boolean) case class PrimaryKey(columns: List[ColumnMetadata]) case class Column(metadata: ColumnMetadata, value: java.io.Serializable) case class Table(id: Long, name: String, db: String, columns: List[ColumnMetadata], primaryKey: Option[PrimaryKey]) case class Row(table: Table, columns: Map[String, Column]) case class InsertMutation(timestamp: Long, table: Table, rows: List[Row], txid: UUID) case class UpdateMutation(timestamp: Long, table: Table, rows: List[(Row, Row)], txid: UUID) case class DeleteMutation(timestamp: Long, table: Table, rows: List[Row], txid: UUID) ● Fully enriched with table metadata ● Contain column types, data and txid ● Mutations can be serialized and deserialized from and to Avro
  • 16. Kafka Producers ● Two modes of operation: ○ Generic Avro beans ○ Specific Avro beans ● Producers decoupled from SerDE ○ Recently started supporting Kafka serializers and decoders ○ Currently we only support: http://schemarepo.org/ ○ Very soon we can integrate with systems such as Confluent Platform’s schema registry.
  • 17. Kafka Message Format ----------------- | MAGIC | 1 byte | |-----------------| | MTYPE | 1 byte | |-----------------| | SCMID | N bytes | |-----------------| | DATA | N bytes | ----------------- ● MAGIC: magic byte, for protocol version ● MTYPE: mutation type, a single byte ○ indicating insert (0x1), update (0x2), or delete (0x3) ● SCMID: Avro schema ID, N bytes ● DATA: the actual mutation data as N bytes
  • 18. Generic Message Format 3 Avro beans ○ InsertMutation, DeleteMutation, UpdateMutation ○ Hold data for new and old columns (for updates) ○ Groups data by type into Avro maps { "name": "old_integers", "type": {"type": "map", "values": "int"} }, { "name": "new_integers", "type": {"type": "map", "values": "int"} }, { "name": "old_strings", "type": {"type": "map", "values": "string"} }, { "name": "new_strings", "type": {"type": "map", "values": "string"} } ...
  • 19. Specific Message Format Requires 3 Avro beans per table ○ Insert, Update, Delete ○ Specific fields can be used in the schema { "name": "UserInsert", "fields": [ { "name": "id", "type": ["null", "int"] }, { "name": "username", "type": ["null", "string"] }, { "name": "login_date", "type": ["null", "long"] },... ] },
  • 20. ALTER table support ● ALTER table queries intercepted ○ Producers can handle this event specifically ● Kafka serializer and deserializer ○ They inspect Avro beans and refresh schema if needed ● Avro evolution rules must be respected ○ Or mypipe can’t properly encode / decode data
  • 21. Pipes ● Join consumers to producers ● Use configurable time based checkpointing and flushing ○ File based, MySQL based, ZK based, Kafka based
  • 22. schema-repo-client = "mypipe.avro.schema.SchemaRepo" consumers { localhost { # database "host:port:user:pass" array source = "localhost:3306:mypipe:mypipe" } } producers { stdout { class = "mypipe.kafka.producer.stdout.StdoutProducer" } kafka-generic { class = "mypipe.kafka.producer.KafkaMutationGenericAvroProducer" } }
  • 23. pipes { stdout { consumers = ["localhost"] producer { stdout {} } binlog-position-repo { #class="mypipe.api.repo.ConfigurableMySQLBasedBinaryLogPositionRepository" class = "mypipe.api.repo.ConfigurableFileBasedBinaryLogPositionRepository" config { file-prefix = "stdout-00" # required if binlog-position-repo is specifiec data-dir = "/tmp/mypipe/data" } } }
  • 24. kafka-generic { enabled = true consumers = ["localhost"] producer { kafka-generic { metadata-brokers = "localhost:9092" } } }
  • 25. Practical Applications ● Cache coherence ● Change logging and auditing ● MySQL to: ○ HDFS ○ Cassandra ○ Spark ● Once Confluent Schema Registry integrated ○ Kafka Connect ○ KStreams ● Other reactive applications ○ Real-time notifications
  • 26. Pipe 2 Pipe 1 Kafka System Diagram Hadoop Cassandra MySQL BinaryLog Consumer Dashboards Binary Logs Select Consumer MySQL Kafka Producer Schema Registry Kafka Producer db2_tbl1 db2_tbl2 db1_tbl1 db1_tbl2 Event Consumers Users Pipe N MySQL BinaryLog Consumer Kafka Producer
  • 27. Future Work ● Finish MySQL -> Kafka snapshot support ● Move to Kafka 0.10 ● MySQL global transaction identifier (GTID) support ● Publish to Maven ● More tests, we have a good amount, but you can’t have enough!
  • 28. Fin! That’s all folks (= Thanks! Questions? https://github.com/mardambey/mypipe