Real-Time Data Loading from MySQL to Hadoop with New Tungsten Replicator 3.0

©Continuent 2014
Real-Time Loading from
MySQL to Hadoop
Featuring Continuent Tungsten
MC Brown, Director of Documentation

©Continuent 2014 2
Introducing Continuent

©Continuent 2014
Introducing Continuent
3
• The leading provider of clustering and
replication for open source DBMS
• Our Product: Continuent Tungsten
• Clustering - Commercial-grade HA, performance
scaling and data management for MySQL
• Replication - Flexible, high-performance data
movement

©Continuent 2014
Quick Continuent Facts
• Largest Tungsten installation processes over
700 million transactions daily on 225
terabytes of data
• Tungsten Replicator was application of the
year at the 2011 MySQL User Conference
• Wide variety of topologies including MySQL,
Oracle, Vertica, and MongoDB are in
production now
• MySQL to Hadoop deployments are now in
progress with multiple customers
4

©Continuent 2014
Selected Continuent Customers
5
23

©Continuent 2014 6
Five Minute Hadoop
Introduction

©Continuent 2014
What Is Hadoop, Exactly?
7
a.A distributed file system
b.A method of processing massive quantities
of data in parallel
c.The Cutting family's stuffed elephant
d.All of the above

©Continuent 2014
Hadoop Distributed File System
8
Java

Client
NameNode

(directory)
DataNodes (replicated data)
Hive
Pig
hadoop

command
Find

ﬁle
Read

block(s)

©Continuent 2014
Typical MySQL to Hadoop Use Case
9
Hive

(Analytics)
Hadoop
Cluster
Transaction
Processing
Initial Load?
Latency?
App changes?
Materialized

views?
Changes?
App load?

©Continuent 2014
Traditional Hadoop Deployments
• Data Analytics
• Single databases
• Collective databases
• Databases and external information
• Non-structured data
• Long term datastores and archiving
10
Client
Back 
Ofﬁce

©Continuent 2014
• Online Analytics
• Real-time queries and caching
• Fully heterogeneous deployments
Future Hadoop Deployments
11
Client
Back 
Ofﬁce

©Continuent 2014
Options for Loading Data
12
CSV

Files
Sqoop
Manual

Loading
Sqoop
Tungsten

Replicator

©Continuent 2014
Comparing Methods in Detail
13
Manual via
CSV
Sqoop
Tungsten
Replicator
Process
Manual/
Scripted
Manual/
Scripted
Fully
automated
Incremental
Loading
Possible with
DDL changes
Requires DDL
changes
Fully
supported
Latency Full-load Intermittent Real-time
Extraction
Requirements
Full table scan
Full and partial
table scans
Low-impact
binlog scan

©Continuent 2014 14
Replicating MySQL Data
to Hadoop using
Tungsten Replicator

©Continuent 2014
What is Tungsten Replicator?
15
A real-time,
high-performance,
open source database
replication engine
!
GPLV2 license - 100% open source

Download from https://code.google.com/p/tungsten-replicator/

Annual support subscription available from Continuent
“Golden Gate without the Price Tag”®

©Continuent 2014
Tungsten Replicator Overview
16
Master
(Transactions + Metadata)
Slave
THL
DBMS

Logs
Replicator
(Transactions + Metadata)
THLReplicator
Extract
transactions
from log
Apply

©Continuent 2014
Tungsten Replicator 3.0 Hadoop
17
• Extract from MySQL or Oracle
• Base Hadoop plus commercial distributions:
Cloudera, HortonWorks, Amazon EMR, IBM
• Provision using Sqoop or parallel extraction
• Automatic replication of incremental changes
• Transformation to preferred HDFS formats
• Schema generation for Hive
• Tools for generating materialized views

©Continuent 2014
Basic MySQL to Hadoop Replication
18
MySQL Tungsten Master
Replicator
hadoop
Master-Side Filtering

* pkey - Fill in pkey info

* colnames - Fill in names

* cdc - Add update type and
schema/table info

* source - Add source DBMS

* replicate - Subset tables to
be replicated
binlog_format=row
Tungsten Slave
Replicator
hadoop
MySQL

Binlog
CSV

Files
CSV

Files
CSV

Files
CSV

Files
CSV

Files
Hadoop

Cluster
Extract from
MySQL binlog
Load raw CSV to HDFS
(e.g., via LOAD DATA to
Hive)
Access via Hive

©Continuent 2014
Hadoop Data Loading - Gory Details
19
Replicator
hadoop
Transactions
from master
CSV

Files
CSV

Files
CSV

Files
Staging

Tables
Staging

Tables
Staging
“Tables”
Base TablesBase TablesMaterializedViews
Javascript load
script

e.g. hadoop.js
Write data
to CSV
(Run
MapReduce)
(Generate
Table
Deﬁnitions)
(Generate
Table
Deﬁnitions)
Load using
hadoop
command

Demo #1
!
Replicating data into Hadoop

©Continuent 2014
JavaScript Batch Loader
• Simple, flexible batch loader
• prepare() - when we go online
• begin() - start of batch
• apply() - write the events
• commit() - commit the events
• release() - when we go offline
21

Viewing MySQL Data
in Hadoop

©Continuent 2014
Generating Staging Table Schema
23
$ ddlscan -template ddl-mysql-hive-0.10-staging.vm !
-user tungsten -pass secret !
-url jdbc:mysql:thin://logos1:3306/sales -db sales!
...!
DROP TABLE IF EXISTS sales.stage_xxx_sales;!
!
CREATE EXTERNAL TABLE sales.stage_xxx_sales!
(!
tungsten_opcode STRING ,!
tungsten_seqno INT ,!
tungsten_row_id INT ,!
id INT ,!
salesman STRING ,!
planet STRING ,!
value DOUBLE)!
ROW FORMAT DELIMITED FIELDS TERMINATED BY '001' ESCAPED BY ''!
LINES TERMINATED BY 'n'!
STORED AS TEXTFILE LOCATION '/user/tungsten/staging/sales/sales';

©Continuent 2014
Generating Base Table Schema
$ ddlscan -template ddl-mysql-hive-0.10.vm -user tungsten !
-pass secret -url jdbc:mysql:thin://logos1:3306/sales -db sales!
...!
DROP TABLE IF EXISTS sales.sales;!
!
CREATE TABLE sales.sales!
(!
id INT,!
salesman STRING,!
planet STRING,!
value DOUBLE )!
;!
24

©Continuent 2014
Creating a Materialized View in Theory
25
Log #1 Log #2 Log #N...
MAP

Sort by key(s), transaction order
REDUCE

Emit last row per key if not a delete

©Continuent 2014
MapReduce
26
Acme,2013,4.75!
Spitze,2013,25.00!
Acme,2013,55.25!
Excelsior,2013,1.00!
Spitze,2013,5.00
Spitze,2014,60.00!
Spitze,2014,9.50!
Acme,2014,1.00!
Acme,2014,4.00!
Excelsior,2014,1.00!
Excelsior,2014,9.00
Acme,(4.75,55.25)!
Spitze,(25.00,5,00)!
Excelsior,(1.00)
Spitze,(60.00,9.50)!
Acme,(1.00,4.00)!
Excelsior,(1.00,9.00)
MAP
MAP
REDUCE
Acme,65.00!
Excelsior,11.00!
Spitze,99.50
SELECT COMPANY,VALUE FROM ...WHERE ... GROUP
BY COMPANY

©Continuent 2014
Creating a Materialized View in Hive
$ hive!
...!
hive ADD FILE /home/rhodges/github/continuent-tools-hadoop/bin/
tungsten-reduce;!
hive FROM ( !
SELECT sales.*!
FROM sales.stage_xxx_sales sales!
DISTRIBUTE BY id !
SORT BY id,tungsten_seqno,tungsten_row_id!
) map1!
INSERT OVERWRITE TABLE sales.sales!
SELECT TRANSFORM(!
tungsten_opcode,tungsten_seqno,tungsten_row_id,id, !
salesman,planet,value)!
USING 'perl tungsten-reduce -k id -c
tungsten_opcode,tungsten_seqno,tungsten_row_id,id,salesman,planet
,value'!
AS id INT,salesman STRING,planet STRING,value DOUBLE;!
27
MAP
REDUCE

©Continuent 2014
Comparing MySQL and Hadoop Data
$ export TUNGSTEN_EXT_LIBS=/usr/lib/hive/lib!
...!
$ /opt/continuent/tungsten/bristlecone/bin/dc !
-url1 jdbc:mysql:thin://logos1:3306/sales !
-user1 tungsten -password1 secret !
-url2 jdbc:hive2://localhost:10000 !
-user2 'tungsten' -password2 'secret' -schema sales !
-table sales -verbose -keys id !
-driver org.apache.hive.jdbc.HiveDriver!
22:33:08,093 INFO DC - Data comparison utility!
...!
22:33:24,526 INFO Tables compare OK!
28

©Continuent 2014
Doing it all at once
$ git clone !
https://github.com/continuent/continuent-tools-
hadoop.git!
!
$ cd continuent-tools-hadoop!
!
$ bin/load-reduce-check !
-U jdbc:mysql:thin://logos1:3306/sales !
-s sales --verbose
29

Demo #2
!
Constructing and Checking a
Materialized View

Scaling It Up!

©Continuent 2014
MySQL to Hadoop Fan-In Architecture
32
Replicator
m1 (slave)
m2 (slave)
m3 (slave)
Replicator
m1 (master)
m2 (master)
m3 (master)
Replicator
Replicator
RBR
RBR
Slaves
Hadoop

Cluster

(many nodes)
Masters
RBR

©Continuent 2014
Integration with Provisioning
33
MySQL
Tungsten Master
hadoop
binlog_format=row
Tungsten Slave
hadoop
MySQL

Binlog
CSV

Files
CSV

Files
CSV

Files
CSV

Files
CSV

Files
Hadoop

Cluster
Access via Hive
Sqoop/ETL
(Initial provisioning run)

©Continuent 2014
On-Demand Provisioning via Parallel
Extract
34
MySQL Tungsten Master
Replicator
hadoop
Master-Side Filtering

* pkey - Fill in pkey info

* colnames - Fill in names

* cdc - Add update type and
schema/table info

* source - Add source DBMS

* replicate - Subset tables to
be replicated

(other ﬁlters as needed)

binlog_format=row
Tungsten Slave
Replicator
hadoop
MySQL

Binlog
CSV

Files
CSV

Files
CSV

Files
CSV

Files
CSV

Files
Hadoop

Cluster
Extract from
MySQL tables
Load raw CSV to HDFS
(e.g., via LOAD DATA to
Hive)
Access via Hive

©Continuent 2014
Tungsten Replicator Roadmap
35
• Parallel CSV file loading
• Partition loaded data by commit time
• Data formats and tools to support additional
Hadoop clients as well as HBase
• Replication out of Hadoop
• Integration with emerging real-time analytics
based on HDFS (Impala, Spark/Shark,
Stinger,...)

Getting Started with
Continuent Tungsten

©Continuent 2014
Where Is Everything?
37
• Tungsten Replicator 3.0 builds are now available on
code.google.com
http://code.google.com/p/tungsten-replicator/
• Replicator 3.0 documentation is available on
Continuent website
http://docs.continuent.com/tungsten-replicator-3.0/
deployment-hadoop.html
• Tungsten Hadoop tools are available on GitHub
https://github.com/continuent/continuent-tools-hadoop
Contact Continuent for support

©Continuent 2014
Commercial Terms
• Replicator features are open source (GPL V2)
• Investment Elements
• POC / Development (Walk Away Option)
• Production Deployment
• Annual Support Subscription
• Governing Principles
• Annual Subscription Required
• More Upfront Investment - Less Annual Subscription
38

©Continuent 2014
We Do Clustering Too!
39
Tungsten clusters combine off-
the-shelf open source MySQL
servers into data services with:
!
• 24x7 data access
• Scaling of load on replicas
• Simple management commands
!
...without app changes or data
migration
Amazon
US West
apache
/php
GonzoPortal.com
Connector Connector

©Continuent 2014
In Conclusion: Tungsten Offers...
• Fully automated, real-time replication from MySQL
into Hadoop
• Support for automatic transformation to HDFS data
formats and creation of full materialized views
• Positions users to take advantage of evolving real-
time features in Hadoop
40

©Continuent 2014
Continuent Web Page:

http://www.continuent.com

!
Tungsten Replicator 3.0:

http://code.google.com/p/tungsten-replicator

Our Blogs:
http://scale-out-blog.blogspot.com
http://mcslp.wordpress.com
http://www.continuent.com/news/blogs
560 S. Winchester Blvd., Suite 500
San Jose, CA 95128
Tel +1 (866) 998-3642
Fax +1 (408) 668-1009
e-mail: sales@continuent.com

Real-Time Data Loading from MySQL to Hadoop with New Tungsten Replicator 3.0

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Real-Time Data Loading from MySQL to Hadoop with New Tungsten Replicator 3.0

Similaire à Real-Time Data Loading from MySQL to Hadoop with New Tungsten Replicator 3.0 (20)

Plus de Continuent

Plus de Continuent (20)

Dernier

Dernier (20)

Real-Time Data Loading from MySQL to Hadoop with New Tungsten Replicator 3.0