Seamless replication and disaster recovery for Apache Hive Warehouse

1 © Hortonworks Inc. 2011–2018. All rights reserved
Seamless Replication and Disaster
Recovery for Apache Hive Warehouse
DataWorks Summit – San Jose
June 2018

Presenters
Sankar Hariappan
Apache Hive Committer
Staff Software Engineer, Hortonworks Inc
Anishek Agarwal
Apache Hive Committer
Engineering Manager, Hortonworks Inc

Agenda
• Background
• Design Goals
• Deep Dive
• Wrap-up
• Questions?

Background

What Is Disaster Recovery and Backup & Restore?
• Disaster Recovery / Replication
• Replication is copying data from Production Site to
Disaster Recovery Site
• Disaster Recovery includes replication, but also
incorporates failover to Disaster Recovery site in case of
outage and failback to the original Production Site
• Disaster Recovery Site can be an on-premise or on-cloud
cluster
• Backup & Restore
• While Replication/Disaster Recovery protects against
disasters, it can transport the logical errors (e.g.
accidental deletion or corruption of data) to the DR Site
• To protect against accidental deletion of your important
Hive data files, customers need to do incremental/full
backup (generally retained for 30 days) in order to
restore back to a previous Point in time version
Production Site Disaster Recovery Site
Offsite Replication
Failback
Replication/
Disaster Recovery
Sunday Monday Tuesday Wednesday Thursday Friday Saturday Sunday
Full Backup
Cumulative incremental backup
Accidental Deletion
Backup & Restore

Why Enterprise Customers Care?
• Disaster Recovery (DR)
• To maintain the business continuity, customers want replication, failover & failback capabilities
across site. It also becomes a compliance requirement.
• Early adopter verticals are financials, Insurance, Healthcare, Payment Processing, Telco etc.
• Replication to Cloud
• Customers want to copy Hive tables to S3/WASB/ADLS and spin up a compute cluster.
• This enables a hybrid cloud deployment for our Enterprise customers.
• Backup & Restore of business critical data
• Customer want to backup and restore critical Hive Data.
Hadoop Data Lake is becoming an integral part of Information Architecture in support of a data
driven organization and many business critical applications are hosted on Hadoop
infrastructure. The High Availability of the business critical data across sites or Backup &
Restore is critical.

Use Case Flow: Disaster Recovery of Hive
A
active
A
Read only
B
active
B
Read only
CentralizedSecurityandGovernance
On-Premise
Data Center (a)
On-Premise
Data Center (b)
Scheduled Policy (A)
(2am, 10am, 6pm daily)
Scheduled Policy (B)
(2am daily)
B
active
B’
active
B’
active
B’
Read Only
A
Read Only
1 Data replication with scheduled policy
2 Disaster takes down Data Center (b)
3 Failover to Data Center (a); data set B made active
4 Active data set B changes to B’ in Data Center (a)
5 Data center (b) is back up
6
Failback to Data Center (b); B’ made passive in
Data Center (a) and get re-synced to Data
Center (b)

Design Goals

Falcon driven Replication - Shortcomings
• Uses EXPORT-IMPORT semantics to replicate data.
• Complete state transfer.
• 4X copy problem.
• High resource usage.
• Rubber-banding issue.
• Depends on external tools such as Falcon/Oozie to manage replication state.

1
0
© Hortonworks Inc. 2011–2018. All rights reserved
Hive driven Replication (HIVE-14841)
• Hive introduces REPL commands to support replication.
• Incremental replication - only copy delta changes
• Reduce number of copies.
• Point-in time replication.
• Hive maintains the replication state.
• Additional support for other database objects - for ex: functions, constraint etc.

1
1
Replication Modes (per Dataset)
• Master-Slave
Master Slave
Unidirectional
Read ReadWrite
• Master-Master
Master
Bidirectional
Read Write
Master
Read Write
Master
Read Write
Master
Read Write
Slave
Read
Master-Master replication works well if both
masters replicate different datasets.

1
2
Replication Patterns
• Hub and Spoke pattern
Master
Slave
Read
Read
Write
Slave
Read
Slave
Read
Slave
Read
• Relay pattern
Master Slave
Read ReadWrite
Slave
Read

1
3
Failover
• Slave take over the Master
responsibilities instantaneously.
• Ensure business continuity with minimal
data loss as per defined Recovery Point
Objective (RPO).
• Virtually zero down-time or zero
Recovery Time Objective (RTO).
Master Slave
Unidirectional
Read Write
Failover
Read Write

1
4
Failback - Requirements & Challenges
• Slave cluster usually have minimal
processing capabilities which makes Fail
Back an important requirement.
• Original Master should come alive with
latest data.
• Ensure removal of stale data which was
not replicated to the Slave.
• Reverse replicate the delta of data
loaded into the Slave after Failover.
Master Slave
Unidirectional
Read Write
Fail back
Read Write

1
5
Deep Dive

1
6
Building Block - Event Log
HiveServer2
Hive
Metastore
Metastore
RDBMS
Events Table
JDBC/ODBC
Capture
event
Stores in
RDBMS
Input SQL
command
• Capture every metadata and data changes as events.
• Each event is self-contained to recover the state of the involved object (metadata +
data).
• Events are serialized using global sequence number (event id).
• Stored in Metastore RDBMS.

1
7
Change Management
• Introduced to allow point-in time replication
• Replicating the following batch
• Insert to table
• Drop table
• Need inserted files after drop for replication
• Trash like directory for capturing deleted files (CM directory)
• Use checksum to verify file, else lookup from CM directory using checksum
• Necessary for ordered replication - State in destination DB would correspond to state in
source X duration back.

1
8
A Replication Cycle
• Each cycle of replication copies both data + metadata.
• Two step process
• Dump operation
• Dumps information on the source warehouse, by writing it to HDFS.
• Runs for relatively short duration - ~ mins to hours.
• Load operation
• Loads the information in target warehouse, by reading from HDFS (on source warehouse)
• Runs for relatively larger duration - ~ hours to days.
• Retryable because of idempotent behavior.
• Manually triggered - the orchestration of replication is not provided via Hive.

1
9
Bootstrap Replication (First Replication Cycle)
• Bootstrapping
• Replicates the whole database/warehouse to the target.
• Not event based.
• Runs only once (per dataset).
• Dump operation
• Iterates through all the objects like databases / tables / functions etc.
• Concurrent operations are allowed - no locking at source.
• Load operation
• Non coherent state, since state transfer is not “point in time”.
• Retryable and ensures idempotent behavior using “replication checkpoints”.

2
0
Incremental Replication (All Subsequent Replication Cycles)
• Event based replication.
• Event ID represents the “Replicated state” - replication metadata stored in target.
• Dump operation
• Only dumps the delta changes since the last replicated state in target.
• Load operation
• Target database is always in coherent state even if load fails.
• Retryable and ensures idempotent behavior using “replicated state”.
• Scheduling - Time-based or On-demand (not recommended) modes.
• First incremental replication cycle brings target database to a coherent state.

2
1
Event Based Replication
Metastore
RDBMS
Events Table
HDFS
Serialize new events
batch
Master Cluster
Slave Cluster
HiveServer2
Dump the events
HDFS
Meatastore
RDBMS
HiveServer2
DistcpMetastore API to
write objects
Data files
copy
Read repl
dump dir
REPL DUMP
REPL LOAD

2
2
Other Challenges
• Optimize for large database.
• Parallel dump of partitions.
• Dynamic DAG generation for Load operation.
• Parallel execution of DAG.
• Add resiliency to replication operations.
• Exponential backoff retries
• Tagging for Distcp jobs.
• Cleanup jobs.
• Optimize for Same key TDE zones on source and target.
• Data Integrity - depends on Distcp (via file checksum).

2
3
REPL Commands
• REPL DUMP [database name] { WITH ( ‘key1’=‘value1’
{, ‘key2’=‘value2’} ) } => outputs location
• REPL LOAD [database name] FROM [ location ] { WITH ( ‘key1’=‘value1’
{, ‘key2’=‘value2’} ) }
• REPL STATUS [database name] => outputs the last_replicated_event_id. Only on the DR
warehouse site.
• REPL DUMP [database name] { FROM [last_replicated_event_id] } { TO [end_event-id] } {
LIMIT [number of events] }

2
4
Current Status
• Replicates
• Managed tables with partitions.
• Views.
• UDFs/UDAFs.
• Constraints.
• To cloud storage (Amazon S3).
• Wire encryption and TDE.
• Work In Progress
• ACID Table Replication (HIVE-18320).
• External tables.

2
5
Future Work
• Optimize Fail Back.
• Offline media for bootstrap of large databases.
• Replicate Column Statistics, Materialized Views etc.
• Backup & Restore capability.
• Limitations
• SQL Standard-based Authorization.
• Non native tables.

2
6
Limitations
• SQL Standard-based Authorization.
• Non native tables.

2
7
Acid Tables Replication

2
8
ACID Tables - Introduction
• Hive managed tables supporting insert/update/delete operations with ACID
semantics are called ACID tables.
• Hive managed tables supporting Insert only operations with ACID semantics are
called MM (Micro-Managed) OR Insert-Only ACID tables.
• Transaction Manager
• Guarantees well defined semantics for concurrent operations/failures
• All transactions run at Snapshot Isolation level
• Between Serializable and Repeatable Read
• Streaming Ingest API
• Insert only
• SQL Merge
• Mix of Insert/Update/Delete

2
9
ACID Tables - Design
• Transaction Manager
• Begin transaction and obtain a transaction ID
• Storage layer enhanced to support MVCC
architecture
• Each row is tagged with unique ROW__ID
(internal)
• Multiple versions of each row to allow
concurrent readers and writers
• Result of each write is stored in a new Delta file
(delta_txnid_txnid_stmtid)
• Compaction/Cleaner
• Combine multiple delta files.
• Reduces overhead on name node.
• Enables clean-up of aborted data.
ACID Metadata Columns
(ROW__ID) PK
original_transaction_id
bucket_id
row_id
current_transaction_id
User Columns col_1:
a : INT
col_2:
b : STRING
CREATE TABLE acidtbl (a INT, b STRING) CLUSTERED BY (a) INTO
1 BUCKETS STORED AS ORC TBLPROPERTIES
('transactional'='true');

3
0
ACID Tables Replication - Design
• Per-table write ID (HIVE-18192)
• Associated with txn_id which writes into the table.
• Data file versioning uses write_id.
• Delta files: delta_writeid_writeid_stmtid
• ROW__ID: {original_write_id, bucket_id, row_id, current_write_id}
• Snapshot isolation semantics uses write id.
• Enable optimized failback as data files are consistent across warehouses.
• Optimize compaction by not waiting on all open transactions.

3
1
ACID Tables Replication - Design
• Bootstrapping
• Replicate a consistent snapshot of the database.
• Forcefully abort the transactions which are opened before dump.
• Dump operation runs with read txn - doesn’t lock the database.
• Incremental Replication
• Additional events for Open, Abort, Commit Txn operations.
• Optimize network / disk usage.
• Replicate only committed data
• Compaction driven at target warehouse instead of copying compacted data files over network.

3
2
Cloud Replication - Challenges
• Move is expensive
• Cloud file systems has implemented “move” as “copy”.
• Atomic move/rename of data files from temp directory to warehouse location in target.
• ACID/MM Tables replication avoids rename by directly copying data to warehouse location.
• Data integrity when copy data from cloud
• Checksum is not consistent across all file systems.

3
3
Replication Time Estimates
• Network Bandwidth
• Use about 50% of 1 Gbps = 0.5 Gbps = 0.5 *1000/8 = 62.5 MB/s, assuming regular hard disks =
~90/130 MB/s - disk should not be a limiting factor
• 1 Gbps = 62.5 MB/s = 5.2 TB/day
• 10 Gbps = 625 MB/s = 52 TB/day
• Data size
• 10 TB ~2 days @ 1 Gbps OR ~5 hours @10 Gbps
• 100 TB ~20 days @ 1 Gbps OR ~2 days @10 Gbps
• 1 PB ~200 days @1 Gbps OR ~20 days @10 Gbps

3
4
Replication Orchestration
• DLM (Data Lifecycle Manager) is the orchestration engine built at Hortonworks that
enables users to easily setup replication between their clusters.
• Schedules and manages Hive replication policies.
• Automatic retry on failures.
• Set resource usage limits.
• Figures out the right set of commands to be run based on on-premises OR on-cloud
replication sites.
• Replicates Ranger policies associated with source warehouse to target.

3
5
Wrap-up

3
6
Takeaways
• A Data Lake is becoming an integral part of the architecture in support of a data driven
organization and many business critical applications are hosted on Hadoop
infrastructure.
• The availability of the business critical data across sites is critical.
• Disaster recovery and off-premise processing solutions are powered by replication
capabilities of Hive.
• Why Hive replication?
• Point-in time incremental replication for Hive data/metadata.
• On-cloud replication capabilities.
• Seamless failover and failback support.

3
7
Questions?

3
8
Thank You!

3
9
References: Hive Configurations for Replication
Hive Configuration Recommendation Description
hive.metastore.transactional.event.listeners org.apache.hive.hcatalog.listener.
DbNotificationListener
Enable event logging
hive.metastore.event.db.listener.timetolive 86400s/RPO Expiry time for the events logged in metastore
hive.repl.rootdir Any valid HDFS directory Root directory used by repl dump
hive.metastore.dml.events true Enable event generation for DML operations
hive.repl.cm.enabled true Enable change management to archive deleted data
files
hive.repl.cm.retain 24hr/RPO Expiry time for CM backed-up data files.
hive.repl.cm.interval 3600s Time interval to look-up on expired data files in CM
hive.repl.cmrootdir Any valid HDFS directory Root directory for Change Manager
hive.repl.replica.functions.root.dir Any valid HDFS Root directory to store UDFs/UDAFs jars. Config needed
in Target cluster.
hive.repl.approx.max.load.tasks 1000 / Depending on memory
capacity of target warehouse
Limit the number of execution tasks to control the
memory consumption. Config needed in Target cluster.
hive.repl.partitions.dump.parallelism 8/depends on cpu usage Number of threads to concurrently dump partitions

4
0
References: Apache Hive Documentation
https://cwiki.apache.org/confluence/display/Hive/Home
https://cwiki.apache.org/confluence/display/Hive/HiveReplicationv2Development
https://cwiki.apache.org/confluence/display/Hive/HiveReplicationDevelopment
https://cwiki.apache.org/confluence/display/Hive/Replication
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ImportExport
https://issues.apache.org/jira/browse/HIVE-14841
https://issues.apache.org/jira/browse/HIVE-18320

Seamless replication and disaster recovery for Apache Hive Warehouse

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Seamless replication and disaster recovery for Apache Hive Warehouse

Similaire à Seamless replication and disaster recovery for Apache Hive Warehouse (20)

Plus de DataWorks Summit

Plus de DataWorks Summit (20)

Dernier

Dernier (20)

Seamless replication and disaster recovery for Apache Hive Warehouse