SlideShare une entreprise Scribd logo
1  sur  10
HIVE Data Migration
NHN Comico Data Laboratory
Bopyo Hong
2015-11-05
Overview
1. Concept
2. Export Data
from Old Hadoop Cluster
3. Data Copy
from Old Hadoop Cluster to New one
4. Import Data
to New Hadoop Cluster
5. Reference sites
Concept
Migration of Hive data from old hadoop cluster to new one
Hive
(old version)
Hive
(new version)
Export Data from Old Hadoop Cluster - 1
1. To make table list of database you want to export
Hive doesn’t support to export/import by DB unit
$ hive --database test -e 'SHOW TABLES' | sed -e '1d' > table_list.txt
2. To make export statements that is executed by partition
when you export whole table(not use partition), if the table is very big the export will
take too long time. So I recommend use partition.
In this example, export 2015-10-01 ~ 2014-10-31 partition
ex) partition name is dt
/user/hive/warehouse/test.db/tablename/dt=2015-10-01
/user/hive/warehouse/test.db/tablename/dt=2015-10-02
/user/hive/warehouse/test.db/tablename/dt=2015-10-03
……………………………………..
/user/hive/warehouse/test.db/tablename/dt=2015-10-31
$ source make_cmd_export_by_partition.sh 20151001 20151101 > export_201510.q
Export Data from Old Hadoop Cluster - 2
$vi make_cmd_export_by_partition.sh
#!/bin/bash
startdate=$1
enddate=$2
rundate="$startdate"
until [ "$rundate" == "$enddate" ]
do
YEAR=${rundate:0:4}
MONTH=${rundate:4:2}
DAY=${rundate:6:2}
DATE2=${YEAR}'-'${MONTH}'-'${DAY}
TBLS=(`cat table_list.txt`)
for TBL in ${TBLS[@]}
do
echo "export table ${TBL} PARTITION (dt='$DATE2') to '/user/hadoop/temp_export_dir/${TBL}/dt=$DATE2';"
echo
done
rundate=`date --date="$rundate +1 day" +%Y%m%d`
done
Export Data from Old Hadoop Cluster - 3
3. Export of test DB
$ source export_by_hive.sh export_201510.q
$vi export_by_hive.sh
#!/bin/bash
echo "export start `date +'%Y-%m-%d %H:%M'`“
# all tables couldn’t have same partitions, so need a option (-hiveconf hive.cli.errors.ignore=true)
hive -hiveconf hive.cli.errors.ignore=true --database test -f $1
echo "export end `date +'%Y-%m-%d %H:%M'`"
Data Copy from old hadoop cluster to new one
Data copy by using distcp command
You should execute distcp command on new hadoop cluster
old cluster url -> hftp://namenode_hostname:50070
new cluster url -> hdfs://namenode_hostname:8020
$ source distcp_script.sh
$ vi distcp_script.sh
#!/bin/bash
echo "`date +'%Y-%m-%d %H:%M'` distcp start“
# -pb option is necessary to avoid errors that could happen by difference of block size between old and new cluster
hadoop distcp -pb hftp://old_host:50070/user/hadoop/temp_export_dir hdfs://new_host:8020/user/bopyo.hong/temp_import_dir
echo "`date +'%Y-%m-%d %H:%M'` distcp end“
Import Data to New Hadoop Cluster - 1
1. To make import statement that is executed by partition
$ source make_cmd_import_by_partition.sh 20151001 20151101 > import_201510.q
$ vi make_cmd_import_by_partition.sh
#!/bin/bash
TBLS=(`cat table_list.txt`)
startdate=$1
enddate=$2
rundate="$startdate"
until [ "$rundate" == "$enddate" ]
do
YEAR=${rundate:0:4}
MONTH=${rundate:4:2}
DAY=${rundate:6:2}
DATE2=${YEAR}'-'${MONTH}'-'${DAY}
for TBL in ${TBLS[@]}
do
echo "import table ${TBL} PARTITION (dt='$DATE2') from '/user/bopyo.hong/temp_import_dir/${TBL}/dt=$DATE2';"
echo
done
rundate=`date --date="$rundate +1 day" +%Y%m%d`
done
Import Data to New Hadoop Cluster - 2
2. Import tables of test DB
$ source import_by_hiveql.sh import_201510.q
$ vi import_by_hiveql.sh
#!/bin/bash
echo "import start `date +'%Y-%m-%d %H:%M'`“
# all tables couldn’t have same partitions so need a option (-hiveconf hive.cli.errors.ignore=true)
hive -hiveconf hive.cli.errors.ignore=true --database test -f $1
echo "export start `date +'%Y-%m-%d %H:%M'`"
Reference site
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ImportExport
http://kickstarthadoop.blogspot.jp/2012/08/how-to-migrate-hive-table-from-one-hive.html
https://hadoop.apache.org/docs/r2.7.1/hadoop-distcp/DistCp.html#Command_Line_Options

Contenu connexe

Tendances

MongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelMongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
Takahiro Inoue
 
Programming Hive Reading #4
Programming Hive Reading #4Programming Hive Reading #4
Programming Hive Reading #4
moai kids
 

Tendances (20)

Introduction to HBase | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to HBase | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to HBase | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to HBase | Big Data Hadoop Spark Tutorial | CloudxLab
 
Commands documentaion
Commands documentaionCommands documentaion
Commands documentaion
 
Installing Apache Hive, internal and external table, import-export
Installing Apache Hive, internal and external table, import-export Installing Apache Hive, internal and external table, import-export
Installing Apache Hive, internal and external table, import-export
 
Tajo Seoul Meetup-201501
Tajo Seoul Meetup-201501Tajo Seoul Meetup-201501
Tajo Seoul Meetup-201501
 
Performance Profiling in Rust
Performance Profiling in RustPerformance Profiling in Rust
Performance Profiling in Rust
 
Unix commands in etl testing
Unix commands in etl testingUnix commands in etl testing
Unix commands in etl testing
 
Hypertable
HypertableHypertable
Hypertable
 
Hypertable - massively scalable nosql database
Hypertable - massively scalable nosql databaseHypertable - massively scalable nosql database
Hypertable - massively scalable nosql database
 
Perl Programming - 03 Programming File
Perl Programming - 03 Programming FilePerl Programming - 03 Programming File
Perl Programming - 03 Programming File
 
Perl for System Automation - 01 Advanced File Processing
Perl for System Automation - 01 Advanced File ProcessingPerl for System Automation - 01 Advanced File Processing
Perl for System Automation - 01 Advanced File Processing
 
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
 
Introduction to scoop and its functions
Introduction to scoop and its functionsIntroduction to scoop and its functions
Introduction to scoop and its functions
 
Database Architectures and Hypertable
Database Architectures and HypertableDatabase Architectures and Hypertable
Database Architectures and Hypertable
 
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
 
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelMongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
 
HADOOP 실제 구성 사례, Multi-Node 구성
HADOOP 실제 구성 사례, Multi-Node 구성HADOOP 실제 구성 사례, Multi-Node 구성
HADOOP 실제 구성 사례, Multi-Node 구성
 
Programming Hive Reading #4
Programming Hive Reading #4Programming Hive Reading #4
Programming Hive Reading #4
 
HBase Secondary Indexing
HBase Secondary Indexing HBase Secondary Indexing
HBase Secondary Indexing
 
2015 bioinformatics bio_python
2015 bioinformatics bio_python2015 bioinformatics bio_python
2015 bioinformatics bio_python
 
PostgreSQL
PostgreSQLPostgreSQL
PostgreSQL
 

Similaire à Hive data migration (export/import)

Similaire à Hive data migration (export/import) (20)

Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
 
Run a mapreduce job
Run a mapreduce jobRun a mapreduce job
Run a mapreduce job
 
Big data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guideBig data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guide
 
Apache Hive
Apache HiveApache Hive
Apache Hive
 
Configure h base hadoop and hbase client
Configure h base hadoop and hbase clientConfigure h base hadoop and hbase client
Configure h base hadoop and hbase client
 
Advanced Sqoop
Advanced Sqoop Advanced Sqoop
Advanced Sqoop
 
Mahout Workshop on Google Cloud Platform
Mahout Workshop on Google Cloud PlatformMahout Workshop on Google Cloud Platform
Mahout Workshop on Google Cloud Platform
 
Introduction to Hadoop part1
Introduction to Hadoop part1Introduction to Hadoop part1
Introduction to Hadoop part1
 
Top 5 Hadoop Admin Tasks
Top 5 Hadoop Admin TasksTop 5 Hadoop Admin Tasks
Top 5 Hadoop Admin Tasks
 
Webinar: Top 5 Hadoop Admin Tasks
Webinar: Top 5 Hadoop Admin TasksWebinar: Top 5 Hadoop Admin Tasks
Webinar: Top 5 Hadoop Admin Tasks
 
Big Data on DC/OS
Big Data on DC/OSBig Data on DC/OS
Big Data on DC/OS
 
Exadata - BULK DATA LOAD Testing on Database Machine
Exadata - BULK DATA LOAD Testing on Database Machine Exadata - BULK DATA LOAD Testing on Database Machine
Exadata - BULK DATA LOAD Testing on Database Machine
 
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDisaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
 
Quay 3.3 installation
Quay 3.3 installationQuay 3.3 installation
Quay 3.3 installation
 
Wanting distributed volumes - Experiences with ceph-docker
Wanting distributed volumes - Experiences with ceph-dockerWanting distributed volumes - Experiences with ceph-docker
Wanting distributed volumes - Experiences with ceph-docker
 
HBaseCon 2015: HBase Operations in a Flurry
HBaseCon 2015: HBase Operations in a FlurryHBaseCon 2015: HBase Operations in a Flurry
HBaseCon 2015: HBase Operations in a Flurry
 
Big Data Analytics using Mahout
Big Data Analytics using MahoutBig Data Analytics using Mahout
Big Data Analytics using Mahout
 
Ch23 system administration
Ch23 system administration Ch23 system administration
Ch23 system administration
 
Upgrading hadoop
Upgrading hadoopUpgrading hadoop
Upgrading hadoop
 
(GAM303) Riot Games: Migrating Mountains of Data to AWS
(GAM303) Riot Games: Migrating Mountains of Data to AWS(GAM303) Riot Games: Migrating Mountains of Data to AWS
(GAM303) Riot Games: Migrating Mountains of Data to AWS
 

Dernier

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Dernier (20)

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 

Hive data migration (export/import)

  • 1. HIVE Data Migration NHN Comico Data Laboratory Bopyo Hong 2015-11-05
  • 2. Overview 1. Concept 2. Export Data from Old Hadoop Cluster 3. Data Copy from Old Hadoop Cluster to New one 4. Import Data to New Hadoop Cluster 5. Reference sites
  • 3. Concept Migration of Hive data from old hadoop cluster to new one Hive (old version) Hive (new version)
  • 4. Export Data from Old Hadoop Cluster - 1 1. To make table list of database you want to export Hive doesn’t support to export/import by DB unit $ hive --database test -e 'SHOW TABLES' | sed -e '1d' > table_list.txt 2. To make export statements that is executed by partition when you export whole table(not use partition), if the table is very big the export will take too long time. So I recommend use partition. In this example, export 2015-10-01 ~ 2014-10-31 partition ex) partition name is dt /user/hive/warehouse/test.db/tablename/dt=2015-10-01 /user/hive/warehouse/test.db/tablename/dt=2015-10-02 /user/hive/warehouse/test.db/tablename/dt=2015-10-03 …………………………………….. /user/hive/warehouse/test.db/tablename/dt=2015-10-31 $ source make_cmd_export_by_partition.sh 20151001 20151101 > export_201510.q
  • 5. Export Data from Old Hadoop Cluster - 2 $vi make_cmd_export_by_partition.sh #!/bin/bash startdate=$1 enddate=$2 rundate="$startdate" until [ "$rundate" == "$enddate" ] do YEAR=${rundate:0:4} MONTH=${rundate:4:2} DAY=${rundate:6:2} DATE2=${YEAR}'-'${MONTH}'-'${DAY} TBLS=(`cat table_list.txt`) for TBL in ${TBLS[@]} do echo "export table ${TBL} PARTITION (dt='$DATE2') to '/user/hadoop/temp_export_dir/${TBL}/dt=$DATE2';" echo done rundate=`date --date="$rundate +1 day" +%Y%m%d` done
  • 6. Export Data from Old Hadoop Cluster - 3 3. Export of test DB $ source export_by_hive.sh export_201510.q $vi export_by_hive.sh #!/bin/bash echo "export start `date +'%Y-%m-%d %H:%M'`“ # all tables couldn’t have same partitions, so need a option (-hiveconf hive.cli.errors.ignore=true) hive -hiveconf hive.cli.errors.ignore=true --database test -f $1 echo "export end `date +'%Y-%m-%d %H:%M'`"
  • 7. Data Copy from old hadoop cluster to new one Data copy by using distcp command You should execute distcp command on new hadoop cluster old cluster url -> hftp://namenode_hostname:50070 new cluster url -> hdfs://namenode_hostname:8020 $ source distcp_script.sh $ vi distcp_script.sh #!/bin/bash echo "`date +'%Y-%m-%d %H:%M'` distcp start“ # -pb option is necessary to avoid errors that could happen by difference of block size between old and new cluster hadoop distcp -pb hftp://old_host:50070/user/hadoop/temp_export_dir hdfs://new_host:8020/user/bopyo.hong/temp_import_dir echo "`date +'%Y-%m-%d %H:%M'` distcp end“
  • 8. Import Data to New Hadoop Cluster - 1 1. To make import statement that is executed by partition $ source make_cmd_import_by_partition.sh 20151001 20151101 > import_201510.q $ vi make_cmd_import_by_partition.sh #!/bin/bash TBLS=(`cat table_list.txt`) startdate=$1 enddate=$2 rundate="$startdate" until [ "$rundate" == "$enddate" ] do YEAR=${rundate:0:4} MONTH=${rundate:4:2} DAY=${rundate:6:2} DATE2=${YEAR}'-'${MONTH}'-'${DAY} for TBL in ${TBLS[@]} do echo "import table ${TBL} PARTITION (dt='$DATE2') from '/user/bopyo.hong/temp_import_dir/${TBL}/dt=$DATE2';" echo done rundate=`date --date="$rundate +1 day" +%Y%m%d` done
  • 9. Import Data to New Hadoop Cluster - 2 2. Import tables of test DB $ source import_by_hiveql.sh import_201510.q $ vi import_by_hiveql.sh #!/bin/bash echo "import start `date +'%Y-%m-%d %H:%M'`“ # all tables couldn’t have same partitions so need a option (-hiveconf hive.cli.errors.ignore=true) hive -hiveconf hive.cli.errors.ignore=true --database test -f $1 echo "export start `date +'%Y-%m-%d %H:%M'`"