Automate the complete big data process from import to export data from HDFS to RDBMS like sql with apache sqoop
Let me know if anything is required. Happy to help.
Ping me google #bobrupakroy.
2. Import a subset of the table
To import a part of the table, type:
sqoop import --connect jbdc:mysql//localhost/db_1 --
username root --password root --table student_details --
where ‘ID >22‘ --target–dir subdata --split-by ID;
Note: if we don’t use –split-by then it throw an error of
not defining the splitting criteria or alternatively we can
use –m1, single mapper if we don’t want to split the
data.
And whenever we run ‘import’ for the subset of data,
we should use a different destination folder otherwise it
will throw an error for again using the existing
destination folder to import the subset of data.
Rupak Roy
3. Sqoop Job
In real life it might not be efficient and practical
to remember the last value each time we run
sqoop functions.
So what we can do to overcome this issue is to use
the sqoop’s metadata. Sqoop’s metadata retains
records of all the sqoop events which can be
retrieve by creating a sqoop job. Sqoop Job will
extract and save all the records of last successfully
completed events from the sqoop’s metadata
and also it will automatically update its last-value
each time after successfully completing its
operations.
Rupak Roy
4. #to create a sqoop job type:
$sqoop job --create s1job -- <space> import --connect
jdbc:mysql://localhost/db_1 --username root --passwod root --table
student_details --split-by ID --incremental append --checkcolumn ID
--last-Value 33 --target-dir s1job_data ;
#now to retrieve the job data
$sqoop job --show s1job
#let’s execute the sqoop using s1job(a sqoop job) :
$sqoop job --exec s1job
#repeat ( $sqoop job --exec s1job ) to again import the updated data.
Now each time whenever there is a update in the table we don’ t need to
remember the last value as the --last-value gets automatically updated in the
sqoop job after a successful completion of a sqoop operation.
And we can view the updated --last-value in sqoop job using the same ( $sqoop
job --show s1job)
#to view all the sqoop jobs
$sqoop job --list
#to delete a sqoop jobs
$sqoop job --delete s1job
Rupak Roy
5. Export Data from HDFS to RDBMS(MySql)
We can export data from HDFS to RDBMS in two common ways :
1)By Updating existing data in the RDBMS
2)By Adding new data in the RDBMS database
1)Updating existing data in the RDBMS
Let us create a file with new entries
$ Vi newdata
55,Chris,ML
66,Alica,AZ
77,Ryan, FL
#now transfer the file in HDFS
$hadoop fs –put newdata new/newdata
#before updating the database, it is advisable to create a duplicate table using
CREATE TABLE student_details1 LIKE student_details;
INSERT INTO student_details1 SELECT * FROM student_details;
ALTER TABLE student_details1 ENABLE KEYS;
For practice purpose so that we have don’t have to go back and forth creating new tables over
and over again in case of any mistakes.
Select * from student_details1;
Rupak Roy
6. #now use the sqoop command to transfer/export it to the RDBMS
such as Mysql:
$sqoop export --connect jdbc:mysql//localhost/db_1 --username
root --password root --table student_details --export-dir
/new/newdata --update-key ID;
Where --export-dir is the path of the file to export.
--update-key will tell sqoop update the values matching
the Keys i.e. ID in our case .
Select * form student_details1 #to see the change in the data
And what we have found that
only the value of ID 55 got
updated because this is update
existing data mode so sqoop
will only update the Values
matching the --update-keys mentioned
Rupak Roy
7. 2)By adding new data to the database
$sqoop export --connect jdbc:mysql://localhost/db_1
--username root --password root --table student_details
--export-dir new/newdata --update-key ID --update-
mode allowinsert;
Now we can definitely see our
new data has been added
successfully.
Note: when we use this ALLOWINSERT mode
table must have a primary key else if there are any similar ID’s like we had ID 55
for Zyan(from exisitng database) and ID 55 for chris(from our new file), then it will
create two rows with 2 same IDs.
Adding Primary key to a table is optional based on the business needs for
example one ID can purchase 3 or many products.
Rupak Roy
8. Next
Apache hive built for data summarization
and query analysis.
Rupak Roy