This presentation about Sqoop will help you learn what is Sqoop, why is Sqoop important, the different features of Sqoop, the architecture of Sqoop, how Sqoop import and export works, how Sqoop processes data and finally you’ll see how to work with Sqoop commands. Sqoop is a tool used to transfer bulk data between Hadoop and external data stores such as relational databases. This tutorial will help you understand how Sqoop can load data from MySql database into HDFS and process that data using Sqoop commands. Finally, you will learn how to export the table imported in HDFS back to RDBMS. Now, let us get started and understand Sqoop in detail.
Below topics are explained in this Sqoop Hadoop presentation:
1. Need for Sqoop
2. What is Sqoop?
3. Sqoop features
4. Sqoop Architecture
5. Sqoop import
6. Sqoop export
7. Sqoop processing
8. Demo on Sqoop
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training.
11. Need for Sqoop
Data processing
Processing huge volumes of data
requires loading data from diverse
sources
into Hadoop clusters
This process of loading data from
heterogeneous sources comes with a set of
challenges
14. Need for Sqoop
Maintaining data consistency Ensuring efficient
utilization of
resources
Loading bulk data
to Hadoop was not
possible
1 2 3
Challenges
15. Need for Sqoop
Maintaining data consistency Ensuring efficient
utilization of
resources
Loading bulk data
to Hadoop was not
possible
1 2 3 4
Challenges
Loading data using
scripts was slow
16. Need for Sqoop
Maintaining data consistency Ensuring efficient
utilization of
resources
Loading data using
scripts was slow
Loading bulk data
to Hadoop was not
possible
1 2 3 4
Challenges
Solution
Sqoop helped in overcoming all the challenges to
traditional approach and could load bulk data from
RDBMS to Hadoop very easily
18. What is Sqoop?
Sqoop is a tool used to transfer bulk data between Hadoop and external datastores such as relational
databases (MS SQL Server, MySQL)
SQOOP = SQL + HADOOP
19. What is Sqoop?
Sqoop is a tool used to transfer bulk data between Hadoop and external datastores such as relational
databases (MS SQL Server, MySQL)
RDBMS
Import
Export
SQOOP = SQL + HADOOP
20. What is Sqoop?
Sqoop is a tool used to transfer bulk data between Hadoop and external datastores such as relational
databases (MS SQL Server, MySQL)
RDBMS
Import
Export
SQOOP = SQL + HADOOP
Export
22. Sqoop Features
1
5 2
4 3
Parallel import/export
Connectors for all
major RDBMS
databases
Kerberos Security
Integration
Import results of SQL
query
Provides full and
incremental load
23. Sqoop Features
1
5 2
4 3
Connectors for all major
RDMS databases
Kerberos Security
Integration
Import results of SQL
query
Provides full and
incremental load
Sqoop uses YARN framework to import and
export data. This provides fault tolerance on top of
parallelism
Parallel import/export
24. Sqoop Features
1
5 2
4 3
Parallel import/export
Connectors for all major
RDMS databases
Kerberos Security
Integration
Import results of SQL
query
Provides full and
incremental load
Sqoop allows us to import the result returned from
an SQL query into HDFS
25. Sqoop Features
1
5 2
4 3
Parallel import/export
Connectors for all
major RDBMS
databases
Kerberos Security
Integration
Import results of SQL
query
Provides full and
incremental load
Sqoop provides connectors for multiple Relational
Database Management System (RDBMS)
databases such as MySQL and MS SQL Server
26. Sqoop Features
1
5 2
4 3
Parallel import/export
Connectors for all major
RDMS databases
Kerberos Security
Integration
Import results of SQL
query
Provides full and
incremental load
Sqoop supports Kerberos computer network
authentication protocol that allows nodes communicating
over a non-secure network to prove their identity to one
another in a secure manner
27. Sqoop Features
1
5 2
4 3
Parallel import/export
Connectors for all major
RDMS databases
Kerberos Security
Integration
Import results of SQL
query
Sqoop can load the whole table or parts of the table
by a single command. Hence, it supports full and
incremental load
Provides full and
incremental load
44. Sqoop Processing
Sqoop runs in the Hadoop cluster
It imports data from RDBMS / NOSQL database to HDFS
It uses mappers to slice the incoming data into multiple formats and
load the data in HDFS
1
2
3
45. Sqoop Processing
Sqoop runs in the Hadoop cluster
It imports data from RDBMS / NOSQL database to HDFS
It uses mappers to slice the incoming data into multiple formats and
load the data in HDFS
It exports data back into RDBMS while making sure that the
schema of the data in the database in maintained
1
2
3
4