2. What is ETL?
Extract is the process of reading data from a database
Transform is the process of converting the extracted data from its
previous form into the form it needs to be in so that it can be placed
into another database. Transformation occurs by using rules or
lookup tables or by combining the data with other data
Load is the process of writing the data into the target database
4. Terms closely related and managed by
ETL processes
data migration
data management
data cleansing
data synchronization
data consolidation.
.
5. Different ETL tools
•Informatica PowerCenter
•Oracle ETL
•Ab Initio
•Pentaho Data Integration -Kettle Project (open source ETL)
•SAS ETL studio
•Cognos Decisionstream
•Business Objects Data Integrator (BODI)
•Microsoft SQL Server Integration Services (SSIS)
•Talend
6. Prerequisites
Talend Open Studio for Data Integration
◦ http://www.talend.com/download
VirtualBox
◦ https://www.virtualbox.org/wiki/Downloads
Hortonworks Sandbox VM
◦ http://hortonworks.com/products/hortonworks-
sandbox/#install
13. Supported data input and output
formats?
• SQL
• MySQL
• PostgreSQL
• Sybase
• Teradata
• MSSQL
• Netezza
• Greenplum
• Access
• DB2
Hive
Pig
Hbase
Sqoop
MongoDB
Riak
Many more
14. What kinds of datasets can be loaded?
Talend Studio offers nearly comprehensive connectivity to:
Packaged applications (ERP, CRM, etc.), databases, mainframes, files, Web Services, and so on to
address the growing disparity of sources.
Data warehouses, data marts, OLAP applications - for analysis, reporting, dashboarding,
scorecarding, and so on.
Built-in advanced components for ETL, including string manipulations, Slowly Changing
Dimensions, automatic lookup handling, bulk loads support etc.
15. Tutorial overview
We will do following tasks in this assignment:
1. Load data from DB on your local machine to HDFS
2. Write hive query to do analysis
3. Result of above hive query is then pushed to HBase output component
16. Step 1
Use row generator to simulate the
rows in db and created a table with
3 columns, ID, name and level.
17. Step 2
• Drag and drop the hdfsoutput component to the
surface and connect the major output of the row
generator to the hdfs.
• For hdfs component, double click on the HDFS
component in design area and just specify the
name node address, and the folder in your
machine to hold the file
18. Step 3
After loading the data to HDFS, we can create
external hive table customers using following
command by logging in hive shell and
executing following command
CREATE EXTERNAL TABLE customers(id INT, name
STRING, level STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY
STORED AS TEXTFILE
LOCATION '/usr/talend';
19. Step 4
Create one flow to read the data in hive. pick up the version ,and the thrift server ip/port, then
write a hive query as shown in below screen
20. Step 5
Click the edit schema button and just add one column with type as object then we will parse the
result and map to our schema.
Click the advanced tab, to enable the parse query results, using the column we just created as
object type.
Drag the parserecordset component to the surface and conenct the mainout of hiverow to it,
click edit schema to do the necessary mapping and then match the values as shown belowust
created as object type
22. Step 6
◦ Click to run this job, from
the console it tell you
whether it has connected
to the hive server
successfully
◦ Go to the hive server and
it will show you that it has
received one query and
will execute it
◦ you can see the results
from the run talend
console
23. Step 7
Drag one component called hbaseoutput from right pallette, and config the zookeeper info
24. Step 8
Run the job will get this as final output
You can login into hbase shell and check if
data is insterted to the hbase and the table
was created by talend also!