Etl with talend (big data)

ETL with talend
POOJA B. MISHRA

What is ETL?
Extract is the process of reading data from a database
Transform is the process of converting the extracted data from its
previous form into the form it needs to be in so that it can be placed
into another database. Transformation occurs by using rules or
lookup tables or by combining the data with other data
Load is the process of writing the data into the target database

Terms closely related and managed by
ETL processes
 data migration
data management
 data cleansing
data synchronization
 data consolidation.
.

Different ETL tools
•Informatica PowerCenter
•Oracle ETL
•Ab Initio
•Pentaho Data Integration -Kettle Project (open source ETL)
•SAS ETL studio
•Cognos Decisionstream
•Business Objects Data Integrator (BODI)
•Microsoft SQL Server Integration Services (SSIS)
•Talend

Prerequisites
Talend Open Studio for Data Integration
◦ http://www.talend.com/download
VirtualBox
◦ https://www.virtualbox.org/wiki/Downloads
Hortonworks Sandbox VM
◦ http://hortonworks.com/products/hortonworks-
sandbox/#install

Talend Interface
Workspace
Repository tree
Component configuration
Palette
WorkspaceRepository
tree
Palette
Repository
tree
Workspace
Palette
Component
configuration

Supported data input and output
formats?
• SQL
• MySQL
• PostgreSQL
• Sybase
• Teradata
• MSSQL
• Netezza
• Greenplum
• Access
• DB2
Hive
Pig
Hbase
Sqoop
MongoDB
Riak
Many more

What kinds of datasets can be loaded?
Talend Studio offers nearly comprehensive connectivity to:
Packaged applications (ERP, CRM, etc.), databases, mainframes, files, Web Services, and so on to
address the growing disparity of sources.
Data warehouses, data marts, OLAP applications - for analysis, reporting, dashboarding,
scorecarding, and so on.
Built-in advanced components for ETL, including string manipulations, Slowly Changing
Dimensions, automatic lookup handling, bulk loads support etc.

Tutorial overview
We will do following tasks in this assignment:
1. Load data from DB on your local machine to HDFS
2. Write hive query to do analysis
3. Result of above hive query is then pushed to HBase output component

Step 1
Use row generator to simulate the
rows in db and created a table with
3 columns, ID, name and level.

Step 2
• Drag and drop the hdfsoutput component to the
surface and connect the major output of the row
generator to the hdfs.
• For hdfs component, double click on the HDFS
component in design area and just specify the
name node address, and the folder in your
machine to hold the file

Step 3
 After loading the data to HDFS, we can create
external hive table customers using following
command by logging in hive shell and
executing following command
CREATE EXTERNAL TABLE customers(id INT, name
STRING, level STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY
STORED AS TEXTFILE
LOCATION '/usr/talend';

Step 4
Create one flow to read the data in hive. pick up the version ,and the thrift server ip/port, then
write a hive query as shown in below screen

Step 5
Click the edit schema button and just add one column with type as object then we will parse the
result and map to our schema.
Click the advanced tab, to enable the parse query results, using the column we just created as
object type.
Drag the parserecordset component to the surface and conenct the mainout of hiverow to it,
click edit schema to do the necessary mapping and then match the values as shown belowust
created as object type

Step 6
◦ Click to run this job, from
the console it tell you
whether it has connected
to the hive server
successfully
◦ Go to the hive server and
it will show you that it has
received one query and
will execute it
◦ you can see the results
from the run talend
console

Step 7
Drag one component called hbaseoutput from right pallette, and config the zookeeper info

Step 8
Run the job will get this as final output
You can login into hbase shell and check if
data is insterted to the hbase and the table
was created by talend also!

Etl with talend (big data)

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Etl with talend (big data)

Similaire à Etl with talend (big data) (20)

Dernier

Dernier (20)

Etl with talend (big data)