How much time and effort (and budget) do organizations spend moving data around the enterprise? Unfortunately, quite a lot. These days, ETL developers are tasked with performing the Extract (E) and Load (L), and spending less time on their craft, building Transformations (T). This changes in the new world of data integration. By offloading data from the RDBMS to Hadoop, with the ability to present it back to the relational database, data can be seamlessly integrated between different source and target systems. Transformations occur on data offloaded to Hadoop, using the latest ETL technologies, or in the target database, with a standard ETL-on-RDBMS tool. In this session, we’ll discuss how the new world of data integration will provide focus on transforming data into insightful information by simplifying the data movement process.
Presented at Enkitec E4 2017.
4. 4
Gluent Recognized by Gartner Inc’s “Cool Vendors in Data
Management, 2017” report:
Cool Vendors in Data Management, 2017
Adam M. Ronthal, Ted Friedman, Mark A. Beyer, Guido De Simoni, Merv
Adrian, Svetlana Sicular, Ehtisham Zaidi, Roxane Edjlali, 28 April 2017
The Gartner Cool Vendor Logo is a trademark and service mark of Gartner, Inc.,
and/or its affiliates, and is used herein with permission. All rights reserved.
Gartner does not endorse any vendor, product or service depicted in its research
publications, and does not advise technology users to select only those vendors
with the highest ratings or other designation. Gartner research publications
consist of the opinions of Gartner's research organization and should not be
construed as statements of fact. Gartner disclaims all warranties, expressed or
implied, with respect to this research, including any warranties of merchantability
or fitness for a particular purpose.
Download (no registration needed):
https://gluent.com/cool-vendor-2017/
18. 9
Typical ETL development - too much time spent on “E” and “L”
reader = new TransformingReader(reader)
.add(new SetCalculatedField("AvailableCredit",
"parseDouble(CreditLimit) - parseDouble(Balance)"))
.add(new ExcludeFields("CreditLimit",
"Balance"));
DataWriter writer = new
JdbcWriter(getJdbcConnection(), "dp_credit_balance")
.setAutoCloseConnection(true);
JobTemplate.DEFAULT.transfer(reader, writer);
22. • Sharing data without moving data
• Metadata is stored about each “source” database
• Queries are passed through a metadata
layer, which provides info about where the
data lives and how to translate the query
• But…
• We must recode applications!
• Data remains in RDBMS silos and are
limited to CPU/storage constraints
12
What about data federation?
Source 1
Source 2
Source X
Source 1
Source 2
Source X
App 1
App 2
App 1
App 2
24. • “Offload” - copy or move data from the relational database to Hadoop
• Why offload?
• Store data in a centralized location for access and sharing
• Use the distributed, parallel processing power of Hadoop for transformations
• Enable the use of “new world” technologies (Spark, Impala, etc)
• Why Hadoop?
• We can now afford to keep a copy of all enterprise data for data sharing reasons!
14
Offload, not Extract
26. • Command line client used to bulk copy data from a relational database to
HDFS over JDBC connection
• Also works in reverse, HDFS to RDBMS
• Sqoop generates MapReduce jobs for the work
16
Bulk data load with Sqoop
sqoop import --connect jdbc:oracle:thin:@myserver:1521/MYDB1
--username myuser
--null-string ''
--null-non-string ''
--target-dir=/user/hive/warehouse/ssh/sales
--append -m1 --fetch-size=5000
--fields-terminated-by '','' --lines-terminated-by ''n''
--optionally-enclosed-by ''"'' --escaped-by ''"''
--split-by TIME_ID
--query ""SELECT * FROM SSH.SALES WHERE TIME_ID < DATE'1998-01-01'""
27. • Gluent Advisor helps determine which tables
and/or partitions are candidates for offload
• Offload options:
• Move or copy 100% of data
• Move “cold” partitions only
• Data can be kept in-sync with an incremental
offload
• Example: when a partition goes inactive (cold), it
can be offloaded to Hadoop
• Data can be updated from RDBMS to Hadoop
17
Offload data with Gluent Data Platform
gluent.
80% - 20%
offload -x -t EDW.BALANCE_DETAIL_MONTHLY --less-than-value=2017-05-01
28. • Gluent Advisor helps determine which tables
and/or partitions are candidates for offload
• Offload options:
• Move or copy 100% of data
• Move “cold” partitions only
• Data can be kept in-sync with an incremental
offload
• Example: when a partition goes inactive (cold), it
can be offloaded to Hadoop
• Data can be updated from RDBMS to Hadoop
17
Offload data with Gluent Data Platform
gluent.
80% - 20%
offload -x -t EDW.BALANCE_DETAIL_MONTHLY --less-than-value=2017-05-01
36. • Oracle SQL Connector for HDFS
• Uses Oracle External Table to access delimited file or Hive table in Hadoop
• When queried, all data is read from Hadoop to Oracle, then processed
• Oracle to Hadoop over database links
• Create a database link from Oracle to Hadoop using the Impala (or Hive) ODBC
gateway
• Pushes filters to Hadoop but not grouping aggregates
• Big Data SQL
• Access Hive tables via an Oracle external table
• Predicate pushdown
22
Oracle SQL Connector for HDFS
43. 25
How to query the data? There are many options
Tool Offload Data Load Data Allow Query
Data
Offload
Query
Parallel
Execution
Sqoop Yes Yes Yes
Oracle Loader for Hadoop Yes Yes
Oracle SQL Connector for HDFS Yes Yes Yes
ODBC Gateway Yes Yes Yes
Big Data SQL Yes Yes Yes Yes
Gluent Data Platform Yes Yes Yes Yes
Present