This is a point of view document showing the various possible techniques to integrate SAP HANA and Hadoop and their pros & cons and the scenarios where each of them is recommended.
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Integration of SAP HANA with Hadoop
1. Page 1 Author – Ramkumar Rajendran
Integration of SAP HANA
with Hadoop
2. Page 2 Author – Ramkumar Rajendran
Author Biography
Ramkumar Rajendran
Ramkumar Rajendran is a Consultant at a leading firm with an
experience of 4 years. He has specialized in various tools like SAP HANA,
SAP BI, SAP BO (Xcelsius, Webi and IDT), Tableau, Lumira and Hadoop-
Hive. He has worked upon the Sentiment Analysis of Twitter data. He
has involved in the integration of HANA and Hadoop. He has worked on
multiple implementation projects for various industry sectors.
3. Page 3 Author – Ramkumar Rajendran
Table of Contents
1 About this document.....................................................................................4
2 Introduction..................................................................................................5
SAP HANA......................................................................................................5
Hadoop..........................................................................................................5
3 Combined Potential of HANA and Hadoop..........Error!Bookmark notdefined.
4 Scenarios of Hadoop and Hana integration....................................................7
Federated Data Query through Smart Data Access (SDA).................................8
Business Objects Data Services.......................................................................9
SQOOP ........................................................................................................10
JAVA Program..............................................................................................12
5 Summary.....................................................................................................13
6 Reference Material......................................................................................13
4. Page 4 Author – Ramkumar Rajendran
About this document
This document would be talking about the combined potential of the in-memory database
’SAP HANA’ and the bigdata solution ‘Hadoop’ and the various methods of integration of
both these technologies and the scenarios where each of these methods would be
applicable .
SAP HANA is specialized in real-time in-memory processing, while Hadoop is apt for massive
parallel processing. Integration of both these technologies would have the advantages from
both of them.
Hadoop handles both structured and unstructured data from social media, machine logs,
etc. which can be further used along with the transactional data present in HANA resulting
in more mature business analysis.
This document has been prepared based upon SAP HANA SP6 and Hadoop CDH 4.5.
5. Page 5 Author – Ramkumar Rajendran
Introduction
SAP HANA
SAP HANA is an innovative in-memory database and data management platform,
specifically developed to take full advantage of the capabilities provided by modern
hardware to increase application performance. By keeping all relevant data in main
memory, data processing operations are significantly accelerated.
Design for scalability is a core SAP HANA principle. SAP HANA can be distributed across
many multiple hosts to achieve scalability in terms of both data volume and user
concurrency. Unlike clusters, distributed HANA systems also distribute the data efficiently,
achieving high scaling without I/O locks.
The key performance indicators of SAP HANA appeal to many of our customers, and
thousands of deployments are in progress. SAP HANA has become the fastest growing
product in SAP’s 40+ year history.
Hadoop
Hadoop is an open source software project that enables the distributed processing of large
data sets across clusters of commodity servers. It is designed to scale up from a single
server to thousands of machines, with a very high degree of fault tolerance. Rather than
relying on high-end hardware, the resiliency of these clusters comes from the software’s
ability to detect and handle failures at the application layer.
Hadoop is known for its massive parallel processing capabilities on large datasets. It is also
scalable, cost effective owing to cheaper processers, flexible and fault tolerant.
6. Page 6 Author – Ramkumar Rajendran
CombinedPotential of HANAand Hadoop
Hadoop can store very huge amount of data. It is well suited for storing unstructured data,
is good for manipulating very large files and is tolerant to hardware and software failures.
But the main challenge with Hadoop is getting information out of this huge data in real
time.
HANA is well suited for processing data in real time, thanks to its in-memory technology.
By integrating Hadoop’s massive parallel processing and HANA’s in-memory computing
capabilities the resultant solution would be capable of the following:
Accommodation of both structured and un-structured data.
Provision of cost efficient data storage and processing for large volumes data.
Computation of complex Information Processing.
Enabling heavily recursive algorithms, machine learning and queries that cannot be
easily expressed in SQL.
Low Value Data Archive & Data stays available, though access is slower.
Mine raw data that is either schema-less or where schema changes over time.
7. Page 7 Author – Ramkumar Rajendran
Scenarios ofHadoopand Hana integration
Smart Data Access Business Objects Data Services
SQOOP Java
Federated Data Query
through Smart Data
Access(SDA)
Hadoop
Reporting Tools
SDA
Data Loading from Hadoop to
HANA
Hadoop
SAP HANA
Reporting Tools
BODS
Data Loading with
Java Programming
Hadoop
SAP HANA
Reporting Tools
Java
Hadoop
SAP HANA
Reporting Tools
Data Loading from Hadoop to
HANA
SQOOOP
PULL
mechanism
PUSH
mechanism
PUSH or PULL
mechanism
SAP HANA
No Data
Loading
8. Page 8 Author – Ramkumar Rajendran
Federated Data Query throughSmart Data Access (SDA)
SAP HANA smart data access enables remote Hadoop data to be accessed as if they are local
tables in SAP HANA, without loading the data into SAP HANA.
Not only does this capability provide operational and cost benefits, but most importantly it
supports the development and deployment of the next generation of analytical applications
which require the ability to access, synthesize and integrate data from multiple systems in
real-time regardless of where the data is located or what systems are generating it.
Specifically in SAP HANA, we can create virtual tables which point to remote tables in
Hadoop. Customers can then write SQL queries in SAP HANA, which could operate on virtual
tables. The SAP HANA query processor optimizes these queries, and executes the relevant
part of the query in the target database, returns the results of the query to SAP HANA, and
completes the operation.
Recommended Scenarios
Using SDA to access Hadoop from HANA would involve federated query being fired on
Hadoop with the execution of the report. This technique is recommended when large
amount of result set gets generated at Hadoop when the reporting query is fired. Smart
Data Access involves aggregating the dataset at Hadoop using its system resources,
resulting in the transfer of only end results from Hadoop to HANA.
Advantages of this technique
Real-time data access from Hadoop without actually having to load it into HANA
Helps in scenarios where the data residing in Hadoop is updated very frequently and
data loading would make no sense.
Query can be optimized by pushing the processing down to Hadoop, as it will return
aggregated data.
Disadvantages of this technique
Federated Query gets slowed down when huge processing needs to be done on the
data at Hadoop end.
Data transformation is not possible while using Smart Data Access.
9. Page 9 Author – Ramkumar Rajendran
With this technique the reporting query would also be fired on Hadoop, which
makes it critical for it to be up at all times. In cases of multiple Hadoop systems, it
would become more potent of risk.
Data can only be extracted from HIVE.
Data access can happen only from Hadoop to HANA.
Business Objects Data Services
SAP Data Services delivers a single enterprise-class solution for data integration, data
quality, data profiling and text data processing. This technique involves data PULL
mechanism from Hadoop to HANA; so the entire control is based on BODS.
This wide range of features helps to -
Integrate, transform, improve, and deliver trusted data from Hadoop to HANA
Provides development user interfaces, a metadata repository, a data connectivity
layer, a run-time environment, and a management console enabling IT organizations
to lower total cost of ownership and accelerate time to value.
Enable IT organizations to maximize operational efficiency with a single solution to
improve data quality and gain access to heterogeneous sources and applications.
Recommended Scenarios
Integrating HANA with Hadoop using BODS would involve data loading on a timely manner.
This can be utilized in scenarios where there is not requirement of real-time reporting, but
involves complex calculations on large datasets. This technique would prove very effective
in scenarios which involve multiple Hadoop systems with variety of unstructured data to be
processed on a large scale.
10. Page 10 Author– Ramkumar Rajendran
Advantages of this technique
Unstructured data can be loaded from Hadoop to HANA with all the transformation
done while data loading.
It is better suited for loading of large dataset.
BODS can be utilized to implement complex transformations while loading data from
Hadoop to HANA.
Performance of HANA can be improved by moving complex calculations to BODS.
Its Error Handling aspect helps in better support and maintenance.
Data encryption function to encrypt sensitive data is one of the niche aspects of data
loading through BODS.
Centralized monitoring favors better IT support.
Delta loads are also supported.
Data transfer can happen from both the sides.
Disadvantages of this technique
Data present in Hadoop cannot be availed on a real time basis since BODS loads data
from Hadoop to HANA as a batch job.
SQOOP
SQOOP is a tool designed for efficiently transferring bulk data between Hadoop and
structured data stores like Oracle, MsSQL, SAP HANA, etc. SQOOP can be used to import
data from external structured data stores into Hadoop Distributed File System or related
systems like Hive and HBase. Conversely, SQOOP can be used to extract data from Hadoop
and export it to external structured data stores such as relational databases and enterprise
data warehouses.
SQOOP provides a pluggable connector mechanism for optimal connectivity to external
systems. The SQOOP extension API provides a convenient framework for building new
connectors. New connectors can be dropped into SQOOP installations to provide
connectivity to various systems. SQOOP itself comes bundled with various connectors that
can be used for popular database and data warehousing systems.
11. Page 11 Author– Ramkumar Rajendran
By utilizing SQOOP data transfer would be automated through batch jobs and it utilizes the
native tools for high performance data transfer. It uses data store metadata to infer
structure definitions. It utilizes the MapReduce framework of Hadoop to transfer data in
parallel, which proves fruitful for huge amount of data. It provides an extension mechanism
to incorporate high performance connectors for external systems.
For exporting data to external targets, SQOOP supports the functionality of Staging Tables
which considerably improves the efficiency of data transfer and also acts as insulation from
data corruption during times of failure.
This technique involves PUSH mechanism to load data from Hadoop to HANA; so the entire
control is based upon SQOOP in Hadoop.
Recommended Scenarios
SQQOP is a component in Hadoop which helps in data transfer from HDFS to external
databases and vice versa. This technique of integrating SAP HANA with Hadoop would
involve periodic loading of data directly from the underlying Hadoop files to HANA tables.
SQOOP doesn’t support any transformation while transferring data. Hence this technique
can be used in scenarios which require no real-time reporting and readily formatted source
data which requires no cleansing. Also this would be most suited for bulk data transfers
since SQOOP uses the underlying MapReduce framework of Hadoop enabling parallel data
transfer.
Advantages of this technique
It is better suited for loading of bulk datasets.
Data transfers can happen from both the sides.
It is open-source and hence cost-effective.
Disadvantages of this technique
Data present in Hadoop cannot be availed on a real time basis since SQOOP loads
data from Hadoop to HANA as a batch job.
No cleansing and formatting on the data can be done with SQOOP.
12. Page 12 Author– Ramkumar Rajendran
JAVA Program
Java program can be used to load data from Hadoop to HANA through JDBC connectivity.
This technique of HANA-Hadoop offers very high level of customization in terms of
cleansing, transformation, refining, filtering, etc. We can implement both PUSH and PULL
mechanism to transfer data from Hadoop to HANA, depending upon where the program is
installed and scheduled.
Recommended Scenarios
Data transfer from Hadoop to HANA is recommended in scenarios where it involves very
less data transfer. This technique offers very high level of control with the developers; so
they can come with a very customizable solution.
Advantages of this technique
It offers customization at a greater extent.
Java is open source; and hence it would be a cost-effective solution.
Java program can be executed from the command line and doesn’t require any
additional setup to host.
Disadvantages of this technique
It would require high level of programming skills.
Error tracking and debugging becomes difficult.
13. Page 13 Author– Ramkumar Rajendran
Summary
The integration of HANA with Hadoop enables customers to move data between Hive and
Hadoop’s Distributed File System and SAP HANA. Hadoop is good at processing bulk data at
a very cheaper rate. Hence if a particular junk of data is not much valuable to the users, and
they don’t access them often, storing it in HANA will be cost-prohibitive.
By combining SAP HANA and Hadoop together, customers get the power of instant access
with SAP HANA and infinite scale with Hadoop. This gives SAP users a broad range of
options for storing and analyzing new types of data and the ability to create applications
that can uncover new business opportunities from vast amounts of data that would not
have been previously possible.
References
http://blog.cloudera.com/blog/
https://www.brighttalk.com/webcast/9727/86361
http://scn.sap.com/community/developer-center/hana/blog/2014/01/27/exporting-and-importing-
data-to-hana-with-hadoop-sqoop
http://www.saphana.com/docs/DOC-2934