Extending data infrastructure with Hadoop

Extending Your Data InfrastructurePUBLICLY
DO NOT USE
with Hadoop PRIOR TO 10/23/12
Headline Goes Here
Jonathan Seidman | Solutions Architect
Speaker Name or Subhead Goes Here
Big Data TechCon
April 10, 2013

©2013 Cloudera, Inc. All Rights
1
Reserved.

Who I Am
• Solutions Architect, Partner Engineering Team.
• Co-founder/organizer of Chicago Hadoop User Group and
Chicago Big Data.
• jseidman@cloudera.com
• @jseidman
• cloudera.com/careers

2
Reserved.

What I’ll be Talking About
• Big data challenges with current data integration approaches.
• How is Hadoop being leveraged with existing data infrastructures?
• Hadoop integration – the big picture.
• Deeper dive into tool categories.
• Data import/export
• Data Integration
• BI/Analytics
• Putting the pieces together.
• BI/Analytics with Hadoop.
• New approachs to data analysis with Hadoop.

3
Reserved.

What is Apache Hadoop?
CORE HADOOP SYSTEM COMPONENTS
Apache Hadoop is an open source
platform for data storage and processing Hadoop Distributed
that is… File System (HDFS)
MapReduce
 Scalable Self-Healing, High
Bandwidth Clustered
 Fault tolerant Storage
Distributed Computing
Framework
 Distributed

Has the Flexibility to Store and Excels at Scales
Mine Any Type of Data Processing Complex Data Economically

 Ask questions across structured and  Scale-out architecture divides workloads  Can be deployed on commodity
unstructured data that were previously across multiple nodes hardware
impossible to ask or solve  Flexible file system eliminates ETL  Open source platform guards against
 Not bound by a single schema bottlenecks vendor lock
4
Reserved.

Current Challenges
Limitations of Existing Data Management Systems

5
Reserved.

The Transforming of Transformation

Enterprise
Applications

Extract Query
OLTP Transform Data
Business
Load Warehouse
Intelligence
Transform

ODS

6
Reserved.

Volume, Velocity, Variety Cause Capacity Problems
1 Slow Data Transformations = Missed ETL SLAs.
Enterprise
Applications
2 Slow Queries = Frustrated Business Users.

Extract 2 Query
OLTP 1 Transform Data
Business
Load Warehouse
Intelligence
1 Transform

7
Reserved.

Data Warehouse Optimization
Enterprise Data Warehouse
Applications
Query
(High $/Byte)

OLTP ETL Business
Hadoop Intelligence
Transform
Query
ODS Store

8
Reserved.

The Key Benefit: Agility/Flexibility
Schema-on-Write (RDBMS): Schema-on-Read (Hadoop):
• Prescriptive Data Modeling: • Descriptive Data Modeling:
• Create static DB schema • Copy data in its native format
• Transform data into RDBMS • Create schema + parser
• Query data in RDBMS format • Query Data in its native format
• New columns must be added explicitly • New data can start flowing any time and
before new data can propagate into will appear retroactively once the
the system. schema/parser properly describes it.

• Good for Known Unknowns • Good for Unknown Unknowns
(Repetition) (Exploration)
9
Reserved.

Not Just Transformation
Other Ways Hadoop is Being Leveraged

10
Reserved.

Data Archiving Before Hadoop

Data Tape
Warehouse Archive

11
Reserved.

Active Archiving with Hadoop

Data Hadoop
Warehouse

12
Reserved.

Offloading Analysis

Data Warehouse

Business
Intelligence
Hadoop

13
Reserved.

Exploratory Analysis

Developers Business Analysts
Users

Hadoop Data
Warehouse

14
Reserved.

The Common Themes?

1 Offload expensive storage and processing
to Hadoop
• Complement, not replace

2 Reduce strain on the data warehouse
• Let it focus on what it was designed to do:
• High speed queries on high value relational data
• Increase ROI of existing relational stores

15
Reserved.

Economics: Return on Byte
Return on Byte (ROB) =
Value of Data
Cost of Storing Data
High ROB

Low ROB
(but still a ton
of aggregate
value)
16
Reserved.

Use Case: A Major Financial Institution
The Challenge:
• Current EDW at capacity; cannot support growing data depth and width
• Performance issues in business critical apps; little room for innovation.
DATA WAREHOUSE DATA WAREHOUSE The Solution:
Operational
Operational (50%)
• Hadoop offloads data storage (S),
(44%) processing (T) & some analytics (Q)
Analytics from the EDW.
(50%)
ELT Processing • EDW resources can now be focused
(42%) HADOOP on repeatable operational analytics.
Analytics
Processing
• Month data scan in 4 secs vs. 4 hours
Analytics (11%)
Storage

17
Reserved.

Hadoop Integration
Some Definitions

18
Reserved.

Data Integration
• Process in which heterogeneous data from multiple sources is
retrieved and transformed to provide a unified view.
• ETL (Extract, transform and load) is a central component of DI.

19
Reserved.

ETL – The Wikipedia Definition
• Extract, transform and load (ETL) is a process in database
usage and especially in data warehousing that involves:
• Extracting data from outside sources
• Transforming it to fit operational needs
• Loading it into the end target (DB or data warehouse)

http://en.wikipedia.org/wiki/Extract,_transform,_load

20
Reserved.

BI – The Forrester Research Definition

"Business Intelligence is a set of methodologies, processes,
architectures, and technologies that transform raw data into
meaningful and useful information used to enable more effective
strategic, tactical, and operational insights and decision-making.”
*

* http://en.wikipedia.org/wiki/Business_intelligence

21
Reserved.

Hadoop Integration
The Big Picture

22
Reserved.

BI/Analytics Tools

Data
Warehouse
/RDBMS

Streaming
Data
Data Import/Export
Data Integration Tools

NoSQL

23
Reserved.

Example Use Case

24
Reserved.

Example Use Case

• Online retailer.
• Customer, order data
stored in data
warehouse.

25
Reserved.

Example Use Case

• Now wants to leverage
behavioral (non-
transactional) data, e.g.
products viewed on-line
to drive
recommendations, etc.

26
Reserved.

So Where is This Data?
• Record of page views is stored in session logs as users browse
site.
• So how do we get it out?

[2002/11/27 18:58:28.294 -0600] "GET /products/view/952 HTTP/1.1" 200 701 "-"
"Mozilla/5.0 (PlayBook; U; RIM Tablet OS 2.0.1; en-US) AppleWebKit/535.8+
(KHTML, like Gecko) Version/7.2.0.1 Safari/535.8+" ”age=63&gender=0&
incomeCategory=4&session=51620033&user=-2118869394&region=9&userType=0"

27
Reserved.

Load Raw Logs into Data Warehouse?

• Very expensive to store.
• Difficult to model and
process semi-structured Web Logs DWH
data. Servers

• Oh, and also, very
expensive.

28
Reserved.

ETL In/Into Data Warehouse?

• Time and resource
intensive with larger log
sizes.
• No archive of raw logs –
potentially valuable data is Logs ETL DWH
Web
thrown away. Servers
• How do you decide which
fields have value?
• Still, some companies are
doing things like this.
29
Reserved.

Hadoop Integration
Data Import/Export Tools

30
Reserved.

BI/Analytics Tools

Data
Warehouse
/RDBMS

Streaming
Data
Data Import/Export

NoSQL

31
Reserved.

Data Import/Export Tools

Data
Warehouse
/RDBMS

Streaming
Data
Data Import/Export

32
Reserved.

Flume in 2 Minutes
Or, why you shouldn’t be using scripts for data movement.

• Reliable, distributed, and available system for efficient
collection, aggregation and movement of streaming data, e.g.
logs.
• Open-source, Apache project.

33
Reserved.

Flume in 2 Minutes
JVM process hosting components

Flume Agent

External Source Sink Destination
Channel
Source

Web Server
Twitter Consumes events Stores events Removes event from
JMS and forwards to until consumed channel and puts
System logs channels by sinks – file, into external
… memory, JDBC destination
34
Reserved.

Flume in 2 Minutes
• Reliable – events are stored in channel until delivered to next stage.
• Recoverable – events can be persisted to disk and recovered in the
event of failure.

Flume Agent

Source Channel Sink Destination

35
Reserved.

Flume in 2 Minutes
• Supports multi-hop flows for more complex processing.
• Also fan-out, fan-in.

Flume Agent Flume Agent

Sourc Channel Sink Sourc Channel Sink
e e

36
Reserved.

Flume in 2 Minutes

• Declarative
• No coding required.
• Configuration specifies
how components are
wired together.

37
Reserved.

Flume in 2 Minutes
• Similar systems:
• Scribe
• Chukwa

38
Reserved.

Sqoop Overview
• Apache project designed to ease import and export of data
between Hadoop and relational databases.
• Provides functionality to do bulk imports and exports of data
with HDFS, Hive and HBase.
• Java based. Leverages MapReduce to transfer data in parallel.

39
Reserved.

Sqoop Overview
• Uses a “connector” abstraction.
• Two types of connectors
• Standard connectors are JDBC based.
• Direct connectors use native database interfaces to improve
performance.
• Direct connectors are available for many open-source and
commercial databases – MySQL, PostgreSQL, Oracle, SQL
Server, Teradata, etc.

40
Reserved.

Sqoop Import Flow
Run import Collect metadata
Client Sqoop

Pull data
Generate code,
MapReduce Map Map Map
Execute MR job
Write to Hadoop

Hadoop

41
Reserved.

Sqoop Limitations
Sqoop has some limitations, including:
• Poor support for security.
$ sqoop import –username scott –password tiger…
• Sqoop can read command line options from an option file, but this still
has holes.
• Error prone syntax.
• Tight coupling to JDBC model – not a good fit for non-RDBMS
systems.

42
Reserved.

Fortunately…
Sqoop 2 (incubating) will address many of these
limitations:
• Adds a web-based GUI.
• Centralized configuration.
• More flexible model.
• Improved security model.

43
Reserved.

MapReduce For Transformation
• Standard interface is Java, but higher-level interfaces are
commonly used:
• Apache Hive – provides an SQL like interface to data in Hadoop.
• Apache Pig – declarative language providing functionality to
declare a sequence of transformations.
• Both Hive and Pig convert queries into MapReduce jobs and
submit to Hadoop for execution.

44
Reserved.

Example Implementation with OSS Tools
All the tools we need for moving and transforming data:
• Hadoop provides:
• HDFS for storage
• MapReduce for Processing
• Also components for process orchestration:
• Oozie, Azkaban
• And higher-level abstractions:
• Pig, Hive, etc.
45
Reserved.

Data Flow with OSS Tools

Transform

Raw Logs Hadoop Load
Web
Servers
Flume, etc. Sqoop, etc.

Process
Orchestration
Oozie, etc.
46
Reserved.

Flume Configuration for Example Use Case
• Spooling source
watches directory for
new files and moves
into channels.
Renames files when
processed.
• HDFS sink ingests files
into HDFS.

47
Reserved.

Pig Code for Example Use Case

48
Reserved.

Importing Final Data into DWH
Output from Pig script stored in HDFS:

2012-09-16T23:03:16.294Z|1461333428|290
2012-09-20T04:48:52.294Z|772136124|749
2012-09-24T03:51:16.294Z|1144520081|222
2012-09-24T12:29:40.294Z|628304774|407

Moved into destination table with Sqoop:

49
Reserved.

But…
• Some DI services are not provided in this stack:
• Metadata repository
• Master Data Management
• Data lineage
• …

50
Reserved.

Also…
• …very low level:
• Requires knowledgeable developers to implement
transformations. Not a whole lot of these right now.

Hadoop Data Modelers,
Developers ETL Developers,
etc.

51
Reserved.

Hadoop Integration

52
Reserved.

BI/Analytics Tools

Data
Warehouse
/RDBMS

Streaming
Data
Data Import/Export

NoSQL
53
Reserved.


54
Reserved.

Pentaho
• Existing BI tools extended to support Hadoop.
• Provides data import/export, transformation, job orchestration,
reporting, and analysis functionality.
• Supports integration with HDFS, Hive and Hbase.
• Community and Enterprise Editions offered.

55
Reserved.

Pentaho
• Primary component is Pentaho
Data Integration (PDI), also
known as Kettle.
• PDI Provides a graphical drag-
and-drop environment for
defining ETL jobs, which interface
with Java MapReduce to execute
in-cluster transformations.

56
Reserved.

Pentaho/Cloudera Demo
• Ingest data into HDFS using Flume
• Pre-process the reference data
• Copy reference files into Hadoop
• Execute transformations in-cluster
• Load Hive
• Query Hive
• Discover, Analyze and Visualize

57

Informatica
• Informatica
• Data import/export
• Metadata services
• Data lineage
• Transformation
• …

60
Reserved.

Informatica – Data Import
Access Data Pre-Process Ingest Data
Web server

Databases, PowerExchange PowerCenter
Data Warehouse
Batch HDFS

Message Queues,
Email, Social Media CDC HIVE
e.g. Filter, Join,
ERP, CRM
Cleanse
Real-time

Mainframe

61
Reserved.

Informatica – Data Export
Extract Data Post-Process Deliver Data
Web server

PowerCenter PowerExchange Databases,
Batch Data Warehouse
HDFS

Real-time
e.g. Transform to ERP, CRM

target schema
Mainframe

62
Reserved.

Informatica Data Import/Export
1. Create Ingest or
Extract Mapping

2. Create Hadoop
Connection

3. Configure Workflow

4. Configure Hive
Properties

63
Reserved.

Informatica – Data Transformation

64
Reserved.

Hadoop Integration
Business Intelligence/Analytic Tools

65
Reserved.

BI/Analytics Tools

Data
Warehouse
/RDBMS

Streaming
Data
Data Import/Export

NoSQL
66
Reserved.

Business Intelligence/Analytics Tools

67
Reserved.

Business Intelligence/Analytics Tools

Relational Data
…
Databases Warehouses

68
Reserved.

ODBC Driver
• Most of these tools use the ODBC
BI/Analytics Tools
standard.
• Since Hive is an SQL-like system it’s a
ODBC

DRIVER
good fit for ODBC.
HIVEQL
• Several vendors, including Cloudera,
HIVE SERVER
make ODBC drivers available for
Hadoop. HIVE

• JDBC is also used by some products for
Hive Integration

69
Reserved.

Hive Integration

HiveServer1 HiveServer2
• No support for concurrent • Adds support for concurrent
queries. Requires running queries. Can support multiple
multiple HiveServers for users.
multiple users
• Adds security support with
• No support for security.
Kerberos.
• The Thrift API in the Hive
Server doesn’t support • Better support for JDBC and
common JDBC/ODBC calls. ODBC.
70
Reserved.

Still Some Limitations With This Model
• Hive does not have full SQL support.
• Dependent on Hive – data must be loaded in Hive to be
available.
• Queries are high-latency.

71
Reserved.

Hadoop Integration
Next Generation BI/Analytics Tools

72
Reserved.

New “Hadoop Native” Tools
You can think of Hadoop as becoming a shared execution environment supporting new
data analysis tools…

BI/Analytics New Query Engines

Co

MapReduce

73
Reserved.

Hadoop Native Tools – Advantages
• New data analysis tools:
• Designed and optimized for working with Hadoop data and large
data sets.
• Remove reliance on Hive for accessing data – can work with any
data in Hadoop.
• New query engines:
• Provide ability to do low latency queries against Hadoop data.
• Make it possible to do ad-hoc, exploratory analysis of data in
Hadoop.
74
Reserved.

Datameer

75
Reserved.

Datameer

76
Reserved.

New Query Engines – Impala
• Fast, interactive queries on data stored in Hadoop (HDFS and HBase).
• But also designed to support long running queries.
• Uses familiar Hive Query Language and shares metastore.
• Tight integration with Hadoop.
• Reads common Hadoop file formats.
• Runs on Hadoop DataNodes.
• High Performance
• C++, not Java.
• Runtime code generation.
• Entirely re-designed execution engine bypasses MapReduce.
• Currently in beta, GA expected in April.

Confidential. ©2012 Cloudera, Inc. All
77
Rights Reserved.

Impala Architecture
Common Hive SQL and interface Unified metadata and scheduler
SQL App Hive State
Metastore YARN HDFS NN Store
ODBC

Query Planner Query Planner Fully MPP Query Planner
Query Coordinator Query Coordinator Distributed Query Coordinator
Query Exec Engine Query Exec Engine Query Exec Engine
HDFS DN HBase HDFS DN HBase HDFS DN HBase

78

Cloudera Impala Details
Client submits query through ODBC
SQL App Hive State
ODBC

SQL Request

Query Planner Query Planner Query Planner
Query Coordinator Query Coordinator Query Coordinator

Rights Reserved.

Planner turns request into collection of plan fragments.
Coordinator initiates execution on remote impalad’s
SQL App Hive State
ODBC

Query Planner Query Planner Fully MPP Query Planner
Query Coordinator Query Coordinator Distributed Query Coordinator

Rights Reserved.

Impalads participating in query access local data in HDFS or HBase
SQL App Hive State
ODBC

Query Planner Query Planner Query Planner
Query Coordinator Query Coordinator Query Coordinator
Local Direct Reads

Rights Reserved.

Intermediate results are streamed between impalad’s
Final results are streamed back to client
SQL App Hive State
ODBC

SQL Results

Query Planner Query Planner In Memory Query Planner
Query Coordinator Query Coordinator Transfers Query Coordinator

Rights Reserved.

BI Example – Tableau with Impala

83
Reserved.

Development Challenges
• According to TDWI research*:
• 28% of users feel software tools are few and immature.
• And 25% note the lack of metadata management.

*TDWI Best Practices Report: Integrating Hadoop Into Business Intelligence and Data Warehousing, Philip Russom, TDWI Research:
http://tdwi.org/research/2013/04/tdwi-best-practices-report-integrating-hadoop-into-business-intelligence-and-data-
warehousing.aspx

84
Reserved.

The Cloudera Developer Kit
• The CDK is an open-source collection of libraries, tools, examples, and
documentation targeted at simplifying the most common tasks when
working with the Hadoop.
• The first module released is the CDK Data module – APIs to drastically
simplify working with datasets in Hadoop filesystems. The Data module:
• handles automatic serialization and deserialization of Java POJOs as well as
Avro Records.
• Automatic compression.
• File and directory layout and management.
• Automatic partitioning based on configurable functions.
• A metadata provider plugin interface to integrate with centralized metadata
management systems.

85
Reserved.

Cloudera Developer Kit
• Source code, examples, documentation, etc.:
• https://github.com/cloudera/cdk

86
Reserved.

Questions?
• Or see me at the Cloudera booth – 11:00-1:00.

87
Reserved.

Extending data infrastructure with Hadoop

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (10)

Similaire à Extending data infrastructure with Hadoop

Similaire à Extending data infrastructure with Hadoop (20)

Plus de Jonathan Seidman

Plus de Jonathan Seidman (15)

Dernier

Dernier (20)

Extending data infrastructure with Hadoop

Notes de l'éditeur