Using Hadoop to Offload Data Warehouse Processing and More - Brad Anserson

© 2014 MapR Technologies 2
Agenda
• Data Warehouse Offload Use Case
• How Is This Achieved?
• Please Do More!

BIG DATA

AnalyticsETL
Your Enterprise Data Warehouse (in reality)

Clean Conform Normalize Present AccessTransformExtract
Billing
Systems
Source Data
Current ETL Pipeline
Data Warehouse
Staging
Extract Clean Conform Transform Normalize Present Access
Proposed Hybrid Solution Pipeline
Hadoop Data Warehouse
Data Warehouse Optimization

Leveraging Big Data with Hadoop
RDBMS
• Only structured data
• $10K to $60K per TB
• Limited Analytics
• 70% cycles for ETL
FROM
DW
Sensor Data
Web Logs
Hadoop
RDBMS
Both structured and unstructured data
50x-100x cost savings: ~$333 per TB
Claim 20-30% of your data warehouse space back
Expanded analytics with MapReduce, NoSQL etc.
TO
ETL + Long Term Storage
DW
Query + Present
Hadoop
ETL + Long Term Storage
• No SPOF
• Fully protected
• Mirrored

 CapEx: Cost avoidance for annual Data Warehouse adds
 Storage: 20x storage good for next 5 years
 Cost: 100x cost reduction
 Scale-out Architecture: New nodes can be added on the fly
 No Disruption: Hybrid solution ensures no change to upstream/downstream business systems
One time Hadoop investment of ~$6.5M provides $33.9M cost savings
Results of TCO Evaluation
Solution Technology 5 Year Contract
Existing Data Warehouse $67M
New Hybrid: Data Warehouse+ Hadoop $33M
Total Cost Savings $34M

How is this Achieved?

Step 1: Admit You Have A Problem
EVERYTHING IS AWESOME!

Start Playing Around
• Dump some of your raw data into Hadoop
– Just use ‘cp’
• Convert your ETL SQL to HiveQL
– 90% unchanged
– 5% HiveQL semantics
– 5% Optimization
• Bulk Load Cleansed Data into EDW
– Use existing bulk loaders

What Changed?
SAN/NAS
data data data
data data data
daa data data
data data data
function
RDBMS
Traditional Architecture
data
function
data
function
data
function
data
function
data
function
data
function
data
function
data
function
data
function
data
function
data
function
data
function
Distributed Computing
function
App
function
App
function
App

Business Reasons
• ETL Window
– 60 hours of load time… every day
– Embarrassingly Parallel
• Cost of EDW
– 20x, 50x, 100x reduction
• Complex analytics
– Compute is Essentially Free
– Some models / algorithms / queries don’t fit relational models

Easy Integration with the Enterprise
Real-time
applications
NFS for
file-based
applications
Hadoop APIs
for Hadoop
applications ODBC &
JDBC for
SQL-based
applications
Mission
critical and
SLA
dependent
applications

Drill 1.0 Hive 0.13
with Tez
Impala 1.x Presto 0.56 Shark 0.8 Vertica
Latency Low Medium Low Low Medium Low
Files Yes (all Hive file
formats)
Yes (all Hive file
formats)
Yes (Parquet,
Sequence, …)
Yes (RC,
Sequence, Text)
Yes (all Hive file
formats)
Yes (all Hive file
formats)
HBase/M7 Yes Yes Various issues No Yes No
Schema Hive or schema-
less
Hive Hive Hive Hive Proprietary or Hive
SQL support ANSI SQL HiveQL HiveQL (subset) ANSI SQL HiveQL ANSI SQL +
advanced analytics
Client support ODBC/JDBC ODBC/JDBC ODBC/JDBC ODBC/JDBC ODBC/JDBC ODBC/JDBC,
ADO.NET, …
Large joins Yes Yes No No No Yes
Nested data Yes Limited No Limited Limited Limited
Hive UDFs Yes Yes Limited No Yes No
Transactions No No No No No Yes
Optimizer Limited Limited Limited Limited Limited Yes
Concurrency Limited Limited Limited Limited Limited Yes
Interactive SQL-on-Hadoop:
You Have Options!
SQL

Structured and Semi-structured - JOIN
trades.csv
ITT,11/01/2011,08:46:01.827,17.44,200,P,T,00,2323,N,C,,,
ITT,11/01/2011,09:04:01.185,17.29,250,P,T,00,2804,N,C,,,
ITT,11/01/2011,09:08:08.997,16.97,200,T,FT,00,2950,N,C,,,
ITT,11/01/2011,09:30:00.375,17.02,700,T,O X,00,5216,N,C,,,
ITT,11/01/2011,09:30:00.375,17.02,700,T,Q,00,5217,N,C,,X,
ITT,11/01/2011,09:30:30.160,16.95,100,P,F,00,9247,N,C,,,
ITT,11/01/2011,09:30:33.362,16.95,200,P,@,00,9590,N,C,,,
ITT,11/01/2011,09:30:33.362,16.98,400,P,@,00,9591,N,C,,,
ITT,11/01/2011,09:30:33.362,16.99,100,P,@,00,9592,N,C,,,
ITT,11/01/2011,09:30:33.366,16.99,800,P,@,00,9594,N,C,,,
equities.json
{
"symbol" : "ITT",
"exchange" : "NYSE",
"company" : {
"name" : "ITT Corporation",
"country" : "United States"
}
}

Structured and Semi-structured - JOIN
ADD JAR /home/ec2-user/brad/csv-serde-1.1.2-0.11.0-all.jar;
ADD JAR /home/ec2-user/brad/json-serde-1.1.7.jar;
SELECT e.company.country, sum(t.volume) as total_volume
FROM trades t
INNER JOIN equities e
ON t.symbol=e.symbol
GROUP BY e.company.country
;

Please Do More.

Real-time ad targeting
Web application serverMobile application
server
Analytics + Operational Apps
Operational
applications
Real-time and
actionable analytics
Customer 360 dashboard Data exploration (SQL)
Real-time churn prevention Product/service optimization
and personalization
• User profiles and state
• User interactions
• Real-time location data
• Web and mobile session state
• Comments/rankings
Cloud services
Hadoop (MapR)
Real-time

Financial Services
Fraud detection
Personalized
offers
Fraud
investigation
tool
Fraud investigator
Fraud model
Recommendations
table
Clickstream
analysis
Online
transactions
MapR Distribution for Hadoop
Analytics
Real-time Operational Applications
Interactive marketer

Waste & Recycling Leader—Architecture
Truck
Truck
Truck
.
.
.
MapR
Geolocation
Geolocation
Geolocation
Online alerts
Batch processing
(MapReduce)
Tax reduction
reporting
Shortest path graph
algorithm
(Titan)
Route
optimization
Real-time stream
processing
(Apache Storm)

Please do more!
Q&A
@mapr maprtech
brad@mapr.com
MapR
maprtech
mapr-technologies

Using Hadoop to Offload Data Warehouse Processing and More - Brad Anserson

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Using Hadoop to Offload Data Warehouse Processing and More - Brad Anserson

Similar to Using Hadoop to Offload Data Warehouse Processing and More - Brad Anserson (20)

More from MapR Technologies

More from MapR Technologies (20)

Recently uploaded

Recently uploaded (20)

Using Hadoop to Offload Data Warehouse Processing and More - Brad Anserson

Editor's Notes