More Related Content Similar to Using Hadoop to Offload Data Warehouse Processing and More - Brad Anserson Similar to Using Hadoop to Offload Data Warehouse Processing and More - Brad Anserson (20) More from MapR Technologies More from MapR Technologies (20) Using Hadoop to Offload Data Warehouse Processing and More - Brad Anserson2. © 2014 MapR Technologies 2
Agenda
• Data Warehouse Offload Use Case
• How Is This Achieved?
• Please Do More!
4. © 2014 MapR Technologies 4
AnalyticsETL
Your Enterprise Data Warehouse (in reality)
5. © 2014 MapR Technologies 5
Clean Conform Normalize Present AccessTransformExtract
Billing
Systems
Source Data
Current ETL Pipeline
Data Warehouse
Staging
Extract Clean Conform Transform Normalize Present Access
Proposed Hybrid Solution Pipeline
Hadoop Data Warehouse
Data Warehouse Optimization
6. © 2014 MapR Technologies 6
Leveraging Big Data with Hadoop
RDBMS
• Only structured data
• $10K to $60K per TB
• Limited Analytics
• 70% cycles for ETL
FROM
DW
Sensor Data
Web Logs
Hadoop
RDBMS
Both structured and unstructured data
50x-100x cost savings: ~$333 per TB
Claim 20-30% of your data warehouse space back
Expanded analytics with MapReduce, NoSQL etc.
TO
ETL + Long Term Storage
DW
Query + Present
Hadoop
ETL + Long Term Storage
• No SPOF
• Fully protected
• Mirrored
7. © 2014 MapR Technologies 7
CapEx: Cost avoidance for annual Data Warehouse adds
Storage: 20x storage good for next 5 years
Cost: 100x cost reduction
Scale-out Architecture: New nodes can be added on the fly
No Disruption: Hybrid solution ensures no change to upstream/downstream business systems
One time Hadoop investment of ~$6.5M provides $33.9M cost savings
Results of TCO Evaluation
Solution Technology 5 Year Contract
Existing Data Warehouse $67M
New Hybrid: Data Warehouse+ Hadoop $33M
Total Cost Savings $34M
8. © 2014 MapR Technologies 8© 2014 MapR Technologies
How is this Achieved?
9. © 2014 MapR Technologies 9
Step 1: Admit You Have A Problem
EVERYTHING IS AWESOME!
10. © 2014 MapR Technologies 10
Start Playing Around
• Dump some of your raw data into Hadoop
– Just use ‘cp’
• Convert your ETL SQL to HiveQL
– 90% unchanged
– 5% HiveQL semantics
– 5% Optimization
• Bulk Load Cleansed Data into EDW
– Use existing bulk loaders
11. © 2014 MapR Technologies 11
What Changed?
SAN/NAS
data data data
data data data
daa data data
data data data
function
RDBMS
Traditional Architecture
data
function
data
function
data
function
data
function
data
function
data
function
data
function
data
function
data
function
data
function
data
function
data
function
Distributed Computing
function
App
function
App
function
App
12. © 2014 MapR Technologies 12
Business Reasons
• ETL Window
– 60 hours of load time… every day
– Embarrassingly Parallel
• Cost of EDW
– 20x, 50x, 100x reduction
• Complex analytics
– Compute is Essentially Free
– Some models / algorithms / queries don’t fit relational models
13. © 2014 MapR Technologies 14
Easy Integration with the Enterprise
Real-time
applications
NFS for
file-based
applications
Hadoop APIs
for Hadoop
applications ODBC &
JDBC for
SQL-based
applications
Mission
critical and
SLA
dependent
applications
14. © 2014 MapR Technologies 15
Drill 1.0 Hive 0.13
with Tez
Impala 1.x Presto 0.56 Shark 0.8 Vertica
Latency Low Medium Low Low Medium Low
Files Yes (all Hive file
formats)
Yes (all Hive file
formats)
Yes (Parquet,
Sequence, …)
Yes (RC,
Sequence, Text)
Yes (all Hive file
formats)
Yes (all Hive file
formats)
HBase/M7 Yes Yes Various issues No Yes No
Schema Hive or schema-
less
Hive Hive Hive Hive Proprietary or Hive
SQL support ANSI SQL HiveQL HiveQL (subset) ANSI SQL HiveQL ANSI SQL +
advanced analytics
Client support ODBC/JDBC ODBC/JDBC ODBC/JDBC ODBC/JDBC ODBC/JDBC ODBC/JDBC,
ADO.NET, …
Large joins Yes Yes No No No Yes
Nested data Yes Limited No Limited Limited Limited
Hive UDFs Yes Yes Limited No Yes No
Transactions No No No No No Yes
Optimizer Limited Limited Limited Limited Limited Yes
Concurrency Limited Limited Limited Limited Limited Yes
Interactive SQL-on-Hadoop:
You Have Options!
SQL
15. © 2014 MapR Technologies 16
Structured and Semi-structured - JOIN
trades.csv
ITT,11/01/2011,08:46:01.827,17.44,200,P,T,00,2323,N,C,,,
ITT,11/01/2011,09:04:01.185,17.29,250,P,T,00,2804,N,C,,,
ITT,11/01/2011,09:08:08.997,16.97,200,T,FT,00,2950,N,C,,,
ITT,11/01/2011,09:30:00.375,17.02,700,T,O X,00,5216,N,C,,,
ITT,11/01/2011,09:30:00.375,17.02,700,T,Q,00,5217,N,C,,X,
ITT,11/01/2011,09:30:30.160,16.95,100,P,F,00,9247,N,C,,,
ITT,11/01/2011,09:30:33.362,16.95,200,P,@,00,9590,N,C,,,
ITT,11/01/2011,09:30:33.362,16.98,400,P,@,00,9591,N,C,,,
ITT,11/01/2011,09:30:33.362,16.99,100,P,@,00,9592,N,C,,,
ITT,11/01/2011,09:30:33.366,16.99,800,P,@,00,9594,N,C,,,
equities.json
{
"symbol" : "ITT",
"exchange" : "NYSE",
"company" : {
"name" : "ITT Corporation",
"country" : "United States"
}
}
16. © 2014 MapR Technologies 17
Structured and Semi-structured - JOIN
ADD JAR /home/ec2-user/brad/csv-serde-1.1.2-0.11.0-all.jar;
ADD JAR /home/ec2-user/brad/json-serde-1.1.7.jar;
SELECT e.company.country, sum(t.volume) as total_volume
FROM trades t
INNER JOIN equities e
ON t.symbol=e.symbol
GROUP BY e.company.country
;
17. © 2014 MapR Technologies 18© 2014 MapR Technologies
Please Do More.
18. © 2014 MapR Technologies 19
Real-time ad targeting
Web application serverMobile application
server
Analytics + Operational Apps
Operational
applications
Real-time and
actionable analytics
Customer 360 dashboard Data exploration (SQL)
Real-time churn prevention Product/service optimization
and personalization
• User profiles and state
• User interactions
• Real-time location data
• Web and mobile session state
• Comments/rankings
Cloud services
Hadoop (MapR)
Real-time
19. © 2014 MapR Technologies 20
Financial Services
Fraud detection
Personalized
offers
Fraud
investigation
tool
Fraud investigator
Fraud model
Recommendations
table
Clickstream
analysis
Online
transactions
MapR Distribution for Hadoop
Analytics
Real-time Operational Applications
Interactive marketer
20. © 2014 MapR Technologies 21
Waste & Recycling Leader—Architecture
Truck
Truck
Truck
.
.
.
MapR
Geolocation
Geolocation
Geolocation
Online alerts
Batch processing
(MapReduce)
Tax reduction
reporting
Shortest path graph
algorithm
(Titan)
Route
optimization
Real-time stream
processing
(Apache Storm)
22. © 2014 MapR Technologies 23
Please do more!
Q&A
@mapr maprtech
brad@mapr.com
MapR
maprtech
mapr-technologies
Editor's Notes MapR’s innovations have also expanded the use cases that are possible with Hadoop. Not only do we support the full Hadoop API set. MapR provides support for NFS so any file-based application can access the cluster with no changes or rewrites required. MapR provides ODBC support, so any database application or SQL-based tool can access and manipulate data in a MapR cluster. MapR supports real-time streaming access. This greatly expands the applications that are possible with Hadoop moving beyond a batch limitation. Finally, the full HA, DR and data protection capabilities of MapR allow mission critical apps to be deployed safely and allows administrators to meet stringent SLA targets. Because only MapR can reliably run both operational and analytical applications on one platform/cluster, MapR enables a faster closed-loop process between operational applications and analytics. This means:interactive marketers and algorithms can update the rules engines more quickly and provide more real-time targeting of offers and relevant content to consumersFraud models are kept more up to date with the latest patterns to better detect anomalies and take action more quickly on bad actors