Data Warehouse Modernization: Accelerating Time-To-Action

© 2017 MapR Technologies 1
+
Clarke Patterson, Head of Product Marketing at StreamSets
Ankur Desai, Sr. Product Marketing Manager at MapR
Data Warehouse Modernization:
Accelerating Time-to-Action

Traditional Data Warehouse Architecture: Batch Processing
User
Interaction
Message
Bus
Operational
Database
ETL Analytics/Big
Data Cluster
Analytics
Batch analysis
(e.g., 24 hours)
Runs periodically
to minimize load
on DBMS
Data
collection

More Complexity: Multiple Data Sources and Pipelines
User
Interaction
Message
Bus
Operational
Database ETL Analytics/Big
Data Cluster
Analytics
Batch analysis
(e.g., 24 hours)

Challenges with Traditional Data Warehouse Based Architectures
Ever increasing cost
Expensive licensing models combined with the
massive volumes of data being created means the
cost of data warehousing is significantly rising.
Inability to scale-out
Data warehouses cannot scale-out linearly using
commodity hardware. Buying new expensive
hardware is straining IT budgets.
Unused data driving cost up
70% of data in DW is unused, i.e. never queried in
past 1 year.
Misuse of CPU capacity
Almost 60% of CPU capacity is used for ETL/ELT.
15% of CPU consumed by ETL to load unused data.
This affects performance of queries.
Inability to support non-relational data
Designed for relational data, data warehouses are
not suitable for unstructured data coming from
sensors, logs, devices, social media etc.
Inability to support modern analytics
DWs do not support modern analytics technologies
such as machine learning and stream processing.
>$10K
Cost/TB
70%
Data Unused
60%
CPU used for ETL

Optimizing Your Data Warehouse
with MapR

Select the ideal offload candidates
MapR experts will help you select the data and ETL workload ideal for offload. Keep the frequently
queried data in the data warehouse. Select unused data (often up to 70% of the data in the DW) for
offloading into the MapR Converged Data Platform,
Build the data pipeline
Data migration can be performed using batch methods using NFS or Sqoop or real-time methods using
tools such as Kafka Connect for MapR-ES. Many data warehouses provide connectors for Hadoop that help
to simplify the migration. Upon migration, the data can be stored in MapR-DB or Hive tables, or Parquet or
Avro files depending on requirements.
Deliver the data to the stakeholders
Utilize SQL engines such as Apache Drill, Hive, or Spark SQL to deliver data to traditional BI tools. You can
continue using your favorite BI tools such as Tableau, Qlik, Microstrategy etc. The existing BI teams can
continue querying the offloaded data using SQL. This solution ensures smooth and continuous operation for
BI teams.
Optimizing your Data Warehouse: 3 Step Solution from MapR
1
2
3
3
Steps
5
Weeks
Time-to-Value
40%
Offload Target

Data Warehouse Optimization Reference Architecture

MapR Converged Data Platform: Key Features
Interactive SQL analysis
Apache Drill on MapR platform allows you to use
ANSI SQL to query any data. BI teams can continue
using SQL and the same BI tools.
Multi-temperature storage
MapR provides multi-temperature storage. Store
your hot, warm, and cold data within MapR on
hardware of your choice, further optimizing for cost.
Streaming for real-time insights
MapR Streams allows you to bring data for analysis
as soon as the data is created. In contrast, legacy
DW solutions are batch oriented.
Multi-tenant big data platform
MapR is the only big data platform that provides
multi-tenancy on the data placement level, helping
you meet the regulatory requirements.
Enterprise grade data governance
The MapR platform provides enterprise grade
security, auditing, and lineage to meet your data
governance needs.
Converge SQL and Machine Learning
Single platform for storage, database, and
streaming. Your choice of compute engine on top
(Spark, Hadoop, SQL, Machine learning)
1st
Rank for Data
Warehousing among
Big Data Solutions*
1st
Rank for
Price/Performance**
1st
Rank for Streaming***
*2016 Gartner Critical Capabilities for DW
**TPCx-HS Performance Benchmarks
***DBTA Reader’s Choice Awards

Data
Exploration
Using Drill
Event
Streaming with
MapR-ES
Transformations
Real-Time Analytics and Dashboards with Stream Processing
Operational
Database
Stream Processing
Operational
Database
Operational
Database
Change Data
Capture
Change Data
Capture
Change Data
Capture
Real-time
Business
Intelligence
Static Data
– Inserts
only
Frequently
updated
data

StreamSets Solves the Data
Drift

Top challenges for the big data
warehouse
68%
60%
52%
47%
40%
32%
1%
0% 18% 35% 53% 70% 88%
Ensuring the quality of the data (accuracy,
completeness, consistency)
Complying with security and data privacy policies
Keeping data flow pipelines operating effectively
Building pipelines for getting data into the data
store
Upgrading big data infrastructure components
(Kafka, Hadoop, etc.).
Adapting pipelines to meet new requirements
We have no challenges
What challenges does your company face when
managing your big data flows?

What’s the impact?
Yes
87%
No
13%
Yes
74%
No
26%
Does ‘bad data’ occasionally
get into your data stores?
Do you believe there is any
‘bad data’ in your data
stores currently?
In response…
53% change data
flow pipelines at
least several
times a month

New standards for data warehousing
ETL ETL
Ingest Analyze
Past (ETL)
➢ Fixed schema ETL for Data
warehouses
➢ Source Data structured and
rigid transaction data
Data Sources Data Stores Data Consumers
Emerging (Ingest)
➢ Explosion of Data Stores –
fluid infrastructure
➢ Source Data predominantly
multi-structured interaction
data
➢ Data Drift: Structure,
Semantic, Infrastructure

Delayed and
False Insights
Solving Data Drift
Tools
Applications
Data Stores Data ConsumersData Sources
Poor Data Trust &
Quality
Data Drift
Custom code
Fixed-schema

Trusted & Timely
Insights
Data KPIs
(Trusted High
Quality Data)
Solving Data Drift
Tools
Applications
Data Stores Data ConsumersData Sources
Data Drift
Intent-Driven
Drift-Handling

Think of dataflows as cyclical
processes
Build
Development
processes are far more
complex and drawn out
than they need to be
Execute
The economics of data
have changed, giving
way to a choice of
executing and
deployment options
Operate
Architectures are
constantly changing
and have more
stringent SLA’s

Build
Not all
developers are
created equally
>_
Integrations are
abundant and
unnecessarily rigid
Build-to-deploy takes
far longer than
necessary

Execute
Multiple deployment
options exist yet
constraints limit making
use of them
Mixed workloads are the
norm, must handle both
batch and streaming
11001001001001101001
00101010010010010010
10100100100101010101
01001001001010100100
11010001110010100100
10010010100101110101
Scalability is a must, both
today and into the future

Operate
Increasingly, the business
expects SLA’s on the
quality and timeliness of
data
Architectures are
constantly evolving, with
new versions or new
projects regularly being
added
Data, and it’s structure,
will inevitably change,
causing wide spread
impact

StreamSets Data Operations
Platform
EFFICIENCY
Intent Driven Flows
Batch & Streaming Ingest
In-stream Sanitization
MASTER
Availability & Accuracy
Proactive Remediation
MEASURE
Any Path
Any Time
MAP
Dataflow Lineage
Live Data Architecture
CONTROL
Drift Handling
Stage & Flow Metrics
Lineage & Impact Analysis
AGILITY
Flexible deployment
Exception Handling
Seamless Evolution
EVOLVE (Proactive)
REMEDIATE (Reactive)
DEVELOP OPERATE
CloudClusterStandalone
StreamSets Data Collector Dataflow Performance Manager
Edge

StreamSets & MapR optimize the
EDW

StreamSets & MapR enable real-
time streams
Operational
Database
Operational
Database
Operational
Database
Change Data
Capture
Change Data
Capture
Change Data
Capture
Data
Exploration
Using Drill
Event Streaming
with MapR-ES
Transformations
Stream Processing
Real-time
Business
Intelligence
Static Data
– Inserts
only
Frequently
updated
data

Business Benefits

MapR Converged Data Platform: Key Business Benefits
$10M
In Savings per 1 PB
over 5 years
53%
Lower CapEx for the
combined DW +
MapR Solution
90%
Lower CapEX for
MapR vs Legacy DW
Reduce TCO of data analysis
Sharply reduce the cost of data management
and analytics. The cost saving can be utilized
towards revenue generating innovations.
Maximize value of current investment
Increase available “headroom” and avoid or
minimize new CapEx. Improve performance of
your existing data warehousing assets.
Answer new questions
Use analytics tools unavailable in legacy data
warehousing (Drill, Parquet, Spark, Machine
Learning, others).
Leverage existing skillsets
BI teams can continue using familiar tools such
as Tableau, Qlik, Microstrategy etc. on both
DW and the MapR data lake.
Leverage hybrid deployment model
MapR provides single global namespace to
help you create a homogeneous data fabric
across on-premises and cloud hosted data.
Get results fasters
The MapR Quick Start Solution for Data
Warehousing will help you get value for
optimization project within 5 weeks.

Beyond Cost Reduction: IDC discovers 4X ROI for MapR Customers
4X
ROI
8.2
Months
Payback Time
39%
Higher Developer
Productivity
$19.44M
Avg. Business
Benefits
42%
Lower TCO vs. Other
Big Data Vendors
31%
Higher Data Scientist
Productivity

Case Studies

Cisco was able to analyze sales opportunities in 1/10 the time, at 1/10 the cost, and
generated $40 million in incremental service bookings in the first year.
Cisco uses integrated customer data to increase revenues
• Create shared view of customer & operations across 75,000 employees
• Increase revenue opportunities with sales partners
• Customer information was siloed in different divisions
• Customer interactions were inconsistent and not satisfying
• Missed opportunities for upselling/cross selling
• Use MapR to collect customer information across touch points
• Integrate billing, support, manufacturing, social media, websites, dial-in data
• Generate new sales leads internally and for partners
OBJECTIVES
CHALLENGES
SOLUTION
Architecture for
Sales Partner Opportunities
Business
Impact

Zions Bank builds cost effective security analytics and fraud detection
on one platform
• Fraud Operations and Security Analytics team at Zions maintains data stores, builds
statistical models to detect fraud, and then uses these models to data mine and
evaluate suspicious activity
“We initially got into centralizing all of our data from an information security perspective. We
then saw that we could use this same environment to help with fraud detection”
Michael Fowkes - SVP Fraud Operations and Security Analytics
• Existing technology infrastructure could not scale
• Timeliness of reports degraded over the last several years
• Chose MapR and cut storage costs by 50%
• Querying time reduced from 24 hours to 30 min on 1.2 PB of data
• Leverage MapR scale for increased model accuracy and deeper insights
OBJECTIVES
CHALLENGES
SOLUTION
Business
Impact

Rich history of industry recognition
Cool Vendor in Data
Management, 2017
Best Open Source Tool,
2016
10 Coolest Big Data
Startups of 2016

Thank You

Q&A
ENGAGE WITH US
Contact us at:
855-NOW-MAPR
Or
maprisr@mapr.com
https://twitter.com/mapr
https://www.linkedin.com/company/mapr-
technologies
Follow us at:

Additional Resources
• Learn more at our Solution Page at: https://mapr.com/dwo
• Try MapR at https://mapr.com/download/
• Blog: Best Practices on Migrating from a Data Warehouse to a Big Data Platform

Data Warehouse Modernization: Accelerating Time-To-Action

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data Warehouse Modernization: Accelerating Time-To-Action

Similar to Data Warehouse Modernization: Accelerating Time-To-Action (20)

More from MapR Technologies

More from MapR Technologies (20)

Recently uploaded

Recently uploaded (20)

Data Warehouse Modernization: Accelerating Time-To-Action

Editor's Notes