Hadoop Data Lake vs classical Data Warehouse: How to utilize best of both worlds - by Kolja Roedel

Hadoop Data Lake &
classical Data Warehouse:
How to utilize best of both worlds
2018 Hadoop Data Lake & classical DWH: Best of both worlds • © Copyright Woodmark Consulting AG • Kolja Rödel 1

Speaker & Agenda
1. Introduction to Hybrid Architectures
• Classical Data Warehouses
• Hadoop Data Lakes
• Bringing it all together
2. Use Cases on Hybrid Architectures
Kolja Rödel
Manager
T
M
E kolja.roedel@woodmark.de
Woodmark Consulting AG
Am Hochacker 4
85630 Grasbrunn / München

1. Introduction to Hybrid
Architectures

Really 2 different things? Or just an implementation detail?
HDFS
Streaming
Reservoir
Data Lake Archive
Landing Zone
(Stream-)Export
Data Lake ArchitectureClassical BI/DWH Architecture
Structured Data
Unstructured
Data
Semi-structured
Data
Use
Case
1
Use
Case
2
Staging Area
Data Warehouse
Structured Data
Data
Mart 1
Data
Mart n
Data
Mart 2
Use
Case
n
Databasetables
> 10 Terabyte< 10 Terabyte

Data Warehouses at heart of the traditional BI landscape
• Def. Bill Inmon: “a subject-oriented, integrated,
time-variant, non-volatile collection of data in
support of management’s decision-making process“
• Def. Ralph Kimball: “a copy of transaction data
specifically structured for querying and reporting“
Architectural concept:
• Global view on heterogeneous & distributed data
• Layered architecture for distinct purposes
• Integration of source data for consistency:
Single Point of Truth
• Elaborate data model (3NF, Dimensional, Data
• Support of deeper analyses (like time-series)
• Aggregation of KPIs for efficient usage
• Preparation of application-specific data extracts
2018
Hadoop Data Lake & classical DWH: Best of both worlds • © Copyright Woodmark Consulting AG • Kolja
Rödel
5
ETL:
transform
Staging
Area
Core Data
Warehouse
Data Marts
Data
Sources
Reporting
& Analysis
ETL:
versionize
clean &
integrate
ETL:
decouple
Query:
filter

Hadoop Data Lakes to face new challeges
Challenges:
• Data growth  explosion (3V)
• All data has potential value for future Use Cases:
no selective archiving & “no” deleting!
• Data cleansing to extract business value
• Retain transparency through Data Governance
Architectural concept:
• Central platform for collecting, processing and
large volumes of multi-structured data
• Layered architecture
• Arrival of raw data (1:1 copy)
• Persistent storage of cleansed, normalized data
• Use Case oriented, filtered data contexts
• No strict data modelling, no transactions (ACID)
• Advanced processing like Streaming & Machine
2018 Hadoop Data Lake & classical DWH: Best of both worlds • © Copyright Woodmark Consulting AG • Kolja Rödel 6Streaming
Reservoir
Data Lake
Archive
(highly compressed,
lower replication
factor)
Landing Zone
(Stream-)Export
Structured Data
Unstructured
Data
Semi-structured
Data
> 10 Terabyte

Apart from implementation, we see a paradigm change:
Hadoop Data Lake
Collect all (types of) data
Data-Driven-Business
Schema-On-Read
Data as an enterprise asset
Data Lake Data is loaded in raw
format to the Data
Lake…
… and are selected
and organized with
respect to the Use
Case
Classical BI/DWH
Minimal storage allocation
Hypothesis / Application-Driven-Business
Schema-On-Write
Data as a side product of processes
Data Warehouse
Data is cleansed and
integrated into a consistent
schema before loading to the
DWH…
… and analyses are
executed directly on the
DWH

A Hybrid Architecture unites the beneficial features:
Hybrid Architecture
• Processing of structured &
unstructured data
• Parallel processing of large
amounts of data in real-time
• Hypothesis- and Data-driven
analyses
• Highly integrated core data
Hadoop
Data LakeData
Warehouse

Hybrid BI & Big Data Reference Architecture
Data Lake DWH
Landing Zone Raw Data Standardized
Data
Business Data Use Case
Data
Provisioning Standardizing
Customizing
Integration Interface Data
Customizing
Portal
SourceSystems
Business
Ready
Speed Layer
Raw Access Early Access
Business
Ready
Provisioning
Archive
Ingestion
Reservoir
Data
Customizing

Why still keep the Data Warehouse?
Protection of investments
• Proven technology for reliable results (e.g. reporting)
• Employees‘ skills
• Existing analyses, reports & applications
Query performance for relational Use Cases
• Indexes & Hints
• Mature optimizer
Quality data  stability
• Schema-on-write: deliberate data modelling
• Defined data types
• Transaction concept (all or nothing)
• Contraints: uniqueness (PK), reference (FK), required attributes (nullable), …

2. Use Cases on Hybrid
Architectures

Typical IT Use Cases around a manufacturing plant:
Long-term archiving
of factory log data
Reducing production
errors via log analysis
Product
customization using a
recommender
Monthly report on
sales numbers
Production
optimization through
self-service analysis
Customer satisfaction
measurement through
Sentiment Analysis
Immediate alerting
based on sensor
streaming
Master Data
Management (MDM)
Effective cross-selling
driven by campaign
management (360°)
Support of customer
service by predictive
maintenance

Criteria for Use Case placement
Data Variety
Data Volume
Data Velocity
Response time
Information
Consistency
Algorithmical
Complexity

Bringing the Use Cases to the Reference Architecture
Data Lake DWH
Landing Zone Stand.
Data
Business Data Use Case
Data
Reservoir
Data
Interface Data
Portal
Source
Systems
Speed Layer
Raw Data
Archive
Predictive
maintenance
Archiving
log data
Log analysis
Recommender
Monthly report
Self-service
analysis
Sentiment
Analysis
Sensor
streaming
MDM
Campaign
management

Conclusion: How to utilize best of a Hybrid Architecture
• Data Warehouses and Data Lakes follow different paradigms and have differents strengths.
• They complement rather than replace each other.
• Hybrid Architectures allow to address various Use Cases:
Use Case Component Layer
EfficientArchiving Data Lake Raw data archive
Streaming of sensor data Data Lake Speed layer
Machine Learning Data Lake Reservoir
Master Data Management DataWarehouse Core layer
Self-ServiceAnalysis and
Standard Reporting
DataWarehouse Portal applications
on Datamarts

Questions, answers & discussion
Thanks for joining!

Hadoop Data Lake vs classical Data Warehouse: How to utilize best of both worlds - by Kolja Roedel

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Hadoop Data Lake vs classical Data Warehouse: How to utilize best of both worlds - by Kolja Roedel

Similaire à Hadoop Data Lake vs classical Data Warehouse: How to utilize best of both worlds - by Kolja Roedel (20)

Dernier

Dernier (20)

Hadoop Data Lake vs classical Data Warehouse: How to utilize best of both worlds - by Kolja Roedel

Notes de l'éditeur