Hadoop and Your Data Warehouse

HADOOP AND YOUR DATA
WAREHOUSE

About Caserta Concepts
• Technology services company with expertise in data analysis:
• Big Data Solutions
• Data Warehousing
• Business Intelligence
• Core focus in the following industries:
• eCommerce / Retail / Marketing
• Financial Services / Insurance
• Healthcare / Higher Education
• Established in 2001:
• Increased growth year-over-year
• Industry recognized work force
• Strategy, Implementation
• Writing, Education, Mentoring
• Data Science & Analytics
• Data on the Cloud
• Data Interaction & Visualization

When it comes to building efficient Data
Solutions, We wrote the book!
• Time-tested proven solutions
• Staging, cleaning, integrating, delivering
• Traditional data warehousing and Big Data
warehouses
• Best practices to extract data from scattered
sources, munge and discover valuable
business information
• Sub-systems offered as project accelerators
• Comprehensive guidance to our clients to
build and populate big data solutions that
ensure quality and integrity
Authors, Innovators, Leaders

Hadoop and Your Data Warehouse
•Last 2 or 3 years have been more disruptive from a
data management perspective than the past 20!
•The advent of new technologies and modern data
engineering concepts has shaken traditional
concepts to their core
Proprietary Information

What is a Data Warehouse
Good Question?
In the traditional world – several competing, almost
religious, approaches to their design.
I think we can all agree:
•A central repository of integrated data from one or
more disparate sources
•Used for Reporting and Analysis
•Reliability, Trust  Data Governance

Data Governance
A program consisting of:
• Metadata
• Security
• Data Quality
• Master Data Management
• Information Lifecycle Management (aka retention)
…with supporting processes, procedures and
organizational suport

How Do you Build a Data Warehouse
•Design – Top Down, Bottom Up
• Customer Interviews and requirements gathering
• Data Profiling
•Extract Transform Load data from source to data
warehouse
•Create Facts and Dimensions
•Put a BI tool on top
•Develop reports
•Data Governance

The Traditional Conversation
• Kimball Vs. Inmon
• Dimensional vs. 3rd normal form
• What hardware do we need (that will be ready in 6 months)
• Oracle vs SQL Server, Postgres or MySQL if we were brave
(and cheap)
• Which ETL tool should we BUY  Informatica, Datastage?
• Which BI tool should we sit on top  Business Objects,
Cognos?

The New Conversation
• Do we need a Data Warehouse at all?
• If we do, does it need to be relational?
• Should we leverage Hadoop or NoSQL?
• Which platform and language are we going to code in?
• What bleeding edge Apache Project should we put in
production!

So Why Change?
New technologies are great and all.. But what drives our
adoption of new technologies and techniques?
• Data has changed – Semistructured, Unstructured, Sparse
and evolving schema
• Volumes have changed  GB to TB to PB workloads
• Cracks in the Armor of Traditional Data Warehousing
approach!
AND MOST IMPORTANTLY:
Companies that innovate and leverage their data win!

Cracks in the Armor
• Onboarding new data is difficult!
• Rigidity and Data Governance
• Disconnect from business requirement:
“Hey – I need to analyze some new source”
Conform and analyze the data
Load it into dimensional models
Build a semantic layer nobody is going to use
Create a dashboard we hope someone will notice
..and then you can have at it 3-6 months later to see if it has
value!

And then there is…
70% FAILURE RATE
• Semi-scientific analysis has proven the majority of data
analytic projects fail..
• And of those that don’t fail, only a fraction are deemed a
“success”, others just finish!
• Data is just REALLY hard, especially without the right
strategy
What do we think the Data Governance failure rate is?

+= Data Scientist
• New breed of data consumers
• They love the conformed clean warehouse data
• But they also are responsible for new insights
• Data not yet modeled in the data warehouse  source
data
• New Data Sources
• Workloads that are supported by traditional facts and
dimensions: network analysis, text analytics, and many
more..

Traditional Warehousing All Wrong?
NO!
The concept of a Data Warehouse is sound:
• Consolidating data from disparate source systems
• Clean and conformed reference data
• Clean and integrated business facts
• Data governance (a more pragmatic version)
We can be more successful by acknowledging the EDW
can’t solve all problems.

So what’s missing?
The Data Lake
A storage and processing layer for all data
• Store anything: source data, semi-structured,
unstructured, structured
• Keep it as long as needed
• Support a number of processing workloads
• Scale-out
..and here is where Hadoop
can help us!

Hadoop Powers the Data Lake
Hadoop Provides us:
• Distributed storage  HDFS
• Resource Management  YARN
• Many workloads, not just Map Reduce

..but we need to think Holistic Data Strategy
Big
Data
Warehouse
Data Science
Workspace
Data Lake – Integrated Sandbox
Landing Area – Source Data in “Full Fidelity”
• Data Governance is
tunable and
pragmatic
• Some analytics are
suited for the Data
Warehouse, while
many are not

Big
Data
Warehouse
Data Science Workspace
Data Lake – Integrated Sandbox
Landing Area – Source Data in “Full Fidelity”
About those layers
Metadata  Catalog
ILM  who has access,
how long do we “manage it”
Raw machine data
collection, collect
everything
Data is ready to be turned
into information:
organized, well defined,
complete.
Agile business insight through data-munging,
machine learning, blending with external
data, development of to-be BDW facts
ILM  who has access, how long to “manage it”
Data Quality and Monitoring  Monitoring
of completeness of data
ILM  who has access, how long do we “manage it”
 The Hadoop Data Lake has different governance
demands at each tier.
 Only top tier of the pyramid is fully governed.
 We refer to this as the Trusted tier of the Big Data
Warehouse.
Fully Data Governed ( trusted)User community arbitrary queries and reporting
1
2
4
3

Why we need “Tunable” Data Governance
•Dumping data into Hadoop with no repeatable
process, procedure, or data governance will create
a mess
• No Data Conformance
• No Master Data Management
• No Data Quality processes
• No Trust
..the alternative is applying Data Governance to
rigidly?

Peeling back the layer…
Landing
•Source data in it’s full fidelity
•Programmatically Loaded
•Partitioned for data processing
•No governance other than catalog and ILM (Security
and Retention)
•Consumers: Data Scientists, ETL Processes,
Applications

Data Lake
•Enriched, lightly integrated
•Data has been is accessible in the Hive Metastore
• Either processed into tabular relations
• Or via Hive Serdes directly upon Raw Data
•Partitioned for data access
•Governance additionally includes a guarantee of
completeness
Applications, Data Analysts

Side Note – Unstructured Data
 A Structure must be extracted/applied in just about every
case imaginable before analysis can be performed.
Full data governance can only be applied to “Structured”
data
This can include materialized endpoints such as files or
tables OR projections such as a Hive table
Governed structured data must have:
 A known schema with Metadata
 A known and certified lineage
 A monitored, quality test, managed process for ingestion
and transformation

Data Science Workspace
•No barrier for onboarding and analysis of new data
•Blending of new data with entire Data Lake,
including the Big Data Warehouse
•No governance other than ILM
•Consumers: Data Scientists Only!

Big Data Warehouse
•Data is Fully Governed
•Data is Structured
•Partitioned/tuned for data access
•Governance includes a guarantee of completeness
and accuracy
Applications, Data Analysts, and Business Users
Big
Data
Warehouse

The Refinery
BDW
Data Science
Workspace
Data Lake
Landing Area
Cool
new
data
New
Insights
•The feedback loop between Data Science and Data
Warehouse is critical
•Successful work products of science must Graduate
into the appropriate layers of the Data Lake

So where does this Big Data Warehouse Live?
Per Martin Fowler (http://martinfowler.com):
“Polygot Persistence - where any decent sized
enterprise will have a variety of different data
storage technologies for different kinds of data.
There will still be large amounts of it managed in
relational stores, but increasingly we'll be first asking
how we want to manipulate the data and only then
figuring out what technology is the best bet for it…”
Abridged Version: Use the right tool for the job!

Polygot Warehouse
We promote the concept that the Big Data
Warehouse may live in one or more platforms
•Full Hadoop Solutions
•Hadoop plus MPP or Relational
Supplemental technologies:
•NoSQL: Columnar, Key value, Timeseries, Graph
•Search Technologies

Hadoop Data Warehouse
•Hadoop is the platform for the entire data lake
including the Big Data Warehouse
•Serves as the Data Lake and “Refinery”
•Query engines such as Hive, and Impala provide SQL
support

Hadoop + Relational
•Hadoop is the platform for the Data Lake and
Refinery
•The Active Set is federated out into MPP or
Relational Platforms  Presentation Layer
•Serves as a good model when there is existing MPP
or Relational Data Warehouse in place

On the Cloud
AWS and other cloud providers present a very
powerful design pattern:
•S3 serves as the storage layer for the Data Lake
•EMR (Elastic Hadoop) provides the Refinery, most
clusters can be ephemeral
•The Active Set is stored into Redshift MPP or
Relational Platforms
Replace massive on premise footprint with a only a
handful of machines!

In Summary
•The principles of Data Warehousing still
makes sense
•Recognize gaps in feature/functionality of the
Relational Database, and traditional Data
Warehousing
•Believe in the Data Lake and accept Tunable
Governance
•Think Polygot Warehouse and use the right
tool for the job

Thank You
Elliott Cordo
Chief Architect
elliott@casertaconcepts.com

Hadoop and Your Data Warehouse

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Hadoop and Your Data Warehouse

Similaire à Hadoop and Your Data Warehouse (20)

Plus de Caserta

Plus de Caserta (20)

Dernier

Dernier (20)

Hadoop and Your Data Warehouse

Notes de l'éditeur