This document discusses how Hadoop can be used to power a data lake and enhance traditional data warehousing approaches. It proposes a holistic data strategy with multiple layers: a landing area to store raw source data, a data lake to enrich and integrate data with light governance, a data science workspace for experimenting with new data, and a big data warehouse at the top level with fully governed and trusted data. Hadoop provides distributed storage and processing capabilities to support these layers. The document advocates a "polygot" approach, using the right tools like Hadoop, relational databases, and cloud platforms depending on the specific workload and data type.
2. About Caserta Concepts
• Technology services company with expertise in data analysis:
• Big Data Solutions
• Data Warehousing
• Business Intelligence
• Core focus in the following industries:
• eCommerce / Retail / Marketing
• Financial Services / Insurance
• Healthcare / Higher Education
• Established in 2001:
• Increased growth year-over-year
• Industry recognized work force
• Strategy, Implementation
• Writing, Education, Mentoring
• Data Science & Analytics
• Data on the Cloud
• Data Interaction & Visualization
3. When it comes to building efficient Data
Solutions, We wrote the book!
• Time-tested proven solutions
• Staging, cleaning, integrating, delivering
• Traditional data warehousing and Big Data
warehouses
• Best practices to extract data from scattered
sources, munge and discover valuable
business information
• Sub-systems offered as project accelerators
• Comprehensive guidance to our clients to
build and populate big data solutions that
ensure quality and integrity
Authors, Innovators, Leaders
5. Hadoop and Your Data Warehouse
•Last 2 or 3 years have been more disruptive from a
data management perspective than the past 20!
•The advent of new technologies and modern data
engineering concepts has shaken traditional
concepts to their core
Proprietary Information
6. What is a Data Warehouse
Good Question?
In the traditional world – several competing, almost
religious, approaches to their design.
I think we can all agree:
•A central repository of integrated data from one or
more disparate sources
•Used for Reporting and Analysis
•Reliability, Trust Data Governance
Proprietary Information
7. Data Governance
A program consisting of:
• Metadata
• Security
• Data Quality
• Master Data Management
• Information Lifecycle Management (aka retention)
…with supporting processes, procedures and
organizational suport
8. How Do you Build a Data Warehouse
•Design – Top Down, Bottom Up
• Customer Interviews and requirements gathering
• Data Profiling
•Extract Transform Load data from source to data
warehouse
•Create Facts and Dimensions
•Put a BI tool on top
•Develop reports
•Data Governance
Proprietary Information
9. The Traditional Conversation
• Kimball Vs. Inmon
• Dimensional vs. 3rd normal form
• What hardware do we need (that will be ready in 6 months)
• Oracle vs SQL Server, Postgres or MySQL if we were brave
(and cheap)
• Which ETL tool should we BUY Informatica, Datastage?
• Which BI tool should we sit on top Business Objects,
Cognos?
Proprietary Information
10. The New Conversation
• Do we need a Data Warehouse at all?
• If we do, does it need to be relational?
• Should we leverage Hadoop or NoSQL?
• Which platform and language are we going to code in?
• What bleeding edge Apache Project should we put in
production!
Proprietary Information
11. So Why Change?
New technologies are great and all.. But what drives our
adoption of new technologies and techniques?
• Data has changed – Semistructured, Unstructured, Sparse
and evolving schema
• Volumes have changed GB to TB to PB workloads
• Cracks in the Armor of Traditional Data Warehousing
approach!
AND MOST IMPORTANTLY:
Companies that innovate and leverage their data win!
Proprietary Information
12. Cracks in the Armor
• Onboarding new data is difficult!
• Rigidity and Data Governance
• Disconnect from business requirement:
“Hey – I need to analyze some new source”
Conform and analyze the data
Load it into dimensional models
Build a semantic layer nobody is going to use
Create a dashboard we hope someone will notice
..and then you can have at it 3-6 months later to see if it has
value!
Proprietary Information
13. And then there is…
70% FAILURE RATE
• Semi-scientific analysis has proven the majority of data
analytic projects fail..
• And of those that don’t fail, only a fraction are deemed a
“success”, others just finish!
• Data is just REALLY hard, especially without the right
strategy
What do we think the Data Governance failure rate is?
Proprietary Information
14. += Data Scientist
• New breed of data consumers
• They love the conformed clean warehouse data
• But they also are responsible for new insights
• Data not yet modeled in the data warehouse source
data
• New Data Sources
• Workloads that are supported by traditional facts and
dimensions: network analysis, text analytics, and many
more..
Proprietary Information
15. Traditional Warehousing All Wrong?
NO!
The concept of a Data Warehouse is sound:
• Consolidating data from disparate source systems
• Clean and conformed reference data
• Clean and integrated business facts
• Data governance (a more pragmatic version)
We can be more successful by acknowledging the EDW
can’t solve all problems.
Proprietary Information
16. So what’s missing?
The Data Lake
A storage and processing layer for all data
• Store anything: source data, semi-structured,
unstructured, structured
• Keep it as long as needed
• Support a number of processing workloads
• Scale-out
..and here is where Hadoop
can help us!
Proprietary Information
17. Hadoop Powers the Data Lake
Hadoop Provides us:
• Distributed storage HDFS
• Resource Management YARN
• Many workloads, not just Map Reduce
Proprietary Information
18. ..but we need to think Holistic Data Strategy
Proprietary Information
Big
Data
Warehouse
Data Science
Workspace
Data Lake – Integrated Sandbox
Landing Area – Source Data in “Full Fidelity”
• Data Governance is
tunable and
pragmatic
• Some analytics are
suited for the Data
Warehouse, while
many are not
19. Big
Data
Warehouse
Data Science Workspace
Data Lake – Integrated Sandbox
Landing Area – Source Data in “Full Fidelity”
About those layers
Metadata Catalog
ILM who has access,
how long do we “manage it”
Raw machine data
collection, collect
everything
Data is ready to be turned
into information:
organized, well defined,
complete.
Agile business insight through data-munging,
machine learning, blending with external
data, development of to-be BDW facts
Metadata Catalog
ILM who has access, how long to “manage it”
Data Quality and Monitoring Monitoring
of completeness of data
Metadata Catalog
ILM who has access, how long do we “manage it”
The Hadoop Data Lake has different governance
demands at each tier.
Only top tier of the pyramid is fully governed.
We refer to this as the Trusted tier of the Big Data
Warehouse.
Fully Data Governed ( trusted)User community arbitrary queries and reporting
1
2
4
3
20. Why we need “Tunable” Data Governance
•Dumping data into Hadoop with no repeatable
process, procedure, or data governance will create
a mess
• No Data Conformance
• No Master Data Management
• No Data Quality processes
• No Trust
..the alternative is applying Data Governance to
rigidly?
21. Peeling back the layer…
Landing
•Source data in it’s full fidelity
•Programmatically Loaded
•Partitioned for data processing
•No governance other than catalog and ILM (Security
and Retention)
•Consumers: Data Scientists, ETL Processes,
Applications
Proprietary Information
22. Data Lake
•Enriched, lightly integrated
•Data has been is accessible in the Hive Metastore
• Either processed into tabular relations
• Or via Hive Serdes directly upon Raw Data
•Partitioned for data access
•Governance additionally includes a guarantee of
completeness
•Consumers: Data Scientists, ETL Processes,
Applications, Data Analysts
Proprietary Information
23. Side Note – Unstructured Data
A Structure must be extracted/applied in just about every
case imaginable before analysis can be performed.
Full data governance can only be applied to “Structured”
data
This can include materialized endpoints such as files or
tables OR projections such as a Hive table
Governed structured data must have:
A known schema with Metadata
A known and certified lineage
A monitored, quality test, managed process for ingestion
and transformation
24. Data Science Workspace
•No barrier for onboarding and analysis of new data
•Blending of new data with entire Data Lake,
including the Big Data Warehouse
•No governance other than ILM
•Consumers: Data Scientists Only!
Proprietary Information
25. Big Data Warehouse
•Data is Fully Governed
•Data is Structured
•Partitioned/tuned for data access
•Governance includes a guarantee of completeness
and accuracy
•Consumers: Data Scientists, ETL Processes,
Applications, Data Analysts, and Business Users
Proprietary Information
Big
Data
Warehouse
26. The Refinery
BDW
Data Science
Workspace
Data Lake
Landing Area
Cool
new
data
New
Insights
Proprietary Information
•The feedback loop between Data Science and Data
Warehouse is critical
•Successful work products of science must Graduate
into the appropriate layers of the Data Lake
27. So where does this Big Data Warehouse Live?
Per Martin Fowler (http://martinfowler.com):
“Polygot Persistence - where any decent sized
enterprise will have a variety of different data
storage technologies for different kinds of data.
There will still be large amounts of it managed in
relational stores, but increasingly we'll be first asking
how we want to manipulate the data and only then
figuring out what technology is the best bet for it…”
Abridged Version: Use the right tool for the job!
Proprietary Information
28. Polygot Warehouse
We promote the concept that the Big Data
Warehouse may live in one or more platforms
•Full Hadoop Solutions
•Hadoop plus MPP or Relational
Supplemental technologies:
•NoSQL: Columnar, Key value, Timeseries, Graph
•Search Technologies
Proprietary Information
29. Hadoop Data Warehouse
•Hadoop is the platform for the entire data lake
including the Big Data Warehouse
•Serves as the Data Lake and “Refinery”
•Query engines such as Hive, and Impala provide SQL
support
Proprietary Information
30. Hadoop + Relational
•Hadoop is the platform for the Data Lake and
Refinery
•The Active Set is federated out into MPP or
Relational Platforms Presentation Layer
•Serves as a good model when there is existing MPP
or Relational Data Warehouse in place
Proprietary Information
31. On the Cloud
AWS and other cloud providers present a very
powerful design pattern:
•S3 serves as the storage layer for the Data Lake
•EMR (Elastic Hadoop) provides the Refinery, most
clusters can be ephemeral
•The Active Set is stored into Redshift MPP or
Relational Platforms
Replace massive on premise footprint with a only a
handful of machines!
Proprietary Information
32. In Summary
•The principles of Data Warehousing still
makes sense
•Recognize gaps in feature/functionality of the
Relational Database, and traditional Data
Warehousing
•Believe in the Data Lake and accept Tunable
Governance
•Think Polygot Warehouse and use the right
tool for the job
Proprietary Information
We focused our attention on building a single version of the truth
We mainly applied data governance on the EDW itself and a few primary supporting systems –like MDM.
We had a fairly restrictive set of tools for using the EDW data Enterprise BI tools It was easier to GOVERN how the data would be used.