SlideShare une entreprise Scribd logo
1  sur  19
BEST PRACTICES FOR
THE APACHE HADOOP
DATA WAREHOUSE
EDW 101 FOR HADOOP
PROFESSIONALS
RALPH KIMBALL / ELI COLLINS
MAY 2014
Best Practices for the Hadoop Data Warehouse
© Ralph Kimball, Cloudera, 2014
May 2014
The Enterprise Data Warehouse
Legacy
 More than 30 years, countless successful
installations, billions of dollars
 Fundamental architecture best practices
 Business user driven: simple, fast, relevant
 Best designs driven by actual data, not top down
models
 Enterprise entities: dimensions, facts, and primary
keys
 Time variance: slowly changing dimensions
 Integration: conformed dimensions
 These best practices also apply to Hadoop
systems
Expose the Data as
Dimensions and Facts
 Dimensions are the enterprise’s fundamental
entities
 Dimensions are a strategic asset
separate from any given data source
 Dimensions need to be attached to each source
 Measurement EVENTS are 1-to-1 with
Fact Table RECORDS
 The GRAIN of a fact table is the physical
world’s description of the measurement event
A Health Care Use Case
 Grain = Health Care Hospital
Events
Grain = Patient Event During Hospital Stay
Importing Raw Data into Hadoop
 Ingesting and transforming raw data from diverse
sources for analysis is where Hadoop shines
 What: Medical device data, doctors’ notes, nurse’s notes,
medications administered, procedures performed,
diagnoses, lab tests, X-rays, ultrasound exams, therapists’
reports, billing, ...
 From: Operational RDBMSs, enterprise data warehouse,
human entered logs, machine generated data files, special
systems, ...
 Use native ingest tools & 3rd party data integration
products
 Always retain original data in full fidelity
 Keep data files “as is” or use Hadoop native formats
 Opportunistically add data sources  Agile!
Importing Raw Data into Hadoop
 First step: get hospital procedures from billing
RDBMS, doctors notes from RDBMS, patient info
from DW, ...
 As well as X-rays from radiology system
$ sqoop import
--connect jdbc:oracle:thin:@db.server.com/BILLING
--table PROCEDURES
--target-dir /ingest/procedures/2014_05_29
$ hadoop fs –put /dcom_files/2014_05_29
hdfs://server.com/ingest/xrays/2014_05_29
$ sqoop import … /EMR … --table CLINICAL_NOTES
$ sqoop import … /CDR … --table PATIENT_INFO
Plan the Fact Table
 Third step: create queries on raw data that will be
basis for extracts from each source at the correct
grain
> CREATE EXTERNAL TABLE procedures_raw(
date_key bigint,
event timestamp, …)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘t’
LOCATION ‘/demo/procedures’;
 Second step: explore raw data immediately
before committing to physical data
transformations
Building the Fact Table
 Fourth step: Build up “native” table for facts using
special logic from extract queries created in step 3:
> CREATE TABLE hospital_events(…)
PARTITIONED BY date_key STORED AS PARQUET;
> INSERT INTO TABLE hospital_events
SELECT <special logic> FROM procedures_raw;
… SELECT <special logic> FROM patient_monitor_raw;
… SELECT <special logic> from clinical_notes_raw;
… SELECT <special logic> from device_17_raw;
… SELECT <special logic> from radiology_reports_raw;
… SELECT <special logic> from meds_adminstered_raw;
… and more
The Patient Dimension
 Primary key is a
“surrogate key”
 Durable identifier is
original “natural key”
 50 attributes typical
 Dimension is
instrumented for
episodic (slow)
changes
Manage Your Primary Keys
 “Natural” keys from source (often “un-natural”!)
 Poorly administered, overwritten, duplicated
 Awkward formats, implied semantic content
 Profoundly incompatible across data sources
 Replace or remap natural keys
 Enterprise dimension keys are surrogate keys
 Replace or remap in all dimension and fact tables
 Attach high value enterprise dimensions to every
source just by replacing the original natural keys
Inserting Surrogate Keys in
Facts
 Re-write fact tables with dimension SKs
NK
NK
NK
SK
SK
SK
NK
NK
NK
SKNK Join
Mapping tables
Original facts
SKNK
SKNK
SKNK
Insert
NK
NK
Append deltas
to facts and
mapping tables
Target Fact Table
Track Time Variance
 Dimensional entities change slowly and
episodically
 EDW has responsibility to correctly represent
history
 Must provide for multiple historically time
stamped versions of all dimension members
 SCDs: Slowly Changing Dimensions
 SCD Type 1: Overwrite dimension member, lose
history
 SCD Type 2: Add new time stamped dimension
member record, track history
Options for Implementing SCD 2
 Re-import the dimension table each time
 Or, import and merge the delta
 Or, re-build the table in Hadoop
 Implement complex merges with an integrated
ETL tool, or in SQL via Impala or Hive
$ sqoop import
--table patient_info
--incremental lastmodified
--check-column SCD2_EFFECTIVE_DATETIME
--last-value “2014-05-29 01:01:01”
Integrate Data Sources at the BI
Layer
 If the dimensions of two sources are not
“conformed” then the sources cannot be
integrated
 Two dimensions are conformed if they share
attributes (fields) that have the same domains
and same content
 The integration payload:
Conforming Dimensions in
Hadoop
 Goal: combine diverse data sets in a single
analysis
 Conform operational and analytical schemas
via key dimensions (user, product, geo)
 Build and use mapping tables (ala SK handling)
> CREATE TABLE patient_tmp LIKE patient_dim;
> ALTER TABLE patient_tmp ADD COLUMNS (state_conf int);
> INSERT INTO TABLE patient_tmp (SELECT … );
> DROP TABLE patient_dim;
> ALTER TABLE patient_tmp RENAME TO patient_dim;
tediou
s!
Integrate Data Sources at the BI
Layer
 Traditional data warehouse personas
 Dimension manager – responsible for defining and
publishing the conformed dimension content
 Fact provider – owner and publisher of fact table,
attached to conformed dimensions
 New Hadoop personas
 “Robot” dimension manager – using auto schema
inference, pattern matching, similarity matching, …
What’s Easy and What’s
Challenging in Hadoop as of May
2014
 Easy
 Assembling/investigating radically diverse data
sources
 Scaling out to any size at any velocity
 Somewhat challenging
 Building extract logic for each diverse data source
 Updating and appending to existing HDFS files
(requires rewrite – straightforward but slow)
 Generating surrogate keys in a profoundly
distributed environment
 Stay tuned! 
What Have We Accomplished
 Identified essential best practices from the EDW
world
 Business driven
 Dimensional approach
 Handling time variance with SCDs and surrogate
keys
 Integrating arbitrary sources with conformed
dimensions
 Shown examples of how to implement each best
practice in Hadoop
 Provided realistic assessment of current state of
The Kimball Group Resource
 www.kimballgroup.com
 Best selling data warehouse books
NEW BOOK! The Classic “Toolkit” 3rd Ed.
 In depth data warehouse classes
taught by primary authors
 Dimensional modeling (Ralph/Margy)
 ETL architecture (Ralph/Bob)
 Dimensional design reviews and consulting
by Kimball Group principals
 White Papers
on Integration, Data Quality, and Big Data Analytics

Contenu connexe

Tendances

Tendances (20)

Data Warehouse - Incremental Migration to the Cloud
Data Warehouse - Incremental Migration to the CloudData Warehouse - Incremental Migration to the Cloud
Data Warehouse - Incremental Migration to the Cloud
 
Modern Data Warehousing with the Microsoft Analytics Platform System
Modern Data Warehousing with the Microsoft Analytics Platform SystemModern Data Warehousing with the Microsoft Analytics Platform System
Modern Data Warehousing with the Microsoft Analytics Platform System
 
SAP BW/4HANA - The Intelligent Enterprise Data Warehouse
SAP BW/4HANA - The Intelligent Enterprise Data WarehouseSAP BW/4HANA - The Intelligent Enterprise Data Warehouse
SAP BW/4HANA - The Intelligent Enterprise Data Warehouse
 
Snowflake SnowPro Core Cert CheatSheet.pdf
Snowflake SnowPro Core Cert CheatSheet.pdfSnowflake SnowPro Core Cert CheatSheet.pdf
Snowflake SnowPro Core Cert CheatSheet.pdf
 
Building an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureBuilding an Effective Data Warehouse Architecture
Building an Effective Data Warehouse Architecture
 
Actionable Insights with AI - Snowflake for Data Science
Actionable Insights with AI - Snowflake for Data ScienceActionable Insights with AI - Snowflake for Data Science
Actionable Insights with AI - Snowflake for Data Science
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Data Modeling Fundamentals
Data Modeling FundamentalsData Modeling Fundamentals
Data Modeling Fundamentals
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
 
Microsoft cloud big data strategy
Microsoft cloud big data strategyMicrosoft cloud big data strategy
Microsoft cloud big data strategy
 
Snowflake + Power BI: Cloud Analytics for Everyone
Snowflake + Power BI: Cloud Analytics for EveryoneSnowflake + Power BI: Cloud Analytics for Everyone
Snowflake + Power BI: Cloud Analytics for Everyone
 
Snowflake SnowPro Certification Exam Cheat Sheet
Snowflake SnowPro Certification Exam Cheat SheetSnowflake SnowPro Certification Exam Cheat Sheet
Snowflake SnowPro Certification Exam Cheat Sheet
 
Delta Lake with Azure Databricks
Delta Lake with Azure DatabricksDelta Lake with Azure Databricks
Delta Lake with Azure Databricks
 
Databricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With DataDatabricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With Data
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
 
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-Baltagi
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-BaltagiModern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-Baltagi
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-Baltagi
 
Modern data warehouse presentation
Modern data warehouse presentationModern data warehouse presentation
Modern data warehouse presentation
 
Power BI Tutorial For Beginners | Power BI Tutorial | Power BI Demo | Power B...
Power BI Tutorial For Beginners | Power BI Tutorial | Power BI Demo | Power B...Power BI Tutorial For Beginners | Power BI Tutorial | Power BI Demo | Power B...
Power BI Tutorial For Beginners | Power BI Tutorial | Power BI Demo | Power B...
 
Power BI for Big Data and the New Look of Big Data Solutions
Power BI for Big Data and the New Look of Big Data SolutionsPower BI for Big Data and the New Look of Big Data Solutions
Power BI for Big Data and the New Look of Big Data Solutions
 

En vedette

En vedette (8)

Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
 
Accelerating Insight - Smart Data Lake Customer Success Stories
Accelerating Insight - Smart Data Lake Customer Success StoriesAccelerating Insight - Smart Data Lake Customer Success Stories
Accelerating Insight - Smart Data Lake Customer Success Stories
 
Building enterprise advance analytics platform
Building enterprise advance analytics platformBuilding enterprise advance analytics platform
Building enterprise advance analytics platform
 
Big data it’s impact on the finance function
Big data it’s impact on the finance functionBig data it’s impact on the finance function
Big data it’s impact on the finance function
 
Building A Self Service Analytics Platform on Hadoop
Building A Self Service Analytics Platform on HadoopBuilding A Self Service Analytics Platform on Hadoop
Building A Self Service Analytics Platform on Hadoop
 
Splunk Business Analytics
Splunk Business AnalyticsSplunk Business Analytics
Splunk Business Analytics
 
Data Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesData Warehouse Design and Best Practices
Data Warehouse Design and Best Practices
 
Webinar | Using Big Data and Predictive Analytics to Empower Distribution and...
Webinar | Using Big Data and Predictive Analytics to Empower Distribution and...Webinar | Using Big Data and Predictive Analytics to Empower Distribution and...
Webinar | Using Big Data and Predictive Analytics to Empower Distribution and...
 

Similaire à Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals

Best Practices and Lessons Learned on Our IBM Rational Insight Deployment
Best Practices and Lessons Learned on Our IBM Rational Insight DeploymentBest Practices and Lessons Learned on Our IBM Rational Insight Deployment
Best Practices and Lessons Learned on Our IBM Rational Insight Deployment
Marc Nehme
 
Introduction to Data Analyst Training
Introduction to Data Analyst TrainingIntroduction to Data Analyst Training
Introduction to Data Analyst Training
Cloudera, Inc.
 
AWS Summit 2011: Big Data Analytics in the AWS cloud
AWS Summit 2011: Big Data Analytics in the AWS cloudAWS Summit 2011: Big Data Analytics in the AWS cloud
AWS Summit 2011: Big Data Analytics in the AWS cloud
Amazon Web Services
 
Data discoveryonhadoop@yahoo! hadoopsummit2014
Data discoveryonhadoop@yahoo! hadoopsummit2014Data discoveryonhadoop@yahoo! hadoopsummit2014
Data discoveryonhadoop@yahoo! hadoopsummit2014
thiruvel
 

Similaire à Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals (20)

Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...
Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...
Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...
 
Testing Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of HadoopTesting Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of Hadoop
 
Luo june27 1150am_room230_a_v2
Luo june27 1150am_room230_a_v2Luo june27 1150am_room230_a_v2
Luo june27 1150am_room230_a_v2
 
The 3 T's - Using Hadoop to modernize with faster access to data and value
The 3 T's - Using Hadoop to modernize with faster access to data and valueThe 3 T's - Using Hadoop to modernize with faster access to data and value
The 3 T's - Using Hadoop to modernize with faster access to data and value
 
From Relational Database Management to Big Data: Solutions for Data Migration...
From Relational Database Management to Big Data: Solutions for Data Migration...From Relational Database Management to Big Data: Solutions for Data Migration...
From Relational Database Management to Big Data: Solutions for Data Migration...
 
PRAFUL_HADOOP
PRAFUL_HADOOPPRAFUL_HADOOP
PRAFUL_HADOOP
 
Analysis of historical movie data by BHADRA
Analysis of historical movie data by BHADRAAnalysis of historical movie data by BHADRA
Analysis of historical movie data by BHADRA
 
Modernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APSModernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APS
 
“Adoption DSpace 7 and 8 Challenges and Solutions from Real Migration Experie...
“Adoption DSpace 7 and 8 Challenges and Solutions from Real Migration Experie...“Adoption DSpace 7 and 8 Challenges and Solutions from Real Migration Experie...
“Adoption DSpace 7 and 8 Challenges and Solutions from Real Migration Experie...
 
Best Practices and Lessons Learned on Our IBM Rational Insight Deployment
Best Practices and Lessons Learned on Our IBM Rational Insight DeploymentBest Practices and Lessons Learned on Our IBM Rational Insight Deployment
Best Practices and Lessons Learned on Our IBM Rational Insight Deployment
 
Hd insight overview
Hd insight overviewHd insight overview
Hd insight overview
 
Introduction to Data Analyst Training
Introduction to Data Analyst TrainingIntroduction to Data Analyst Training
Introduction to Data Analyst Training
 
AWS Summit 2011: Big Data Analytics in the AWS cloud
AWS Summit 2011: Big Data Analytics in the AWS cloudAWS Summit 2011: Big Data Analytics in the AWS cloud
AWS Summit 2011: Big Data Analytics in the AWS cloud
 
Accelerating Big Data Analytics
Accelerating Big Data AnalyticsAccelerating Big Data Analytics
Accelerating Big Data Analytics
 
ITReady DW Day2
ITReady DW Day2ITReady DW Day2
ITReady DW Day2
 
Bridging the Big Data Gap in the Software-Driven World
Bridging the Big Data Gap in the Software-Driven WorldBridging the Big Data Gap in the Software-Driven World
Bridging the Big Data Gap in the Software-Driven World
 
Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Hadoop Summit San Jose 2014: Data Discovery on Hadoop Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Hadoop Summit San Jose 2014: Data Discovery on Hadoop
 
Data discoveryonhadoop@yahoo! hadoopsummit2014
Data discoveryonhadoop@yahoo! hadoopsummit2014Data discoveryonhadoop@yahoo! hadoopsummit2014
Data discoveryonhadoop@yahoo! hadoopsummit2014
 
Data-ware Housing
Data-ware HousingData-ware Housing
Data-ware Housing
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
 

Plus de Cloudera, Inc.

Plus de Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Dernier

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Dernier (20)

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 

Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals

  • 1. BEST PRACTICES FOR THE APACHE HADOOP DATA WAREHOUSE EDW 101 FOR HADOOP PROFESSIONALS RALPH KIMBALL / ELI COLLINS MAY 2014 Best Practices for the Hadoop Data Warehouse © Ralph Kimball, Cloudera, 2014 May 2014
  • 2. The Enterprise Data Warehouse Legacy  More than 30 years, countless successful installations, billions of dollars  Fundamental architecture best practices  Business user driven: simple, fast, relevant  Best designs driven by actual data, not top down models  Enterprise entities: dimensions, facts, and primary keys  Time variance: slowly changing dimensions  Integration: conformed dimensions  These best practices also apply to Hadoop systems
  • 3. Expose the Data as Dimensions and Facts  Dimensions are the enterprise’s fundamental entities  Dimensions are a strategic asset separate from any given data source  Dimensions need to be attached to each source  Measurement EVENTS are 1-to-1 with Fact Table RECORDS  The GRAIN of a fact table is the physical world’s description of the measurement event
  • 4. A Health Care Use Case  Grain = Health Care Hospital Events Grain = Patient Event During Hospital Stay
  • 5. Importing Raw Data into Hadoop  Ingesting and transforming raw data from diverse sources for analysis is where Hadoop shines  What: Medical device data, doctors’ notes, nurse’s notes, medications administered, procedures performed, diagnoses, lab tests, X-rays, ultrasound exams, therapists’ reports, billing, ...  From: Operational RDBMSs, enterprise data warehouse, human entered logs, machine generated data files, special systems, ...  Use native ingest tools & 3rd party data integration products  Always retain original data in full fidelity  Keep data files “as is” or use Hadoop native formats  Opportunistically add data sources  Agile!
  • 6. Importing Raw Data into Hadoop  First step: get hospital procedures from billing RDBMS, doctors notes from RDBMS, patient info from DW, ...  As well as X-rays from radiology system $ sqoop import --connect jdbc:oracle:thin:@db.server.com/BILLING --table PROCEDURES --target-dir /ingest/procedures/2014_05_29 $ hadoop fs –put /dcom_files/2014_05_29 hdfs://server.com/ingest/xrays/2014_05_29 $ sqoop import … /EMR … --table CLINICAL_NOTES $ sqoop import … /CDR … --table PATIENT_INFO
  • 7. Plan the Fact Table  Third step: create queries on raw data that will be basis for extracts from each source at the correct grain > CREATE EXTERNAL TABLE procedures_raw( date_key bigint, event timestamp, …) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘t’ LOCATION ‘/demo/procedures’;  Second step: explore raw data immediately before committing to physical data transformations
  • 8. Building the Fact Table  Fourth step: Build up “native” table for facts using special logic from extract queries created in step 3: > CREATE TABLE hospital_events(…) PARTITIONED BY date_key STORED AS PARQUET; > INSERT INTO TABLE hospital_events SELECT <special logic> FROM procedures_raw; … SELECT <special logic> FROM patient_monitor_raw; … SELECT <special logic> from clinical_notes_raw; … SELECT <special logic> from device_17_raw; … SELECT <special logic> from radiology_reports_raw; … SELECT <special logic> from meds_adminstered_raw; … and more
  • 9. The Patient Dimension  Primary key is a “surrogate key”  Durable identifier is original “natural key”  50 attributes typical  Dimension is instrumented for episodic (slow) changes
  • 10. Manage Your Primary Keys  “Natural” keys from source (often “un-natural”!)  Poorly administered, overwritten, duplicated  Awkward formats, implied semantic content  Profoundly incompatible across data sources  Replace or remap natural keys  Enterprise dimension keys are surrogate keys  Replace or remap in all dimension and fact tables  Attach high value enterprise dimensions to every source just by replacing the original natural keys
  • 11. Inserting Surrogate Keys in Facts  Re-write fact tables with dimension SKs NK NK NK SK SK SK NK NK NK SKNK Join Mapping tables Original facts SKNK SKNK SKNK Insert NK NK Append deltas to facts and mapping tables Target Fact Table
  • 12. Track Time Variance  Dimensional entities change slowly and episodically  EDW has responsibility to correctly represent history  Must provide for multiple historically time stamped versions of all dimension members  SCDs: Slowly Changing Dimensions  SCD Type 1: Overwrite dimension member, lose history  SCD Type 2: Add new time stamped dimension member record, track history
  • 13. Options for Implementing SCD 2  Re-import the dimension table each time  Or, import and merge the delta  Or, re-build the table in Hadoop  Implement complex merges with an integrated ETL tool, or in SQL via Impala or Hive $ sqoop import --table patient_info --incremental lastmodified --check-column SCD2_EFFECTIVE_DATETIME --last-value “2014-05-29 01:01:01”
  • 14. Integrate Data Sources at the BI Layer  If the dimensions of two sources are not “conformed” then the sources cannot be integrated  Two dimensions are conformed if they share attributes (fields) that have the same domains and same content  The integration payload:
  • 15. Conforming Dimensions in Hadoop  Goal: combine diverse data sets in a single analysis  Conform operational and analytical schemas via key dimensions (user, product, geo)  Build and use mapping tables (ala SK handling) > CREATE TABLE patient_tmp LIKE patient_dim; > ALTER TABLE patient_tmp ADD COLUMNS (state_conf int); > INSERT INTO TABLE patient_tmp (SELECT … ); > DROP TABLE patient_dim; > ALTER TABLE patient_tmp RENAME TO patient_dim; tediou s!
  • 16. Integrate Data Sources at the BI Layer  Traditional data warehouse personas  Dimension manager – responsible for defining and publishing the conformed dimension content  Fact provider – owner and publisher of fact table, attached to conformed dimensions  New Hadoop personas  “Robot” dimension manager – using auto schema inference, pattern matching, similarity matching, …
  • 17. What’s Easy and What’s Challenging in Hadoop as of May 2014  Easy  Assembling/investigating radically diverse data sources  Scaling out to any size at any velocity  Somewhat challenging  Building extract logic for each diverse data source  Updating and appending to existing HDFS files (requires rewrite – straightforward but slow)  Generating surrogate keys in a profoundly distributed environment  Stay tuned! 
  • 18. What Have We Accomplished  Identified essential best practices from the EDW world  Business driven  Dimensional approach  Handling time variance with SCDs and surrogate keys  Integrating arbitrary sources with conformed dimensions  Shown examples of how to implement each best practice in Hadoop  Provided realistic assessment of current state of
  • 19. The Kimball Group Resource  www.kimballgroup.com  Best selling data warehouse books NEW BOOK! The Classic “Toolkit” 3rd Ed.  In depth data warehouse classes taught by primary authors  Dimensional modeling (Ralph/Margy)  ETL architecture (Ralph/Bob)  Dimensional design reviews and consulting by Kimball Group principals  White Papers on Integration, Data Quality, and Big Data Analytics