SlideShare une entreprise Scribd logo
1  sur  19
BEST PRACTICES FOR
THE APACHE HADOOP
DATA WAREHOUSE
EDW 101 FOR HADOOP
PROFESSIONALS
RALPH KIMBALL / ELI COLLINS
MAY 2014
Best Practices for the Hadoop Data Warehouse
© Ralph Kimball, Cloudera, 2014
May 2014
The Enterprise Data Warehouse
Legacy
 More than 30 years, countless successful
installations, billions of dollars
 Fundamental architecture best practices
 Business user driven: simple, fast, relevant
 Best designs driven by actual data, not top down
models
 Enterprise entities: dimensions, facts, and primary
keys
 Time variance: slowly changing dimensions
 Integration: conformed dimensions
 These best practices also apply to Hadoop
systems
Expose the Data as
Dimensions and Facts
 Dimensions are the enterprise’s fundamental
entities
 Dimensions are a strategic asset
separate from any given data source
 Dimensions need to be attached to each source
 Measurement EVENTS are 1-to-1 with
Fact Table RECORDS
 The GRAIN of a fact table is the physical
world’s description of the measurement event
A Health Care Use Case
 Grain = Health Care Hospital
Events
Grain = Patient Event During Hospital Stay
Importing Raw Data into Hadoop
 Ingesting and transforming raw data from diverse
sources for analysis is where Hadoop shines
 What: Medical device data, doctors’ notes, nurse’s notes,
medications administered, procedures performed,
diagnoses, lab tests, X-rays, ultrasound exams, therapists’
reports, billing, ...
 From: Operational RDBMSs, enterprise data warehouse,
human entered logs, machine generated data files, special
systems, ...
 Use native ingest tools & 3rd party data integration
products
 Always retain original data in full fidelity
 Keep data files “as is” or use Hadoop native formats
 Opportunistically add data sources  Agile!
Importing Raw Data into Hadoop
 First step: get hospital procedures from billing
RDBMS, doctors notes from RDBMS, patient info
from DW, ...
 As well as X-rays from radiology system
$ sqoop import
--connect jdbc:oracle:thin:@db.server.com/BILLING
--table PROCEDURES
--target-dir /ingest/procedures/2014_05_29
$ hadoop fs –put /dcom_files/2014_05_29
hdfs://server.com/ingest/xrays/2014_05_29
$ sqoop import … /EMR … --table CLINICAL_NOTES
$ sqoop import … /CDR … --table PATIENT_INFO
Plan the Fact Table
 Third step: create queries on raw data that will be
basis for extracts from each source at the correct
grain
> CREATE EXTERNAL TABLE procedures_raw(
date_key bigint,
event timestamp, …)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘t’
LOCATION ‘/demo/procedures’;
 Second step: explore raw data immediately
before committing to physical data
transformations
Building the Fact Table
 Fourth step: Build up “native” table for facts using
special logic from extract queries created in step 3:
> CREATE TABLE hospital_events(…)
PARTITIONED BY date_key STORED AS PARQUET;
> INSERT INTO TABLE hospital_events
SELECT <special logic> FROM procedures_raw;
… SELECT <special logic> FROM patient_monitor_raw;
… SELECT <special logic> from clinical_notes_raw;
… SELECT <special logic> from device_17_raw;
… SELECT <special logic> from radiology_reports_raw;
… SELECT <special logic> from meds_adminstered_raw;
… and more
The Patient Dimension
 Primary key is a
“surrogate key”
 Durable identifier is
original “natural key”
 50 attributes typical
 Dimension is
instrumented for
episodic (slow)
changes
Manage Your Primary Keys
 “Natural” keys from source (often “un-natural”!)
 Poorly administered, overwritten, duplicated
 Awkward formats, implied semantic content
 Profoundly incompatible across data sources
 Replace or remap natural keys
 Enterprise dimension keys are surrogate keys
 Replace or remap in all dimension and fact tables
 Attach high value enterprise dimensions to every
source just by replacing the original natural keys
Inserting Surrogate Keys in
Facts
 Re-write fact tables with dimension SKs
NK
NK
NK
SK
SK
SK
NK
NK
NK
SKNK Join
Mapping tables
Original facts
SKNK
SKNK
SKNK
Insert
NK
NK
Append deltas
to facts and
mapping tables
Target Fact Table
Track Time Variance
 Dimensional entities change slowly and
episodically
 EDW has responsibility to correctly represent
history
 Must provide for multiple historically time
stamped versions of all dimension members
 SCDs: Slowly Changing Dimensions
 SCD Type 1: Overwrite dimension member, lose
history
 SCD Type 2: Add new time stamped dimension
member record, track history
Options for Implementing SCD 2
 Re-import the dimension table each time
 Or, import and merge the delta
 Or, re-build the table in Hadoop
 Implement complex merges with an integrated
ETL tool, or in SQL via Impala or Hive
$ sqoop import
--table patient_info
--incremental lastmodified
--check-column SCD2_EFFECTIVE_DATETIME
--last-value “2014-05-29 01:01:01”
Integrate Data Sources at the BI
Layer
 If the dimensions of two sources are not
“conformed” then the sources cannot be
integrated
 Two dimensions are conformed if they share
attributes (fields) that have the same domains
and same content
 The integration payload:
Conforming Dimensions in
Hadoop
 Goal: combine diverse data sets in a single
analysis
 Conform operational and analytical schemas
via key dimensions (user, product, geo)
 Build and use mapping tables (ala SK handling)
> CREATE TABLE patient_tmp LIKE patient_dim;
> ALTER TABLE patient_tmp ADD COLUMNS (state_conf int);
> INSERT INTO TABLE patient_tmp (SELECT … );
> DROP TABLE patient_dim;
> ALTER TABLE patient_tmp RENAME TO patient_dim;
tediou
s!
Integrate Data Sources at the BI
Layer
 Traditional data warehouse personas
 Dimension manager – responsible for defining and
publishing the conformed dimension content
 Fact provider – owner and publisher of fact table,
attached to conformed dimensions
 New Hadoop personas
 “Robot” dimension manager – using auto schema
inference, pattern matching, similarity matching, …
What’s Easy and What’s
Challenging in Hadoop as of May
2014
 Easy
 Assembling/investigating radically diverse data
sources
 Scaling out to any size at any velocity
 Somewhat challenging
 Building extract logic for each diverse data source
 Updating and appending to existing HDFS files
(requires rewrite – straightforward but slow)
 Generating surrogate keys in a profoundly
distributed environment
 Stay tuned! 
What Have We Accomplished
 Identified essential best practices from the EDW
world
 Business driven
 Dimensional approach
 Handling time variance with SCDs and surrogate
keys
 Integrating arbitrary sources with conformed
dimensions
 Shown examples of how to implement each best
practice in Hadoop
 Provided realistic assessment of current state of
The Kimball Group Resource
 www.kimballgroup.com
 Best selling data warehouse books
NEW BOOK! The Classic “Toolkit” 3rd Ed.
 In depth data warehouse classes
taught by primary authors
 Dimensional modeling (Ralph/Margy)
 ETL architecture (Ralph/Bob)
 Dimensional design reviews and consulting
by Kimball Group principals
 White Papers
on Integration, Data Quality, and Big Data Analytics

Contenu connexe

Tendances

Building a Dashboard in an hour with Power Pivot and Power BI
Building a Dashboard in an hour with Power Pivot and Power BIBuilding a Dashboard in an hour with Power Pivot and Power BI
Building a Dashboard in an hour with Power Pivot and Power BINR Computer Learning Center
 
GeoKettle: A powerful open source spatial ETL tool
GeoKettle: A powerful open source spatial ETL toolGeoKettle: A powerful open source spatial ETL tool
GeoKettle: A powerful open source spatial ETL toolThierry Badard
 
Azure Data Factory for Azure Data Week
Azure Data Factory for Azure Data WeekAzure Data Factory for Azure Data Week
Azure Data Factory for Azure Data WeekMark Kromer
 
Power BI measure and visualize project success
Power BI measure and visualize project successPower BI measure and visualize project success
Power BI measure and visualize project successFisnik Doko
 
Overview SQL Server 2019
Overview SQL Server 2019Overview SQL Server 2019
Overview SQL Server 2019Juan Fabian
 
Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)Michael Rys
 
ETL Using Informatica Power Center
ETL Using Informatica Power CenterETL Using Informatica Power Center
ETL Using Informatica Power CenterEdureka!
 
Summary introduction to data engineering
Summary introduction to data engineeringSummary introduction to data engineering
Summary introduction to data engineeringNovita Sari
 
35 power bi presentations
35 power bi presentations35 power bi presentations
35 power bi presentationsSean Brady
 
Why Use a Content Delivery Network (CDN)?
Why Use a Content Delivery Network (CDN)?Why Use a Content Delivery Network (CDN)?
Why Use a Content Delivery Network (CDN)?Medianova
 
Getting Started with Delta Lake on Databricks
Getting Started with Delta Lake on DatabricksGetting Started with Delta Lake on Databricks
Getting Started with Delta Lake on DatabricksKnoldus Inc.
 
Microsoft power bi
Microsoft power biMicrosoft power bi
Microsoft power bitechpro360
 
Integration Solution Patterns
Integration Solution Patterns Integration Solution Patterns
Integration Solution Patterns WSO2
 

Tendances (20)

Building a Dashboard in an hour with Power Pivot and Power BI
Building a Dashboard in an hour with Power Pivot and Power BIBuilding a Dashboard in an hour with Power Pivot and Power BI
Building a Dashboard in an hour with Power Pivot and Power BI
 
Streaming in Mule
Streaming in MuleStreaming in Mule
Streaming in Mule
 
Power BI visuals
Power BI visualsPower BI visuals
Power BI visuals
 
SQL Server on AWS
SQL Server on AWSSQL Server on AWS
SQL Server on AWS
 
GeoKettle: A powerful open source spatial ETL tool
GeoKettle: A powerful open source spatial ETL toolGeoKettle: A powerful open source spatial ETL tool
GeoKettle: A powerful open source spatial ETL tool
 
Athena & Glue
Athena & GlueAthena & Glue
Athena & Glue
 
Compute Services con AWS
Compute Services con AWSCompute Services con AWS
Compute Services con AWS
 
Azure Data Factory for Azure Data Week
Azure Data Factory for Azure Data WeekAzure Data Factory for Azure Data Week
Azure Data Factory for Azure Data Week
 
Power BI measure and visualize project success
Power BI measure and visualize project successPower BI measure and visualize project success
Power BI measure and visualize project success
 
Overview SQL Server 2019
Overview SQL Server 2019Overview SQL Server 2019
Overview SQL Server 2019
 
Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)
 
ETL Using Informatica Power Center
ETL Using Informatica Power CenterETL Using Informatica Power Center
ETL Using Informatica Power Center
 
Summary introduction to data engineering
Summary introduction to data engineeringSummary introduction to data engineering
Summary introduction to data engineering
 
35 power bi presentations
35 power bi presentations35 power bi presentations
35 power bi presentations
 
Why Use a Content Delivery Network (CDN)?
Why Use a Content Delivery Network (CDN)?Why Use a Content Delivery Network (CDN)?
Why Use a Content Delivery Network (CDN)?
 
Building-a-Data-Lake-on-AWS
Building-a-Data-Lake-on-AWSBuilding-a-Data-Lake-on-AWS
Building-a-Data-Lake-on-AWS
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
 
Getting Started with Delta Lake on Databricks
Getting Started with Delta Lake on DatabricksGetting Started with Delta Lake on Databricks
Getting Started with Delta Lake on Databricks
 
Microsoft power bi
Microsoft power biMicrosoft power bi
Microsoft power bi
 
Integration Solution Patterns
Integration Solution Patterns Integration Solution Patterns
Integration Solution Patterns
 

En vedette

Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...Data Con LA
 
Accelerating Insight - Smart Data Lake Customer Success Stories
Accelerating Insight - Smart Data Lake Customer Success StoriesAccelerating Insight - Smart Data Lake Customer Success Stories
Accelerating Insight - Smart Data Lake Customer Success StoriesCambridge Semantics
 
Building enterprise advance analytics platform
Building enterprise advance analytics platformBuilding enterprise advance analytics platform
Building enterprise advance analytics platformHaoran Du
 
Big data it’s impact on the finance function
Big data it’s impact on the finance functionBig data it’s impact on the finance function
Big data it’s impact on the finance functionMike Davis
 
Building A Self Service Analytics Platform on Hadoop
Building A Self Service Analytics Platform on HadoopBuilding A Self Service Analytics Platform on Hadoop
Building A Self Service Analytics Platform on HadoopCraig Warman
 
Splunk Business Analytics
Splunk Business AnalyticsSplunk Business Analytics
Splunk Business AnalyticsCleverDATA
 
Data Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesData Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesIvo Andreev
 
Webinar | Using Big Data and Predictive Analytics to Empower Distribution and...
Webinar | Using Big Data and Predictive Analytics to Empower Distribution and...Webinar | Using Big Data and Predictive Analytics to Empower Distribution and...
Webinar | Using Big Data and Predictive Analytics to Empower Distribution and...NICSA
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lakeJames Serra
 

En vedette (9)

Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
 
Accelerating Insight - Smart Data Lake Customer Success Stories
Accelerating Insight - Smart Data Lake Customer Success StoriesAccelerating Insight - Smart Data Lake Customer Success Stories
Accelerating Insight - Smart Data Lake Customer Success Stories
 
Building enterprise advance analytics platform
Building enterprise advance analytics platformBuilding enterprise advance analytics platform
Building enterprise advance analytics platform
 
Big data it’s impact on the finance function
Big data it’s impact on the finance functionBig data it’s impact on the finance function
Big data it’s impact on the finance function
 
Building A Self Service Analytics Platform on Hadoop
Building A Self Service Analytics Platform on HadoopBuilding A Self Service Analytics Platform on Hadoop
Building A Self Service Analytics Platform on Hadoop
 
Splunk Business Analytics
Splunk Business AnalyticsSplunk Business Analytics
Splunk Business Analytics
 
Data Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesData Warehouse Design and Best Practices
Data Warehouse Design and Best Practices
 
Webinar | Using Big Data and Predictive Analytics to Empower Distribution and...
Webinar | Using Big Data and Predictive Analytics to Empower Distribution and...Webinar | Using Big Data and Predictive Analytics to Empower Distribution and...
Webinar | Using Big Data and Predictive Analytics to Empower Distribution and...
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 

Similaire à Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals

Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...
Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...
Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...Impetus Technologies
 
Testing Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of HadoopTesting Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of HadoopRTTS
 
Luo june27 1150am_room230_a_v2
Luo june27 1150am_room230_a_v2Luo june27 1150am_room230_a_v2
Luo june27 1150am_room230_a_v2DataWorks Summit
 
The 3 T's - Using Hadoop to modernize with faster access to data and value
The 3 T's - Using Hadoop to modernize with faster access to data and valueThe 3 T's - Using Hadoop to modernize with faster access to data and value
The 3 T's - Using Hadoop to modernize with faster access to data and valueDataWorks Summit
 
From Relational Database Management to Big Data: Solutions for Data Migration...
From Relational Database Management to Big Data: Solutions for Data Migration...From Relational Database Management to Big Data: Solutions for Data Migration...
From Relational Database Management to Big Data: Solutions for Data Migration...Cognizant
 
Analysis of historical movie data by BHADRA
Analysis of historical movie data by BHADRAAnalysis of historical movie data by BHADRA
Analysis of historical movie data by BHADRABhadra Gowdra
 
Modernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APSModernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APSStéphane Fréchette
 
“Adoption DSpace 7 and 8 Challenges and Solutions from Real Migration Experie...
“Adoption DSpace 7 and 8 Challenges and Solutions from Real Migration Experie...“Adoption DSpace 7 and 8 Challenges and Solutions from Real Migration Experie...
“Adoption DSpace 7 and 8 Challenges and Solutions from Real Migration Experie...4Science
 
Best Practices and Lessons Learned on Our IBM Rational Insight Deployment
Best Practices and Lessons Learned on Our IBM Rational Insight DeploymentBest Practices and Lessons Learned on Our IBM Rational Insight Deployment
Best Practices and Lessons Learned on Our IBM Rational Insight DeploymentMarc Nehme
 
Is the traditional data warehouse dead?
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?James Serra
 
Hd insight overview
Hd insight overviewHd insight overview
Hd insight overviewvhrocca
 
Introduction to Data Analyst Training
Introduction to Data Analyst TrainingIntroduction to Data Analyst Training
Introduction to Data Analyst TrainingCloudera, Inc.
 
AWS Summit 2011: Big Data Analytics in the AWS cloud
AWS Summit 2011: Big Data Analytics in the AWS cloudAWS Summit 2011: Big Data Analytics in the AWS cloud
AWS Summit 2011: Big Data Analytics in the AWS cloudAmazon Web Services
 
Accelerating Big Data Analytics
Accelerating Big Data AnalyticsAccelerating Big Data Analytics
Accelerating Big Data AnalyticsAttunity
 
Bridging the Big Data Gap in the Software-Driven World
Bridging the Big Data Gap in the Software-Driven WorldBridging the Big Data Gap in the Software-Driven World
Bridging the Big Data Gap in the Software-Driven WorldCA Technologies
 
Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Hadoop Summit San Jose 2014: Data Discovery on Hadoop Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Hadoop Summit San Jose 2014: Data Discovery on Hadoop Sumeet Singh
 
Data discoveryonhadoop@yahoo! hadoopsummit2014
Data discoveryonhadoop@yahoo! hadoopsummit2014Data discoveryonhadoop@yahoo! hadoopsummit2014
Data discoveryonhadoop@yahoo! hadoopsummit2014thiruvel
 

Similaire à Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals (20)

Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...
Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...
Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...
 
Testing Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of HadoopTesting Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of Hadoop
 
Luo june27 1150am_room230_a_v2
Luo june27 1150am_room230_a_v2Luo june27 1150am_room230_a_v2
Luo june27 1150am_room230_a_v2
 
The 3 T's - Using Hadoop to modernize with faster access to data and value
The 3 T's - Using Hadoop to modernize with faster access to data and valueThe 3 T's - Using Hadoop to modernize with faster access to data and value
The 3 T's - Using Hadoop to modernize with faster access to data and value
 
From Relational Database Management to Big Data: Solutions for Data Migration...
From Relational Database Management to Big Data: Solutions for Data Migration...From Relational Database Management to Big Data: Solutions for Data Migration...
From Relational Database Management to Big Data: Solutions for Data Migration...
 
PRAFUL_HADOOP
PRAFUL_HADOOPPRAFUL_HADOOP
PRAFUL_HADOOP
 
Analysis of historical movie data by BHADRA
Analysis of historical movie data by BHADRAAnalysis of historical movie data by BHADRA
Analysis of historical movie data by BHADRA
 
Modernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APSModernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APS
 
“Adoption DSpace 7 and 8 Challenges and Solutions from Real Migration Experie...
“Adoption DSpace 7 and 8 Challenges and Solutions from Real Migration Experie...“Adoption DSpace 7 and 8 Challenges and Solutions from Real Migration Experie...
“Adoption DSpace 7 and 8 Challenges and Solutions from Real Migration Experie...
 
Best Practices and Lessons Learned on Our IBM Rational Insight Deployment
Best Practices and Lessons Learned on Our IBM Rational Insight DeploymentBest Practices and Lessons Learned on Our IBM Rational Insight Deployment
Best Practices and Lessons Learned on Our IBM Rational Insight Deployment
 
Is the traditional data warehouse dead?
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?
 
Hd insight overview
Hd insight overviewHd insight overview
Hd insight overview
 
Introduction to Data Analyst Training
Introduction to Data Analyst TrainingIntroduction to Data Analyst Training
Introduction to Data Analyst Training
 
AWS Summit 2011: Big Data Analytics in the AWS cloud
AWS Summit 2011: Big Data Analytics in the AWS cloudAWS Summit 2011: Big Data Analytics in the AWS cloud
AWS Summit 2011: Big Data Analytics in the AWS cloud
 
Accelerating Big Data Analytics
Accelerating Big Data AnalyticsAccelerating Big Data Analytics
Accelerating Big Data Analytics
 
ITReady DW Day2
ITReady DW Day2ITReady DW Day2
ITReady DW Day2
 
Bridging the Big Data Gap in the Software-Driven World
Bridging the Big Data Gap in the Software-Driven WorldBridging the Big Data Gap in the Software-Driven World
Bridging the Big Data Gap in the Software-Driven World
 
Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Hadoop Summit San Jose 2014: Data Discovery on Hadoop Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Hadoop Summit San Jose 2014: Data Discovery on Hadoop
 
Data discoveryonhadoop@yahoo! hadoopsummit2014
Data discoveryonhadoop@yahoo! hadoopsummit2014Data discoveryonhadoop@yahoo! hadoopsummit2014
Data discoveryonhadoop@yahoo! hadoopsummit2014
 
Data-ware Housing
Data-ware HousingData-ware Housing
Data-ware Housing
 

Plus de Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

Plus de Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Dernier

[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 

Dernier (20)

[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 

Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals

  • 1. BEST PRACTICES FOR THE APACHE HADOOP DATA WAREHOUSE EDW 101 FOR HADOOP PROFESSIONALS RALPH KIMBALL / ELI COLLINS MAY 2014 Best Practices for the Hadoop Data Warehouse © Ralph Kimball, Cloudera, 2014 May 2014
  • 2. The Enterprise Data Warehouse Legacy  More than 30 years, countless successful installations, billions of dollars  Fundamental architecture best practices  Business user driven: simple, fast, relevant  Best designs driven by actual data, not top down models  Enterprise entities: dimensions, facts, and primary keys  Time variance: slowly changing dimensions  Integration: conformed dimensions  These best practices also apply to Hadoop systems
  • 3. Expose the Data as Dimensions and Facts  Dimensions are the enterprise’s fundamental entities  Dimensions are a strategic asset separate from any given data source  Dimensions need to be attached to each source  Measurement EVENTS are 1-to-1 with Fact Table RECORDS  The GRAIN of a fact table is the physical world’s description of the measurement event
  • 4. A Health Care Use Case  Grain = Health Care Hospital Events Grain = Patient Event During Hospital Stay
  • 5. Importing Raw Data into Hadoop  Ingesting and transforming raw data from diverse sources for analysis is where Hadoop shines  What: Medical device data, doctors’ notes, nurse’s notes, medications administered, procedures performed, diagnoses, lab tests, X-rays, ultrasound exams, therapists’ reports, billing, ...  From: Operational RDBMSs, enterprise data warehouse, human entered logs, machine generated data files, special systems, ...  Use native ingest tools & 3rd party data integration products  Always retain original data in full fidelity  Keep data files “as is” or use Hadoop native formats  Opportunistically add data sources  Agile!
  • 6. Importing Raw Data into Hadoop  First step: get hospital procedures from billing RDBMS, doctors notes from RDBMS, patient info from DW, ...  As well as X-rays from radiology system $ sqoop import --connect jdbc:oracle:thin:@db.server.com/BILLING --table PROCEDURES --target-dir /ingest/procedures/2014_05_29 $ hadoop fs –put /dcom_files/2014_05_29 hdfs://server.com/ingest/xrays/2014_05_29 $ sqoop import … /EMR … --table CLINICAL_NOTES $ sqoop import … /CDR … --table PATIENT_INFO
  • 7. Plan the Fact Table  Third step: create queries on raw data that will be basis for extracts from each source at the correct grain > CREATE EXTERNAL TABLE procedures_raw( date_key bigint, event timestamp, …) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘t’ LOCATION ‘/demo/procedures’;  Second step: explore raw data immediately before committing to physical data transformations
  • 8. Building the Fact Table  Fourth step: Build up “native” table for facts using special logic from extract queries created in step 3: > CREATE TABLE hospital_events(…) PARTITIONED BY date_key STORED AS PARQUET; > INSERT INTO TABLE hospital_events SELECT <special logic> FROM procedures_raw; … SELECT <special logic> FROM patient_monitor_raw; … SELECT <special logic> from clinical_notes_raw; … SELECT <special logic> from device_17_raw; … SELECT <special logic> from radiology_reports_raw; … SELECT <special logic> from meds_adminstered_raw; … and more
  • 9. The Patient Dimension  Primary key is a “surrogate key”  Durable identifier is original “natural key”  50 attributes typical  Dimension is instrumented for episodic (slow) changes
  • 10. Manage Your Primary Keys  “Natural” keys from source (often “un-natural”!)  Poorly administered, overwritten, duplicated  Awkward formats, implied semantic content  Profoundly incompatible across data sources  Replace or remap natural keys  Enterprise dimension keys are surrogate keys  Replace or remap in all dimension and fact tables  Attach high value enterprise dimensions to every source just by replacing the original natural keys
  • 11. Inserting Surrogate Keys in Facts  Re-write fact tables with dimension SKs NK NK NK SK SK SK NK NK NK SKNK Join Mapping tables Original facts SKNK SKNK SKNK Insert NK NK Append deltas to facts and mapping tables Target Fact Table
  • 12. Track Time Variance  Dimensional entities change slowly and episodically  EDW has responsibility to correctly represent history  Must provide for multiple historically time stamped versions of all dimension members  SCDs: Slowly Changing Dimensions  SCD Type 1: Overwrite dimension member, lose history  SCD Type 2: Add new time stamped dimension member record, track history
  • 13. Options for Implementing SCD 2  Re-import the dimension table each time  Or, import and merge the delta  Or, re-build the table in Hadoop  Implement complex merges with an integrated ETL tool, or in SQL via Impala or Hive $ sqoop import --table patient_info --incremental lastmodified --check-column SCD2_EFFECTIVE_DATETIME --last-value “2014-05-29 01:01:01”
  • 14. Integrate Data Sources at the BI Layer  If the dimensions of two sources are not “conformed” then the sources cannot be integrated  Two dimensions are conformed if they share attributes (fields) that have the same domains and same content  The integration payload:
  • 15. Conforming Dimensions in Hadoop  Goal: combine diverse data sets in a single analysis  Conform operational and analytical schemas via key dimensions (user, product, geo)  Build and use mapping tables (ala SK handling) > CREATE TABLE patient_tmp LIKE patient_dim; > ALTER TABLE patient_tmp ADD COLUMNS (state_conf int); > INSERT INTO TABLE patient_tmp (SELECT … ); > DROP TABLE patient_dim; > ALTER TABLE patient_tmp RENAME TO patient_dim; tediou s!
  • 16. Integrate Data Sources at the BI Layer  Traditional data warehouse personas  Dimension manager – responsible for defining and publishing the conformed dimension content  Fact provider – owner and publisher of fact table, attached to conformed dimensions  New Hadoop personas  “Robot” dimension manager – using auto schema inference, pattern matching, similarity matching, …
  • 17. What’s Easy and What’s Challenging in Hadoop as of May 2014  Easy  Assembling/investigating radically diverse data sources  Scaling out to any size at any velocity  Somewhat challenging  Building extract logic for each diverse data source  Updating and appending to existing HDFS files (requires rewrite – straightforward but slow)  Generating surrogate keys in a profoundly distributed environment  Stay tuned! 
  • 18. What Have We Accomplished  Identified essential best practices from the EDW world  Business driven  Dimensional approach  Handling time variance with SCDs and surrogate keys  Integrating arbitrary sources with conformed dimensions  Shown examples of how to implement each best practice in Hadoop  Provided realistic assessment of current state of
  • 19. The Kimball Group Resource  www.kimballgroup.com  Best selling data warehouse books NEW BOOK! The Classic “Toolkit” 3rd Ed.  In depth data warehouse classes taught by primary authors  Dimensional modeling (Ralph/Margy)  ETL architecture (Ralph/Bob)  Dimensional design reviews and consulting by Kimball Group principals  White Papers on Integration, Data Quality, and Big Data Analytics