Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals

BEST PRACTICES FOR
THE APACHE HADOOP
DATA WAREHOUSE
EDW 101 FOR HADOOP
PROFESSIONALS
RALPH KIMBALL / ELI COLLINS
MAY 2014
Best Practices for the Hadoop Data Warehouse
© Ralph Kimball, Cloudera, 2014
May 2014

The Enterprise Data Warehouse
Legacy
 More than 30 years, countless successful
installations, billions of dollars
 Fundamental architecture best practices
 Business user driven: simple, fast, relevant
 Best designs driven by actual data, not top down
models
 Enterprise entities: dimensions, facts, and primary
keys
 Time variance: slowly changing dimensions
 Integration: conformed dimensions
 These best practices also apply to Hadoop
systems

Expose the Data as
Dimensions and Facts
 Dimensions are the enterprise’s fundamental
entities
 Dimensions are a strategic asset
separate from any given data source
 Dimensions need to be attached to each source
 Measurement EVENTS are 1-to-1 with
Fact Table RECORDS
 The GRAIN of a fact table is the physical
world’s description of the measurement event

A Health Care Use Case
 Grain = Health Care Hospital
Events
Grain = Patient Event During Hospital Stay

Importing Raw Data into Hadoop
 Ingesting and transforming raw data from diverse
sources for analysis is where Hadoop shines
 What: Medical device data, doctors’ notes, nurse’s notes,
medications administered, procedures performed,
diagnoses, lab tests, X-rays, ultrasound exams, therapists’
reports, billing, ...
 From: Operational RDBMSs, enterprise data warehouse,
human entered logs, machine generated data files, special
systems, ...
 Use native ingest tools & 3rd party data integration
products
 Always retain original data in full fidelity
 Keep data files “as is” or use Hadoop native formats
 Opportunistically add data sources  Agile!

Importing Raw Data into Hadoop
 First step: get hospital procedures from billing
RDBMS, doctors notes from RDBMS, patient info
from DW, ...
 As well as X-rays from radiology system
$ sqoop import
--connect jdbc:oracle:thin:@db.server.com/BILLING
--table PROCEDURES
--target-dir /ingest/procedures/2014_05_29
$ hadoop fs –put /dcom_files/2014_05_29
hdfs://server.com/ingest/xrays/2014_05_29
$ sqoop import … /EMR … --table CLINICAL_NOTES
$ sqoop import … /CDR … --table PATIENT_INFO

Plan the Fact Table
 Third step: create queries on raw data that will be
basis for extracts from each source at the correct
grain
> CREATE EXTERNAL TABLE procedures_raw(
date_key bigint,
event timestamp, …)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘t’
LOCATION ‘/demo/procedures’;
 Second step: explore raw data immediately
before committing to physical data
transformations

Building the Fact Table
 Fourth step: Build up “native” table for facts using
special logic from extract queries created in step 3:
> CREATE TABLE hospital_events(…)
PARTITIONED BY date_key STORED AS PARQUET;
> INSERT INTO TABLE hospital_events
SELECT <special logic> FROM procedures_raw;
… SELECT <special logic> FROM patient_monitor_raw;
… SELECT <special logic> from clinical_notes_raw;
… SELECT <special logic> from device_17_raw;
… SELECT <special logic> from radiology_reports_raw;
… SELECT <special logic> from meds_adminstered_raw;
… and more

The Patient Dimension
 Primary key is a
“surrogate key”
 Durable identifier is
original “natural key”
 50 attributes typical
 Dimension is
instrumented for
episodic (slow)
changes

Manage Your Primary Keys
 “Natural” keys from source (often “un-natural”!)
 Poorly administered, overwritten, duplicated
 Awkward formats, implied semantic content
 Profoundly incompatible across data sources
 Replace or remap natural keys
 Enterprise dimension keys are surrogate keys
 Replace or remap in all dimension and fact tables
 Attach high value enterprise dimensions to every
source just by replacing the original natural keys

Inserting Surrogate Keys in
Facts
 Re-write fact tables with dimension SKs
NK
NK
NK
SK
SK
SK
NK
NK
NK
SKNK Join
Mapping tables
Original facts
SKNK
SKNK
SKNK
Insert
NK
NK
Append deltas
to facts and
mapping tables
Target Fact Table

Track Time Variance
 Dimensional entities change slowly and
episodically
 EDW has responsibility to correctly represent
history
 Must provide for multiple historically time
stamped versions of all dimension members
 SCDs: Slowly Changing Dimensions
 SCD Type 1: Overwrite dimension member, lose
history
 SCD Type 2: Add new time stamped dimension
member record, track history

Options for Implementing SCD 2
 Re-import the dimension table each time
 Or, import and merge the delta
 Or, re-build the table in Hadoop
 Implement complex merges with an integrated
ETL tool, or in SQL via Impala or Hive
$ sqoop import
--table patient_info
--incremental lastmodified
--check-column SCD2_EFFECTIVE_DATETIME
--last-value “2014-05-29 01:01:01”

Integrate Data Sources at the BI
Layer
 If the dimensions of two sources are not
“conformed” then the sources cannot be
integrated
 Two dimensions are conformed if they share
attributes (fields) that have the same domains
and same content
 The integration payload:

Conforming Dimensions in
Hadoop
 Goal: combine diverse data sets in a single
analysis
 Conform operational and analytical schemas
via key dimensions (user, product, geo)
 Build and use mapping tables (ala SK handling)
> CREATE TABLE patient_tmp LIKE patient_dim;
> ALTER TABLE patient_tmp ADD COLUMNS (state_conf int);
> INSERT INTO TABLE patient_tmp (SELECT … );
> DROP TABLE patient_dim;
> ALTER TABLE patient_tmp RENAME TO patient_dim;
tediou
s!

Integrate Data Sources at the BI
Layer
 Traditional data warehouse personas
 Dimension manager – responsible for defining and
publishing the conformed dimension content
 Fact provider – owner and publisher of fact table,
attached to conformed dimensions
 New Hadoop personas
 “Robot” dimension manager – using auto schema
inference, pattern matching, similarity matching, …

What’s Easy and What’s
Challenging in Hadoop as of May
2014
 Easy
 Assembling/investigating radically diverse data
sources
 Scaling out to any size at any velocity
 Somewhat challenging
 Building extract logic for each diverse data source
 Updating and appending to existing HDFS files
(requires rewrite – straightforward but slow)
 Generating surrogate keys in a profoundly
distributed environment
 Stay tuned! 

What Have We Accomplished
 Identified essential best practices from the EDW
world
 Business driven
 Dimensional approach
 Handling time variance with SCDs and surrogate
keys
 Integrating arbitrary sources with conformed
dimensions
 Shown examples of how to implement each best
practice in Hadoop
 Provided realistic assessment of current state of

The Kimball Group Resource
 www.kimballgroup.com
 Best selling data warehouse books
NEW BOOK! The Classic “Toolkit” 3rd Ed.
 In depth data warehouse classes
taught by primary authors
 Dimensional modeling (Ralph/Margy)
 ETL architecture (Ralph/Bob)
 Dimensional design reviews and consulting
by Kimball Group principals
 White Papers
on Integration, Data Quality, and Big Data Analytics

Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (8)

Similaire à Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals

Similaire à Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals (20)

Plus de Cloudera, Inc.

Plus de Cloudera, Inc. (20)

Dernier

Dernier (20)

Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals