The enormous legacy of EDW experience and best practices can be adapted to the unique capabilities of the Hadoop environment. In this webinar, in a point-counterpoint format, Dr. Kimball will describe standard data warehouse best practices including the identification of dimensions and facts, managing primary keys, and handling slowly changing dimensions (SCDs) and conformed dimensions. Eli Collins, Chief Technologist at Cloudera, will describe how each of these practices actually can be implemented in Hadoop.
2. The Enterprise Data Warehouse
Legacy
More than 30 years, countless successful
installations, billions of dollars
Fundamental architecture best practices
Business user driven: simple, fast, relevant
Best designs driven by actual data, not top down
models
Enterprise entities: dimensions, facts, and primary
keys
Time variance: slowly changing dimensions
Integration: conformed dimensions
These best practices also apply to Hadoop
systems
3. Expose the Data as
Dimensions and Facts
Dimensions are the enterprise’s fundamental
entities
Dimensions are a strategic asset
separate from any given data source
Dimensions need to be attached to each source
Measurement EVENTS are 1-to-1 with
Fact Table RECORDS
The GRAIN of a fact table is the physical
world’s description of the measurement event
4. A Health Care Use Case
Grain = Health Care Hospital
Events
Grain = Patient Event During Hospital Stay
5. Importing Raw Data into Hadoop
Ingesting and transforming raw data from diverse
sources for analysis is where Hadoop shines
What: Medical device data, doctors’ notes, nurse’s notes,
medications administered, procedures performed,
diagnoses, lab tests, X-rays, ultrasound exams, therapists’
reports, billing, ...
From: Operational RDBMSs, enterprise data warehouse,
human entered logs, machine generated data files, special
systems, ...
Use native ingest tools & 3rd party data integration
products
Always retain original data in full fidelity
Keep data files “as is” or use Hadoop native formats
Opportunistically add data sources Agile!
6. Importing Raw Data into Hadoop
First step: get hospital procedures from billing
RDBMS, doctors notes from RDBMS, patient info
from DW, ...
As well as X-rays from radiology system
$ sqoop import
--connect jdbc:oracle:thin:@db.server.com/BILLING
--table PROCEDURES
--target-dir /ingest/procedures/2014_05_29
$ hadoop fs –put /dcom_files/2014_05_29
hdfs://server.com/ingest/xrays/2014_05_29
$ sqoop import … /EMR … --table CLINICAL_NOTES
$ sqoop import … /CDR … --table PATIENT_INFO
7. Plan the Fact Table
Third step: create queries on raw data that will be
basis for extracts from each source at the correct
grain
> CREATE EXTERNAL TABLE procedures_raw(
date_key bigint,
event timestamp, …)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘t’
LOCATION ‘/demo/procedures’;
Second step: explore raw data immediately
before committing to physical data
transformations
8. Building the Fact Table
Fourth step: Build up “native” table for facts using
special logic from extract queries created in step 3:
> CREATE TABLE hospital_events(…)
PARTITIONED BY date_key STORED AS PARQUET;
> INSERT INTO TABLE hospital_events
SELECT <special logic> FROM procedures_raw;
… SELECT <special logic> FROM patient_monitor_raw;
… SELECT <special logic> from clinical_notes_raw;
… SELECT <special logic> from device_17_raw;
… SELECT <special logic> from radiology_reports_raw;
… SELECT <special logic> from meds_adminstered_raw;
… and more
9. The Patient Dimension
Primary key is a
“surrogate key”
Durable identifier is
original “natural key”
50 attributes typical
Dimension is
instrumented for
episodic (slow)
changes
10. Manage Your Primary Keys
“Natural” keys from source (often “un-natural”!)
Poorly administered, overwritten, duplicated
Awkward formats, implied semantic content
Profoundly incompatible across data sources
Replace or remap natural keys
Enterprise dimension keys are surrogate keys
Replace or remap in all dimension and fact tables
Attach high value enterprise dimensions to every
source just by replacing the original natural keys
11. Inserting Surrogate Keys in
Facts
Re-write fact tables with dimension SKs
NK
NK
NK
SK
SK
SK
NK
NK
NK
SKNK Join
Mapping tables
Original facts
SKNK
SKNK
SKNK
Insert
NK
NK
Append deltas
to facts and
mapping tables
Target Fact Table
12. Track Time Variance
Dimensional entities change slowly and
episodically
EDW has responsibility to correctly represent
history
Must provide for multiple historically time
stamped versions of all dimension members
SCDs: Slowly Changing Dimensions
SCD Type 1: Overwrite dimension member, lose
history
SCD Type 2: Add new time stamped dimension
member record, track history
13. Options for Implementing SCD 2
Re-import the dimension table each time
Or, import and merge the delta
Or, re-build the table in Hadoop
Implement complex merges with an integrated
ETL tool, or in SQL via Impala or Hive
$ sqoop import
--table patient_info
--incremental lastmodified
--check-column SCD2_EFFECTIVE_DATETIME
--last-value “2014-05-29 01:01:01”
14. Integrate Data Sources at the BI
Layer
If the dimensions of two sources are not
“conformed” then the sources cannot be
integrated
Two dimensions are conformed if they share
attributes (fields) that have the same domains
and same content
The integration payload:
15. Conforming Dimensions in
Hadoop
Goal: combine diverse data sets in a single
analysis
Conform operational and analytical schemas
via key dimensions (user, product, geo)
Build and use mapping tables (ala SK handling)
> CREATE TABLE patient_tmp LIKE patient_dim;
> ALTER TABLE patient_tmp ADD COLUMNS (state_conf int);
> INSERT INTO TABLE patient_tmp (SELECT … );
> DROP TABLE patient_dim;
> ALTER TABLE patient_tmp RENAME TO patient_dim;
tediou
s!
16. Integrate Data Sources at the BI
Layer
Traditional data warehouse personas
Dimension manager – responsible for defining and
publishing the conformed dimension content
Fact provider – owner and publisher of fact table,
attached to conformed dimensions
New Hadoop personas
“Robot” dimension manager – using auto schema
inference, pattern matching, similarity matching, …
17. What’s Easy and What’s
Challenging in Hadoop as of May
2014
Easy
Assembling/investigating radically diverse data
sources
Scaling out to any size at any velocity
Somewhat challenging
Building extract logic for each diverse data source
Updating and appending to existing HDFS files
(requires rewrite – straightforward but slow)
Generating surrogate keys in a profoundly
distributed environment
Stay tuned!
18. What Have We Accomplished
Identified essential best practices from the EDW
world
Business driven
Dimensional approach
Handling time variance with SCDs and surrogate
keys
Integrating arbitrary sources with conformed
dimensions
Shown examples of how to implement each best
practice in Hadoop
Provided realistic assessment of current state of
19. The Kimball Group Resource
www.kimballgroup.com
Best selling data warehouse books
NEW BOOK! The Classic “Toolkit” 3rd Ed.
In depth data warehouse classes
taught by primary authors
Dimensional modeling (Ralph/Margy)
ETL architecture (Ralph/Bob)
Dimensional design reviews and consulting
by Kimball Group principals
White Papers
on Integration, Data Quality, and Big Data Analytics