This document provides an introduction to data warehousing fundamentals. It defines a data warehouse as an enterprise repository for subject-oriented, time-variant data used for decision support. It outlines the typical phases of a data warehousing project including strategy, definition, analysis, design, build, populate, test and evolution. It compares data warehouses to operational databases and data marts. Finally, it discusses extract, transform, load processes, possible ETL failures, and typical warehousing development tasks.
2. Definition of a Data Warehouse
• A data warehouse is an enterprise
structured repository of subject-oriented,
time-variant data used for information
retrieval and decision support. The data
warehouse stores atomic and summary
data.
3. Typical Data Warehousing Process
Phase I: STRATEGY
Identify business requirements.
Define objectives and purpose of DW.
Phase II: DEFINITION
Project scoping and planning: Using building block
approach
Phase III: ANALYSIS
Information requirements are defined.
Phase IV: DESIGN
Database structures to hold base data and
summaries are created. Translation
mechanisms are designed.
Phase V: BUILD AND DOCUMENT
The warehouse is built and documentation is
developed.
Phase VI: POPULATE, TEST, AND TRAIN
Iterative The warehouse is populated and tested. The users
are trained on system and tools.
Phase VII: DISCOVERY AND EVOLUTION
The warehouse is monitored and adjustments are
applied, or future extensions are planned.
4. Data Warehouse Compared to OLTP
Property OLTP Data Warehouse
Activities Processes Analysis
Response Time Subseconds Seconds to hours
to seconds
Operations DML Primarily read-only
Nature of Data Current Snapshots over time
Data Organized By application By subject, time
Size Small to large Large to very large
Data Sources Operational, internal Operational, internal,
external
5. Data Warehouse Compared
with Data Mart
Property Data Warehouse Data Mart
Scope Enterprise Department
Subjects Multiple Single-subject, line
of business (LOB)
Data Source Many Few
Size (typical) See notes below See notes below
Implementation Months to years Months
Time
8. Dependent Data Mart
Operational Data warehouse Data mart
systems
Flat files
Marketing
Marketing
Sales
Finance Sales
Human
Resources
Finance
External data
9. Purpose of an Enterprise Model
Extract Transform/Load Publish Subscribe
Federated data warehouse
Flat files
TL Dependent data marts
Staging areas
L
Access layers
Portal
Transformations
Operational
B2C
E
RDBMS B2B
External Enterprise
model Clickstream
Server log (atomic data)
files
Metadata repository
10. Extract, Transform, Load (ETL)
Processes
– Extract source data. – Load data into warehouse.
– Transform/clean data. – Detect changes.
– Index and summarize. – Refresh data.
Programs
Gateways
Operational systems Tools Warehouse
ETL
11. ETL Processes
– Must result in data that is relevant, useful, high-
quality, accurate, and accessible
– Require a large proportion of warehouse
development time and resources
Relevant
Clean up Useful
Consolidate Quality
Operational systems Restructure Warehouse Accurate
ETL Accessible
12. Possible Reasons for ETL Failure
– A missing source file
– A system failure
– Inadequate metadata
– Poor mapping information
– Inadequate storage planning
– A source structural change
– No contingency plan
– Inadequate data validation
13. Typical Warehousing Development
Tasks
Define source metadata
Source Define staging area metadata
Map source to staging area
to Deploy database structures
staging Deploy mappings
Extract data into staging tables
Define enterprise model (warehouse) metadata
Staging Map staging area to enterprise model
to Deploy database structures
warehouse Deploy mappings
Extract data into the enterprise model
Define data mart metadata (cubes, dimensions)
Warehouse Map enterprise model to data marts
to Deploy database structures
data marts Deploy mappings
Extract data into the data mart
Refresh warehouse and data mart
Administration
Maintain warehouse and data mart
14. Visit more self help tutorials
• Pick a tutorial of your choice and browse
through it at your own pace.
• The tutorials section is free, self-guiding and
will not involve any additional support.
• Visit us at www.dataminingtools.net