2. Overview
•
•
•
•
•
•
What is data warehouse?
Why data warehouse?
Data reconciliation – ETL process
Data warehouse architectures
Star schema – dimensional modeling
Data analysis
2
3. What is Data Warehouse?
• Defined in many different ways, but not rigorously.
– A decision support database that is maintained separately from
the organization’s operational database
– Support information processing by providing a solid platform of
consolidated, historical data for analysis.
• “A data warehouse is a subject-oriented, integrated, timevariant, and nonvolatile collection of data in support of
management’s decision-making process.”—W. H. Inmon
• Data warehousing:
– The process of constructing and using data warehouses
3
4. Data Warehouse—SubjectOriented
• Organized around major subjects, such as
customer, product, sales
• Focusing on the modeling and analysis of data
for decision makers, not on daily operations or
transaction processing
• Provide a simple and concise view around
particular subject issues by excluding data that
are not useful in the decision support process
4
5. Data Warehouse—Integrated
• Constructed by integrating multiple,
heterogeneous data sources
– relational databases, flat files, on-line transaction
records
• Data cleaning and data integration techniques
are applied.
– Ensure consistency in naming conventions,
encoding structures, attribute measures, etc. among
different data sources
• E.g., Hotel price: currency, tax, breakfast covered, etc.
– When data is moved to the warehouse, it is
converted.
5
6. Data Warehouse—Time Variant
• The time horizon for the data warehouse is
significantly longer than that of operational
systems
– Operational database: current value data
– Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)
• Every key structure in the data warehouse
– Contains an element of time, explicitly or implicitly
– But the key of operational data may or may not
6
contain “time element”
7. Data Warehouse—Nonvolatile
• A physically separate store of data transformed
from the operational environment
• Operational update of data does not occur in the
data warehouse environment
– Does not require transaction processing, recovery,
and concurrency control mechanisms
– Requires only two operations in data accessing:
• initial loading of data and access of data
7
8. Trends in Organisations that encourage
the need for data warehousing
• No single system of record
• Multiple systems are not synchronized
• Organisations want to analyse the activities in
a balanced way
• Customer relationship management
• Supplier relationship management
8
9. Need for Data Warehousing
• Integrated, company-wide view of high-quality
information (from different databases)
• Separation of operational and informational systems
and data (for improved performance)
9
10. Operational & Informational System
The need to separate operational and informational
systems is based on three primary factors:
• A data warehouse centralizes data that are scattered
throughout disparate operational systems and make them
a available for decision support applications
• A properly designed data warehouse adds value to data
by improving their quality
• A separate data warehouse eliminates much of contention
for resources that result when informational application
confounded with operational processing
10
12. Data Reconciliation
• Typical operational data is:
– Transient – not historical
– Not normalised (perhaps due to denormalisation for
performance)
– Restricted in scope – not comprehensive
– Sometimes poor quality – inconsistencies and errors
• After ETL (Extract, Transform, Load), data
should be:
–
–
–
–
–
Detailed – not summarized yet
Historical – periodic
Normalised – 3rd normal form or higher
Comprehensive – enterprise-wide perspective
Timely – data should be current enough to assist decisionmaking
– Quality controlled – accurate with full integrity
12
13. The ETL Process/ Data
Reconciliation Main Steps
•
•
•
•
Capture/Extract
Scrub or data cleansing
Transform
Load and Index
13
14. Static extract = capturing a
Incremental extract =
snapshot of the source data at a point
in time
capturing changes that have
occurred since the last static extract
14
16. Record-level:
Field-level:
Selection – data partitioning
Joining – data combining
Aggregation – data summarization
single-field – from one field to one field
multi-field – from many fields to one, or
16
one field to many
17. Refresh mode: bulk rewriting of
target data at periodic intervals
Update mode: only changes in
source data are written to data
17
warehouse
18. Data Warehouse Architectures
• Generic Two-Level Architecture
• Independent Data Mart
• Dependent Data Mart and Operational
Data Store
• Logical Data Mart and @ctive
Warehouse
18
20. Independent data mart
Data marts:
Mini-warehouses, limited in scope
L
T
E
Separate ETL for each
independent data mart
Data access complexity
due to multiple data marts
20
21. Dependent data mart with
operational data store
ODS provides option for
obtaining current data
L
T
E
Single ETL for
enterprise data warehouse (EDW)
Dependent data marts
loaded from21
EDW
22. ODS and data warehouse
are one and the same
L
T
E
Near real-time ETL for
@active Data Warehouse
Data marts are NOT separate
databases, but logical views of the
data warehouse
22
Easier to create new data marts
23. Data Characteristics
Status vs. Event Data
Status
Event – a database action
(create/update/delete) that
results from a transaction
Status
23
24. Data Characteristics
Transient vs.
Periodic Data
Changes to existing
records are written
over previous
records, thus
destroying the
previous data content
Data are never
physically altered or
deleted once they
have been added to
the store
24
25. star schema
Fact tables contain
factual or quantitative
data
1:N relationship
between dimension
tables and fact
tables
Dimension tables
are denormalized to
maximize
performance
Dimension tables contain
descriptions about the
subjects of the business
Star Schema: Simple database design in
which dimensional data are separated from
fact data. Excellent for queries, but bad for
25
online transaction processing
26. Star schema example
Fact table provides statistics for sales broken
down by product, period and store dimensions
26
28. On-Line Analytical Processing (OLAP)
• The use of a set of graphical tools that
provides users with multidimensional views of
their data and allows them to analyze the
data using simple windowing techniques
• Relational OLAP (ROLAP)
– Traditional relational representation
• Multidimensional OLAP (MOLAP)
– Cube structure
• OLAP Operations
– Cube slicing – come up with 2-D view of data
– Drill-down – going from summary to more
detailed views
28
29. Data Warehouse vs. Operational
DBMS
• OLTP (on-line transaction processing)
– Major task of traditional relational DBMS
– Day-to-day operations: purchasing, inventory, banking,
manufacturing, payroll, registration, accounting, etc.
• OLAP (on-line analytical processing)
– Major task of data warehouse system
– Data analysis and decision making
• Distinct features (OLTP vs. OLAP):
– User and system orientation: customer vs. market
– Data contents: current, detailed vs. historical, consolidated
– Database design: ER + application vs. star + subject
– View: current, local vs. evolutionary, integrated
– Access patterns: update vs. read-only but complex queries
29
30. OLTP vs. OLAP
OLTP
OLAP
users
clerk, IT professional
knowledge worker
function
day to day operations
decision support
DB design
application-oriented
subject-oriented
data
current, up-to-date
detailed, flat relational
isolated
repetitive
historical,
summarized, multidimensional
integrated, consolidated
ad-hoc
lots of scans
unit of work
read/write
index/hash on prim. key
short, simple transaction
# records accessed
tens
millions
#users
thousands
hundreds
DB size
100MB-GB
100GB-TB
metric
transaction throughput
query throughput, response
usage
access
complex query
30
33. Data Warehouse Usage
• Three kinds of data warehouse applications
– Information processing
• supports querying, basic statistical analysis, and reporting
using crosstabs, tables, charts and graphs
– Analytical processing
• multidimensional analysis of data warehouse data
• supports basic OLAP operations, slice-dice, drilling, pivoting
– Data mining
• knowledge discovery from hidden patterns
• supports associations, constructing analytical models,
performing classification and prediction, and presenting the
mining results using visualization tools
33