The document provides an introduction to data warehousing. It defines a data warehouse as a subject-oriented, integrated, time-varying, and non-volatile collection of data used for organizational decision making. It describes key characteristics of a data warehouse such as maintaining historical data, facilitating analysis to improve understanding, and enabling better decision making. It also discusses dimensions, facts, ETL processes, and common data warehouse architectures like star schemas.
2. Data Warehouse
Maintain historic data
Analysis to get better understanding of business
Better Decision making
Definition: A data warehouse is a
subject-oriented
integrated
time-varying
non-volatile
collection of data that is used primarily in organizational
decision making.
-- Bill Inmon, Building the Data Warehouse
1996
3. Subject Oriented
•
•
•
Data warehouse is organized around subjects such as
sales, product, customer.
It focuses on modeling and analysis of data for decision
makers.
Excludes data not useful in decision support process.
4. Integrated
•
•
Data Warehouse is constructed by integrating multiple
heterogeneous sources.
Data Preprocessing are applied to ensure consistency.
RDBMS
Data Processing
Data Transformation
Legacy
System
Flat File
Data
Warehouse
Data Processing
Data Transformation
5. Non-volatile
•
Mostly, data once recorded will not be updated.
• Data warehouse requires two operations in data accessing
- Incremental loading of data
- Access of data
load
access
6. Time Variant
•
•
Provides information from historical perspective e.g. past 510 years
Every key structure contains either implicitly or explicitly an
element of time
7. Why Data Warehouse?
Problem Statement:
• ABC Pvt Ltd is a company with branches at Mumbai, Delhi,
Chennai and Bangalore.
• The Sales Manager wants quarterly sales report across the
branches.
• Each branch has a separate operational system where sales
transactions are recorded.
8. Why Data Warehouse?
Mumbai
Delhi
Get quarterly sales figure
for each branch
and manually calculate
sales figure across branches.
Sales
Manager
Chennai
Banglore
What if he need daily sales report across the branches?
11. Characteristics of Data Warehouse
Relational / Multidimensional database
Query and Analysis rather than transaction
Historical data from transactions
Consolidates Multiple data sources
Separates query load from transactions
Mostly non volatile
Large amount of data in order of TBs
12. When we say large - we mean it!
• Terabytes -- 10^12 bytes:
Yahoo! – 300 Terabytes and
growing
• Petabytes -- 10^15 bytes:
Geographic Information Systems
• Exabytes -- 10^18 bytes:
National Medical Records
• Zettabytes -- 10^21 bytes:
Weather images
• Zottabytes -- 10^24 bytes:
Intelligence Agency Videos
13. OLTP Vs Data Warehouse (OLAP)
OLTP
Data Warehouse (OLAP)
Indexes
Few
Many
Data
Normalized
Generally De-normalized
Joins
Many
Some
Derived data and aggregates
Rare
Common
15. ETL
ETL stands for Extract, Transform and
Load
Data is distributed across different sources
– Flat files, Streaming Data, DB Systems, XML, JSON
Data can be in different format
– CSV, Key Value Pairs
Different units and representation
– Country: IN or India
– Date: 20 Nov 2010 or 20101020
16. ETL Functions
Extract
– Collect data from different sources
– Parse data
– Remove unwanted data
Transform
– Project
– Generate Surrogate keys
– Encode data
– Join data from different sources
– Aggregate
Load
17. ETL Steps
•
The first step in ETL process is mapping the data between
source systems and target database.
• The second step is cleansing of source data in staging area.
• The third step is transforming cleansed source data.
• Fourth step is loading into the target system.
Data before ETL Processing:
Data after ETL Processing:
18. ETL Glossary
Mapping:
Defining relationship between source and target objects.
Cleansing:
The process of resolving inconsistencies in source data.
Transformation:
The process of manipulating data. Any manipulation beyond
copying is a transformation. Examples include aggregating, and
integrating data from multiple sources.
Staging Area:
A place where data is processed before entering the
warehouse.
19. Dimension
Categorizes the data. For example - time, location, etc.
A dimension can have one or more attributes. For example
- day, week and month are attributes of time dimension.
Role of dimensions in data warehousing.
- Slice and dice
- Filter by dimensions
20. Types of dimensions
•
•
•
•
•
Conformed Dimension - A dimension that is shared across
fact tables.
Junk Dimension - A junk dimension is a convenient
grouping of flags and indicators. For example, payment
method, shipping method.
De-generated Dimension - A dimension key, that has no
attributes and hence does not have its own dimension
table. For example, transaction number, invoice number.
Value of these dimension is mostly unique within a fact
table.
Role Playing Dimensions - Role Playing dimension refers
to a dimension that play different roles in fact tables
depending on the context. For example, the Date
dimension can be used for the ordered date, shipment
date, and invoice date.
Slowly Changing Dimensions - Dimensions that have data
21. Types of Slowly Changing Dimension
•
•
•
•
Type1 - The Type 1 methodology overwrites old data with
new data, and therefore does not track historical data at
all.
Type 2 - The Type 2 method tracks historical data by
creating multiple records for a given value in dimension
table with separate surrogate keys.
Type 3 - The Type 3 method tracks changes using
separate columns. Whereas Type 2 had unlimited history
preservation, Type 3 has limited history preservation, as it's
limited to the number of columns we designate for storing
historical data.
Type 4 - The Type 4 method is usually referred to as using
"history tables", where one table keeps the current data,
and an additional table is used to keep a record of all
changes.
Type 1, 2 and 3 are commonly used.
22. Facts
Facts are values that can be examined and analyzed.
For Example - Page Views, Unique Users, Pieces
Sold, Profit.
Fact and measure are synonymous.
Types of facts:
–
Additive - Measures that can be added across all
dimensions.
–
Non Additive - Measures that cannot be added across
all dimensions.
–
Semi Additive - Measures that can be added across
few dimensions and not with others.
23. How to store data?
Facts and Dimensions:
1. Select the business process to model
2. Declare the grain of the business process
3. Choose the dimensions that apply to each fact table row
4. Identify the numeric facts that will populate each fact table
row
24. Dimension Table
Contains attributes of dimensions e.g. Month is an attribute
of Time dimension.
Can also have foreign keys to another dimension table
Usually identified by a unique integer primary key called
surrogate key
29. Snowflake Schema
An extension of star schema in which the dimension tables
are partly or fully normalized.
Dimension table hierarchies broken down into simpler
tables.
31. Fact Constellation Schema
•
•
A fact constellation schema allows dimension tables to be
shared between fact tables.
This Schema is used mainly for the aggregate fact tables,
OR where we want to split a fact table for better
comprehension.
For example, a separate fact table for daily, weekly and
monthly reporting requirement.
32. Fact Constellation Schema
In this example, the dimensions tables for time, item, and location are
shared between both the sales and shipping fact tables.
33. Operations on Data Warehouse
Drill Down
Roll up
Slice & Dice
Pivoting
38. Advantages of Data Warehouse
•
•
•
•
•
One consistent data store for reporting, forecasting, and
analysis
Easier and timely access to data
Scalability
Trend analysis and detection
Drill down analysis
39. Disadvantages of Data Warehouse
•
Preparation may be time consuming.
• High associated cost
40. Case Study: Why Data Warehouse
•
•
G2G Courier Pvt. Ltd. is an established brand in courier
industry which has its own network in main cities and also
have sub contracted in rural areas across the country to
various partners.
The President of the company wants to look deep into the
financial health of the company and different performance
aspects.
41. Challenges
Apart from G2G’s own transaction system, each partner has
their own system which make the data very heterogeneous.
• Granularity of data in various systems is also different. For
eg: minute accuracy and day accuracy.
• To do analysis on metrics like Revenue and Timely delivery
across various geographical locations and partner, we need
to have a unified system.
•
42. “Looks like we are doing good in
South, is there any scope of further
improvement???”
“We are getting lot of complaints
from the East, who exactly is the
black sheep???”
43. Sales Information
Report: Revenue by region
Region
Revenue (lacs)
% Change
South
41
+ 8.1
North
34
+ 5.2
East
25
- 6.8
West
12
+ 2.7
Report: Performance by partner
Partner
On Time Delivery Rate
No. of complaints
A
100 %
0
B
98 %
90
C
60 %
521
44. Case Study: Data Warehouse Design
•
•
•
ABC Pvt Ltd is a new company which produces stationary
products with production unit located at Ludhiana.
They have sales units at Delhi, Bangalore.
The President of the company wants sales information.
45. Sales Information
Report: The number of units sold.
113
Report: The number of units sold over time
January
February
March
April
14
41
33
25
Report : The number of items sold for each product with time
Jan
Feb
Apr
6
Black Cartridge
Mar
17
8
Long notebook
6
16
6
Short notebook
8
25
21
Product
46. Sales Information
Report: The number of items sold in each City for each product with time
City
Item
Delhi
Jan
Feb Mar Apr
Black Cartridge
3
16
6
Short Notebook 4
16
6
Bangalore Black Cartridge
3
Time
Long Notebook 3
10
7
Long Notebook 3
8
Short Notebook 4
9
Product
15
City
Item
Jan Feb
Mar Apr
Delhi
General Stationary
7
12
Ink & Toners
Bangalore General Stationary
Ink & Toners
3
7
9
10
15
8
3
7
Time
32
Product Category
47. Identify sales Facts & Dimensions
Facts – Units sold
Dimensions – Product, Time, Region.
Fact Table
City_ID Prod_ID
1
589
1
3
1
1218
1
4
2
589
1
3
2
1218
1
4
1
Time_Id Units
589
2
16
Time dimension table
Time_Id
Month
1
January 2012
2
February 2012
48. Identify sales Facts & Dimensions
Region Dimension Table
City_ID
City
Region
Country
1
Delhi
North
India
2
Bangalore
South
India
Product Dimension Tables
Prod_ID
Product_Name
Product_Category_ID
589
Black Cartridge
2
590
Long Notebook
1
288
Short Notebook
1
Product_Category_ID Product_Category
1
General Stationary
2
Ink & Toners