2. Shanu Sharma, CSE-ASET
TOPICS COVERED
Definition of Data warehouse
Characteristics of Data Warehouse
Data mart
Components of data warehouse
Meta data
Applications of Data warehouse
OLTP v/s Data Warehouse
3. Shanu Sharma, CSE-ASET
CONCEPT OF DATA WAREHOUSE
Take all the data you already have in the organization,
clean and transform it, and then provide useful strategic
information.
4. Shanu Sharma, CSE-ASET
DEFINITION OF DATA WAREHOUSE
(1996 )Bill Inmon considered to be the father of data
warehousing stated.
“A DW is a subject-oriented, integrated, non-volatile,
time-variant collection of data in favor of decision-
making”.
Sean Kelly said Data in the data warehouse is
“Separate available, integrated, time-stamped, subject-
oriented, non-volatile, accessible”
6. Shanu Sharma, CSE-ASET
1. SUBJECT ORIENTED DATA
In operational systems data is stored by individual
applications or business process. Like data about
individual order , customer etc.
For example in banking industry data sets for saving or
checking accounts contain data about that particular
application.
But in DW data is stored by real world business
objectives or events not by the applications.
8. Shanu Sharma, CSE-ASET
2. INTEGRATED DATA
Data in DW comes from several operational systems.
Different datasets have different file formats.
Example: Data for subject Account comes from 3 different
data sources.
So variations could be there, like:
Naming conventions could be different.
Attributes for data items could be different.
Like: Saving account no. could be of 8 bytes long but only 6
bytes for checking accounts.
9. Shanu Sharma, CSE-ASET
Before moving the data into the data warehouse,
you have to go through a process of
transformation, consolidation, and integration of
the source data.
Here are some of the items that would need
standardization:
Naming conventions
Codes
Data attributes
11. Shanu Sharma, CSE-ASET
TIME VARIANT DATA
In operational systems the stored data contains current
values.
Like in saving account system the balance is the current
balance of the customer.
But the data in the DW is meant for analysis and decision
making.
Comparative analysis is one of the best techniques for
business performance evaluation
Time is critical factor for comparative analysis
Every data structure in DW contains time element
12. Shanu Sharma, CSE-ASET
So, DW has to contain historical data and current
values.
Data is stored as snapshots over past and current
periods.
The time-variant nature of the data in a data warehouse
Allows for analysis of the past
Relates information to the present
Enables forecasts for the future
13. Shanu Sharma, CSE-ASET
NON VOLATILE DATA
Data from operational systems are moved into DW after
specific intervals
Every business transaction don‟t update in DW
Data from DW is not deleted
Data is neither changed by individual transactions
14. Shanu Sharma, CSE-ASET
Subject Oriented
Organized along the lines
of the subjects of the
corporation. Typical
subjects are customer,
product, vendor and
transaction.
Time-Variant
Every record in the
data warehouse has
some form of time
variancy attached to it.
Non-Volatile
Refers to the inability of
data to be updated. Every
record in the data
warehouse is time
stamped in one form or
another.
15. Shanu Sharma, CSE-ASET
DATA GRANULARITY
Data granularity refers to the level of details of data in data
warehouse.
The lower the level of details, the finer is the data granularity.
16. Shanu Sharma, CSE-ASET
DATA WAREHOUSES AND DATA MARTS
In 1998 Bill Inmon stated ,
“The single most important issue facing the IT manager this
year is whether to build the data warehouse first or the
data mart first”.
How are they different ?
18. Shanu Sharma, CSE-ASET
In any organization for managing data for analysis
purpose there are basically two approaches.
1. Top Down Approach
The centralized data warehouse would feed the
dependent data marts that may be designed based on
a dimensional data model.
In this approach data in the data warehouse is stored at
the lowest level of granularity based on a normalized
data model.
19. Shanu Sharma, CSE-ASET
Advantages:
An enterprise view of data
Not a union of disparate data marts
Centralized rules and control
Disadvantages:
Slow approach
High exposure to risk of failure
20. Shanu Sharma, CSE-ASET
2. Bottom Up Approach
In this approach first data marts are created to provide
analytical capability for specific business subjects based on
dimension data model.
Then these data marts are joined or unioned by conforming
the dimensions to create a DW.
Advantages:
Faster and easier implementation
Less risk of failure
Allows project team to learn and grow
Disadvantages:
Redundant data in every data mart.
Inconsistent data
22. Shanu Sharma, CSE-ASET
1. SOURCE DATA COMPONENT
Production data
Comes from various operational systems of the enterprise.
Internal Data
Like private documents, customer profiles, departmental
databases etc.
External Data
Statistics data produced by external agencies. Used for
comparing performance against other organizations.
Archived Data
In every operational systems, the old data periodically stored
in archived files or on disk storage. This data is also required
as the data warehouse keeps historical snapshots of data.
23. Shanu Sharma, CSE-ASET
2. DATA STAGING COMPONENT
After data is extracted, data is to be prepared
Data extracted from sources needs to be changed,
converted and made ready in suitable format
Three major functions to make data ready
Extract
Transform
Load
Staging area provides a place and area with a set of
functions to
Clean
Change
Combine
Convert
24. Shanu Sharma, CSE-ASET
Different techniques are used for extracting data from
different data sources.
Data transformation includes
Data cleaning- like correction of misselling, resolution of
conflicts, providing default values for missing data
elements etc, remove duplication.
Standardization of Data- standardize data types, field
length. Semantic standardization like resolving
synonyms and homonyms.
Sorting, Merging etc.
26. Shanu Sharma, CSE-ASET
3. DATA STORAGE COMPONENTS
Separate repository
Data structured for efficient processing
Updated after specific periods
Only read-only
27. Shanu Sharma, CSE-ASET
4. INFORMATION DELIVERY COMPONENT
It includes various methods of delivering information on
the basis of users. Ex.
Ad hoc reports or predefined reports for novice and casual
users.
Statistical analysis for business analyst.
It also provides information to data mining applications.
29. Shanu Sharma, CSE-ASET
METADATA COMPONENT
Metadata component is the data about the data in the data
warehouse.
Metadata in a data warehouse contains the answers to
questions about the data in the data warehouse.
It serves as a directory of the contents of the data
warehouse
30. Shanu Sharma, CSE-ASET
TYPES OF METADATA
Operational Metadata
Contains information about the operational data sources
like field lengths, data types etc.
Extraction and Transformation Metadata
extraction frequencies, extraction methods etc.
End-User Metadata
32. 32
APPLICATION AREAS
Industry Application
Finance Credit Card Analysis
Insurance Claims, Fraud Analysis
Telecommunication Call record analysis
Transport Logistics management
Consumer goods promotion analysis
Data Service providers Value added data
Utilities Power usage analysis
33. Shanu Sharma, CSE-ASET
TYPICAL APPLICATIONS
Impact on organization‟s core business is to
streamline and maximize profitability.
Fraud detection.
Profitability analysis.
Direct mail/database marketing.
Credit risk prediction.
Yield management.
Inventory management.
.
34. Shanu Sharma, CSE-ASET
TYPICAL APPLICATIONS
Fraud detection
By observing data usage patterns.
People have typical purchase patterns.
Deviation from patterns.
Certain cities notorious for fraud.
Certain items bought by stolen cards.
Similar behavior for stolen phone cards.
35. Shanu Sharma, CSE-ASET
TYPICAL APPLICATIONS
Profitability Analysis
Banks know if they are profitable or not.
Don‟t know which customers are profitable.
Typically more than 50% are NOT profitable.
Don‟t know which one?
Balance is not enough, transactional behavior is the key.
Restructure products and pricing strategies.
Life-time profitability models (next 3-5 years).
36. Shanu Sharma, CSE-ASET
TYPICAL APPLICATIONS
Direct mail marketing
Targeted marketing.
Offering high bandwidth package NOT to all users.
Know from call detail records of web surfing.
Saves marketing expense, saving pennies.
Knowing your customers better.
37. Shanu Sharma, CSE-ASET
TYPICAL APPLICATIONS
Credit risk prediction
Who should get a loan?
Qualitative decision making NOT subjective.
Different interest rates for different customers.
Do not subsidize bad customer on the basis of good.
38. Shanu Sharma, CSE-ASET
TYPICAL APPLICATIONS
Yield Management
Works for fixed inventory businesses.
Item prices vary for varying customers.
Example: Air Lines, Hotels etc.
Price of (say) Air Ticket depends on:
How much in advance ticket was bought?
How many vacant seats were present?
How profitable is the customer?
Ticket is one-way or return?
39. Shanu Sharma, CSE-ASET
RECENT APPLICATION
Agriculture Systems
Agri and related data collected for decades.
Decision making based on expert judgment.
Lack of integration results in underutilization.
What is required, in which amount and when?
40. 40
DATA WAREHOUSE VS. OLTP
OLTP (On Line Transaction Processing)
Select tx_date, balance from tx_table
Where account_ID = 23876;
41. 41
DATA WAREHOUSE VS. OLTP
DWH
Select balance, age, sal, gender from
customer_table, tx_table
Where age between (30 and 40) and
Education = „graduate‟ and
CustID.customer_table =
Customer_ID.tx_table;
42. 42
DATA WAREHOUSE VS. OLTP
OLTP DWH
Primary key used Primary key NOT used
No concept of Primary Index Primary index used
Few rows returned Many rows returned
May use a single table Uses multiple tables
High selectivity of query Low selectivity of query
Indexing on primary key
(unique)
Indexing on primary index
(non-unique)
43. Shanu Sharma, CSE-ASET43
COMPARISON OF RESPONSE TIMES
On-line analytical processing (OLAP) queries must be
executed in a small number of seconds.
Often requires denormalization and/or sampling.
Complex query scripts and large list selections can
generally be executed in a small number of minutes.
Sophisticated clustering algorithms (e.g., data mining)
can generally be executed in a small number of hours
(even for hundreds of thousands of customers).
44. Shanu Sharma, CSE-ASET44
DATA WAREHOUSE FOR DECISION SUPPORT
& OLAP
Putting Information technology to help the
knowledge worker make faster and better
decisions
Which of my customers are most likely to go to
the competition?
What product promotions have the biggest
impact on revenue?
How did the share price of software companies
correlate with profits over last 10 years?