Dwdm 2(data warehouse)

Shanu Sharma, CSE-ASET
DATA WAREHOUSE-
THE BUILDING BLOCKS

TOPICS COVERED
 Definition of Data warehouse
 Characteristics of Data Warehouse
 Data mart
 Components of data warehouse
 Meta data
 Applications of Data warehouse
 OLTP v/s Data Warehouse

CONCEPT OF DATA WAREHOUSE
Take all the data you already have in the organization,
clean and transform it, and then provide useful strategic
information.

DEFINITION OF DATA WAREHOUSE
(1996 )Bill Inmon considered to be the father of data
warehousing stated.
 “A DW is a subject-oriented, integrated, non-volatile,
time-variant collection of data in favor of decision-
making”.
Sean Kelly said Data in the data warehouse is
“Separate available, integrated, time-stamped, subject-
oriented, non-volatile, accessible”

CHARACTERISTICS OF DATA WAREHOUSE
Subject
Oriented
Integrated
Time
Variant
Non
Volatile

1. SUBJECT ORIENTED DATA
 In operational systems data is stored by individual
applications or business process. Like data about
individual order , customer etc.
 For example in banking industry data sets for saving or
checking accounts contain data about that particular
application.
 But in DW data is stored by real world business
objectives or events not by the applications.

In DW subject is the organization method
Subjects vary with enterprise

2. INTEGRATED DATA
 Data in DW comes from several operational systems.
 Different datasets have different file formats.
Example: Data for subject Account comes from 3 different
data sources.
So variations could be there, like:
Naming conventions could be different.
Attributes for data items could be different.
Like: Saving account no. could be of 8 bytes long but only 6
bytes for checking accounts.

 Before moving the data into the data warehouse,
you have to go through a process of
transformation, consolidation, and integration of
the source data.
 Here are some of the items that would need
standardization:
 Naming conventions
 Codes
 Data attributes

TIME VARIANT DATA
 In operational systems the stored data contains current
values.
Like in saving account system the balance is the current
balance of the customer.
 But the data in the DW is meant for analysis and decision
making.
 Comparative analysis is one of the best techniques for
business performance evaluation
 Time is critical factor for comparative analysis
 Every data structure in DW contains time element

 So, DW has to contain historical data and current
values.
 Data is stored as snapshots over past and current
periods.
The time-variant nature of the data in a data warehouse
 Allows for analysis of the past
 Relates information to the present
 Enables forecasts for the future

NON VOLATILE DATA
 Data from operational systems are moved into DW after
specific intervals
 Every business transaction don‟t update in DW
 Data from DW is not deleted
 Data is neither changed by individual transactions

Subject Oriented
Organized along the lines
of the subjects of the
corporation. Typical
subjects are customer,
product, vendor and
transaction.
Time-Variant
Every record in the
data warehouse has
some form of time
variancy attached to it.
Non-Volatile
Refers to the inability of
data to be updated. Every
record in the data
warehouse is time
stamped in one form or
another.

DATA GRANULARITY
Data granularity refers to the level of details of data in data
warehouse.
The lower the level of details, the finer is the data granularity.

DATA WAREHOUSES AND DATA MARTS
 In 1998 Bill Inmon stated ,
“The single most important issue facing the IT manager this
year is whether to build the data warehouse first or the
data mart first”.
How are they different ?

 In any organization for managing data for analysis
purpose there are basically two approaches.
1. Top Down Approach
The centralized data warehouse would feed the
dependent data marts that may be designed based on
a dimensional data model.
In this approach data in the data warehouse is stored at
the lowest level of granularity based on a normalized
data model.

Advantages:
 An enterprise view of data
 Not a union of disparate data marts
 Centralized rules and control
Disadvantages:
 Slow approach
 High exposure to risk of failure

2. Bottom Up Approach
In this approach first data marts are created to provide
analytical capability for specific business subjects based on
dimension data model.
Then these data marts are joined or unioned by conforming
the dimensions to create a DW.
Advantages:
 Faster and easier implementation
 Less risk of failure
 Allows project team to learn and grow
Disadvantages:
 Redundant data in every data mart.
 Inconsistent data

DW: BUILDING BLOCKS OR COMPONENTS

1. SOURCE DATA COMPONENT
 Production data
Comes from various operational systems of the enterprise.
 Internal Data
Like private documents, customer profiles, departmental
databases etc.
 External Data
Statistics data produced by external agencies. Used for
comparing performance against other organizations.
 Archived Data
In every operational systems, the old data periodically stored
in archived files or on disk storage. This data is also required
as the data warehouse keeps historical snapshots of data.

2. DATA STAGING COMPONENT
After data is extracted, data is to be prepared
Data extracted from sources needs to be changed,
converted and made ready in suitable format
 Three major functions to make data ready
 Extract
 Transform
 Load
 Staging area provides a place and area with a set of
functions to
 Clean
 Change
 Combine
 Convert

Different techniques are used for extracting data from
different data sources.
Data transformation includes
Data cleaning- like correction of misselling, resolution of
conflicts, providing default values for missing data
elements etc, remove duplication.
Standardization of Data- standardize data types, field
length. Semantic standardization like resolving
synonyms and homonyms.
Sorting, Merging etc.

Data Loading: Data Movement to the Data Warehouse

3. DATA STORAGE COMPONENTS
 Separate repository
 Data structured for efficient processing
 Updated after specific periods
 Only read-only

4. INFORMATION DELIVERY COMPONENT
 It includes various methods of delivering information on
the basis of users. Ex.
 Ad hoc reports or predefined reports for novice and casual
users.
 Statistical analysis for business analyst.
 It also provides information to data mining applications.

METADATA COMPONENT
 Metadata component is the data about the data in the data
warehouse.
 Metadata in a data warehouse contains the answers to
questions about the data in the data warehouse.
 It serves as a directory of the contents of the data
warehouse

TYPES OF METADATA
 Operational Metadata
Contains information about the operational data sources
like field lengths, data types etc.
 Extraction and Transformation Metadata
extraction frequencies, extraction methods etc.
 End-User Metadata

TYPES & TYPICAL APPLICATIONS OF DWH

32
APPLICATION AREAS
Industry Application
Finance Credit Card Analysis
Insurance Claims, Fraud Analysis
Telecommunication Call record analysis
Transport Logistics management
Consumer goods promotion analysis
Data Service providers Value added data
Utilities Power usage analysis

TYPICAL APPLICATIONS
Impact on organization‟s core business is to
streamline and maximize profitability.
 Fraud detection.
 Profitability analysis.
 Direct mail/database marketing.
 Credit risk prediction.
 Yield management.
 Inventory management.
.

Fraud detection
 By observing data usage patterns.
 People have typical purchase patterns.
 Deviation from patterns.
 Certain cities notorious for fraud.
 Certain items bought by stolen cards.
 Similar behavior for stolen phone cards.

Profitability Analysis
 Banks know if they are profitable or not.
 Don‟t know which customers are profitable.
 Typically more than 50% are NOT profitable.
 Don‟t know which one?
 Balance is not enough, transactional behavior is the key.
 Restructure products and pricing strategies.
 Life-time profitability models (next 3-5 years).

Direct mail marketing
 Targeted marketing.
 Offering high bandwidth package NOT to all users.
 Know from call detail records of web surfing.
 Saves marketing expense, saving pennies.
 Knowing your customers better.

Credit risk prediction
 Who should get a loan?
 Qualitative decision making NOT subjective.
 Different interest rates for different customers.
 Do not subsidize bad customer on the basis of good.

Yield Management
 Works for fixed inventory businesses.
 Item prices vary for varying customers.
 Example: Air Lines, Hotels etc.
 Price of (say) Air Ticket depends on:
 How much in advance ticket was bought?
 How many vacant seats were present?
 How profitable is the customer?
 Ticket is one-way or return?

RECENT APPLICATION
Agriculture Systems
 Agri and related data collected for decades.
 Decision making based on expert judgment.
 Lack of integration results in underutilization.
 What is required, in which amount and when?

40
DATA WAREHOUSE VS. OLTP
OLTP (On Line Transaction Processing)
Select tx_date, balance from tx_table
Where account_ID = 23876;

41
DWH
Select balance, age, sal, gender from
customer_table, tx_table
Where age between (30 and 40) and
Education = „graduate‟ and
CustID.customer_table =
Customer_ID.tx_table;

42
OLTP DWH
Primary key used Primary key NOT used
No concept of Primary Index Primary index used
Few rows returned Many rows returned
May use a single table Uses multiple tables
High selectivity of query Low selectivity of query
Indexing on primary key
(unique)
Indexing on primary index
(non-unique)

Shanu Sharma, CSE-ASET43
COMPARISON OF RESPONSE TIMES
 On-line analytical processing (OLAP) queries must be
executed in a small number of seconds.
 Often requires denormalization and/or sampling.
 Complex query scripts and large list selections can
generally be executed in a small number of minutes.
 Sophisticated clustering algorithms (e.g., data mining)
can generally be executed in a small number of hours
(even for hundreds of thousands of customers).

Shanu Sharma, CSE-ASET44
DATA WAREHOUSE FOR DECISION SUPPORT
& OLAP
 Putting Information technology to help the
knowledge worker make faster and better
decisions
 Which of my customers are most likely to go to
the competition?
 What product promotions have the biggest
impact on revenue?
 How did the share price of software companies
correlate with profits over last 10 years?

Dwdm 2(data warehouse)

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Dwdm 2(data warehouse)

Similaire à Dwdm 2(data warehouse) (20)

Dernier

Dernier (20)

Dwdm 2(data warehouse)