TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
Data Warehouse: Basics
1. Data Warehouse
An Introduction
Lecture - 2
Dept of MCA, NIT, Durgapur. September 6, 2012 1
2. Data, Data everywhere yet ...
I can’t find the data I need
data is scattered over the network
many versions, subtle differences
I can’t get the data I need
need an expert to get the data
I can’t understand the data I found
available data poorly documented
I can’t use the data I found
results are unexpected
data needs to be transformed from one form to
other
Dept of MCA, NIT, Durgapur. September 6, 2012 2
3. What We Need?
A single, complete and consistent
store of data obtained from a variety
of different sources made available to
end users in a what they can
understand and use, in a Business
Context / Subject.
[Barry Devlin]
Leads towards Business Analysis
Dept of MCA, NIT, Durgapur. September 6, 2012 3
4. Subject
Orientation
Organized around major subjects, such as
customer, product, sales.
Focusing on the modeling and analysis of data for
decision makers, not on daily operations or
transaction processing.
Provide a simple and concise view around
particular subject issues, by excluding data that are
not useful in the decision support process.
Dept of MCA, NIT, Durgapur. September 6, 2012 4
5. What Are Analytical
Needs?
Which are our
Which are our
lowest/highest margin
lowest/highest margin
customers ?
customers ?
Who are my customers
Who are my customers
What is the most
What is the most and what products
and what products
effective distribution
effective distribution are they buying?
are they buying?
channel?
channel?
What product prom-
What product prom- Which customers
Which customers
-otions have the biggest
-otions have the biggest are most likely to go
are most likely to go
impact on revenue?
impact on revenue? to the competition ?
to the competition ?
What impact will
What impact will
new products/services
new products/services
have on revenue
have on revenue
and margins?
and margins?
Dept of MCA, NIT, Durgapur. September 6, 2012 5
6. Decision Support System
Used to manage and control business
Data is historical or point-in-time
Optimized for inquiry rather than update
Use of the system is loosely defined and can
be ad-hoc
Used by managers and end-users to
understand the business and make
judgements
Dept of MCA, NIT, Durgapur. September 6, 2012 6
7. Evolution of Decision Support
60’s: Batch reports
hard to find and analyze information
inflexible and expensive, reprogram every request
70’s: Terminal based DSS and EIS
80’s: Desktop data access and analysis tools
query tools, spreadsheets, GUIs
easy to use, but access only operational db
90’s: Data warehousing with integrated OLAP engines and
tools
To meet the analytical needs of the business.
Dept of MCA, NIT, Durgapur. September 6, 2012 7
8. What are the users saying...
Data should be integrated across the
enterprise
Summary data had a real value to
the organization
Historical data held the key to
understanding data over time
What-if capabilities are required
Dept of MCA, NIT, Durgapur. September 6, 2012 8
9. Need Separate Process?
Technique for assembling and
managing data from various sources
for the purpose of answering business
questions. Thus making decisions that
were not previously possible.
A decision support database
maintained separately from the
organization’s operational database
Dept of MCA, NIT, Durgapur. September 6, 2012 9
10. Traditional RDBMS used for OLTP
Database Systems have been used traditionally
for OLTP
clerical data processing tasks
detailed, up to date data
structured repetitive tasks
read/update a few records
isolation, recovery and integrity are critical
Normalization is mandatory
Will call these Operational Database
Dept of MCA, NIT, Durgapur. September 6, 2012 10
11. Decision Support
Database
Defined in many different ways, but not
rigorously.
A decision support database that is
maintained separately from the
organization’s operational database
Support information processing by providing
a solid platform of consolidated, historical
data for analysis.
Dept of MCA, NIT, Durgapur. September 6, 2012 11
12. Some Common Terms
Operational databases: Operational databases are detail oriented
databases defined to meet the needs of sometimes very complex
processes in a company. This detailed view is reflected in the data
arrangement in the database. The data is highly normalized to avoid data
redundancy and “complex-maintenance".
OLTP: On-Line Transaction Processing (OLTP) describes the way data
is processed by an end user or a computer system. It is detail oriented,
highly repetitive with massive amounts of updates and changes of the
data by the end user. It is also very often described as the use of
computers to run the on-going operation of a business.
Dept of MCA, NIT, Durgapur. September 6, 2012 12
13. Some Common Terms
Cont…
Data warehouse: A data warehouse collects, organizes, and makes
data available for the purpose of analysis — to give management the
ability to access and analyze information about its business. This type
of data can be called "informational data". The systems used to work
with informational data are referred to as OLAP (On-Line Analytical
Processing).
We will call it Informational Database .
Dept of MCA, NIT, Durgapur. September 6, 2012 13
14. Some Common Terms
Cont…
Operational versus informational databases
The major difference between operational and informational databases is the
update frequency:
1. On operational databases a high number of transactions take place every
hour. The database is always "up to date", and it represents a snapshot of
the current business situation, or more commonly referred to as point in
time.
2. Informational databases are usually stable over a period of time to
represent a situation at a specific point in time in the past, which can be
noted as historical data.
Dept of MCA, NIT, Durgapur. September 6, 2012 14
15. Some Common Terms
Cont…
OLAP: On-Line Analytical Processing (OLAP) is a category of software
technology that enables analysts, managers and executives to gain insight into
data through fast, consistent, interactive access to a wide variety of possible
views of information that has been transformed from raw data to reflect the real
dimensionality of the enterprise as understood by the user.
OLAP is implemented in a multi-user client/server mode and offers
consistently rapid response to queries, regardless of database size and
complexity. OLAP helps the user synthesize enterprise information through
comparative, personalized viewing, as well as through analysis of historical
and projected data in various "what-if" data model scenarios. This is achieved
through use of an OLAP Server.
Dept of MCA, NIT, Durgapur. September 6, 2012 15
16. OLTP vs. Data Warehouse
OLTP Warehouse (OLAP)
Application Oriented Subject Oriented
Used to run business Used to analyze business
Clerical User Manager/Analyst
Detailed data Summarized and refined
Current up to date Snapshot data
Isolated Data Integrated Data
Repetitive access by Ad-hoc access using
small transactions large queries
Read/Update access Mostly read access (batch
update)
Dept of MCA, NIT, Durgapur. September 6, 2012 16
17. Some Common Terms
Cont…
Metadata — a definition
Metadata is the kind of information that describes the data stored in a
database and includes such information as:
• A description of tables and fields in the data warehouse, including data
types and the range of acceptable values.
• A similar description of tables and fields in the source databases, with a
mapping of fields from the source to the warehouse.
• A description of how the data has been transformed, including formulae,
formatting, currency conversion, and time aggregation.
• Any other information that is needed to support and manage the operation
of the data warehouse.
Dept of MCA, NIT, Durgapur. September 6, 2012 17
18. Some Common Terms
Cont…
Data mart: A data mart contains a subset of corporate data that is of
value to a specific business unit, department, or set of users. This subset
consists of historical, summarized, and possibly detailed data captured
from transaction processing systems, or from an enterprise data
warehouse. It is important to realize that a data mart is defined by the
functional scope of its users, and not by the size of the data mart
database. Most data marts today involve less than 100 GB of data; some
are larger, however it is expected that as data mart usage increases they
will rapidly increase in size.
Data mining: Data mining is the process of extracting valid, useful,
previously unknown, and comprehensible information from data and using
it to make business decisions.
Dept of MCA, NIT, Durgapur. September 6, 2012 18
19. Problem in General Purpose SQL
Let a set of database schemas are as follows:
1. Product ( P_ID, P_NAME, P_DESC);
2. Sales (R_NO, P_ID, Q_ID, AMOUNT);
3. Time (Q_ID, Q_DESC);
Say, the organization need to generate a report as follows:
Product 4Q96 Sales 4Q97 Sales
XYZ 57 66
ABC 29 24
PQR 115 89
Dept of MCA, NIT, Durgapur. September 6, 2012 19
20. Problem in SQL Cont…
The SQL may be needed to display the Fourth Quarter 1996 Sales may be
as follows:
SELECT Product.P_Name, SUM(Sales.DOLLAR)
FROM Sales, Product, Time
WHERE . . . Time.Q_ID= '4Q96'
AND Product.Product_Name in (‘XYZ', ‘ABC', ‘PQR')
GROUP BY Product.P_NAME
If one expand the Time constraint to include both quarters, as follows:
WHERE . . . Time.Quarter IN ('4Q96', '4Q97')
then the sum expression adds up the sales from both quarters, which
we do not want. Also SQL not gives any other alternative.
Hence General SQL Engine fails in case of query like above.
Dept of MCA, NIT, Durgapur. September 6, 2012 20