Informatica and datawarehouse Material

Data Warehousing
obieefans.com

DATA WAREHOUSE
A data warehouse is the main repository of the organization's historical data, its corporate memory. For
example, an organization would use the information that's stored in its data warehouse to find out what day of the
week they sold the most widgets in May 1992, or how employee sick leave the week before the winter break
differed between California and New York from 2001-2005. In other words, the data warehouse contains the raw
material for management's decision support system. The critical factor leading to the use of a data warehouse is that
a data analyst can perform complex queries and analysis on the information without slowing down the operational
systems.

While operational systems are optimized for simplicity and speed of modification (online transaction processing,
or OLTP) through heavy use of database normalization and an entity-relationship model, the data warehouse is
optimized for reporting and analysis (on line analytical processing, or OLAP). Frequently data in data warehouses is
heavily denormalised, summarised and/or stored in a dimension-based model but this is not always required to
achieve acceptable query response times.
More formally, Bill Inmon (one of the earliest and most influential practitioners) defined a data warehouse as
follows:

Subject-oriented, meaning that the data in the database is organized so that all the data elements relating to the
same real-world event or object are linked together;

Time-variant, meaning that the changes to the data in the database are tracked and recorded so that reports can be
produced showing changes over time; obieefans.com

Non-volatile, meaning that data in the database is never over-written or deleted, once committed, the data is static,
read-only, but retained for future reporting;

Integrated, meaning that the database contains data from most or all of an organization's operational applications,
and that this data is made consistent History of data warehousing
Data Warehouses became a distinct type of computer database during the late 1980s and early 1990s. They were
developed to meet a growing demand for management information and analysis that could not be met by operational
systems. Operational systems were unable to meet this need for a range of reasons:
• The processing load of reporting reduced the response time of the operational systems,
• The database designs of operational systems were not optimized for information analysis and
reporting,
• Most organizations had more than one operational system, so company-wide reporting could not be
supported from a single system, and
• Development of reports in operational systems often required writing specific computer programs
which was slow and expensive.
As a result, separate computer databases began to be built that were specifically designed to support management
information and analysis purposes. These data warehouses were able to bring in data from a range of different data

obieefans.com 1

Data Warehousing
obieefans.com
sources, such as mainframe computers, minicomputers, as well as personal computers and office automation
software such as spreadsheet, and integrate this information in a single place. This capability, coupled with user-
friendly reporting tools and freedom from operational impacts, has led to a growth of this type of computer system.
As technology improved (lower cost for more performance) and user requirements increased (faster data load cycle
times and more features), data warehouses have evolved through several fundamental stages:

Offline Operational Databases - Data warehouses in this initial stage are developed by simply copying the
database of an operational system to an off-line server where the processing load of reporting does not impact on the
operational system's performance.

Offline Data Warehouse - Data warehouses in this stage of evolution are updated on a regular time cycle (usually
daily, weekly or monthly) from the operational systems and the data is stored in an integrated reporting-oriented
data structure

Real Time Data Warehouse - Data warehouses at this stage are updated on a transaction or event basis, every time
an operational system performs a transaction (e.g. an order or a delivery or a booking etc.)

Integrated Data Warehouse - Data warehouses at this stage are used to generate activity or transactions that are
passed back into the operational systems for use in the daily activity of the organization.

DATA WAREHOUSE ARCHITECTURE

The term data warehouse architecture is primarily used today to describe the overall structure of a Business
Intelligence system. Other historical terms include decision support systems (DSS), management information
systems (MIS), and others.

The data warehouse architecture describes the overall system from various perspectives such as data, process, and
infrastructure needed to communicate the structure, function and interrelationships of each component. The
infrastructure or technology perspective details the various hardware and software products used to implement the
distinct components of the overall system. The data perspectives typically diagrams the source and target data
structures and aid the user in understanding what data assets are available and how they are related. The process
perspective is primarily concerned with communicating the process and flow of data from the originating source
system through the process of loading the data warehouse, and often the process that client products use to access
and extract data from the warehouse.

DATA STORAGE METHOTS
In OLTP - online transaction processing systems relational database design use the discipline of data
modeling and generally follow the Codd rules of data normalization in order to ensure absolute data integrity. Less

obieefans.com 2

Data Warehousing
obieefans.com
complex information is broken down into its most simple structures (a table) where all of the individual atomic level
elements relate to each other and satisfy the normalization rules. Codd defines 5 increasing stringent rules of
normalization and typically OLTP systems achieve a 3rd level normalization. Fully normalized OLTP database
designs often result in having information from a business transaction stored in dozens to hundreds of tables.
Relational database managers are efficient at managing the relationships between tables and result in very fast insert/
update performance because only a little bit of data is affected in each relational transaction.

OLTP databases are efficient because they are typically only dealing with the information around a single
transaction. In reporting and analysis, thousands to billions of transactions may need to be reassembled imposing a
huge workload on the relational database. Given enough time the software can usually return the requested results,
but because of the negative performance impact on the machine and all of its hosted applications, data warehousing
professionals recommend that reporting databases be physically separated from the OLTP database.

In addition, data warehousing suggests that data be restructured and reformatted to facilitate query and analysis by
novice users. OLTP databases are designed to provide good performance by rigidly defined applications built by
programmers fluent in the constraints and conventions of the technology. Add in frequent enhancements, and to
many a database is just a collection of cryptic names, seemingly unrelated and obscure structures that store data
using incomprehensible coding schemes. All factors that while improving performance, complicate use by untrained
people. Lastly, the data warehouse needs to support high volumes of data gathered over extended periods of time
and are subject to complex queries and need to accommodate formats and definitions of inherited from
independently designed package and legacy systems.

Designing the data warehouse data Architecture synergy is the realm of Data Warehouse Architects. The goal of a
data warehouse is to bring data together from a variety of existing databases to support management and reporting
needs. The generally accepted principle is that data should be stored at its most elemental level because this provides
for the most useful and flexible basis for use in reporting and information analysis. However, because of different
focus on specific requirements, there can be alternative methods for design and implementing data warehouses.
There are two leading approaches to organizing the data in a data warehouse. The dimensional approach advocated
by Ralph Kimball and the normalized approach advocated by Bill Inmon. Whilst the dimension approach is very
useful in data mart design, it can result in a rats nest of long term data integration and abstraction complications
when used in a data warehouse.

In the "dimensional" approach, transaction data is partitioned into either a measured "facts" which are generally
numeric data that captures specific values or "dimensions" which contain the reference information that gives each
transaction its context. As an example, a sales transaction would be broken up into facts such as the number of
products ordered, and the price paid, and dimensions such as date, customer, product, geographical location and
salesperson. The main advantages of a dimensional approach is that the data warehouse is easy for business staff
with limited information technology experience to understand and use. Also, because the data is pre-joined into the
dimensional form, the data warehouse tends to operate very quickly. The main disadvantage of the dimensional
approach is that it is quite difficult to add or change later if the company changes the way in which it does business.
The "normalized" approach uses database normalization. In this method, the data in the data warehouse is stored in

obieefans.com 3

Data Warehousing
obieefans.com
third normal form. Tables are then grouped together by subject areas that reflect the general definition of the data
(customer, product, finance, etc.). The main advantage of this approach is that it is quite straightforward to add new
information into the database -- the primary disadvantage of this approach is that because of the number of tables
involved, it can be rather slow to produce information and reports. Furthermore, since the segregation of facts and
dimensions is not explicit in this type of data model, it is difficult for users to join the required data elements into
meaningful information without a precise understanding of the data structure.

Subject areas are just a method of organizing information and can be defined along any lines. The traditional
approach has subjects defined as the subjects or nouns within a problem space. For example, in a financial services
business, you might have customers, products and contracts. An alternative approach is to organize around the
business transactions, such as customer enrollment, sales and trades.
Advantages of using data warehouse
There are many advantages to using a data warehouse, some of them are:
• Enhances end-user access to a wide variety of data.
• Business decision makers can obtain various kinds of trend reports e.g. the item with the most sales
in a particular area / country for the last two years.
A data warehouse can be a significant enabler of commercial business applications, most notably

Customer relationship management (CRM).
Concerns in using data warehouses
• Extracting, cleaning and loading data is time consuming.
• Data warehousing project scope must be actively managed to deliver a release of defined content and
value.
• Compatibility problems with systems already in place.
• Security could develop into a serious issue, especially if the data warehouse is web accessible.
• Data Storage design controversy warrants careful consideration and perhaps prototyping of the data
warehouse solution for each project's environments.

HISTORY OF DATA WAREHOUSING

Data warehousing emerged for many different reasons as a result of advances in the field of information systems.
A vital discovery that propelled the development of data warehousing was the fundamental differences between
operational (transaction processing) systems and informational (decision support) systems. Operational systems are
run in real time where in contrast informational systems support decisions on a historical point-in-time. Below is a
comparison of the two.
Characteristic Operational Systems (OLTP) Informational Systems (OLAP)
Primary Purpose Run the business on a current Support managerial decision
basis making

Type of Data Real time based on current data Snapshots and predictions

obieefans.com 4

Data Warehousing
obieefans.com
Primary Users Clerks, salespersons, Managers, analysts, customers
administrators
Scope Narrow, planned, and simple Broad, complex queries and
updates and queries analysis

Design Goal Performance throughput, Ease of flexible access and use
availability
Database concept Complex simple

Normalization High Low
Time-focus Point in time Period of time
Volume Many - constant updates and Periodic batch updates and
queries on one or a few table queries requiring many or all
rows rows

Other aspects that also contributed for the need of data warehousing are:
• Improvements in database technology
o The beginning of relational data models and relational database management systems (RDBMS)
• Advances in computer hardware
o The abundant use of affordable storage and other architectures
• The importance of end-users in information systems
o The development of interfaces allowing easier use of systems for end users
• Advances in middleware products
o Enabled enterprise database connectivity across heterogeneous platforms

Data warehousing has evolved rapidly since its inception. Here is the story timeline of data warehousing:
1970’s – Operational systems (such as data processing) were not able to handle large and frequent requests for data
analyses. Data stored was in mainframe files and static databases. A request was processed from recorded tapes for
specific queries and data gathering. This proved to be time consuming and an inconvenience.

1980’s – Real time computer applications became decentralized. Relational models and database management
systems started emerging and becoming the wave. Retrieving data from operational databases still a problem
because of “islands of data.”

1990’s – Data warehousing emerged as a feasible solution to optimize and manipulate data both internally and
externally to allow business’ to make accurate decisions.
What is data warehousing?

After information technology took the world by storm, there were many revolutionary concepts that were created to
make it more effective and helpful. During the nineties as new technology was being born and was becoming
obsolete in no time, there was a need for a concrete fool proof idea that can help database administration more
secure and reliable. The concept of data warehousing was thus, invented to help the business decision making

obieefans.com 5

Data Warehousing
obieefans.com
process. The working of data warehousing and its applications has been a boon to information technology
professionals all over the world. It is very important for all these managers to understand the architecture of how it
works and how can it be used as a tool to improve performance. The concept has revolutionized the business
planning techniques.

Concept
Information processing and managing a database are the two important components for any business to have a
smooth operation. Data warehousing is a concept where the information systems are computerized. Since there
would be a lot of applications that run simultaneously, there is a possibility that each individual processes create an
exclusive “secondary data” which originates from the source. The data warehouses are useful in tracking all the
information down and are useful in analyzing this information and improve performance. They offer a wide variety
of options and are highly compatible to virtually all working environments. They help the managers of companies to
gauge the progress that is made by the company over a period of time and also explore new ways to improve the
growth of the company. There are many “it’s” in business and these data warehouses are read only integrated
databases that help to answer these questions. They are useful to form a structure of operations and analyze the
subject matter on a given time period.

The structure
As is the case with all computer applications there are various steps that are involved in planning a data warehouse.
The need is analyzed and most of the time the end user is taken into consideration and their input forms an
invaluable asset in building a customized database. The business requirements are analyzed and the “need” is
discovered. That would then become the focus area. If a company wants to analyze all its records and use the
research in improving performance.

A data warehouse allows the manager to focus on this area. After the need is zeroed in on then a conceptual data
model is designed. This model is then used a basic structure that companies follow to build a physical database
design. A number of iterations, technical decisions and prototypes are formulated. Then the systems development
life cycle of design, development, implementation and support begins.

Collection of data
The project team analyzes various kinds of data that need to go into the database and also where they can find all
this information that they can use to build the database. There are two different kinds of data. One which can be
found internally in the company and the other is the data that comes from another source. There would be another
team of professionals who would work on the creation, extraction programs that are used to collect all the
information that is needed from a number of databases, Files or legacy systems. They identify these sources and ten
copy them onto a staging area outside the database. They clean all the data which is described as cleansing and make
sure that it does not contain any errors. They copy all the data into his data warehouse. This concept of data
extraction from the source and the selection, transformation processes have been unique benchmarks of this concept.
This is very important for the project to become successful. A lot of meticulous planning is involved in arriving at a
step by step configuration of all the data from the source to the data warehouse.

Use of metadata

obieefans.com 6

Data Warehousing
obieefans.com
The whole process of extracting data and collecting it to make it effective component in the operation requires
“metadata”. The transformation of an analytical system from an operational system is achieved only with maps of
Meta data. The transformational data includes the change in names, data changes and the physical characteristics
that exist. It also includes the description of the data, its brigand updates. Algorithms are used in summarizing the
data.Meta data provides graphical user interface that helps the non-technical end users. This offers richness in
navigation and accessing the database. There is other form of Meta data called the operational Meta data. This forms
the fundamental structure of accessing the procedures and monitoring the growth of data warehouse in relation with
the available storage space. It also recognizes who would be responsible to access the data in the warehouse and in
operational systems.

Data marts-specific data
In every data base systems, there is a need for updation. Some of them do it by the day and some by the minute.
However if a specific department needs to monitor its own data in sync with the overall business process. They store
it as data marts. These are not as big as data arehouse and are useful for storing the data and the information of a
specific business module. The latest trend in data warehousing is to develop smaller data marts and then manage
each of them individually and later integrate them into the overall business structure.

Security and reliability Similar to information system, trustworthiness of data is determined by the trustworthiness
of the hardware, software, and the procedures that created them. The reliability and authenticity of the data and
information extracted from the warehouse will be a function of the reliability and authenticity of the warehouse and
the various source systems that it encompasses.
In data warehouse environments specifically, there needs to be a means to ensure the integrity of data first by having
procedures to control the movement of data to the warehouse from operational systems and second by having
controls to protect warehouse data from unauthorized changes. Data warehouse trustworthiness and security are
contingent upon acquisition, transformation and access metadata and systems documentation

The basic need for every data base is that it needs to be secure and trustworthy. This is determined by the hardware
components of the system the reliability and authenticity of the data and information extracted from the warehouse
will be a function of the reliability and authenticity of the warehouse and the various source systems that it
encompasses. In data warehouse environments specifically, there needs to be a means to ensure the integrity of data
first by having procedures to control the movement of data to the warehouse from operational systems and second
by having controls to protect warehouse data from unauthorized changes. Data warehouse trustworthiness and
security are contingent upon acquisition, transformation and access metadata and systems documentation.

Han and Kamber (2001) define a data warehouse as “A repository of information collected from multiple sources,
stored under a unified scheme, and which usually resides at a single site.”
In educational terms, all past information available in electronic format about a school or district such as budget,
payroll, student achievement and demographics is stored in one location where it can be accessed using a single set
of inquiry tools.

These are some of the drivers that have been created to initiate data warehousing.

obieefans.com 7

Data Warehousing
obieefans.com
• CRM: Customer relationship management .there is a threat of losing customers due to poor quality and sometimes
those unknown reasons that nobody ever explored. As a result of direct competition, this concept of customer
relationship management has been on the forefront to provide the solutions. Data warehousing techniques have
helped this cause enormously. Diminishing profit margins: Global competition has forced many companies that
enjoyed generous profit margins on their products to reduce their prices to remain competitive. Since cost of goods
sold remains constant, companies need to manage their operations better to improve their operating margins

• Data warehouses enable management decision support for managing business operations. Retaining the existing
customers has been the most important feature of present day business. To facilitate good customer relationship
management companies are investing a lot of money to find out the exact needs of the consumer. As a result of this
direct competition the concept of customer relationship management came into existence. Data warehousing
techniques have helped this cause enormously. Diminishing profit margins: Global competition has forced many
companies that enjoyed generous profit margins on their products to reduce their prices to remain competitive. Since
cost of goods sold remains constant, companies need to manage their operations better to improve their operating
margins. Data warehouses enable management decision support for managing business operations.

• Deregulation: the ever growing competition and the diminishing profit margins have made companies to explore
various new possibilities to play the game better. A company develops in one direction and establishes a particular
core competency in the market. After they have their own speciality, they look for new avenues to go into a new
market with a completely new set of possibilities. For a company to venture into developing a new core competency,
the concept of deregulation is very important. . Data warehouses are used to provide this information. Data
warehousing is useful in generating a cross reference data base that would help companies to get into cross selling.
this is the single most effective way that this can hap

• The complete life cycle. The industry is very volatile where we come across a wide range of new products every
day and then becoming obsolete in no time. The waiting time for the complete lifecycle often results in a heavy loss
of resources of the company. There was a need to build a concept which would help in tracking all the volatile
changes and update them by the minute. This allowed companies to be extra safe In regard to all their products. The
system is useful in tracking all the changes and helps the business decision process to a great deal. These are also
described as business intelligence systems in that aspect.

Merging of businesses: As described above, as a direct result of growing competition, companies join forces to
carve a niche in a particular market. This would help the companies to work towards a common goal with twice the
number of resources. In case of such an event, there is a huge amount of data that has to be integrated. This data
might be on different platforms and different operating systems. To have a centralized authority over the data, it is
important that a business tool has to be generated which not only is effective but also reliable. Data warehousing fits
the need Relevance of Data Warehousing for organizations Enterprises today, both nationally and globally, are in
perpetual search of competitive advantage. An incontrovertible axiom of business management is that information is
the key to gaining this advantage. Within this explosion of data are the clues management needs to define its market
strategy. Data Warehousing Technology is a means of discovering and unearthing these clues, enabling
organizations to competitively position themselves within market sectors. It is an increasingly popular and powerful

obieefans.com 8

Data Warehousing
obieefans.com
concept of applying information technology to solving business problems. Companies use data warehouses to store
information for marketing, sales and manufacturing to help managers get a feel for the data and run the business
more effectively. Managers use sales data to improve forecasting and planning for brands, product lines and
business areas. Retail purchasing managers use warehouses to track fast-moving lines and ensure an adequate supply
of high-demand products. Financial analysts use warehouses to manage currency and exchange exposures, oversee
cash flow and monitor capital expenditures.

Data warehousing has become very popular among organizations seeking competitive advantage by getting strategic
information fast and easy (Adhikari, 1996). The reasons for organizations for having a data warehouse can be
grouped into four sections:

• Warehousing data outside the operational systems:
The primary concept of data warehousing is that the data stored for business analysis can most effectively be
accessed by separating it from the data in the operational systems. Many of the reasons for this separation has
evolved over the years. A few years before legacy systems archived data onto tapes as it became inactive and many
analysis reports ran from these tapes or data sources to minimize the performance on the operational systems.

• Integrating data from more than one operational system:
Data warehousing are more successful when data can be combined from more than one operational system. When
data needs to be brought together from more than one application, it is natural that this integration be done at a place
independent of the source application. Before the evolution of structured data warehouses, analysts in many
instances would combine data extracted from more than one operational system into a single spreadsheet or a
database. The data warehouse may very effectively combine data from multiple source applications such as sales,
marketing, finance, and production.

• Data is mostly volatile:
Another key attribute of the data in a data warehouse system is that the data is brought to the warehouse after it has
become mostly non-volatile. This means that after the data is in the data warehouse, there are no modifications to be
made to this information.

• Data saved for longer periods than in transaction systems:
Data from most operational systems is archived after the data becomes inactive. For example, an order may become
inactive after a set period from the fulfillment of the order; or a bank account may become inactive after it has been
closed for a period of time. The primary reason for archiving the inactive data has been the performance of the
operational system. Large amounts of inactive data mixed with operational live data can significantly degrade the
performance of a transaction that is only processing the active data. Since the data warehouses are designed to be the
archives for the operational data, the data here is saved for a very long period.

Advantages of data warehouse:
There are several advantages of data warehousing. When companies have a problem that requires necessary changes

obieefans.com 9

Data Warehousing
obieefans.com
in their transaction, they need the information and the transaction processing to make a decision.

• Time reduction
"The warehouse has enabled employee to shift their time from collecting information to analyzing it and that helps
the company make better business decisions" A data warehouse turns raw information into a useful analytical tool
for business decision-making. Most companies want to get the information or transaction processing quickly in
order to make a decision-making. If companies are still using traditional online transaction processing systems, it
will take longer time to get the information that needed. As a result, the decision-making will be made longer, and
the companies will lose time and money. Data warehouse also makes the transaction processing easier.
• Efficiency
In order to minimize inconsistent reports and provide the capability for data sharing, the companies should provide a
database technology that is required to write and maintain queries and reports. A data warehouse provides, in one
central repository, all the metrics necessary to support decision-making throughout the queries and reports. Queries
and reports make the management processing be efficient.

• Complete Documentation
A typical data warehouse objective is to store all the information including history. This objective comes with its
own challenges. Historical data is seldom kept on the operational systems; and, even if it is kept, rarely is found in
three or five years of history in one file. There are some reasons why companies need data warehouse to store
historical data.

• Data Integration
Another primary goal for all data warehouses is to integrate data, because it is a primary deficiency in current
decision support. Another reason to integrate data is that the data content in one file is at a different level of
granularity than that in another file or that the same data in one file is updated at a different time period than that in
another file.

Limitations:
Although data warehouse brings a lot of advantages to corporate, there are some disadvantages that apply to data
warehouse.

• High Cost
Data warehouse system is too expensive. According to Phil Blackwood, “with the average cost of data warehouse
systems valued at$1.8 million”. This limits small companies to buy data warehouse system. As a result, only big
companies can afford to buy it. It means that not all companies have proper system to store data and transaction
system databases.
Furthermore, because small companies do not have data warehouse, then it causes difficulty for small companies to
store data and information in the system that may causes small companies to organize the data as one of the
requirement for the company will grow.

• Complexity
Moreover, data warehouse is very complex system. The primary function of data warehouse is to integrate all the
data and the transaction system database. Because integrate the system is complicated, data warehousing can

obieefans.com 10

Data Warehousing
obieefans.com
complicate business process significantly. For example, small change in the transaction processing system may have
major impacts on all transaction processing system. Sometimes, adding, deleting, or changing the data and
transaction can causes time consuming. The administrator need to control and check the correctness of changing
transaction in order to impact on other transaction. Therefore, complexity of data warehouse prevents the companies
from changing the data or transaction that are necessary to make.

Opportunities and Challenges for Data Warehousing
Data warehousing is facing tremendous opportunities and challenges which to a greater part decide the most
immediate developments and future trends. Behind all these probable happenings is the impact that the Internet has
upon ways of doing business and, consequently, upon data warehousing—a more and more important tool for
today’s and future’s organizations and enterprises. The opportunities and challenges for data warehousing are
mainly reflected in four aspects.

• Data Quality

Data warehousing has unearthed many previously hidden data-quality problems. Most companies have attempted
data warehousing and discovered problems as they integrate information from different business units. Data that was
apparently adequate for operational systems has often proved to be inadequate for data warehouses (Faden, 2000).
On the other hand, the emergence of E-commerce has also opened up an entirely new source of data-quality
problems. As we all know, data, now, may be entered at a Web site directly by a customer, a business partner, or, in
some cases, by anyone who visits the site. They are more likely to make mistakes, but, in most cases, less likely to
care if they do. All these are “elevating data cleansing from an obscure, specialized technology to a core requirement
for data warehousing, cusomer-relationship management, and Web-based commerce “

• Business Intelligence

The second challenge comes from the necessity of integrating data warehousing with business intelligence to
maximize profits and competency. We have been witnessing an ever-increasing demand to deploy data warehousing
structures and business intelligence. The primary purpose of the data warehouse is experiencing a shift from a focus
on data transformation into information to—most recently—transformation into intelligence.

All the way down this new development, people will expect more and more analytical function of the data
warehouse. The customer profile will be extended with psycho-graphic, behavioral and competitive ownership
information as companies attempt to go beyond understanding a customer’s preference. In the end, data warehouses
will be used to automate actions based on business intelligence. One example is to determine with which supplier
the order should be placed in order to achieve delivery as promised to the customer.

• E-business and the Internet

Besides the data quality problem we mentioned above, a more profound impact of this new trend on data
warehousing is in the nature of data warehousing itself.

obieefans.com 11

Data Warehousing
obieefans.com
On the surface, the rapidly expanding e-business has posed a threat to data warehouse practitioners. They may be
concerned that the Internet has surpassed data warehousing in terms of strategic importance to their company, or that
Internet development skills are more highly valued than those for data warehousing. They may feel that the Internet
and e-business have captured the hearts and minds of business executives, relegating data warehousing to ‘second
class citizen’ status. However, the opposite is true.

• Other trends

While data warehousing is facing so many challenges and opportunities, it also brings opportunities for other
fields. Some trends that have just started are as follows:
• More and more small-tier and middle-tier corporations are looking to build their own decision support systems.
• The reengineering of decision support systems more often than not end up with the architecture that would help
fuel the growth of their decision support systems.
• Advanced decision support architectures proliferate in response to companies’ increasing demands to integrate
their customer relationship management and e-business initiatives with their decision support systems.
• More organizations are starting to use data warehousing meta data standards, which allow the various decision
support tools to share their data with one another.

Architectural Overview
In concept the architecture required is relatively simple as can be seen from the diagram below:
Source System(s)
Data
Mart
ETL
Transaction Repository
ETL
ETL
ETL
Reporting Tools
Data
Mart
Data
Mart
Figure 1 - Simple Architecture
However this is a very simple design concept and does not reflect what it takes to implement a data warehousing
solution. In the next section we look not only at these core components but the additional elements required to make
it all work.
White Paper - Overview Architecture for Enterprise Data Warehouses

Components of the Enterprise Data Warehouse

obieefans.com 12

Data Warehousing
obieefans.com
The simple architecture diagram shown at the start of the document shows four core components of an enterprise
data warehouse. Real implementations however often have many more depending on the circumstances. In this
section we look first at the core components and then look at what other additional components might be needed.
The core components
The core components are those shown on the diagram in Figure 1 – Simple Architecture. They are the ones that are
most easily identified and described.

Source Systems
The first component of a data warehouse is the source systems, without which there would be no data. These provide
the input into the solution and will require detailed analysis early in any project. Important considerations in looking
at these systems include:

Is this the master of the data you are looking for?
Who owns/manages/maintains this system?
Where is the source system in its lifecycle?
What is the quality of the data in the system?
What are the batch/backup/upgrade cycles on the system?
Can we get access to it?

Source systems can broadly be categorised in five types:
On-line Transaction Processing (OLTP) Systems These are the main operational systems of the business and will
normally include financial systems, manufacturing systems, and customer relationship management (CRM) systems.
These systems will provide the core of any data warehouse but, whilst a large part of the effort will be expended on
loading these systems it is the integration of the other sources that provides the value. Legacy Systems Organisations
will often have systems that are at the end of their life, or archives of de-commissioned systems. One of the business
case justifications for building a data warehouse may have been to remove these systems after the critical data has
been moved into the data warehouse. This sort of data often adds to the historical richness of a solution.
Missing or Source-less Data
During the analysis it is often the case that data is identified as required but for which no viable source exists, e.g.
exchange rates used on a given date or corporate calendar events, a source that is unusable for loading such as a
document, or just that the answer is
in someone.s head. There is also data required for the basic operation such as descriptions of codes.

This is therefore an important category, which is frequently forgotten during the initial design stages, and then
requires a last minute fix into the system, often achieved by direct manual changes to the data warehouse. The down
side of this approach is that it loses the tracking, control and auditability of the information added to the warehouse.
Our advice is therefore to create a system or systems that we call the Warehouse Support Application (WSA). This
is normally a number of simple data entry type forms that can capture the data required. This is then treated as
another OLTP source and managed in the same way. Organisations are often concerned about how much of this they
will have to build. In reality it is a reflection of the level of good data capture during the existing business process
and current systems. If
these are good then there will be little or no WSA components to build but if they are poor then significant
development will be required and this should also raise a red flag about the readiness of the organisation to

obieefans.com 13

Data Warehousing
obieefans.com
undertake this type of build.

Transactional Repository (TR)
The Transactional Repository is the store of the lowest level of data and thus defines the scope and size of
the database. The scope is defined by what tables are available in the data model and the size is defined by the
amount of data put into the model. Data that is loaded here will be clean, consistent, and time variant. The design of
the data model in this area is critical to the long term success of the data warehouse as it determines the scope and
the cost of changes, makes mistakes expensive and inevitably causes delays.

As can be seen from the architecture diagram the transaction repository sits at the heart of the system; it is the
point where all data is integrated and the point where history is held. If the model, once in production, is missing key
business information and can not easily be xtended when the requirements or the sources change then this will mean
significant rework. Avoiding this cost is a factor in the choice of design for this data Model.

In order to design the Transaction Repository there are three data modelling approaches that can be identified.
Each lends itself to different organisation types and each has its own advantages and disadvantages, although a
detailed discussion of these is outside the scope of this document.
The three approaches are:

Enterprise Data Modelling (Bill Inmon)
This is a data model that starts by using conventional relational modelling techniques and often will describe the
business in a conventional normalised database. There may then be a series of de-normalisations for performance
and to assist extraction into the
data marts.
This approach is typically used by organisations that have a corporate-wide data model and strong central
control by a group such as a strategy team. These organisations will tend also to have more internally developed
systems rather than third party products.
Data Bus (Ralph Kimball)
The data model for this type of solution is normally made up of a series of star schemas that have evolved over time,
with dimensions becoming .conformed. as they are re-used. The transaction repository is made up of these base star
schemas and their associated dimensions. The data marts in the architecture will often just be views either directly
onto these schemas or onto aggregates of these star schemas. This approach is particularly suitable for companies
which have evolved from a number of independent data marts and growing and evolving into a more mature data
warehouse environment. Process Neutral Model

A Process Neutral Data Model is a data model in which all embedded business rules have been removed.
If this is done correctly then as business processes change there should be little or no change required to the data
model. Business Intelligence solutions designed around such a model should therefore not be subject to limitations
as the business changes.

This is achieved both by making many relationships optional and have multiple cardinality, and by carefully making
sure the model is generic rather then reflecting only the views and needs of one or more specific business areas.
Although this sounds simple (and it is once you get used to it) in reality it takes a little while to fully understand and

obieefans.com 14

Data Warehousing
obieefans.com
to be able to achieve. This type of data model has been used by a number of very large organisations where it
combines some of the best features of both the data bus approach and enterprise data modelling. As with enterprise
data modelling it sets out to describe the entire business
but rather than normalise data it uses an approach that embeds the metadata (or data about data) in the data model
and often contains natural star schemas. This approach is generally used by large corporations that have one or more
of the following attributes: many legacy systems, a number of systems as a result of business acquisitions, no central
data model, or
a rapidly changing corporate environment.

Data Marts

The data marts are areas of a database where the data is organised for user queries, reporting and analysis.
Just as with the design of the Transaction Repository there are a number of design types for data mart. The choice
depends on factors such as the design of transaction repository and which tools are to be used to query the data
marts.
The most commonly used models are star schemas and snowflake schemas where direct database access is made,
whilst data cubes are favoured by some tool vendors. It is also possible to have single table solution sets if this meets
the business requirement. There is no need for all data marts to have the same design type, as they are user facing it
is important that they are fit for purpose for the user and not what suits a purist architecture.

Extract . Transform- Load (ETL) Tools
ETL tools are the backbone of the data warehouse, moving data from source to transaction repository

and on to data marts. They must deal with issues of performance of load for large volumes and with complex
transformation of data, in a repeatable, scheduled environment. These tools build the interfaces between components
in the architecture and will also often work with data cleansing elements to ensure that the most accurate data is
available. The need for a standard approach to ETL design within a project is paramount. Developers will often
create an intricate and complicated solution for which there is a simple solution, often requiring little compromise.
Any compromise in the deliverable is usually accepted by the
business once they understand these simple approaches will save them a great deal of cash in terms of time taken to
design, develop, test and ultimately support.

Analysis and Reporting Tools
Collecting all of the data into a single place and making it available is useless without the ability for users to access
the information. This is done with a set of analysis and reporting tools. Any given data warehouse is likely to have
more than one tool. The types of tool can be qualified in broadly four categories:

Simple reporting tools that either produce fixed or simple parameterised reports.
Complex ad hoc query tools that allow users to build and specify their own queries.
Statistical and data mining packages that allow users to delve into the information contained within the data.
.What-if. tools that allow users to extract data and then modify it to role play or simulate scenarios.
Additional Components

obieefans.com 15

Data Warehousing
obieefans.com
In addition to the core components a real data warehouse may require any or all of these components to deliver the
solution. The requirement to use a component should be considered by each programme on its own merits.

Literal Staging Area (LSA)
Occasionally, the implementation of the data warehouse encounters environmental problems, particularly with
legacy systems (e.g. a mainframe system, which is not easily accessible by applications and tools). In this case it
might be necessary to implement a Literal Staging
Area, which creates a literal copy of the source system.s content but in a more convenient environment (e.g. moving
mainframe data into an ODBC accessible relational database). This literal staging area then acts as a surrogate for
the source system for use by the downstream ETL interfaces.
There are some important benefits associated with implementing an LSA:

It will make the system more accessible to downstream ETL products.
It creates a quick win for projects that have been trying to get data off, for example a Mainframe, in a more
laborious fashion.
It is a good place to perform data quality profiling.
It can be used as a point close to the source to perform data quality cleaning.
Transaction Repository Staging Area (TRS)
ETL loading will often need an area to put intermediate data sets, or working tables, Somewhere which for clarity
and ease of management should not be in the same area as the main model. This area is used when bringing data
from a source system or its surrogate into the transaction repository.

Data Mart Staging Area (DMS)
As with the transaction repository staging area there is a need for space between the transaction repository and data
marts for intermediate data sets. This area provides that space.

Operational Data Store (ODS)
An operational data store is an area that is used to get data from a source and, if required, lightly aggregate it to
make it quickly available. This is required for certain types of reporting which need to be available in .realtime .
(updated within 15 minutes) or .near-time. (for example 15 to 60 minutes old). The ODS will not normally clean,
integrate, or fully aggregate
data (as the data warehouse does) but it will provide rapid answers, and the data will then become available via the
data warehouse once the cleaning, integration and aggregation has taken place in the next batch cycle.

Tools & Technology
The component diagrams above show all the areas and the elements needed. This translates into a significant list of
tools and technology that are required to build and operationally run a data warehouse solution. These include:
Operating system
Database
Backup and Recovery
Extract, Transform, Load (ETL)
Data Quality Profiling
Data Quality Cleansing

obieefans.com 16

Data Warehousing
obieefans.com
Scheduling
Analysis & Reporting
Data Modelling
Metadata Repository
Source Code Control
Issue Tracking
Web based solution integration
The tools selected should operate together to cover all of these areas. The technology choices will also be influenced
by whether the organisation needs to operate a homogeneous
(all systems of the same type) or heterogeneous (systems may be of differing types)
environment, and also whether the solution is to be centralised or distributed.

Operating System
The server side operating system is usually an easy decision, normally following the recommendation in the
organisation.s Information System strategy. The operating system choice for enterprise data warehouses tends to be
a Unix/Linux variant, although some organisations do use Microsoft operating systems. It is not the purpose of this
paper to make any recommendations for the above and the choice should be the result of the organisation.s normal
procurement procedures.

Database
The database falls into a very similar category to the operating system in that for most organisations it is a given
from a select few including Oracle, Sybase, IBM DB2 or Microsoft SQLServer.

Backup and Recovery
This may seem like an obvious requirement but is often overlooked or slipped in at the
end. From .Day 1. of development there will be a need to backup and recover the databases from time to time. The
backup poses a number of issues:
Ideally backups should be done whilst allowing the database to stay up.
It is not uncommon for elements to be backed up during the day as this is the point of least load on the system and it
is often read-only at that point.
It must handle large volumes of data.
It must cope with both databases and source data in flat files.
The recovery has to deal with the related consequence of the above:
Recovery of large databases quickly to a point in time.

Extract - Transform - Load (ETL)

The purpose of the extract, transform and load (ETL) software, to create interfaces, has been described above and is
at the core of the data warehouse. The market for such tools is constantly moving, with a trend for database vendors
to include this sort of technology in their core product. Some of the considerations for selection of an ETL tool
include:

obieefans.com 17

Data Warehousing
obieefans.com
Ability to access source systems
Ability to write to target systems
Cost of development (it is noticeable that some of the easy to deploy and operate tools are not easy to develop with)
Cost of deployment (it is also noticeable that some of the easiest tools to develop with are not easy to deploy or
operate)
Integration with scheduling tools Typically, only one ETL is needed, however it is common for specialist tools to be
used from a source system to a literal staging area as a way of overcoming a limitation in the main ETL

Data Quality Profiling obieefans.com
Data profiling tools look at the data and identify issues with it. It does this by some of the following techniques:
Looking at individual values in a column to check that they are valid Validating data types within a column
Looking for rules about uniqueness or frequencies of certain values
Validating primary and foreign key constraints +++++
Validating that data within a row is consistent
Validating that data is consistent within a table
Validating that data is consistent across tables etc.
This is important for both the analysts when examining the system and developers
when building the system. It also will identify data quality cleansing rules that can be applied to the data before
loading. It is worth noting that good analysts will often do this without tools especially if good analysis templates
are available.

Data Quality Cleansing
This tool updates data to improve the overall data quality, often based on the output of the data quality profiling tool.
There are essentially two types of cleansing tools:

Rule-based cleansing; this performs updates on the data based on rules (e.g. make everything uppercase; replace
two spaces with a single space, etc.). These rules can be very simple or quite complex depending on the tool used
and the business requirement.

Heuristic cleansing; this performs cleansing by being given only an approximate method of solving the problem
within the context of some goal, and then uses feedback from the effects of the solution to improve its own
performance. This is commonly used for address matching type problems.
An important consideration when implementing a cleansing tool is that the process should be performed as closely
as possible to the source system. If it is performed further downstream, data will be repeatedly presented for
cleansing.

Scheduling
With backup, ETL and batch reporting runs the data warehouse environment has a large number of jobs to be
scheduled (typically in the hundreds per day) with many Dependencies, for example:

.The backup can only start at the end of the business day and provided that the source system has generated a flat
file, if the file does not exist then it must poll for thirty minutes to see if it arrives otherwise notify an operator. The
data mart load can not start until the transaction repository load is complete but then can run six different data mart

obieefans.com 18

Data Warehousing
obieefans.com
loads in parallel.

This should be done via a scheduling tool that integrates into the environment.

Analysis & Reporting
The analysis and reporting tools are the user.s main interface into the system. As has already been discussed there
are four main types
Simple reporting tools
Complex ad hoc query tools
Statistical and data mining packages
What-if tools
Whilst the market for such tools changes constantly the recognised source of
information is The OLAP Report2.

Data Modelling
With all the data models that have been discussed it is obvious that a tool in which to
build data models is required. This will allow designers to graphically manage data models and generate the code to
create the database objects. The tool should be capable of both logical and physical data modelling. Metadata
Repository Metadata is data about data. In the case of the data warehouse this will include information about the
sources, targets, loading procedures, when those procedures were run, and information about what certain terms
mean and how they relate to the data in the database. The metadata required is defined in a subsequent section on
documentation however the information itself will need to be held somewhere. Most tools have some elements of a
metadata repository but there is a need to identify what constitutes the entire repository by identifying which parts
are held in which tools.
2 The OLAP Report by Nigel Pendse and Richard Creeth is an independent research resource for organizations
buying and implementing OLAP applications.

Source Code Control
Up to this point you will have noticed that we have steadfastly remained vendor independent and we remain so
here. However the issue of source control is one of the biggest impacts on a data warehouse. If the tools that you use
do not have versioning control or if your tools do not integrate to allow version control across them and your
organisation does not have a source code control tool then download and use CVS, it is free, multi-platform and we
have found can be made to work with most of the tools in other categories. There are also Microsoft Windows
clients for CVS and web based tools for CVS available.

Issue Tracking
In a similar vein to the issue of source code control most projects do not deal with issue tracking well. The worst
nightmare being a spreadsheet that is mailed around once a week to get updates. We again recommend that if a
suitable tool is not already available then you consider an open source tool called Bugzilla.

Web Based Solution Integration
Running a programme such as the one described will bring much information together. It is important to bring
everything together in an accessible fashion. Fortunately web technologies provide an easy way to do this.

obieefans.com 19

Data Warehousing
obieefans.com
An ideal environment would allow communities to see some or all of the following via a secure web based interface:
Static reports
Parameterised reports
Web based reporting tools
Balanced Scorecards
Analysis
Documentation
Requirements Library
Business Terms Definitions
Schedules
Metadata Reports
Data Quality profiles
Data Quality rules
Data Quality Reports
Issue tracking
Source code
There are two similar but different technologies that are available to do this depending on the corporate approach or
philosophy:
Portals: these provide personalised websites and make use of distributed applications to provide a collaborative
workspace.
Wiki3: which provide a website that allows users to easily add and edit contents and link to other web applications

Both can be very effective in developing common understanding of what the data warehouse does and how it
operates which in turn leads to a more engaged user community and greater return on investment.
3 A wiki is a type of website that allows users to easily add and edit content and is especially suited for collaborative
writing. In essence, wiki is a simplification of the process of creating HTML web pages combined with a system that
records each individual change that occurs over time, so that at any time, a page can be reverted to any of its
previous states. A wiki system may also provide various tools that allow the user community to easily monitor the
constantly changing state of the wiki and discuss the issues that emerge in trying to achieve a general consensus
about wiki content.

Documentation Requirements

Given the size and complexity of the Enterprise Data Warehouse, a core set of documentation is required,
which is described in the following section. If a structured project approach is adopted, these documents would be
produced as a natural byproduct however we would recommend the following set of documents as a minimum. To
facilitate this, at Data Management & Warehousing, we have developed our own set of templates for this purpose.

Requirements Gathering
This is a document managed using a word-processor.
Timescales: At start of project 40 days effort plus on-going updates.
There are four sections to our requirement templates:

obieefans.com 20

Data Warehousing
obieefans.com
Facts: these are the key figures that a business requires. Often these will be associated with Key Performance
Indicators (KPIs) and the information required to calculate them i.e. the Metrics required for running the company.
An example of a fact might be the number of products sold in a store.
Dimensions: this is the information used to constrain or qualify the facts. An example of this might be the list of
products or the date of a transaction or some attribute of the customer who purchased product.
Queries: these are the typical questions that a user might want to ask for example .How many cans of soft drink
were sold to male customers on the 2nd February?. This uses information from the requirements sections on
available facts and dimensions. Non-functional: these are the requirements that do not directly relate to the data,
such as when must the system be available to users, how often does it need to be refreshed, what quality metrics
should be recorded about the data, who should be able to access it, etc.
Note that whilst an initial requirements document will come early in the project it will undergo a number of versions
as the user community matures in both its use and understanding of the system and data available to it. Key Design
Decisions This is a document managed using a word-processor.
Timescales: 0.5 days effort as and when required.
This is a simple one or two page template used to record the design decisions that are
made during the project. It contains the issue, the proposed outcome, any counterarguments
and why they were rejected and the impact on the various teams within the
project. It is important because given the long term nature of such projects there is
often a revisionist element that queries why such decisions were made and spends
time revisiting them.

Data Model
This is held in the data modelling tool.s internal format.
Timescales: At start of project 20 days effort plus on-going updates. Both logical and physical data models will be
required. The logical data model is an abstract representation of a set of data entities and their relationship, usually
including their key attributes. The logical data model is intended to facilitate analysis of the function of the data
design, and is not intended to be a full representation of the physical database. It is typically produced early in
system design, and it is frequently a precursor to the physical data model that documents the actual implementation
of the database.

In parallel with the gathering of requirements the data models for the transaction repository and the initial data marts
will be developed. These will be constantly maintained throughout the life of the solution.

Analysis
These are documents managed using a word-processor. The analysis phase of the project is broken down into three
main templates, each serving as a step in the progression of understanding required to build the system. During the
system analysis part of the project, the following three areas must be covered and documented:

Source System Analysis (SSA)
Timescales: 2-3 days effort per source system.
This is a simple high-level overview of each source system to understand its value as a potential source of business
information, and to clarify its ownership and longevity. This is normally done for all systems that are potential
sources. As the name implies this looks at the .system. level and identifies .candidate. systems.

obieefans.com 21

Data Warehousing
obieefans.com
These documents are only updated at the start of each phase when candidate systems are being identified.

Source Entity Analysis (SEA)
Timescales: 7-10 days effort per system.
This is a detailed look at the .candidate. systems, examining the data, the data quality issues, frequency of update,
access rights, etc. The output is a list of tables and fields that are required to populate the data warehouse. These
documents are updated at the start of each phase when candidate systems are being examined and as part of the
impact analysis of any upgrades to a system that has been used for a previous phase and is being upgraded.

Target Oriented Analysis (TOA)
Timescales: 15-20 days effort for the Transaction Repository, 3-5 days effort for each data mart. This is a document
that describes the mappings and transformations that are required to populate a target object. It is important that this
is target focused as a common failing is to look at the source and ask the question .Where do I put all these bits of
information ?. rather than the correct question which is .I need to populate this object where do I get the information
from ?.

Operations Guide
This is a document managed using a word-processor. Timescales: 20 days towards the end of the development
phase. This document describes how to operate the system; it will include the schedule for running all the ETL jobs,
including dependencies on other jobs and external factors such as the backups or a source system. It will also
include instructions on how to
recover from failure and what the escalation procedures for technical problem resolution are. Other sections will
include information on current sizing, predicted growth and key data inflection points (e.g. year end where there are
a particularly large number of journal entries) It will also include the backup and recovery plan identifying what
should be backed up and how to perform system recoveries from backup. Security Model This is a document
managed using a word-processor.
Timescales: 10 days effort after the data model is complete, 5 days effort toward the development phase.
This document should identify who can access what data when and where. This can be a complex issue, but the
above architecture can simplify this as most access control needs to be around the data marts and nearly everything
else will only be visible to the ETL tools extracting and loading data into them.
Issue log
This is held in the issue logging system’s internal format.
Timescales: Daily as required.
As has already been identified the project will require an issue log that tracks issues during the development and
operation of the system.

Metadata
There are two key categories of metadata as discussed below:
Business Metadata
This is a document managed using a word-processor or a Portal or Wiki if available.
Business Definitions Catalogue4
Timescales: 20 days effort after the requirements are complete and ongoing maintenance.

obieefans.com 22

Data Warehousing
obieefans.com
This is a catalogue of business terms and their definitions. It is all about adding context to data and making meaning
explicit and providing definitions to business terms, data elements, acronyms and abbreviations. It will often include
information about who owns the definition and who maintains it and where appropriate what formula is required to
calculate it. Other useful elements will include synonyms, related terms and preferred terms. Typical examples can
include definitions of business terms such as .Net Sales Value. or .Average revenue per customer. as well as
definitions of hierarchies and common terms such as customer.
Technical Metadata This is the information created by the system as it is running. It will either be held in server log
files or databases.

Server & Database availability
This includes all information about which servers and databases were available when and serves two purposes,
firstly monitoring and management of service level agreements (SLAs) and secondly performance optimisation to fit
the ETL into the available batch window and to ensure that users have good reporting performance.

ETL Information
This is all the information generated by the ETL process and will include items such as:
When was a mapping created or changed?
When was it last run?
How long did it run for?
Did it succeed or fail?
How many records inserted, updated, deleted>
This information is again used to monitor the effective running and operation of the system not only in failure but
also by identifying trends such as mappings or transformations whose Performance characteristics are changing.
Query Information
This gathers information about which queries the users are making. The information will include:
What are the queries that are being run?
Which tables do they access?
Which fields are being used?
How long do queries take to execute?
This information is used to optimise the users experience but also to remove redundant information that is no longer
being queried by users.

Some additional high-level guidelines
The following items are just some of the common issues that arise in delivering data warehouse solutions. Whilst not
exhaustive they are some of the most important factors to
consider:
Programme or project?
For data warehouse solutions to be successful (and financially viable), it is important for organisations to view the
development as a long term programme of work and examine how the work can be broken up into smaller
component projects for delivery. This enables many smaller quick wins at different stages of the programme whilst
retaining focus on the overall objective.
Examples of this approach may include the development of tactical independent data marts, a literal staging area to

obieefans.com 23

Data Warehousing
obieefans.com
facilitate reporting from a legacy system, or prioritization of the Development of particular reports which can
significantly help a particular business function. Most successful data warehouse programmes will have an
operational life in excess of ten years with peaks and troughs in development.

The technology trap
At the outset of any data warehouse project organisations frequently fall into the trap of wanting to design the
largest, most complex and functionally all-inclusive solution. This will often tempt the technical teams to use the
latest, greatest technology promised by a vendor.
However, building a data warehouse is not about creating the biggest database or using the cleverest technology, it is
about putting lots of different, often well established, components together so that they can function successfully to
meet the organisation.s data management requirements. It also requires sufficient design such that when the next
enhancement or extension of the requirement comes along, there is a known and well understood business process
and technology path to meet that requirement.

Vendor Selection
This document presents a vendor-neutral view. However, it is important (and perhaps obvious) to note that the
products which an organisation chooses to buy will dramatically affect the design and development of the system. In
particular most vendors are looking to spread their coverage in the market space. This means that two selected
products may have overlapping functionality and therefore which product to use for a given piece of functionality
must be identified. It is also important to differentiate between strategic and tactical tools
The other major consideration is that this technology market space changes rapidly. The process, whereby
vendors constantly add features similar to those of another competing product, means that few vendors will have a
significant long term advantage on features alone. Most features that you will require (rather than those that are
sometimes desired) will become available during the lifetime of the programme in market leading products if they
are not already there.

The rule of thumb is therefore when assessing products to follow the basic Gartner5 type magic quadrant of
.ability to execute. and .completeness of vision. and combine that with your organisations view of the long term
relationship it has with the vendor and the fact that a series of rolling upgrades to the technology will be required
over the life of the programme.

Development partners
This is one of the thorniest issues for large organisations as they often have policies that outsource development
work to third parties and do not want to create internal teams.
In practice the issue can be broken down with programme management and business
requirements being sourced internally. Technical design authority is either an external domain expert who transitions
to an internal person or an internal person if suitable skills exist.
It is then possible for individual development projects to be outsourced to development partners. In general the
market place has more contractors with this type of experience than permanent staff with specialist
domain/technology knowledge and so some contractor base either internally or at the development partner is almost
inevitable. Ultimately it comes down to the individuals and how they come together as a team, regardless of the
supplier and the best teams will be a blend of the best people.

obieefans.com 24

Data Warehousing
obieefans.com

The development and implementation sequence
Data Warehousing on this scale requires a top down approach to requirements and a bottom up approach to the
build. In order to deliver a solution it is important to understand what is required of the reports, where that is sourced
from in the transaction repository and how in turn the transaction repository is populated from the source system.
Conversely the build must start at the bottom and build up through the transaction repository and on to the data
marts.
Each build phase will look to either build up (i.e. add another level) or build out (i.e. add another source) This
approach means that the project manager can firstly be assured that the final destination will meet the users
requirement and that the build can be optimized by using different teams to build up in some areas whilst other
teams are building out the underlying levels. Using this model it is also possible to change direction after each
completed phase.

Homogeneous & Heterogeneous Environments
This architecture can be deployed using homogeneous or heterogeneous technologies. In a homogeneous
environment all the operating systems, databases and other components are built using the same technology, whilst a
heterogeneous solution would allow multiple technologies to be used, although it is usually advisable to limit this to
one technology per component.
For example using Oracle on UNIX everywhere would be a homogeneous environment, whilst using Sybase for the
transaction repository and all staging areas on a UNIX environment and Microsoft SQLServer on Microsoft
Windows for the data marts would be an example of a heterogeneous environment.
The trade off between the two deployments is the cost of integration and managing additional skills with a
heterogeneous environment compared with the suitability of a single product to fulfil all roles in a homogeneous
environment. There is obviously a spectrum of solutions Between the two end points, such as the same operating
system but different databases.
Centralised vs. Distributed solutions
This architecture also supports deployment in either a centralised or distributed mode. In a centralised solution all
the systems are held at a central data centre, this has the advantage of easy management but may result in a
performance impact where users that are remote from the central solution suffer problems over the network.
Conversely a distributed solution provides local solutions, which may have a better performance profile for local
users but might be more difficult to administer and will suffer from capacity issues when loading the data. Once
again there is a spectrum of solutions and therefore there are degrees to which this can be applied. It is normal that
centralised solutions are associated with homogeneous environments whilst distributed environments are usually
heterogeneous, however this need not always be the case.

Converting Data from Application Centric to User Centric Systems such as ERP systems are effectively systems
designed to pump data through a particular business process (application-centric). A data warehouse is designed to
look across systems (user-centric) to allow the user to view the data they need to perform their job.
As an example: raising a purchase order in the ERP system is optimised to get the purchase order from being raised,
through approval to being sent out. Whilst the data warehouse user may want to look at who is raising orders, the
average value, who approves them and how long do they take to do the approval. Requirements should therefore
reflect the view of the data warehouse user and not what a single application can provide.

obieefans.com 25

Data Warehousing
obieefans.com

Analysis and Reporting Tool Usage
When buying licences etc. for the analysis and reporting tools a common mistake is to require many thousands for a
given reporting tool. Once delivered the number of users never rises to the original estimates. The diagram below
illustrates why this occurs:
Flexibility in data access and complexity of tool
Size of user community
Data
Mining
Ad Hoc
Reporting Tools
Parameterised Reporting
Fixed Reporting
Web Based Desktop Tools
Senior Analysts
Business Analysts
Business Users
Customers and Suppliers
Researchers
Figure 5 - Analysis and Reporting Tool Usage
What the diagram shows is that there is a direct, inverse relationship between the degree of reporting flexibility
required by a user and the number of users requiring this access.
There will be very few people, typically business analysts and planners at the top but these individuals will need to
have tools that really allow them to manipulate and mine the data. At the next level down, there will be a somewhat
larger group of users who require ad hoc reporting access, these people will normally be developing or improving
reports that get presented to management. The remainder but largest community of the user base will only have a
requirement to be presented with data in the form of pre-defined reports with varying degrees of inbuilt flexibility:
for instance, managers, sales staff or even suppliers and customers coming into the solution over the internet. This
broad community will also influence the choice of tool to reflect the skills of the users. Therefore no individual tool
will be perfect and it is a case of fitting the users and a selection of tools together to give the best results.

Data Warehouse .A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of
data in support of management’s decision making process. (Bill Inmon).
Design Pattern A design pattern provides a generic approach, rather than a specific solution for building a
particular system or systems.
Dimension Table Dimension tables contain attributes that describe fact records in the fact table.
Distributed Solution A system architecture where the system components are distributed over a number of sites to
provide local solutions. DMS Data Mart Staging, a component in the data warehouse architecture for staging data.
ERP Enterprise Resource Planning, a business management system that integrates all facets of the business,
including planning, manufacturing, sales, and marketing. ETL Extract, Transform and Load. Activities required to
populate data warehouses and OLAP applications with clean, consistent, integrated and properly summarized data.
Also a component in the data warehouse architecture. Fact Table In an organisation, the .facts. are the key figures
that a business requires. Within that organisation.s data mart, the fact table is the foundation from which everything

obieefans.com 26

Data Warehousing
obieefans.com
else arises.

Term Description
Heterogeneous System An environment in which all or any of the operating systems, databases and other
components are built using the different technology, and are the integrated by means of customized interfaces.

Heuristic Cleansing Cleansing by means of an approximate method for solving a Problem within the context of
a goal. Heuristic cleansing then uses feedback from the effects of its solution to improve its own performance.
Homogeneous System An environment in which the operating systems, databases and other components are built
using the same technology. KDD Key Design Decision, a project template. KPI Key Performance Indicators. KPIs
help an organization define and measure progress toward organizational goals. LSA Literal Staging Area. Data from
a legacy system is taken and stored in a database in order to make this data more readily accessible to the
downstream systems. A component in the data warehouse architecture. Middleware Software that connects or
serves as the "glue" between two otherwise separate applications. Near-time Refers to data being updated by means
of batch processing at intervals of in between 15 minutes and 1 hour (in contrast to .Real-time. data, which needs to
be updated within 15 minute intervals). Normalisation Database normalization is a process of eliminating duplicated
data in a relational database. The key idea is to store data in one location, and provide links to it wherever needed.
ODS Operational Data Store, also a component in the data warehouse architecture that allows near-time reporting.
OLAP On-Line Analytical Processing. A category of applications and technologies for collecting, managing,
processing and presenting multidimensional data for analysis and management purposes.
OLTP OLTP (Online Transaction Processing) is a form of transaction processing conducted via a computer
network. Portal A Web site or service that offers a broad array of resources and services, such as e-mail, forums,
search engines. Process Neutral Model A Process Neutral Data Model is a data model in which all embedded
business rules have been removed. If this is done correctly then as business processes change there should be little or
no change required to the data model. Business
Intelligence solutions designed around such a model should therefore not be subject to limitations as the business
changes.
Rule Based Cleansing A data cleansing method, which performs updates on the data
based on rules.
SEA Source Entity Analysis, an analysis template.
Snowflake Schema A variant of the star schema with normalized dimension tables.
SSA Source System Analysis, an analysis template.

Term Description
Star Schema A relational database schema for representing multidimensional data. The data is stored in a central
fact table, with one or more tables holding information on each dimension. Dimensions have levels, and all levels
are usually shown as colum ns in each dimension table.
TOA Target Oriented Analysis, an analysis template. TR Transactional Repository. The collated, clean
repository for the lowest level of data held by the organisation and a component in the data warehouse architecture.
TRS Transaction Repository Staging, a component in the data warehouse architecture used to stage data. Wiki A
wiki is a type of website, or the software needed to operate this website, that allows users to easily add and edit
content, and that is particularly suited to collaborative content creation. WSA Warehouse Support Application, a
component in the data warehouse architecture that supports missing data. Designing the Star Schema Database

obieefans.com 27

Data Warehousing
obieefans.com

Creating a Star Schema Database is one of the most important, and sometimes the final, step in creating a
data warehouse. Given how important this process is to our data warehouse, it is important to understand how me
move from a standard, on-line transaction processing (OLTP) system to a final star schema (which here, we will call
an OLAP system).
This paper attempts to address some of the issues that have no doubt kept you awake at night. As you stared at the
ceiling, wondering how to build a data warehouse, questions began swirling in your mind:
c What is a Data Warehouse? What is a Data Mart?
What is a Star Schema Database?
Why do I want/need a Star Schema Database?
The Star Schema looks very denormalized. Won’t I get in trouble for that?
What do all these terms mean?
Should I repaint the ceiling?
These are certainly burning questions. This paper will attempt to answer these questions, and show you how to build
a star schema database to support decision support within your organization.

Usually, you are bored with terminology at the end of a chapter, or buried in an appendix at the back of the book.
Here, however, I have the thrill of presenting some terms up front. The intent is not to bore you earlier than usual,
but to present a baseline off of which we can operate. The problem in data warehousing is that the terms are often
used loosely by different parties. The Data Warehousing Institute (http://www.dw-institute.com) has attempted to
standardize some terms and concepts. I will present my best understanding of the terms I will use throughout this
lecture. Please note, however, that I do not speak for the Data Warehousing Institute.

OLTP
OLTP stand for Online Transaction Processing. This is a standard, normalized database structure. OLTP is designed
for transactions, which means that inserts, updates, and deletes must be fast. Imagine a call center that takes orders.
Call takers are continually taking calls and entering orders that may contain numerous items. Each order and each
item must be inserted into a database. Since the performance of the database is critical, we want to maximize the
speed of inserts (and updates and deletes). To maximize performance, we typically try to hold as few records in the
database as possible.

OLAP and Star Schema
OLAP stands for Online Analytical Processing. OLAP is a term that means many things to many people.
Here, we will use the term OLAP and Star Schema pretty much interchangeably. We will assume that a star schema
database is an OLAP system. This is not the same thing that Microsoft calls OLAP; they extend OLAP to mean the
cube structures built using their product, OLAP Services. Here, we will assume that any system of read-only,
historical, aggregated data is an OLAP system.
In addition, we will assume an OLAP/Star Schema can be the same thing as a data warehouse. It can be, although
often data warehouses have cube structures built on top of them to speed queries.

Data Warehouse and Data Mart
Before you begin grumbling that I have taken two very different things and lumped them together, let me explain

obieefans.com 28

Data Warehousing
obieefans.com
that Data Warehouses and Data Marts are conceptually different – in scope. However, they are built using the exact
same methods and procedures, so I will define them together here, and then discuss the differences.
A data warehouse (or mart) is way of storing data for later retrieval. This retrieval is almost always used to support
decision-making in the organization. That is why many data warehouses are considered to be DSS (Decision-
Support Systems). You will hear some people argue that not all data warehouses are DSS, and that’s fine. Some data
warehouses are merely archive copies of data. Still, the full benefit of taking the time to create a star schema, and
then possibly cube structures, is to speed the retrieval of data. In other words, it supports queries. These queries are
often across time. And why would anyone look at data across time? Perhaps they are looking for trends. And if they
are looking for trends, you can bet they are making decisions, such as how much raw material to order. Guess what:
that’s decision support!
Enough of the soap box. Both a data warehouse and a data mart are storage mechanisms for read-only, historical,
aggregated data. By read-only, we mean that the person looking at the data won’t be changing it. If a user wants to
look at the sales yesterday for a certain product, they should not have the ability to change that number. Of course, if
we know that number is wrong, we need to correct it, but more on that later.
The “historical” part may just be a few minutes old, but usually it is at least a day old. A data warehouse usually
holds data that goes back a certain period in time, such as five years. In contrast, standard OLTP systems usually
only hold data as long as it is “current” or active. An order table, for example, may move orders to an archive table
once they have been completed, shipped, and received by the customer.
When we say that data warehouses and data marts hold aggregated data, we need to stress that there are many levels
of aggregation in a typical data warehouse. In this section, on the star schema, we will just assume the “base” level
of aggregation: all the data in our data warehouse is aggregated to a certain point in time.
Let’s look at an example: we sell 2 products, dog food and cat food. Each day, we record sales of each product. At
the end of a couple of days, we might have data that looks like this:

Quantity Sold
Date Order Number Dog Food Cat Food
4/24/99 1 5 2
2 3 0
3 2 6
4 2 2
5 3 3

4/25/99 1 3 7
2 2 1
3 4 0
Table 1
Now, as you can see, there are several transactions. This is the data we would find in a standard OLTP system.
However, our data warehouse would usually not record this level of detail. Instead, we summarize, or aggregate, the
data to daily totals. Our records in the data warehouse might look something like this:

Quantity Sold
Date Dog Food Cat Food
4/24/99 15 13
4/25/99 9 8
Table 2

obieefans.com 29

Data Warehousing
obieefans.com
You can see that we have reduced the number of records by aggregating the individual transaction records into daily
records that show the number of each product purchased each day.
We can certainly get from the OLTP system to what we see in the OLAP system just by running a query. However,
there are many reasons not to do this, as we will see later.
Aggregations
There is no magic to the term “aggregations.” It simply means a summarized, additive value. The level of
aggregation in our star schema is open for debate. We will talk about this later. Just realize that almost every star
schema is aggregated to some base level, called the grain.
OLTP Systems
OLTP, or Online Transaction Processing, systems are standard, normalized databases. OLTP systems are optimized
for inserts, updates, and deletes; in other words, transactions. Transactions in this context can be thought of as the
entry, update, or deletion of a record or set of records.
OLTP systems achieve greater speed of transactions through a couple of means: they minimize repeated data, and
they limit the number of indexes. First, let’s examine the minimization of repeated data.
If we take the concept of an order, we usually think of an order header and then a series of detail records. The header
contains information such as an order number, a bill-to address, a ship-to address, a PO number, and other fields. An
order detail record is usually a product number, a product description, the quantity ordered, the unit price, the total
price, and other fields. Here is what an order might look like:

Figure 1
Now, the data behind this looks very different. If we had a flat structure, we would see the detail records looking
like this:

Order Number Order Date Customer ID Customer Customer Customer
Name Address City
12345 4/24/99 451 ACME 123 Main Street Louisville
Products
Customer State Customer Contact Contact Product ID Product Name

obieefans.com 30

Data Warehousing
obieefans.com
Zip Name Number
KY 40202 Jane Doe 502-555-1212 A13J2 Widget
Product Category SubCategory Product Price Quantity Ordered Etc…
Description
¼” Brass Widget Brass Goods Widgets $1.00 200 Etc…
Table 3
Notice, however, that for each detail, we are repeating a lot of information: the entire customer address, the contact
information, the product information, etc. We need all of this information for each detail record, but we don’t want
to have to enter the customer and product information for each record. Therefore, we use relational technology to tie
each detail to the header record, without having to repeat the header information in each detail record. The new
detail records might look like this:

Order Number Product Quantity
Number Ordered
12473 A4R12J 200
Table 4
A simplified logical view of the tables might look something like this:

Figure 2
Notice that we do not have the extended cost for each record in the OrderDetail table. This is because we store as

obieefans.com 31

Data Warehousing
obieefans.com
little data as possible to speed inserts, updates, and deletes. Therefore, any number that can be calculated is
calculated and not stored.
We also minimize the number of indexes in an OLTP system. Indexes are important, of course, but they slow down
inserts, updates, and deletes. Therefore, we use just enough indexes to get by. Over-indexing can significantly
decrease performance.
Normalization
Database normalization is basically the process of removing repeated information. As we saw above, we do not want
to repeat the order header information in each order detail record. There are a number of rules in database
normalization, but we will not go through the entire process.
First and foremost, we want to remove repeated records in a table. For example, we don’t want an order table that
looks like this:

Figure 3
In this example, we will have to have some limit of order detail records in the Order table. If we add 20 repeated sets
of fields for detail records, we won’t be able to handle that order for 21 products. In addition, if an order just has one
product ordered, we still have all those fields wasting space.
So, the first thing we want to do is break those repeated fields into a separate table, and end up with this:

Figure 4

obieefans.com 32

Informatica and datawarehouse Material

Informatica and datawarehouse Material

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (15)

Similar to Informatica and datawarehouse Material

Similar to Informatica and datawarehouse Material (20)

Recently uploaded

Recently uploaded (20)

Informatica and datawarehouse Material