2. Agenda
• Concept of Data warehousing
• Data Integration and extraction
transformation and Load (ETL) Process
• Data Warehouse Development
• Administration
• Issues
2Gurpreet Singh, MGN646
3. Concept of Data warehousing
• A data warehouse is a pool of data produced
to support decision making.
• Data is usually structured to be available in a
form ready for OLAP, mining, querying,
reporting and other decision support
applications
3Gurpreet Singh, MGN646
4. Definition
“A data warehouse is simply a single, complete,
and consistent store of data obtained from a
variety of sources and made available to end
users in a way they can understand and use it in
a business context.”
-- Barry Devlin, IBM Consultant
4Gurpreet Singh, MGN646
5. Characteristics of data warehouse
• Subject oriented
• Integrated
• Time variant or time series
• Nonvolatile
• Multidimensional
• Client/server
• Real time
• Include metadata
5Gurpreet Singh, MGN646
6. 1. Subject-Oriented
Data is categorized and stored by business subject
rather than by application
Equity
Plans
Shares
Customer
financial
information
Savings
Insurance
Loans
OLTP Applications Data Warehouse Subject
6Gurpreet Singh, MGN646
8. 3. Time-Variant
Data is stored as a series of snapshots, each
representing a period of time
Time Data
Jan-97 January
Feb-97 February
Mar-97 March
8Gurpreet Singh, MGN646
9. 4. Nonvolatile
Typically data in the data warehouse is not updated or delelted.
Insert
Update
Delete
Read Read
Operational Warehouse
Load
9Gurpreet Singh, MGN646
10. 5. Client/Server
• A data warehouses uses the client/server
architecture to provide easy access for end
users.
10Gurpreet Singh, MGN646
11. 6. Real Time
• Data warehouses provide real time, or active
data access and analysis capabilities.
11Gurpreet Singh, MGN646
12. 7. Meta Data
• Metadata is a data that describes other data
• It provides information about a certain item’s
content.
• E.g. information about how long the
document is, who is the author, when the
document was written and a short summary
of the document.
12Gurpreet Singh, MGN646
13. • Structural Meta Data (data describing the
structure of data)
• Semantic Meta data (data describing the
meaning of a data)
13Gurpreet Singh, MGN646
14. Data Marts
• A data mart is usually smaller and focuses on a
particular subject or department.
• It typically consist of a single subject area.
• Data Mart can be dependent or independent
14Gurpreet Singh, MGN646
15. Data Warehouses Versus Data Marts
Property Data Warehouse Data Mart
Scope Enterprise Department
Subject Multiple Single-subject
Data Source Many Few
Size(typical) 100 GB to>1 TB <100 GB
Implementation time Months to years Months
Data
Warehouse
Data
Mart
15Gurpreet Singh, MGN646
16. Data Mart
– Dependent data mart
A subset that is created directly from a data
warehouse
– Independent data mart
A small data warehouse designed for a strategic
business unit or a department
16Gurpreet Singh, MGN646
19. Operational Data stores
• Recent form of customer information file.
• Contents are updated throughout the course
of business operations.
• Used for short term decisions
• It stores only recent information
• Short term memory
19Gurpreet Singh, MGN646
20. Enterprise Data warehouse
• Large scale data warehouse that is used across
the enterprise for decision support.
20Gurpreet Singh, MGN646
21. Data Warehousing Process
• Data Sources (from legacy systems and external
data providers)
• Data extraction (using commercial software
called ETL)
• Data loading
• Comprehensive database
• Metadata
• Middleware tools (enable access to data
warehouse)
21Gurpreet Singh, MGN646
22. Data Warehouse Architecture
• The data warehouse itself
• Data acquisition software
• Client software, which allows users to access
and analyze data from the warehouse.
22Gurpreet Singh, MGN646
23. Data Integration and extraction
transformation and Load (ETL) Process
• A major purpose of a data warehouse is to
integrate data from multiple systems.
23Gurpreet Singh, MGN646
24. Continued…
• Various integration technologies enable data and metadata
integration:
I. Enterprise application integration (provides a vehicle
(software) for pushing data from source systems into the
data warehouse)
II. Enterprise information integration (real time data
integration from a variety of sources, mechanism for pulling
data from source systems)
24Gurpreet Singh, MGN646
25. Continued…
III. Extraction, transformation and Load: The ETL process is an
integral component in any data centric project.
• It consumes 70% of the time in a data centric project.
• The ETL process consists of extraction (reading data from one
or more databases), transformation (converting the extracted
data from its previous form into the form in which it needs to
be), and load ( putting the data into data warehouse)
• The purpose of the ETL process is to load the warehouse with
integrated and cleansed data.
• Data can come from any source like flat files, excel
spreadsheets etc.
25Gurpreet Singh, MGN646
26. Continued…
• Any data quality issues pertaining to the source files need to
be corrected before the data are loaded into the data
warehouse.
• The process of loading data into a data warehouse can be
performed through data transformation tools or using
programming languages
26Gurpreet Singh, MGN646
28. Data Warehouse Development
Benefits:
• End users can perform extensive analysis in
numerous ways.
• A consolidated view of corporate data is
possible.
• Better and more timely information.
• Data access is simplified.
28Gurpreet Singh, MGN646
30. Inmon Model
• EDW approach
• Emphasis on top-down development
• Inmon’s approach starts with an enterprise
data warehouse, creating data marts as
subsets if appropriate.
30Gurpreet Singh, MGN646
31. Kimball Model
• Data mart approach
• Emphasis on bottom-up development
• Kimball’s approach starts with data marts,
consolidating them into an EDW later if
appropriate.
31Gurpreet Singh, MGN646
32. Best Model
• No one size fits all strategy to data
warehousing.
32Gurpreet Singh, MGN646
34. Similarities and differences between the Inmon and
Kimball data warehouse development approaches
• Similarities: Both methods can produce an enterprise data
warehouse and subset data marts.
• Differences: Inmon’s approach starts with an enterprise data
warehouse, creating data marts as subsets of that EDW if
appropriate. The focus is on proven, traditional methods and
technologies. Kimball’s starts with data marts, consolidating
them into an EDW later if appropriate. It focuses in creating a
useful end-user capability quickly.
34Gurpreet Singh, MGN646
35. Real Time Data Warehouse
• Real time data warehousing is the process of
loading and providing data via the data
warehouse as they become available.
• Also known as active data warehouse.
35Gurpreet Singh, MGN646
36. Concerns about real-time BI
Not all data should be updated continuously
May be cost prohibitive
May also be infeasible
36Gurpreet Singh, MGN646
37. Example
• Egg plc (egg.com) is the world’s largest online
bank.
• It provides banking, insurance, investments
and mortgages to more than 3.6 million
customers through its internet site.
• In 1998, Egg selected Sum microsystems to
create a reliable, scalable, secure
infrastructure to support its more than 2.5
million daily transactions.
37Gurpreet Singh, MGN646
38. Continued…
• In 2001, the system was upgraded.
• This new customer data warehouse used Sun,
Oracle and SAS software products.
• The system provides near real-time data access.
• It provides data warehouse and data mining
services to users.
• Hundreds of sales and marketing campaigns are
constructed using near real time data.
• Enables faster decision making about specific
customers.
38Gurpreet Singh, MGN646
39. Data Warehouse Administration
• Due to its huge size, a DW requires especially strong
monitoring in order to sustain its efficiency,
productivity and security.
• The successful administration and management of a
data warehouse entails skills and proficiency.
• A data warehouse administrator should be familiar
with high performance software, hardware and
networking technologies.
39Gurpreet Singh, MGN646
40. DW Scalability and Security
• Scalability
– The main issues pertaining to scalability:
• The amount of data in the warehouse
• How quickly the warehouse is expected to grow
• The number of concurrent users
• The complexity of user queries
– Good scalability means that queries and other data-
access functions will grow linearly with the size of the
warehouse
• Security
– Emphasis on security and privacy
40Gurpreet Singh, MGN646
41. Security concerns involved in building a
data warehouse.
1.Laws and regulations, in the U.S. and elsewhere,
require certain safeguards on databases that contain
the type of information typically found in a DW.
2.The large amount of valuable corporate data in a data
warehouse can make it an attractive target.
3.The need to allow a wide variety of unplanned
queries in a DW makes it impractical to restrict end
user access to specific carefully constrained screens,
one way to limit potential violations.
41Gurpreet Singh, MGN646
42. Effective security in a data warehouse should
focus on four main areas:
• Step 1. Establishing effective corporate and security
policies and procedures. An effective security policy should
start at the top and be communicated to everyone in the
organization.
• Step 2. Implementing logical security procedures and
techniques to restrict access. This includes user
authentication, access controls, and encryption.
• Step 3. Limiting physical access to the data center
environment.
• Step 4. Establishing an effective internal control review
process for security and privacy.
42Gurpreet Singh, MGN646
43. DIRECTV THRIVES WITH ACTIVE DATA
WAREHOUSING
• DIRECTV which is known for its direct
television broadcast satellite service, has been
a regular contributor to the evolution of TV
with its advanced HD programming,
interactive features, digital video recording
services and electronic program guides.
43Gurpreet Singh, MGN646
44. Problem
• DIRECTV faced the challenge of dealing with
high transactional data volumes created by
no. of daily customer calls.
• Accommodating such a large data volume,
along with changing market conditions was
one of key challenges.
44Gurpreet Singh, MGN646
45. Solution
• Used software solutions of Teradata and
GoldenGate to develop a product that
integrates its data assets in near real time
throught the enterprise.
• The goal of the new data warehouse system
was to send fresh data to the call center at
least daily.
45Gurpreet Singh, MGN646