This document discusses data warehousing, including its definition, importance, components, strategies, ETL processes, and considerations for success and pitfalls. A data warehouse is a collection of integrated, subject-oriented, non-volatile data used for analysis. It allows more effective decision making through consolidated historical data from multiple sources. Key components include summarized and current detailed data, as well as transformation programs. Common strategies are enterprise-wide and data mart approaches. ETL processes extract, transform and load the data. Clean data and proper implementation, training and maintenance are important for success.
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
Data Warehouse Basic Guide
1. Data Warehouse
DefinitionDefinition
Importance of Data WarehouseImportance of Data Warehouse
Its ComponentsIts Components
Two Data Warehousing StrategiesTwo Data Warehousing Strategies
ETL ProcessesETL Processes
For a Successful WarehouseFor a Successful Warehouse
Data Warehouse PitfallsData Warehouse Pitfalls
2. Data Warehouse
A subject oriented, integrated, time-variant, non-volatile
collection of data in support of management decisions (Bill
Inmon)
Subject oriented -- data are organized around sales,
products, etc.
Integrated -- data are integrated to provide a
comprehensive view
Time variant -- historical data are maintained
Nonvolatile -- data are not updated by users
3. Limitations of Traditional
Databases
lack of on-line historical data
residing in different operational systems
extremely poor query performance
operational database designs not suited for
decision support
4. The Importance of Data
Warehousing
More cost – effective decision making
Increase quality and flexibility of enterprise analysis as
data warehouse contain accurate and reliable data
Ability to maintain better customer relationships
Unlimited analyses of enterprise information
5. Components of Data warehouse
Summarized data
Basically of two type: 1) Lightly (departmental information)
2) Highly (enterprise wide decision)
Current detail
Comes directly from operational system
But stored by subject area and represent entire organization not a department
System of record
Maintaining the source of record
Integration and transformation Programs
Programs that convert an application – specific data to enterprise data
6. Cont..
Performs many function like
Reformatting, recalculating
Adding time element
Identifying the default value
Summarizing and merging the data
Filling up the blank fields
Archives
Contain old data which hold some amount of significance to the organization
Used for trend analysis
Metadata
Control access and analysis of the data warehouse contents
To manage and control data warehouse creation and maintenance
7. Two Data Warehousing
Strategies
Enterprise-wide warehouse, top down, the
Inmon methodology
Data mart, bottom up, the Kimball
methodology
When properly executed, both result in an
enterprise-wide data warehouse
8. The Data Mart Strategy
The most common approach
Begins with a single mart and are added over
time for more subject areas
Relatively inexpensive and easy to implement
Can be used as a proof of concept for data
warehousing
Requires an overall integration plan
9. The Enterprise-wide Strategy
A comprehensive warehouse is built initially
An initial dependent data mart is built using a
subset of the data in the warehouse
Additional data marts are built using subsets of the
data in the warehouse
Like all complex projects, it is expensive, time
consuming, and prone to failure
When successful, it results in an integrated, scalable
warehouse
10. ETL Processes
Extraction, Transformation, and Loading Process
The “plumbing” work of data warehousing
Data are moved from source to target data
bases
A very costly, time consuming part of data
warehousing
11. Sample ETL Tools
Teradata Warehouse Builder from Teradata
DataStage from Ascential Software
SAS System from SAS Institute
Power Mart/Power Center from Informatica
Sagent Solution from Sagent Software
12. Reasons for “Dirty” Data
• Dummy Values
• Absence of Data
• Multipurpose Fields
• Inappropriate Use of Address Lines
• Violation of Business Rules
• Non-Unique Identifiers
• Data Integration Problems
13. I. Data Cleansing and
Extracting
Source systems contain “dirty data” that must be cleansed
ETL software contains rudimentary data cleansing capabilities
Specialized data cleansing software is often used. Important
for performing name and address correction and householding
functions
Leading data cleansing vendors include Vality (Integrity),
Harte-Hanks (Trillium), and Firstlogic (i.d.Centric)
14. Steps in Data Cleansing
Parsing
Correcting
Standardizing
Matching
Consolidating
15. Parsing
Parsing locates and identifies individual data
elements in the source files and then isolates
these data elements in the target files.
Examples include parsing the first, middle,
and last name; street number and street
name; and city and state.
16. Correcting
Corrects parsed individual data components
using sophisticated data algorithms and
secondary data sources.
Example include replacing a vanity address
and adding a zip code.
17. Standardizing
Standardizing applies conversion routines to
transform data into its preferred (and
consistent) format using both standard and
custom business rules.
Examples include adding a pre name,
replacing a nickname, and using a preferred
street name
18. Matching
Searching and matching records within and
across the parsed, corrected and standardized
data based on predefined business rules to
eliminate duplications.
Examples include identifying similar names
and addresses.
19. Consolidating
• Analyzing and identifying relationships between
matched records and consolidating/merging
them into ONE representation.
20. II. Data Transformation
Transforms the data in accordance with the
business rules and standards that have been
established
Example include: format changes,
deduplication, splitting up fields,
replacement of codes, derived values, and
aggregates
21. III. Data Loading
Data are physically moved to the data
warehouse
The loading takes place within a “load
window”
The trend is to near real time updates of the
data warehouse as the warehouse is
increasingly used for operational applications
22. For a Successful Warehouse
From day one establish that warehousing is a joint
user/builder project
Establish that maintaining data quality will be an
ONGOING joint user/builder responsibility
Train the users one step at a time
Consider doing a high level corporate data model in no
more than three weeks
Look closely at the data extracting, cleaning, and loading
tools
23. Cont..
Determine a plan to test the integrity of the data in the
warehouse
From the start get warehouse users in the habit of 'testing'
complex queries
Coordinate system roll-out with network administration
personnel
Implement a user accessible automated directory to information
stored in the warehouse
24. Data Warehouse Pitfalls
Many warehouse end users will be trained and never or
seldom apply their training
Large scale data warehousing can become an exercise in
data homogenizing
Loading information only because it is available
Providing no maintenance to the data warehouse