A data warehouse is a database designed for query and analysis rather than for transaction processing. An appropriate design leads to scalable, balanced and flexible architecture that is capable to meet both present and long-term future needs. This session covers a comparison of the main data warehouse architectures together with best practices for the logical and physical design that support staging, load and querying.
2. About me
Project Manager @
12 years professional experience
.NET Web Development MCPD
SQL Server 2012 (MCSA)
Business Interests
Web Development, SOA, Integration
Security Performance Optimization
Horizon2020, Open BIM, GIS, Mapping
Contact me
ivelin.andreev@icb.bg
www.linkedin.com/in/ivelin
www.slideshare.net/ivoandreev
2 |
3. About me
Senior Developer @
.NET Web Development MCPD
Business Interests
Web Development, WCF, Integration
SQL Server – Query Optimization and Tuning
Data Warehousing
Contact me
georgi.mishev@icb.bg
www.linkedin.com/in/georgimishev
5. Agenda
Why Data Warehouse
Main DW Architectures
Dimensional Modeling
Patterns Practices
DW Maintenance
ETL Process
SSIS Demo
6. Lots of Data Everywhere
Can’t find data?
Data scattered over the network
Can’t get data?
Need an expert to get the data
Can’t understand data?
Data poorly documented
Can’t use data found?
Data needs to be transformed
7. Data Warehouse?
Def: Central repository where data are organized, cleansed
and in standardized format.
Integrated
Heterogeneous sources
Data clean and conversion ($, €, 元)
Focus on subject
i.e. Customer, Sale, Product
Time variant
Timestamp every key
Historical data (10+ years)
8. Different Problems - Different Solutions
OLTP Database Data Warehouse
Users Customer Knowledge worker
Design Normalized, Data Integrity Denormalized
Function Daily operation Decision making
Data Current, Detailed Historical, Aggregated
Usage Real time Ad-hoc
Access Short R/W transactions Complex R/O queries
Data accessed Comparatively lower Large Amounts
# Records x100 x1’000’000
# Users x1’000 x10
DB Size x10 GB x100GB-TB
10. B.Inmon Model
Top-Down Approach
Warehouse (3NF)
Data Mart OLAP (MD)
http://sqlschoolgr.files.wordpress.com/2012/03/clip_image003_thumb.png?w=640h=368
11. R.Kimball Model
Bottom-Up Approach
Data Marts (3NF or MD)
Warehouse OLAP (MD)
http://sqlschoolgr.files.wordpress.com/2012/03/clip_image005_thumb.png?w=640h=369
12. Data Vault (by Dan Linstedt)
Hubs
List of unique business keys
Links
Unique relationships between keys
Satellites
Hub and Link details and history
13. It is irrelevant which camp you belong…
as far as you understand why!
14. Making Your Choice
• Kimball (MD)
+ Start small, scale big
+ Faster ROI
+ Analytical tools
- Low reusability
• Data Vault
• Inmon (3NF)
+ Structured
+ Easy to maintain
+ Easier data mining
- Timely to build
Backend Data Warehouse
+ Multiple sources; Full history; Incremental build
- Up-front work; Long-term payoff; Many joins
16. Dimensions
Def: The object of BI interest
Keys
Surrogate key
Business key
Hierarchical attributes
Analysis and Drill Down
Member properties
Presentation labels
Auditing information (not for end users)
17. Slowly Changing Dimensions
Def: Scheme for recording changes over time
Type 1 - Overwrite
Type 2 – Multiple Records
18. Facts
Def: Measurement of a business process
Keys
FK from all dimensional tables (in the star)
PK - Composite (usually) or Surrogate
Measures
Numeric columns, that are of interest to the business
Additive, Non-additive, Semi-additive
Factless facts
Auditing information (optional)
20. Data Warehouse Pitfalls
Admit it is not as it seems to be
You need education
Find what is of business value
Rather than focus on performance
Spend a lot of time in Extract-Transform-Load
Homogenize data from different sources
Find (and resolve) problems in source systems
21. Prepare your Sources
Data integrity
Avoid redundancy
Data quality
Master data source
Data validation
Auditing
CreatedDate / CreatedBy
ChangedDate / ChangedBy
Nightly jobs
22. Dimension Design
Business key with non-clustered index
Include date (if dimension has history)
Surrogate key
The smallest possible integer
Clustered index
FK constraints
Do not enforce (WITH NOCHECK)
Document the relation
Faster load
Data validation
Task for the Source system
23. Conformed Dimensions
Def. Having the same meaning and content
when referred from multiple fact tables
Date Dimension
Partitioning best candidate
Granularity
Do not store every hour, when reporting daily
Avoid surrogate keys
Saves lookup and joins
Integer representing date (yyyyMMdd, days after 1/1/1900)
24. Pre-join Hierarchies
Recursive relationships
Fast drill and report
Pre-computed aggregations
Hierarchy Bridge
For each dimension row
1 association with self
1 row for each subordinate
25. Determine the Facts
The center of a Star schema
Identify subject areas
Identify key business events
Identify dimensions
Start from OLTP logical model
Identify historical requirements
Identify attributes
26. The Grain
Def: The level of detail of a fact table
What is the business objective?
Fine grain - behaviour and frequency analysis
Coarse grain - overall and trend analysis
Aggregates
DO NOT summarize prematurely
DO NOT mix detail and summary
DO use “summary tables”
27. C3-PO is fluent in 6M forms of communication.
What about your customers?
28. Multinational DW
What parts need translation?
Where to store various language versions?
How to support future languages?
Dimensions
Add language attribute
Include text data in the dimension
Problem 1: The dimension key?
Replicate PK for every language
Fact.DimId = Dim.Id AND Dim.Lang=[Lang]
Problem 2: Storage = [Dim] x [Lang]
Sub-dimension with language attributes
TxtId Attr1 Attr2 LangId
1 large Yes En
2 small No En
1 stor Ja No
2 liten Nei No
3 … … …
31. Partitioning
Why
Faster index maintenance
Faster load
Faster queries
When
Tables 10GB+
How
Do not partition dimension tables
Partition by date (most analysis are time-based)
Eliminate partitions (WHERE [PartitionKey]=…)
Avoid split and merge of existing partitions
Can cause inefficient log generation
32. Columnstore Index
Non-clustered in SQL 2012
Clustered in SQL 2014
Pros
Better data compression
High performance on table scan
Clustered CSI Limitations
No other indexes allowed
Little advantage on seek operations
No XML, computed column or replication
34. Efficient Load Process
Use simple recovery model during data load
Staging
Avoid indexing
Populate in parallel
Maintain DW
Disable indexes on load
Rebuild manually after load
Automatic stats update slow down SQL Server
35. To SSIS, or not to SSIS ?
Pros
Minimum coding to none
Extensive support of various data sources
Parallel execution of migration tasks
Better organization of the ETL process
Cons
Another way of thinking
Hidden options
T-SQL developer would do much faster
Auto-generated flows need optimization
Sometimes simply does not work (i.e. Sort by GUID)
36.
37. Takeaways
Books
The Data Warehouse Toolkit (3rd ed), Ralph Kimball
Implementing DW with Microsoft SQL Server 2012
Data Warehousing Fundamentals, Paulraj Ponniah
Articles
Best Practices in Data Warehouse (Hanover Research Council)
http://www.kimballgroup.com/category/design-tips/
http://sqlmag.com/business-intelligence
Resources
http://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/
dimensional-modeling-techniques/
http://www.databaseanswers.org/data_models/index.htm