SlideShare a Scribd company logo
1 of 24
DecisionLab.Net
business intelligence is business performance
___________________________________________________________________________________________________________________________________________________________________________________

Data Vault:
What is it?
Where does it fit?
- SQL Saturday #249
_________________________________________________________________________________________________________________________________________________________________________________________________________________

DecisionLab.Net

http://www.decisionlab.net
http://blog.decisionlab.net

dupton@decisionlab.net
Carlsbad, California, USA
Data Vault:
What is it?
Where does it fit?
- SQL Saturday #249
daniel upton
business intelligenceanalyst
certified scrum master

DecisionLab.Net
business intelligence is business performance

Reference: Database diagrams in this presentation are adaptations or expansions of those published in the article
“Data Warehouse Generation Algorithm Explained”, available at…
http://www.dwhautomation.org/data-warehouse-generation-algorithm-explained/

__________________________________________________________________________________________________________________________________________________________________________________

Page 2 of 24
Real-World BI-DW Implementation Questions:
As an ETL Developer who often waits weeks or months without capturing any source data
history while a new DW / DM target data model gets designed, would you instead like to –
without compromising the star/snowflake schema design – quickly and with some
automation, define and load a robust, history-tracking DW / staging repository?
As a DW Architect, have recent requests from project sponsors – requests that your DW
reporting / analytics (BI) solution also supply authoritative data into upcoming Master Data
Management or Enterprise Data Quality solutions – got you nervous about your planned or
(oops) already completed data transformations that were supposed to be scope-limited to
BI only? When you remind them of the initially-agreed scope …they shrug!
After six months of production data loads from a source system that does not track historic
data changes, you discover that your DW-loading logic is wrong, and of course your staging
area is overwritten with each cycle.

__________________________________________________________________________________________________________________________________________________________________________________

Page 3 of 24
Data Vault resolveseach of the above challenges.
This session will demonstrate this claim, while familiarizing you
with Data Vault design fundamentals, briefly explore its
potential for automation,and consider where it fits.

__________________________________________________________________________________________________________________________________________________________________________________

Page 4 of 24
List of Entering Assumptions:
Disk storage is sufficiently cheap
Automation of back-end DW development tasks is appealing.
rd
Source data is in RDBMS with a at least a resemblance to 3 normal form
Source data exists in disparate systems OR one system with poor data quality OR
with the inability to efficiently track historic data changes.
A non-volatile back-end data repository between operational systems and the BI-DW
presentation layer (eg. Star Schema) is desired.
Time-latency requirements do give us an ample time window to load the repository
and then again transform data from there into the presentation layer.
A proliferation, and substantial increase in the number, of tables is tolerable as long
as both the design and loading of the schema is straightforward and, to some extent,
automatable.

__________________________________________________________________________________________________________________________________________________________________________________

Page 5 of 24
High-Level Introduction to Data Vault Methodology:
We begin with a simple OLTP database design for sales transactions, plus a small excerpt of tables from ERP and CRM schema.
For illustration purposes, I include aminimum of tables and fields. In the diagrams, ‘BK’ means business key, ‘FK’ means foreign
key. Refer to Diagram A below.
This OLTP schema usesno surrogate keys. If a client gets a new email address, or a product gets a new name, or a city’s remapping of boundary lines suddenly places an existing store in a new city, then for any given business key, new non-key values
overwrite old values, which are therein lost. Of course, in order to preserve history, history-tracking surrogate keys are
commonly used by practitioners of both W.H. Inmon’s classic third-normal form (3nf) EDW design, as well as Dr. Ralph Kimball’s
Star Schema method, but both of these methods prescribe surrogate keys within the context of data transformations that also
include subjective interpretation (herein simply ‘subjective transformation’) in order to cleanse or enhance the data for the
purposes of integration to serve reporting or analytic needs. Data Vault purists claim that any such subjective transformation of
line-of-business data introduces distortion to it, thereby disqualifying the Data Warehouse / Mart as system of record. Data
Vault, by contrast, provides a simple yet unique way to track historical changes from source data while eliminating most, or all,
subjective transformations such as data-quality filters, establishment of hierarchies, calculated fields, or target/goal values.
Although analytics-driven, subjective transformations should still be applied for BI, they are applied downstream of the Data
Vault EDW, as subsequent custom transformations for loads into data marts designed to analyze specific business processes.
Back upstream, the Data Vault accomplishes historic change-tracking using a generic table-deconstructing approach that I will
now describe. Before beginning, I recommend against too-quickly comparing this method to others, like star-schema design,
which serve different needs.
__________________________________________________________________________________________________________________________________________________________________________________

Page 6 of 24
Diagram A:Excerpts from three operational OLTP schema (data sources for Data Vault)

__________________________________________________________________________________________________________________________________________________________________________________

Page 7 of 24
Diagram B:Sales Transaction Only

__________________________________________________________________________________________________________________________________________________________________________________

Page 8 of 24
Diagram C: Hubs and Satellite in Source B’s partially-designed Data Vault schema

__________________________________________________________________________________________________________________________________________________________________________________

Page 9 of 24
Fundamentally, Data Vault prescribes three types of tables: Hubs, Satellites, and Links.Let’s use Diagram B’s Client table asour example. Hubs
and Satellites have the following characteristics:

Hub Tables:
Define the granularity of an entity (eg. Client), and thus the granularity of non-key attributes (eg. name, description) of the entity.
Contain a new surrogate primary key (PK), as well as the source table’s business key, demoted from its PK role.
Contain no non-key attribute fields such as name, address, email, telephone.
Satellite Tables:
Contain all non-key fields (attributes), plus a set of date-stamp fields
Contain, as a Foreign Key (FK), the Hub’s PK, plus load date-time stamps.
Have a defining, dependent entity relationship to one, and only one, parent table.
Whether that parent table is a Hub or Link, the Satellite holds the non-key attribute fields from the parent table.
Although on initial loads, only one Satellite row will exist for each corresponding Hub row, whenever a non-key attribute changes
upstream in the OLTP schema (eg. a client’s email address changes, too-often accomplished with an over-write), a new row will be added
only to the Satellite, but not the Hub, which is why many Satellite rows will relate to one Hub row. So, in this fashion, historic changes
within source tables are gracefully tracked in the Data Vault.
Notice, in Diagram C that, among other tables, the Client_h_s Satellite table is dependent to the Client_h Hub table, but that, at this stage in our
design, the Client_h Hub is not yet related to Order_h Hub. When we add Links, those relationships will appear. But first, have a look at the
tables, the new location of existing fields, and the various added date-time stamps.

__________________________________________________________________________________________________________________________________________________________________________________

Page 10 of 24
Diagram D: Data Vault Schema w/ Link tables added: Complete for Source B only

__________________________________________________________________________________________________________________________________________________________________________________

Page 11 of 24
Link Tables:
See Diagram D
Links relate exactly two Hubs together.
Links contain, now as non-key values, the primary keys of the two Hubs, plus its own surrogate PK.
Peg-leg links are special links in that they only relate to one Hub. More on this later.
As with an ordinary association / join table, a Link is a child to both of two Hubs it relates to and, as such, it is able to gracefully handle the odd relative
changes in cardinality between the two tables and cleanlysupports many-to-many relationships that are stored in the source system, and which otherwise
either cause load-failing errors in the data-loading process or require ad-hoc data cleansing hard coding.
Unlike an ordinary associationtable, the Link table, with its own surrogate PK in conjunction with date-stamp fields in both Hubs, allows us to track historic
changes in the relationship itself between the two Hubs, and thus between their two directly-related OLTP source tables. Specifically, all loaded data that
conformed with the initial cardinality between tables would share the same Link table surrogate key, but an unexpected, future source data change that
either caused a cardinality reversal (so that the one becomes the many, and vice versa), a new row, with a new surrogate key, is generated to not only
capture it now while the original surrogate key preserves the historical relationship. Slick!
Limits to Automated Logic for Data Vault Design
Note that the OLTP Details table was transformed not into a Hub-and-Satellite combination, but rather into a Link table, which seems valid insofar as Order
Details can be considered to simply be a direct relationship between an Order and a Product. This logic may or may not be fully automatable.
In a more sophisticated Data Vault schema than this one, we might go further by adding a add load_date and load_date_end data_stamp fields to Link tables,
too. As an (admittedly strange) example, the Order_Store_l Link table might conceivably get date-time stamp fields so that, in coordination with its surrogate
PK, an Order (perhaps for a long-running service) that, after the Order Date, gets re-credited to a different store can be efficiently tracked over time in this
way.
Of course, with this last of enhancement, we’re probably crossing the line from ‘automatable’ to ‘custom’ Data Vault design.

__________________________________________________________________________________________________________________________________________________________________________________

Page 12 of 24
Summary: Steps in Basic Data Vault Automation Logic:
A. Buy or build application to automate Data Vault schema design and/or Data Vault ETL code.
B. Based on each OLTP table’s primary and foreign key structure, auto-tag each table as Hub, Satellite or Link
C. Human review, overruling certain automated quick-tags with enhancements.
D. Using either custom-built logic or purchased design-automation application, auto-generate Data Vault DDL.
E. Using same, auto-generate ETL code for loading Data Vault.

A: Buy or Build:
Roll your own with macros in ERwin, ER Studio, etc.
Buy: Consider BIReady, QUIPU, WhereScape Red
No silver bullets

__________________________________________________________________________________________________________________________________________________________________________________

Page 13 of 24
B. Auto-Tag:Hub, Satellite or Link -- Notice how simple (thus automatable) these rules are
Rules:A Bottom-Up process of identifying ‘non-Hubs’
o Satellite Auto-Tag:A table on which there are no other tables’ foreign keys referencing it, and which has its own foreign-key field
also acting as its primary key, and finally, the primary keys contains no other fields.
o Peg-Leg Link Auto-Tag:Same exactly as above, except that the primary key contains one or more additional fields: is a candidate to
become a peg leg link.
o Link Auto-Tag: A table on which there are no other tables’ foreign keys referencing it, and has more than one foreign key with all
foreign keys collectively contained within the primary key. The primary key may include other fields, as well.
 As a reminder, many, perhaps most Links are created not directly from individual source tables, but rather from the direct
(existing or to-be-designed) relationships between source tables. In this case does not matter if the primary key is wider than
all the foreign keys together or not.
o Hub Auto-Tag: A table which does not fit one of the above rules is a Hub.

C. Human Review: Overrule certain automated quick-tags with enhancements based on experience with the database, data and business.
See above section on ‘Limits to Automated Logic for Data Vault Design’
D. Generate DDL Code
E. Generate logic for ETL Code
F. Capture ETL code and setup scheduled Data Vault loads
__________________________________________________________________________________________________________________________________________________________________________________

Page 14 of 24
Diagram E: Data Vault Schema: Complete for (excerpts of) Sources A through C

__________________________________________________________________________________________________________________________________________________________________________________

Page 15 of 24
Diagram F: Data Vault Schema - Integrated

Data Vault Spanning Multiple Operational Databases

__________________________________________________________________________________________________________________________________________________________________________________

Page 16 of 24
Diagram G: Summary of OLTP into Data Vault

OLTP Source

Data Vault

__________________________________________________________________________________________________________________________________________________________________________________

Page 17 of 24
In Diagram G…
…We note that that the source schema’s seven tables just morphed into the Data Vault’s eighteen. When you consider that an
order detail record (a line item) is really just the association between an Order and a Product (albeit an association with plenty
of vital associated data), then it makes sense that the Link table Details_l was created. This Link table, whose sole purpose is to
relate the Orders_h and Products_h tables, of course, also needs a Details_l_s Satellite table to hold the show-stopper non-key
attributes, Quantity and Unit Price.
The Data Vault method does allow for some interpretation here. You might now be thinking, “Aha! So, we haven’t eliminated
all subjective interpretation!” Perhaps not, but what I’ll describe here is a pretty small, generic interpretation. Either way, in
this situation, it would not be patently wrong to design a Details_h Hub table (plus, of course, a Details_h_s Satellite), rather
than the Details_l Link. Added to that, if we use very simple Data-Vault design automation logic, which simply de-constructs all
tables into Hub and Satellite pairs, this is what we would get. However, keep in mind that if we did that, we would then have to
create not one, but two Link tables, specifically Order_Order_Details_l Link table and Product_Order_Details_l Link table to
connect our tables, and these tables would contain no attributes of apparent value. Therefore, we choose the design that
leaves us with a simpler, more efficient Data Vault design. By the way, this logic can easily be automated, but that’s beyond
the scope of this article.

__________________________________________________________________________________________________________________________________________________________________________________

Page 18 of 24
Diagram H:

Data Vault Supplies Authoritative Enterprise Data
To Multiple Target Applications

Custom transformations with
subjective data
re-interpretation

__________________________________________________________________________________________________________________________________________________________________________________

Page 19 of 24
Diagram I

Data Vault Supplies Authoritative Enterprise Data
To Multiple Target Applications

Custom transformations with
subjective data
re-interpretation

“The Data Vault is the optimal
choice for modeling the EDW
in the DW 2.0 framework”
- W.H. Inmon
What about the 3NF EDW?

Data Vault: Staging or EDW?

__________________________________________________________________________________________________________________________________________________________________________________

Page 20 of 24
Diagram J

Data Vault Supplies Authoritative Enterprise Data
To Multiple Target Applications
What else?

If Data Vault provides...

an authoritative,
non-volatile, history-tracking RDBMS repository
with robust yet forgiving referential integrity,
while imposing little or no subjective data reinterpretation from multiple operational systems,
then... it has benefits for target systems beyond
BI-DW:

Custom transformations with
subjective data
re-interpretation

Data Quality Management:

Data Vault
gracefully and permanently stores the good, the
bad and the ugly (outliers, RFI violations, etc) and
their improvement (or lack thereof) over time.

“The Data Vault is the optimal
choice for modeling the EDW
in the DW 2.0 framework”
- W.H. Inmon
What about the 3NF EDW?
Data Vault: Staging or EDW?

Master Data Management:
* A More appropriate MDM data source than
Dimensional Data Marts.
* In closed-loop MDM: Data Vault feeds
operational data to MDM, which then publishes
improved Master Data back for operational
systems, which continue to automatically feed
Data Vault, so Data Vault captures MDM adoption
levels across multiple operational systems.

__________________________________________________________________________________________________________________________________________________________________________________

Page 21 of 24
Review: Real-World Questions and Entering Assumptions:

Real-World BI-DW Implementation Questions:
ETL Developer waiting without capturing source data history while data model gets
designed. Instead, define and load a robust, history-tracking DW / staging repository?
Requests that BI-DW solution supply authoritative data into MDM or DQ solution. What
about ETL that wassupposed to be BI-only?
Six months post-release, you discover that your DW-loading logic is wrong and significant
source data has been overwritten.

__________________________________________________________________________________________________________________________________________________________________________________

Page 22 of 24
Entering Assumptions:
Disk storage is sufficiently cheap
Automation of back-end DW development tasks is appealing.
Source data is in RDBMS with a at least a resemblance to 3rd normal form
Source data exists in disparate systems OR one system with poor data quality OR with the
inability to efficiently track historic data changes.
A non-volatile back-end data repository between operational systems and the BI-DW
presentation layer (eg. Star Schema) is desired.
Time-latency requirements do give us an ample time window to load the repository and
then again transform data from there into the presentation layer.
A proliferation, and substantial increase in the number, of tables is tolerable as long as both
the design and loading of the schema is straightforward and, to some extent, automatable.

Conclusion: Data Vault can indeed resolve these challenges.
__________________________________________________________________________________________________________________________________________________________________________________

Page 23 of 24
Thank you!

DecisionLab.Net
business intelligence is business performance
_____________________________________________________

daniel upton
business intelligence analyst
certified scrum master
dupton@decisionlab.net
http://www.linkedin.com/in/DanielUpton
http://www.slideshare.net/DanielUpton
__________________________________________________________________________________

__________________________________________________________________________________________________________________________________________________________________________________

Page 24 of 24

More Related Content

What's hot

Data warehousing unit 6.2
Data warehousing unit 6.2Data warehousing unit 6.2
Data warehousing unit 6.2WE-IT TUTORIALS
 
Management of Bi-Temporal Properties of Sql/Nosql Based Architectures – A Re...
Management of Bi-Temporal Properties of  Sql/Nosql Based Architectures – A Re...Management of Bi-Temporal Properties of  Sql/Nosql Based Architectures – A Re...
Management of Bi-Temporal Properties of Sql/Nosql Based Architectures – A Re...lyn kurian
 
Multidimensional Database Design & Architecture
Multidimensional Database Design & ArchitectureMultidimensional Database Design & Architecture
Multidimensional Database Design & Architecturehasanshan
 
BI-TEMPORAL IMPLEMENTATION IN RELATIONAL DATABASE MANAGEMENT SYSTEMS: MS SQ...
BI-TEMPORAL IMPLEMENTATION IN  RELATIONAL DATABASE  MANAGEMENT SYSTEMS: MS SQ...BI-TEMPORAL IMPLEMENTATION IN  RELATIONAL DATABASE  MANAGEMENT SYSTEMS: MS SQ...
BI-TEMPORAL IMPLEMENTATION IN RELATIONAL DATABASE MANAGEMENT SYSTEMS: MS SQ...lyn kurian
 
BI Architecture in support of data quality
BI Architecture in support of data qualityBI Architecture in support of data quality
BI Architecture in support of data qualityTom Breur
 
Data warehousing unit 4.2
Data warehousing unit 4.2Data warehousing unit 4.2
Data warehousing unit 4.2WE-IT TUTORIALS
 
Data Warehouse 101 - U W Guest Lecture
Data Warehouse 101 - U W Guest LectureData Warehouse 101 - U W Guest Lecture
Data Warehouse 101 - U W Guest LectureNicholas Goodman
 
Crystal xcelsius best practices and workflows for building enterprise solut...
Crystal xcelsius   best practices and workflows for building enterprise solut...Crystal xcelsius   best practices and workflows for building enterprise solut...
Crystal xcelsius best practices and workflows for building enterprise solut...Yogeeswar Reddy
 
Data warehousing in practice 2016
Data warehousing in practice 2016Data warehousing in practice 2016
Data warehousing in practice 2016Sjors Otten
 
An ontological approach to handle multidimensional schema evolution for data ...
An ontological approach to handle multidimensional schema evolution for data ...An ontological approach to handle multidimensional schema evolution for data ...
An ontological approach to handle multidimensional schema evolution for data ...ijdms
 
Machine-Learning: Customer Segmentation and Analysis.
Machine-Learning: Customer Segmentation and Analysis.Machine-Learning: Customer Segmentation and Analysis.
Machine-Learning: Customer Segmentation and Analysis.Siddhanth Chaurasiya
 
A COMPARATIVE STUDY ON BIG DATA HANDLING USING RELATIONAL AND NON-RELATIONAL ...
A COMPARATIVE STUDY ON BIG DATA HANDLING USING RELATIONAL AND NON-RELATIONAL ...A COMPARATIVE STUDY ON BIG DATA HANDLING USING RELATIONAL AND NON-RELATIONAL ...
A COMPARATIVE STUDY ON BIG DATA HANDLING USING RELATIONAL AND NON-RELATIONAL ...IJDKP
 

What's hot (14)

Star schema PPT
Star schema PPTStar schema PPT
Star schema PPT
 
Data warehousing unit 6.2
Data warehousing unit 6.2Data warehousing unit 6.2
Data warehousing unit 6.2
 
Management of Bi-Temporal Properties of Sql/Nosql Based Architectures – A Re...
Management of Bi-Temporal Properties of  Sql/Nosql Based Architectures – A Re...Management of Bi-Temporal Properties of  Sql/Nosql Based Architectures – A Re...
Management of Bi-Temporal Properties of Sql/Nosql Based Architectures – A Re...
 
Multidimensional Database Design & Architecture
Multidimensional Database Design & ArchitectureMultidimensional Database Design & Architecture
Multidimensional Database Design & Architecture
 
BI-TEMPORAL IMPLEMENTATION IN RELATIONAL DATABASE MANAGEMENT SYSTEMS: MS SQ...
BI-TEMPORAL IMPLEMENTATION IN  RELATIONAL DATABASE  MANAGEMENT SYSTEMS: MS SQ...BI-TEMPORAL IMPLEMENTATION IN  RELATIONAL DATABASE  MANAGEMENT SYSTEMS: MS SQ...
BI-TEMPORAL IMPLEMENTATION IN RELATIONAL DATABASE MANAGEMENT SYSTEMS: MS SQ...
 
BI Architecture in support of data quality
BI Architecture in support of data qualityBI Architecture in support of data quality
BI Architecture in support of data quality
 
Data warehousing unit 4.2
Data warehousing unit 4.2Data warehousing unit 4.2
Data warehousing unit 4.2
 
Data Warehouse 101 - U W Guest Lecture
Data Warehouse 101 - U W Guest LectureData Warehouse 101 - U W Guest Lecture
Data Warehouse 101 - U W Guest Lecture
 
Crystal xcelsius best practices and workflows for building enterprise solut...
Crystal xcelsius   best practices and workflows for building enterprise solut...Crystal xcelsius   best practices and workflows for building enterprise solut...
Crystal xcelsius best practices and workflows for building enterprise solut...
 
Data warehousing in practice 2016
Data warehousing in practice 2016Data warehousing in practice 2016
Data warehousing in practice 2016
 
Teradata sql-tuning-top-10
Teradata sql-tuning-top-10Teradata sql-tuning-top-10
Teradata sql-tuning-top-10
 
An ontological approach to handle multidimensional schema evolution for data ...
An ontological approach to handle multidimensional schema evolution for data ...An ontological approach to handle multidimensional schema evolution for data ...
An ontological approach to handle multidimensional schema evolution for data ...
 
Machine-Learning: Customer Segmentation and Analysis.
Machine-Learning: Customer Segmentation and Analysis.Machine-Learning: Customer Segmentation and Analysis.
Machine-Learning: Customer Segmentation and Analysis.
 
A COMPARATIVE STUDY ON BIG DATA HANDLING USING RELATIONAL AND NON-RELATIONAL ...
A COMPARATIVE STUDY ON BIG DATA HANDLING USING RELATIONAL AND NON-RELATIONAL ...A COMPARATIVE STUDY ON BIG DATA HANDLING USING RELATIONAL AND NON-RELATIONAL ...
A COMPARATIVE STUDY ON BIG DATA HANDLING USING RELATIONAL AND NON-RELATIONAL ...
 

Viewers also liked

Introduction to Data Vault Modeling
Introduction to Data Vault ModelingIntroduction to Data Vault Modeling
Introduction to Data Vault ModelingKent Graziano
 
Lean Data Warehouse via Data Vault
Lean Data Warehouse via Data VaultLean Data Warehouse via Data Vault
Lean Data Warehouse via Data VaultDaniel Upton
 
Agile BI via Data Vault and Modelstorming
Agile BI via Data Vault and ModelstormingAgile BI via Data Vault and Modelstorming
Agile BI via Data Vault and ModelstormingDaniel Upton
 
Data Vault: Data Warehouse Design Goes Agile
Data Vault: Data Warehouse Design Goes AgileData Vault: Data Warehouse Design Goes Agile
Data Vault: Data Warehouse Design Goes AgileDaniel Upton
 
Agile Data Engineering - Intro to Data Vault Modeling (2016)
Agile Data Engineering - Intro to Data Vault Modeling (2016)Agile Data Engineering - Intro to Data Vault Modeling (2016)
Agile Data Engineering - Intro to Data Vault Modeling (2016)Kent Graziano
 
Data warehousing in practice 2015
Data warehousing in practice 2015Data warehousing in practice 2015
Data warehousing in practice 2015Sjors Otten
 
Big data and digital ecosystem mark skilton jan 2014 v1
Big data and digital ecosystem mark skilton jan 2014 v1Big data and digital ecosystem mark skilton jan 2014 v1
Big data and digital ecosystem mark skilton jan 2014 v1Mark Skilton
 
Data Vault ReConnect Speed Presenting AM Part Two
Data Vault ReConnect Speed Presenting AM Part TwoData Vault ReConnect Speed Presenting AM Part Two
Data Vault ReConnect Speed Presenting AM Part TwoHans Hultgren
 
Roland bouman modern_data_warehouse_architectures_data_vault_and_anchor_model...
Roland bouman modern_data_warehouse_architectures_data_vault_and_anchor_model...Roland bouman modern_data_warehouse_architectures_data_vault_and_anchor_model...
Roland bouman modern_data_warehouse_architectures_data_vault_and_anchor_model...Roland Bouman
 
Experiences from a Data Vault Pilot Exploiting the Internet of Things
Experiences from a Data Vault Pilot Exploiting the Internet of ThingsExperiences from a Data Vault Pilot Exploiting the Internet of Things
Experiences from a Data Vault Pilot Exploiting the Internet of ThingsGuyVanderSande
 
Data Vault 2.0: Using MD5 Hashes for Change Data Capture
Data Vault 2.0: Using MD5 Hashes for Change Data CaptureData Vault 2.0: Using MD5 Hashes for Change Data Capture
Data Vault 2.0: Using MD5 Hashes for Change Data CaptureKent Graziano
 
Semantic Spend Classification
Semantic Spend ClassificationSemantic Spend Classification
Semantic Spend Classificationarivolit
 
Data Virtualization - Supernova
Data Virtualization - SupernovaData Virtualization - Supernova
Data Virtualization - SupernovaTorsten Glunde
 

Viewers also liked (14)

Introduction to Data Vault Modeling
Introduction to Data Vault ModelingIntroduction to Data Vault Modeling
Introduction to Data Vault Modeling
 
Lean Data Warehouse via Data Vault
Lean Data Warehouse via Data VaultLean Data Warehouse via Data Vault
Lean Data Warehouse via Data Vault
 
Agile BI via Data Vault and Modelstorming
Agile BI via Data Vault and ModelstormingAgile BI via Data Vault and Modelstorming
Agile BI via Data Vault and Modelstorming
 
Data Vault: Data Warehouse Design Goes Agile
Data Vault: Data Warehouse Design Goes AgileData Vault: Data Warehouse Design Goes Agile
Data Vault: Data Warehouse Design Goes Agile
 
Agile Data Engineering - Intro to Data Vault Modeling (2016)
Agile Data Engineering - Intro to Data Vault Modeling (2016)Agile Data Engineering - Intro to Data Vault Modeling (2016)
Agile Data Engineering - Intro to Data Vault Modeling (2016)
 
Data warehousing in practice 2015
Data warehousing in practice 2015Data warehousing in practice 2015
Data warehousing in practice 2015
 
Big data and digital ecosystem mark skilton jan 2014 v1
Big data and digital ecosystem mark skilton jan 2014 v1Big data and digital ecosystem mark skilton jan 2014 v1
Big data and digital ecosystem mark skilton jan 2014 v1
 
Data Vault ReConnect Speed Presenting AM Part Two
Data Vault ReConnect Speed Presenting AM Part TwoData Vault ReConnect Speed Presenting AM Part Two
Data Vault ReConnect Speed Presenting AM Part Two
 
Roland bouman modern_data_warehouse_architectures_data_vault_and_anchor_model...
Roland bouman modern_data_warehouse_architectures_data_vault_and_anchor_model...Roland bouman modern_data_warehouse_architectures_data_vault_and_anchor_model...
Roland bouman modern_data_warehouse_architectures_data_vault_and_anchor_model...
 
Experiences from a Data Vault Pilot Exploiting the Internet of Things
Experiences from a Data Vault Pilot Exploiting the Internet of ThingsExperiences from a Data Vault Pilot Exploiting the Internet of Things
Experiences from a Data Vault Pilot Exploiting the Internet of Things
 
Big Data Modeling
Big Data ModelingBig Data Modeling
Big Data Modeling
 
Data Vault 2.0: Using MD5 Hashes for Change Data Capture
Data Vault 2.0: Using MD5 Hashes for Change Data CaptureData Vault 2.0: Using MD5 Hashes for Change Data Capture
Data Vault 2.0: Using MD5 Hashes for Change Data Capture
 
Semantic Spend Classification
Semantic Spend ClassificationSemantic Spend Classification
Semantic Spend Classification
 
Data Virtualization - Supernova
Data Virtualization - SupernovaData Virtualization - Supernova
Data Virtualization - Supernova
 

Similar to Data Vault: What is it? Where does it fit? SQL Saturday #249

2015 SEO Checklist - UPDATED
2015 SEO Checklist - UPDATED2015 SEO Checklist - UPDATED
2015 SEO Checklist - UPDATEDBe Dynamic
 
Oracle 11i Configuration Document
Oracle 11i Configuration DocumentOracle 11i Configuration Document
Oracle 11i Configuration Documentساجد علی
 
Evaluation of Data Auditability, Traceability and Agility leveraging Data Vau...
Evaluation of Data Auditability, Traceability and Agility leveraging Data Vau...Evaluation of Data Auditability, Traceability and Agility leveraging Data Vau...
Evaluation of Data Auditability, Traceability and Agility leveraging Data Vau...IRJET Journal
 
Building Modern Data Platform with AWS
Building Modern Data Platform with AWSBuilding Modern Data Platform with AWS
Building Modern Data Platform with AWSDmitry Anoshin
 
SQL Server Source Control Basics
SQL Server Source Control BasicsSQL Server Source Control Basics
SQL Server Source Control BasicsKesavan Munuswamy
 
Five_Things_You_Might_Not_Know_About_Oracle_Database_v2.pptx
Five_Things_You_Might_Not_Know_About_Oracle_Database_v2.pptxFive_Things_You_Might_Not_Know_About_Oracle_Database_v2.pptx
Five_Things_You_Might_Not_Know_About_Oracle_Database_v2.pptxMaria Colgan
 
Assignment #10 Market Structures 1 Perfect Competition and .docx
Assignment #10 Market Structures 1 Perfect Competition and .docxAssignment #10 Market Structures 1 Perfect Competition and .docx
Assignment #10 Market Structures 1 Perfect Competition and .docxfredharris32
 
Ccs tutorial beta
Ccs tutorial betaCcs tutorial beta
Ccs tutorial betaMac Barx
 
Ws wireless solution
Ws   wireless solutionWs   wireless solution
Ws wireless solutionRafael Roque
 
Oracle golden gateway_mat
Oracle golden gateway_matOracle golden gateway_mat
Oracle golden gateway_matSuresh Kumar
 
Performance Tuning for Visualforce and Apex
Performance Tuning for Visualforce and ApexPerformance Tuning for Visualforce and Apex
Performance Tuning for Visualforce and ApexSalesforce Developers
 
Log shippingbestpractices
Log shippingbestpracticesLog shippingbestpractices
Log shippingbestpracticesAntilamps
 
Transaction Account Builder Oracle Fusion Procurement
Transaction Account Builder Oracle Fusion ProcurementTransaction Account Builder Oracle Fusion Procurement
Transaction Account Builder Oracle Fusion ProcurementSam Elrashedy
 
SQL Server Backup and Restore
SQL Server Backup and RestoreSQL Server Backup and Restore
SQL Server Backup and RestoreKesavan Munuswamy
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)theijes
 

Similar to Data Vault: What is it? Where does it fit? SQL Saturday #249 (20)

2015 SEO Checklist - UPDATED
2015 SEO Checklist - UPDATED2015 SEO Checklist - UPDATED
2015 SEO Checklist - UPDATED
 
Neri - CV 2014
Neri - CV 2014Neri - CV 2014
Neri - CV 2014
 
Oracle 11i Configuration Document
Oracle 11i Configuration DocumentOracle 11i Configuration Document
Oracle 11i Configuration Document
 
Evaluation of Data Auditability, Traceability and Agility leveraging Data Vau...
Evaluation of Data Auditability, Traceability and Agility leveraging Data Vau...Evaluation of Data Auditability, Traceability and Agility leveraging Data Vau...
Evaluation of Data Auditability, Traceability and Agility leveraging Data Vau...
 
Building Modern Data Platform with AWS
Building Modern Data Platform with AWSBuilding Modern Data Platform with AWS
Building Modern Data Platform with AWS
 
Tqm3 ppt
Tqm3 pptTqm3 ppt
Tqm3 ppt
 
SQL Server Source Control Basics
SQL Server Source Control BasicsSQL Server Source Control Basics
SQL Server Source Control Basics
 
Five_Things_You_Might_Not_Know_About_Oracle_Database_v2.pptx
Five_Things_You_Might_Not_Know_About_Oracle_Database_v2.pptxFive_Things_You_Might_Not_Know_About_Oracle_Database_v2.pptx
Five_Things_You_Might_Not_Know_About_Oracle_Database_v2.pptx
 
Assignment #10 Market Structures 1 Perfect Competition and .docx
Assignment #10 Market Structures 1 Perfect Competition and .docxAssignment #10 Market Structures 1 Perfect Competition and .docx
Assignment #10 Market Structures 1 Perfect Competition and .docx
 
Ccs tutorial beta
Ccs tutorial betaCcs tutorial beta
Ccs tutorial beta
 
Ws wireless solution
Ws   wireless solutionWs   wireless solution
Ws wireless solution
 
Oracle golden gateway_mat
Oracle golden gateway_matOracle golden gateway_mat
Oracle golden gateway_mat
 
Course Outline Ch 2
Course Outline Ch 2Course Outline Ch 2
Course Outline Ch 2
 
Performance Tuning for Visualforce and Apex
Performance Tuning for Visualforce and ApexPerformance Tuning for Visualforce and Apex
Performance Tuning for Visualforce and Apex
 
Log shippingbestpractices
Log shippingbestpracticesLog shippingbestpractices
Log shippingbestpractices
 
Tab
TabTab
Tab
 
Transaction Account Builder Oracle Fusion Procurement
Transaction Account Builder Oracle Fusion ProcurementTransaction Account Builder Oracle Fusion Procurement
Transaction Account Builder Oracle Fusion Procurement
 
SQL Server Backup and Restore
SQL Server Backup and RestoreSQL Server Backup and Restore
SQL Server Backup and Restore
 
Speaker
SpeakerSpeaker
Speaker
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
 

Recently uploaded

GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 

Recently uploaded (20)

GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 

Data Vault: What is it? Where does it fit? SQL Saturday #249

  • 1. DecisionLab.Net business intelligence is business performance ___________________________________________________________________________________________________________________________________________________________________________________ Data Vault: What is it? Where does it fit? - SQL Saturday #249 _________________________________________________________________________________________________________________________________________________________________________________________________________________ DecisionLab.Net http://www.decisionlab.net http://blog.decisionlab.net dupton@decisionlab.net Carlsbad, California, USA
  • 2. Data Vault: What is it? Where does it fit? - SQL Saturday #249 daniel upton business intelligenceanalyst certified scrum master DecisionLab.Net business intelligence is business performance Reference: Database diagrams in this presentation are adaptations or expansions of those published in the article “Data Warehouse Generation Algorithm Explained”, available at… http://www.dwhautomation.org/data-warehouse-generation-algorithm-explained/ __________________________________________________________________________________________________________________________________________________________________________________ Page 2 of 24
  • 3. Real-World BI-DW Implementation Questions: As an ETL Developer who often waits weeks or months without capturing any source data history while a new DW / DM target data model gets designed, would you instead like to – without compromising the star/snowflake schema design – quickly and with some automation, define and load a robust, history-tracking DW / staging repository? As a DW Architect, have recent requests from project sponsors – requests that your DW reporting / analytics (BI) solution also supply authoritative data into upcoming Master Data Management or Enterprise Data Quality solutions – got you nervous about your planned or (oops) already completed data transformations that were supposed to be scope-limited to BI only? When you remind them of the initially-agreed scope …they shrug! After six months of production data loads from a source system that does not track historic data changes, you discover that your DW-loading logic is wrong, and of course your staging area is overwritten with each cycle. __________________________________________________________________________________________________________________________________________________________________________________ Page 3 of 24
  • 4. Data Vault resolveseach of the above challenges. This session will demonstrate this claim, while familiarizing you with Data Vault design fundamentals, briefly explore its potential for automation,and consider where it fits. __________________________________________________________________________________________________________________________________________________________________________________ Page 4 of 24
  • 5. List of Entering Assumptions: Disk storage is sufficiently cheap Automation of back-end DW development tasks is appealing. rd Source data is in RDBMS with a at least a resemblance to 3 normal form Source data exists in disparate systems OR one system with poor data quality OR with the inability to efficiently track historic data changes. A non-volatile back-end data repository between operational systems and the BI-DW presentation layer (eg. Star Schema) is desired. Time-latency requirements do give us an ample time window to load the repository and then again transform data from there into the presentation layer. A proliferation, and substantial increase in the number, of tables is tolerable as long as both the design and loading of the schema is straightforward and, to some extent, automatable. __________________________________________________________________________________________________________________________________________________________________________________ Page 5 of 24
  • 6. High-Level Introduction to Data Vault Methodology: We begin with a simple OLTP database design for sales transactions, plus a small excerpt of tables from ERP and CRM schema. For illustration purposes, I include aminimum of tables and fields. In the diagrams, ‘BK’ means business key, ‘FK’ means foreign key. Refer to Diagram A below. This OLTP schema usesno surrogate keys. If a client gets a new email address, or a product gets a new name, or a city’s remapping of boundary lines suddenly places an existing store in a new city, then for any given business key, new non-key values overwrite old values, which are therein lost. Of course, in order to preserve history, history-tracking surrogate keys are commonly used by practitioners of both W.H. Inmon’s classic third-normal form (3nf) EDW design, as well as Dr. Ralph Kimball’s Star Schema method, but both of these methods prescribe surrogate keys within the context of data transformations that also include subjective interpretation (herein simply ‘subjective transformation’) in order to cleanse or enhance the data for the purposes of integration to serve reporting or analytic needs. Data Vault purists claim that any such subjective transformation of line-of-business data introduces distortion to it, thereby disqualifying the Data Warehouse / Mart as system of record. Data Vault, by contrast, provides a simple yet unique way to track historical changes from source data while eliminating most, or all, subjective transformations such as data-quality filters, establishment of hierarchies, calculated fields, or target/goal values. Although analytics-driven, subjective transformations should still be applied for BI, they are applied downstream of the Data Vault EDW, as subsequent custom transformations for loads into data marts designed to analyze specific business processes. Back upstream, the Data Vault accomplishes historic change-tracking using a generic table-deconstructing approach that I will now describe. Before beginning, I recommend against too-quickly comparing this method to others, like star-schema design, which serve different needs. __________________________________________________________________________________________________________________________________________________________________________________ Page 6 of 24
  • 7. Diagram A:Excerpts from three operational OLTP schema (data sources for Data Vault) __________________________________________________________________________________________________________________________________________________________________________________ Page 7 of 24
  • 8. Diagram B:Sales Transaction Only __________________________________________________________________________________________________________________________________________________________________________________ Page 8 of 24
  • 9. Diagram C: Hubs and Satellite in Source B’s partially-designed Data Vault schema __________________________________________________________________________________________________________________________________________________________________________________ Page 9 of 24
  • 10. Fundamentally, Data Vault prescribes three types of tables: Hubs, Satellites, and Links.Let’s use Diagram B’s Client table asour example. Hubs and Satellites have the following characteristics: Hub Tables: Define the granularity of an entity (eg. Client), and thus the granularity of non-key attributes (eg. name, description) of the entity. Contain a new surrogate primary key (PK), as well as the source table’s business key, demoted from its PK role. Contain no non-key attribute fields such as name, address, email, telephone. Satellite Tables: Contain all non-key fields (attributes), plus a set of date-stamp fields Contain, as a Foreign Key (FK), the Hub’s PK, plus load date-time stamps. Have a defining, dependent entity relationship to one, and only one, parent table. Whether that parent table is a Hub or Link, the Satellite holds the non-key attribute fields from the parent table. Although on initial loads, only one Satellite row will exist for each corresponding Hub row, whenever a non-key attribute changes upstream in the OLTP schema (eg. a client’s email address changes, too-often accomplished with an over-write), a new row will be added only to the Satellite, but not the Hub, which is why many Satellite rows will relate to one Hub row. So, in this fashion, historic changes within source tables are gracefully tracked in the Data Vault. Notice, in Diagram C that, among other tables, the Client_h_s Satellite table is dependent to the Client_h Hub table, but that, at this stage in our design, the Client_h Hub is not yet related to Order_h Hub. When we add Links, those relationships will appear. But first, have a look at the tables, the new location of existing fields, and the various added date-time stamps. __________________________________________________________________________________________________________________________________________________________________________________ Page 10 of 24
  • 11. Diagram D: Data Vault Schema w/ Link tables added: Complete for Source B only __________________________________________________________________________________________________________________________________________________________________________________ Page 11 of 24
  • 12. Link Tables: See Diagram D Links relate exactly two Hubs together. Links contain, now as non-key values, the primary keys of the two Hubs, plus its own surrogate PK. Peg-leg links are special links in that they only relate to one Hub. More on this later. As with an ordinary association / join table, a Link is a child to both of two Hubs it relates to and, as such, it is able to gracefully handle the odd relative changes in cardinality between the two tables and cleanlysupports many-to-many relationships that are stored in the source system, and which otherwise either cause load-failing errors in the data-loading process or require ad-hoc data cleansing hard coding. Unlike an ordinary associationtable, the Link table, with its own surrogate PK in conjunction with date-stamp fields in both Hubs, allows us to track historic changes in the relationship itself between the two Hubs, and thus between their two directly-related OLTP source tables. Specifically, all loaded data that conformed with the initial cardinality between tables would share the same Link table surrogate key, but an unexpected, future source data change that either caused a cardinality reversal (so that the one becomes the many, and vice versa), a new row, with a new surrogate key, is generated to not only capture it now while the original surrogate key preserves the historical relationship. Slick! Limits to Automated Logic for Data Vault Design Note that the OLTP Details table was transformed not into a Hub-and-Satellite combination, but rather into a Link table, which seems valid insofar as Order Details can be considered to simply be a direct relationship between an Order and a Product. This logic may or may not be fully automatable. In a more sophisticated Data Vault schema than this one, we might go further by adding a add load_date and load_date_end data_stamp fields to Link tables, too. As an (admittedly strange) example, the Order_Store_l Link table might conceivably get date-time stamp fields so that, in coordination with its surrogate PK, an Order (perhaps for a long-running service) that, after the Order Date, gets re-credited to a different store can be efficiently tracked over time in this way. Of course, with this last of enhancement, we’re probably crossing the line from ‘automatable’ to ‘custom’ Data Vault design. __________________________________________________________________________________________________________________________________________________________________________________ Page 12 of 24
  • 13. Summary: Steps in Basic Data Vault Automation Logic: A. Buy or build application to automate Data Vault schema design and/or Data Vault ETL code. B. Based on each OLTP table’s primary and foreign key structure, auto-tag each table as Hub, Satellite or Link C. Human review, overruling certain automated quick-tags with enhancements. D. Using either custom-built logic or purchased design-automation application, auto-generate Data Vault DDL. E. Using same, auto-generate ETL code for loading Data Vault. A: Buy or Build: Roll your own with macros in ERwin, ER Studio, etc. Buy: Consider BIReady, QUIPU, WhereScape Red No silver bullets __________________________________________________________________________________________________________________________________________________________________________________ Page 13 of 24
  • 14. B. Auto-Tag:Hub, Satellite or Link -- Notice how simple (thus automatable) these rules are Rules:A Bottom-Up process of identifying ‘non-Hubs’ o Satellite Auto-Tag:A table on which there are no other tables’ foreign keys referencing it, and which has its own foreign-key field also acting as its primary key, and finally, the primary keys contains no other fields. o Peg-Leg Link Auto-Tag:Same exactly as above, except that the primary key contains one or more additional fields: is a candidate to become a peg leg link. o Link Auto-Tag: A table on which there are no other tables’ foreign keys referencing it, and has more than one foreign key with all foreign keys collectively contained within the primary key. The primary key may include other fields, as well.  As a reminder, many, perhaps most Links are created not directly from individual source tables, but rather from the direct (existing or to-be-designed) relationships between source tables. In this case does not matter if the primary key is wider than all the foreign keys together or not. o Hub Auto-Tag: A table which does not fit one of the above rules is a Hub. C. Human Review: Overrule certain automated quick-tags with enhancements based on experience with the database, data and business. See above section on ‘Limits to Automated Logic for Data Vault Design’ D. Generate DDL Code E. Generate logic for ETL Code F. Capture ETL code and setup scheduled Data Vault loads __________________________________________________________________________________________________________________________________________________________________________________ Page 14 of 24
  • 15. Diagram E: Data Vault Schema: Complete for (excerpts of) Sources A through C __________________________________________________________________________________________________________________________________________________________________________________ Page 15 of 24
  • 16. Diagram F: Data Vault Schema - Integrated Data Vault Spanning Multiple Operational Databases __________________________________________________________________________________________________________________________________________________________________________________ Page 16 of 24
  • 17. Diagram G: Summary of OLTP into Data Vault OLTP Source Data Vault __________________________________________________________________________________________________________________________________________________________________________________ Page 17 of 24
  • 18. In Diagram G… …We note that that the source schema’s seven tables just morphed into the Data Vault’s eighteen. When you consider that an order detail record (a line item) is really just the association between an Order and a Product (albeit an association with plenty of vital associated data), then it makes sense that the Link table Details_l was created. This Link table, whose sole purpose is to relate the Orders_h and Products_h tables, of course, also needs a Details_l_s Satellite table to hold the show-stopper non-key attributes, Quantity and Unit Price. The Data Vault method does allow for some interpretation here. You might now be thinking, “Aha! So, we haven’t eliminated all subjective interpretation!” Perhaps not, but what I’ll describe here is a pretty small, generic interpretation. Either way, in this situation, it would not be patently wrong to design a Details_h Hub table (plus, of course, a Details_h_s Satellite), rather than the Details_l Link. Added to that, if we use very simple Data-Vault design automation logic, which simply de-constructs all tables into Hub and Satellite pairs, this is what we would get. However, keep in mind that if we did that, we would then have to create not one, but two Link tables, specifically Order_Order_Details_l Link table and Product_Order_Details_l Link table to connect our tables, and these tables would contain no attributes of apparent value. Therefore, we choose the design that leaves us with a simpler, more efficient Data Vault design. By the way, this logic can easily be automated, but that’s beyond the scope of this article. __________________________________________________________________________________________________________________________________________________________________________________ Page 18 of 24
  • 19. Diagram H: Data Vault Supplies Authoritative Enterprise Data To Multiple Target Applications Custom transformations with subjective data re-interpretation __________________________________________________________________________________________________________________________________________________________________________________ Page 19 of 24
  • 20. Diagram I Data Vault Supplies Authoritative Enterprise Data To Multiple Target Applications Custom transformations with subjective data re-interpretation “The Data Vault is the optimal choice for modeling the EDW in the DW 2.0 framework” - W.H. Inmon What about the 3NF EDW? Data Vault: Staging or EDW? __________________________________________________________________________________________________________________________________________________________________________________ Page 20 of 24
  • 21. Diagram J Data Vault Supplies Authoritative Enterprise Data To Multiple Target Applications What else? If Data Vault provides... an authoritative, non-volatile, history-tracking RDBMS repository with robust yet forgiving referential integrity, while imposing little or no subjective data reinterpretation from multiple operational systems, then... it has benefits for target systems beyond BI-DW: Custom transformations with subjective data re-interpretation Data Quality Management: Data Vault gracefully and permanently stores the good, the bad and the ugly (outliers, RFI violations, etc) and their improvement (or lack thereof) over time. “The Data Vault is the optimal choice for modeling the EDW in the DW 2.0 framework” - W.H. Inmon What about the 3NF EDW? Data Vault: Staging or EDW? Master Data Management: * A More appropriate MDM data source than Dimensional Data Marts. * In closed-loop MDM: Data Vault feeds operational data to MDM, which then publishes improved Master Data back for operational systems, which continue to automatically feed Data Vault, so Data Vault captures MDM adoption levels across multiple operational systems. __________________________________________________________________________________________________________________________________________________________________________________ Page 21 of 24
  • 22. Review: Real-World Questions and Entering Assumptions: Real-World BI-DW Implementation Questions: ETL Developer waiting without capturing source data history while data model gets designed. Instead, define and load a robust, history-tracking DW / staging repository? Requests that BI-DW solution supply authoritative data into MDM or DQ solution. What about ETL that wassupposed to be BI-only? Six months post-release, you discover that your DW-loading logic is wrong and significant source data has been overwritten. __________________________________________________________________________________________________________________________________________________________________________________ Page 22 of 24
  • 23. Entering Assumptions: Disk storage is sufficiently cheap Automation of back-end DW development tasks is appealing. Source data is in RDBMS with a at least a resemblance to 3rd normal form Source data exists in disparate systems OR one system with poor data quality OR with the inability to efficiently track historic data changes. A non-volatile back-end data repository between operational systems and the BI-DW presentation layer (eg. Star Schema) is desired. Time-latency requirements do give us an ample time window to load the repository and then again transform data from there into the presentation layer. A proliferation, and substantial increase in the number, of tables is tolerable as long as both the design and loading of the schema is straightforward and, to some extent, automatable. Conclusion: Data Vault can indeed resolve these challenges. __________________________________________________________________________________________________________________________________________________________________________________ Page 23 of 24
  • 24. Thank you! DecisionLab.Net business intelligence is business performance _____________________________________________________ daniel upton business intelligence analyst certified scrum master dupton@decisionlab.net http://www.linkedin.com/in/DanielUpton http://www.slideshare.net/DanielUpton __________________________________________________________________________________ __________________________________________________________________________________________________________________________________________________________________________________ Page 24 of 24