Data Warehouse (especially EDW) design needs to get Agile. This whitepaper introduces Data Vault to newcomers, and describes how it adds agility to DW best practices.
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Data Vault: Data Warehouse Design Goes Agile
1. DecisionLab.Net
business intelligence is business performance
___________________________________________________________________________________________________________________________________________________________________________________
____________________________________________________________________________________________________________________________________________________________________________________
DecisionLab http://www.decisionlab.net dupton@decisionlab.net direct760.525.3268
http://blog.decisionlab.net Carlsbad,California,USA
Data Vault:
Data Warehouse
Design Goes
Agile
2. __________________________________________________________________________________________________________________________________________________________________________________
Page 2 of 13
Whitepaper
Data Vault:
Data Warehouse Design Goes Agile
by daniel upton
data warehouse modeler and architect
certified scrum master
DecisionLab.Net
business intelligence is business performance
dupton@decisionlab.net
http://www.linkedin.com/in/DanielUpton
Without my (the writer’s) explicit written permission in advance, the only permissible reproduction or copying of this written material is in the form of a
review or a brief reference to a specific concept herein, either or which must clearly specify this writing’s title, author (me), and this web address
http://www.slideshare.net/DanielUpton/lean-data-warehouse-via-data-vault . For permission to reproduce or copy any of this material other than what is
specified above, just email me at the above address.
3. __________________________________________________________________________________________________________________________________________________________________________________
Page 3 of 13
Open Question: When we begin considering a new Data Warehouse initiative, how clear is the
scope, really?
If weintend to design Data Marts, and we have no specified need for a data warehouseeither to become a systemof record,
or to supportMaster Data Management (MDM), then we may chooseto Dr. Ralph Kimball’s Data WarehouseBus
architecture, designing a library of conformed (standardized, re-usable) dimension and fact tables for deployment into a series
of purpose-builtdata marts. Under these requirements, wemay have no specific need for an Inmon stylethird-normalform
(3nf) EnterpriseData Warehouse(EDW) in general, or for a Data Vault in particular. In other cases, however, because
sometimes data warehousedata outlives its corresponding sourcedata inside a soon-to-retireapplication database, then, like
it or not, a data warehousemay, as Bill Inman remind us, assumea systemof record role for its data. Whereas the Kimball
Bus architecture’s tables are often not related via key fields, and in fact may not be populated at all until deployment fromthe
Bus into a specific-needs Data Mart, Kimball adherents rarely asserta system-of-record rolefor their solutions.
But, supposewedo determine that our required solution either does need to assumea systemof record role, or perhaps that
it mustsupportMaster Data Management. As such, wemay elect to design a fully functionalEDW, rather than Kimball’s DW
Bus, so that the EDW itself, and not justits dependent data marts, is a working, populated database. Now, knowing that the
creation of a classic EDW, with its requirement for an up-front, enterprise-widedesign, is a challenge with today’s
expectations for rapid delivery, some may be curious aboutnew design methodologies offer ways to accelerate EDW Design.
Data Vault, a data warehousemodeling method with a substantialfollowing in Denmark, and a growing basein the U.S., offers
specific and important benefits.
In order to set expectations early about Data Vault, readers mustunderstand that, somewhatunlike a traditional EDW, and
utterly unlike a star-schema, a Data Vault (not to be confusedwithBusiness DataVault, whichis not addressedinthis
article) cannot serve as an efficient presentationlayer appropriate for direct queries. Rather, it is morelike a historic
enterprise data staging repository that, with additional downstreamETL, will supportnotonly star-schema, reporting and data
mining, but also master data management, data quality and other enterprise data initiatives.
4. __________________________________________________________________________________________________________________________________________________________________________________
Page 4 of 13
Data Vault Benefits:
Benefit #1: Allows for loading of a history-tracking DW with little or none of the typical extraction, transformation and
loading (ETL) transformations that, oncethey are finally figured out, would otherwisecontain subjective-interpretations
of the data and which purportedly enhancethe data and prepareit for reporting or analytics.
o In my view, this is almost enough of a benefit all by itself. As such, in my introduction that follows, I will focus on
proving this point.
o Agile Win: Confidently loading a DW without having to already know the fine details of business rules and
requirements and the resulting transformation requirements means that loading of historicaland incremental
data could get accomplished before the firsttarget databasedesign (3nf EDW or Data Mart) is complete.
Benefit #2: Insofar as Data Vaultprescribes a very generic downstream‘de-constructing’ of OLTP tables, thesede-
constructing transformations can beautomated and so can it’s associated early-stageETL into Data Vault. Since, as
you’ll soon see, Data Vault causes a substantial increasein the number of tables, this automation potential is a
substantialbenefit.
o Agile Win: Automated initial design and loading, anyone?
Benefit #3: Due to Data Vault’s generic design logic, it’s use of surrogatekeys (moreon this soon), and it’s prescription
to avoid subjective-interpretivetransformations, it’s reasonableto quickly load a Data Vaultjustwith the needed subset
of tables.
o Agile Win: More frequent releases. Quickly design for, and load, only the data needed for the next release. Use
the samegeneric design to load other tables when those User Stories fromthe ProductBacklog get placed into a
Sprint.
In the remainder of this article, I will provide a high level introduction to Data Vault, with primary emphasis on how it achieves
Benefit #1.
5. __________________________________________________________________________________________________________________________________________________________________________________
Page 5 of 13
High-Level IntroductiontoDataVault Methodology:
We begin with a simple OLTP databasedesign for clients purchasing products froma company’s stores. For simplicity, I
include only a minimum of fields. In the diagrams, ‘BK’ means business key, ‘FK’ means foreign key. Refer to DiagramA
below.
As is common, this simple OLTP schema does not use surrogatekeys. If a client gets a new email address, or a productgets a
new name, or a city’s re-mapping of boundary lines suddenly places an existing storein a new city, new values would
overwritethe old values, which would then be lost. Of course, in order to preservehistory, history-tracking surrogatekeys are
commonly used by practitioners of both Bill Inmon’s classic third-normalform(3nf) EDW design, as well as Dr. Ralph Kimball’s
Star Schema method, but both of these methods prescribesurrogatekeys within the context of data transformations thatalso
include subjectiveinterpretation (herein simply ‘subjectivetransformation’) in order to cleanse or purportedly enhance the
data for the purposes of integration, reporting, or analytics. Data Vault purists claim that any such subjectivetransformation
of line-of-business data introduces inappropriatedistortion to it, thereby disqualifying the Data Warehouseas systemof
record. Data Vault, importantly, provides a unique way to track historical changes in sourcedata while eliminating most, or
all, subjectivetransformations such as field renaming, selective data-quality filters, establishment of hierarchies, calculated
fields, and target values. Although analytics-driven, subjectivetransformations can still be applied, they are applied
downstreamof the Data Vault EDW, as subsequenttransformations for loads into data marts designed to analyze specific
processes. Back upstream, the Data Vault accomplishes historic change-tracking using a generic table-deconstructing
approach that I will now describe. Before beginning, I recommend against too-quickly comparisons this method others, like
star-schema design, which servedifferent needs.
7. __________________________________________________________________________________________________________________________________________________________________________________
Page 7 of 13
Fundamentally, Data Vault prescribes three types of tables: Hubs, Satellites, and Links. The diagram’s Client table as a good
example. Hubs work according to the following simplified description:
Hub Tables:
Define the granularity of an entity (eg. product), and thus the granularity of non-key attributes (eg. productdescription)
within the entity.
Contain a new surrogateprimary key (PK), as well as the sourcetable’s business key, which is demotes fromits PK role.
Satellite Tables:
Contain all non-key fields (attributes), plus a set of date-stamp fields
Contain, as a Foreign Key (FK), the Hub’s PK, plus the load date-time stamps.
Have a defining, dependent entity relationship to one, and only one, parent table.
Whether that parent table is a Hub or Link, the Satellite holds the non-key fields fromthe parenttable.
Although on initial loads, only one Satellite row will exist for each corresponding Hub row, whenever a non-key
attribute change(eg. a client’s email address changes) upstreamin the OLTP schema (often accomplished up there with
a simple over-write), a new row will be added only to the Satellite, and not the Hub, which is why many Satellite rows
relate to one Hub row. So, in this fashion, historic changes within sourcetables are gracefully tracked in the EDW.
Notice, in DiagramB that, among other tables, the Client_h_s Satellite table is dependent to the Client_h Hub table, but that,
at this stage in our design, the Client_h Hub is not yet related to Order_h Hub. When we add Links, thoserelationships will
appear. But first, have a look at the tables, the new location of existing fields, and the various added date-time stamps.
9. __________________________________________________________________________________________________________________________________________________________________________________
Page 9 of 13
Link Tables:
Refer to Diagram C
Relate exactly two Hub tables together.
Contain, now as non-key values, the primary keys of the two Hubs, plus its own surrogatePK.
As with an ordinary association table, a Link is a child to two other tables and, as such, is able to gracefully handle
relative changes in cardinality between the two tables and, wherenecessary, can directly resolvemany-to-many
relationships that might otherwisecausea show-stopper error in thedata-loading process.
Unlike an ordinary associationtable, the Link table, with its own surrogatePK, is able to track historic changes in the
relationship itself between the two Hubs, and thus between their two directly-related OLTP sourcetables. Specifically,
all loaded data that conformed with the initial cardinality between tables would sharethe same Link table surrogate
key, but an unexpected, future sourcedata change that either caused a cardinality reversal(so that the one becomes
the many, and vice versa), a new row, with a new surrogatekey, is generated to not only capture it now while the
original surrogatekey preserves thehistorical relationship. Slick!
In a more sophisticated Data Vault schema than this one, we might go further by adding a add load_date and
load_date_end data_stamp fields to Link tables, too. As an (admittedly strange) example, the Order_Store_l Link table
might conceivably get date-time stamp fields so that, in coordination with its surrogatePK, an Order (perhaps for a
long-running service) that, after the Order Date, gets re-credited to a different storecan be efficiently tracked over time
in this way.
11. __________________________________________________________________________________________________________________________________________________________________________________
Page 11 of 13
Now, we’veadded Link tables. After scanning DiagramC, go back and compare it withDiagram A and note the movement of
the various non-key attributes. Undoubtedly, you will also notice, and may be concerned, that the sourceschema’s fivetables
justmorphed into the Data Vault’s twelve. Importantly, notethat the Diagram A’s Details table was transformed notinto a
Hub-and-Satellite combination, but rather into a Link table. When you consider that an order detail record (a line item) is
really justthe association between an Order and a Product(albeit an association with plenty of vital associated data), then it
makes sensethat the Link table Details_l was created. This Link table, whosesole purposeis to relate the Orders_h and
Products_h tables, of course, also needs a Details_l_s Satellite table to hold the show-stopper non-key attributes, Quantity
and Unit Price.
The Data Vault method does allow for some interpretation here. You might now be thinking, “Aha! So, we haven’t eliminated
all subjectiveinterpretation!” Perhaps not, but whatI’ll describehere is a pretty small, generic interpretation. Either way, in
this situation, it would not be patently wrong to design a Details_h Hub table (plus, of course, a Details_h_s Satellite), rather
than the Details_l Link. Added to that, if we use very simple Data-Vaultdesign automation logic, which simply de-constructs
all tables into Hub and Satellite pairs, this is whatwe would get. However, keep in mind that if we did that, we would then
have to create not one, but two Link tables, specifically Order_Order_Details_l Link table and Product_Order_Details_l Link
table to connect our tables, and these tables would contain no attributes of apparent value. Therefore, we choosethe design
that leaves us with a simpler, more efficient Data Vault design. By the way, this logic can easily be automated, but that’s
beyond the scopeof this article.
12. __________________________________________________________________________________________________________________________________________________________________________________
Page 12 of 13
Conclusion:
Our discussion on Data Vault opened with the idea that an EDW should load and storehistoricaldata withoutapplying any
transformations thatcontain subjectiveinterpretation of data or business-rules, becausethoseinterpretations, even if
appropriatefor specific reporting or analytics, do modify line-of-business data, and thereforeintroduce distortions into
operational data. Those interpretive transformations should occur downstreamduring ETL into presentation layer tables.
Although Data Vault does, in fact, apply a specific set of generic ‘de-construction’ transformations, thesetransformations
contain little or no subjective interpretation of business rules. They do, however, allow it to (1) apply an appropriatelevel of
referential integrity to sourcedata even wherethe sourcesystemmay lack it now or in the future; (2) gracefully capture
historical data changes, within and between tables, without endangering the success of the data load; (3) supportloading of
data froma subsetof sourcetables initially, and then load, or not load, other related sourcedata tables much later without
compromising the EDW’s referential integrity.
Lastly, and very importantly; (4) data vault design and the associated Data Vault loading ETL, which is largely generic from one
data set to another, can be automated, and thus radically accelerated in development. Although the logic of this automation
flows fromthe simplicity of data vault design, a detailed automation discussion is beyond the scope of this article.
In closing, if we can automatically design and load a Data Warehouse(albeit not it’s presentation layer), it frees up brain cells
for the higher-order logic of design of the presentation layer and the intensive, customETL to load it. As I described here, all
of this can be accomplished simultaneously.
________________________________________________
daniel upton
dupton@decisionlab.net
DecisionLab.Net
business intelligence is business performance
13. __________________________________________________________________________________________________________________________________________________________________________________
Page 13 of 13
DecisionLab.Net
Range of Services:
_____________________________________________________
Business Intelligence Roadmapping,Feasibility Analysis
BI ProjectEstimation and Requirement Modelstorming
BI Staff Augmentation: Data Warehouse / Mart / Dashboard Design and Development
_________________________________________________________________________________________________________________________________________________________________________
DanielUpton
DecisionLab http://www.decisionlab.net dupton@decisionlab.net
Direct760.525.3268 http://blog.decisionlab.net Carlsbad,California,USA