Data Vault Modeling and Methodology introduction that I provided to a Montreal event in September 2011. It covers an introduction and overview of the Data Vault components for Business Intelligence and Data Warehousing. I am Dan Linstedt, the author and inventor of Data Vault Modeling and methodology.
If you use the images anywhere in your presentations, please credit http://LearnDataVault.com as the source (me).
Thank-you kindly,
Daniel Linstedt
2. Agenda Introduction – why are you here? What is a Data Vault? Where does it come from? Star Schema, 3nf, and Data Vault pros and cons AS AN EDW solution.. When is a Data Vault a good fit? Benefits of Data Vault Modeling & Methodology <BREAK> When to NOT use a Data Vault Fundamental Paradigm Shift Business Keys & Business Processes Technical Review Query Performance (PIT & Bridge) What wasn’t covered in this presentation… 2
3. A bit about me… 3 Author, Inventor, Speaker – and part time photographer… 25+ years in the IT industry Worked in DoD, US Gov’t, Fortune 50, and so on… Find out more about the Data Vault: http://www.youtube.com/LearnDataVault http://LearnDataVault.com Full profile on http://www.LinkedIn.com/dlinstedt
4. Why Are YOU Here? 4 Your Expectations? Your Questions? Your Background? Areas of Interest? Biggest question: What are the top 3 pains your current EDW / BI solution is experiencing?
5. What is it?Where did it come from? Defining the Data Vault Space 5
6. Data Vault Time Line E.F. Codd invented relational modeling 1976 Dr Peter Chen Created E-R Diagramming 1990 – Dan Linstedt Begins R&D on Data Vault Modeling Chris Date and Hugh Darwen Maintained and Refined Modeling Mid 70’s AC Nielsen Popularized Dimension & Fact Terms 1970 2000 1960 1980 1990 Late 80’s – Barry Devlin and Dr Kimball Release “Business Data Warehouse” Early 70’s Bill Inmon Began Discussing Data Warehousing Mid 80’s Bill Inmon Popularizes Data Warehousing Mid 60’s Dimension & Fact Modeling presented by General Mills and Dartmouth University 2000 – Dan Linstedt releases first 5 articles on Data Vault Modeling Mid – Late 80’s Dr Kimball Popularizes Star Schema 6
7. Data Vault Modeling… Took 10 years of Research and Design, including TESTING to become flexible, consistent, and scalable 7
13. Complete with Best Practices for BI/DWBusiness Keys Span / Cross Lines of Business Sales Contracts Planning Delivery Finance Operations Procurement Functional Area
18. Satellite11 Hub = List of Unique Business Keys Link = List of Relationships, Associations Satellites = Descriptive Data
19. Colorized Perspective… Data Vault 3rd NF & Star Schema (separation) Business Keys Associations Details HUB Satellite The Data Vault uniquely separates the Business Keys (Hubs) from the Associations (Links) and both of these from the Detailsthat describe them and provide context (Satellites). LINK Satellite (Colors Concept Originated By: Hans Hultgren) 12
20. Star Schemas, 3NF, Data Vault:Pros & Cons Defining the Data Vault Space Why NOT use Star Schemas as an EDW? Why NOT use 3NF as an EDW? Why NOT use Data Vault as a Data Delivery Model? 13
21. Star Schema Pros/Cons as an EDW PROS Good for multi-dimensional analysis Subject oriented answers Excellent for aggregation points Rapid development / deployment Great for some historical storage CONS Not cross-business functional Use of junk / helper tables Trouble with VLDW Unable to provide integrated enterprise information Can’t handle ODS or exploration warehouse requirements Trouble with data explosion in near-real-time environments Trouble with updates to type 2 dimension primary keys Trouble with late arriving data in dimensions to support real-time arriving transactions Not granular enough information to support real-time data integration 14
22. 3nf Pros/Cons as an EDW PROS Many to many linkages Handle lots of information Tightly integrated information Highly structured Conducive to near-real time loads Relatively easy to extend CONS Time driven PK issues Parent-child complexities Cascading change impacts Difficult to load Not conducive to BI tools Not conducive to drill-down Difficult to architect for an enterprise Not conducive to spiral/scope controlled implementation Physical design usually doesn’t follow business processes 15
23. Data Vault Pros/Cons as an EDW CONS Not conducive to OLAP processing Requires business analysis to be firm Introduces many join operations PROS Supports near-real time and batch feeds Supports functional business linking Extensible / flexible Provides rapid build / delivery of star schema’s Supports VLDB / VLDW Designed for EDW Supports data mining and AI Provides granular detail Incrementally built 16
24. Analogy: The Porsche, the SUV and the Big Rig Which would you use to win a race? Which would you use to move a house? Would you adapt the truck and enter a race with Porches and expect to win? 17
25. A Quick Look at Methodology Issues Business Rule Processing, Lack of Agility, and Future proofing your new solution 18
34. Re-Engineering Business Rules Data Flow (Mapping) Current Sources Sales Customer Source Join Finance Customer Transactions Customer Purchases IMPACT!! ** NEW SYSTEM** 21
35. Federated Star Schema Inhibiting Agility Data Mart 3 High Effort & Cost Data Mart 2 Data Mart 1 Changing and Adjusting conformed dimensions causes an exponential rise in the cost curve over time RESULT: Business builds their own Data Marts! Low Maintenance Cycle Begins Time Start 22 The main driver for this is the maintenance costs, and re-engineering of the existing system which occurs for each new “federated/conformed” effort. This increases delivery time, difficulty, and maintenance costs.
41. AuditableThe business rules are moved closer to the business, improving IT reaction time, reducing cost and minimizing impacts to the enterprise data warehouse (EDW) 23
42. NO Re-Engineering Current Sources Data Vault Sales Stage Copy Hub Customer Customer Finance Stage Copy Link Transaction Customer Transactions Hub Acct Hub Product Customer Purchases Stage Copy NO IMPACT!!! NO RE-ENGINEERING! ** NEW SYSTEM** IMPACT!! 24
43. Progressive Agility and Responsiveness of IT High Effort & Cost Low Maintenance Cycle Begins Time Start 25 Foundational Base Built New Functional Areas Added Initial DV Build Out Re-Engineering does NOT occur with a Data Vault Model. This keeps costs down, and maintenance easy. It also reduces complexity of the existing architecture.
44. What’s Wrong With the OLD METHODOLOGY? Using Star Schemas as your Data Warehouse leads to…. 26
45. Dimensionitis DimensionItis: Incurable Disease, the symptoms are the creation of new dimensions because the cost and time to conform existing dimensions with new attributes rises beyond the business ability to pay… 27 …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... Business Says: Avoid the re-engineering costs, just “copy” the dimensions and create a new one for OUR department… What can it hurt? …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………... …………………...
46. Deformed Dimensions Deformity: The URGE to continue “slamming data” into an existing conformed dimension until it simply cannot sustain any further changes, the result: a deformed dimension and a HUGE re-engineering cost / nightmare. 28 Business Wants a Change! Business said: Just add that to the existing Dimension, it will be easy right? Business Change Business Change V1 Business Change V2 ………………… ………………… ………………… ………………… ………………… ………………… ………………… ………………… Complex Load V3 ……………… ……………… ……………… ……………… ……………… ……………… ……………… ……………… ……………… ……………… ……………… ……………… ……………… ……………… ………………… ………………… ………………… ………………… ………………… ………………… ………………… ………………… ………………… ………………… ………………… ………………… ………………… ………………… ………………… ………………… ………………… ………………… ………………… ………………… ………………… ………………… ………………… Complex Load Complex Load 90 days, $125k 120 days, $200k Re-Engineering the Load Processes EACH TIME! 180 days, $275k
47. Silo Building / IT Non-Agility Business Says: Take the dimension you have, copy it, and change it… This should be cheap, and easy right? 29 SALES Business Change To Modify Existing Star = 180 days, $275k We built our own because IT costs too much… First Star FINANCE Customer_ID Customer_Name Customer_Addr Customer_Addr1 Customer_City Customer_State Customer_Zip Customer_Phone Customer_Tag Customer_Score Customer_Region Customer_Stats Customer_Phone Customer_Type Customer_ID Customer_Name Customer_Addr Customer_Addr1 Customer_City Customer_State Customer_Zip Customer_Phone Customer_Tag Customer_Score Customer_Region Customer_Stats Customer_Phone Customer_Type We built our own because IT took too long… Customer_ID Customer_Name Customer_Addr Customer_Addr1 Customer_City Customer_State Customer_Zip Customer_Phone Fact_ABC Fact_DEF Fact_PDQ Fact_MYFACT MARKETING Customer_ID Customer_Name Customer_Addr Customer_Addr1 Customer_City Customer_State Customer_Zip Customer_Phone Customer_Tag Customer_Score Customer_Region Customer_Stats Customer_Phone Customer_Type Customer_ID Customer_Name Customer_Addr Customer_Addr1 Customer_City Customer_State Customer_Zip Customer_Phone Customer_Tag Customer_Score Customer_Region Customer_Stats Customer_Phone Customer_Type We built our own because we needed customized dimension data…
49. What are the top businessobstacles in your data warehousetoday? 31
50. Poor Agility Inconsistent Answer Sets Needs Accountability Demands Auditability Desires IT Transparency Are you feeling Pinned Down? 32
51. What are the top technologyobstacles in yourdata warehousetoday? 33
52. Complex Systems Real-Time Data Arrival Unimaginable Data Growth Master Data Alignment Bad Data Quality Late Delivery/Over Budget Are your systems CRUMBLING? 34
54. Projects Cancelled & Restarted Re-engineering required to absorb new systems Complexity drives maintenance cost Sky high Disparate Silo Solutions provide inaccurate answers! Severe lack of Accountability 36
57. What is it? It’s a simple Easy-to-use Plan To build your valuable Data Warehouse! 39
58. What’s the Value? Painless Auditability Understandable Standards Rapid Adaptability Simple Build-out Uncomplicated Design Effortless Scalability Pursue Your Goals! 40
59. Why Bother With Something New? Old Chinese proverb: 'Unless you change direction, you're apt to end up where you're headed.' 41
60. What Are the Issues? This is NOT what you want happening to your project! Business… Changes Frequently IT…. Needs Accountability Takes Too Long Demands Auditability Is Over-budget Has No Visibility Too Complex Wants More Control Can’t Sustain Growth THE GAP!! 42
61. What Are the Foundational Keys? Flexibility Scalability Productivity 43
76. Case In Point: Result of scalability was to produce a Data Vault model that scaled to 3 Petabytes in size, and is still growing today! 52
77. Key: Scalability in Team Size You should be able to SCALE your TEAM as well! With the Data Vault methodology, you can: Scale your team when desired, at different points in the project! 53
78. Case In Point: (Dutch Tax Authority) Result of scalability was to increase ETL developers for each new source system, and reassign them when the system was completely loaded to the Data Vault 54
88. The Competing Bid? The competition bid this with 15 people and 3 months to completion, at a cost of $250k! (they bid a Very complex system) Our total cost? $30k and 2 weeks! 57
89. Results? Changing the direction of the river takes less effort than stopping the flow of water 58
90. When NOT to use the Data Vault Model & Methodology 59
91. When NOT to Use the Data Vault You have: a small set of point solution requirements a very short time-frame for delivery To use the data one-time, then throw it away a single source system, single source application A single business analyst in the entire company You do NOT have: audit requirements forcing you to keep history multiple data center consolidation efforts near-real-time to worry about massive batch data to integrate External data feeds outside your control Requirements to do trend analysis of all your data Pain – that forces you to reengineer every time you ask for a change to your current data warehousing systems 60
92. Fundamental Paradigm Shift Exploring differences in the architecture, implementation, and process design. 61
93. It’s Not Just a Data Model… Model Methodology SUCCESS! 62
94. Different From ANYTHING ELSE! The Business Rules go after the Data Warehouse! Data is interpreted on the way OUT! Hold on… We do distinguish between HARD and SOFT business rules… Ok, now tell my WHY this is important? 63
95. EDW: The Old Way of Loading Corporate Fraud Accountability Title XI consists of seven sections. Section 1101 recommends a name for this title as “Corporate Fraud Accountability Act of 2002”. It identifies corporate fraud and records tamperingas criminal offenses and joins those offenses to specific penalties. It also revises sentencing guidelines and strengthens their penalties. This enables the SEC to temporarily freeze large or unusual payments. Source 1 HR Mart Business Rules Change Data! Sales Mart Source 2 Staging Are changes to data ON THE WAY IN to the EDW equivalent to records tampering? Finance Mart Source 3 64
96. EDW: The New Compliant Way Implement a Raw Data Vault Data Warehouse Move the business rules “downstream” 65
101. Link Structures Link_Product_Supplier Link_Customer_Account_Employee LPS_SQN PRODUCT_SQN SUPPLIER_SQN LPS_LOAD_DTS LPS_REC_SOURCE LPS_ENCR_KEY LCAE_SQN CUSTOMER_SQN ACCOUNT_SQN EMPLOYEE_SQN LCAE_LOAD_DTS LCAE_REC_SOURCE Unique Index Link Structure SEQUENCE <HUB KEY SQN 1> <HUB KEY SQN 2> <HUB KEY SQN N> {LAST SEEN DATE} {CONFIDENCE} {STRENGTH} <LOAD DATE> <RECORD SOURCE> Unique Index } Optional Dynamic Link 70
102. Satellites Split By Source System SAT_FINANCE_CUST SAT_CONTRACTS_CUST SAT_SALES_CUST PARENT SEQUENCE LOAD DATE <LOAD-END-DATE> <RECORD-SOURCE> Contact Name Contact Email Contact Phone Number PARENT SEQUENCE LOAD DATE <LOAD-END-DATE> <RECORD-SOURCE> First Name Last Name Guardian Full Name Co-Signer Full Name Phone Number Address City State/Province Zip Code PARENT SEQUENCE LOAD DATE <LOAD-END-DATE> <RECORD-SOURCE> Name Phone Number Best time of day to reach Do Not Call Flag Satellite Structure PARENT SEQUENCE LOAD DATE <LOAD-END-DATE> <RECORD-SOURCE> {user defined descriptive data} {or temporal based timelines} Primary Key 71
104. History Teaches Us… If we model for ONE relationship in the EDW, we BREAK the others! 73 Portfolio The EDW is designed to handle TODAY’S relationship, as soon as history is loaded, it breaks the model! 1 Today: M Customer Hub Portfolio X 1 Portfolio 5 years From now M M M Customer Hub Customer X Portfolio M 10 Years ago 1 This situation forces re-engineering of the model, load routines, and queries! Customer
105. History Teaches Us… If we model with a LINK table, we can handle ALL the requirements! 74 Portfolio 1 Today: Hub Portfolio M Customer 1 M Portfolio LNK Cust-Port 5 years from now M M M Customer 1 Hub Customer Portfolio M 10 Years ago This design is flexible, handles past, present, and future relationship changes with NO RE-ENGINEERING! 1 Customer
106. Applying the Data Vault to Global DW2.0 Manufacturing EDW in China Planning in Brazil Hub Hub Link Sat Sat Link Sat Sat Link Hub Link Hub Hub Sat Sat Sat Sat Sat Sat Sat Sat Base EDW Created in Corporate Financials in USA 75
109. Purpose Of PIT & Bridge To reduce the number of joins, and to reduce the amount of data being queried for a given range of time. These two together, allow “direct table match”, as well as table elimination in the queries to occur. These tables are not necessary for the entire model; only when: Massive amounts of data are found Large numbers of Satellites surround a Hub or Link Large query across multiple Hubs & Links is necessary Real-time-data is flowing in, uninterrupted What are they? Snapshot tables – Specifically built for query speed 78
110. PIT Table Architecture Satellite: Point In Time Primary Key PARENT SEQUENCE LOAD DATE {Satellite 1 Load Date} {Satellite 2 Load Date} {Satellite 3 Load Date} {…} {Satellite N Load Date} PIT Sat Sat 1 Sat 2 Hub Order PIT Sat Sat 3 Sat 1 Sat 4 Sat 2 Sat 1 Hub Customer Hub Product Sat 2 Sat 3 Link Line Item Sat 4 Satellite Line Item 79
112. BridgeTable Architecture Satellite: Bridge Primary Key UNIQUE SEQUENCE LOAD DATE {Hub 1 Sequence #} {Hub 2 Sequence #} {Hub 3 Sequence #} {Link 1 Sequence #} {Link 2 Sequence #} {…} {Link N Sequence #} {Hub 1 Business Key} {Hub 2 Business Key} {…} {Hub N Business Key} Bridge Sat 1 Sat 2 Hub Parts Hub Seller Hub Product Link Link Sat 3 Sat 4 Satellite Satellite 81
113. Bridge Table Data Example Bridge Table: Seller by Product by Part SQN LOAD_DTSSELL_SQN SELL_ID PROD_SQN PROD_NUM PART_SQN PART_NUM 1 08-01-200015 NY*1 2756 ABC-123-9K 525 JK*2*4 209-01-200016CO*242654DEF-847-0L 324 MN*5-2 310-01-200016CO*2482374PPA-252-2A 9938 DD*2*3 411-01-200024AZ*2525222UIF-525-88 7 UF*9*0 512-01-200099NM*581DAN-347-7F 16 KI*9-2 601-01-200199NM*581DAN-347-7F 24 DL*0-5 Snapshot Date 82
114. What WASN’T Covered ETL Automation ETL Implementation SQL Query Logic Balanced MPP design Data Vault Modeling on Appliances Deep Dive on Structures (Hubs, Links, Satellites) What happens when you break the rules? Project management, Risk management & mitigation, methodology & approach Automation: Automated DV modeling, Automated ETL production Change Management Temporal Data Modeling Concerns… And so on… 83
117. The Experts Say… “The Data Vault is the optimal choice for modeling the EDW in the DW 2.0 framework.” Bill Inmon “The Data Vault is foundationally strong and exceptionally scalable architecture.” Stephen Brobst “The Data Vault is a technique which some industry experts have predicted may spark a revolution as the next big thing in data modeling for enterprise warehousing....” Doug Laney 86
118. More Notables… “This enables organizations to take control of their data warehousing destiny, supporting better and more relevant data warehouses in less time than before.” Howard Dresner “[The Data Vault] captures a practical body of knowledge for data warehouse development which both agile and traditional practitioners will benefit from..” Scott Ambler 87
119. Where To Learn More The Technical Modeling Book: http://LearnDataVault.com The Discussion Forums: & eventshttp://LinkedIn.com – Data Vault Discussions Contact me:http://DanLinstedt.com - web siteDanLinstedt@gmail.com - email World wide User Group (Free)http://dvusergroup.com 88
Notes de l'éditeur
Before we begin exploring how the Data Vault can help you, or even defining what a Data Vault is, we need to first understand some of the business problems that may be causing you heartburn on a daily basis.
Everything from poor agility to a lack of IT Transparency plague todays’ data warehouses. I can’t begin to tell you how much pain these businesses are suffering as a result of these problems. Inconsistent Answer Sets, Lack of accountability, inadequate auditablitiy all play a part in data warehouses that are currently on the brink of falling apart.But it’s not just business issues, there are technical ones to cope with as well.
There are always technology obstacles that we face in any data warehousing project. So the question is: what kinds of problems have you seen in your journey? Do they haunt you today?
Complexity drives high cost, resulting in unnecessary late delivery schedules and unsustainable business logic in the integration channels.Real-time data is flooding our data warehouses, has your architecture fallen down on the job?Unstructured data and legal requirements for auditability are bringing huge data volumes.Master Data Alignment is missing from our data warehouses, as they are split in disparate systems all over the world.Bad data quality is covered up through the transformation layers on the way IN to your EDW.Data warehouses grow so large and become so difficult to maintain that IT teams are often delivering late, and beyond original costs.The foundations of your data warehouse are probably crumbling under sheer weight and pressure.
Disparate data marts, unmatched answer sets, geographical problems, and worse…Projects are under fire from a number of areas. Let’s take a look at what happenswhen a data warehouse project reaches the brick wall head-on, at 90 miles an hour.
I think this says it all…. Projects cancelled and restarted, Re-Engineering required to absorb changes, high complexity making it difficult to upgrade, change, and keep up at the speed of business. Disparate silo solutions screaming for consolidation, and of course – a lack of accountability on BOTH sides of the fence…All signs of an ailing BI solution on the brink of being shut-down.
We have got to keep focus on the prize. Business still wants a BI systemBacked by an enterprise EDW.IT still wants a manageable system that will grow and change without major re-engineering.There is a better way, and I can help you with it.
The Data Vault model is really just another name for “Common foundational architecture and design”.It’s based on 10 years of Research and design work, followed by10 years of implementation best practices.It is architected to help you solve the problems!
Put quite simply: It’s an easy-to-use architecture and plan, a guide-bookFor building a repeatable, consistent, and scalable data warehouse system.So just what is the value of the Data Vault?
The Data Vault model and methodology provide:Painless AuditabilityUnderstandable standardsRapid AdaptabilitySimple Build-outUncomplicated DesignAnd Effortless ScalabilityGo after your goals, build a wildly successful data warehouse just like I have.
Beginning: 5 advanced ETLBy the 1st month, they 5 advanced, and 15 basic/introBy the 6th month, they 5 advanced, but 50 basicBy the end of the 8th month they went to production with 10 MF sourcesAnd their team size was: 12 people (5 advanced, 7 basic – for support).
You’re not the first, nor will you be the last one to use it.Some of the worlds biggest companies are implementing Data Vaults.From Diamler Motors to Lockheed Martin, to the Department of Defense.JPMorgan and Chase used the Data Vault model to merge 3 companies in 90 days!