SlideShare une entreprise Scribd logo
1  sur  36
DATA VAULT
2.0:
Big Data Meets Data Warehousing
DEAN HALLMAN
WIRESOFT, LLC
DATA WAREHOUSING VS BIG DATA
• Does Big Data replace Data Warehousing? Or do I need both?
• What’s the difference:
• Between the data flowing into a data warehouse vs big data tools?
• Between the ingestion processes and infrastructure?
• Data Lakes arrived with Big Data, so are they useful in Data
Warehousing?
• How should I model my data in EDW?
• 3NF, Star Schema, same as my operational data stores?
• Data Vault 2.0
• Graph Databases
• What is an architecture that allows both to co-exists effectively?
Impressions
(Big Data)
Core
Business
Services
Core
Business
Services
Core
Business
Services
Operational
Data Stores
D
A
T
A
L
A
K
E
Enterprise Data Warehouse
CDC,
snapshot
Internet
External
Data
Sources
Big Data Toolchain
Batch
(SerDe)
StagingVault
RawVault
BusinessVault
InformationMart
Streaming
(Kafka)
Streaming
Analytics
Batch Analytics
(Hadoop)
Schema-on-Read
Schema-on-Write
Data Source
Landing
Clients
ETL
ELT
𝛾
𝛾
𝛾
𝛾 BI Tools
Monitoring
Discovery
Audit
clickstream
(SerDe)
ETL ETL
Impressions
(Big Data)
Core
Business
Services
Core
Business
Services
Core
Business
Services
Operational
Data Stores
D
A
T
A
L
A
K
E
Enterprise Data Warehouse
CDC,
snapshot
Internet
External
Data
Sources
Big Data Toolchain
Batch
(SerDe)
StagingVault
RawVault
BusinessVault
InformationMart
Streaming
(Kafka)
Streaming
Analytics
Batch Analytics
(Hadoop)
Schema-on-Read
Schema-on-Write
Data Source
Landing
Clients
ETL
ELT
𝛾
𝛾
𝛾
𝛾 BI Tools
Monitoring
Discovery
Audit
clickstream
(SerDe)
ETL ETL
THE DATA MODEL
DATA VAULT 2.0
COMMON FOUNDATIONAL WAREHOUSE ARCHITECTURE
• “The Data Vault Model is a detail oriented, historical tracking and uniquely linked
set of normalized tables that support one or more functional areas of business. It is a
hybrid approach encompassing the best of breed between 3rd normal form (3NF)
and star schema. The design is flexible, scalable, consistent and adaptable to the
needs of the enterprise” -- Dan Linstedt, Creator of Data Vault
• Data loaded as-is from sources, no edits or cleanup
• Append-only to afford highest performance
• Agile & agnostic to changes in the operational store’s data model
• Essentially, a prescription for Layered Graph to Relational Mapping
DATA WAREHOUSING & DATA VAULT 2.0
• 60’s, 70’s, 80’s
• E.F. Codd => 3NF
• Bill Inmon invents Data Warehousing
concept
• Dr. Ralph Kimball popularizes Star
Schema design
• 90’s, 00’s:
• Dan Linstedt creates Data Vault Model @
DOD
• 2014:
• Dan Introduces Data Vault 2.0
Source: “What are Graph Databases and Why should I care?“, by Dave Bechberger of Expero
SOLVE BY STAR SCHEMA ?
RELATIONAL VS GRAPH DATABASES
• Enterprise Grade
• Well-worn path
• SQL has been relatively stagnant vs programming languages
GRAPH DATA MODEL
Source: https://neo4j.com/developer/graph-database/
GRAPH DATABASE VS DATA VAULT
GRAPH DATABASE VS DATA VAULT
Flight
Base Dest Forecast
Record
Source
LoadDate Depart Gate
LGA 2018-10-11 1:25P
M
B27
CAE 2018-10-24 3:30P
M
A14
SFO 2018-09-06 8:55P
M
G19
RDU 2018-08-12 4:45P
M
C22
SERVICED_BY
Record Source Airport CAE
Load Date 2018-11-17
Source Id 20181117-32-983
Aircraft
Base Service FAA NTSB
Record
Source
LoadDate Model Tailno
United 2017-02-11 767 1477
Delta 2015-11-04 A6 2381
Alaska 2013-08-28 747 8312
Frontie
r
2016-07-19 182 1438
Record Source United Airlines
Load Date 2018-01-17
Source Id 2412c
SERVICED_BY
Base Dest Manifest
Record
Source
LoadDate Begin End
United 2017-02-11 2017-04-23 2017-09-23
Delta 2015-11-04 2015-12-01 2017-04-22
Alaska 2013-08-28 2013-09-14 2016-05-04
Frontie
r
2016-07-19 2016-08-02 2018-04-11
Record Source United Airlines
Load Date 2018-09-17
Hubs
Links
SatellitesTab
• Organizations which design systems ...
are constrained to produce designs
which are copies of the communication
structures of these organizations
- Mel Conway
FLIGHT
Base Dest Forecast
Record
Source
LoadDate Depart Gate
LGA 2018-10-
11
1:25P
M
B27
CAE 2018-10-
24
3:30P
M
A14
FLIGHT
Record Source Airport CAE
Load Date 2018-11-17
Source Id 20181117-32-983
Aircraft
Bas
e
Service FAA NTSB
Record
Source
LoadDate Model Tailno
United 2017-02-
11
767 1477
Delta 2015-11-
04
A6 2381
Alaska 2013-08-
28
747 8312
Frontie
r
2016-07-
19
182 1438
Record Source United Airlines
Load Date 2018-01-17
Source Id 2412c
Airport
Base Dest Manifest
Record
Source
LoadDate Begin End
United 2017-02-11 2017-04-23 2017-09-
23
Delta 2015-11-04 2015-12-01 2017-04-
22
Alaska 2013-08-28 2013-09-14 2016-05-
04
Frontie
r
2016-07-19 2016-08-02 2018-04-
11
Record Source United Airlines
Load Date 2018-09-17
Airline
Base Service FAA
NTS
B
Record
Source
LoadDate Model Tailno
United 2017-02-11 767 1477
Delta 2015-11-04 A6 2381
Record Source United Airlines
Load Date 2018-01-17
Source Id 2412c
Hubs
Links
SatellitesTab
Source: https://www.wherescape.com/solutions/project-types/data-vault-automation/
• Modeled after self-
organizing networks
• A Business Key identifies a
key concept in business.
• They have a business
meaning
• They are unique and
have very low propensity
to change
• Business keys change
only when the business
change
• Enables (forces) cross-
source modeling
Source: http://www.di.univr.it/documenti/OccorrenzaIns/matdid/matdid232240.pdf
Source: http://www.di.univr.it/documenti/OccorrenzaIns/matdid/matdid232240.pdf
Source: http://www.di.univr.it/documenti/OccorrenzaIns/matdid/matdid232240.pdf
DATA VAULT 2.0 MODELING:
HUBS, LINKS & SATELLITES
@wiresoft/Pathfinder
Impressions
(Big Data)
Core
Business
Services
Core
Business
Services
Core
Business
Services
Operational
Data Stores
D
A
T
A
L
A
K
E
Enterprise Data Warehouse
CDC,
snapshot
Internet
External
Data
Sources
Big Data Toolchain
Batch
(SerDe)
StagingVault
RawVault
BusinessVault
InformationMart
Streaming
(Kafka)
Streaming
Analytics
Batch Analytics
(Hadoop)
Schema-on-Read
Schema-on-Write
Data Source
Landing
Clients
ETL
ELT
𝛾
𝛾
𝛾
𝛾 BI Tools
Monitoring
Discovery
Audit
clickstream
(SerDe)
ETL ETL
THE DATA
Impressions vs Business Data
ENTERPRISE DATA SILOS
Small DataLarge DataBig Data
Describes the
user base
Describes the
Enterprise
Describes the
Product
Instance
Grain
Transaction
Grain
Audit Grain
Impression Grain
Big Data
Enterprise Data
Warehouse
Operational Data Stores
Impression
Analytics
Business
Analytics
External Data Sources
DATA GRANULARITY FUNNEL
Impressions
(Big Data)
Core
Business
Services
Core
Business
Services
Core
Business
Services
Operational
Data Stores
D
A
T
A
L
A
K
E
Enterprise Data Warehouse
CDC,
snapshot
Internet
External
Data
Sources
Big Data Toolchain
Batch
(SerDe)
StagingVault
RawVault
BusinessVault
InformationMart
Streaming
(Kafka)
Streaming
Analytics
Batch Analytics
(Hadoop)
Schema-on-Read
Schema-on-Write
Data Source
Landing
Clients
ETL
ELT
𝛾
𝛾
𝛾
𝛾 BI Tools
Monitoring
Discovery
Audit
clickstream
(SerDe)
ETL ETL
DATA INGESTION
ETL vs ELT vs SerDe
ETL
VS
ELT
VS
SerDe
• Beware the Turing tar-pit, in which
everything is possible, but nothing
of interest is easy
- Alan Perlis
DATA CLASSIFICATION
MATRIX:
DECLARATIVE VS INTERPRETIVE
Declarative Interpretive
HadoopRDBMS
Web Events
Media Player
DATA WAREHOUSING
• Deep Topic
• 60’s, 70’s, 80’s
• E.F. Codd => 3NF
• Bill Inmon invents Data Warehousing
concept
• Dr. Ralph Kimball popularizes Star Schema
design
• 90’s, 00’s:
• Dan Linstedt creates Data Vault Model @
DOD
• 2014:
• Dan Introduces Data Vault 2.0
• Data Warehouse vs Operational Data
Stores
• Data Warehouse as Version Control System
• MapReduce, 2004, Google by Jeffery
Dean and Sanjay, “MAPREDUCE:
SIMPLIFIED DATA PROCESSING ON
LARGE CLUSTERS” , GFS
• Nutch 2005, Hadoop 2006, 2007 - Doug
Cutting
• What exactly is “Big Data”?
BIG DATA
Client
User
Interpreter
Analysis
UNSTRUCTURED USER EXPERIENCE
L
L
n L
ilossy
Client
User
Time Series
Event
Record
Analysis
STRUCTURED USER EXPERIENCE
losslessL
p L
p
L
e
ETL OR SERDE ?
S3
Hadoop
Time Series
Event Record
Analysis
Deserializer
L e
L
d
L
m
Client
User
Serializer
L p
L
p
Eventlog.e Eventlog.d
L
e
Single Source
(Version Locked)
Kafka/Kinesis
LeInternet
ETL
ELT
(SerDe)
vs
Source: https://www.ironsidegroup.com/2015/03/01/etl-vs-elt-whats-the-big-difference/
Schema
On
Write
Schema
On
Read
OTHER CHALLENGES
• Satellites must be loaded chronologically
• Time-based scheduling vs data-availability scheduling
QUESTIONS?
• Contact:
 Dean Hallman
 rdhallman@gmail.com
 Linkedin: https://www.linkedin.com/in/dean-hallman/

Contenu connexe

Tendances

Data Lineage with Apache Airflow using Marquez
Data Lineage with Apache Airflow using Marquez Data Lineage with Apache Airflow using Marquez
Data Lineage with Apache Airflow using Marquez
Willy Lulciuc
 
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
StampedeCon
 

Tendances (20)

Data Lineage with Apache Airflow using Marquez
Data Lineage with Apache Airflow using Marquez Data Lineage with Apache Airflow using Marquez
Data Lineage with Apache Airflow using Marquez
 
Data Vault 2.0: Using MD5 Hashes for Change Data Capture
Data Vault 2.0: Using MD5 Hashes for Change Data CaptureData Vault 2.0: Using MD5 Hashes for Change Data Capture
Data Vault 2.0: Using MD5 Hashes for Change Data Capture
 
Data Pipline Observability meetup
Data Pipline Observability meetup Data Pipline Observability meetup
Data Pipline Observability meetup
 
Data Vault Automation at the Bijenkorf
Data Vault Automation at the BijenkorfData Vault Automation at the Bijenkorf
Data Vault Automation at the Bijenkorf
 
Marquez: A Metadata Service for Data Abstraction, Data Lineage, and Event-bas...
Marquez: A Metadata Service for Data Abstraction, Data Lineage, and Event-bas...Marquez: A Metadata Service for Data Abstraction, Data Lineage, and Event-bas...
Marquez: A Metadata Service for Data Abstraction, Data Lineage, and Event-bas...
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
 
Why data warehouses cannot support hot analytics
Why data warehouses cannot support hot analyticsWhy data warehouses cannot support hot analytics
Why data warehouses cannot support hot analytics
 
Scaling Databricks to Run Data and ML Workloads on Millions of VMs
Scaling Databricks to Run Data and ML Workloads on Millions of VMsScaling Databricks to Run Data and ML Workloads on Millions of VMs
Scaling Databricks to Run Data and ML Workloads on Millions of VMs
 
Extreme BI: Creating Virtualized Hybrid Type 1+2 Dimensions
Extreme BI: Creating Virtualized Hybrid Type 1+2 DimensionsExtreme BI: Creating Virtualized Hybrid Type 1+2 Dimensions
Extreme BI: Creating Virtualized Hybrid Type 1+2 Dimensions
 
Using Multiple Persistence Layers in Spark to Build a Scalable Prediction Eng...
Using Multiple Persistence Layers in Spark to Build a Scalable Prediction Eng...Using Multiple Persistence Layers in Spark to Build a Scalable Prediction Eng...
Using Multiple Persistence Layers in Spark to Build a Scalable Prediction Eng...
 
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
 
Building a Streaming Microservices Architecture - Data + AI Summit EU 2020
Building a Streaming Microservices Architecture - Data + AI Summit EU 2020Building a Streaming Microservices Architecture - Data + AI Summit EU 2020
Building a Streaming Microservices Architecture - Data + AI Summit EU 2020
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
 
Making Sense of Schema on Read
Making Sense of Schema on ReadMaking Sense of Schema on Read
Making Sense of Schema on Read
 
Big Data or Data Warehousing? How to Leverage Both in the Enterprise
Big Data or Data Warehousing? How to Leverage Both in the EnterpriseBig Data or Data Warehousing? How to Leverage Both in the Enterprise
Big Data or Data Warehousing? How to Leverage Both in the Enterprise
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data Analytics
 
Building a Data Lake on AWS
Building a Data Lake on AWSBuilding a Data Lake on AWS
Building a Data Lake on AWS
 
Seeing Redshift: How Amazon Changed Data Warehousing Forever
Seeing Redshift: How Amazon Changed Data Warehousing ForeverSeeing Redshift: How Amazon Changed Data Warehousing Forever
Seeing Redshift: How Amazon Changed Data Warehousing Forever
 
Suburface 2021 IBM Cloud Data Lake
Suburface 2021 IBM Cloud Data LakeSuburface 2021 IBM Cloud Data Lake
Suburface 2021 IBM Cloud Data Lake
 
LinkedIn2
LinkedIn2LinkedIn2
LinkedIn2
 

Similaire à Data Vault 2.0: Big Data Meets Data Warehousing

Webinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafkaWebinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafka
Jeffrey T. Pollock
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
Jeffrey T. Pollock
 
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services LayerLogical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
DataWorks Summit
 

Similaire à Data Vault 2.0: Big Data Meets Data Warehousing (20)

datavault2.pptx
datavault2.pptxdatavault2.pptx
datavault2.pptx
 
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
 
Webinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafkaWebinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafka
 
Flash session -streaming--ses1243-lon
Flash session -streaming--ses1243-lonFlash session -streaming--ses1243-lon
Flash session -streaming--ses1243-lon
 
Big Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft AzureBig Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft Azure
 
Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
10/ EnterpriseDB @ OPEN'16
10/ EnterpriseDB @ OPEN'16 10/ EnterpriseDB @ OPEN'16
10/ EnterpriseDB @ OPEN'16
 
Building Custom Big Data Integrations
Building Custom Big Data IntegrationsBuilding Custom Big Data Integrations
Building Custom Big Data Integrations
 
Building Fast Applications for Streaming Data
Building Fast Applications for Streaming DataBuilding Fast Applications for Streaming Data
Building Fast Applications for Streaming Data
 
Enabling Next Gen Analytics with Azure Data Lake and StreamSets
Enabling Next Gen Analytics with Azure Data Lake and StreamSetsEnabling Next Gen Analytics with Azure Data Lake and StreamSets
Enabling Next Gen Analytics with Azure Data Lake and StreamSets
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
 
A Key to Real-time Insights in a Post-COVID World (ASEAN)
A Key to Real-time Insights in a Post-COVID World (ASEAN)A Key to Real-time Insights in a Post-COVID World (ASEAN)
A Key to Real-time Insights in a Post-COVID World (ASEAN)
 
The Great Lakes: How to Approach a Big Data Implementation
The Great Lakes: How to Approach a Big Data ImplementationThe Great Lakes: How to Approach a Big Data Implementation
The Great Lakes: How to Approach a Big Data Implementation
 
Horses for Courses: Database Roundtable
Horses for Courses: Database RoundtableHorses for Courses: Database Roundtable
Horses for Courses: Database Roundtable
 
Scaling to Infinity - Open Source meets Big Data
Scaling to Infinity - Open Source meets Big DataScaling to Infinity - Open Source meets Big Data
Scaling to Infinity - Open Source meets Big Data
 
Dealing with Unstructured Data: Scaling to Infinity
Dealing with Unstructured Data: Scaling to InfinityDealing with Unstructured Data: Scaling to Infinity
Dealing with Unstructured Data: Scaling to Infinity
 
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services LayerLogical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
 
Trivadis Azure Data Lake
Trivadis Azure Data LakeTrivadis Azure Data Lake
Trivadis Azure Data Lake
 
Data APIs as a Foundation for Systems of Engagement
Data APIs as a Foundation for Systems of EngagementData APIs as a Foundation for Systems of Engagement
Data APIs as a Foundation for Systems of Engagement
 

Plus de All Things Open

Open Source and Public Policy
Open Source and Public PolicyOpen Source and Public Policy
Open Source and Public Policy
All Things Open
 
Weaving Microservices into a Unified GraphQL Schema with graph-quilt - Ashpak...
Weaving Microservices into a Unified GraphQL Schema with graph-quilt - Ashpak...Weaving Microservices into a Unified GraphQL Schema with graph-quilt - Ashpak...
Weaving Microservices into a Unified GraphQL Schema with graph-quilt - Ashpak...
All Things Open
 
How to Write & Deploy a Smart Contract
How to Write & Deploy a Smart ContractHow to Write & Deploy a Smart Contract
How to Write & Deploy a Smart Contract
All Things Open
 
Scaling Web Applications with Background
Scaling Web Applications with BackgroundScaling Web Applications with Background
Scaling Web Applications with Background
All Things Open
 
Build Developer Experience Teams for Open Source
Build Developer Experience Teams for Open SourceBuild Developer Experience Teams for Open Source
Build Developer Experience Teams for Open Source
All Things Open
 
Sudo – Giving access while staying in control
Sudo – Giving access while staying in controlSudo – Giving access while staying in control
Sudo – Giving access while staying in control
All Things Open
 
Fortifying the Future: Tackling Security Challenges in AI/ML Applications
Fortifying the Future: Tackling Security Challenges in AI/ML ApplicationsFortifying the Future: Tackling Security Challenges in AI/ML Applications
Fortifying the Future: Tackling Security Challenges in AI/ML Applications
All Things Open
 
Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...
Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...
Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...
All Things Open
 

Plus de All Things Open (20)

Building Reliability - The Realities of Observability
Building Reliability - The Realities of ObservabilityBuilding Reliability - The Realities of Observability
Building Reliability - The Realities of Observability
 
Modern Database Best Practices
Modern Database Best PracticesModern Database Best Practices
Modern Database Best Practices
 
Open Source and Public Policy
Open Source and Public PolicyOpen Source and Public Policy
Open Source and Public Policy
 
Weaving Microservices into a Unified GraphQL Schema with graph-quilt - Ashpak...
Weaving Microservices into a Unified GraphQL Schema with graph-quilt - Ashpak...Weaving Microservices into a Unified GraphQL Schema with graph-quilt - Ashpak...
Weaving Microservices into a Unified GraphQL Schema with graph-quilt - Ashpak...
 
The State of Passwordless Auth on the Web - Phil Nash
The State of Passwordless Auth on the Web - Phil NashThe State of Passwordless Auth on the Web - Phil Nash
The State of Passwordless Auth on the Web - Phil Nash
 
Total ReDoS: The dangers of regex in JavaScript
Total ReDoS: The dangers of regex in JavaScriptTotal ReDoS: The dangers of regex in JavaScript
Total ReDoS: The dangers of regex in JavaScript
 
What Does Real World Mass Adoption of Decentralized Tech Look Like?
What Does Real World Mass Adoption of Decentralized Tech Look Like?What Does Real World Mass Adoption of Decentralized Tech Look Like?
What Does Real World Mass Adoption of Decentralized Tech Look Like?
 
How to Write & Deploy a Smart Contract
How to Write & Deploy a Smart ContractHow to Write & Deploy a Smart Contract
How to Write & Deploy a Smart Contract
 
Spinning Your Drones with Cadence Workflows, Apache Kafka and TensorFlow
 Spinning Your Drones with Cadence Workflows, Apache Kafka and TensorFlow Spinning Your Drones with Cadence Workflows, Apache Kafka and TensorFlow
Spinning Your Drones with Cadence Workflows, Apache Kafka and TensorFlow
 
DEI Challenges and Success
DEI Challenges and SuccessDEI Challenges and Success
DEI Challenges and Success
 
Scaling Web Applications with Background
Scaling Web Applications with BackgroundScaling Web Applications with Background
Scaling Web Applications with Background
 
Supercharging tutorials with WebAssembly
Supercharging tutorials with WebAssemblySupercharging tutorials with WebAssembly
Supercharging tutorials with WebAssembly
 
Using SQL to Find Needles in Haystacks
Using SQL to Find Needles in HaystacksUsing SQL to Find Needles in Haystacks
Using SQL to Find Needles in Haystacks
 
Configuration Security as a Game of Pursuit Intercept
Configuration Security as a Game of Pursuit InterceptConfiguration Security as a Game of Pursuit Intercept
Configuration Security as a Game of Pursuit Intercept
 
Scaling an Open Source Sponsorship Program
Scaling an Open Source Sponsorship ProgramScaling an Open Source Sponsorship Program
Scaling an Open Source Sponsorship Program
 
Build Developer Experience Teams for Open Source
Build Developer Experience Teams for Open SourceBuild Developer Experience Teams for Open Source
Build Developer Experience Teams for Open Source
 
Deploying Models at Scale with Apache Beam
Deploying Models at Scale with Apache BeamDeploying Models at Scale with Apache Beam
Deploying Models at Scale with Apache Beam
 
Sudo – Giving access while staying in control
Sudo – Giving access while staying in controlSudo – Giving access while staying in control
Sudo – Giving access while staying in control
 
Fortifying the Future: Tackling Security Challenges in AI/ML Applications
Fortifying the Future: Tackling Security Challenges in AI/ML ApplicationsFortifying the Future: Tackling Security Challenges in AI/ML Applications
Fortifying the Future: Tackling Security Challenges in AI/ML Applications
 
Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...
Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...
Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...
 

Dernier

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Dernier (20)

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 

Data Vault 2.0: Big Data Meets Data Warehousing

  • 1. DATA VAULT 2.0: Big Data Meets Data Warehousing DEAN HALLMAN WIRESOFT, LLC
  • 2. DATA WAREHOUSING VS BIG DATA • Does Big Data replace Data Warehousing? Or do I need both? • What’s the difference: • Between the data flowing into a data warehouse vs big data tools? • Between the ingestion processes and infrastructure? • Data Lakes arrived with Big Data, so are they useful in Data Warehousing? • How should I model my data in EDW? • 3NF, Star Schema, same as my operational data stores? • Data Vault 2.0 • Graph Databases • What is an architecture that allows both to co-exists effectively?
  • 3. Impressions (Big Data) Core Business Services Core Business Services Core Business Services Operational Data Stores D A T A L A K E Enterprise Data Warehouse CDC, snapshot Internet External Data Sources Big Data Toolchain Batch (SerDe) StagingVault RawVault BusinessVault InformationMart Streaming (Kafka) Streaming Analytics Batch Analytics (Hadoop) Schema-on-Read Schema-on-Write Data Source Landing Clients ETL ELT 𝛾 𝛾 𝛾 𝛾 BI Tools Monitoring Discovery Audit clickstream (SerDe) ETL ETL
  • 4. Impressions (Big Data) Core Business Services Core Business Services Core Business Services Operational Data Stores D A T A L A K E Enterprise Data Warehouse CDC, snapshot Internet External Data Sources Big Data Toolchain Batch (SerDe) StagingVault RawVault BusinessVault InformationMart Streaming (Kafka) Streaming Analytics Batch Analytics (Hadoop) Schema-on-Read Schema-on-Write Data Source Landing Clients ETL ELT 𝛾 𝛾 𝛾 𝛾 BI Tools Monitoring Discovery Audit clickstream (SerDe) ETL ETL THE DATA MODEL
  • 5. DATA VAULT 2.0 COMMON FOUNDATIONAL WAREHOUSE ARCHITECTURE • “The Data Vault Model is a detail oriented, historical tracking and uniquely linked set of normalized tables that support one or more functional areas of business. It is a hybrid approach encompassing the best of breed between 3rd normal form (3NF) and star schema. The design is flexible, scalable, consistent and adaptable to the needs of the enterprise” -- Dan Linstedt, Creator of Data Vault • Data loaded as-is from sources, no edits or cleanup • Append-only to afford highest performance • Agile & agnostic to changes in the operational store’s data model • Essentially, a prescription for Layered Graph to Relational Mapping
  • 6. DATA WAREHOUSING & DATA VAULT 2.0 • 60’s, 70’s, 80’s • E.F. Codd => 3NF • Bill Inmon invents Data Warehousing concept • Dr. Ralph Kimball popularizes Star Schema design • 90’s, 00’s: • Dan Linstedt creates Data Vault Model @ DOD • 2014: • Dan Introduces Data Vault 2.0
  • 7.
  • 8. Source: “What are Graph Databases and Why should I care?“, by Dave Bechberger of Expero
  • 9. SOLVE BY STAR SCHEMA ?
  • 10. RELATIONAL VS GRAPH DATABASES • Enterprise Grade • Well-worn path • SQL has been relatively stagnant vs programming languages
  • 11. GRAPH DATA MODEL Source: https://neo4j.com/developer/graph-database/
  • 12. GRAPH DATABASE VS DATA VAULT
  • 13. GRAPH DATABASE VS DATA VAULT
  • 14. Flight Base Dest Forecast Record Source LoadDate Depart Gate LGA 2018-10-11 1:25P M B27 CAE 2018-10-24 3:30P M A14 SFO 2018-09-06 8:55P M G19 RDU 2018-08-12 4:45P M C22 SERVICED_BY Record Source Airport CAE Load Date 2018-11-17 Source Id 20181117-32-983 Aircraft Base Service FAA NTSB Record Source LoadDate Model Tailno United 2017-02-11 767 1477 Delta 2015-11-04 A6 2381 Alaska 2013-08-28 747 8312 Frontie r 2016-07-19 182 1438 Record Source United Airlines Load Date 2018-01-17 Source Id 2412c SERVICED_BY Base Dest Manifest Record Source LoadDate Begin End United 2017-02-11 2017-04-23 2017-09-23 Delta 2015-11-04 2015-12-01 2017-04-22 Alaska 2013-08-28 2013-09-14 2016-05-04 Frontie r 2016-07-19 2016-08-02 2018-04-11 Record Source United Airlines Load Date 2018-09-17 Hubs Links SatellitesTab
  • 15. • Organizations which design systems ... are constrained to produce designs which are copies of the communication structures of these organizations - Mel Conway
  • 16. FLIGHT Base Dest Forecast Record Source LoadDate Depart Gate LGA 2018-10- 11 1:25P M B27 CAE 2018-10- 24 3:30P M A14 FLIGHT Record Source Airport CAE Load Date 2018-11-17 Source Id 20181117-32-983 Aircraft Bas e Service FAA NTSB Record Source LoadDate Model Tailno United 2017-02- 11 767 1477 Delta 2015-11- 04 A6 2381 Alaska 2013-08- 28 747 8312 Frontie r 2016-07- 19 182 1438 Record Source United Airlines Load Date 2018-01-17 Source Id 2412c Airport Base Dest Manifest Record Source LoadDate Begin End United 2017-02-11 2017-04-23 2017-09- 23 Delta 2015-11-04 2015-12-01 2017-04- 22 Alaska 2013-08-28 2013-09-14 2016-05- 04 Frontie r 2016-07-19 2016-08-02 2018-04- 11 Record Source United Airlines Load Date 2018-09-17 Airline Base Service FAA NTS B Record Source LoadDate Model Tailno United 2017-02-11 767 1477 Delta 2015-11-04 A6 2381 Record Source United Airlines Load Date 2018-01-17 Source Id 2412c Hubs Links SatellitesTab
  • 18. • Modeled after self- organizing networks • A Business Key identifies a key concept in business. • They have a business meaning • They are unique and have very low propensity to change • Business keys change only when the business change • Enables (forces) cross- source modeling Source: http://www.di.univr.it/documenti/OccorrenzaIns/matdid/matdid232240.pdf
  • 19.
  • 22. DATA VAULT 2.0 MODELING: HUBS, LINKS & SATELLITES
  • 24. Impressions (Big Data) Core Business Services Core Business Services Core Business Services Operational Data Stores D A T A L A K E Enterprise Data Warehouse CDC, snapshot Internet External Data Sources Big Data Toolchain Batch (SerDe) StagingVault RawVault BusinessVault InformationMart Streaming (Kafka) Streaming Analytics Batch Analytics (Hadoop) Schema-on-Read Schema-on-Write Data Source Landing Clients ETL ELT 𝛾 𝛾 𝛾 𝛾 BI Tools Monitoring Discovery Audit clickstream (SerDe) ETL ETL THE DATA Impressions vs Business Data
  • 25. ENTERPRISE DATA SILOS Small DataLarge DataBig Data Describes the user base Describes the Enterprise Describes the Product
  • 26. Instance Grain Transaction Grain Audit Grain Impression Grain Big Data Enterprise Data Warehouse Operational Data Stores Impression Analytics Business Analytics External Data Sources DATA GRANULARITY FUNNEL
  • 27. Impressions (Big Data) Core Business Services Core Business Services Core Business Services Operational Data Stores D A T A L A K E Enterprise Data Warehouse CDC, snapshot Internet External Data Sources Big Data Toolchain Batch (SerDe) StagingVault RawVault BusinessVault InformationMart Streaming (Kafka) Streaming Analytics Batch Analytics (Hadoop) Schema-on-Read Schema-on-Write Data Source Landing Clients ETL ELT 𝛾 𝛾 𝛾 𝛾 BI Tools Monitoring Discovery Audit clickstream (SerDe) ETL ETL DATA INGESTION ETL vs ELT vs SerDe
  • 28. ETL VS ELT VS SerDe • Beware the Turing tar-pit, in which everything is possible, but nothing of interest is easy - Alan Perlis
  • 29. DATA CLASSIFICATION MATRIX: DECLARATIVE VS INTERPRETIVE Declarative Interpretive HadoopRDBMS Web Events Media Player
  • 30. DATA WAREHOUSING • Deep Topic • 60’s, 70’s, 80’s • E.F. Codd => 3NF • Bill Inmon invents Data Warehousing concept • Dr. Ralph Kimball popularizes Star Schema design • 90’s, 00’s: • Dan Linstedt creates Data Vault Model @ DOD • 2014: • Dan Introduces Data Vault 2.0 • Data Warehouse vs Operational Data Stores • Data Warehouse as Version Control System • MapReduce, 2004, Google by Jeffery Dean and Sanjay, “MAPREDUCE: SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS” , GFS • Nutch 2005, Hadoop 2006, 2007 - Doug Cutting • What exactly is “Big Data”? BIG DATA
  • 33. ETL OR SERDE ? S3 Hadoop Time Series Event Record Analysis Deserializer L e L d L m Client User Serializer L p L p Eventlog.e Eventlog.d L e Single Source (Version Locked) Kafka/Kinesis LeInternet
  • 35. OTHER CHALLENGES • Satellites must be loaded chronologically • Time-based scheduling vs data-availability scheduling
  • 36. QUESTIONS? • Contact:  Dean Hallman  rdhallman@gmail.com  Linkedin: https://www.linkedin.com/in/dean-hallman/

Notes de l'éditeur

  1. No single answer, but convention over configuration has one the day Data Warehousing --- 60’s, 70’s, 80’s E.F. Codd => 3NF Bill Inmon invents Data Warehousing concept Dr. Ralph Kimball popularizes Star Schema design 90’s, 00’s: Dan Linstedt creates Data Vault Model @ DOD 2014: Dan Introduces Data Vault 2.0 Data Warehouse vs Operational Data Stores Data Warehouse as Version Control System Big Data ----- MapReduce, 2004, Google by Jeffery Dean and Sanjay, “MAPREDUCE: SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS” , GFS Nutch 2005, Hadoop 2006, 2007 - Doug Cutting What exactly is “Big Data”?
  2. * 30 seconds * Slow down, cover this slide thoroughly * Lambda architecture
  3. * 30 seconds * Slow down, cover this slide thoroughly * Lambda architecture
  4. Data Warehouse vs Operational Data Stores Data Warehouse as Version Control System
  5. Same SQL abstraction level
  6. Graph databases - single property namespace - could impose naming convention as a namespace
  7. Hubs work to counter Conway’s law
  8. 3-way relationship - hypergraph / hyperedge
  9. Provide context and instance data for the hub-link relationship
  10. Encode the relationship between the nouns of your system
  11. * 30 seconds * Slow down, cover this slide thoroughly * Lambda architecture
  12. * 30 seconds * Slow down, cover this slide thoroughly * Lambda architecture
  13. Too close to the forest, forget to see the trees Is the business intelligence scattered out in the field Or centralized in the back office? Actors in the system are intelligent? Learn lanuage, conjugate verbs, form new sentences
  14. Serializer/Deserialize: Reusable package to be imported into a Lambda Test suite that ensures Serializer / Deserializer agree on before/after result