The document discusses the differences between data warehousing and big data, how Data Vault 2.0 provides a common foundation for both, and how to model data using the Data Vault approach with hubs, links, and satellites. It also covers challenges like loading satellites chronologically and different data ingestion methods like ETL, ELT, and SerDe.
2. DATA WAREHOUSING VS BIG DATA
• Does Big Data replace Data Warehousing? Or do I need both?
• What’s the difference:
• Between the data flowing into a data warehouse vs big data tools?
• Between the ingestion processes and infrastructure?
• Data Lakes arrived with Big Data, so are they useful in Data Warehousing?
• How should I model my data in EDW?
• 3NF, Star Schema, same as my operational data stores?
• DataVault 2.0
• Graph Databases
• What is an architecture that allows both to co-exists effectively?
5. DATA VAULT 2.0
COMMON FOUNDATIONALWAREHOUSE ARCHITECTURE
• “The Data Vault Model is a detail oriented, historical tracking and uniquely linked set
of normalized tables that support one or more functional areas of business. It is a
hybrid approach encompassing the best of breed between 3rd normal form (3NF)
and star schema.The design is flexible, scalable, consistent and adaptable to the
needs of the enterprise” -- Dan Linstedt, Creator of Data Vault
• Data loaded as-is from sources, no edits or cleanup
• Append-only to afford highest performance
• Agile & agnostic to changes in the operational store’s data model
• Essentially, a prescription for Layered Graph to Relational Mapping
6. DATA WAREHOUSING & DATA VAULT 2.0
• 60’s, 70’s, 80’s
• E.F. Codd => 3NF
• Bill Inmon invents DataWarehousing
concept
• Dr. Ralph Kimball popularizes Star
Schema design
• 90’s, 00’s:
• Dan Linstedt creates DataVault Model @
DOD
• 2014:
• Dan Introduces DataVault 2.0
14. Flight
Base Dest Forecast
Record
Source
LoadDate Depart Gate
LGA 2018-10-11 1:25PM B27
CAE 2018-10-24 3:30PM A14
SFO 2018-09-06 8:55PM G19
RDU 2018-08-12 4:45PM C22
SERVICED_BY
Record Source Airport CAE
Load Date 2018-11-17
Source Id 20181117-32-983
Aircraft
Base Service FAA NTSB
Record
Source
LoadDate Model Tailno
United 2017-02-11 767 1477
Delta 2015-11-04 A6 2381
Alaska 2013-08-28 747 8312
Frontier 2016-07-19 182 1438
Record Source United Airlines
Load Date 2018-01-17
Source Id 2412c
SERVICED_BY
Base Dest Manifest
Record
Source
LoadDate Begin End
United 2017-02-11 2017-04-23 2017-09-23
Delta 2015-11-04 2015-12-01 2017-04-22
Alaska 2013-08-28 2013-09-14 2016-05-04
Frontier 2016-07-19 2016-08-02 2018-04-11
Record Source United Airlines
Load Date 2018-09-17
Hubs
Links
SatellitesTab
15. • Organizations which design systems ...
are constrained to produce designs
which are copies of the communication
structures of these organizations
- Mel Conway
16. FLIGHT
Base Dest Forecast
Record
Source
LoadDate Depart Gate
LGA 2018-10-11 1:25PM B27
CAE 2018-10-24 3:30PM A14
FLIGHT
Record Source Airport CAE
Load Date 2018-11-17
Source Id 20181117-32-983
Aircraft
Base Service FAA NTSB
Record
Source
LoadDate Model Tailno
United 2017-02-11 767 1477
Delta 2015-11-04 A6 2381
Alaska 2013-08-28 747 8312
Frontier 2016-07-19 182 1438
Record Source United Airlines
Load Date 2018-01-17
Source Id 2412c
Airport
Base Dest Manifest
Record
Source
LoadDate Begin End
United 2017-02-11 2017-04-23 2017-09-23
Delta 2015-11-04 2015-12-01 2017-04-22
Alaska 2013-08-28 2013-09-14 2016-05-04
Frontier 2016-07-19 2016-08-02 2018-04-11
Record Source United Airlines
Load Date 2018-09-17
Airline
Base Service FAA NTSB
Record
Source
LoadDate Model Tailno
United 2017-02-11 767 1477
Delta 2015-11-04 A6 2381
Record Source United Airlines
Load Date 2018-01-17
Source Id 2412c
Hubs
Links
SatellitesTab
18. • Modeled after self-
organizing networks
• A Business Key identifies a
key concept in business.
• They have a business
meaning
• They are unique and have
very low propensity to
change
• Business keys change
only when the business
change
• Enables (forces) cross-
source modeling
Source: http://www.di.univr.it/documenti/OccorrenzaIns/matdid/matdid232240.pdf
30. DATA WAREHOUSING
• Deep Topic
• 60’s, 70’s, 80’s
• E.F. Codd => 3NF
• Bill Inmon invents Data Warehousing
concept
• Dr. Ralph Kimball popularizes Star Schema
design
• 90’s, 00’s:
• Dan Linstedt creates Data Vault Model @
DOD
• 2014:
• Dan Introduces Data Vault 2.0
• Data Warehouse vs Operational Data Stores
• Data Warehouse as Version Control System
• MapReduce, 2004, Google by Jeffery
Dean and Sanjay,“MAPREDUCE:
SIMPLIFIED DATA PROCESSING ON
LARGE CLUSTERS” , GFS
• Nutch 2005, Hadoop 2006, 2007 - Doug
Cutting
• What exactly is “Big Data”?
BIG DATA
33. ETL OR SERDE ?
S3
Hadoop
Time Series
Event Record
Analysis
Deserializer
L e
L
d
L
m
Client
User
Serializer
L p
L
p
Eventlog.e Eventlog.d
L
e
Single Source
(Version Locked)
Kafka/Kinesis
L eInternet