1. The difference between a Data Lake and a
Data Vault is the difference between a
stethoscope and a radar
• A Data Lake reinforces what you already know
• A Data Lake provides weak support for strategic
decisions
• Data Lakes encourage a silo mentality
• Data Lakes can show the ‘what’
• Data Vaults help with the ‘why’
• Data Lakes enable drill down
• Data Vaults encourage drill across
Data Lake vs Vault Summary
3. What do we do?
Signal Processing or Data
Processing?
• Signals start conversations
• Signals move boardrooms
• Signals release IT expenditure
• Signal variety, reliability and context
are key business drivers
• Data Processing ends
conversations!
4. Signal Processing is the
customer of Data
Integration & Warehousing
Signal
Processing
Business
Intelligence
Artificial
Intelligence
Reporting Analytics
Spreadsheets
Dashboards
5. Sales are down but why?
There are many interpretations of
reality;
• Website broken
• Marketing budget cut
• Campaign poor
• Product price uncompetitive
• New product release
• Company trashed by Trump
• Fashion victim
• Delivery delays and/or cost
• Recession
6. Signal Processing at Scale
• The Cloud is one massive signal
processor, with limitless
compute power and storage
• The Role of Data Integration in
the cloud is the organisation of
data sets for both efficient and
effective signal processing
• Data Lakes & Vaults have
emerged as key cloud
integration patterns
8. Data Lake Evolution
• 2011: Horton Works Forms
• 2012: AWS announces Amazon RedShift
• 2014: Data Lake European on premise
projects take off
• 2015: Snowflake released on AWS
• 2015: Hive and Presto released on AWS
• 2017: AWS Athena released
• 2006: Amazon AWS Launches
• 2008: Yahoo Open Sources Hadoop
• 2009: Cloudera Forms
• 2009: AWS Elastic MapReduce
• 2010 (October): Apache Hive release
• 2010 (October): James Dickson,
CTO Pentaho, coined the term Data Lake
9. Data Lake Signals are Isolated
• Data Lakes encourage detailed
analysis of a very narrow field
• Thinking across separate data
sources is difficult and inconsistent
• A silo mentality can emerge
• Data Scientists spend their time
hunting for the data lake ontology
• Weak support for strategic
decisions
• Too easy to make bad decisions on
limited data
10. Data Lake Warning
The danger with Data Lakes is that they encourage
decisions based upon what can be easily measured
11. Data Lakes are Good for
• Starting EDW projects
• Persistent staging areas
• Feedstock for Data Vaults
• Tactical Analysis
• DWH flexibility
• API Calls/Gateway
• Unstructured log analysis
• Operational Monitoring
12. Data Vault Evolution
• 1990s: Conceived by Dan Linstedt
• 2000: DV 1.0 Released into public domain
• 2014: DV 2.0 Announced
13. Data Vault Trends
• Strong tools are emerging for source centric
modelling and model population
• The need for business centric modelling
• Patterns emerging for automation of documentation,
validation and reconciliation
• New Data Warehouse Databases complement data
vaults
• GDPR and & PII are driving the need for ontologies
• S3/Athena as a Data Vault?
14. Data Vaults are Good for
• EDW projects
• Strategic Analysis
• Feedstock for Cubes and Models
15. Data Vault Signals are related
through business context
Sales are down and here is the
business context
• Broadens the field of vision and
the scope of questions
• Increases the variety, quality and
strength of signal channels
• Different business perspectives
are supported in a consistent
analysis framework
In the pub, signals open conversations
Signals move boardrooms not data
How our data integration projects are consumed by the board determines the success/failure
We should sell signals not technology
Flying blind
Yield Curves
Human task
Board can’t take action if blind to obvious signals
10 years since Yahoo open sourced Hadoop
Which came first James Dickson or Hive?
Up until Hive, Hadoop was hard, separated compute from storage without analysis
4 years since first data lake iteration…poor