SlideShare a Scribd company logo
1 of 42
Download to read offline
So, you want to build a
Data Lake?
The Basics of Data Lakes, Key Considerations, and Lessons Learned
David P. Moore
12/15/2020
Agenda
• Introduction
• What is a Data Lake?
• Architecture and Design
• Governance and Support
• Lessons Learned
• What’s Next?
About Me…
• Sr. Software Developer at CarMax since 2019,
 Consultant with CapTech for 3+ years
 Before that worked at Capital One in a variety of
roles including Developer, Data Modeler, Tech
Lead
• Have worked on 3 data lake implementations
at 3 different companies using 3 different
technologies
• 20+ years in data and software dev, with a
passion for continuous improvement
• Two Fun facts:
 I have a black belt in Silkisondan Karate
 I love to play guitar and listen to music
What is a Data Lake?
First a little data history lesson…
Data warehouse and proprietary ETL and database tools
• 1990’s to mid 2000’s – Data Warehouse Popularized
 Ralph Kimball – Star Schema, Data Marts
 Bill Inmon - EDW
• SMP Database Systems (Oracle, SQL Server, Sybase)
• ETL Tools (Informatica, Ab Initio, Talend, etc)
• MPP Database Systems (Teradata, Netezza, Greenplum, etc)
 ELT, 3NF
Open-source, big data and the
cloud…
• 2003, 2004 – Google File System, and Google MapReduce Papers
published
• 2006 – Hadoop started by Doug Cutting and Mike Cafarell
• 2008 - Companies like Cloudera, Hortonworks, MapR form to
package and distribute open-source Hadoop
• 2006 – AWS launched, followed by Google in 2008 and Azure in 2010
• 2010 – Apache Spark started by Matei Zaharia
• 2013 – Databricks launched offering Spark as a Service
• 2019 – Delta Lake released by Databricks
What is Big Data?
• Big Data is a term used to describe massive volumes of data that can
flood a business daily
• This data can be either structured or unstructured, but ultimately
the datasets are so large that they cannot be processed on a single
machine in a reasonable amount of time
• 3 V’s, popularized by Doug Laney from Gartner:
Volume Variety Velocity
What is a Data Lake?
• “A data lake is a system or repository of data stored in its natural/raw
format, usually object blobs or files.”
“A data lake is usually a single store of all enterprise data including raw copies
of source system data and transformed data used for tasks such as reporting,
visualization, advanced analytics and machine learning. A data lake can
include structured data from relational databases (rows and columns), semi-
structured data (CSV, logs, XML, JSON), unstructured data (emails,
documents, PDFs) and binary data (images, audio, video).”
Source: https://en.wikipedia.org/wiki/Data_lake
James Dixon of Pentaho:
“If you think of a datamart as a store of bottled water –
cleansed and packaged and structured for easy consumption –
the data lake is a large body of water in a more natural state.
The contents of the data lake stream in from a source to fill the
lake, and various users of the lake can come to examine, dive
in, or take samples.”
Data Warehouse vs. Data Lake
Data Warehouse Data Lake
Data Format Structured Structured, Semi-
structured, Unstructured
Data Schema / Modeling Schema-on-Write Schema-on-Read
Relative Cost $$$ $
Flexibility Less agile Highly agile
Performance Tuned for fast query
response
General purpose access,
slower responses
Data Quality High quality, curated data Lower quality, raw data
Target Users Business Analysts Data Scientists
Typical Use Cases Reporting, Visualizations Predictive Analytics,
Machine Learning
What is Delta Lake?
“Delta Lake is an open source storage layer that brings reliability to data lakes.
Delta Lake provides ACID transactions, scalable metadata handling, and unifies
streaming and batch data processing. Delta Lake runs on top of your existing data
lake and is fully compatible with Apache Spark APIs.”
https://docs.delta.io/latest/delta-faq.html
Created by Databricks, and open sourced and contributed to the Linux Foundation as an
open standard, Delta Lake is a technology layer compatible with Apache Spark that
adds some database-like features to a data lake.
The cloud has enabled a massive
transformation in data capabilities
• Going from on-premises data centers, where provisioning new
hardware took weeks or months, to being able to scale up within
minutes
• Decoupling of compute from storage allows for flexible scalability and
optimizing costs
Architecture and
Design
Data Lake Architectural
Considerations
Scalability Flexibility Security
Availability Supportability
Cloud vs. On-Premises?
• Flexibility & Agility
• Scalability
• Op-ex cost model
• No data center
• Lack of control of data
• Depending on workload
costs can be higher
• Slower Time to Market
• Limit of Scalability
• Cap-ex cost model
• Full control over data
• Depending on workload
costs could be lower
Data Lake Architecture
Primary Components
Storage
Processing Engine
Orchestration Engine
User Access Tools
Data Catalog
Data Lake Environments
DEV
TEST
PRODUCTION
• As in any traditional systems development, having multiple environments for
developing and testing code is necessary.
• Changes to each subsequent environment should be made via automation
• Pre-prod environments need to be kept in sync with prod
Refresh
Process
Data Lake Zones
Landing
Raw (Bronze)
Clean/Valid (Silver)
Refined (Gold)
Secure
Sandbox
Data Lakes typically are divided into separate zones with data going through a
refining process as it progresses from one zone to the next.
Progression
Data Lake Storage paradigms
The Data Lake has two primary storage paradigms for accessing and dealing
with its data:
Hierarchical File System
 Typically based on HDFS
 Data organized into Files and Folders
 N-levels deep
 Based on Posix file system standard
Database
 Typically based on Hive
 Data is organized into Databases and Tables
 2-levels deep
 Compatible with SQL-based access
Most Data Lake systems use both at the same time, where the Database layer sits
on top of the File System. This can cause confusion for users.
Storage Design Decisions
• Datasets in a data lake are typically defined at a folder level
instead of at the file level.
• At the top level there is typically a folder structure that aligns
with the zones
• There are two primary types of data to consider:
 Event/Fact data (Clicks, Transactions, Sensor readings, etc)
 Reference/Master/Dimension data (Customer, Product, etc)
• Reference/Dimension data requires thinking about how to store
history of changes:
1. Snapshots
2. Deltas
File formats and compression
An important design choice is what file format to use in the Lake as
well as whether to compress the data
 For the Landing/Raw zone, the convention is preserve the data in
whatever format it arrived in.
 For subsequent zones, it makes sense to conform to a standard
format that is designed for data lakes that includes schema
information
 Parquet is popular for analytics (Columnar) with Snappy
Compression
 Delta Lake uses Parquet with additional metadata
 ORC is an alternative columnar format popular on Hadoop
 Avro is row-based popular for streaming (Kafka)
 Avoid CSV or plain text formats where possible
 Consider whether the format is splittable for parallel processing
 CSV and Gzip may not be splittable formats
Data Ingestion Choices
ETL frameworks:
• GUI-Based
• Code-based
• Notebooks
• Metadata-driven
Frequency:
• Batch
 Weekly
 Daily
 Hourly
• Micro batch
 Every N minutes
• Streaming / Real time
Push vs. Pull:
• Push – systems send
their data to the lake
• Pull – The lake
initiates extracts
Ingestion is the process of getting data into the lake. When designing ingestion
systems, there are many options and choices that need to be made such as:
Data Catalog
The data catalog is a central part of managing the lake
and should have features such as:
• Dataset definitions
• Fields/column definitions
• Tags: Owner, Classification, PII
• Subject Matter Experts (SMEs)
Modern catalog tools also provide features such as:
• Crowdsourcing of metadata and gamification
• Automated annotation
Some examples:
Alation
Lumada Data Catalog
IBM Watson Knowledge
Catalog
AWS Glue
Azure Data Catalog /
Purview
Hive Metastore
• Most data lakes that are Hadoop-based or Spark-based rely on a
metadata catalog called the Hive Metastore
• It is important to consider how this should be provisioned and
managed
• The metastore is a relational database and supports a variety of
DBMS types including both open source (PostgreSQL, MySQL) and
closed (Oracle, MS SQL Server)
• Some configurations allow for an external metastore that can be
shared by workspaces (i.e. Databricks)
Data Lake Consuming Systems
The lake will most likely host
multiple consuming systems
including:
• Data Warehouses
• Data Marts
• Operational Data Stores
• Feature Stores
• Data products or applications
 Dashboards
 Alerts/Notifications
 Automated Actions
 Datasets
Designing and architecting for data
consumption will require answering
questions such as:
• Will systems pull data from the
lake, or will data be pushed?
• How will these systems access the
data?
• How will systems be notified that
data is available?
• What environments will these
systems use for developing and
testing?
• What apis will be used? (JDBC,
ODBC, REST, SFTP)
Example: Modern Data Warehouse
in Azure
https://docs.microsoft.com/en-us/azure/architecture/solution-ideas/articles/modern-data-warehouse
https://aws.amazon.com/solutions/implementations/data-lake-solution/
Example: Data Lake on AWS
Governance and
Support
Keeping the Lake Secure
• Network security controls
• Role-Based Access Controls (RBAC)
• Encryption
 Transparent Data Encryption
 Explicit Encryption
• Row level and column level access
Keeping the Lake Available
• Service Level Agreements
 RPO and RTO
• Backups
 Data
 Configuration
 Secrets
• Version Control
• Resource Locks
• Geo-Redundancy
• Automation
What’s your disaster recovery plan?
Access Patterns and Roles
The Lake needs to support several different types of access patterns:
1. System Access
 Platform systems
 Applications
2. Business User Access
 Data Analysts
 Data Scientists
3. Technology User Access
 Support Access
 Developer Access
Each of these groups need to have different access rights appropriate to the
role.
Regulations and Policies
impacting the Lake
External Regulations
• GDPR
• CCPA
• HIPAA
• PCI
Internal Policies
• PII and Privacy
• Information Classification
Some regulations such as GDPR and CCPA require customer data to
be disclosed and/or deleted. This requires careful design.
User Support
• Data Catalog
• Access to Data and Tools
• Training
• Sandbox Provisioning
• Help & Support
Technical Exploration and Tool
Selection
• Explore and select tools and technologies
• Minimize number of tools
• Choose best of breed
• Consider Total Cost of Ownership (TCO)
• Select compatible technologies
Performance Tuning
• CPUs/Cores
• Memory
• Parallelism
• Skew
• Caching
Lessons Learned
1. Managing Environments is Hard
2. Automate Everything
3. Don’t rush to fill the lake, you might wind up with a swamp
4. Know your data
5. Pick a high value use case and demonstrate value quickly
6. Minimize complexity
7. Make sure you have backups
8. Enable self-service
9. But set limits and controls on user space
10. Try out different options, but settle on a single solution
What’s Next?
Machine Learning and AI
The Data Lake should not be an end of itself, but instead
should be an enabler of new ways of using data for the benefit
of the business and its customers.
Machine Learning and Artificial intelligence hold much
promise and potential to leverage big data to create
innovative data products.
Some newer capabilities that are critical to this include:
• Feature Stores – Systems for storing and managing
“features” used by machine learning pipelines or models
• Model Registries – Systems for storing, managing and
operationalizing predictive models
The Lakehouse
https://databricks.com/blog/2020/01/30/what-is-a-data-lakehouse.html
Technologies like Delta Lake have enabled the combining of the data lake
and data warehouse, simplifying the data architecture
• Lakehouse concept introduced by Databricks
The Event Streaming Platform
Championed by Confluent (creators of Kafka)
this enterprise architecture pattern uses a
hub-and-spoke model where systems stream
events to a hub, which can be read by other
systems.
• Enables real-time event driven systems
• Simplifies point to point dependencies
• Compliments Data Lakes, Data Warehouses
and other systems
Further reading
The Enterprise Big Data Lake
Alex Gorelik
Questions?

More Related Content

What's hot

Should I move my database to the cloud?
Should I move my database to the cloud?Should I move my database to the cloud?
Should I move my database to the cloud?James Serra
 
Microsoft Data Platform - What's included
Microsoft Data Platform - What's includedMicrosoft Data Platform - What's included
Microsoft Data Platform - What's includedJames Serra
 
Choosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloudChoosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloudJames Serra
 
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-Baltagi
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-BaltagiModern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-Baltagi
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-BaltagiSlim Baltagi
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMatei Zaharia
 
AWS Cloud Kata 2013 | Singapore - Getting to Scale on AWS
AWS Cloud Kata 2013 | Singapore - Getting to Scale on AWSAWS Cloud Kata 2013 | Singapore - Getting to Scale on AWS
AWS Cloud Kata 2013 | Singapore - Getting to Scale on AWSAmazon Web Services
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMark Kromer
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lakeJames Serra
 
Data Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future OutlookData Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future OutlookJames Serra
 
Anatomy of a data driven architecture - Tamir Dresher
Anatomy of a data driven architecture - Tamir Dresher   Anatomy of a data driven architecture - Tamir Dresher
Anatomy of a data driven architecture - Tamir Dresher Tamir Dresher
 
Design Principles for a Modern Data Warehouse
Design Principles for a Modern Data WarehouseDesign Principles for a Modern Data Warehouse
Design Principles for a Modern Data WarehouseRob Winters
 
Designing modern dw and data lake
Designing modern dw and data lakeDesigning modern dw and data lake
Designing modern dw and data lakepunedevscom
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Databricks
 
Data modeling trends for analytics
Data modeling trends for analyticsData modeling trends for analytics
Data modeling trends for analyticsIke Ellis
 
The Warranty Data Lake – After, Inc.
The Warranty Data Lake – After, Inc.The Warranty Data Lake – After, Inc.
The Warranty Data Lake – After, Inc.Richard Vermillion
 
Delta Lake with Azure Databricks
Delta Lake with Azure DatabricksDelta Lake with Azure Databricks
Delta Lake with Azure DatabricksDustin Vannoy
 
Modernize & Automate Analytics Data Pipelines
Modernize & Automate Analytics Data PipelinesModernize & Automate Analytics Data Pipelines
Modernize & Automate Analytics Data PipelinesCarole Gunst
 

What's hot (20)

Should I move my database to the cloud?
Should I move my database to the cloud?Should I move my database to the cloud?
Should I move my database to the cloud?
 
Microsoft Data Platform - What's included
Microsoft Data Platform - What's includedMicrosoft Data Platform - What's included
Microsoft Data Platform - What's included
 
Choosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloudChoosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloud
 
Data lake
Data lakeData lake
Data lake
 
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-Baltagi
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-BaltagiModern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-Baltagi
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-Baltagi
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
 
Synapse for mere mortals
Synapse for mere mortalsSynapse for mere mortals
Synapse for mere mortals
 
AWS Cloud Kata 2013 | Singapore - Getting to Scale on AWS
AWS Cloud Kata 2013 | Singapore - Getting to Scale on AWSAWS Cloud Kata 2013 | Singapore - Getting to Scale on AWS
AWS Cloud Kata 2013 | Singapore - Getting to Scale on AWS
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data Analytics
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Data Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future OutlookData Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future Outlook
 
Anatomy of a data driven architecture - Tamir Dresher
Anatomy of a data driven architecture - Tamir Dresher   Anatomy of a data driven architecture - Tamir Dresher
Anatomy of a data driven architecture - Tamir Dresher
 
Design Principles for a Modern Data Warehouse
Design Principles for a Modern Data WarehouseDesign Principles for a Modern Data Warehouse
Design Principles for a Modern Data Warehouse
 
Designing modern dw and data lake
Designing modern dw and data lakeDesigning modern dw and data lake
Designing modern dw and data lake
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
 
Data modeling trends for analytics
Data modeling trends for analyticsData modeling trends for analytics
Data modeling trends for analytics
 
The Warranty Data Lake – After, Inc.
The Warranty Data Lake – After, Inc.The Warranty Data Lake – After, Inc.
The Warranty Data Lake – After, Inc.
 
Delta Lake with Azure Databricks
Delta Lake with Azure DatabricksDelta Lake with Azure Databricks
Delta Lake with Azure Databricks
 
2022 02 Integration Bootcamp
2022 02 Integration Bootcamp2022 02 Integration Bootcamp
2022 02 Integration Bootcamp
 
Modernize & Automate Analytics Data Pipelines
Modernize & Automate Analytics Data PipelinesModernize & Automate Analytics Data Pipelines
Modernize & Automate Analytics Data Pipelines
 

Similar to So You Want to Build a Data Lake?

Planing and optimizing data lake architecture
Planing and optimizing data lake architecturePlaning and optimizing data lake architecture
Planing and optimizing data lake architectureMilos Milovanovic
 
Planning and Optimizing Data Lake Architecture - Milos Milovanovic
 Planning and Optimizing Data Lake Architecture - Milos Milovanovic Planning and Optimizing Data Lake Architecture - Milos Milovanovic
Planning and Optimizing Data Lake Architecture - Milos MilovanovicInstitute of Contemporary Sciences
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...Institute of Contemporary Sciences
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Martin Bém
 
Is the traditional data warehouse dead?
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?James Serra
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptxAlex Ivy
 
Big Data_Architecture.pptx
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptxbetalab
 
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraLow-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraCaserta
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiridatastack
 
Building Data Intensive Analytic Application on Top of Delta Lakes
Building Data Intensive Analytic Application on Top of Delta LakesBuilding Data Intensive Analytic Application on Top of Delta Lakes
Building Data Intensive Analytic Application on Top of Delta LakesDatabricks
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...Simon Ambridge
 
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data LakesADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data LakesDATAVERSITY
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureLuan Moreno Medeiros Maciel
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFAmazon Web Services
 
20160331 sa introduction to big data pipelining berlin meetup 0.3
20160331 sa introduction to big data pipelining berlin meetup   0.320160331 sa introduction to big data pipelining berlin meetup   0.3
20160331 sa introduction to big data pipelining berlin meetup 0.3Simon Ambridge
 

Similar to So You Want to Build a Data Lake? (20)

Planing and optimizing data lake architecture
Planing and optimizing data lake architecturePlaning and optimizing data lake architecture
Planing and optimizing data lake architecture
 
Planning and Optimizing Data Lake Architecture - Milos Milovanovic
 Planning and Optimizing Data Lake Architecture - Milos Milovanovic Planning and Optimizing Data Lake Architecture - Milos Milovanovic
Planning and Optimizing Data Lake Architecture - Milos Milovanovic
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
 
Is the traditional data warehouse dead?
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
Big Data_Architecture.pptx
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptx
 
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraLow-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiri
 
Architecting a datalake
Architecting a datalakeArchitecting a datalake
Architecting a datalake
 
Building Data Intensive Analytic Application on Top of Delta Lakes
Building Data Intensive Analytic Application on Top of Delta LakesBuilding Data Intensive Analytic Application on Top of Delta Lakes
Building Data Intensive Analytic Application on Top of Delta Lakes
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...
 
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data LakesADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SF
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
20160331 sa introduction to big data pipelining berlin meetup 0.3
20160331 sa introduction to big data pipelining berlin meetup   0.320160331 sa introduction to big data pipelining berlin meetup   0.3
20160331 sa introduction to big data pipelining berlin meetup 0.3
 

Recently uploaded

BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 

Recently uploaded (20)

BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 

So You Want to Build a Data Lake?

  • 1. So, you want to build a Data Lake? The Basics of Data Lakes, Key Considerations, and Lessons Learned David P. Moore 12/15/2020
  • 2. Agenda • Introduction • What is a Data Lake? • Architecture and Design • Governance and Support • Lessons Learned • What’s Next?
  • 3. About Me… • Sr. Software Developer at CarMax since 2019,  Consultant with CapTech for 3+ years  Before that worked at Capital One in a variety of roles including Developer, Data Modeler, Tech Lead • Have worked on 3 data lake implementations at 3 different companies using 3 different technologies • 20+ years in data and software dev, with a passion for continuous improvement • Two Fun facts:  I have a black belt in Silkisondan Karate  I love to play guitar and listen to music
  • 4. What is a Data Lake?
  • 5. First a little data history lesson… Data warehouse and proprietary ETL and database tools • 1990’s to mid 2000’s – Data Warehouse Popularized  Ralph Kimball – Star Schema, Data Marts  Bill Inmon - EDW • SMP Database Systems (Oracle, SQL Server, Sybase) • ETL Tools (Informatica, Ab Initio, Talend, etc) • MPP Database Systems (Teradata, Netezza, Greenplum, etc)  ELT, 3NF
  • 6. Open-source, big data and the cloud… • 2003, 2004 – Google File System, and Google MapReduce Papers published • 2006 – Hadoop started by Doug Cutting and Mike Cafarell • 2008 - Companies like Cloudera, Hortonworks, MapR form to package and distribute open-source Hadoop • 2006 – AWS launched, followed by Google in 2008 and Azure in 2010 • 2010 – Apache Spark started by Matei Zaharia • 2013 – Databricks launched offering Spark as a Service • 2019 – Delta Lake released by Databricks
  • 7. What is Big Data? • Big Data is a term used to describe massive volumes of data that can flood a business daily • This data can be either structured or unstructured, but ultimately the datasets are so large that they cannot be processed on a single machine in a reasonable amount of time • 3 V’s, popularized by Doug Laney from Gartner: Volume Variety Velocity
  • 8. What is a Data Lake? • “A data lake is a system or repository of data stored in its natural/raw format, usually object blobs or files.” “A data lake is usually a single store of all enterprise data including raw copies of source system data and transformed data used for tasks such as reporting, visualization, advanced analytics and machine learning. A data lake can include structured data from relational databases (rows and columns), semi- structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and binary data (images, audio, video).” Source: https://en.wikipedia.org/wiki/Data_lake James Dixon of Pentaho: “If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”
  • 9. Data Warehouse vs. Data Lake Data Warehouse Data Lake Data Format Structured Structured, Semi- structured, Unstructured Data Schema / Modeling Schema-on-Write Schema-on-Read Relative Cost $$$ $ Flexibility Less agile Highly agile Performance Tuned for fast query response General purpose access, slower responses Data Quality High quality, curated data Lower quality, raw data Target Users Business Analysts Data Scientists Typical Use Cases Reporting, Visualizations Predictive Analytics, Machine Learning
  • 10. What is Delta Lake? “Delta Lake is an open source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake runs on top of your existing data lake and is fully compatible with Apache Spark APIs.” https://docs.delta.io/latest/delta-faq.html Created by Databricks, and open sourced and contributed to the Linux Foundation as an open standard, Delta Lake is a technology layer compatible with Apache Spark that adds some database-like features to a data lake.
  • 11. The cloud has enabled a massive transformation in data capabilities • Going from on-premises data centers, where provisioning new hardware took weeks or months, to being able to scale up within minutes • Decoupling of compute from storage allows for flexible scalability and optimizing costs
  • 13. Data Lake Architectural Considerations Scalability Flexibility Security Availability Supportability
  • 14. Cloud vs. On-Premises? • Flexibility & Agility • Scalability • Op-ex cost model • No data center • Lack of control of data • Depending on workload costs can be higher • Slower Time to Market • Limit of Scalability • Cap-ex cost model • Full control over data • Depending on workload costs could be lower
  • 15. Data Lake Architecture Primary Components Storage Processing Engine Orchestration Engine User Access Tools Data Catalog
  • 16. Data Lake Environments DEV TEST PRODUCTION • As in any traditional systems development, having multiple environments for developing and testing code is necessary. • Changes to each subsequent environment should be made via automation • Pre-prod environments need to be kept in sync with prod Refresh Process
  • 17. Data Lake Zones Landing Raw (Bronze) Clean/Valid (Silver) Refined (Gold) Secure Sandbox Data Lakes typically are divided into separate zones with data going through a refining process as it progresses from one zone to the next. Progression
  • 18. Data Lake Storage paradigms The Data Lake has two primary storage paradigms for accessing and dealing with its data: Hierarchical File System  Typically based on HDFS  Data organized into Files and Folders  N-levels deep  Based on Posix file system standard Database  Typically based on Hive  Data is organized into Databases and Tables  2-levels deep  Compatible with SQL-based access Most Data Lake systems use both at the same time, where the Database layer sits on top of the File System. This can cause confusion for users.
  • 19. Storage Design Decisions • Datasets in a data lake are typically defined at a folder level instead of at the file level. • At the top level there is typically a folder structure that aligns with the zones • There are two primary types of data to consider:  Event/Fact data (Clicks, Transactions, Sensor readings, etc)  Reference/Master/Dimension data (Customer, Product, etc) • Reference/Dimension data requires thinking about how to store history of changes: 1. Snapshots 2. Deltas
  • 20. File formats and compression An important design choice is what file format to use in the Lake as well as whether to compress the data  For the Landing/Raw zone, the convention is preserve the data in whatever format it arrived in.  For subsequent zones, it makes sense to conform to a standard format that is designed for data lakes that includes schema information  Parquet is popular for analytics (Columnar) with Snappy Compression  Delta Lake uses Parquet with additional metadata  ORC is an alternative columnar format popular on Hadoop  Avro is row-based popular for streaming (Kafka)  Avoid CSV or plain text formats where possible  Consider whether the format is splittable for parallel processing  CSV and Gzip may not be splittable formats
  • 21. Data Ingestion Choices ETL frameworks: • GUI-Based • Code-based • Notebooks • Metadata-driven Frequency: • Batch  Weekly  Daily  Hourly • Micro batch  Every N minutes • Streaming / Real time Push vs. Pull: • Push – systems send their data to the lake • Pull – The lake initiates extracts Ingestion is the process of getting data into the lake. When designing ingestion systems, there are many options and choices that need to be made such as:
  • 22. Data Catalog The data catalog is a central part of managing the lake and should have features such as: • Dataset definitions • Fields/column definitions • Tags: Owner, Classification, PII • Subject Matter Experts (SMEs) Modern catalog tools also provide features such as: • Crowdsourcing of metadata and gamification • Automated annotation Some examples: Alation Lumada Data Catalog IBM Watson Knowledge Catalog AWS Glue Azure Data Catalog / Purview
  • 23. Hive Metastore • Most data lakes that are Hadoop-based or Spark-based rely on a metadata catalog called the Hive Metastore • It is important to consider how this should be provisioned and managed • The metastore is a relational database and supports a variety of DBMS types including both open source (PostgreSQL, MySQL) and closed (Oracle, MS SQL Server) • Some configurations allow for an external metastore that can be shared by workspaces (i.e. Databricks)
  • 24. Data Lake Consuming Systems The lake will most likely host multiple consuming systems including: • Data Warehouses • Data Marts • Operational Data Stores • Feature Stores • Data products or applications  Dashboards  Alerts/Notifications  Automated Actions  Datasets Designing and architecting for data consumption will require answering questions such as: • Will systems pull data from the lake, or will data be pushed? • How will these systems access the data? • How will systems be notified that data is available? • What environments will these systems use for developing and testing? • What apis will be used? (JDBC, ODBC, REST, SFTP)
  • 25. Example: Modern Data Warehouse in Azure https://docs.microsoft.com/en-us/azure/architecture/solution-ideas/articles/modern-data-warehouse
  • 28. Keeping the Lake Secure • Network security controls • Role-Based Access Controls (RBAC) • Encryption  Transparent Data Encryption  Explicit Encryption • Row level and column level access
  • 29. Keeping the Lake Available • Service Level Agreements  RPO and RTO • Backups  Data  Configuration  Secrets • Version Control • Resource Locks • Geo-Redundancy • Automation What’s your disaster recovery plan?
  • 30. Access Patterns and Roles The Lake needs to support several different types of access patterns: 1. System Access  Platform systems  Applications 2. Business User Access  Data Analysts  Data Scientists 3. Technology User Access  Support Access  Developer Access Each of these groups need to have different access rights appropriate to the role.
  • 31. Regulations and Policies impacting the Lake External Regulations • GDPR • CCPA • HIPAA • PCI Internal Policies • PII and Privacy • Information Classification Some regulations such as GDPR and CCPA require customer data to be disclosed and/or deleted. This requires careful design.
  • 32. User Support • Data Catalog • Access to Data and Tools • Training • Sandbox Provisioning • Help & Support
  • 33. Technical Exploration and Tool Selection • Explore and select tools and technologies • Minimize number of tools • Choose best of breed • Consider Total Cost of Ownership (TCO) • Select compatible technologies
  • 34. Performance Tuning • CPUs/Cores • Memory • Parallelism • Skew • Caching
  • 36. 1. Managing Environments is Hard 2. Automate Everything 3. Don’t rush to fill the lake, you might wind up with a swamp 4. Know your data 5. Pick a high value use case and demonstrate value quickly 6. Minimize complexity 7. Make sure you have backups 8. Enable self-service 9. But set limits and controls on user space 10. Try out different options, but settle on a single solution
  • 38. Machine Learning and AI The Data Lake should not be an end of itself, but instead should be an enabler of new ways of using data for the benefit of the business and its customers. Machine Learning and Artificial intelligence hold much promise and potential to leverage big data to create innovative data products. Some newer capabilities that are critical to this include: • Feature Stores – Systems for storing and managing “features” used by machine learning pipelines or models • Model Registries – Systems for storing, managing and operationalizing predictive models
  • 39. The Lakehouse https://databricks.com/blog/2020/01/30/what-is-a-data-lakehouse.html Technologies like Delta Lake have enabled the combining of the data lake and data warehouse, simplifying the data architecture • Lakehouse concept introduced by Databricks
  • 40. The Event Streaming Platform Championed by Confluent (creators of Kafka) this enterprise architecture pattern uses a hub-and-spoke model where systems stream events to a hub, which can be read by other systems. • Enables real-time event driven systems • Simplifies point to point dependencies • Compliments Data Lakes, Data Warehouses and other systems
  • 41. Further reading The Enterprise Big Data Lake Alex Gorelik