1. So, you want to build a
Data Lake?
The Basics of Data Lakes, Key Considerations, and Lessons Learned
David P. Moore
12/15/2020
2. Agenda
• Introduction
• What is a Data Lake?
• Architecture and Design
• Governance and Support
• Lessons Learned
• What’s Next?
3. About Me…
• Sr. Software Developer at CarMax since 2019,
Consultant with CapTech for 3+ years
Before that worked at Capital One in a variety of
roles including Developer, Data Modeler, Tech
Lead
• Have worked on 3 data lake implementations
at 3 different companies using 3 different
technologies
• 20+ years in data and software dev, with a
passion for continuous improvement
• Two Fun facts:
I have a black belt in Silkisondan Karate
I love to play guitar and listen to music
5. First a little data history lesson…
Data warehouse and proprietary ETL and database tools
• 1990’s to mid 2000’s – Data Warehouse Popularized
Ralph Kimball – Star Schema, Data Marts
Bill Inmon - EDW
• SMP Database Systems (Oracle, SQL Server, Sybase)
• ETL Tools (Informatica, Ab Initio, Talend, etc)
• MPP Database Systems (Teradata, Netezza, Greenplum, etc)
ELT, 3NF
6. Open-source, big data and the
cloud…
• 2003, 2004 – Google File System, and Google MapReduce Papers
published
• 2006 – Hadoop started by Doug Cutting and Mike Cafarell
• 2008 - Companies like Cloudera, Hortonworks, MapR form to
package and distribute open-source Hadoop
• 2006 – AWS launched, followed by Google in 2008 and Azure in 2010
• 2010 – Apache Spark started by Matei Zaharia
• 2013 – Databricks launched offering Spark as a Service
• 2019 – Delta Lake released by Databricks
7. What is Big Data?
• Big Data is a term used to describe massive volumes of data that can
flood a business daily
• This data can be either structured or unstructured, but ultimately
the datasets are so large that they cannot be processed on a single
machine in a reasonable amount of time
• 3 V’s, popularized by Doug Laney from Gartner:
Volume Variety Velocity
8. What is a Data Lake?
• “A data lake is a system or repository of data stored in its natural/raw
format, usually object blobs or files.”
“A data lake is usually a single store of all enterprise data including raw copies
of source system data and transformed data used for tasks such as reporting,
visualization, advanced analytics and machine learning. A data lake can
include structured data from relational databases (rows and columns), semi-
structured data (CSV, logs, XML, JSON), unstructured data (emails,
documents, PDFs) and binary data (images, audio, video).”
Source: https://en.wikipedia.org/wiki/Data_lake
James Dixon of Pentaho:
“If you think of a datamart as a store of bottled water –
cleansed and packaged and structured for easy consumption –
the data lake is a large body of water in a more natural state.
The contents of the data lake stream in from a source to fill the
lake, and various users of the lake can come to examine, dive
in, or take samples.”
9. Data Warehouse vs. Data Lake
Data Warehouse Data Lake
Data Format Structured Structured, Semi-
structured, Unstructured
Data Schema / Modeling Schema-on-Write Schema-on-Read
Relative Cost $$$ $
Flexibility Less agile Highly agile
Performance Tuned for fast query
response
General purpose access,
slower responses
Data Quality High quality, curated data Lower quality, raw data
Target Users Business Analysts Data Scientists
Typical Use Cases Reporting, Visualizations Predictive Analytics,
Machine Learning
10. What is Delta Lake?
“Delta Lake is an open source storage layer that brings reliability to data lakes.
Delta Lake provides ACID transactions, scalable metadata handling, and unifies
streaming and batch data processing. Delta Lake runs on top of your existing data
lake and is fully compatible with Apache Spark APIs.”
https://docs.delta.io/latest/delta-faq.html
Created by Databricks, and open sourced and contributed to the Linux Foundation as an
open standard, Delta Lake is a technology layer compatible with Apache Spark that
adds some database-like features to a data lake.
11. The cloud has enabled a massive
transformation in data capabilities
• Going from on-premises data centers, where provisioning new
hardware took weeks or months, to being able to scale up within
minutes
• Decoupling of compute from storage allows for flexible scalability and
optimizing costs
14. Cloud vs. On-Premises?
• Flexibility & Agility
• Scalability
• Op-ex cost model
• No data center
• Lack of control of data
• Depending on workload
costs can be higher
• Slower Time to Market
• Limit of Scalability
• Cap-ex cost model
• Full control over data
• Depending on workload
costs could be lower
16. Data Lake Environments
DEV
TEST
PRODUCTION
• As in any traditional systems development, having multiple environments for
developing and testing code is necessary.
• Changes to each subsequent environment should be made via automation
• Pre-prod environments need to be kept in sync with prod
Refresh
Process
17. Data Lake Zones
Landing
Raw (Bronze)
Clean/Valid (Silver)
Refined (Gold)
Secure
Sandbox
Data Lakes typically are divided into separate zones with data going through a
refining process as it progresses from one zone to the next.
Progression
18. Data Lake Storage paradigms
The Data Lake has two primary storage paradigms for accessing and dealing
with its data:
Hierarchical File System
Typically based on HDFS
Data organized into Files and Folders
N-levels deep
Based on Posix file system standard
Database
Typically based on Hive
Data is organized into Databases and Tables
2-levels deep
Compatible with SQL-based access
Most Data Lake systems use both at the same time, where the Database layer sits
on top of the File System. This can cause confusion for users.
19. Storage Design Decisions
• Datasets in a data lake are typically defined at a folder level
instead of at the file level.
• At the top level there is typically a folder structure that aligns
with the zones
• There are two primary types of data to consider:
Event/Fact data (Clicks, Transactions, Sensor readings, etc)
Reference/Master/Dimension data (Customer, Product, etc)
• Reference/Dimension data requires thinking about how to store
history of changes:
1. Snapshots
2. Deltas
20. File formats and compression
An important design choice is what file format to use in the Lake as
well as whether to compress the data
For the Landing/Raw zone, the convention is preserve the data in
whatever format it arrived in.
For subsequent zones, it makes sense to conform to a standard
format that is designed for data lakes that includes schema
information
Parquet is popular for analytics (Columnar) with Snappy
Compression
Delta Lake uses Parquet with additional metadata
ORC is an alternative columnar format popular on Hadoop
Avro is row-based popular for streaming (Kafka)
Avoid CSV or plain text formats where possible
Consider whether the format is splittable for parallel processing
CSV and Gzip may not be splittable formats
21. Data Ingestion Choices
ETL frameworks:
• GUI-Based
• Code-based
• Notebooks
• Metadata-driven
Frequency:
• Batch
Weekly
Daily
Hourly
• Micro batch
Every N minutes
• Streaming / Real time
Push vs. Pull:
• Push – systems send
their data to the lake
• Pull – The lake
initiates extracts
Ingestion is the process of getting data into the lake. When designing ingestion
systems, there are many options and choices that need to be made such as:
22. Data Catalog
The data catalog is a central part of managing the lake
and should have features such as:
• Dataset definitions
• Fields/column definitions
• Tags: Owner, Classification, PII
• Subject Matter Experts (SMEs)
Modern catalog tools also provide features such as:
• Crowdsourcing of metadata and gamification
• Automated annotation
Some examples:
Alation
Lumada Data Catalog
IBM Watson Knowledge
Catalog
AWS Glue
Azure Data Catalog /
Purview
23. Hive Metastore
• Most data lakes that are Hadoop-based or Spark-based rely on a
metadata catalog called the Hive Metastore
• It is important to consider how this should be provisioned and
managed
• The metastore is a relational database and supports a variety of
DBMS types including both open source (PostgreSQL, MySQL) and
closed (Oracle, MS SQL Server)
• Some configurations allow for an external metastore that can be
shared by workspaces (i.e. Databricks)
24. Data Lake Consuming Systems
The lake will most likely host
multiple consuming systems
including:
• Data Warehouses
• Data Marts
• Operational Data Stores
• Feature Stores
• Data products or applications
Dashboards
Alerts/Notifications
Automated Actions
Datasets
Designing and architecting for data
consumption will require answering
questions such as:
• Will systems pull data from the
lake, or will data be pushed?
• How will these systems access the
data?
• How will systems be notified that
data is available?
• What environments will these
systems use for developing and
testing?
• What apis will be used? (JDBC,
ODBC, REST, SFTP)
25. Example: Modern Data Warehouse
in Azure
https://docs.microsoft.com/en-us/azure/architecture/solution-ideas/articles/modern-data-warehouse
28. Keeping the Lake Secure
• Network security controls
• Role-Based Access Controls (RBAC)
• Encryption
Transparent Data Encryption
Explicit Encryption
• Row level and column level access
29. Keeping the Lake Available
• Service Level Agreements
RPO and RTO
• Backups
Data
Configuration
Secrets
• Version Control
• Resource Locks
• Geo-Redundancy
• Automation
What’s your disaster recovery plan?
30. Access Patterns and Roles
The Lake needs to support several different types of access patterns:
1. System Access
Platform systems
Applications
2. Business User Access
Data Analysts
Data Scientists
3. Technology User Access
Support Access
Developer Access
Each of these groups need to have different access rights appropriate to the
role.
31. Regulations and Policies
impacting the Lake
External Regulations
• GDPR
• CCPA
• HIPAA
• PCI
Internal Policies
• PII and Privacy
• Information Classification
Some regulations such as GDPR and CCPA require customer data to
be disclosed and/or deleted. This requires careful design.
32. User Support
• Data Catalog
• Access to Data and Tools
• Training
• Sandbox Provisioning
• Help & Support
33. Technical Exploration and Tool
Selection
• Explore and select tools and technologies
• Minimize number of tools
• Choose best of breed
• Consider Total Cost of Ownership (TCO)
• Select compatible technologies
36. 1. Managing Environments is Hard
2. Automate Everything
3. Don’t rush to fill the lake, you might wind up with a swamp
4. Know your data
5. Pick a high value use case and demonstrate value quickly
6. Minimize complexity
7. Make sure you have backups
8. Enable self-service
9. But set limits and controls on user space
10. Try out different options, but settle on a single solution
38. Machine Learning and AI
The Data Lake should not be an end of itself, but instead
should be an enabler of new ways of using data for the benefit
of the business and its customers.
Machine Learning and Artificial intelligence hold much
promise and potential to leverage big data to create
innovative data products.
Some newer capabilities that are critical to this include:
• Feature Stores – Systems for storing and managing
“features” used by machine learning pipelines or models
• Model Registries – Systems for storing, managing and
operationalizing predictive models
40. The Event Streaming Platform
Championed by Confluent (creators of Kafka)
this enterprise architecture pattern uses a
hub-and-spoke model where systems stream
events to a hub, which can be read by other
systems.
• Enables real-time event driven systems
• Simplifies point to point dependencies
• Compliments Data Lakes, Data Warehouses
and other systems