Ensuring Quality in Data Lakes (D&D Meetup Feb 22)

Ensuring Data Quality In a
Data Lake with
By Paul Singman
@datawhisp
DevOps & Drinks
Feb 11, 2022

Agenda
1. L1 DataLake
- Why they excel at Performance, Cost, Dev Ex, Integrations
2. L2: + Optimized Table Formats
- Delta, Hudi, Iceberg
3. L3: + Data Version Control
- lakeFS
4. lakeFS + Delta Demo!

L1: Basic Data Lake
Object Store

L1: Basic Data Lake
Object Store
Date-separated .csv files

L1: Basic Data Lake
Object Store
ML
BI
Data-Intensive APIs

are awesome in terms of
L1: Basic Data Lake

• Performance
• Cost
• Connectivity
• Developer Experience
L1: Basic Data Lake

• Performance
• Cost
• Connectivity
• Achieve 3.5k PUT requests per second
per prefix
• 5.5k GET requests per
second per prefix
• Auto-scales to this limit automatically
and overallcapacity is limitless
• "something like 11 '9's of availability"
FromOlegLvovitch's2021 Re:Inventtalk"Buildingadatalake on AmazonS3" and Matt Sidley's"DeepdiveonAmazonS3"
L1: Basic Data Lake

• Performance
• Cost
• Connectivity
• Storage: $.023 per GB vs $.10 for RDS
or $.12 for EBS
• Network:
• $5 per milllionPUT, $.40 per
millionGET requests,
• $0 transfer datain, $.09 per GB
for data transfer out
• ~5-8x times cheaper than block
storage
L1: Basic Data Lake

• Performance
• Cost
• Connectivity
• Mature client SDKs
• Strong Consistency (2020)
• AWS Storage Lens (2020)
• Feature-rich (events,
permissions, inventories,
replication...)
L1: Basic Data Lake

• Mature client SDKs
• Strong Consistency (2020)
• AWS Storage Lens (2020)
• Feature-rich (events,
permissions, inventories,
replication...)
L1: Basic Data Lake

• Performance
• Cost
• Connectivity
L1: Basic Data Lake

How do we improve upon this?
L1: Basic Data Lake
Object Store
ML
BI
Data-Intensive APIs

L2: Modern Table Formats
The idea: To maintainobject transaction
logs as metadata stored alongside the
data that query engines make use of to
provide:

provide:
- Transaction isolation
- Data Versioning
- Schema enforcement
- Performance improvements

provide:
- Data Versioning
Implementations:
- Delta Lake
- Apache Hudi
- Apache Iceberg

provide:
- Data Versioning
Implementations:
- Delta Lake
- Apache Hudi
- Apache Iceberg
parquet metadata
transaction log
ML
BI
Data-Intensive APIs
Object Store

provide:
- Data Versioning
Implementations:
- Delta Lake
- Apache Hudi
- Apache Iceberg
ML
BI
Data-Intensive APIs
Object Store
parquet metadata
transaction log

L3: Data Version Control
The idea: To extend
availableobject store
operationswith git source
control to:

The idea: To extend
control to:
- Revert bad data instantly
- Expose new data atomically (cross-collection)
- Develop in isolation
- Simplify data reproducibility

The idea: To extend
control to:
Implementations:
- lakeFS
- Nessie

The idea: To extend
control to:
Implementations:
- lakeFS
- Nessie
ML
BI
Data-Intensive APIs
Object Store
parquet metadata
transaction log

The idea: To extend
control to:
Implementations:
- lakeFS
- Nessie
ML
BI
Data-Intensive APIs
Object Store
Data Repos w/
commit, merge,
branch, revert
operations

Best Practice lakeFS Solution
Identify and fix data errors instantly
Develop new data assets in isolation
Reproduce jobs and pipelines easily
Update datasets atomically
$ lakectl branch create my-branch
$ lakectl merge my-branch main
$ lakectl revert main^1
$ spark.read.parquet(‘s3://my-repo/<commit_id>’)

lakeFS Solution

core principles
Format-agnostic
Works with all data
formats out of the box
Scale
Graveler data model
supports exabyte
size datasets
Infrastructure
Integrates with any tool that
can talk to object stores

Ensuring Quality in Data Lakes (D&D Meetup Feb 22)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Ensuring Quality in Data Lakes (D&D Meetup Feb 22)

Similar to Ensuring Quality in Data Lakes (D&D Meetup Feb 22) (20)

Recently uploaded

Recently uploaded (20)

Ensuring Quality in Data Lakes (D&D Meetup Feb 22)