The document discusses improving data quality in a data lake. It describes three levels (L1-L3) of data lake maturity:
L1 involves storing data in an object store in a basic format like CSV files. This provides good performance, cost efficiency, and developer experience.
L2 adds optimized table formats like Delta Lake, Hudi and Iceberg that maintain metadata and transaction logs to enable features like schema enforcement, data versioning and isolation.
L3 adds data version control systems like lakeFS that extend the object store with Git-like source control operations. This allows instantly reverting bad data, developing data in isolation, and simplifying data reproducibility. LakeFS was demonstrated as an example solution
8. are awesome in terms of
• Performance
• Cost
• Connectivity
• Developer Experience
L1: Basic Data Lake
9. are awesome in terms of
• Performance
• Cost
• Connectivity
• Developer Experience
• Achieve 3.5k PUT requests per second
per prefix
• 5.5k GET requests per
second per prefix
• Auto-scales to this limit automatically
and overallcapacity is limitless
• "something like 11 '9's of availability"
FromOlegLvovitch's2021 Re:Inventtalk"Buildingadatalake on AmazonS3" and Matt Sidley's"DeepdiveonAmazonS3"
L1: Basic Data Lake
10. are awesome in terms of
• Performance
• Cost
• Connectivity
• Developer Experience
• Storage: $.023 per GB vs $.10 for RDS
or $.12 for EBS
• Network:
• $5 per milllionPUT, $.40 per
millionGET requests,
• $0 transfer datain, $.09 per GB
for data transfer out
• ~5-8x times cheaper than block
storage
FromOlegLvovitch's2021 Re:Inventtalk"Buildingadatalake on AmazonS3" and Matt Sidley's"DeepdiveonAmazonS3"
L1: Basic Data Lake
11. are awesome in terms of
• Performance
• Cost
• Connectivity
• Developer Experience
FromOlegLvovitch's2021 Re:Inventtalk"Buildingadatalake on AmazonS3" and Matt Sidley's"DeepdiveonAmazonS3"
• Mature client SDKs
• Strong Consistency (2020)
• AWS Storage Lens (2020)
• Feature-rich (events,
permissions, inventories,
replication...)
L1: Basic Data Lake
12. are awesome in terms of
FromOlegLvovitch's2021 Re:Inventtalk"Buildingadatalake on AmazonS3" and Matt Sidley's"DeepdiveonAmazonS3"
• Mature client SDKs
• Strong Consistency (2020)
• AWS Storage Lens (2020)
• Feature-rich (events,
permissions, inventories,
replication...)
L1: Basic Data Lake
13. are awesome in terms of
• Performance
• Cost
• Connectivity
• Developer Experience
FromOlegLvovitch's2021 Re:Inventtalk"Buildingadatalake on AmazonS3" and Matt Sidley's"DeepdiveonAmazonS3"
L1: Basic Data Lake
14. are awesome in terms of
• Performance
• Cost
• Connectivity
• Developer Experience
FromOlegLvovitch's2021 Re:Inventtalk"Buildingadatalake on AmazonS3" and Matt Sidley's"DeepdiveonAmazonS3"
L1: Basic Data Lake
15. How do we improve upon this?
L1: Basic Data Lake
Object Store
Date-separated .csv files
ML
BI
Data-Intensive APIs
16. L2: Modern Table Formats
The idea: To maintainobject transaction
logs as metadata stored alongside the
data that query engines make use of to
provide:
17. L2: Modern Table Formats
The idea: To maintainobject transaction
logs as metadata stored alongside the
data that query engines make use of to
provide:
- Transaction isolation
- Data Versioning
- Schema enforcement
- Performance improvements
18. L2: Modern Table Formats
The idea: To maintainobject transaction
logs as metadata stored alongside the
data that query engines make use of to
provide:
- Transaction isolation
- Data Versioning
- Schema enforcement
- Performance improvements
Implementations:
- Delta Lake
- Apache Hudi
- Apache Iceberg
19. L2: Modern Table Formats
The idea: To maintainobject transaction
logs as metadata stored alongside the
data that query engines make use of to
provide:
- Transaction isolation
- Data Versioning
- Schema enforcement
- Performance improvements
Implementations:
- Delta Lake
- Apache Hudi
- Apache Iceberg
parquet metadata
transaction log
ML
BI
Data-Intensive APIs
Object Store
20. L2: Modern Table Formats
The idea: To maintainobject transaction
logs as metadata stored alongside the
data that query engines make use of to
provide:
- Transaction isolation
- Data Versioning
- Schema enforcement
- Performance improvements
Implementations:
- Delta Lake
- Apache Hudi
- Apache Iceberg
ML
BI
Data-Intensive APIs
Object Store
parquet metadata
transaction log
22. L3: Data Version Control
The idea: To extend
availableobject store
operationswith git source
control to:
23. L3: Data Version Control
The idea: To extend
availableobject store
operationswith git source
control to:
- Revert bad data instantly
- Expose new data atomically (cross-collection)
- Develop in isolation
- Simplify data reproducibility
24. L3: Data Version Control
The idea: To extend
availableobject store
operationswith git source
control to:
Implementations:
- lakeFS
- Nessie
- Revert bad data instantly
- Expose new data atomically (cross-collection)
- Develop in isolation
- Simplify data reproducibility
25. L3: Data Version Control
The idea: To extend
availableobject store
operationswith git source
control to:
Implementations:
- lakeFS
- Nessie
- Revert bad data instantly
- Expose new data atomically (cross-collection)
- Develop in isolation
- Simplify data reproducibility
ML
BI
Data-Intensive APIs
Object Store
parquet metadata
transaction log
26. L3: Data Version Control
The idea: To extend
availableobject store
operationswith git source
control to:
Implementations:
- lakeFS
- Nessie
- Revert bad data instantly
- Expose new data atomically (cross-collection)
- Develop in isolation
- Simplify data reproducibility
ML
BI
Data-Intensive APIs
Object Store
Data Repos w/
commit, merge,
branch, revert
operations
27. Best Practice lakeFS Solution
Identify and fix data errors instantly
Develop new data assets in isolation
Reproduce jobs and pipelines easily
Update datasets atomically
$ lakectl branch create my-branch
$ lakectl merge my-branch main
$ lakectl revert main^1
$ spark.read.parquet(‘s3://my-repo/<commit_id>’)
- Revert bad data instantly
- Expose new data atomically (cross-collection)
- Develop in isolation
- Simplify data reproducibility
L3: Data Version Control
28. Best Practice lakeFS Solution
Identify and fix data errors instantly
Develop new data assets in isolation
Reproduce jobs and pipelines easily
Update datasets atomically
$ lakectl branch create my-branch
$ lakectl merge my-branch main
$ lakectl revert main^1
$ spark.read.parquet(‘s3://my-repo/<commit_id>’)
- Revert bad data instantly
- Expose new data atomically (cross-collection)
- Develop in isolation
- Simplify data reproducibility
lakeFS Solution
$ lakectl revert main^1
L3: Data Version Control
29. Best Practice lakeFS Solution
Identify and fix data errors instantly
Develop new data assets in isolation
Reproduce jobs and pipelines easily
Update datasets atomically
$ lakectl branch create my-branch
$ lakectl merge my-branch main
$ lakectl revert main^1
$ spark.read.parquet(‘s3://my-repo/<commit_id>’)
- Revert bad data instantly
- Expose new data atomically (cross-collection)
- Develop in isolation
- Simplify data reproducibility
lakeFS Solution
$ lakectl revert main^1
$ lakectl merge my-branch main
L3: Data Version Control
30. Best Practice lakeFS Solution
Identify and fix data errors instantly
Develop new data assets in isolation
Reproduce jobs and pipelines easily
Update datasets atomically
$ lakectl branch create my-branch
$ lakectl merge my-branch main
$ lakectl revert main^1
$ spark.read.parquet(‘s3://my-repo/<commit_id>’)
- Revert bad data instantly
- Expose new data atomically (cross-collection)
- Develop in isolation
- Simplify data reproducibility
lakeFS Solution
$ lakectl revert main^1
$ lakectl merge my-branch main
$ lakectl branch create my-branch
L3: Data Version Control
31. Best Practice lakeFS Solution
Identify and fix data errors instantly
Develop new data assets in isolation
Reproduce jobs and pipelines easily
Update datasets atomically
$ lakectl branch create my-branch
$ lakectl merge my-branch main
$ lakectl revert main^1
$ spark.read.parquet(‘s3://my-repo/<commit_id>’)
- Revert bad data instantly
- Expose new data atomically (cross-collection)
- Develop in isolation
- Simplify data reproducibility
lakeFS Solution
$ lakectl revert main^1
$ lakectl merge my-branch main
$ lakectl branch create my-branch
$ spark.read.parquet(‘s3://my-repo/<commit_id>’)
L3: Data Version Control
33. core principles
Format-agnostic
Works with all data
formats out of the box
Scale
Graveler data model
supports exabyte
size datasets
Infrastructure
Integrates with any tool that
can talk to object stores