SlideShare a Scribd company logo
1 of 35
Download to read offline
Ensuring Data Quality In a
Data Lake with
By Paul Singman
@datawhisp
DevOps & Drinks
Feb 11, 2022
Agenda
1. L1 DataLake
- Why they excel at Performance, Cost, Dev Ex, Integrations
2. L2: + Optimized Table Formats
- Delta, Hudi, Iceberg
3. L3: + Data Version Control
- lakeFS
4. lakeFS + Delta Demo!
L1: Basic Data Lake
L1: Basic Data Lake
Object Store
L1: Basic Data Lake
Object Store
Date-separated .csv files
L1: Basic Data Lake
Object Store
Date-separated .csv files
ML
BI
Data-Intensive APIs
are awesome in terms of
L1: Basic Data Lake
are awesome in terms of
• Performance
• Cost
• Connectivity
• Developer Experience
L1: Basic Data Lake
are awesome in terms of
• Performance
• Cost
• Connectivity
• Developer Experience
• Achieve 3.5k PUT requests per second
per prefix
• 5.5k GET requests per
second per prefix
• Auto-scales to this limit automatically
and overallcapacity is limitless
• "something like 11 '9's of availability"
FromOlegLvovitch's2021 Re:Inventtalk"Buildingadatalake on AmazonS3" and Matt Sidley's"DeepdiveonAmazonS3"
L1: Basic Data Lake
are awesome in terms of
• Performance
• Cost
• Connectivity
• Developer Experience
• Storage: $.023 per GB vs $.10 for RDS
or $.12 for EBS
• Network:
• $5 per milllionPUT, $.40 per
millionGET requests,
• $0 transfer datain, $.09 per GB
for data transfer out
• ~5-8x times cheaper than block
storage
FromOlegLvovitch's2021 Re:Inventtalk"Buildingadatalake on AmazonS3" and Matt Sidley's"DeepdiveonAmazonS3"
L1: Basic Data Lake
are awesome in terms of
• Performance
• Cost
• Connectivity
• Developer Experience
FromOlegLvovitch's2021 Re:Inventtalk"Buildingadatalake on AmazonS3" and Matt Sidley's"DeepdiveonAmazonS3"
• Mature client SDKs
• Strong Consistency (2020)
• AWS Storage Lens (2020)
• Feature-rich (events,
permissions, inventories,
replication...)
L1: Basic Data Lake
are awesome in terms of
FromOlegLvovitch's2021 Re:Inventtalk"Buildingadatalake on AmazonS3" and Matt Sidley's"DeepdiveonAmazonS3"
• Mature client SDKs
• Strong Consistency (2020)
• AWS Storage Lens (2020)
• Feature-rich (events,
permissions, inventories,
replication...)
L1: Basic Data Lake
are awesome in terms of
• Performance
• Cost
• Connectivity
• Developer Experience
FromOlegLvovitch's2021 Re:Inventtalk"Buildingadatalake on AmazonS3" and Matt Sidley's"DeepdiveonAmazonS3"
L1: Basic Data Lake
are awesome in terms of
• Performance
• Cost
• Connectivity
• Developer Experience
FromOlegLvovitch's2021 Re:Inventtalk"Buildingadatalake on AmazonS3" and Matt Sidley's"DeepdiveonAmazonS3"
L1: Basic Data Lake
How do we improve upon this?
L1: Basic Data Lake
Object Store
Date-separated .csv files
ML
BI
Data-Intensive APIs
L2: Modern Table Formats
The idea: To maintainobject transaction
logs as metadata stored alongside the
data that query engines make use of to
provide:
L2: Modern Table Formats
The idea: To maintainobject transaction
logs as metadata stored alongside the
data that query engines make use of to
provide:
- Transaction isolation
- Data Versioning
- Schema enforcement
- Performance improvements
L2: Modern Table Formats
The idea: To maintainobject transaction
logs as metadata stored alongside the
data that query engines make use of to
provide:
- Transaction isolation
- Data Versioning
- Schema enforcement
- Performance improvements
Implementations:
- Delta Lake
- Apache Hudi
- Apache Iceberg
L2: Modern Table Formats
The idea: To maintainobject transaction
logs as metadata stored alongside the
data that query engines make use of to
provide:
- Transaction isolation
- Data Versioning
- Schema enforcement
- Performance improvements
Implementations:
- Delta Lake
- Apache Hudi
- Apache Iceberg
parquet metadata
transaction log
ML
BI
Data-Intensive APIs
Object Store
L2: Modern Table Formats
The idea: To maintainobject transaction
logs as metadata stored alongside the
data that query engines make use of to
provide:
- Transaction isolation
- Data Versioning
- Schema enforcement
- Performance improvements
Implementations:
- Delta Lake
- Apache Hudi
- Apache Iceberg
ML
BI
Data-Intensive APIs
Object Store
parquet metadata
transaction log
L3: Data Version Control
L3: Data Version Control
The idea: To extend
availableobject store
operationswith git source
control to:
L3: Data Version Control
The idea: To extend
availableobject store
operationswith git source
control to:
- Revert bad data instantly
- Expose new data atomically (cross-collection)
- Develop in isolation
- Simplify data reproducibility
L3: Data Version Control
The idea: To extend
availableobject store
operationswith git source
control to:
Implementations:
- lakeFS
- Nessie
- Revert bad data instantly
- Expose new data atomically (cross-collection)
- Develop in isolation
- Simplify data reproducibility
L3: Data Version Control
The idea: To extend
availableobject store
operationswith git source
control to:
Implementations:
- lakeFS
- Nessie
- Revert bad data instantly
- Expose new data atomically (cross-collection)
- Develop in isolation
- Simplify data reproducibility
ML
BI
Data-Intensive APIs
Object Store
parquet metadata
transaction log
L3: Data Version Control
The idea: To extend
availableobject store
operationswith git source
control to:
Implementations:
- lakeFS
- Nessie
- Revert bad data instantly
- Expose new data atomically (cross-collection)
- Develop in isolation
- Simplify data reproducibility
ML
BI
Data-Intensive APIs
Object Store
Data Repos w/
commit, merge,
branch, revert
operations
Best Practice lakeFS Solution
Identify and fix data errors instantly
Develop new data assets in isolation
Reproduce jobs and pipelines easily
Update datasets atomically
$ lakectl branch create my-branch
$ lakectl merge my-branch main
$ lakectl revert main^1
$ spark.read.parquet(‘s3://my-repo/<commit_id>’)
- Revert bad data instantly
- Expose new data atomically (cross-collection)
- Develop in isolation
- Simplify data reproducibility
L3: Data Version Control
Best Practice lakeFS Solution
Identify and fix data errors instantly
Develop new data assets in isolation
Reproduce jobs and pipelines easily
Update datasets atomically
$ lakectl branch create my-branch
$ lakectl merge my-branch main
$ lakectl revert main^1
$ spark.read.parquet(‘s3://my-repo/<commit_id>’)
- Revert bad data instantly
- Expose new data atomically (cross-collection)
- Develop in isolation
- Simplify data reproducibility
lakeFS Solution
$ lakectl revert main^1
L3: Data Version Control
Best Practice lakeFS Solution
Identify and fix data errors instantly
Develop new data assets in isolation
Reproduce jobs and pipelines easily
Update datasets atomically
$ lakectl branch create my-branch
$ lakectl merge my-branch main
$ lakectl revert main^1
$ spark.read.parquet(‘s3://my-repo/<commit_id>’)
- Revert bad data instantly
- Expose new data atomically (cross-collection)
- Develop in isolation
- Simplify data reproducibility
lakeFS Solution
$ lakectl revert main^1
$ lakectl merge my-branch main
L3: Data Version Control
Best Practice lakeFS Solution
Identify and fix data errors instantly
Develop new data assets in isolation
Reproduce jobs and pipelines easily
Update datasets atomically
$ lakectl branch create my-branch
$ lakectl merge my-branch main
$ lakectl revert main^1
$ spark.read.parquet(‘s3://my-repo/<commit_id>’)
- Revert bad data instantly
- Expose new data atomically (cross-collection)
- Develop in isolation
- Simplify data reproducibility
lakeFS Solution
$ lakectl revert main^1
$ lakectl merge my-branch main
$ lakectl branch create my-branch
L3: Data Version Control
Best Practice lakeFS Solution
Identify and fix data errors instantly
Develop new data assets in isolation
Reproduce jobs and pipelines easily
Update datasets atomically
$ lakectl branch create my-branch
$ lakectl merge my-branch main
$ lakectl revert main^1
$ spark.read.parquet(‘s3://my-repo/<commit_id>’)
- Revert bad data instantly
- Expose new data atomically (cross-collection)
- Develop in isolation
- Simplify data reproducibility
lakeFS Solution
$ lakectl revert main^1
$ lakectl merge my-branch main
$ lakectl branch create my-branch
$ spark.read.parquet(‘s3://my-repo/<commit_id>’)
L3: Data Version Control
Demo
core principles
Format-agnostic
Works with all data
formats out of the box
Scale
Graveler data model
supports exabyte
size datasets
Infrastructure
Integrates with any tool that
can talk to object stores
Graveler Data Model
THANK YOU!
http:// .io

More Related Content

What's hot

OpenStack Swift In the Enterprise
OpenStack Swift In the EnterpriseOpenStack Swift In the Enterprise
OpenStack Swift In the EnterpriseHostway|HOSTING
 
Elastic Stack Introduction
Elastic Stack IntroductionElastic Stack Introduction
Elastic Stack IntroductionVikram Shinde
 
Globus: Research Data Management as Service and Platform - pearc17
Globus: Research Data Management as Service and Platform - pearc17Globus: Research Data Management as Service and Platform - pearc17
Globus: Research Data Management as Service and Platform - pearc17Mary Bass
 
Webinar 2017. Supercharge your analytics with ClickHouse. Alexander Zaitsev
Webinar 2017. Supercharge your analytics with ClickHouse. Alexander ZaitsevWebinar 2017. Supercharge your analytics with ClickHouse. Alexander Zaitsev
Webinar 2017. Supercharge your analytics with ClickHouse. Alexander ZaitsevAltinity Ltd
 
Backup multi-cloud solution based on named pipes
Backup multi-cloud solution based on named pipesBackup multi-cloud solution based on named pipes
Backup multi-cloud solution based on named pipesLeandro Totino Pereira
 
How companies use NoSQL & Couchbase - NoSQL Now 2014
How companies use NoSQL & Couchbase - NoSQL Now 2014How companies use NoSQL & Couchbase - NoSQL Now 2014
How companies use NoSQL & Couchbase - NoSQL Now 2014Dipti Borkar
 
Using OpenStack Swift for Extreme Data Durability
 Using OpenStack Swift for Extreme Data Durability Using OpenStack Swift for Extreme Data Durability
Using OpenStack Swift for Extreme Data DurabilityChristian Schwede
 
ClickHouse on Plug-n-Play Cloud, by Som Sikdar, Kodiak Data
ClickHouse on Plug-n-Play Cloud, by Som Sikdar, Kodiak DataClickHouse on Plug-n-Play Cloud, by Som Sikdar, Kodiak Data
ClickHouse on Plug-n-Play Cloud, by Som Sikdar, Kodiak DataAltinity Ltd
 
Introducing Cloudian HyperStore 6.0
Introducing Cloudian HyperStore 6.0Introducing Cloudian HyperStore 6.0
Introducing Cloudian HyperStore 6.0Cloudian
 
Presto at Facebook - Presto Meetup @ Boston (10/6/2015)
Presto at Facebook - Presto Meetup @ Boston (10/6/2015)Presto at Facebook - Presto Meetup @ Boston (10/6/2015)
Presto at Facebook - Presto Meetup @ Boston (10/6/2015)Martin Traverso
 
Presto @ Treasure Data - Presto Meetup Boston 2015
Presto @ Treasure Data - Presto Meetup Boston 2015Presto @ Treasure Data - Presto Meetup Boston 2015
Presto @ Treasure Data - Presto Meetup Boston 2015Taro L. Saito
 
Cloudian HyperStore Features and Benefits
Cloudian HyperStore Features and BenefitsCloudian HyperStore Features and Benefits
Cloudian HyperStore Features and BenefitsCloudian
 
Elastic v5.0.0 Update uptoalpha3 v0.2 - 김종민
Elastic v5.0.0 Update uptoalpha3 v0.2 - 김종민Elastic v5.0.0 Update uptoalpha3 v0.2 - 김종민
Elastic v5.0.0 Update uptoalpha3 v0.2 - 김종민NAVER D2
 
SpringPeople - Introduction to Cloud Computing
SpringPeople - Introduction to Cloud ComputingSpringPeople - Introduction to Cloud Computing
SpringPeople - Introduction to Cloud ComputingSpringPeople
 

What's hot (20)

OpenStack Swift
OpenStack SwiftOpenStack Swift
OpenStack Swift
 
OpenStack Swift In the Enterprise
OpenStack Swift In the EnterpriseOpenStack Swift In the Enterprise
OpenStack Swift In the Enterprise
 
Elastic Stack Introduction
Elastic Stack IntroductionElastic Stack Introduction
Elastic Stack Introduction
 
Zabbix at scale with Elasticsearch
Zabbix at scale with ElasticsearchZabbix at scale with Elasticsearch
Zabbix at scale with Elasticsearch
 
Globus: Research Data Management as Service and Platform - pearc17
Globus: Research Data Management as Service and Platform - pearc17Globus: Research Data Management as Service and Platform - pearc17
Globus: Research Data Management as Service and Platform - pearc17
 
Webinar 2017. Supercharge your analytics with ClickHouse. Alexander Zaitsev
Webinar 2017. Supercharge your analytics with ClickHouse. Alexander ZaitsevWebinar 2017. Supercharge your analytics with ClickHouse. Alexander Zaitsev
Webinar 2017. Supercharge your analytics with ClickHouse. Alexander Zaitsev
 
Backup multi-cloud solution based on named pipes
Backup multi-cloud solution based on named pipesBackup multi-cloud solution based on named pipes
Backup multi-cloud solution based on named pipes
 
Openstack swift - VietOpenStack 6thmeeetup
Openstack swift - VietOpenStack 6thmeeetupOpenstack swift - VietOpenStack 6thmeeetup
Openstack swift - VietOpenStack 6thmeeetup
 
OpenStack Marketing Meeting Oct 2
OpenStack Marketing Meeting Oct 2OpenStack Marketing Meeting Oct 2
OpenStack Marketing Meeting Oct 2
 
How companies use NoSQL & Couchbase - NoSQL Now 2014
How companies use NoSQL & Couchbase - NoSQL Now 2014How companies use NoSQL & Couchbase - NoSQL Now 2014
How companies use NoSQL & Couchbase - NoSQL Now 2014
 
Using OpenStack Swift for Extreme Data Durability
 Using OpenStack Swift for Extreme Data Durability Using OpenStack Swift for Extreme Data Durability
Using OpenStack Swift for Extreme Data Durability
 
ClickHouse on Plug-n-Play Cloud, by Som Sikdar, Kodiak Data
ClickHouse on Plug-n-Play Cloud, by Som Sikdar, Kodiak DataClickHouse on Plug-n-Play Cloud, by Som Sikdar, Kodiak Data
ClickHouse on Plug-n-Play Cloud, by Som Sikdar, Kodiak Data
 
Introducing Cloudian HyperStore 6.0
Introducing Cloudian HyperStore 6.0Introducing Cloudian HyperStore 6.0
Introducing Cloudian HyperStore 6.0
 
OpenStack Swift
OpenStack SwiftOpenStack Swift
OpenStack Swift
 
What's new in MongoDB 2.6
What's new in MongoDB 2.6What's new in MongoDB 2.6
What's new in MongoDB 2.6
 
Presto at Facebook - Presto Meetup @ Boston (10/6/2015)
Presto at Facebook - Presto Meetup @ Boston (10/6/2015)Presto at Facebook - Presto Meetup @ Boston (10/6/2015)
Presto at Facebook - Presto Meetup @ Boston (10/6/2015)
 
Presto @ Treasure Data - Presto Meetup Boston 2015
Presto @ Treasure Data - Presto Meetup Boston 2015Presto @ Treasure Data - Presto Meetup Boston 2015
Presto @ Treasure Data - Presto Meetup Boston 2015
 
Cloudian HyperStore Features and Benefits
Cloudian HyperStore Features and BenefitsCloudian HyperStore Features and Benefits
Cloudian HyperStore Features and Benefits
 
Elastic v5.0.0 Update uptoalpha3 v0.2 - 김종민
Elastic v5.0.0 Update uptoalpha3 v0.2 - 김종민Elastic v5.0.0 Update uptoalpha3 v0.2 - 김종민
Elastic v5.0.0 Update uptoalpha3 v0.2 - 김종민
 
SpringPeople - Introduction to Cloud Computing
SpringPeople - Introduction to Cloud ComputingSpringPeople - Introduction to Cloud Computing
SpringPeople - Introduction to Cloud Computing
 

Similar to Ensuring Quality in Data Lakes (D&D Meetup Feb 22)

A Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons LearnedA Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons LearnedDatabricks
 
Enterprise Data Science
Enterprise Data ScienceEnterprise Data Science
Enterprise Data ScienceMisha Lisovich
 
Scylla @ Disney+ Hotstar
Scylla @ Disney+ HotstarScylla @ Disney+ Hotstar
Scylla @ Disney+ HotstarScyllaDB
 
Lightbend Fast Data Platform
Lightbend Fast Data PlatformLightbend Fast Data Platform
Lightbend Fast Data PlatformLightbend
 
Lightbend Fast Data Platform
Lightbend Fast Data PlatformLightbend Fast Data Platform
Lightbend Fast Data PlatformLightbend
 
BGOUG "Agile Data: revolutionizing database cloning'
BGOUG  "Agile Data: revolutionizing database cloning'BGOUG  "Agile Data: revolutionizing database cloning'
BGOUG "Agile Data: revolutionizing database cloning'Kyle Hailey
 
CI/CD with an Idempotent Kafka Producer & Consumer | Kafka Summit London 2022
CI/CD with an Idempotent Kafka Producer & Consumer | Kafka Summit London 2022CI/CD with an Idempotent Kafka Producer & Consumer | Kafka Summit London 2022
CI/CD with an Idempotent Kafka Producer & Consumer | Kafka Summit London 2022HostedbyConfluent
 
2015 03-16-elk at-bsides
2015 03-16-elk at-bsides2015 03-16-elk at-bsides
2015 03-16-elk at-bsidesJeremy Cohoe
 
Dataflow in 104corp - DataConTW2018
Dataflow in 104corp - DataConTW2018Dataflow in 104corp - DataConTW2018
Dataflow in 104corp - DataConTW2018Gavin Lin
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeDatabricks
 
IBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep DiveIBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep DiveTorsten Steinbach
 
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
MLOps with a Feature Store: Filling the Gap in ML InfrastructureMLOps with a Feature Store: Filling the Gap in ML Infrastructure
MLOps with a Feature Store: Filling the Gap in ML InfrastructureData Science Milan
 
Open Source Reliability for Data Lake with Apache Spark by Michael Armbrust
Open Source Reliability for Data Lake with Apache Spark by Michael ArmbrustOpen Source Reliability for Data Lake with Apache Spark by Michael Armbrust
Open Source Reliability for Data Lake with Apache Spark by Michael ArmbrustData Con LA
 
Cloud-based Data Lake for Analytics and AI
Cloud-based Data Lake for Analytics and AICloud-based Data Lake for Analytics and AI
Cloud-based Data Lake for Analytics and AITorsten Steinbach
 
Dissecting Open Source Cloud Evolution: An OpenStack Case Study
Dissecting Open Source Cloud Evolution: An OpenStack Case StudyDissecting Open Source Cloud Evolution: An OpenStack Case Study
Dissecting Open Source Cloud Evolution: An OpenStack Case StudySalman Baset
 
Extending Analytic Reach - From The Warehouse to The Data Lake by Mike Limcaco
Extending Analytic Reach - From The Warehouse to The Data Lake by Mike LimcacoExtending Analytic Reach - From The Warehouse to The Data Lake by Mike Limcaco
Extending Analytic Reach - From The Warehouse to The Data Lake by Mike LimcacoData Con LA
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...Databricks
 

Similar to Ensuring Quality in Data Lakes (D&D Meetup Feb 22) (20)

A Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons LearnedA Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons Learned
 
Enterprise Data Science
Enterprise Data ScienceEnterprise Data Science
Enterprise Data Science
 
Data Science
Data ScienceData Science
Data Science
 
Scylla @ Disney+ Hotstar
Scylla @ Disney+ HotstarScylla @ Disney+ Hotstar
Scylla @ Disney+ Hotstar
 
Lightbend Fast Data Platform
Lightbend Fast Data PlatformLightbend Fast Data Platform
Lightbend Fast Data Platform
 
Lightbend Fast Data Platform
Lightbend Fast Data PlatformLightbend Fast Data Platform
Lightbend Fast Data Platform
 
BGOUG "Agile Data: revolutionizing database cloning'
BGOUG  "Agile Data: revolutionizing database cloning'BGOUG  "Agile Data: revolutionizing database cloning'
BGOUG "Agile Data: revolutionizing database cloning'
 
Log Analysis At Scale
Log Analysis At ScaleLog Analysis At Scale
Log Analysis At Scale
 
CI/CD with an Idempotent Kafka Producer & Consumer | Kafka Summit London 2022
CI/CD with an Idempotent Kafka Producer & Consumer | Kafka Summit London 2022CI/CD with an Idempotent Kafka Producer & Consumer | Kafka Summit London 2022
CI/CD with an Idempotent Kafka Producer & Consumer | Kafka Summit London 2022
 
2015 03-16-elk at-bsides
2015 03-16-elk at-bsides2015 03-16-elk at-bsides
2015 03-16-elk at-bsides
 
Dataflow in 104corp - DataConTW2018
Dataflow in 104corp - DataConTW2018Dataflow in 104corp - DataConTW2018
Dataflow in 104corp - DataConTW2018
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
IBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep DiveIBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep Dive
 
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
MLOps with a Feature Store: Filling the Gap in ML InfrastructureMLOps with a Feature Store: Filling the Gap in ML Infrastructure
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
 
Open Source Reliability for Data Lake with Apache Spark by Michael Armbrust
Open Source Reliability for Data Lake with Apache Spark by Michael ArmbrustOpen Source Reliability for Data Lake with Apache Spark by Michael Armbrust
Open Source Reliability for Data Lake with Apache Spark by Michael Armbrust
 
Cloud-based Data Lake for Analytics and AI
Cloud-based Data Lake for Analytics and AICloud-based Data Lake for Analytics and AI
Cloud-based Data Lake for Analytics and AI
 
Dissecting Open Source Cloud Evolution: An OpenStack Case Study
Dissecting Open Source Cloud Evolution: An OpenStack Case StudyDissecting Open Source Cloud Evolution: An OpenStack Case Study
Dissecting Open Source Cloud Evolution: An OpenStack Case Study
 
Extending Analytic Reach - From The Warehouse to The Data Lake by Mike Limcaco
Extending Analytic Reach - From The Warehouse to The Data Lake by Mike LimcacoExtending Analytic Reach - From The Warehouse to The Data Lake by Mike Limcaco
Extending Analytic Reach - From The Warehouse to The Data Lake by Mike Limcaco
 
Extending Analytic Reach
Extending Analytic ReachExtending Analytic Reach
Extending Analytic Reach
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 

Recently uploaded

Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 

Recently uploaded (20)

Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 

Ensuring Quality in Data Lakes (D&D Meetup Feb 22)

  • 1. Ensuring Data Quality In a Data Lake with By Paul Singman @datawhisp DevOps & Drinks Feb 11, 2022
  • 2. Agenda 1. L1 DataLake - Why they excel at Performance, Cost, Dev Ex, Integrations 2. L2: + Optimized Table Formats - Delta, Hudi, Iceberg 3. L3: + Data Version Control - lakeFS 4. lakeFS + Delta Demo!
  • 4. L1: Basic Data Lake Object Store
  • 5. L1: Basic Data Lake Object Store Date-separated .csv files
  • 6. L1: Basic Data Lake Object Store Date-separated .csv files ML BI Data-Intensive APIs
  • 7. are awesome in terms of L1: Basic Data Lake
  • 8. are awesome in terms of • Performance • Cost • Connectivity • Developer Experience L1: Basic Data Lake
  • 9. are awesome in terms of • Performance • Cost • Connectivity • Developer Experience • Achieve 3.5k PUT requests per second per prefix • 5.5k GET requests per second per prefix • Auto-scales to this limit automatically and overallcapacity is limitless • "something like 11 '9's of availability" FromOlegLvovitch's2021 Re:Inventtalk"Buildingadatalake on AmazonS3" and Matt Sidley's"DeepdiveonAmazonS3" L1: Basic Data Lake
  • 10. are awesome in terms of • Performance • Cost • Connectivity • Developer Experience • Storage: $.023 per GB vs $.10 for RDS or $.12 for EBS • Network: • $5 per milllionPUT, $.40 per millionGET requests, • $0 transfer datain, $.09 per GB for data transfer out • ~5-8x times cheaper than block storage FromOlegLvovitch's2021 Re:Inventtalk"Buildingadatalake on AmazonS3" and Matt Sidley's"DeepdiveonAmazonS3" L1: Basic Data Lake
  • 11. are awesome in terms of • Performance • Cost • Connectivity • Developer Experience FromOlegLvovitch's2021 Re:Inventtalk"Buildingadatalake on AmazonS3" and Matt Sidley's"DeepdiveonAmazonS3" • Mature client SDKs • Strong Consistency (2020) • AWS Storage Lens (2020) • Feature-rich (events, permissions, inventories, replication...) L1: Basic Data Lake
  • 12. are awesome in terms of FromOlegLvovitch's2021 Re:Inventtalk"Buildingadatalake on AmazonS3" and Matt Sidley's"DeepdiveonAmazonS3" • Mature client SDKs • Strong Consistency (2020) • AWS Storage Lens (2020) • Feature-rich (events, permissions, inventories, replication...) L1: Basic Data Lake
  • 13. are awesome in terms of • Performance • Cost • Connectivity • Developer Experience FromOlegLvovitch's2021 Re:Inventtalk"Buildingadatalake on AmazonS3" and Matt Sidley's"DeepdiveonAmazonS3" L1: Basic Data Lake
  • 14. are awesome in terms of • Performance • Cost • Connectivity • Developer Experience FromOlegLvovitch's2021 Re:Inventtalk"Buildingadatalake on AmazonS3" and Matt Sidley's"DeepdiveonAmazonS3" L1: Basic Data Lake
  • 15. How do we improve upon this? L1: Basic Data Lake Object Store Date-separated .csv files ML BI Data-Intensive APIs
  • 16. L2: Modern Table Formats The idea: To maintainobject transaction logs as metadata stored alongside the data that query engines make use of to provide:
  • 17. L2: Modern Table Formats The idea: To maintainobject transaction logs as metadata stored alongside the data that query engines make use of to provide: - Transaction isolation - Data Versioning - Schema enforcement - Performance improvements
  • 18. L2: Modern Table Formats The idea: To maintainobject transaction logs as metadata stored alongside the data that query engines make use of to provide: - Transaction isolation - Data Versioning - Schema enforcement - Performance improvements Implementations: - Delta Lake - Apache Hudi - Apache Iceberg
  • 19. L2: Modern Table Formats The idea: To maintainobject transaction logs as metadata stored alongside the data that query engines make use of to provide: - Transaction isolation - Data Versioning - Schema enforcement - Performance improvements Implementations: - Delta Lake - Apache Hudi - Apache Iceberg parquet metadata transaction log ML BI Data-Intensive APIs Object Store
  • 20. L2: Modern Table Formats The idea: To maintainobject transaction logs as metadata stored alongside the data that query engines make use of to provide: - Transaction isolation - Data Versioning - Schema enforcement - Performance improvements Implementations: - Delta Lake - Apache Hudi - Apache Iceberg ML BI Data-Intensive APIs Object Store parquet metadata transaction log
  • 21. L3: Data Version Control
  • 22. L3: Data Version Control The idea: To extend availableobject store operationswith git source control to:
  • 23. L3: Data Version Control The idea: To extend availableobject store operationswith git source control to: - Revert bad data instantly - Expose new data atomically (cross-collection) - Develop in isolation - Simplify data reproducibility
  • 24. L3: Data Version Control The idea: To extend availableobject store operationswith git source control to: Implementations: - lakeFS - Nessie - Revert bad data instantly - Expose new data atomically (cross-collection) - Develop in isolation - Simplify data reproducibility
  • 25. L3: Data Version Control The idea: To extend availableobject store operationswith git source control to: Implementations: - lakeFS - Nessie - Revert bad data instantly - Expose new data atomically (cross-collection) - Develop in isolation - Simplify data reproducibility ML BI Data-Intensive APIs Object Store parquet metadata transaction log
  • 26. L3: Data Version Control The idea: To extend availableobject store operationswith git source control to: Implementations: - lakeFS - Nessie - Revert bad data instantly - Expose new data atomically (cross-collection) - Develop in isolation - Simplify data reproducibility ML BI Data-Intensive APIs Object Store Data Repos w/ commit, merge, branch, revert operations
  • 27. Best Practice lakeFS Solution Identify and fix data errors instantly Develop new data assets in isolation Reproduce jobs and pipelines easily Update datasets atomically $ lakectl branch create my-branch $ lakectl merge my-branch main $ lakectl revert main^1 $ spark.read.parquet(‘s3://my-repo/<commit_id>’) - Revert bad data instantly - Expose new data atomically (cross-collection) - Develop in isolation - Simplify data reproducibility L3: Data Version Control
  • 28. Best Practice lakeFS Solution Identify and fix data errors instantly Develop new data assets in isolation Reproduce jobs and pipelines easily Update datasets atomically $ lakectl branch create my-branch $ lakectl merge my-branch main $ lakectl revert main^1 $ spark.read.parquet(‘s3://my-repo/<commit_id>’) - Revert bad data instantly - Expose new data atomically (cross-collection) - Develop in isolation - Simplify data reproducibility lakeFS Solution $ lakectl revert main^1 L3: Data Version Control
  • 29. Best Practice lakeFS Solution Identify and fix data errors instantly Develop new data assets in isolation Reproduce jobs and pipelines easily Update datasets atomically $ lakectl branch create my-branch $ lakectl merge my-branch main $ lakectl revert main^1 $ spark.read.parquet(‘s3://my-repo/<commit_id>’) - Revert bad data instantly - Expose new data atomically (cross-collection) - Develop in isolation - Simplify data reproducibility lakeFS Solution $ lakectl revert main^1 $ lakectl merge my-branch main L3: Data Version Control
  • 30. Best Practice lakeFS Solution Identify and fix data errors instantly Develop new data assets in isolation Reproduce jobs and pipelines easily Update datasets atomically $ lakectl branch create my-branch $ lakectl merge my-branch main $ lakectl revert main^1 $ spark.read.parquet(‘s3://my-repo/<commit_id>’) - Revert bad data instantly - Expose new data atomically (cross-collection) - Develop in isolation - Simplify data reproducibility lakeFS Solution $ lakectl revert main^1 $ lakectl merge my-branch main $ lakectl branch create my-branch L3: Data Version Control
  • 31. Best Practice lakeFS Solution Identify and fix data errors instantly Develop new data assets in isolation Reproduce jobs and pipelines easily Update datasets atomically $ lakectl branch create my-branch $ lakectl merge my-branch main $ lakectl revert main^1 $ spark.read.parquet(‘s3://my-repo/<commit_id>’) - Revert bad data instantly - Expose new data atomically (cross-collection) - Develop in isolation - Simplify data reproducibility lakeFS Solution $ lakectl revert main^1 $ lakectl merge my-branch main $ lakectl branch create my-branch $ spark.read.parquet(‘s3://my-repo/<commit_id>’) L3: Data Version Control
  • 32. Demo
  • 33. core principles Format-agnostic Works with all data formats out of the box Scale Graveler data model supports exabyte size datasets Infrastructure Integrates with any tool that can talk to object stores