SlideShare une entreprise Scribd logo
1  sur  44
Télécharger pour lire hors ligne
Making Data Timelier and More Reliable
with Lakehouse Technology
Matei Zaharia
Databricks and Stanford University
Talk Outline
§ Many problems with data analytics today stem from the complex data
architectures we use
§ New “Lakehouse” technologies can remove this complexity by enabling
fast data warehousing, streaming & ML directly on data lake storage
The biggest challenges with data today:
data quality and staleness
Data Analyst Survey
60% reported data quality as top challenge
86% of analysts had to use stale data, with
41% using data that is >2 months old
90% regularly had unreliable data sources
Data Scientist Survey
75%
51%
42%
Getting high-quality, timely data is hard…
but it’s partly a problem of our own making!
The Evolution of
Data Management
1980s: Data Warehouses
§ ETL data directly from operational
database systems
§ Purpose-built for SQL analytics & BI:
schemas, indexes, caching, etc
§ Powerful management features such as
ACID transactions and time travel
ETL
Operational Data
Data Warehouses
BI Reports
2010s: New Problems for Data Warehouses
§ Could not support rapidly growing
unstructured and semi-structured data:
time series, logs, images, documents, etc
§ High cost to store large datasets
§ No support for data science & ML
ETL
Operational Data
Data Warehouses
BI Reports
2010s: Data Lakes
§ Low-cost storage to hold all raw data
(e.g. Amazon S3, HDFS)
▪ $12/TB/month for S3 infrequent tier!
§ ETL jobs then load specific data into
warehouses, possibly for further ELT
§ Directly readable in ML libraries (e.g.
TensorFlow) due to open file format
BI Data
Science
Machine
Learning
Structured, Semi-Structured & Unstructured Data
Data Lake
Real-Time
Database
Reports
Data Warehouses Data
PreparationETL
Problems with Today’s Data Lakes
Cheap to store all the data, but system architecture is much more complex!
Data reliability suffers:
§ Multiple storage systems with different
semantics, SQL dialects, etc
§ Extra ETL steps that can go wrong
Timeliness suffers:
§ Extra ETL steps before data is available
in data warehouses
BI Data
Science
Machine
Learning
Structured, Semi-Structured & Unstructured Data
Data Lake
Real-Time
Database
Reports
Data Warehouses Data
PreparationETL
Problems with Today’s Data Lakes
Cheap to store all the data, but system architecture is much more complex!
Data reliability suffers:
§ Multiple storage systems with different
semantics, SQL dialects, etc
§ Extra ETL steps that can go wrong
Timeliness suffers:
§ Extra ETL steps before data is available
in data warehouses
Summary
At least some of the problems in modern data architectures are due to
unnecessary system complexity
§ We wanted low-cost storage for large historical data, but we
designed separate storage systems (data lakes) for that
§ Now we need to sync data across systems all the time!
What if we didn’t need to have all these different data systems?
Lakehouse Technology
New techniques to provide data warehousing features directly on
data lake storage
§ Retain existing open file formats (e.g. Apache Parquet, ORC)
§ Add management and performance features on top
(transactions, data versioning, indexes, etc)
§ Can also help eliminate other data systems, e.g. message queues
Key parts: metadata layers such as Delta Lake (from Databricks)
and Apache Iceberg (from Netflix) + new engine designs
Streaming
Analytics
BI Data
Science
Machine
Learning
Structured, Semi-Structured & Unstructured Data
Lakehouse Vision
Data lake storage for all data
Single platform for every use case
Management features
(transactions, versioning, etc)
Key Technologies Enabling Lakehouse
1. Metadata layers for data lakes: add transactions, versioning & more
2. New query engine designs: great SQL performance on data lake
storage systems and file formats
3. Optimized access for data science & ML
Key Technologies Enabling Lakehouse
1. Metadata layers for data lakes: add transactions, versioning & more
2. New query engine designs: great SQL performance on data lake
storage systems and file formats
3. Optimized access for data science & ML
Metadata Layers for Data Lakes
§ A data lake is normally just a collection of files
§ Metadata layers keep track of which files are part of a table to enable
richer management features such as transactions
▪ Clients can then still access the underlying files at high speed
§ Implemented in multiple systems:
ACID
Keep metadata in the object store itself Keep metadata in a database
Problem: What if a query reads the table while the delete is running?
Example: Basic Data Lake
file1.parquet
file2.parquet
file3.parquet
“events” table Query: delete all events data about customer #17
file1b.parquet
file3b.parquet
rewrite
rewrite
+ delete file1.parquet
+ delete file3.parquet
Problem: What if the query doing the delete fails partway through?
Example with
file1.parquet
file2.parquet
file3.parquet
“events” table
_delta_log / v1.parquet
/ v2.parquet
Query: delete all events data about customer #17
file1b.parquet
file3b.parquet
rewrite
rewrite
track which files are part of
each version of the table
(e.g. v2 = file1, file2, file3)
_delta_log / v3.parquet
atomically add new log file
v3 = file1b, file2, file3b
Clients now always read a
consistent table version!
• If a client reads v2 of log, it sees
file1, file2, file3 (no delete)
• If a client reads v3 of log, it sees
file1b, file2, file3b (all deleted)
See our VLDB 2020 paper for details
Other Management Features with
§ Time travel to an old table version
§ Zero-copy CLONE by forking the log
§ DESCRIBE HISTORY
§ INSERT, UPSERT, DELETE & MERGE
SELECT * FROM my_table
TIMESTAMP AS OF “2020-05-01”
CREATE TABLE my_table_dev
SHALLOW CLONE my_table
Other Management Features with
§ Streaming I/O: treat a table as a
stream of changes to remove need
for message buses like Kafka
§ Schema enforcement & evolution
§ Expectations for data quality
CREATE TABLE orders (
product_id INTEGER NOT NULL,
quantity INTEGER CHECK(quantity > 0),
list_price DECIMAL CHECK(list_price > 0),
discount_price DECIMAL
CHECK(discount_price > 0 AND
discount_price <= list_price)
);
spark.readStream
.format("delta")
.table("events")
Adoption
§ Used by thousands of companies to process exabytes of data/day
§ Grew from zero to ~50% of the Databricks workload in 3 years
§ Largest deployments: exabyte tables and 1000s of users
Available Connectors
Store
data in
Ingest
from
Query
from
Key Technologies Enabling Lakehouse
1. Metadata layers for data lakes: add transactions, versioning & more
2. New query engine designs: great SQL performance on data lake
storage systems and file formats
3. Optimized access for data science & ML
The Challenge
§ Most data warehouses have full control over the data storage system
and query engine, so they design them together
§ The key idea in a Lakehouse is to store data in open storage formats
(e.g. Parquet) for direct access from many systems
§ How can we get great performance with these standard, open formats?
Enabling Lakehouse Performance
Even with a fixed, directly-accessible storage format, four optimizations
can enable great SQL performance:
§ Caching hot data, possibly in a different format
§ Auxiliary data structures like statistics and indexes
§ Data layout optimizations to minimize I/O
§ Vectorized execution engines for modern CPUs
New query engines such as Databricks Delta Engine use these ideas
Optimization 1: Caching
§ Most query workloads have concentrated accesses on “hot” data
▪ Data warehouses use SSD and memory caches to improve performance
§ The same techniques work in a Lakehouse if we have a metadata layer
such as Delta Lake to correctly maintain the cache
▪ Caches can even hold data in a faster format (e.g. decompressed)
§ Example: SSD cache in
Databricks Delta Engine
0 20 40 60 80
Parquet onS3
Parquet onSSD
Delta Engine cache
Values read per second per core (millions)
Optimization 2: Auxiliary Data Structures
§ Even if the base data is in Parquet, we can build many other data
structures to speed up queries and maintain them transactionally
▪ Inspired by the literature on databases for “raw” data formats
§ Example: min/max statistics on Parquet files for data skipping
file1.parquet
file2.parquet
file3.parquet
year: min 2018, max 2019
uid: min 12000, max 23000
year: min 2018, max 2020
uid: min 12000, max 14000
year: min 2020, max 2020
uid: min 23000, max 25000
Query: SELECT * FROM events
WHERE year=2020 AND uid=24000
updated transactionally
with Delta table log
Optimization 2: Auxiliary Data Structures
§ Even if the base data is in Parquet, we can build many other data
structures to speed up queries and maintain them transactionally
▪ Inspired by the literature on databases for “raw” data formats
§ Example: min/max statistics on Parquet files for data skipping
file1.parquet
file2.parquet
file3.parquet
year: min 2018, max 2019
uid: min 12000, max 23000
year: min 2018, max 2020
uid: min 12000, max 14000
year: min 2020, max 2020
uid: min 23000, max 25000
Query: SELECT * FROM events
WHERE year=2020 AND uid=24000
updated transactionally
with Delta table log
Optimization 2: Auxiliary Data Structures
§ Even if the base data is in Parquet, we can build many other data
structures to speed up queries and maintain them transactionally
▪ Inspired by the literature on databases for “raw” data formats
§ Example: indexes over Parquet files
file1.parquet
file2.parquet
file3.parquet
Query: SELECT * FROM events
WHERE type = “DELETE_ACCOUNT”
tree
index
Optimization 3: Data Layout
§ Query execution time primarily depends on amount of data accessed
§ Even with a fixed storage format such as Parquet, we can optimize the
data layout within tables to reduce execution time
§ Example: sorting a table for fast access
file1.parquet
file2.parquet
file3.parquet
file4.parquet
uid = 0…1000
uid = 1001…2000
uid = 2001…3000
uid = 3001…4000
Optimization 3: Data Layout
§ Query execution time primarily depends on amount of data accessed
§ Even with a fixed storage format such as Parquet, we can optimize the
data layout within tables to reduce execution time
§ Example: Z-ordering for multi-dimensional access
dimension 1
dimension2
99%
67%
0%
60%
0%
47%
0%
44%
0%
20%
40%
60%
80%
100%
Sort by col1 Z-Order by col1-4
DataFilesSkipped
Filter on col1 Filter on col2
Filter on col3 Filter on col4
Optimization 4: Vectorized Execution
§ Modern data warehouses optimize CPU time by using vector (SIMD)
instructions on modern CPUs, e.g., AVX512
§ Many of these optimizations can also be applied over Parquet
§ Databricks Delta Engine: ~10x faster
than Java-based engines
3220
25544
35861
0
10000
20000
30000
40000
Delta
Engine
Apache
Spark 3.0
Presto 230
TPC-DS Benchmark Time (s)
Putting These Optimizations Together
§ Given that (1) most reads are from a cache, (2) I/O cost is the key factor
for non-cached data, and (3) CPU time can be optimized via SIMD…
§ Lakehouse engines can offer similar performance to DWs!
0
10000
20000
30000
40000
DW1 DW2 DW3 Delta
Engine
TPC-DS 30TB Benchmark Time (s)
Key Technologies Enabling Lakehouse
1. Metadata layers for data lakes: add transactions, versioning & more
2. New query engine designs: great SQL performance on data lake
storage systems and file formats
3. Optimized access for data science & ML
ML over a Data Warehouse is Painful
§ Unlike SQL workloads, ML workloads need to process large amounts of
data with non-SQL code (e.g. TensorFlow, XGBoost, etc)
§ SQL over JDBC/ODBC interface is too slow for this at scale
§ Export data to a data lake? → adds a third ETL step and more staleness!
§ Maintain production datasets in both DW & lake? → even more complex
ML over a Lakehouse
§ Direct access to data files without overloading the SQL frontend
(e.g., just run a GPU cluster to do deep learning on S3 data)
▪ ML frameworks already support reading Parquet!
§ New declarative APIs for ML data prep enable further optimization
Example: Spark’s Declarative DataFrame API
Users write DataFrame code in Python, R or Java
users = spark.table(“users”)
buyers = users[users.kind == “buyer”]
train_set = buyers[“start_date”, “zip”, “quantity”]
.fillna(0)
Example: Spark’s Declarative DataFrame API
Users write DataFrame code in Python, R or Java
...
model.fit(train_set)
Lazily evaluated query plan
Optimized execution using
cache, statistics, index, etc
users
SELECT(kind = “buyer”)
PROJECT(start_date, zip, …)
PROJECT(NULL → 0)users = spark.table(“users”)
buyers = users[users.kind == “buyer”]
train_set = buyers[“start_date”, “zip”, “quantity”]
.fillna(0)
ML over Lakehouse: Management Features
Lakehouse systems’ management features also make ML easier!
§ Use time travel for data versioning and reproducible experiments
§ Use transactions to reliably update tables
§ Always access the latest data from streaming I/O
Example: organizations using Delta Lake as an ML “feature store”
Summary
Lakehouse systems combine the benefits of data warehouses & lakes
§ Management features via metadata
layers (transactions, CLONE, etc)
§ Performance via new query engines
§ Direct access via open file formats
§ Low cost equal to cloud storage
Streaming
Analytics
BI Data
Science
Machine
Learning
Structured, Semi-Structured & Unstructured Data
Result: simplify data architectures to
improve both reliability & timeliness
Before and After Lakehouse
Typical Architecture with Many Data Systems
ETL Job ETL Job
ETL Job
Delta Lake
Table 1
Delta Lake
Table 2
Delta Lake
Table 3
Streaming
Analytics
Data
Scientists
BI Users
Cloud Object Store
input
Cloud Object Store
ETL Job ETL Job
ETL Job
Message
Queue
Parquet
Table 1
Parquet
Table 2
Parquet
Table 3
Data
Warehouse
Data
Warehouse
Streaming
Analytics
Data
Scientists
BI Users
input
Lakehouse Architecture: All Data in Object Store
Fewer copies of the data, fewer ETL
steps, no divergence & faster results!
Learn More
Download and learn Delta Lake at delta.io
View free content from our conferences at spark-summit.org:

Contenu connexe

Tendances

Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
Jeffrey T. Pollock
 

Tendances (20)

Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
 
Snowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at ScaleSnowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at Scale
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouse
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
 
Data Mess to Data Mesh | Jay Kreps, CEO, Confluent | Kafka Summit Americas 20...
Data Mess to Data Mesh | Jay Kreps, CEO, Confluent | Kafka Summit Americas 20...Data Mess to Data Mesh | Jay Kreps, CEO, Confluent | Kafka Summit Americas 20...
Data Mess to Data Mesh | Jay Kreps, CEO, Confluent | Kafka Summit Americas 20...
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake Analytics
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
 
Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?
 
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
 
Technical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdfTechnical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdf
 
Intro to databricks delta lake
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lake
 
How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?
 

Similaire à Making Data Timelier and More Reliable with Lakehouse Technology

The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 
http://www.hfadeel.com/Blog/?p=151
http://www.hfadeel.com/Blog/?p=151http://www.hfadeel.com/Blog/?p=151
http://www.hfadeel.com/Blog/?p=151
xlight
 
Data center design standards for cabinet and floor loading
Data center design standards for cabinet and floor loadingData center design standards for cabinet and floor loading
Data center design standards for cabinet and floor loading
kotatsu
 

Similaire à Making Data Timelier and More Reliable with Lakehouse Technology (20)

The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
 
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need BothThe Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Planning and Optimizing Data Lake Architecture - Milos Milovanovic
 Planning and Optimizing Data Lake Architecture - Milos Milovanovic Planning and Optimizing Data Lake Architecture - Milos Milovanovic
Planning and Optimizing Data Lake Architecture - Milos Milovanovic
 
Planing and optimizing data lake architecture
Planing and optimizing data lake architecturePlaning and optimizing data lake architecture
Planing and optimizing data lake architecture
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
 
Design Principles for a Modern Data Warehouse
Design Principles for a Modern Data WarehouseDesign Principles for a Modern Data Warehouse
Design Principles for a Modern Data Warehouse
 
Best Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and DeltaBest Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and Delta
 
Myth Busters: I’m Building a Data Lake, So I Don’t Need Data Virtualization (...
Myth Busters: I’m Building a Data Lake, So I Don’t Need Data Virtualization (...Myth Busters: I’m Building a Data Lake, So I Don’t Need Data Virtualization (...
Myth Busters: I’m Building a Data Lake, So I Don’t Need Data Virtualization (...
 
Is the traditional data warehouse dead?
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?
 
http://www.hfadeel.com/Blog/?p=151
http://www.hfadeel.com/Blog/?p=151http://www.hfadeel.com/Blog/?p=151
http://www.hfadeel.com/Blog/?p=151
 
Data Virtualization: An Essential Component of a Cloud Data Lake
Data Virtualization: An Essential Component of a Cloud Data LakeData Virtualization: An Essential Component of a Cloud Data Lake
Data Virtualization: An Essential Component of a Cloud Data Lake
 
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical DemonstrationMaximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
 
Data center design standards for cabinet and floor loading
Data center design standards for cabinet and floor loadingData center design standards for cabinet and floor loading
Data center design standards for cabinet and floor loading
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Suburface 2021 IBM Cloud Data Lake
Suburface 2021 IBM Cloud Data LakeSuburface 2021 IBM Cloud Data Lake
Suburface 2021 IBM Cloud Data Lake
 
External & Managed Tables In Fabric Lakehouse.pptx
External & Managed Tables In Fabric Lakehouse.pptxExternal & Managed Tables In Fabric Lakehouse.pptx
External & Managed Tables In Fabric Lakehouse.pptx
 
Dwdm unit 1-2016-Data ingarehousing
Dwdm unit 1-2016-Data ingarehousingDwdm unit 1-2016-Data ingarehousing
Dwdm unit 1-2016-Data ingarehousing
 

Dernier

如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
vexqp
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
cnajjemba
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
wsppdmt
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
ptikerjasaptiker
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 

Dernier (20)

如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
 
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
SR-101-01012024-EN.docx  Federal Constitution  of the Swiss ConfederationSR-101-01012024-EN.docx  Federal Constitution  of the Swiss Confederation
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 

Making Data Timelier and More Reliable with Lakehouse Technology

  • 1. Making Data Timelier and More Reliable with Lakehouse Technology Matei Zaharia Databricks and Stanford University
  • 2. Talk Outline § Many problems with data analytics today stem from the complex data architectures we use § New “Lakehouse” technologies can remove this complexity by enabling fast data warehousing, streaming & ML directly on data lake storage
  • 3. The biggest challenges with data today: data quality and staleness
  • 4. Data Analyst Survey 60% reported data quality as top challenge 86% of analysts had to use stale data, with 41% using data that is >2 months old 90% regularly had unreliable data sources
  • 6. Getting high-quality, timely data is hard… but it’s partly a problem of our own making!
  • 8. 1980s: Data Warehouses § ETL data directly from operational database systems § Purpose-built for SQL analytics & BI: schemas, indexes, caching, etc § Powerful management features such as ACID transactions and time travel ETL Operational Data Data Warehouses BI Reports
  • 9. 2010s: New Problems for Data Warehouses § Could not support rapidly growing unstructured and semi-structured data: time series, logs, images, documents, etc § High cost to store large datasets § No support for data science & ML ETL Operational Data Data Warehouses BI Reports
  • 10. 2010s: Data Lakes § Low-cost storage to hold all raw data (e.g. Amazon S3, HDFS) ▪ $12/TB/month for S3 infrequent tier! § ETL jobs then load specific data into warehouses, possibly for further ELT § Directly readable in ML libraries (e.g. TensorFlow) due to open file format BI Data Science Machine Learning Structured, Semi-Structured & Unstructured Data Data Lake Real-Time Database Reports Data Warehouses Data PreparationETL
  • 11. Problems with Today’s Data Lakes Cheap to store all the data, but system architecture is much more complex! Data reliability suffers: § Multiple storage systems with different semantics, SQL dialects, etc § Extra ETL steps that can go wrong Timeliness suffers: § Extra ETL steps before data is available in data warehouses BI Data Science Machine Learning Structured, Semi-Structured & Unstructured Data Data Lake Real-Time Database Reports Data Warehouses Data PreparationETL
  • 12. Problems with Today’s Data Lakes Cheap to store all the data, but system architecture is much more complex! Data reliability suffers: § Multiple storage systems with different semantics, SQL dialects, etc § Extra ETL steps that can go wrong Timeliness suffers: § Extra ETL steps before data is available in data warehouses
  • 13. Summary At least some of the problems in modern data architectures are due to unnecessary system complexity § We wanted low-cost storage for large historical data, but we designed separate storage systems (data lakes) for that § Now we need to sync data across systems all the time! What if we didn’t need to have all these different data systems?
  • 14. Lakehouse Technology New techniques to provide data warehousing features directly on data lake storage § Retain existing open file formats (e.g. Apache Parquet, ORC) § Add management and performance features on top (transactions, data versioning, indexes, etc) § Can also help eliminate other data systems, e.g. message queues Key parts: metadata layers such as Delta Lake (from Databricks) and Apache Iceberg (from Netflix) + new engine designs
  • 15. Streaming Analytics BI Data Science Machine Learning Structured, Semi-Structured & Unstructured Data Lakehouse Vision Data lake storage for all data Single platform for every use case Management features (transactions, versioning, etc)
  • 16. Key Technologies Enabling Lakehouse 1. Metadata layers for data lakes: add transactions, versioning & more 2. New query engine designs: great SQL performance on data lake storage systems and file formats 3. Optimized access for data science & ML
  • 17. Key Technologies Enabling Lakehouse 1. Metadata layers for data lakes: add transactions, versioning & more 2. New query engine designs: great SQL performance on data lake storage systems and file formats 3. Optimized access for data science & ML
  • 18. Metadata Layers for Data Lakes § A data lake is normally just a collection of files § Metadata layers keep track of which files are part of a table to enable richer management features such as transactions ▪ Clients can then still access the underlying files at high speed § Implemented in multiple systems: ACID Keep metadata in the object store itself Keep metadata in a database
  • 19. Problem: What if a query reads the table while the delete is running? Example: Basic Data Lake file1.parquet file2.parquet file3.parquet “events” table Query: delete all events data about customer #17 file1b.parquet file3b.parquet rewrite rewrite + delete file1.parquet + delete file3.parquet Problem: What if the query doing the delete fails partway through?
  • 20. Example with file1.parquet file2.parquet file3.parquet “events” table _delta_log / v1.parquet / v2.parquet Query: delete all events data about customer #17 file1b.parquet file3b.parquet rewrite rewrite track which files are part of each version of the table (e.g. v2 = file1, file2, file3) _delta_log / v3.parquet atomically add new log file v3 = file1b, file2, file3b Clients now always read a consistent table version! • If a client reads v2 of log, it sees file1, file2, file3 (no delete) • If a client reads v3 of log, it sees file1b, file2, file3b (all deleted) See our VLDB 2020 paper for details
  • 21. Other Management Features with § Time travel to an old table version § Zero-copy CLONE by forking the log § DESCRIBE HISTORY § INSERT, UPSERT, DELETE & MERGE SELECT * FROM my_table TIMESTAMP AS OF “2020-05-01” CREATE TABLE my_table_dev SHALLOW CLONE my_table
  • 22. Other Management Features with § Streaming I/O: treat a table as a stream of changes to remove need for message buses like Kafka § Schema enforcement & evolution § Expectations for data quality CREATE TABLE orders ( product_id INTEGER NOT NULL, quantity INTEGER CHECK(quantity > 0), list_price DECIMAL CHECK(list_price > 0), discount_price DECIMAL CHECK(discount_price > 0 AND discount_price <= list_price) ); spark.readStream .format("delta") .table("events")
  • 23. Adoption § Used by thousands of companies to process exabytes of data/day § Grew from zero to ~50% of the Databricks workload in 3 years § Largest deployments: exabyte tables and 1000s of users
  • 25. Key Technologies Enabling Lakehouse 1. Metadata layers for data lakes: add transactions, versioning & more 2. New query engine designs: great SQL performance on data lake storage systems and file formats 3. Optimized access for data science & ML
  • 26. The Challenge § Most data warehouses have full control over the data storage system and query engine, so they design them together § The key idea in a Lakehouse is to store data in open storage formats (e.g. Parquet) for direct access from many systems § How can we get great performance with these standard, open formats?
  • 27. Enabling Lakehouse Performance Even with a fixed, directly-accessible storage format, four optimizations can enable great SQL performance: § Caching hot data, possibly in a different format § Auxiliary data structures like statistics and indexes § Data layout optimizations to minimize I/O § Vectorized execution engines for modern CPUs New query engines such as Databricks Delta Engine use these ideas
  • 28. Optimization 1: Caching § Most query workloads have concentrated accesses on “hot” data ▪ Data warehouses use SSD and memory caches to improve performance § The same techniques work in a Lakehouse if we have a metadata layer such as Delta Lake to correctly maintain the cache ▪ Caches can even hold data in a faster format (e.g. decompressed) § Example: SSD cache in Databricks Delta Engine 0 20 40 60 80 Parquet onS3 Parquet onSSD Delta Engine cache Values read per second per core (millions)
  • 29. Optimization 2: Auxiliary Data Structures § Even if the base data is in Parquet, we can build many other data structures to speed up queries and maintain them transactionally ▪ Inspired by the literature on databases for “raw” data formats § Example: min/max statistics on Parquet files for data skipping file1.parquet file2.parquet file3.parquet year: min 2018, max 2019 uid: min 12000, max 23000 year: min 2018, max 2020 uid: min 12000, max 14000 year: min 2020, max 2020 uid: min 23000, max 25000 Query: SELECT * FROM events WHERE year=2020 AND uid=24000 updated transactionally with Delta table log
  • 30. Optimization 2: Auxiliary Data Structures § Even if the base data is in Parquet, we can build many other data structures to speed up queries and maintain them transactionally ▪ Inspired by the literature on databases for “raw” data formats § Example: min/max statistics on Parquet files for data skipping file1.parquet file2.parquet file3.parquet year: min 2018, max 2019 uid: min 12000, max 23000 year: min 2018, max 2020 uid: min 12000, max 14000 year: min 2020, max 2020 uid: min 23000, max 25000 Query: SELECT * FROM events WHERE year=2020 AND uid=24000 updated transactionally with Delta table log
  • 31. Optimization 2: Auxiliary Data Structures § Even if the base data is in Parquet, we can build many other data structures to speed up queries and maintain them transactionally ▪ Inspired by the literature on databases for “raw” data formats § Example: indexes over Parquet files file1.parquet file2.parquet file3.parquet Query: SELECT * FROM events WHERE type = “DELETE_ACCOUNT” tree index
  • 32. Optimization 3: Data Layout § Query execution time primarily depends on amount of data accessed § Even with a fixed storage format such as Parquet, we can optimize the data layout within tables to reduce execution time § Example: sorting a table for fast access file1.parquet file2.parquet file3.parquet file4.parquet uid = 0…1000 uid = 1001…2000 uid = 2001…3000 uid = 3001…4000
  • 33. Optimization 3: Data Layout § Query execution time primarily depends on amount of data accessed § Even with a fixed storage format such as Parquet, we can optimize the data layout within tables to reduce execution time § Example: Z-ordering for multi-dimensional access dimension 1 dimension2 99% 67% 0% 60% 0% 47% 0% 44% 0% 20% 40% 60% 80% 100% Sort by col1 Z-Order by col1-4 DataFilesSkipped Filter on col1 Filter on col2 Filter on col3 Filter on col4
  • 34. Optimization 4: Vectorized Execution § Modern data warehouses optimize CPU time by using vector (SIMD) instructions on modern CPUs, e.g., AVX512 § Many of these optimizations can also be applied over Parquet § Databricks Delta Engine: ~10x faster than Java-based engines 3220 25544 35861 0 10000 20000 30000 40000 Delta Engine Apache Spark 3.0 Presto 230 TPC-DS Benchmark Time (s)
  • 35. Putting These Optimizations Together § Given that (1) most reads are from a cache, (2) I/O cost is the key factor for non-cached data, and (3) CPU time can be optimized via SIMD… § Lakehouse engines can offer similar performance to DWs! 0 10000 20000 30000 40000 DW1 DW2 DW3 Delta Engine TPC-DS 30TB Benchmark Time (s)
  • 36. Key Technologies Enabling Lakehouse 1. Metadata layers for data lakes: add transactions, versioning & more 2. New query engine designs: great SQL performance on data lake storage systems and file formats 3. Optimized access for data science & ML
  • 37. ML over a Data Warehouse is Painful § Unlike SQL workloads, ML workloads need to process large amounts of data with non-SQL code (e.g. TensorFlow, XGBoost, etc) § SQL over JDBC/ODBC interface is too slow for this at scale § Export data to a data lake? → adds a third ETL step and more staleness! § Maintain production datasets in both DW & lake? → even more complex
  • 38. ML over a Lakehouse § Direct access to data files without overloading the SQL frontend (e.g., just run a GPU cluster to do deep learning on S3 data) ▪ ML frameworks already support reading Parquet! § New declarative APIs for ML data prep enable further optimization
  • 39. Example: Spark’s Declarative DataFrame API Users write DataFrame code in Python, R or Java users = spark.table(“users”) buyers = users[users.kind == “buyer”] train_set = buyers[“start_date”, “zip”, “quantity”] .fillna(0)
  • 40. Example: Spark’s Declarative DataFrame API Users write DataFrame code in Python, R or Java ... model.fit(train_set) Lazily evaluated query plan Optimized execution using cache, statistics, index, etc users SELECT(kind = “buyer”) PROJECT(start_date, zip, …) PROJECT(NULL → 0)users = spark.table(“users”) buyers = users[users.kind == “buyer”] train_set = buyers[“start_date”, “zip”, “quantity”] .fillna(0)
  • 41. ML over Lakehouse: Management Features Lakehouse systems’ management features also make ML easier! § Use time travel for data versioning and reproducible experiments § Use transactions to reliably update tables § Always access the latest data from streaming I/O Example: organizations using Delta Lake as an ML “feature store”
  • 42. Summary Lakehouse systems combine the benefits of data warehouses & lakes § Management features via metadata layers (transactions, CLONE, etc) § Performance via new query engines § Direct access via open file formats § Low cost equal to cloud storage Streaming Analytics BI Data Science Machine Learning Structured, Semi-Structured & Unstructured Data Result: simplify data architectures to improve both reliability & timeliness
  • 43. Before and After Lakehouse Typical Architecture with Many Data Systems ETL Job ETL Job ETL Job Delta Lake Table 1 Delta Lake Table 2 Delta Lake Table 3 Streaming Analytics Data Scientists BI Users Cloud Object Store input Cloud Object Store ETL Job ETL Job ETL Job Message Queue Parquet Table 1 Parquet Table 2 Parquet Table 3 Data Warehouse Data Warehouse Streaming Analytics Data Scientists BI Users input Lakehouse Architecture: All Data in Object Store Fewer copies of the data, fewer ETL steps, no divergence & faster results!
  • 44. Learn More Download and learn Delta Lake at delta.io View free content from our conferences at spark-summit.org: