SlideShare une entreprise Scribd logo
1  sur  39
Télécharger pour lire hors ligne
Data Time Travel by
Delta Time Machine
Burak Yavuz | Software Engineer
Vini Jaiswal | Customer Success Engineer
Who are we?
● Software Engineer @ Databricks
“We make your streams come true”
● Apache Spark Committer
● MS in Management Science & Engineering - Stanford University
● BS in Mechanical Engineering - Bogazici University, Turkey
● Customer Success Engineer @ Databricks
“Making Customers Successful with their data and ML/AI use cases”
● Data Science Lead - Citi | Data Intern - Southwest Airlines
● MS in Information Technology & Management - UTDallas
● BS in Electrical Engineering - Rajiv Gandhi Technology University, India
Vini Jaiswal
Burak Yavuz
Agenda
Intro to Time Travel
Time Travel Use Cases
▪ Data Archiving
▪ Rollbacks
▪ Governance
▪ Reproducing ML experiments
Solving with Delta
Demo - Riding the time machine
Introduction to Time Travel
What might time travel look like?
Source: NASA (https://climate.nasa.gov/vital-signs/global-temperature/)
1926
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
TIMESTAMP AS OF ‘1926-12-31'
-0.09
1972
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
TIMESTAMP AS OF ‘1972-12-31'
0.02
1880
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
TIMESTAMP AS OF '1880-12-31'
-0.18
2018
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
TIMESTAMP AS OF ‘2018-12-31'
0.82
Data
Archiving
Governance Rollbacks Reproduce
Experiments
Time Travel Use Cases
Data Archiving
● Changes to data need to be stored and be retrievable for regulatory reasons
● May need to store data for many years (7+)
Governance
8
Flights
Delays per
airplane
Planes
Weather
● What if records need to be forgotten with respect to Data Subject request
● And, at the same time, how do you stay in compliance with international
regulations?
Flights
(JSON)
events per
second
Kinesis
Planes
(CSV)
slow
changing
S3
Weather
(JSON)
every 5
minutes a
new dump
on S3
Rollbacks
9
Flights
Planes
Weather
Flights
(JSON)
events per
second
Event
Hubs
Planes
(CSV)
slow
changing
Blob
Weather
(JSON)
every 5
minutes a new
dump on Blob
What if a new job is deployed that
accidentally specifies
.mode(“overwrite”)
New job with .mode(“overwrite”)
Delays per
airplane
All
historic
data gone
Reproduce Experiments
● Reproducibility is the cornerstone of all scientific inquiry
● In order for a machine learning model to be improved, a data scientist
must first reproduce the results of the model.
Reproduce
Experiments
Solving with Delta
For more info check out
Diving Into Delta Lake:
Unpacking the Transaction Log
Wednesday (Nov 11) 15:00 GMT
Transaction Protocol
▪ Serializable ACID Writes
▪ Snapshot Isolation
▪ Scalability to billions of partitions or files
▪ Incremental processing
Computing Delta’s State
000000.json
000001.json
000002.json
000003.json
000004.json
000005.json
000006.json
000007.json
listFrom
version 0
Cache version
7
Update Metadata – name, schema, partitioning, etc
Add File – adds a file (with optional statistics)
Remove File – removes a file
Set Transaction – records an idempotent txn id
Change Protocol – upgrades the version of the txn protocol
Result: Current Metadata, List of Files, List of Txns, Version
Table = Result of a set of actions
Computing Delta’s State
000000.json
...
000007.json
000008.json
000009.json
0000010.json
0000010.checkpoint.parquet
0000011.json
0000012.json
Cache version
12
listFrom
version 0
Computing Delta’s State
0000010.checkpoint.parquet
0000011.json
0000012.json
0000013.json
0000014.json
Cache version
14
listFrom
version 10
Time Travelling by version
SELECT * FROM my_table VERSION AS OF 1071;
SELECT * FROM my_table@v1071 -- no backticks to specify @
spark.read.option("versionAsOf", 1071).load("/some/path")
spark.read.load("/some/path@v1071")
deltaLog.getSnapshotAt(1071)
Time Travelling by timestamp
SELECT * FROM my_table TIMESTAMP AS OF '1492-10-28';
SELECT * FROM my_table@14921028000000000 -- yyyyMMddHHmmssSSS
spark.read.option("timestampAsOf", "1492-10-28").load("/some/path")
spark.read.load("/some/path@14921028000000000")
deltaLog.getSnapshotAt(1071)
Time Travelling by timestamp
001070.json
001071.json
001072.json
001073.json
Commit timestamps come from storage system modification timestamps
375-01-01
1453-05-29
1923-10-29
1920-04-23
Time Travelling by timestamp
001070.json
001071.json
001072.json
001073.json
Timestamps can be out of order. We adjust by adding 1 millisecond to the
previous commit’s timestamp.
375-01-01
1453-05-29
1923-10-29
1920-04-23
375-01-01
1453-05-29
1923-10-29
1923-10-29 00:00:00.001
Time Travelling by timestamp
001070.json
001071.json
001072.json
001073.json
Price is right rules: Pick closest commit with timestamp that doesn’t exceed
the user’s timestamp.
375-01-01
1453-05-29
1923-10-29
1923-10-29 00:00:00.001
1492-10-28
deltaLog.getSnapshotAt(1071)
Back to the Use Cases
Data Archiving
● Changes to data need to be stored and be retrievable for regulatory reasons
○ Should you be storing changes (CDC) or the latest snapshot?
● May need to store data for many years (7+)
○ How do you make it cost efficient?
What might time travel look like?
1926
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
TIMESTAMP AS OF ‘1926-12-31'
-0.09
1972
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
TIMESTAMP AS OF ‘1972-12-31'
0.02
1880
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
TIMESTAMP AS OF '1880-12-31'
-0.18
2018
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
TIMESTAMP AS OF ‘2018-12-31'
0.82
Source: NASA (https://climate.nasa.gov/vital-signs/global-temperature/)
Is this really a Time Travel problem?
Source: NASA (https://climate.nasa.gov/vital-signs/global-temperature/)
1926
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
TIMESTAMP AS OF ‘1926-12-31'
-0.09
1972
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
TIMESTAMP AS OF ‘1972-12-31'
0.02
1880
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
TIMESTAMP AS OF '1880-12-31'
-0.18
2018
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
TIMESTAMP AS OF ‘2018-12-31'
0.82
Is this really a Time Travel problem?
Source: NASA (https://climate.nasa.gov/vital-signs/global-temperature/)
1926
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
WHERE year = '1926'
-0.09
1972
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
WHERE year = '1972'
0.02
1880
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
WHERE year = '1880'
-0.18
2018
SELECT AVG(TEMPERATURE) AS TEMP
FROM global_temperatures
WHERE year = '2018'
0.82
Better to save data by year and query with a predicate instead of using time travel.
Slowly Changing Dimensions (SCD)
- Type 1: Only keep latest data
First Name Last Name Date of Birth City Last Updated
Henrik Larsson September 20, 1971 Helsingborg 2012
First Name Last Name Date of Birth City Last Updated
Henrik Larsson September 20, 1971 Barcelona 2020
To access older data, you need to perform Time Travel. Is this the ideal way to store data for my use case?
Problems with SCD Type 1 + Time Travel
● Trade-off between data recency, query performance, and storage
costs
○ Data recency requires many frequent updates
○ Better query performance requires regular compaction of the data
○ The two above lead to many copies of the data
○ Many copies of the data lead to prohibitive storage costs
● Time Travel requires older copies of the data to exist
Slowly Changing Dimensions (SCD)
- Type 2: Insert row for each change
First Name Last Name Date of Birth City Last Updated Latest
Henrik Larsson September 20, 1971 Helsingborg 2012 Y
First Name Last Name Date of Birth City Last Updated Latest
Henrik Larsson September 20, 1971 Helsingborg 2012 N
Henrik Larsson September 20, 1971 Barcelona 2020 Y
To access older data, you simply write a WHERE query. A VIEW can help show only the latest state of the data at any given point.
Governance
31
DESCRIBE HISTORY my_table
Rollbacks
● Undoing work (restoring an old version of the table)
RESTORE my_table TO TIMESTAMP AS OF '2020-11-10'
● Replaying Structured Streaming Pipelines
RESTORE target_table TO TIMESTAMP AS OF '2020-11-10'
spark.readStream.format("delta")
.option("startingTimestamp", "2020-11-10")
.load(path)
// fix logic
.writeStream
.option("checkpointLocation", "<new_location>")
.table("target_table")
Rollbacks
Rollback accidental bad writes
INSERT INTO my_table
SELECT * FROM my_table
TIMESTAMP AS OF
date_sub(current_date(), 1)
Fix incorrect updates as follows:
MERGE INTO my_table target
USING my_table TIMESTAMP AS OF
date_sub(current_date(), 1) source
ON source.userId = target.userId
WHEN MATCHED THEN UPDATE SET *
Reproduce Experiments
● Use Time Travel to ensure all experiments run on the same snapshot
of the table
○ SELECT * FROM my_table VERSION AS OF 1071;
○ SELECT * FROM my_table@v1071
● Archive a blessed snapshot using CLONE
○ CREATE TABLE my_table_xmas
○ CLONE my_table VERSION AS OF 1071
Reproduce Experiments & reports with MLflow
Reproduce Experiments & reports
SELECT count(*) FROM events
TIMESTAMP AS OF timestamp
SELECT count(*) FROM events
VERSION AS OF version
spark.read.format("delta").option("timestampAsOf",
timestamp_string).load("/events/")
Reproduce experiments & reports
Time Series Analytics
If you want to find out how many new customers were added
over the last week
SELECT
count(distinct userId) - (
SELECT count(distinct userId)
FROM my_table
TIMESTAMP AS OF date_sub(current_date(), 7))
FROM my_table
DEMO - Riding the time machine
Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

Contenu connexe

Tendances

Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 

Tendances (20)

Delta Lake Cheat Sheet.pdf
Delta Lake Cheat Sheet.pdfDelta Lake Cheat Sheet.pdf
Delta Lake Cheat Sheet.pdf
 
Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Vertica
VerticaVertica
Vertica
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
From Data Warehouse to Lakehouse
From Data Warehouse to LakehouseFrom Data Warehouse to Lakehouse
From Data Warehouse to Lakehouse
 
The delta architecture
The delta architectureThe delta architecture
The delta architecture
 
Why Data Vault?
Why Data Vault? Why Data Vault?
Why Data Vault?
 
Snowflake Overview
Snowflake OverviewSnowflake Overview
Snowflake Overview
 
Data Modeling Fundamentals
Data Modeling FundamentalsData Modeling Fundamentals
Data Modeling Fundamentals
 
Activate Data Governance Using the Data Catalog
Activate Data Governance Using the Data CatalogActivate Data Governance Using the Data Catalog
Activate Data Governance Using the Data Catalog
 
BI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraBI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache Cassandra
 
Snowflake for Data Engineering
Snowflake for Data EngineeringSnowflake for Data Engineering
Snowflake for Data Engineering
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineering
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Agile Data Mining with Data Vault 2.0 (english)
Agile Data Mining with Data Vault 2.0 (english)Agile Data Mining with Data Vault 2.0 (english)
Agile Data Mining with Data Vault 2.0 (english)
 
The Five Graphs of Government: How Federal Agencies can Utilize Graph Technology
The Five Graphs of Government: How Federal Agencies can Utilize Graph TechnologyThe Five Graphs of Government: How Federal Agencies can Utilize Graph Technology
The Five Graphs of Government: How Federal Agencies can Utilize Graph Technology
 

Similaire à Data Time Travel by Delta Time Machine

MongoDB for Time Series Data: Analyzing Time Series Data Using the Aggregatio...
MongoDB for Time Series Data: Analyzing Time Series Data Using the Aggregatio...MongoDB for Time Series Data: Analyzing Time Series Data Using the Aggregatio...
MongoDB for Time Series Data: Analyzing Time Series Data Using the Aggregatio...
MongoDB
 

Similaire à Data Time Travel by Delta Time Machine (20)

Data Time Travel by Delta Time Machine
Data Time Travel by Delta Time MachineData Time Travel by Delta Time Machine
Data Time Travel by Delta Time Machine
 
Air Pollution in Nova Scotia: Analysis and Predictions
Air Pollution in Nova Scotia: Analysis and PredictionsAir Pollution in Nova Scotia: Analysis and Predictions
Air Pollution in Nova Scotia: Analysis and Predictions
 
Containerized Stream Engine to Build Modern Delta Lake
Containerized Stream Engine to Build Modern Delta LakeContainerized Stream Engine to Build Modern Delta Lake
Containerized Stream Engine to Build Modern Delta Lake
 
Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)
Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)
Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)
 
LendingClub RealTime BigData Platform with Oracle GoldenGate
LendingClub RealTime BigData Platform with Oracle GoldenGateLendingClub RealTime BigData Platform with Oracle GoldenGate
LendingClub RealTime BigData Platform with Oracle GoldenGate
 
Wait! What’s going on inside my database?
Wait! What’s going on inside my database?Wait! What’s going on inside my database?
Wait! What’s going on inside my database?
 
Dataframes in Spark - Data Analysts' perspective
Dataframes in Spark - Data Analysts' perspectiveDataframes in Spark - Data Analysts' perspective
Dataframes in Spark - Data Analysts' perspective
 
Spark3
Spark3Spark3
Spark3
 
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
 
Querying Data Pipeline with AWS Athena
Querying Data Pipeline with AWS AthenaQuerying Data Pipeline with AWS Athena
Querying Data Pipeline with AWS Athena
 
Oracle Goldengate for Big Data - LendingClub Implementation
Oracle Goldengate for Big Data - LendingClub ImplementationOracle Goldengate for Big Data - LendingClub Implementation
Oracle Goldengate for Big Data - LendingClub Implementation
 
On Relevant Query Answering over Streaming and Distributed Data
On Relevant Query Answering over Streaming and Distributed DataOn Relevant Query Answering over Streaming and Distributed Data
On Relevant Query Answering over Streaming and Distributed Data
 
MongoDB for Time Series Data: Analyzing Time Series Data Using the Aggregatio...
MongoDB for Time Series Data: Analyzing Time Series Data Using the Aggregatio...MongoDB for Time Series Data: Analyzing Time Series Data Using the Aggregatio...
MongoDB for Time Series Data: Analyzing Time Series Data Using the Aggregatio...
 
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
 
Urban flood prediction digital ocean august edition
Urban flood prediction   digital ocean august editionUrban flood prediction   digital ocean august edition
Urban flood prediction digital ocean august edition
 
Big Data Seervices in Danaos Use Case
Big Data Seervices in Danaos Use CaseBig Data Seervices in Danaos Use Case
Big Data Seervices in Danaos Use Case
 
CLIM Undergraduate Workshop: (Attachment) Performing Extreme Value Analysis (...
CLIM Undergraduate Workshop: (Attachment) Performing Extreme Value Analysis (...CLIM Undergraduate Workshop: (Attachment) Performing Extreme Value Analysis (...
CLIM Undergraduate Workshop: (Attachment) Performing Extreme Value Analysis (...
 
IBM Informix dynamic server 11 10 Cheetah Sql Features
IBM Informix dynamic server 11 10 Cheetah Sql FeaturesIBM Informix dynamic server 11 10 Cheetah Sql Features
IBM Informix dynamic server 11 10 Cheetah Sql Features
 
What is in All of Those SSTable Files Not Just the Data One but All the Rest ...
What is in All of Those SSTable Files Not Just the Data One but All the Rest ...What is in All of Those SSTable Files Not Just the Data One but All the Rest ...
What is in All of Those SSTable Files Not Just the Data One but All the Rest ...
 
Big size meteorological data processing and mobile displaying system using ...
Big size meteorological data processing and mobile displaying system using ...Big size meteorological data processing and mobile displaying system using ...
Big size meteorological data processing and mobile displaying system using ...
 

Plus de Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Databricks
 
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueIntuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Databricks
 

Plus de Databricks (20)

Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
 
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueIntuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
 

Dernier

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
gajnagarg
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
gajnagarg
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
amitlee9823
 
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
gajnagarg
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
only4webmaster01
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
karishmasinghjnh
 

Dernier (20)

Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 

Data Time Travel by Delta Time Machine

  • 1. Data Time Travel by Delta Time Machine Burak Yavuz | Software Engineer Vini Jaiswal | Customer Success Engineer
  • 2. Who are we? ● Software Engineer @ Databricks “We make your streams come true” ● Apache Spark Committer ● MS in Management Science & Engineering - Stanford University ● BS in Mechanical Engineering - Bogazici University, Turkey ● Customer Success Engineer @ Databricks “Making Customers Successful with their data and ML/AI use cases” ● Data Science Lead - Citi | Data Intern - Southwest Airlines ● MS in Information Technology & Management - UTDallas ● BS in Electrical Engineering - Rajiv Gandhi Technology University, India Vini Jaiswal Burak Yavuz
  • 3. Agenda Intro to Time Travel Time Travel Use Cases ▪ Data Archiving ▪ Rollbacks ▪ Governance ▪ Reproducing ML experiments Solving with Delta Demo - Riding the time machine
  • 5. What might time travel look like? Source: NASA (https://climate.nasa.gov/vital-signs/global-temperature/) 1926 SELECT AVG(TEMPERATURE) AS TEMP FROM global_temperatures TIMESTAMP AS OF ‘1926-12-31' -0.09 1972 SELECT AVG(TEMPERATURE) AS TEMP FROM global_temperatures TIMESTAMP AS OF ‘1972-12-31' 0.02 1880 SELECT AVG(TEMPERATURE) AS TEMP FROM global_temperatures TIMESTAMP AS OF '1880-12-31' -0.18 2018 SELECT AVG(TEMPERATURE) AS TEMP FROM global_temperatures TIMESTAMP AS OF ‘2018-12-31' 0.82
  • 7. Data Archiving ● Changes to data need to be stored and be retrievable for regulatory reasons ● May need to store data for many years (7+)
  • 8. Governance 8 Flights Delays per airplane Planes Weather ● What if records need to be forgotten with respect to Data Subject request ● And, at the same time, how do you stay in compliance with international regulations? Flights (JSON) events per second Kinesis Planes (CSV) slow changing S3 Weather (JSON) every 5 minutes a new dump on S3
  • 9. Rollbacks 9 Flights Planes Weather Flights (JSON) events per second Event Hubs Planes (CSV) slow changing Blob Weather (JSON) every 5 minutes a new dump on Blob What if a new job is deployed that accidentally specifies .mode(“overwrite”) New job with .mode(“overwrite”) Delays per airplane All historic data gone
  • 10. Reproduce Experiments ● Reproducibility is the cornerstone of all scientific inquiry ● In order for a machine learning model to be improved, a data scientist must first reproduce the results of the model. Reproduce Experiments
  • 12. For more info check out Diving Into Delta Lake: Unpacking the Transaction Log Wednesday (Nov 11) 15:00 GMT
  • 13. Transaction Protocol ▪ Serializable ACID Writes ▪ Snapshot Isolation ▪ Scalability to billions of partitions or files ▪ Incremental processing
  • 15. Update Metadata – name, schema, partitioning, etc Add File – adds a file (with optional statistics) Remove File – removes a file Set Transaction – records an idempotent txn id Change Protocol – upgrades the version of the txn protocol Result: Current Metadata, List of Files, List of Txns, Version Table = Result of a set of actions
  • 18. Time Travelling by version SELECT * FROM my_table VERSION AS OF 1071; SELECT * FROM my_table@v1071 -- no backticks to specify @ spark.read.option("versionAsOf", 1071).load("/some/path") spark.read.load("/some/path@v1071") deltaLog.getSnapshotAt(1071)
  • 19. Time Travelling by timestamp SELECT * FROM my_table TIMESTAMP AS OF '1492-10-28'; SELECT * FROM my_table@14921028000000000 -- yyyyMMddHHmmssSSS spark.read.option("timestampAsOf", "1492-10-28").load("/some/path") spark.read.load("/some/path@14921028000000000") deltaLog.getSnapshotAt(1071)
  • 20. Time Travelling by timestamp 001070.json 001071.json 001072.json 001073.json Commit timestamps come from storage system modification timestamps 375-01-01 1453-05-29 1923-10-29 1920-04-23
  • 21. Time Travelling by timestamp 001070.json 001071.json 001072.json 001073.json Timestamps can be out of order. We adjust by adding 1 millisecond to the previous commit’s timestamp. 375-01-01 1453-05-29 1923-10-29 1920-04-23 375-01-01 1453-05-29 1923-10-29 1923-10-29 00:00:00.001
  • 22. Time Travelling by timestamp 001070.json 001071.json 001072.json 001073.json Price is right rules: Pick closest commit with timestamp that doesn’t exceed the user’s timestamp. 375-01-01 1453-05-29 1923-10-29 1923-10-29 00:00:00.001 1492-10-28 deltaLog.getSnapshotAt(1071)
  • 23. Back to the Use Cases
  • 24. Data Archiving ● Changes to data need to be stored and be retrievable for regulatory reasons ○ Should you be storing changes (CDC) or the latest snapshot? ● May need to store data for many years (7+) ○ How do you make it cost efficient?
  • 25. What might time travel look like? 1926 SELECT AVG(TEMPERATURE) AS TEMP FROM global_temperatures TIMESTAMP AS OF ‘1926-12-31' -0.09 1972 SELECT AVG(TEMPERATURE) AS TEMP FROM global_temperatures TIMESTAMP AS OF ‘1972-12-31' 0.02 1880 SELECT AVG(TEMPERATURE) AS TEMP FROM global_temperatures TIMESTAMP AS OF '1880-12-31' -0.18 2018 SELECT AVG(TEMPERATURE) AS TEMP FROM global_temperatures TIMESTAMP AS OF ‘2018-12-31' 0.82 Source: NASA (https://climate.nasa.gov/vital-signs/global-temperature/)
  • 26. Is this really a Time Travel problem? Source: NASA (https://climate.nasa.gov/vital-signs/global-temperature/) 1926 SELECT AVG(TEMPERATURE) AS TEMP FROM global_temperatures TIMESTAMP AS OF ‘1926-12-31' -0.09 1972 SELECT AVG(TEMPERATURE) AS TEMP FROM global_temperatures TIMESTAMP AS OF ‘1972-12-31' 0.02 1880 SELECT AVG(TEMPERATURE) AS TEMP FROM global_temperatures TIMESTAMP AS OF '1880-12-31' -0.18 2018 SELECT AVG(TEMPERATURE) AS TEMP FROM global_temperatures TIMESTAMP AS OF ‘2018-12-31' 0.82
  • 27. Is this really a Time Travel problem? Source: NASA (https://climate.nasa.gov/vital-signs/global-temperature/) 1926 SELECT AVG(TEMPERATURE) AS TEMP FROM global_temperatures WHERE year = '1926' -0.09 1972 SELECT AVG(TEMPERATURE) AS TEMP FROM global_temperatures WHERE year = '1972' 0.02 1880 SELECT AVG(TEMPERATURE) AS TEMP FROM global_temperatures WHERE year = '1880' -0.18 2018 SELECT AVG(TEMPERATURE) AS TEMP FROM global_temperatures WHERE year = '2018' 0.82 Better to save data by year and query with a predicate instead of using time travel.
  • 28. Slowly Changing Dimensions (SCD) - Type 1: Only keep latest data First Name Last Name Date of Birth City Last Updated Henrik Larsson September 20, 1971 Helsingborg 2012 First Name Last Name Date of Birth City Last Updated Henrik Larsson September 20, 1971 Barcelona 2020 To access older data, you need to perform Time Travel. Is this the ideal way to store data for my use case?
  • 29. Problems with SCD Type 1 + Time Travel ● Trade-off between data recency, query performance, and storage costs ○ Data recency requires many frequent updates ○ Better query performance requires regular compaction of the data ○ The two above lead to many copies of the data ○ Many copies of the data lead to prohibitive storage costs ● Time Travel requires older copies of the data to exist
  • 30. Slowly Changing Dimensions (SCD) - Type 2: Insert row for each change First Name Last Name Date of Birth City Last Updated Latest Henrik Larsson September 20, 1971 Helsingborg 2012 Y First Name Last Name Date of Birth City Last Updated Latest Henrik Larsson September 20, 1971 Helsingborg 2012 N Henrik Larsson September 20, 1971 Barcelona 2020 Y To access older data, you simply write a WHERE query. A VIEW can help show only the latest state of the data at any given point.
  • 32. Rollbacks ● Undoing work (restoring an old version of the table) RESTORE my_table TO TIMESTAMP AS OF '2020-11-10' ● Replaying Structured Streaming Pipelines RESTORE target_table TO TIMESTAMP AS OF '2020-11-10' spark.readStream.format("delta") .option("startingTimestamp", "2020-11-10") .load(path) // fix logic .writeStream .option("checkpointLocation", "<new_location>") .table("target_table")
  • 33. Rollbacks Rollback accidental bad writes INSERT INTO my_table SELECT * FROM my_table TIMESTAMP AS OF date_sub(current_date(), 1) Fix incorrect updates as follows: MERGE INTO my_table target USING my_table TIMESTAMP AS OF date_sub(current_date(), 1) source ON source.userId = target.userId WHEN MATCHED THEN UPDATE SET *
  • 34. Reproduce Experiments ● Use Time Travel to ensure all experiments run on the same snapshot of the table ○ SELECT * FROM my_table VERSION AS OF 1071; ○ SELECT * FROM my_table@v1071 ● Archive a blessed snapshot using CLONE ○ CREATE TABLE my_table_xmas ○ CLONE my_table VERSION AS OF 1071
  • 35. Reproduce Experiments & reports with MLflow
  • 36. Reproduce Experiments & reports SELECT count(*) FROM events TIMESTAMP AS OF timestamp SELECT count(*) FROM events VERSION AS OF version spark.read.format("delta").option("timestampAsOf", timestamp_string).load("/events/") Reproduce experiments & reports
  • 37. Time Series Analytics If you want to find out how many new customers were added over the last week SELECT count(distinct userId) - ( SELECT count(distinct userId) FROM my_table TIMESTAMP AS OF date_sub(current_date(), 7)) FROM my_table
  • 38. DEMO - Riding the time machine
  • 39. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.