SlideShare une entreprise Scribd logo
1  sur  2
Télécharger pour lire hors ligne
Rollback a table to an earlier version
PERFORMANCE OPTIMIZATIONS
TIME TRAVEL
View table details
Delete old files with Vacuum
Clone a Delta Lake table
Interoperability with Python / DataFrames
Run SQL queries from Python
Modify data retention settings for Delta Lake table
-- RESTORE requires Delta Lake version 0.7.0+ & DBR 7.4+.
RESTORE tableName VERSION AS OF 0
RESTORE tableName TIMESTAMP AS OF "2020-12-18"
Delta Lake is an open source storage layer that brings ACID
transactions to Apache Spark™ and big data workloads.
delta.io | Documentation | GitHub | Delta Lake on Databricks
WITH SPARK SQL UPDATE tableName SET event = 'click' WHERE event = 'clk'
DELETE FROM tableName WHERE "date < '2017-01-01"
MERGE INTO logs
USING newDedupedLogs
ON logs.uniqueId = newDedupedLogs.uniqueId
WHEN NOT MATCHED
THEN INSERT *
-- Add "Not null" constraint:
ALTER TABLE tableName CHANGE COLUMN col_name SET NOT NULL
-- Add "Check" constraint:
ALTER TABLE tableName
ADD CONSTRAINT dateWithinRange CHECK date > "1900-01-01"
-- Drop constraint:
ALTER TABLE tableName DROP CONSTRAINT dateWithinRange
ALTER TABLE tableName ADD COLUMNS (
col_name data_type
[FIRST|AFTER colA_name])
MERGE INTO target
USING updates
ON target.Id = updates.Id
WHEN MATCHED AND target.delete_flag = "true" THEN
DELETE
WHEN MATCHED THEN
UPDATE SET * -- star notation means all columns
WHEN NOT MATCHED THEN
INSERT (date, Id, data) -- or, use INSERT *
VALUES (date, Id, data)
INSERT INTO TABLE tableName VALUES (
(8003, "Kim Jones", "2020-12-18", 3.875),
(8004, "Tim Jones", "2020-12-20", 3.750)
);
-- Insert using SELECT statement
INSERT INTO tableName SELECT * FROM sourceTable
-- Atomically replace all data in table with new values
INSERT OVERWRITE loan_by_state_delta VALUES (...)
DELTA LAKE DDL/DML: UPDATE, DELETE, INSERT, ALTER TABLE
Update rows that match a predicate condition
Delete rows that match a predicate condition
Insert values directly into table
Upsert (update + insert) using MERGE
Alter table schema — add columns
Insert with Deduplication using MERGE
Alter table — add constraint
DESCRIBE DETAIL tableName
DESCRIBE FORMATTED tableName
-- logRetentionDuration -> how long transaction log history
is kept, deletedFileRetentionDuration -> how long ago a file
must have been deleted before being a candidate for VACCUM.
ALTER TABLE tableName
SET TBLPROPERTIES(
delta.logRetentionDuration = "interval 30 days",
delta.deletedFileRetentionDuration = "interval 7 days"
);
SHOW TBLPROPERTIES tableName;
spark.sql("SELECT * FROM tableName")
spark.sql("SELECT * FROM delta.`/path/to/delta_table`")
-- Read name-based table from Hive metastore into DataFrame
df = spark.table("tableName")
-- Read path-based table into DataFrame
df = spark.read.format("delta").load("/path/to/delta_table")
-- Deep clones copy data from source, shallow clones don't.
CREATE TABLE [dbName.] targetName
[SHALLOW | DEEP] CLONE sourceName [VERSION AS OF 0]
[LOCATION "path/to/table"]
-- specify location only for path-based tables
VACUUM tableName [RETAIN num HOURS] [DRY RUN]
UTILITY METHODS
*Databricks Delta Lake feature
OPTIMIZE tableName
[ZORDER BY (colNameA, colNameB)]
*Databricks Delta Lake feature
ALTER TABLE [table_name | delta.`path/to/delta_table`]
SET TBLPROPERTIES (delta.autoOptimize.optimizeWrite = true)
*Databricks Delta Lake feature
CACHE SELECT * FROM tableName
-- or:
CACHE SELECT colA, colB FROM tableName WHERE colNameA > 0
Compact data files with Optimize and Z-Order
Auto-optimize tables
Cache frequently queried data in Delta Cache
DESCRIBE HISTORY tableName
SELECT * FROM tableName VERSION AS OF 12
EXCEPT ALL SELECT * FROM tableName VERSION AS OF 11
SELECT * FROM tableName VERSION AS OF 0
SELECT * FROM tableName@v0 -- equivalent to VERSION AS OF 0
SELECT * FROM tableName TIMESTAMP AS OF "2020-12-18"
View transaction log (aka Delta Log)
Query historical versions of Delta Lake tables
Find changes between 2 versions of table
-- Managed database is saved in the Hive metastore.
Default database is named "default".
DROP DATABASE IF EXISTS dbName;
CREATE DATABASE dbName;
USE dbName -- This command avoids having to specify
dbName.tableName every time instead of just tableName.
/* You can refer to Delta Tables by table name, or by
path. Table name is the preferred way, since named tables
are managed in the Hive Metastore (i.e., when you DROP a
named table, the data is dropped also — not the case for
path-based tables.) */
SELECT * FROM [dbName.] tableName
CREATE TABLE [dbName.] tableName
USING DELTA
AS SELECT * FROM tableName | parquet.`path/to/data`
[LOCATION `/path/to/table`]
-- using location = unmanaged table
-- by table name
CONVERT TO DELTA [dbName.]tableName
[PARTITIONED BY (col_name1 col_type1, col_name2
col_type2)]
-- path-based tables
CONVERT TO DELTA parquet.`/path/to/table` -- note backticks
[PARTITIONED BY (col_name1 col_type1, col_name2 col_type2)]
SELECT * FROM delta.`path/to/delta_table` -- note
backticks
CREATE TABLE [dbName.] tableName (
id INT [NOT NULL],
name STRING,
date DATE,
int_rate FLOAT)
USING DELTA
[PARTITIONED BY (time, date)] -- optional
COPY INTO [dbName.] targetTable
FROM (SELECT * FROM "/path/to/table")
FILEFORMAT = DELTA -- or CSV, Parquet, ORC, JSON, etc.
CREATE AND QUERY DELTA TABLES
Create and use managed database
Query Delta Lake table by table name (preferred)
Query Delta Lake table by path
Convert Parquet table to Delta Lake format in place
Create table, define schema explicitly with SQL DDL
Create Delta Lake table as SELECT * with no upfront
schema definition
Copy new data into Delta Lake table (with idempotent retries)
TIME TRAVEL (CONTINUED)
spark.sql("SELECT * FROM tableName")
spark.sql("SELECT * FROM delta.`/path/to/delta_table`")
spark.sql("DESCRIBE HISTORY tableName")
deltaTable.vacuum() # vacuum files older than default
retention period (7 days)
deltaTable.vacuum(100) # vacuum files not required by
versions more than 100 hours old
deltaTable.clone(target="/path/to/delta_table/",
isShallow=True, replace=True)
spark.sql("SELECT * FROM tableName")
spark.sql("SELECT * FROM delta.`/path/to/delta_table`")
UTILITY METHODS
WITH PYTHON
Convert Parquet table to Delta Lake format in place
Run Spark SQL queries in Python
Compact old files with Vacuum
Clone a Delta Lake table
Get DataFrame representation of a Delta Lake table
Run SQL queries on Delta Lake tables
fullHistoryDF = deltaTable.history()
# choose only one option: versionAsOf, or timestampAsOf
df = (spark.read.format("delta")
.option("versionAsOf", 0)
.option("timestampAsOf", "2020-12-18")
.load("/path/to/delta_table"))
TIME TRAVEL
View transaction log (aka Delta Log)
Query historical versions of Delta Lake tables
PERFORMANCE OPTIMIZATIONS
*Databricks Delta Lake feature
spark.sql("OPTIMIZE tableName [ZORDER BY (colA, colB)]")
*Databricks Delta Lake feature. For existing tables:
spark.sql("ALTER TABLE [table_name |
delta.`path/to/delta_table`]
SET TBLPROPERTIES (delta.autoOptimize.optimizeWrite = true)
To enable auto-optimize for all new Delta Lake tables:
spark.sql("SET spark.databricks.delta.properties.
defaults.autoOptimize.optimizeWrite = true")
*Databricks Delta Lake feature
spark.sql("CACHE SELECT * FROM tableName")
-- or:
spark.sql("CACHE SELECT colA, colB FROM tableName
WHERE colNameA > 0")
Compact data files with Optimize and Z-Order
Auto-optimize tables
Cache frequently queried data in Delta Cache
WORKING WITH DELTATABLES
WORKING WITH DELTA TABLES
# A DeltaTable is the entry point for interacting with
tables programmatically in Python — for example, to
perform updates or deletes.
from delta.tables import *
deltaTable = DeltaTable.forName(spark, tableName)
deltaTable = DeltaTable.forPath(spark,
delta.`path/to/table`)
CONVERT PARQUET TO DELTA LAKE
deltaTable = DeltaTable.convertToDelta(spark,
"parquet.`/path/to/parquet_table`")
partitionedDeltaTable = DeltaTable.convertToDelta(spark,
"parquet.`/path/to/parquet_table`", "part int")
df1 = spark.read.format("delta").load(pathToTable)
df2 = spark.read.format("delta").option("versionAsOf",
2).load("/path/to/delta_table")
df1.exceptAll(df2).show()
deltaTable.restoreToVersion(0)
deltaTable.restoreToTimestamp('2020-12-01')
Find changes between 2 versions of a table
Rollback a table by version or timestamp
Delta Lake is an open source storage layer that brings ACID
transactions to Apache Spark™ and big data workloads.
delta.io | Documentation | GitHub | API reference | Databricks
df = spark.createDataFrame(pdf)
# where pdf is a pandas DF
# then save DataFrame in Delta Lake format as shown below
# read by path
df = (spark.read.format("parquet"|"csv"|"json"|etc.)
.load("/path/to/delta_table"))
# read table from Hive metastore
df = spark.table("events")
# by path or by table name
df = (spark.readStream
.format("delta")
.schema(schema)
.table("events") | .load("/delta/events")
)
streamingQuery = (
df.writeStream.format("delta")
.outputMode("append"|"update"|"complete")
.option("checkpointLocation", "/path/to/checkpoints")
.trigger(once=True|processingTime="10 seconds")
.table("events") | .start("/delta/events")
)
(df.write.format("delta")
.mode("append"|"overwrite")
.partitionBy("date") # optional
.option("mergeSchema", "true") # option - evolve schema
.saveAsTable("events") | .save("/path/to/delta_table")
)
READS AND WRITES WITH DELTA LAKE
Read data from pandas DataFrame
Read data using Apache Spark™
Save DataFrame in Delta Lake format
Streaming reads (Delta table as streaming source)
Streaming writes (Delta table as a sink)
# predicate using SQL formatted string
deltaTable.delete("date < '2017-01-01'")
# predicate using Spark SQL functions
deltaTable.delete(col("date") < "2017-01-01")
# Available options for merges [see documentation for
details]:
.whenMatchedUpdate(...) | .whenMatchedUpdateAll(...) |
.whenNotMatchedInsert(...) | .whenMatchedDelete(...)
(deltaTable.alias("target").merge(
source = updatesDF.alias("updates"),
condition = "target.eventId = updates.eventId")
.whenMatchedUpdateAll()
.whenNotMatchedInsert(
values = {
"date": "updates.date",
"eventId": "updates.eventId",
"data": "updates.data",
"count": 1
}
).execute()
)
(deltaTable.alias("logs").merge(
newDedupedLogs.alias("newDedupedLogs"),
"logs.uniqueId = newDedupedLogs.uniqueId")
.whenNotMatchedInsertAll()
.execute()
)
# predicate using SQL formatted string
deltaTable.update(condition = "eventType = 'clk'",
set = { "eventType": "'click'" } )
# predicate using Spark SQL functions
deltaTable.update(condition = col("eventType") == "clk",
set = { "eventType": lit("click") } )
DELTA LAKE DDL/DML: UPDATES, DELETES, INSERTS, MERGES
Delete rows that match a predicate condition
Update rows that match a predicate condition
Upsert (update + insert) using MERGE
Insert with Deduplication using MERGE
df = deltaTable.toDF()
TIME TRAVEL (CONTINUED)

Contenu connexe

Tendances

Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache KafkaShiao-An Yuan
 
Intro to databricks delta lake
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lakeMykola Zerniuk
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDatabricks
 
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query PerformanceORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query PerformanceDataWorks Summit
 
Time-Series Apache HBase
Time-Series Apache HBaseTime-Series Apache HBase
Time-Series Apache HBaseHBaseCon
 
Kafka Intro With Simple Java Producer Consumers
Kafka Intro With Simple Java Producer ConsumersKafka Intro With Simple Java Producer Consumers
Kafka Intro With Simple Java Producer ConsumersJean-Paul Azar
 
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataDataWorks Summit
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introductioncolorant
 
What Is ELK Stack | ELK Tutorial For Beginners | Elasticsearch Kibana | ELK S...
What Is ELK Stack | ELK Tutorial For Beginners | Elasticsearch Kibana | ELK S...What Is ELK Stack | ELK Tutorial For Beginners | Elasticsearch Kibana | ELK S...
What Is ELK Stack | ELK Tutorial For Beginners | Elasticsearch Kibana | ELK S...Edureka!
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
 
Lambda Architecture in the Cloud with Azure Databricks with Andrei Varanovich
Lambda Architecture in the Cloud with Azure Databricks with Andrei VaranovichLambda Architecture in the Cloud with Azure Databricks with Andrei Varanovich
Lambda Architecture in the Cloud with Azure Databricks with Andrei VaranovichDatabricks
 
Introduction to Kafka connect
Introduction to Kafka connectIntroduction to Kafka connect
Introduction to Kafka connectKnoldus Inc.
 
High-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQLHigh-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQLScyllaDB
 
Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Sadayuki Furuhashi
 
Streaming 101 Revisited: A Fresh Hot Take With Tyler Akidau and Dan Sotolongo...
Streaming 101 Revisited: A Fresh Hot Take With Tyler Akidau and Dan Sotolongo...Streaming 101 Revisited: A Fresh Hot Take With Tyler Akidau and Dan Sotolongo...
Streaming 101 Revisited: A Fresh Hot Take With Tyler Akidau and Dan Sotolongo...HostedbyConfluent
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseDatabricks
 
Camel JBang - Quarkus Insights.pdf
Camel JBang - Quarkus Insights.pdfCamel JBang - Quarkus Insights.pdf
Camel JBang - Quarkus Insights.pdfClaus Ibsen
 

Tendances (20)

Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Intro to databricks delta lake
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lake
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
 
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query PerformanceORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
 
Time-Series Apache HBase
Time-Series Apache HBaseTime-Series Apache HBase
Time-Series Apache HBase
 
Kafka Intro With Simple Java Producer Consumers
Kafka Intro With Simple Java Producer ConsumersKafka Intro With Simple Java Producer Consumers
Kafka Intro With Simple Java Producer Consumers
 
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
 
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
What Is ELK Stack | ELK Tutorial For Beginners | Elasticsearch Kibana | ELK S...
What Is ELK Stack | ELK Tutorial For Beginners | Elasticsearch Kibana | ELK S...What Is ELK Stack | ELK Tutorial For Beginners | Elasticsearch Kibana | ELK S...
What Is ELK Stack | ELK Tutorial For Beginners | Elasticsearch Kibana | ELK S...
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Lambda Architecture in the Cloud with Azure Databricks with Andrei Varanovich
Lambda Architecture in the Cloud with Azure Databricks with Andrei VaranovichLambda Architecture in the Cloud with Azure Databricks with Andrei Varanovich
Lambda Architecture in the Cloud with Azure Databricks with Andrei Varanovich
 
Introduction to Kafka connect
Introduction to Kafka connectIntroduction to Kafka connect
Introduction to Kafka connect
 
High-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQLHigh-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQL
 
Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014
 
Streaming 101 Revisited: A Fresh Hot Take With Tyler Akidau and Dan Sotolongo...
Streaming 101 Revisited: A Fresh Hot Take With Tyler Akidau and Dan Sotolongo...Streaming 101 Revisited: A Fresh Hot Take With Tyler Akidau and Dan Sotolongo...
Streaming 101 Revisited: A Fresh Hot Take With Tyler Akidau and Dan Sotolongo...
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
Camel JBang - Quarkus Insights.pdf
Camel JBang - Quarkus Insights.pdfCamel JBang - Quarkus Insights.pdf
Camel JBang - Quarkus Insights.pdf
 

Similaire à Delta Lake Cheat Sheet.pdf

DeltaLakeOperations.pdf
DeltaLakeOperations.pdfDeltaLakeOperations.pdf
DeltaLakeOperations.pdfGCPAdmin
 
delta_lake_cheat_sheet.pdf
delta_lake_cheat_sheet.pdfdelta_lake_cheat_sheet.pdf
delta_lake_cheat_sheet.pdfPUSHKAR ARYA
 
Creating a database
Creating a databaseCreating a database
Creating a databaseRahul Gupta
 
Oracle SQL AND PL/SQL
Oracle SQL AND PL/SQLOracle SQL AND PL/SQL
Oracle SQL AND PL/SQLsuriyae1
 
Deeply Declarative Data Pipelines
Deeply Declarative Data PipelinesDeeply Declarative Data Pipelines
Deeply Declarative Data PipelinesHostedbyConfluent
 
Introduction to Oracle Database.pptx
Introduction to Oracle Database.pptxIntroduction to Oracle Database.pptx
Introduction to Oracle Database.pptxSiddhantBhardwaj26
 
SQL WORKSHOP::Lecture 10
SQL WORKSHOP::Lecture 10SQL WORKSHOP::Lecture 10
SQL WORKSHOP::Lecture 10Umair Amjad
 
My sql Syntax
My sql SyntaxMy sql Syntax
My sql SyntaxReka
 
Oracle-L11 using Oracle flashback technology-Mazenet solution
Oracle-L11 using  Oracle flashback technology-Mazenet solutionOracle-L11 using  Oracle flashback technology-Mazenet solution
Oracle-L11 using Oracle flashback technology-Mazenet solutionMazenetsolution
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks
 
MS SQL SERVER: Manipulating Database
MS SQL SERVER: Manipulating DatabaseMS SQL SERVER: Manipulating Database
MS SQL SERVER: Manipulating Databasesqlserver content
 
MS Sql Server: Manipulating Database
MS Sql Server: Manipulating DatabaseMS Sql Server: Manipulating Database
MS Sql Server: Manipulating DatabaseDataminingTools Inc
 
MS SQLSERVER:Manipulating Database
MS SQLSERVER:Manipulating DatabaseMS SQLSERVER:Manipulating Database
MS SQLSERVER:Manipulating Databasesqlserver content
 
Introduction4 SQLite
Introduction4 SQLiteIntroduction4 SQLite
Introduction4 SQLiteStanley Huang
 
COMPUTERS SQL
COMPUTERS SQL COMPUTERS SQL
COMPUTERS SQL Rc Os
 
Impala SQL Support
Impala SQL SupportImpala SQL Support
Impala SQL SupportYue Chen
 

Similaire à Delta Lake Cheat Sheet.pdf (20)

DeltaLakeOperations.pdf
DeltaLakeOperations.pdfDeltaLakeOperations.pdf
DeltaLakeOperations.pdf
 
delta_lake_cheat_sheet.pdf
delta_lake_cheat_sheet.pdfdelta_lake_cheat_sheet.pdf
delta_lake_cheat_sheet.pdf
 
Creating a database
Creating a databaseCreating a database
Creating a database
 
Les10 Creating And Managing Tables
Les10 Creating And Managing TablesLes10 Creating And Managing Tables
Les10 Creating And Managing Tables
 
Sql for dbaspresentation
Sql for dbaspresentationSql for dbaspresentation
Sql for dbaspresentation
 
Oracle SQL AND PL/SQL
Oracle SQL AND PL/SQLOracle SQL AND PL/SQL
Oracle SQL AND PL/SQL
 
Deeply Declarative Data Pipelines
Deeply Declarative Data PipelinesDeeply Declarative Data Pipelines
Deeply Declarative Data Pipelines
 
Les10
Les10Les10
Les10
 
Introduction to Oracle Database.pptx
Introduction to Oracle Database.pptxIntroduction to Oracle Database.pptx
Introduction to Oracle Database.pptx
 
SQL WORKSHOP::Lecture 10
SQL WORKSHOP::Lecture 10SQL WORKSHOP::Lecture 10
SQL WORKSHOP::Lecture 10
 
My sql Syntax
My sql SyntaxMy sql Syntax
My sql Syntax
 
Oracle-L11 using Oracle flashback technology-Mazenet solution
Oracle-L11 using  Oracle flashback technology-Mazenet solutionOracle-L11 using  Oracle flashback technology-Mazenet solution
Oracle-L11 using Oracle flashback technology-Mazenet solution
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
MS SQL SERVER: Manipulating Database
MS SQL SERVER: Manipulating DatabaseMS SQL SERVER: Manipulating Database
MS SQL SERVER: Manipulating Database
 
MS Sql Server: Manipulating Database
MS Sql Server: Manipulating DatabaseMS Sql Server: Manipulating Database
MS Sql Server: Manipulating Database
 
MS SQLSERVER:Manipulating Database
MS SQLSERVER:Manipulating DatabaseMS SQLSERVER:Manipulating Database
MS SQLSERVER:Manipulating Database
 
DML, DDL, DCL ,DRL/DQL and TCL Statements in SQL with Examples
DML, DDL, DCL ,DRL/DQL and TCL Statements in SQL with ExamplesDML, DDL, DCL ,DRL/DQL and TCL Statements in SQL with Examples
DML, DDL, DCL ,DRL/DQL and TCL Statements in SQL with Examples
 
Introduction4 SQLite
Introduction4 SQLiteIntroduction4 SQLite
Introduction4 SQLite
 
COMPUTERS SQL
COMPUTERS SQL COMPUTERS SQL
COMPUTERS SQL
 
Impala SQL Support
Impala SQL SupportImpala SQL Support
Impala SQL Support
 

Dernier

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 

Dernier (20)

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 

Delta Lake Cheat Sheet.pdf

  • 1. Rollback a table to an earlier version PERFORMANCE OPTIMIZATIONS TIME TRAVEL View table details Delete old files with Vacuum Clone a Delta Lake table Interoperability with Python / DataFrames Run SQL queries from Python Modify data retention settings for Delta Lake table -- RESTORE requires Delta Lake version 0.7.0+ & DBR 7.4+. RESTORE tableName VERSION AS OF 0 RESTORE tableName TIMESTAMP AS OF "2020-12-18" Delta Lake is an open source storage layer that brings ACID transactions to Apache Spark™ and big data workloads. delta.io | Documentation | GitHub | Delta Lake on Databricks WITH SPARK SQL UPDATE tableName SET event = 'click' WHERE event = 'clk' DELETE FROM tableName WHERE "date < '2017-01-01" MERGE INTO logs USING newDedupedLogs ON logs.uniqueId = newDedupedLogs.uniqueId WHEN NOT MATCHED THEN INSERT * -- Add "Not null" constraint: ALTER TABLE tableName CHANGE COLUMN col_name SET NOT NULL -- Add "Check" constraint: ALTER TABLE tableName ADD CONSTRAINT dateWithinRange CHECK date > "1900-01-01" -- Drop constraint: ALTER TABLE tableName DROP CONSTRAINT dateWithinRange ALTER TABLE tableName ADD COLUMNS ( col_name data_type [FIRST|AFTER colA_name]) MERGE INTO target USING updates ON target.Id = updates.Id WHEN MATCHED AND target.delete_flag = "true" THEN DELETE WHEN MATCHED THEN UPDATE SET * -- star notation means all columns WHEN NOT MATCHED THEN INSERT (date, Id, data) -- or, use INSERT * VALUES (date, Id, data) INSERT INTO TABLE tableName VALUES ( (8003, "Kim Jones", "2020-12-18", 3.875), (8004, "Tim Jones", "2020-12-20", 3.750) ); -- Insert using SELECT statement INSERT INTO tableName SELECT * FROM sourceTable -- Atomically replace all data in table with new values INSERT OVERWRITE loan_by_state_delta VALUES (...) DELTA LAKE DDL/DML: UPDATE, DELETE, INSERT, ALTER TABLE Update rows that match a predicate condition Delete rows that match a predicate condition Insert values directly into table Upsert (update + insert) using MERGE Alter table schema — add columns Insert with Deduplication using MERGE Alter table — add constraint DESCRIBE DETAIL tableName DESCRIBE FORMATTED tableName -- logRetentionDuration -> how long transaction log history is kept, deletedFileRetentionDuration -> how long ago a file must have been deleted before being a candidate for VACCUM. ALTER TABLE tableName SET TBLPROPERTIES( delta.logRetentionDuration = "interval 30 days", delta.deletedFileRetentionDuration = "interval 7 days" ); SHOW TBLPROPERTIES tableName; spark.sql("SELECT * FROM tableName") spark.sql("SELECT * FROM delta.`/path/to/delta_table`") -- Read name-based table from Hive metastore into DataFrame df = spark.table("tableName") -- Read path-based table into DataFrame df = spark.read.format("delta").load("/path/to/delta_table") -- Deep clones copy data from source, shallow clones don't. CREATE TABLE [dbName.] targetName [SHALLOW | DEEP] CLONE sourceName [VERSION AS OF 0] [LOCATION "path/to/table"] -- specify location only for path-based tables VACUUM tableName [RETAIN num HOURS] [DRY RUN] UTILITY METHODS *Databricks Delta Lake feature OPTIMIZE tableName [ZORDER BY (colNameA, colNameB)] *Databricks Delta Lake feature ALTER TABLE [table_name | delta.`path/to/delta_table`] SET TBLPROPERTIES (delta.autoOptimize.optimizeWrite = true) *Databricks Delta Lake feature CACHE SELECT * FROM tableName -- or: CACHE SELECT colA, colB FROM tableName WHERE colNameA > 0 Compact data files with Optimize and Z-Order Auto-optimize tables Cache frequently queried data in Delta Cache DESCRIBE HISTORY tableName SELECT * FROM tableName VERSION AS OF 12 EXCEPT ALL SELECT * FROM tableName VERSION AS OF 11 SELECT * FROM tableName VERSION AS OF 0 SELECT * FROM tableName@v0 -- equivalent to VERSION AS OF 0 SELECT * FROM tableName TIMESTAMP AS OF "2020-12-18" View transaction log (aka Delta Log) Query historical versions of Delta Lake tables Find changes between 2 versions of table -- Managed database is saved in the Hive metastore. Default database is named "default". DROP DATABASE IF EXISTS dbName; CREATE DATABASE dbName; USE dbName -- This command avoids having to specify dbName.tableName every time instead of just tableName. /* You can refer to Delta Tables by table name, or by path. Table name is the preferred way, since named tables are managed in the Hive Metastore (i.e., when you DROP a named table, the data is dropped also — not the case for path-based tables.) */ SELECT * FROM [dbName.] tableName CREATE TABLE [dbName.] tableName USING DELTA AS SELECT * FROM tableName | parquet.`path/to/data` [LOCATION `/path/to/table`] -- using location = unmanaged table -- by table name CONVERT TO DELTA [dbName.]tableName [PARTITIONED BY (col_name1 col_type1, col_name2 col_type2)] -- path-based tables CONVERT TO DELTA parquet.`/path/to/table` -- note backticks [PARTITIONED BY (col_name1 col_type1, col_name2 col_type2)] SELECT * FROM delta.`path/to/delta_table` -- note backticks CREATE TABLE [dbName.] tableName ( id INT [NOT NULL], name STRING, date DATE, int_rate FLOAT) USING DELTA [PARTITIONED BY (time, date)] -- optional COPY INTO [dbName.] targetTable FROM (SELECT * FROM "/path/to/table") FILEFORMAT = DELTA -- or CSV, Parquet, ORC, JSON, etc. CREATE AND QUERY DELTA TABLES Create and use managed database Query Delta Lake table by table name (preferred) Query Delta Lake table by path Convert Parquet table to Delta Lake format in place Create table, define schema explicitly with SQL DDL Create Delta Lake table as SELECT * with no upfront schema definition Copy new data into Delta Lake table (with idempotent retries) TIME TRAVEL (CONTINUED)
  • 2. spark.sql("SELECT * FROM tableName") spark.sql("SELECT * FROM delta.`/path/to/delta_table`") spark.sql("DESCRIBE HISTORY tableName") deltaTable.vacuum() # vacuum files older than default retention period (7 days) deltaTable.vacuum(100) # vacuum files not required by versions more than 100 hours old deltaTable.clone(target="/path/to/delta_table/", isShallow=True, replace=True) spark.sql("SELECT * FROM tableName") spark.sql("SELECT * FROM delta.`/path/to/delta_table`") UTILITY METHODS WITH PYTHON Convert Parquet table to Delta Lake format in place Run Spark SQL queries in Python Compact old files with Vacuum Clone a Delta Lake table Get DataFrame representation of a Delta Lake table Run SQL queries on Delta Lake tables fullHistoryDF = deltaTable.history() # choose only one option: versionAsOf, or timestampAsOf df = (spark.read.format("delta") .option("versionAsOf", 0) .option("timestampAsOf", "2020-12-18") .load("/path/to/delta_table")) TIME TRAVEL View transaction log (aka Delta Log) Query historical versions of Delta Lake tables PERFORMANCE OPTIMIZATIONS *Databricks Delta Lake feature spark.sql("OPTIMIZE tableName [ZORDER BY (colA, colB)]") *Databricks Delta Lake feature. For existing tables: spark.sql("ALTER TABLE [table_name | delta.`path/to/delta_table`] SET TBLPROPERTIES (delta.autoOptimize.optimizeWrite = true) To enable auto-optimize for all new Delta Lake tables: spark.sql("SET spark.databricks.delta.properties. defaults.autoOptimize.optimizeWrite = true") *Databricks Delta Lake feature spark.sql("CACHE SELECT * FROM tableName") -- or: spark.sql("CACHE SELECT colA, colB FROM tableName WHERE colNameA > 0") Compact data files with Optimize and Z-Order Auto-optimize tables Cache frequently queried data in Delta Cache WORKING WITH DELTATABLES WORKING WITH DELTA TABLES # A DeltaTable is the entry point for interacting with tables programmatically in Python — for example, to perform updates or deletes. from delta.tables import * deltaTable = DeltaTable.forName(spark, tableName) deltaTable = DeltaTable.forPath(spark, delta.`path/to/table`) CONVERT PARQUET TO DELTA LAKE deltaTable = DeltaTable.convertToDelta(spark, "parquet.`/path/to/parquet_table`") partitionedDeltaTable = DeltaTable.convertToDelta(spark, "parquet.`/path/to/parquet_table`", "part int") df1 = spark.read.format("delta").load(pathToTable) df2 = spark.read.format("delta").option("versionAsOf", 2).load("/path/to/delta_table") df1.exceptAll(df2).show() deltaTable.restoreToVersion(0) deltaTable.restoreToTimestamp('2020-12-01') Find changes between 2 versions of a table Rollback a table by version or timestamp Delta Lake is an open source storage layer that brings ACID transactions to Apache Spark™ and big data workloads. delta.io | Documentation | GitHub | API reference | Databricks df = spark.createDataFrame(pdf) # where pdf is a pandas DF # then save DataFrame in Delta Lake format as shown below # read by path df = (spark.read.format("parquet"|"csv"|"json"|etc.) .load("/path/to/delta_table")) # read table from Hive metastore df = spark.table("events") # by path or by table name df = (spark.readStream .format("delta") .schema(schema) .table("events") | .load("/delta/events") ) streamingQuery = ( df.writeStream.format("delta") .outputMode("append"|"update"|"complete") .option("checkpointLocation", "/path/to/checkpoints") .trigger(once=True|processingTime="10 seconds") .table("events") | .start("/delta/events") ) (df.write.format("delta") .mode("append"|"overwrite") .partitionBy("date") # optional .option("mergeSchema", "true") # option - evolve schema .saveAsTable("events") | .save("/path/to/delta_table") ) READS AND WRITES WITH DELTA LAKE Read data from pandas DataFrame Read data using Apache Spark™ Save DataFrame in Delta Lake format Streaming reads (Delta table as streaming source) Streaming writes (Delta table as a sink) # predicate using SQL formatted string deltaTable.delete("date < '2017-01-01'") # predicate using Spark SQL functions deltaTable.delete(col("date") < "2017-01-01") # Available options for merges [see documentation for details]: .whenMatchedUpdate(...) | .whenMatchedUpdateAll(...) | .whenNotMatchedInsert(...) | .whenMatchedDelete(...) (deltaTable.alias("target").merge( source = updatesDF.alias("updates"), condition = "target.eventId = updates.eventId") .whenMatchedUpdateAll() .whenNotMatchedInsert( values = { "date": "updates.date", "eventId": "updates.eventId", "data": "updates.data", "count": 1 } ).execute() ) (deltaTable.alias("logs").merge( newDedupedLogs.alias("newDedupedLogs"), "logs.uniqueId = newDedupedLogs.uniqueId") .whenNotMatchedInsertAll() .execute() ) # predicate using SQL formatted string deltaTable.update(condition = "eventType = 'clk'", set = { "eventType": "'click'" } ) # predicate using Spark SQL functions deltaTable.update(condition = col("eventType") == "clk", set = { "eventType": lit("click") } ) DELTA LAKE DDL/DML: UPDATES, DELETES, INSERTS, MERGES Delete rows that match a predicate condition Update rows that match a predicate condition Upsert (update + insert) using MERGE Insert with Deduplication using MERGE df = deltaTable.toDF() TIME TRAVEL (CONTINUED)