Time travel is now possible with Delta Lake! We will uncover how Delta Lake makes Time Travel possible and why it matters to you. Through presentation, notebooks, and code, we will showcase several common applications and how they can improve your modern data engineering pipelines. Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark(TM). It provides snapshot isolation for concurrent read/writes. Enables efficient upserts, deletes and immediate rollback capabilities. It allows background file optimization through compaction and Z-Order partitioning achieving up to 100x performance improvements. In this presentation you will learn: What challenges Delta Lake solves How Delta Lake works under the hood Applications of new Delta Time Travel capability
5. Complexities Spark Solves
Other Spark Challenges:
Concurrency
Multiple readers and writers
Ensuring atomic transactions,
consistency, and isolation
Updates & Rollbacks
GDPR User delete requests or
other Upserts
Data rollback or snapshots for
audits
The Small Files Problem
Performance degradation
Complex cleanup often incurs in
downtime
Complex Data
Diverse data formats
(json, avro, binary, …)
Data can be dirty,
late, out-of-order
Complex Systems
Diverse storage systems (Kafka,
Azure Storage,Event Hubs, SQL DW, …)
System failures
Complex Workloads
Combining streaming with interactive
queries
Machine learning
Solved
6. Delta Table =
Parquet + Transaction Log +
Indexes/Stats
Delta Table
Versioned Parquet FilesIndexes &
Stats
Delta Log
- Reliable Data Lakes at Scale
ACID Transaction Guarantees
• Atomic, Consistent, Isolated, Durable
Versioned parquet files with transaction log
• Snapshot isolation for multiple concurrent read/writes
• Immediate rollback capabilities
Efficient Upserts (Updates+Inserts) with MERGE command
• GDPR DSR requests
• Change Data Capture
Time Travel
8. Delta Table =
Parquet + Transaction Log +
Indexes/Stats
Delta Table
Versioned Parquet FilesIndexes &
Stats
Delta Log
- Time Travel
Applications Include:
a
• Audit Data Changes
• Data reproducibility
• Data pipeline debugging
• Immediate rollback capabilities
9. - Time Travel, Audit Applications
Audit Data Changes
• History of all operations are recorded for audit history
• Audit operation types, userIds, clusterIds, notebookIds, timestamps and versions
10. - Time Travel, Data Reproducilibility
Data reproducibility
Reproduce query results and reports
• Go back to the exact same data that was used to train an
ML model version in the past.
13. Fast, easy, and collaborative Apache Spark™-based analytics platform
Built with your needs in mind
Role-based access controls
Effortless autoscaling
Live collaboration
Enterprise-grade SLAs
Best-in-class notebooks
Simple job scheduling
Seamlessly integrated with the Azure Portfolio
Increase productivity
Build on a secure, trusted cloud
Scale without limits
Azure Databricks– Introduction
14. Sensors and IoT
(unstructured)
Ingest Store Process Serve
Cosmos DB Apps
Azure Data Lake Storage
Logs (unstructured)
Azure Data Factory
Azure Databricks
Media (unstructured)
Files (unstructured)
Business/custom apps
(structured)
Azure SQL Data
Warehouse
Power BI
Azure Event Hub
Azure IoT Hub
Kafka
Delta FormatRaw Format
+
AzureDatabricks– DeltaLakeatScaleonAzure
15. AzureDatabricks– DeltaLakeatScaleonAzure
Azure Data Factory
Polybase
Azure SQL Data
Warehouse
Azure Event Hub
Azure IoT Hub
Kafka
Raw Format
Step 1
Load raw data to Azure
Data Lake Storage
Step 2
Use Azure Databricks to
1. Combine streaming and
batch
2. Save data as Delta format
Delta Format
(Bronze Table)
Cosmos DB Apps
Step 3
Use Azure Databricks to
1. Join, enrich, clean, transform data
2. Develop, train, and score ML models
with Azure ML + MLFlow
Azure Data Lake Storage
Delta Format
(Silver Table)
+
Delta Format
(Gold Table)
Step 4
Load data into serving layers like
1. SQL Data Warehouse for
enterprise BI scenarios.
2. Cosmos DB for real-time Apps
Power BI
Sensors and IoT
(unstructured)
Logs (unstructured)
Media (unstructured)
Files (unstructured)
Business/custom apps
(structured)