Data Reproducibility, Audits, Immediate Rollbacks, and Other Applications of Time Travel with Delta Lake

WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics

Kyle Weller
Product Manager Azure Databricks - Microsoft
Applications of Time Travel
with Delta Lake
#UnifiedDataAnalytics #SparkAISummit

Common Data Challenges
Gartner estimates
> 65% big data projects
fail
XCustomer Data
Click Streams
Unstructured
Sensors (IoT)
Etc
WHY?

Complexities Spark Solves
Other Spark Challenges:
Concurrency
Multiple readers and writers
Ensuring atomic transactions,
consistency, and isolation
Updates & Rollbacks
GDPR User delete requests or
other Upserts
Data rollback or snapshots for
audits
The Small Files Problem
Performance degradation
Complex cleanup often incurs in
downtime
Complex Data
Diverse data formats
(json, avro, binary, …)
Data can be dirty,
late, out-of-order
Complex Systems
Diverse storage systems (Kafka,
Azure Storage,Event Hubs, SQL DW, …)
System failures
Complex Workloads
Combining streaming with interactive
queries
Machine learning
Solved

Delta Table =
Parquet + Transaction Log +
Indexes/Stats
Delta Table
Versioned Parquet FilesIndexes &
Stats
Delta Log
- Reliable Data Lakes at Scale
ACID Transaction Guarantees
• Atomic, Consistent, Isolated, Durable
Versioned parquet files with transaction log
• Snapshot isolation for multiple concurrent read/writes
• Immediate rollback capabilities
Efficient Upserts (Updates+Inserts) with MERGE command
• GDPR DSR requests
• Change Data Capture
Time Travel

parquet
parquet
delta
delta
- Easy to Use

Delta Table =
Parquet + Transaction Log +
Indexes/Stats
Delta Table
Versioned Parquet FilesIndexes &
Stats
Delta Log
- Time Travel
Applications Include:
a
• Audit Data Changes
• Data reproducibility
• Data pipeline debugging
• Immediate rollback capabilities

- Time Travel, Audit Applications
Audit Data Changes
• History of all operations are recorded for audit history
• Audit operation types, userIds, clusterIds, notebookIds, timestamps and versions

- Time Travel, Data Reproducilibility
Data reproducibility
Reproduce query results and reports
• Go back to the exact same data that was used to train an
ML model version in the past.

Delta at scale in the cloud
12

Fast, easy, and collaborative Apache Spark™-based analytics platform
Built with your needs in mind
Role-based access controls
Effortless autoscaling
Live collaboration
Enterprise-grade SLAs
Best-in-class notebooks
Simple job scheduling
Seamlessly integrated with the Azure Portfolio
Increase productivity
Build on a secure, trusted cloud
Scale without limits
Azure Databricks– Introduction

Sensors and IoT
(unstructured)
Ingest Store Process Serve
Cosmos DB Apps
Azure Data Lake Storage
Logs (unstructured)
Azure Data Factory
Azure Databricks
Media (unstructured)
Files (unstructured)
Business/custom apps
(structured)
Azure SQL Data
Warehouse
Power BI
Azure Event Hub
Azure IoT Hub
Kafka
Delta FormatRaw Format
+
AzureDatabricks– DeltaLakeatScaleonAzure

AzureDatabricks– DeltaLakeatScaleonAzure
Azure Data Factory
Polybase
Azure SQL Data
Warehouse
Azure Event Hub
Azure IoT Hub
Kafka
Raw Format
Step 1
Load raw data to Azure
Data Lake Storage
Step 2
Use Azure Databricks to
1. Combine streaming and
batch
2. Save data as Delta format
Delta Format
(Bronze Table)
Cosmos DB Apps
Step 3
Use Azure Databricks to
1. Join, enrich, clean, transform data
2. Develop, train, and score ML models
with Azure ML + MLFlow
Azure Data Lake Storage
Delta Format
(Silver Table)
+
Delta Format
(Gold Table)
Step 4
Load data into serving layers like
1. SQL Data Warehouse for
enterprise BI scenarios.
2. Cosmos DB for real-time Apps
Power BI
Sensors and IoT
(unstructured)
Logs (unstructured)
Media (unstructured)
Files (unstructured)
Business/custom apps
(structured)

http://bit.ly/adbrelnote
https://docs.azuredatabricks.net/
https://delta.io
Learn More
https://aka.ms/AzureDatabricksBestPractices
AzureDatabricks
+

DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

Data Reproducibility, Audits, Immediate Rollbacks, and Other Applications of Time Travel with Delta Lake

Recommandé

Recommandé

Contenu connexe

Plus de Databricks

Plus de Databricks (20)

Dernier

Dernier (20)

Data Reproducibility, Audits, Immediate Rollbacks, and Other Applications of Time Travel with Delta Lake