Delta Lake is an open source data management system that provides ACID transactions and schema enforcement for data stored in object storage. Delta Lake allows for working with Delta tables outside of the JVM ecosystem through delta-rs, which provides Rust and Python bindings. Delta-rs enables reading and writing Delta tables on local filesystem, S3, and Azure Data Lake Storage. Future work includes improving writer support and adding bindings for additional languages.
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Growing the Delta Ecosystem to Rust and Python with Delta-RS
1. 1
Growing the Delta Lake ecosystem
to Rust, Python, and more.
introducing delta-rs
R. Tyler Croy
tech.scribd.com
github.com/rtyler
rtyler@brokenco.de
2. 2
whoami(1)
● Long-time free and open source developer
● delta-rs contributor
● Director of Platform Engineering at Scribd
○ Core Platform
○ Data Engineering
○ Data Operations
5. 5
Delta Lake basics
● Three major components:
○ Parquet files
○ Object storage
○ Transaction logs
● Well-defined transaction log semantics in the
protocol document
● Very little "magic"
github.com/delta-io/delta/blob/master/PROTOCOL.md
7. 7
Why delta-rs was needed
● Not everything needs a Spark cluster 😀
○ Many workloads need "fractional compute" resources,
● Data ingestion is a key area where Scribd needed a better cost/performance ratio.
○ Some portion of writes into Delta Lake aren't results of "big data" processing.
● Hybrid workloads may need "offline" and "online" data.
brokenco.de/2021/04/27/why-delta-lake.html
8. 8
delta-rs
Extending Delta Lake outside the JVM ecosystem
● Provides Rust and Python bindings for working with Delta tables
● Supports local filesystem, S3, and Azure Data Lake Storage (Gen 2)
● Production ready:
○ Read and metadata operations
● Still cookin':
○ Support for writing to the Delta transaction log
○ S3 multi-writer support work in progress
○ Ruby bindings in their infancy
github.com/delta-io/delta-rs
9. 9
The power of Rust
● Correctness and speed are important to Delta users
● No runtime allows for very easy embedding
● Opens up possibilities for Delta Lake in:
○ NodeJS
○ Ruby
○ Python
○ Golang
○ etc
● It's so hot right now
github.com/delta-io/delta-rs
14. 14
What you can do right now
● Access Delta tables in: AWS S3, Azure Data Lake, Local filesystem
● Read tables
○ By partitions
○ With checkpoints
○ With stream table updates
○ Write to the transaction log
○ Vacuum
15. 15
● Write parquet directly
● Create checkpoints
● Execute an OPTIMIZE command
○ Focusing on bin-packing
○ Not on z-ordering yet
What you can't quite do yet
16. 16
pip install deltalake pandas
[dependencies]
deltalake = "*"
Python
Rust
Available on
scribd.com
18. 18
kafka-delta-ingest
Rapidly ingesting structured data from Apache Kafka into Delta Lake
● Intended to provide a high speed bridge between Apache Kafka
streams and Delta Lake
● Initially targeting mapping JSON messages into Delta table rows
● Heavily dependent on delta-rs
○ Driving significant writer-based improvements
● Not intended to do any stream transformation or manipulation
github.com/delta-io/kafka-delta-ingest
19. 19
Problems to solve
● Kafka topics can have variable throughput volume
● Auto-scaling is important for data timeliness
● Spark is a lot of overhead for writing data from one
socket to another
github.com/delta-io/kafka-delta-ingest
20. 20
delta-rs community
Your name here!
● Active channels in the Delta Slack workspace, join on delta.io
○ #delta-rs
○ #kafka-delta-ingest
● Lots of good-first issues for anyone who wants to learn Delta Lake or Rust
● Notable contributions
○ Extensive Python binding support and bug fixes from Florian Valeye (@fvaleye)
○ Async IO and Azure storage backend from Ben Sully (@sdk2)
○ Safe concurrent writer work with S3 and DynamoDB from Mykhailo Osypov (@mosyp)
○ Parquet crate write support by Neville Dipale (@nevi-me)
github.com/delta-io/delta-rs