The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski

Data storage made  
fast and easy

The Problem
• We focus on persistent storage of massive data
• Plethora of complex formats across many applications

- Genomics (FastQ, BAM, VCF, CRAM, etc.), LiDAR (LAS, LAZ), Databases (proprietary formats, Parquet), …

• Every format is associated with a library responsible for

- Backend support (POSIX, HDFS, AWS S3, …), parallel IO, compression, other ﬁlters, …

• Downstream computations (e.g., Linear Algebra) typically work on vectors and arrays

• Two common problems:
- redundant software engineering for high performance (parallel IO, compression, etc.)
- expensive conversion to arrays for downstream computations

What is Array Data?
1) Slicing
2) Compression
Goals

Applications
Genomics Time Series Tabular
Source: NYU’s Center for Urban Science and Progress
LiDAR Imaging

Storage Module vs. DBMS
Storage Module
DBMS
Storage Module
IO
Compression
Access / Slicing
APIs to higher level modules
Other ﬁlters (e.g., encryption)
DBMS
Query language
Query optimizer
Query executor
Query parser
A storage module
can be integrated with other
data science tools as well,
without an ODBC/JDBC

What is TileDB?
Architecture
TileDB is a storage module for a novel
multi-dimensional array data format

TileDB History
Stavros Jake Tyler Seth
2016 VLDB paper on TileDB
2018 - We are hiring!
2017 TileDB, Inc. is incorporated backed by
2015 TileDB research project kicks oﬀ at

The TileDB Format 
Physical Organization
a1.tdb
a2.tdb
a2_var.tdb
__fragment_metadata.tdb
__<uuid>_<timestamp>
__array_schema.tdb
__lock.tdb
my_array
a1.tdb
a2.tdb
a2_var.tdb
__<uuid>_<timestamp>
__array_schema.tdb
__lock.tdb
__coords.tdb
my_array

Updates
a1.tdb
a2.tdb
a2_var.tdb
__<uuid>_<timestmap2>
a1.tdb
a2.tdb
a2_var.tdb
__coords.tdb
__<uuid>_<timestmap3>
LSM-tree-like updates
and consolidation
a1.tdb
a2.tdb
a2_var.tdb
__<uuid>_<timestamp1>
__array_schema.tdb
__lock.tdb
my_array

Filters
Binary data across an attribute
Chunk Chunk Chunk Chunk
Each chunk fits in L1 cache
Atomic unit of filtering
Tile
Atomic unit of IO
Filters
Compression (gzip, zstd, …)
Byte/Bit Shuffle
Encryption
Delta encoding
Bit-width reduction
Filter 1
Filter 2
Filter 1
Filter 2
Filter 1
Filter 2
Filter 1
Filter 2

Cloud
• TileDB works great on AWS S3

- Just use s3://bucket-name/path/to/array instead of my_array

- No concept of directories, natural use of / in the URI

- aws s3 sync just works

- LSM-tree-based updates excellent ﬁt for such an object store

• Adding Azure, Google Cloud and Alibaba Cloud soon

TileDB Parallelism
• Fully multi-threaded via Intel TBB

• TileDB does not rely on an external engine for parallelism (e.g., Dask)

• Thread-/Process-safety, no need for locking, multiple reader/writer model

• Parallel IO (good use of S3 multipart upload and byte range requests)

• Parallel ﬁlters

• Parallel sorting

• Parallel slicing

APIs and Integration
• Lightweight interfaces between the TileDB C library and HL APIs

• Zero-copying wherever possible

• Predicate push-down

• Eﬀective partitioning (especially for sparse arrays)

ND arrays
Sparse arrays
Compression/Filters
Parallel IO
Parallelism
S3 support
Updates
Zarr
APIs
LSM-tree-like chunk-based chunk-based ﬁle-based
SWMR pushed to app pushed to app
multiple multiple only Python multiple
pushed to app Blosc / pushed to app pushed to app
open-source closed-source open-source pushed to app

• In-memory columnar format

• DataFrames, limited ND array support

• Designed for fast in-memory operations

• Rich datatype support, complex objects

• Persistence through virtual memory mapping or delegated to external on-disk formats

• TileDB integration with Apache Arrow is on our roadmap!

TileDB Value to
• Manage dense and sparse data persistence using a single API

• Get the most from you modern hardware! Concurrent IO, parallel
compression, accelerated encryption and more

• Easily interface with multiple diﬀerent storage backends (including
cloud storage) and get performance with little to no code changes

• Common format that can be leveraged by “big data” / SQL
platforms and Python, R, Julia, … ecosystems

Thank You
We are Hiring !
tiledb.workable.com
careers@tiledb.io
https://github.com/TileDB-Inc
pip install tiledb

The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski

Recommended

Recommended

More Related Content

What's hot

What's hot (16)

Similar to The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski

Similar to The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski (20)

More from PyData

More from PyData (20)

Recently uploaded

Recently uploaded (20)

The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski