SlideShare a Scribd company logo
1 of 17
Download to read offline
Data storage made 

fast and easy
The Problem
• We focus on persistent storage of massive data
• Plethora of complex formats across many applications

- Genomics (FastQ, BAM, VCF, CRAM, etc.), LiDAR (LAS, LAZ), Databases (proprietary formats, Parquet), …

• Every format is associated with a library responsible for

- Backend support (POSIX, HDFS, AWS S3, …), parallel IO, compression, other filters, …

• Downstream computations (e.g., Linear Algebra) typically work on vectors and arrays

• Two common problems:
- redundant software engineering for high performance (parallel IO, compression, etc.)
- expensive conversion to arrays for downstream computations
What is Array Data?
1) Slicing
2) Compression
Goals
Applications
Genomics Time Series Tabular
Source: NYU’s Center for Urban Science and Progress
LiDAR Imaging
Storage Module vs. DBMS
Storage Module
DBMS
Storage Module
IO
Compression
Access / Slicing
APIs to higher level modules
Other filters (e.g., encryption)
DBMS
Query language
Query optimizer
Query executor
Query parser
A storage module
can be integrated with other
data science tools as well,
without an ODBC/JDBC
What is TileDB?
Architecture
TileDB is a storage module for a novel
multi-dimensional array data format
TileDB History
Stavros Jake Tyler Seth
2016 VLDB paper on TileDB
2018 - We are hiring!
2017 TileDB, Inc. is incorporated backed by
2015 TileDB research project kicks off at
The TileDB Format

Physical Organization
a1.tdb
a2.tdb
a2_var.tdb
__fragment_metadata.tdb
__<uuid>_<timestamp>
__array_schema.tdb
__lock.tdb
my_array
a1.tdb
a2.tdb
a2_var.tdb
__fragment_metadata.tdb
__<uuid>_<timestamp>
__array_schema.tdb
__lock.tdb
__coords.tdb
my_array
The TileDB Format

Updates
a1.tdb
a2.tdb
a2_var.tdb
__fragment_metadata.tdb
__<uuid>_<timestmap2>
a1.tdb
a2.tdb
a2_var.tdb
__fragment_metadata.tdb
__coords.tdb
__<uuid>_<timestmap3>
LSM-tree-like updates
and consolidation
a1.tdb
a2.tdb
a2_var.tdb
__fragment_metadata.tdb
__<uuid>_<timestamp1>
__array_schema.tdb
__lock.tdb
my_array
The TileDB Format

Filters
Binary data across an attribute
Chunk Chunk Chunk Chunk
Each chunk fits in L1 cache
Atomic unit of filtering
Tile
Atomic unit of IO
Filters
Compression (gzip, zstd, …)
Byte/Bit Shuffle
Encryption
Delta encoding
Bit-width reduction
Filter 1
Filter 2
Filter 1
Filter 2
Filter 1
Filter 2
Filter 1
Filter 2
The TileDB Format

Cloud
• TileDB works great on AWS S3

- Just use s3://bucket-name/path/to/array instead of my_array

- No concept of directories, natural use of / in the URI

- aws s3 sync just works

- LSM-tree-based updates excellent fit for such an object store

• Adding Azure, Google Cloud and Alibaba Cloud soon
TileDB Parallelism
• Fully multi-threaded via Intel TBB

• TileDB does not rely on an external engine for parallelism (e.g., Dask)

• Thread-/Process-safety, no need for locking, multiple reader/writer model

• Parallel IO (good use of S3 multipart upload and byte range requests)

• Parallel filters

• Parallel sorting

• Parallel slicing
APIs and Integration
• Lightweight interfaces between the TileDB C library and HL APIs

• Zero-copying wherever possible

• Predicate push-down

• Effective partitioning (especially for sparse arrays)
ND arrays
Sparse arrays
Compression/Filters
Parallel IO
Parallelism
S3 support
Updates
Zarr
APIs
LSM-tree-like chunk-based chunk-based file-based
SWMR pushed to app pushed to app
multiple multiple only Python multiple
pushed to app Blosc / pushed to app pushed to app
open-source closed-source open-source pushed to app
• In-memory columnar format

• DataFrames, limited ND array support

• Designed for fast in-memory operations

• Rich datatype support, complex objects

• Persistence through virtual memory mapping or delegated to external on-disk formats

• TileDB integration with Apache Arrow is on our roadmap!
TileDB Value to
• Manage dense and sparse data persistence using a single API

• Get the most from you modern hardware! Concurrent IO, parallel
compression, accelerated encryption and more

• Easily interface with multiple different storage backends (including
cloud storage) and get performance with little to no code changes

• Common format that can be leveraged by “big data” / SQL
platforms and Python, R, Julia, … ecosystems
Thank You
We are Hiring !
tiledb.workable.com
careers@tiledb.io
https://github.com/TileDB-Inc
pip install tiledb

More Related Content

What's hot

What's hot (16)

How to write a research proposal UP.ppt
How to write a research proposal UP.pptHow to write a research proposal UP.ppt
How to write a research proposal UP.ppt
 
Standard format of Research article ( how to write research article )
Standard format of Research article ( how to write research article )Standard format of Research article ( how to write research article )
Standard format of Research article ( how to write research article )
 
AGRIS.pptx
AGRIS.pptxAGRIS.pptx
AGRIS.pptx
 
Academic Publishing: Challenges and Opportunities
Academic Publishing: Challenges and OpportunitiesAcademic Publishing: Challenges and Opportunities
Academic Publishing: Challenges and Opportunities
 
ResourceSync Tutorial
ResourceSync TutorialResourceSync Tutorial
ResourceSync Tutorial
 
National library of Nepal
National library of NepalNational library of Nepal
National library of Nepal
 
Preprint and Preprint Servers.pptx
Preprint and Preprint Servers.pptxPreprint and Preprint Servers.pptx
Preprint and Preprint Servers.pptx
 
Cataloging101 foundations frbr - 2019 version
Cataloging101 foundations frbr - 2019 versionCataloging101 foundations frbr - 2019 version
Cataloging101 foundations frbr - 2019 version
 
Research Proposal.pptx
Research Proposal.pptxResearch Proposal.pptx
Research Proposal.pptx
 
Usage des API de HAL
Usage des API de HALUsage des API de HAL
Usage des API de HAL
 
Nginx dhruba mandal
Nginx dhruba mandalNginx dhruba mandal
Nginx dhruba mandal
 
Chap 1 general introduction of information retrieval
Chap 1  general introduction of information retrievalChap 1  general introduction of information retrieval
Chap 1 general introduction of information retrieval
 
Evaluation of Digital Library
Evaluation of Digital LibraryEvaluation of Digital Library
Evaluation of Digital Library
 
How to write a research synopsis
How to write a research synopsisHow to write a research synopsis
How to write a research synopsis
 
IPC: AIDL is sexy, not a curse
IPC: AIDL is sexy, not a curseIPC: AIDL is sexy, not a curse
IPC: AIDL is sexy, not a curse
 
Getting Started With ScienceDirect
Getting Started With ScienceDirectGetting Started With ScienceDirect
Getting Started With ScienceDirect
 

Similar to The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski

An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...
DataWorks Summit
 
Aws for Startups Building Cloud Enabled Apps
Aws for Startups Building Cloud Enabled AppsAws for Startups Building Cloud Enabled Apps
Aws for Startups Building Cloud Enabled Apps
Amazon Web Services
 

Similar to The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski (20)

Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiri
 
Accesso ai dati con Azure Data Platform
Accesso ai dati con Azure Data PlatformAccesso ai dati con Azure Data Platform
Accesso ai dati con Azure Data Platform
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
 
Scaling horizontally on AWS
Scaling horizontally on AWSScaling horizontally on AWS
Scaling horizontally on AWS
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SF
 
Hadoop
HadoopHadoop
Hadoop
 
Deep Dive in Big Data
Deep Dive in Big DataDeep Dive in Big Data
Deep Dive in Big Data
 
Spark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit EU talk by Shay Nativ and Dvir VolkSpark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit EU talk by Shay Nativ and Dvir Volk
 
An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...
 
UNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxUNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptx
 
Aws for Startups Building Cloud Enabled Apps
Aws for Startups Building Cloud Enabled AppsAws for Startups Building Cloud Enabled Apps
Aws for Startups Building Cloud Enabled Apps
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
 
Using Data Lakes
Using Data Lakes Using Data Lakes
Using Data Lakes
 
Michael stack -the state of apache h base
Michael stack -the state of apache h baseMichael stack -the state of apache h base
Michael stack -the state of apache h base
 
Database Technologies
Database TechnologiesDatabase Technologies
Database Technologies
 
Big Data Introduction - Solix empower
Big Data Introduction - Solix empowerBig Data Introduction - Solix empower
Big Data Introduction - Solix empower
 
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
 
Azure Cosmos DB - The Swiss Army NoSQL Cloud Database
Azure Cosmos DB - The Swiss Army NoSQL Cloud DatabaseAzure Cosmos DB - The Swiss Army NoSQL Cloud Database
Azure Cosmos DB - The Swiss Army NoSQL Cloud Database
 
Using Hazelcast in the Kappa architecture
Using Hazelcast in the Kappa architectureUsing Hazelcast in the Kappa architecture
Using Hazelcast in the Kappa architecture
 

More from PyData

More from PyData (20)

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne Bauer
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica Puerto
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will Ayd
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen Hoover
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper Seabold
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
 
Deprecating the state machine: building conversational AI with the Rasa stack...
Deprecating the state machine: building conversational AI with the Rasa stack...Deprecating the state machine: building conversational AI with the Rasa stack...
Deprecating the state machine: building conversational AI with the Rasa stack...
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Recently uploaded (20)

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 

The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski

  • 1. Data storage made 
 fast and easy
  • 2. The Problem • We focus on persistent storage of massive data • Plethora of complex formats across many applications - Genomics (FastQ, BAM, VCF, CRAM, etc.), LiDAR (LAS, LAZ), Databases (proprietary formats, Parquet), … • Every format is associated with a library responsible for - Backend support (POSIX, HDFS, AWS S3, …), parallel IO, compression, other filters, … • Downstream computations (e.g., Linear Algebra) typically work on vectors and arrays • Two common problems: - redundant software engineering for high performance (parallel IO, compression, etc.) - expensive conversion to arrays for downstream computations
  • 3. What is Array Data? 1) Slicing 2) Compression Goals
  • 4. Applications Genomics Time Series Tabular Source: NYU’s Center for Urban Science and Progress LiDAR Imaging
  • 5. Storage Module vs. DBMS Storage Module DBMS Storage Module IO Compression Access / Slicing APIs to higher level modules Other filters (e.g., encryption) DBMS Query language Query optimizer Query executor Query parser A storage module can be integrated with other data science tools as well, without an ODBC/JDBC
  • 6. What is TileDB? Architecture TileDB is a storage module for a novel multi-dimensional array data format
  • 7. TileDB History Stavros Jake Tyler Seth 2016 VLDB paper on TileDB 2018 - We are hiring! 2017 TileDB, Inc. is incorporated backed by 2015 TileDB research project kicks off at
  • 8. The TileDB Format
 Physical Organization a1.tdb a2.tdb a2_var.tdb __fragment_metadata.tdb __<uuid>_<timestamp> __array_schema.tdb __lock.tdb my_array a1.tdb a2.tdb a2_var.tdb __fragment_metadata.tdb __<uuid>_<timestamp> __array_schema.tdb __lock.tdb __coords.tdb my_array
  • 9. The TileDB Format
 Updates a1.tdb a2.tdb a2_var.tdb __fragment_metadata.tdb __<uuid>_<timestmap2> a1.tdb a2.tdb a2_var.tdb __fragment_metadata.tdb __coords.tdb __<uuid>_<timestmap3> LSM-tree-like updates and consolidation a1.tdb a2.tdb a2_var.tdb __fragment_metadata.tdb __<uuid>_<timestamp1> __array_schema.tdb __lock.tdb my_array
  • 10. The TileDB Format
 Filters Binary data across an attribute Chunk Chunk Chunk Chunk Each chunk fits in L1 cache Atomic unit of filtering Tile Atomic unit of IO Filters Compression (gzip, zstd, …) Byte/Bit Shuffle Encryption Delta encoding Bit-width reduction Filter 1 Filter 2 Filter 1 Filter 2 Filter 1 Filter 2 Filter 1 Filter 2
  • 11. The TileDB Format
 Cloud • TileDB works great on AWS S3 - Just use s3://bucket-name/path/to/array instead of my_array - No concept of directories, natural use of / in the URI - aws s3 sync just works - LSM-tree-based updates excellent fit for such an object store • Adding Azure, Google Cloud and Alibaba Cloud soon
  • 12. TileDB Parallelism • Fully multi-threaded via Intel TBB • TileDB does not rely on an external engine for parallelism (e.g., Dask) • Thread-/Process-safety, no need for locking, multiple reader/writer model • Parallel IO (good use of S3 multipart upload and byte range requests) • Parallel filters • Parallel sorting • Parallel slicing
  • 13. APIs and Integration • Lightweight interfaces between the TileDB C library and HL APIs • Zero-copying wherever possible • Predicate push-down • Effective partitioning (especially for sparse arrays)
  • 14. ND arrays Sparse arrays Compression/Filters Parallel IO Parallelism S3 support Updates Zarr APIs LSM-tree-like chunk-based chunk-based file-based SWMR pushed to app pushed to app multiple multiple only Python multiple pushed to app Blosc / pushed to app pushed to app open-source closed-source open-source pushed to app
  • 15. • In-memory columnar format • DataFrames, limited ND array support • Designed for fast in-memory operations • Rich datatype support, complex objects • Persistence through virtual memory mapping or delegated to external on-disk formats • TileDB integration with Apache Arrow is on our roadmap!
  • 16. TileDB Value to • Manage dense and sparse data persistence using a single API • Get the most from you modern hardware! Concurrent IO, parallel compression, accelerated encryption and more • Easily interface with multiple different storage backends (including cloud storage) and get performance with little to no code changes • Common format that can be leveraged by “big data” / SQL platforms and Python, R, Julia, … ecosystems
  • 17. Thank You We are Hiring ! tiledb.workable.com careers@tiledb.io https://github.com/TileDB-Inc pip install tiledb