Petabytes, Exabytes, and Beyond: Managing Delta Lakes for Interactive Queries at Scale

•

1 j'aime•527 vues

Data production continues to scale up and the techniques for managing it need to scale too. Building pipelines that can process petabytes per day in turn create data lakes with exabytes of historical data. At Databricks, we help our customers turn these data lakes into gold mines of valuable information using Apache Spark. This talk will cover techniques to optimize access to these data lakes using Delta Lakes, including range partitioning, file-based data skipping, multi-dimensional clustering, and read-optimized files. We'll cover sample implementations and see examples of querying petabytes of data in seconds, not hours. We'll also discuss tradeoffs that data engineers deal with everyday like read speed vs. write throughput, managing storage costs, and duplicating data to support multiple query profiles. We'll also discuss combining batch with streaming to achieve desired query performance. After this session, you will have new ideas for managing truly massive Delta Lakes.

Données & analyses

WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics

Chris Hoshino-Fish, Databricks
Petabytes, Exabytes,
and Beyond
Managing Delta Lakes for
Interactive Queries at Scale
#UnifiedDataAnalytics #SparkAISummit

Range Partitioning
• Evenly balanced partitions
• Tune partition and cluster size
3#UnifiedDataAnalytics #SparkAISummit

ZOrder Indexing
4#UnifiedDataAnalytics #SparkAISummit
• Multidimensional
clustering
• Maps multiple
columns to 1-
dimensional binary
space
• Effectiveness falls off
after 3-5 columns

Dataskipping
• Collects metadata
about files
• Uses metadata to
improve query plans
• Combined with
ZOrder, can reduce
data needed to read
by 90% or more
5#UnifiedDataAnalytics #SparkAISummit

Tuning File Size
6#UnifiedDataAnalytics #SparkAISummit
• Smaller files - dataskipping can be more
effective
• Write/Read cost - performance and $

Multiple Query Profiles
• Assess common query patterns
• Understand end-users’ needs
• If possible, map secondary queries to use
primary index
• Create second table with different ZOrder
7#UnifiedDataAnalytics #SparkAISummit

Write Throughput vs. Query Speed
• More frequent writes -> new data is
unoptimized
• Tradeoff performance for recency
• ZOrder is incremental - only creates new ZCube
after threshold of new data
8#UnifiedDataAnalytics #SparkAISummit

Streaming vs. Batch
• Streaming has incremental processing and
reduces compute load, improves stability, and
can reduce latency
• RocksDB StateStore to scale up state
management
9#UnifiedDataAnalytics #SparkAISummit

DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

Recommandé

Tactical Data Science Tips: Python and Spark TogetherDatabricks

End-to-End Data Pipelines with Apache SparkBurak Yavuz

Automated Production Ready ML at ScaleDatabricks

Apache Pulsar: The Next Generation Messaging and Queuing SystemDatabricks

Building an AI-Powered Retail Experience with Delta Lake, Spark, and DatabricksDatabricks

Data Warehousing with Spark Streaming at ZalandoDatabricks

Powering Custom Apps at Facebook using Spark Script TransformationDatabricks

Scaling Through Simplicity—How a 300 million User Chat App Reduced Data Engin...Spark Summit

Recommandé

Tactical Data Science Tips: Python and Spark TogetherDatabricks

End-to-End Data Pipelines with Apache SparkBurak Yavuz

Automated Production Ready ML at ScaleDatabricks

Apache Pulsar: The Next Generation Messaging and Queuing SystemDatabricks

Building an AI-Powered Retail Experience with Delta Lake, Spark, and DatabricksDatabricks

Data Warehousing with Spark Streaming at ZalandoDatabricks

Powering Custom Apps at Facebook using Spark Script TransformationDatabricks

Scaling Through Simplicity—How a 300 million User Chat App Reduced Data Engin...Spark Summit

Spark Summit EU talk by Ahsan Javed AwanSpark Summit

Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Spark Summit

Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...Databricks

Spark Summit EU talk by Bas GeerdinkSpark Summit

End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaDatabricks

Spark at AirbnbHao Wang

AI on Spark for Malware Analysis and Anomalous Threat DetectionDatabricks

Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Spark Summit

Using Production Profiles to Guide OptimizationsDatabricks

Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...Databricks

Data-Driven Transformation: Leveraging Big Data at Showtime with Apache SparkDatabricks

Accelerating Machine Learning and Deep Learning At Scale...With Apache Spark:...Spark Summit

A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...Databricks

Make your PySpark Data Fly with Arrow!Databricks

Downscaling: The Achilles heel of Autoscaling Apache Spark ClustersDatabricks

An End-to-End Spark-Based Machine Learning Stack in the Hybrid Cloud with Far...Databricks

Spark Summit EU talk by Tug GrallSpark Summit

Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringDatabricks

Cosmos DB Real-time Advanced Analytics WorkshopDatabricks

Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceDatabricks

Simplifying Change Data Capture using Databricks DeltaDatabricks

Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...Databricks

Contenu connexe

Tendances

Spark Summit EU talk by Ahsan Javed AwanSpark Summit

Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Spark Summit

Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...Databricks

Spark Summit EU talk by Bas GeerdinkSpark Summit

End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaDatabricks

Spark at AirbnbHao Wang

AI on Spark for Malware Analysis and Anomalous Threat DetectionDatabricks

Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Spark Summit

Using Production Profiles to Guide OptimizationsDatabricks

Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...Databricks

Data-Driven Transformation: Leveraging Big Data at Showtime with Apache SparkDatabricks

Accelerating Machine Learning and Deep Learning At Scale...With Apache Spark:...Spark Summit

A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...Databricks

Make your PySpark Data Fly with Arrow!Databricks

Downscaling: The Achilles heel of Autoscaling Apache Spark ClustersDatabricks

An End-to-End Spark-Based Machine Learning Stack in the Hybrid Cloud with Far...Databricks

Spark Summit EU talk by Tug GrallSpark Summit

Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringDatabricks

Cosmos DB Real-time Advanced Analytics WorkshopDatabricks

Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceDatabricks

Tendances (20)

Spark Summit EU talk by Ahsan Javed Awan

Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...

Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...

Spark Summit EU talk by Bas Geerdink

End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta

Spark at Airbnb

AI on Spark for Malware Analysis and Anomalous Threat Detection

Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...

Using Production Profiles to Guide Optimizations

Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...

Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark

Accelerating Machine Learning and Deep Learning At Scale...With Apache Spark:...

A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...

Make your PySpark Data Fly with Arrow!

Downscaling: The Achilles heel of Autoscaling Apache Spark Clusters

An End-to-End Spark-Based Machine Learning Stack in the Hybrid Cloud with Far...

Spark Summit EU talk by Tug Grall

Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring

Cosmos DB Real-time Advanced Analytics Workshop

Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service

Similaire à Petabytes, Exabytes, and Beyond: Managing Delta Lakes for Interactive Queries at Scale

Simplifying Change Data Capture using Databricks DeltaDatabricks

Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...Databricks

Visualizing big data in the browser using sparkDatabricks

Borthakur hadoop univ-researchsaintdevil163

Cloud computing UNIT 2.1 presentation inRahulBhole12

20160331 sa introduction to big data pipelining berlin meetup 0.3Simon Ambridge

Big Data Analytics 2: Leveraging Customer Behavior to Enhance Relevancy in Pe...MongoDB

Common MongoDB Use CasesDATAVERSITY

Webinar: Utilisations courantes de MongoDBMongoDB

Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...Flink Forward

RedisSearch / CRDT: Kyle Davis, Meir ShpilraienRedis Labs

Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...Rukmani Gopalan

Designing your SaaS Database for Scale with PostgresOzgun Erdogan

Hadoop Data ModelingAdam Doyle

Practical Use of a NoSQL DatabaseIBM Cloud Data Services

Spark summit 2019 infrastructure for deep learning in apache spark 0425Wee Hyong Tok

Hadoop introductionmusrath mohammad

Mongo db 3.4 OverviewNorberto Leite

Cloud infrastructure. Google File System and MapReduce - Andrii VozniukAndrii Vozniuk

DOXLON November 2016 - Data Democratization Using SplunkOutlyer

Similaire à Petabytes, Exabytes, and Beyond: Managing Delta Lakes for Interactive Queries at Scale (20)

Simplifying Change Data Capture using Databricks Delta

Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...

Visualizing big data in the browser using spark

Borthakur hadoop univ-research

Cloud computing UNIT 2.1 presentation in

20160331 sa introduction to big data pipelining berlin meetup 0.3

Big Data Analytics 2: Leveraging Customer Behavior to Enhance Relevancy in Pe...

Common MongoDB Use Cases

Webinar: Utilisations courantes de MongoDB

Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...

RedisSearch / CRDT: Kyle Davis, Meir Shpilraien

Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...

Designing your SaaS Database for Scale with Postgres

Hadoop Data Modeling

Practical Use of a NoSQL Database

Spark summit 2019 infrastructure for deep learning in apache spark 0425

Hadoop introduction

Mongo db 3.4 Overview

Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk

DOXLON November 2016 - Data Democratization Using Splunk

Plus de Databricks

DW Migration Webinar-March 2022.pptxDatabricks

Data Lakehouse Symposium | Day 1 | Part 1Databricks

Data Lakehouse Symposium | Day 1 | Part 2Databricks

Data Lakehouse Symposium | Day 2Databricks

Data Lakehouse Symposium | Day 4Databricks

5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks

Democratizing Data Quality Through a Centralized PlatformDatabricks

Learn to Use Databricks for Data ScienceDatabricks

Why APM Is Not the Same As ML MonitoringDatabricks

The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks

Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks

Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks

Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks

Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks

Sawtooth Windows for Feature AggregationsDatabricks

Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks

Re-imagine Data Monitoring with whylogs and SparkDatabricks

Raven: End-to-end Optimization of ML Prediction QueriesDatabricks

Processing Large Datasets for ADAS Applications using Apache SparkDatabricks

Massive Data Processing in Adobe Using Delta LakeDatabricks

Plus de Databricks (20)

DW Migration Webinar-March 2022.pptx

Data Lakehouse Symposium | Day 1 | Part 1

Data Lakehouse Symposium | Day 1 | Part 2

Data Lakehouse Symposium | Day 2

Data Lakehouse Symposium | Day 4

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

Democratizing Data Quality Through a Centralized Platform

Learn to Use Databricks for Data Science

Why APM Is Not the Same As ML Monitoring

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

Stage Level Scheduling Improving Big Data and AI Integration

Simplify Data Conversion from Spark to TensorFlow and PyTorch

Scaling your Data Pipelines with Apache Spark on Kubernetes

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Sawtooth Windows for Feature Aggregations

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Re-imagine Data Monitoring with whylogs and Spark

Raven: End-to-end Optimization of ML Prediction Queries

Processing Large Datasets for ADAS Applications using Apache Spark

Massive Data Processing in Adobe Using Delta Lake

Dernier

一比一原版(UCD毕业证书）加州大学戴维斯分校毕业证成绩单原件一模一样wsppdmt

Ranking and Scoring Exercises for ResearchRajesh Mondal

PLE-statistics document for primary schscnajjemba

Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATIONLakpaYanziSherpa

DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann

一比一原版(曼大毕业证书）曼尼托巴大学毕业证成绩单留信学历认证一手价格q6pzkpark

Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop

Gartner's Data Analytics Maturity Model.pptxchadhar227

Aspirational Block Program Block Syaldey District - AlmoraGovindSinghDasila

怎样办理圣地亚哥州立大学毕业证（SDSU毕业证书）成绩单学校原版复制vexqp

7. Epi of Chronic respiratory diseases.pptibrahimabdi22

Discover Why Less is More in B2B Researchmichael115558

Switzerland Constitution 2002.pdf.........EfruzAsilolu

如何办理英国诺森比亚大学毕业证（NU毕业证书）成绩单原件一模一样wsppdmt

Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg

怎样办理纽约州立大学宾汉姆顿分校毕业证（SUNY-Bin毕业证书）成绩单学校原版复制vexqp

Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...gajnagarg

Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangeThinkInnovation

+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...Health

Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...nirzagarg

Dernier (20)

一比一原版(UCD毕业证书）加州大学戴维斯分校毕业证成绩单原件一模一样

Ranking and Scoring Exercises for Research

PLE-statistics document for primary schs

Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION

DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK

一比一原版(曼大毕业证书）曼尼托巴大学毕业证成绩单留信学历认证一手价格

Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...

Gartner's Data Analytics Maturity Model.pptx

Aspirational Block Program Block Syaldey District - Almora

怎样办理圣地亚哥州立大学毕业证（SDSU毕业证书）成绩单学校原版复制

7. Epi of Chronic respiratory diseases.ppt

Discover Why Less is More in B2B Research

Switzerland Constitution 2002.pdf.........

如何办理英国诺森比亚大学毕业证（NU毕业证书）成绩单原件一模一样

Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...

怎样办理纽约州立大学宾汉姆顿分校毕业证（SUNY-Bin毕业证书）成绩单学校原版复制

Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...

Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange

+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...

Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...

Petabytes, Exabytes, and Beyond: Managing Delta Lakes for Interactive Queries at Scale

1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics

2. Chris Hoshino-Fish, Databricks Petabytes, Exabytes, and Beyond Managing Delta Lakes for Interactive Queries at Scale #UnifiedDataAnalytics #SparkAISummit

3. Range Partitioning • Evenly balanced partitions • Tune partition and cluster size 3#UnifiedDataAnalytics #SparkAISummit

4. ZOrder Indexing 4#UnifiedDataAnalytics #SparkAISummit • Multidimensional clustering • Maps multiple columns to 1- dimensional binary space • Effectiveness falls off after 3-5 columns

5. Dataskipping • Collects metadata about files • Uses metadata to improve query plans • Combined with ZOrder, can reduce data needed to read by 90% or more 5#UnifiedDataAnalytics #SparkAISummit

6. Tuning File Size 6#UnifiedDataAnalytics #SparkAISummit • Smaller files - dataskipping can be more effective • Write/Read cost - performance and $

7. Multiple Query Profiles • Assess common query patterns • Understand end-users’ needs • If possible, map secondary queries to use primary index • Create second table with different ZOrder 7#UnifiedDataAnalytics #SparkAISummit

8. Write Throughput vs. Query Speed • More frequent writes -> new data is unoptimized • Tradeoff performance for recency • ZOrder is incremental - only creates new ZCube after threshold of new data 8#UnifiedDataAnalytics #SparkAISummit

9. Streaming vs. Batch • Streaming has incremental processing and reduces compute load, improves stability, and can reduce latency • RocksDB StateStore to scale up state management 9#UnifiedDataAnalytics #SparkAISummit

10. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT