SlideShare une entreprise Scribd logo
1  sur  35
Télécharger pour lire hors ligne
Best Practices for building Robust Data Platform
with Apache Spark & Delta
Vini Jaiswal
Spark+AI Summit - June 2020
https://www.linkedin.com/in/vinijaiswal/
▪ Data Strategy
Optimizing the cost to drive Business value
▪ Performance and tuning with Delta Lake & Apache Spark
▪ Governance and security controls
Bringing it all together - A reference architecture
Agenda
Data Strategy
Data Challenges
Data Warehouse limits the potential of
intelligence
Data Volume is growing rapidly
More Variety of data -> Different
applications
Need for faster processing and scalability
Data silos limits innovation
Promise of the Data Lake
1. Collect
Everything
2. Store it all in
the Data Lake
🔥
🔥🔥
3. Data
Science &
Machine
Learning
🔥
🔥
Usual Data Lake
Garbage
In
Garbage
Out
Garbage Stored
Ideal Data Lake with
Ideal data lakes with
No atomicity
No quality enforcement
No consistency /
isolation
✗ Reliability - High Quality Data
● Schema Enforcement
● ACID Transactions
● Time Travel
● Open Standards, Open Source
● Powered by
● Unifies Streaming / Batch
Usual Data Lake
References: https://youtu.be/qtCxNSmTejk
Getting the Data Right
Audience Segmentation
CSV,
JSON, TXT…
Data
Types
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
Bronze Silver Gold
Table
Categorization
Align with
Business
Outcomes
Is my data use
case worthy?
Is my data ready
for Analytics / ML?
Optimizing the Cost to Drive Business Value
Best Practices for Cluster Sizing & Selection
1. Selection of Instance Types
a. Workload type
b. Use cases
2. Selection of node size
a. Observe Metrics
b. Tweak workloads
Best Practices for Cluster Sizing & Selection
1. Selection of Instance Types
a. Workload type
b. Use cases
2. Selection of node size
a. Observe Metrics
b. Tweak workloads
Workload Type AWS
Type
Azure
Type
Recommended Use Case
Memory Optimized r5 Dsv2 Memory-intensive applications
Use Case: ML workload with data caching
Compute Optimized c5 Fsv2 Structured Streaming, Distributed Analytics, Data
Science Applications
Use Case: ETL with full file scans and no data reuse
Storage Optimized i3 Lsv2 Use cases that require higher disk throughput and IO
Use Case: Analytics - Storage Optimized i3 class with
Delta IO Cache
Selection of Instance Types
Reference for Azure Type:: https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes
Reference for AWS Type::https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-types.html
Best Practices for Cluster Sizing & Selection
1. Selection of Instance Types
a. Workload type
b. Use cases
2. Selection of node size
a. Observe Metrics
b. Tweak workloads
Selection of node size
Rule of thumb
1. Fewer big instances > more small instances
a. (larger heap = larger GC)
b. Multiple executors per machine
2. Size based on the number of tasks initially, tweak later
a. Run the job with a small cluster to get idea of # of tasks
b. Observe Cluster metrics for CPU, memory and network utilization
Observe Spark UI & tweak the workloads
Fully cached with room to spare?
> decrease instances
Almost completely cached?
> Increase cluster size
Not even close to cached?
> Consider instance with SSD
instead of EBS or use R class
Reference: https://docs.databricks.com/delta/optimizations/delta-cache.html#delta-and-rdd-cache-comparison
Observe Ganglia Metrics & tweak the workloads
○ Are we compute bound?
○ Are we network bound?
○ Are we spilling a ton?
Performance and Tuning with
Delta Lake & Apache Spark
Performance Symptoms
Look for these 4 symptoms
Shuffle
Spill
Skew
Small Files
Can I make
Spark application run faster?
Use broadcast join
Review Join order
I found Shuffle, now what?
Query completion time
28 Minutes
Sort Merge Join
rows
output:
2,509,189,31
3
Before
1.8 Minutes
rows
output:
1023
After
Reference: https://spark.apache.org/docs/latest/sql-
performance-tuning.html#broadcast-hint-for-sql-queries
● Increase Shuffle Partitions
(for this example: 48)
● Reduce the number of cores
spark.executor.cores < total
cores per worker
● Larger cluster - faster disk
SSDs
Shuffle Partitions = 16
I found Spill, now what?
set spark.sql.shuffle.partitions=48
More spill you can remove, larger
the impact!
Symptom
● Ganglia CPU usage becomes low for long time after
initial high usage
● Task duration -> Significant difference in max than
75% and 25% values
● Input Size/Records
What to do?
● Use broadcast join
● Use Skew Join
● Filter out large keys/salt keys and set
up multiple reduce steps
● Explicitly repartition the data on a
different field
I found Skew, now what?
Reference: https://docs.databricks.com/delta/join-performance/skew-join.html
Adaptive Query Execution
Reduced manual effort of tuning spark.sql.shuffle.partitions
By default it is turned off, Set spark.sql.adaptive.enabled=true
Dynamically change sort-merge join into broadcast-hash join
▪ Dynamically optimizing skew joins
*Available in DBR 7.x/Spark 3.0
Upstream
● Fix the upstream application building tons of files
● Use a seperate tool to compact them before
processing with Spark
Changes in Spark Application
● Write your own compaction job
● Delta solves this problem!
I found a lot of small files, now what?
Achieving Performance with
Compaction
● Improves the Read
Performance
● Solves Small Files problem
Reference: https://docs.delta.io/latest/best-practices.html#compact-files
● Optimizes Apache Spark partition
● Maximizes the throughput of data being
written
● Compacts files for partitions
Auto Optimize
Auto Optimize consists of two complementary features:
Optimized Writes and Auto Compaction.
Reference: https://docs.databricks.com/delta/optimizations/auto-optimize.html#auto-optimize
Reference:https://docs.databricks.com/delta/optimizations/file-mgmt.html#z-ordering-multi-dimensional-clustering
Z-order Sorting
0 1 2 3 4 5 6 7
0
1
2
3
4
5
6
7
Z-Ordering
A technique to colocate related information in the same set of files
● Safely skips more data
● Faster queries
Governance & Security Controls
Data Governance with Delta Lake
Create retention policy to age out and
erase raw data that may contain
personal information
High Level Aggregates
(e.g. # of users that took an action)
Historical Data Repository
● Easy to navigate
● Pseudonymization
Data Lake
Satisfy Compliance requests using
UPDATE / DELETE commands
Create tables that don't contain
personal data
Reference: https://www.youtube.com/watch?v=tCPslvUjG1w
Audit & Monitoring
▪ Use cluster tags for chargeback
▪ Audit logs
▪ Monitor Databricks DBU usage
▪ Delta Transactional Logs
Governance - The Who/What/Where
Perform standard extraction,
transformation and loading
tasks (ETL) and apply best
coding practices including
source control, unit test, and
automation
drives product innovation
with state-of-the-art
Machine Learning models
applied to big data
Improves business process
through providing standardized
and ad-hoc business analysis.
Acts as intermediary between
Analytics and Business team
Performs automated jobs based
on Data Engineering configs.
Data Scientist Data Engineer Data/Business
Analyst Automated Jobs
Many players in the Org. Managing Access, roles and responsibilities, as well as managing usage is a must.
Business Unit
Serving
Operations
& Security
Data Science & MLIngest
OrchestrationCI/CD
Bringing it together - A reference pipeline
APIs
Jobs
Models
Notebooks
Dashboards
ML Runtime
Delta Pipelines
BLOB
DB/DW
Streaming
Massively scalable data cleansing & transformation
ETL/Data
Processing
Bronze
Silver
Gold
Execution
Databricks Runtime
Reliability & Performance
Optimized Spark
Clusters
Storage
Business Unit
Serving
Operations
& Security
Data Science & MLIngest
OrchestrationCI/CD
Bringing it together - A reference pipeline
APIs
Jobs
Models
Notebooks
Dashboards
ML Runtime
Delta Pipelines
BLOB
DB/DW
Streaming
Massively scalable data cleansing & transformation
ETL/Data
Processing
Bronze
Silver
Gold
Execution
Databricks Runtime
Reliability & Performance
Optimized Spark
Clusters
Storage
Data Strategy
Cost Optimization &
Performance Tuning
Business Value
Security
THANK YOU!!!
Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.

Contenu connexe

Tendances

AWS Lake Formation Deep Dive
AWS Lake Formation Deep DiveAWS Lake Formation Deep Dive
AWS Lake Formation Deep DiveCobus Bernard
 
Building Data Intensive Analytic Application on Top of Delta Lakes
Building Data Intensive Analytic Application on Top of Delta LakesBuilding Data Intensive Analytic Application on Top of Delta Lakes
Building Data Intensive Analytic Application on Top of Delta LakesDatabricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks FundamentalsDalibor Wijas
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeDatabricks
 
Data platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptxData platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptxCalvinSim10
 
Accelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks AutoloaderAccelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks AutoloaderDatabricks
 
Getting Started with Delta Lake on Databricks
Getting Started with Delta Lake on DatabricksGetting Started with Delta Lake on Databricks
Getting Started with Delta Lake on DatabricksKnoldus Inc.
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBill Liu
 
Databricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With DataDatabricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With DataDatabricks
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseDatabricks
 
Building a Data Science as a Service Platform in Azure with Databricks
Building a Data Science as a Service Platform in Azure with DatabricksBuilding a Data Science as a Service Platform in Azure with Databricks
Building a Data Science as a Service Platform in Azure with DatabricksDatabricks
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...DataScienceConferenc1
 
Introducing Azure SQL Data Warehouse
Introducing Azure SQL Data WarehouseIntroducing Azure SQL Data Warehouse
Introducing Azure SQL Data WarehouseJames Serra
 
Azure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the CloudAzure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the CloudMark Kromer
 
Azure Data Factory Data Flow
Azure Data Factory Data FlowAzure Data Factory Data Flow
Azure Data Factory Data FlowMark Kromer
 
Containerized Stream Engine to Build Modern Delta Lake
Containerized Stream Engine to Build Modern Delta LakeContainerized Stream Engine to Build Modern Delta Lake
Containerized Stream Engine to Build Modern Delta LakeDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
Building Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta LakeBuilding Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta LakeDatabricks
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardParis Data Engineers !
 

Tendances (20)

AWS Lake Formation Deep Dive
AWS Lake Formation Deep DiveAWS Lake Formation Deep Dive
AWS Lake Formation Deep Dive
 
Building Data Intensive Analytic Application on Top of Delta Lakes
Building Data Intensive Analytic Application on Top of Delta LakesBuilding Data Intensive Analytic Application on Top of Delta Lakes
Building Data Intensive Analytic Application on Top of Delta Lakes
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks Fundamentals
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Data platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptxData platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptx
 
Accelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks AutoloaderAccelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks Autoloader
 
Getting Started with Delta Lake on Databricks
Getting Started with Delta Lake on DatabricksGetting Started with Delta Lake on Databricks
Getting Started with Delta Lake on Databricks
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
 
Databricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With DataDatabricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With Data
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
Building a Data Science as a Service Platform in Azure with Databricks
Building a Data Science as a Service Platform in Azure with DatabricksBuilding a Data Science as a Service Platform in Azure with Databricks
Building a Data Science as a Service Platform in Azure with Databricks
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
 
Introducing Azure SQL Data Warehouse
Introducing Azure SQL Data WarehouseIntroducing Azure SQL Data Warehouse
Introducing Azure SQL Data Warehouse
 
Azure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the CloudAzure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the Cloud
 
Azure Data Factory Data Flow
Azure Data Factory Data FlowAzure Data Factory Data Flow
Azure Data Factory Data Flow
 
Containerized Stream Engine to Build Modern Delta Lake
Containerized Stream Engine to Build Modern Delta LakeContainerized Stream Engine to Build Modern Delta Lake
Containerized Stream Engine to Build Modern Delta Lake
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Building Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta LakeBuilding Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta Lake
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
 

Similaire à Best Practices for Building Robust Data Platform with Apache Spark and Delta

Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayC4Media
 
ADV Slides: Comparing the Enterprise Analytic Solutions
ADV Slides: Comparing the Enterprise Analytic SolutionsADV Slides: Comparing the Enterprise Analytic Solutions
ADV Slides: Comparing the Enterprise Analytic SolutionsDATAVERSITY
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...javier ramirez
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMark Kromer
 
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...Cisco DevNet
 
Estimating the Total Costs of Your Cloud Analytics Platform
Estimating the Total Costs of Your Cloud Analytics PlatformEstimating the Total Costs of Your Cloud Analytics Platform
Estimating the Total Costs of Your Cloud Analytics PlatformDATAVERSITY
 
Sql server-performance-hafi
Sql server-performance-hafiSql server-performance-hafi
Sql server-performance-hafizabi-babi
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Bhupesh Bansal
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop User Group
 
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevMigration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevAltinity Ltd
 
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudFSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudAmazon Web Services
 
Ssis Best Practices Israel Bi U Ser Group Itay Braun
Ssis Best Practices   Israel Bi U Ser Group   Itay BraunSsis Best Practices   Israel Bi U Ser Group   Itay Braun
Ssis Best Practices Israel Bi U Ser Group Itay Braunsqlserver.co.il
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native PlatformSunil Govindan
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native PlatformSunil Govindan
 
Building a High Performance Analytics Platform
Building a High Performance Analytics PlatformBuilding a High Performance Analytics Platform
Building a High Performance Analytics PlatformSantanu Dey
 
Infrastructure Considerations : Design : "webops"
Infrastructure Considerations : Design : "webops"Infrastructure Considerations : Design : "webops"
Infrastructure Considerations : Design : "webops"Piyush Kumar
 
Black Friday and Cyber Monday- Best Practices for Your E-Commerce Database
Black Friday and Cyber Monday- Best Practices for Your E-Commerce DatabaseBlack Friday and Cyber Monday- Best Practices for Your E-Commerce Database
Black Friday and Cyber Monday- Best Practices for Your E-Commerce DatabaseTim Vaillancourt
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
 

Similaire à Best Practices for Building Robust Data Platform with Apache Spark and Delta (20)

Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
 
11g R2
11g R211g R2
11g R2
 
ADV Slides: Comparing the Enterprise Analytic Solutions
ADV Slides: Comparing the Enterprise Analytic SolutionsADV Slides: Comparing the Enterprise Analytic Solutions
ADV Slides: Comparing the Enterprise Analytic Solutions
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data Analytics
 
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
 
Estimating the Total Costs of Your Cloud Analytics Platform
Estimating the Total Costs of Your Cloud Analytics PlatformEstimating the Total Costs of Your Cloud Analytics Platform
Estimating the Total Costs of Your Cloud Analytics Platform
 
Sql server-performance-hafi
Sql server-performance-hafiSql server-performance-hafi
Sql server-performance-hafi
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevMigration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
 
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudFSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
 
Ssis Best Practices Israel Bi U Ser Group Itay Braun
Ssis Best Practices   Israel Bi U Ser Group   Itay BraunSsis Best Practices   Israel Bi U Ser Group   Itay Braun
Ssis Best Practices Israel Bi U Ser Group Itay Braun
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native Platform
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native Platform
 
Building a High Performance Analytics Platform
Building a High Performance Analytics PlatformBuilding a High Performance Analytics Platform
Building a High Performance Analytics Platform
 
Infrastructure Considerations : Design : "webops"
Infrastructure Considerations : Design : "webops"Infrastructure Considerations : Design : "webops"
Infrastructure Considerations : Design : "webops"
 
Black Friday and Cyber Monday- Best Practices for Your E-Commerce Database
Black Friday and Cyber Monday- Best Practices for Your E-Commerce DatabaseBlack Friday and Cyber Monday- Best Practices for Your E-Commerce Database
Black Friday and Cyber Monday- Best Practices for Your E-Commerce Database
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Serverless SQL
Serverless SQLServerless SQL
Serverless SQL
 

Plus de Databricks

Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionDatabricks
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityDatabricks
 
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueIntuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueDatabricks
 

Plus de Databricks (20)

Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
 
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueIntuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
 

Dernier

CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxolyaivanovalion
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...amitlee9823
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxolyaivanovalion
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 

Dernier (20)

Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 

Best Practices for Building Robust Data Platform with Apache Spark and Delta

  • 1. Best Practices for building Robust Data Platform with Apache Spark & Delta Vini Jaiswal Spark+AI Summit - June 2020 https://www.linkedin.com/in/vinijaiswal/
  • 2. ▪ Data Strategy Optimizing the cost to drive Business value ▪ Performance and tuning with Delta Lake & Apache Spark ▪ Governance and security controls Bringing it all together - A reference architecture Agenda
  • 4. Data Challenges Data Warehouse limits the potential of intelligence Data Volume is growing rapidly More Variety of data -> Different applications Need for faster processing and scalability Data silos limits innovation Promise of the Data Lake 1. Collect Everything 2. Store it all in the Data Lake 🔥 🔥🔥 3. Data Science & Machine Learning 🔥 🔥
  • 7. Ideal data lakes with No atomicity No quality enforcement No consistency / isolation ✗ Reliability - High Quality Data ● Schema Enforcement ● ACID Transactions ● Time Travel ● Open Standards, Open Source ● Powered by ● Unifies Streaming / Batch Usual Data Lake References: https://youtu.be/qtCxNSmTejk
  • 8. Getting the Data Right Audience Segmentation CSV, JSON, TXT… Data Types Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion Bronze Silver Gold Table Categorization Align with Business Outcomes Is my data use case worthy? Is my data ready for Analytics / ML?
  • 9. Optimizing the Cost to Drive Business Value
  • 10. Best Practices for Cluster Sizing & Selection 1. Selection of Instance Types a. Workload type b. Use cases 2. Selection of node size a. Observe Metrics b. Tweak workloads
  • 11. Best Practices for Cluster Sizing & Selection 1. Selection of Instance Types a. Workload type b. Use cases 2. Selection of node size a. Observe Metrics b. Tweak workloads
  • 12. Workload Type AWS Type Azure Type Recommended Use Case Memory Optimized r5 Dsv2 Memory-intensive applications Use Case: ML workload with data caching Compute Optimized c5 Fsv2 Structured Streaming, Distributed Analytics, Data Science Applications Use Case: ETL with full file scans and no data reuse Storage Optimized i3 Lsv2 Use cases that require higher disk throughput and IO Use Case: Analytics - Storage Optimized i3 class with Delta IO Cache Selection of Instance Types Reference for Azure Type:: https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes Reference for AWS Type::https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-types.html
  • 13. Best Practices for Cluster Sizing & Selection 1. Selection of Instance Types a. Workload type b. Use cases 2. Selection of node size a. Observe Metrics b. Tweak workloads
  • 14. Selection of node size Rule of thumb 1. Fewer big instances > more small instances a. (larger heap = larger GC) b. Multiple executors per machine 2. Size based on the number of tasks initially, tweak later a. Run the job with a small cluster to get idea of # of tasks b. Observe Cluster metrics for CPU, memory and network utilization
  • 15. Observe Spark UI & tweak the workloads Fully cached with room to spare? > decrease instances Almost completely cached? > Increase cluster size Not even close to cached? > Consider instance with SSD instead of EBS or use R class Reference: https://docs.databricks.com/delta/optimizations/delta-cache.html#delta-and-rdd-cache-comparison
  • 16. Observe Ganglia Metrics & tweak the workloads ○ Are we compute bound? ○ Are we network bound? ○ Are we spilling a ton?
  • 17. Performance and Tuning with Delta Lake & Apache Spark
  • 18. Performance Symptoms Look for these 4 symptoms Shuffle Spill Skew Small Files Can I make Spark application run faster?
  • 19. Use broadcast join Review Join order I found Shuffle, now what? Query completion time 28 Minutes Sort Merge Join rows output: 2,509,189,31 3 Before 1.8 Minutes rows output: 1023 After Reference: https://spark.apache.org/docs/latest/sql- performance-tuning.html#broadcast-hint-for-sql-queries
  • 20. ● Increase Shuffle Partitions (for this example: 48) ● Reduce the number of cores spark.executor.cores < total cores per worker ● Larger cluster - faster disk SSDs Shuffle Partitions = 16 I found Spill, now what? set spark.sql.shuffle.partitions=48 More spill you can remove, larger the impact!
  • 21. Symptom ● Ganglia CPU usage becomes low for long time after initial high usage ● Task duration -> Significant difference in max than 75% and 25% values ● Input Size/Records What to do? ● Use broadcast join ● Use Skew Join ● Filter out large keys/salt keys and set up multiple reduce steps ● Explicitly repartition the data on a different field I found Skew, now what? Reference: https://docs.databricks.com/delta/join-performance/skew-join.html
  • 22. Adaptive Query Execution Reduced manual effort of tuning spark.sql.shuffle.partitions By default it is turned off, Set spark.sql.adaptive.enabled=true Dynamically change sort-merge join into broadcast-hash join ▪ Dynamically optimizing skew joins *Available in DBR 7.x/Spark 3.0
  • 23. Upstream ● Fix the upstream application building tons of files ● Use a seperate tool to compact them before processing with Spark Changes in Spark Application ● Write your own compaction job ● Delta solves this problem! I found a lot of small files, now what?
  • 25. Compaction ● Improves the Read Performance ● Solves Small Files problem Reference: https://docs.delta.io/latest/best-practices.html#compact-files
  • 26. ● Optimizes Apache Spark partition ● Maximizes the throughput of data being written ● Compacts files for partitions Auto Optimize Auto Optimize consists of two complementary features: Optimized Writes and Auto Compaction. Reference: https://docs.databricks.com/delta/optimizations/auto-optimize.html#auto-optimize
  • 27. Reference:https://docs.databricks.com/delta/optimizations/file-mgmt.html#z-ordering-multi-dimensional-clustering Z-order Sorting 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Z-Ordering A technique to colocate related information in the same set of files ● Safely skips more data ● Faster queries
  • 29. Data Governance with Delta Lake Create retention policy to age out and erase raw data that may contain personal information High Level Aggregates (e.g. # of users that took an action) Historical Data Repository ● Easy to navigate ● Pseudonymization Data Lake Satisfy Compliance requests using UPDATE / DELETE commands Create tables that don't contain personal data Reference: https://www.youtube.com/watch?v=tCPslvUjG1w
  • 30. Audit & Monitoring ▪ Use cluster tags for chargeback ▪ Audit logs ▪ Monitor Databricks DBU usage ▪ Delta Transactional Logs
  • 31. Governance - The Who/What/Where Perform standard extraction, transformation and loading tasks (ETL) and apply best coding practices including source control, unit test, and automation drives product innovation with state-of-the-art Machine Learning models applied to big data Improves business process through providing standardized and ad-hoc business analysis. Acts as intermediary between Analytics and Business team Performs automated jobs based on Data Engineering configs. Data Scientist Data Engineer Data/Business Analyst Automated Jobs Many players in the Org. Managing Access, roles and responsibilities, as well as managing usage is a must.
  • 32. Business Unit Serving Operations & Security Data Science & MLIngest OrchestrationCI/CD Bringing it together - A reference pipeline APIs Jobs Models Notebooks Dashboards ML Runtime Delta Pipelines BLOB DB/DW Streaming Massively scalable data cleansing & transformation ETL/Data Processing Bronze Silver Gold Execution Databricks Runtime Reliability & Performance Optimized Spark Clusters Storage
  • 33. Business Unit Serving Operations & Security Data Science & MLIngest OrchestrationCI/CD Bringing it together - A reference pipeline APIs Jobs Models Notebooks Dashboards ML Runtime Delta Pipelines BLOB DB/DW Streaming Massively scalable data cleansing & transformation ETL/Data Processing Bronze Silver Gold Execution Databricks Runtime Reliability & Performance Optimized Spark Clusters Storage Data Strategy Cost Optimization & Performance Tuning Business Value Security
  • 35. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.