SlideShare a Scribd company logo
1 of 50
Download to read offline
Lessons From the Field:
Applying Best Practices to Your
Apache Spark™ Applications
Silvio Fiorito
Spark Summit Europe, 2017
#EUdev5 1
TEAM
About Databricks
Started Spark project (now Apache Spark) at UC Berkeley in 2009
PRODUCT
Unified Analytics Platform
MISSION
Making Big Data Simple
About Me
•Silvio Fiorito
• silvio@databricks.com
• @granturing
•Resident Solutions Architect @ Databricks
•Spark developer, trainer, consultant - using Spark since ~v0.6
•Prior to that - application security, cyber analytics, forensics, etc.
Outline
•Speeding Up File Loading & Partition Discovery
•Optimizing File Storage & Layout
•Identifying Bottlenecks In Your Queries
“Why is it taking so long to
‘load’ my DataFrame”
5 5
“Loading” DataFrames
6
“Loading” DataFrames
7
Spark is lazily executed, right?
“Loading” DataFrames
8
Spark is lazily executed, right?
Jobs running even though there’s no
action, why?
Datasource API
Datasource
• Maps “spark.read.load” command to underlying data source
(Parquet, CSV, ORC, JSON, etc.)
• For file sources - Infers schema from files
– Kicks off Spark job to parallelize
– Needs file listing first
• Need basic statistics for query planning (file size, partitions, etc.)
9
Listing Files
InMemoryFileIndex
• Discovers partitions & lists files, using Spark
job if needed
(spark.sql.sources.parallelPartitionDiscovery.threshold 32 default)
• FileStatusCache to cache file status (250MB default)
• Maps Hive-style partition paths into columns
• Handles partition pruning based on query filters
10
date=2017-01-01
date=2017-01-02
Partition Discovery
11
1,823 partition paths to index
Partition Pruning
12
We only care about 4 out of 1,823
partitions!!
Dealing With Many Partitions
Option 1:
• If you know exactly what partitions you want
• If you need to use files vs Datasource Tables
• Specify “basePath” option and load specific partition paths
• InMemoryFileIndex “pre-filters” on those paths
13
Dealing With Many Partitions
14
Dealing With Many Partitions
15
~2 seconds vs ~43 seconds
Dealing With Many Partitions
Option 2:
• Use Datasource Tables with Spark 2.1+
• Partitions managed in Hive metastore
• Partition Pruning at logical planning stage
• Scalable Partition Handling for Cloud-Native Architecture in Apach
2.1
16
Dealing With Many Partitions
17
~0.29 seconds vs ~43
seconds
Using Datasource Tables
Using SQL - External (or Unmanaged) Tables
• Create over existing data
• Partition discovery runs once
• Use “saveAsTable” or “insertInto” to add new partitions
18
Using Datasource Tables
Using SQL - Managed Tables
• SparkSQL manages metadata and underlying files
• Use “saveAsTable” or “insertInto” to add new partitions
19
Using Datasource Tables
Using DataFrame API
• Use “saveAsTable” or “insertInto” to add new partitions
20
Managed Table
Using Datasource Tables
Using DataFrame API
• Use “saveAsTable” or “insertInto” to add new partitions
21
Managed Table
Unmanaged
Table
TABLES FILES
Datasource Tables vs Files
• Partition discovery at each
DataFrame creation
• Infer schema from files
• Slower job startup time
• Only file-size statistics available
• Only DataFrame API (SQL with temp
views)
• Managed, more scalable partition
handling
• Schema in metastore
• Faster job startup
• Additional statistics available to
Catalyst (e.g. CBO)
• Use SQL or DataFrame API
2222
Reading CSV & JSON Files
Schema Inference
• Runs on whole dataset
• Convenient, but SLOW... delays job startup
• Even worse with poor file layout
– Large GZIP files
– Lots of small files
23
Reading CSV & JSON Files
Avoiding Schema Inference
• Easiest - use Tables!
• Otherwise...If consistent data, infer on subset of files
• Save schema for reuse (e.g. scheduled batch jobs)
• JSON - adjust samplingRatio (default 1.0)
24
Better Schema Inference
25
Better Schema Inference
26
Better Schema Inference
27
“What format, compression,
and partitioning scheme
should I use?”
2828
Managing & Optimizing File Output
File Types
• Prefer columnar over text for analytical queries
• Parquet
– Column pruning
– Predicate pushdown
• CSV/JSON
– Must parse whole row
– No column pruning
– No predicate pushdown
29
Managing & Optimizing File Output
Compression
• Prefer splittable
– LZ4, BZip2, LZO, etc.
• Parquet + Snappy or GZIP (splittable due to row groups)
– Snappy is default in Spark 2.2+
• AVOID - Large GZIP text files
– Not splittable
– GC issues with wide tables
• More - Why You Should Care about Data Layout in the Filesystem
30
PARTITIONING BUCKETING
Managing & Optimizing File Output
• Write data already hash-partitioned
• Good for joins or aggregations
(avoids shuffle)
• Must be Datasource table
• Coarse-grained filtering of input files
• Avoid over-partitioning (small files),
overly nested partitions
3131
Partitioning & Bucketing
Writing Partitioned & Bucketed Data
• Each task writes a file per-partition and bucket
• If data is randomly distributed across tasks...lots of small files!
• Coalesce?
– Doesn’t guarantee colocation of partition/bucket values
– Reduces parallelization of final stage
• Repartition - incurs a shuffle but results in cleaner files
• Compaction - run job periodically to generate clean files
32
Managing File Sizes
33
Managing File Sizes
What If Partitions Are Too Big
• Spark 2.2 - spark.sql.files.maxRecordsPerFile (default disabled)
• Write task is responsible for splitting records across files
– WARNING: not parallelized
• If need parallelization - repartition with additional key
– Use hash+mod to manage # of files
34
“How can I optimize my
query?”
3535
? ? ? ?
36
Managing Shuffle Partitions
Spark SQL Shuffle Partitions
• Default to 200, used in shuffle operations
– groupBy, repartition, join, window
• Depending on your data volume might be too small/large
• Override with conf - spark.sql.shuffle.partitions
– 1 value for whole query
37
38
Managing Shuffle Partitions
Adaptive Execution
• Introduced in Spark 2.0
• Sets shuffle partitions based on size of shuffle output
• spark.sql.adaptive.enabled (default false)
• spark.sql.adaptive.shuffle.targetPostShuffleInputSize (default 64MB)
• Still in development* - SPARK-9850
39
Unions
Understanding unions
• Each DataFrame in union runs independently until shuffle
• Self-union - DataFrame read N times (number of unions)
• Alternatives (depends on use case)
– explode or flatMap
– persist root DataFrame (read once from storage)
40
41
Optimizing Query Execution
Cost Based Optimizer
• Added in Spark 2.2
• Collects and uses per-column stats for query planning
• Requires Datasource Tables
• Cost Based Optimizer in Apache Spark 2.2
42
Computing Statistics for CBO
43
TPC-DS Query 25 Without CBO
44
https://databricks.com/blog/2017/08/31/cost-based-optimizer-in-apache-spark-2-2.html
TPC-DS Query 25 With CBO
45
https://databricks.com/blog/2017/08/31/cost-based-optimizer-in-apache-spark-2-2.html
Data Skipping Index
•Added in Databricks Runtime 3.0
•Use file-level stats to skip files based on filters
•Requires Datasource Tables
•Opt-in, no code changes
•Data Skipping Index Docs
46
Data Skipping Index
47
48
UNIFIED ANALYTICS PLATFORM
Try Apache Spark in Databricks!
• Collaborative cloud environment
• Free version (community edition)
DATABRICKS RUNTIME 3.3
• Apache Spark - optimized for the cloud
• Caching and optimization layer - DBIO
• Enterprise security - DBES
Try for free today.
databricks.com
• http://go.databricks.com/definitive-guide-apache-spark
• Blog Post on Preview of Apache Spark: The Definitive Guide
Spark the Definitive Guide
49
Early Release
Thank you
Q&A
silvio@databricks.com
@granturing
50

More Related Content

What's hot

What's hot (20)

Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
 

Similar to Lessons from the Field: Applying Best Practices to Your Apache Spark Applications with Silvio Fiorito

Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Databricks
 
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
Databricks
 

Similar to Lessons from the Field: Applying Best Practices to Your Apache Spark Applications with Silvio Fiorito (20)

Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
 
[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 
Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
 
Apache Spark in Industry
Apache Spark in IndustryApache Spark in Industry
Apache Spark in Industry
 
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big DataDataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
 
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
 
Tachyon: An Open Source Memory-Centric Distributed Storage System
Tachyon: An Open Source Memory-Centric Distributed Storage SystemTachyon: An Open Source Memory-Centric Distributed Storage System
Tachyon: An Open Source Memory-Centric Distributed Storage System
 
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
Some key value stores using log-structure
Some key value stores using log-structureSome key value stores using log-structure
Some key value stores using log-structure
 
Productionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanProductionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan Chan
 
Spark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni SchieferSpark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni Schiefer
 

More from Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 

Recently uploaded

Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
wsppdmt
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
cnajjemba
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
q6pzkpark
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Abortion pills in Riyadh +966572737505 get cytotec
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 

Recently uploaded (20)

Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
SR-101-01012024-EN.docx  Federal Constitution  of the Swiss ConfederationSR-101-01012024-EN.docx  Federal Constitution  of the Swiss Confederation
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 

Lessons from the Field: Applying Best Practices to Your Apache Spark Applications with Silvio Fiorito

  • 1. Lessons From the Field: Applying Best Practices to Your Apache Spark™ Applications Silvio Fiorito Spark Summit Europe, 2017 #EUdev5 1
  • 2. TEAM About Databricks Started Spark project (now Apache Spark) at UC Berkeley in 2009 PRODUCT Unified Analytics Platform MISSION Making Big Data Simple
  • 3. About Me •Silvio Fiorito • silvio@databricks.com • @granturing •Resident Solutions Architect @ Databricks •Spark developer, trainer, consultant - using Spark since ~v0.6 •Prior to that - application security, cyber analytics, forensics, etc.
  • 4. Outline •Speeding Up File Loading & Partition Discovery •Optimizing File Storage & Layout •Identifying Bottlenecks In Your Queries
  • 5. “Why is it taking so long to ‘load’ my DataFrame” 5 5
  • 7. “Loading” DataFrames 7 Spark is lazily executed, right?
  • 8. “Loading” DataFrames 8 Spark is lazily executed, right? Jobs running even though there’s no action, why?
  • 9. Datasource API Datasource • Maps “spark.read.load” command to underlying data source (Parquet, CSV, ORC, JSON, etc.) • For file sources - Infers schema from files – Kicks off Spark job to parallelize – Needs file listing first • Need basic statistics for query planning (file size, partitions, etc.) 9
  • 10. Listing Files InMemoryFileIndex • Discovers partitions & lists files, using Spark job if needed (spark.sql.sources.parallelPartitionDiscovery.threshold 32 default) • FileStatusCache to cache file status (250MB default) • Maps Hive-style partition paths into columns • Handles partition pruning based on query filters 10 date=2017-01-01 date=2017-01-02
  • 12. Partition Pruning 12 We only care about 4 out of 1,823 partitions!!
  • 13. Dealing With Many Partitions Option 1: • If you know exactly what partitions you want • If you need to use files vs Datasource Tables • Specify “basePath” option and load specific partition paths • InMemoryFileIndex “pre-filters” on those paths 13
  • 14. Dealing With Many Partitions 14
  • 15. Dealing With Many Partitions 15 ~2 seconds vs ~43 seconds
  • 16. Dealing With Many Partitions Option 2: • Use Datasource Tables with Spark 2.1+ • Partitions managed in Hive metastore • Partition Pruning at logical planning stage • Scalable Partition Handling for Cloud-Native Architecture in Apach 2.1 16
  • 17. Dealing With Many Partitions 17 ~0.29 seconds vs ~43 seconds
  • 18. Using Datasource Tables Using SQL - External (or Unmanaged) Tables • Create over existing data • Partition discovery runs once • Use “saveAsTable” or “insertInto” to add new partitions 18
  • 19. Using Datasource Tables Using SQL - Managed Tables • SparkSQL manages metadata and underlying files • Use “saveAsTable” or “insertInto” to add new partitions 19
  • 20. Using Datasource Tables Using DataFrame API • Use “saveAsTable” or “insertInto” to add new partitions 20 Managed Table
  • 21. Using Datasource Tables Using DataFrame API • Use “saveAsTable” or “insertInto” to add new partitions 21 Managed Table Unmanaged Table
  • 22. TABLES FILES Datasource Tables vs Files • Partition discovery at each DataFrame creation • Infer schema from files • Slower job startup time • Only file-size statistics available • Only DataFrame API (SQL with temp views) • Managed, more scalable partition handling • Schema in metastore • Faster job startup • Additional statistics available to Catalyst (e.g. CBO) • Use SQL or DataFrame API 2222
  • 23. Reading CSV & JSON Files Schema Inference • Runs on whole dataset • Convenient, but SLOW... delays job startup • Even worse with poor file layout – Large GZIP files – Lots of small files 23
  • 24. Reading CSV & JSON Files Avoiding Schema Inference • Easiest - use Tables! • Otherwise...If consistent data, infer on subset of files • Save schema for reuse (e.g. scheduled batch jobs) • JSON - adjust samplingRatio (default 1.0) 24
  • 28. “What format, compression, and partitioning scheme should I use?” 2828
  • 29. Managing & Optimizing File Output File Types • Prefer columnar over text for analytical queries • Parquet – Column pruning – Predicate pushdown • CSV/JSON – Must parse whole row – No column pruning – No predicate pushdown 29
  • 30. Managing & Optimizing File Output Compression • Prefer splittable – LZ4, BZip2, LZO, etc. • Parquet + Snappy or GZIP (splittable due to row groups) – Snappy is default in Spark 2.2+ • AVOID - Large GZIP text files – Not splittable – GC issues with wide tables • More - Why You Should Care about Data Layout in the Filesystem 30
  • 31. PARTITIONING BUCKETING Managing & Optimizing File Output • Write data already hash-partitioned • Good for joins or aggregations (avoids shuffle) • Must be Datasource table • Coarse-grained filtering of input files • Avoid over-partitioning (small files), overly nested partitions 3131
  • 32. Partitioning & Bucketing Writing Partitioned & Bucketed Data • Each task writes a file per-partition and bucket • If data is randomly distributed across tasks...lots of small files! • Coalesce? – Doesn’t guarantee colocation of partition/bucket values – Reduces parallelization of final stage • Repartition - incurs a shuffle but results in cleaner files • Compaction - run job periodically to generate clean files 32
  • 34. Managing File Sizes What If Partitions Are Too Big • Spark 2.2 - spark.sql.files.maxRecordsPerFile (default disabled) • Write task is responsible for splitting records across files – WARNING: not parallelized • If need parallelization - repartition with additional key – Use hash+mod to manage # of files 34
  • 35. “How can I optimize my query?” 3535
  • 36. ? ? ? ? 36
  • 37. Managing Shuffle Partitions Spark SQL Shuffle Partitions • Default to 200, used in shuffle operations – groupBy, repartition, join, window • Depending on your data volume might be too small/large • Override with conf - spark.sql.shuffle.partitions – 1 value for whole query 37
  • 38. 38
  • 39. Managing Shuffle Partitions Adaptive Execution • Introduced in Spark 2.0 • Sets shuffle partitions based on size of shuffle output • spark.sql.adaptive.enabled (default false) • spark.sql.adaptive.shuffle.targetPostShuffleInputSize (default 64MB) • Still in development* - SPARK-9850 39
  • 40. Unions Understanding unions • Each DataFrame in union runs independently until shuffle • Self-union - DataFrame read N times (number of unions) • Alternatives (depends on use case) – explode or flatMap – persist root DataFrame (read once from storage) 40
  • 41. 41
  • 42. Optimizing Query Execution Cost Based Optimizer • Added in Spark 2.2 • Collects and uses per-column stats for query planning • Requires Datasource Tables • Cost Based Optimizer in Apache Spark 2.2 42
  • 44. TPC-DS Query 25 Without CBO 44 https://databricks.com/blog/2017/08/31/cost-based-optimizer-in-apache-spark-2-2.html
  • 45. TPC-DS Query 25 With CBO 45 https://databricks.com/blog/2017/08/31/cost-based-optimizer-in-apache-spark-2-2.html
  • 46. Data Skipping Index •Added in Databricks Runtime 3.0 •Use file-level stats to skip files based on filters •Requires Datasource Tables •Opt-in, no code changes •Data Skipping Index Docs 46
  • 48. 48 UNIFIED ANALYTICS PLATFORM Try Apache Spark in Databricks! • Collaborative cloud environment • Free version (community edition) DATABRICKS RUNTIME 3.3 • Apache Spark - optimized for the cloud • Caching and optimization layer - DBIO • Enterprise security - DBES Try for free today. databricks.com
  • 49. • http://go.databricks.com/definitive-guide-apache-spark • Blog Post on Preview of Apache Spark: The Definitive Guide Spark the Definitive Guide 49 Early Release