SlideShare une entreprise Scribd logo
1  sur  52
Control dataset partitioning and cache to
optimize performances in Spark
Christophe Préaud & Florian Fauvarque
2
Who are we?
Christophe Préaud
Big data and distributed computing enthusiast
Christophe is data engineer at Kelkoo Group, in charge of the maintenance and evolution of the big
data technology stack, the development of Spark applications and the Spark support to other teams.
Florian Fauvarque
Opensource enthusiast, who loves neat and clean code, and more generally good software
craftmanship practices
Florian is software engineer at Kelkoo Group, in charge of the development of Spark applications to
produce analysis and products feeds for affiliate web sites.
This presentation is also available at https://aquilae.eu/snowcamp2019-spark
3
The global data-driven marketing platform that
connects consumers to products
22 countries
International presence
20 years
of ecommerce experience
4 price comparison sites
7
We are hiring!
Over 30 roles in the company
Roles in Grenoble:
• Java/Scala Developers
• Front-End Developers
• Data Scientists
• Internships
8
• 2 Billions logs written per day
• 60 TB in HDFS
• 15 servers in our prod yarn cluster: 1.73 TB memory 520 Vcores
• 3300 jobs executed every day
KelkooGroup – Some numbers
9
Spark is a unified processing engine that can analyze big data using SQL, machine learning, graph
processing or real-time stream analysis: http://spark.apache.org
What is Apache Spark?
11
• Task
• Slot
• Shuffle
Spark glossary
12
• Narrow transformation (ex: coalesce, filter, map, …)
Spark glossary
13
• Wide transformation (ex: repartition, distinct, groupBy, ...)
Spark glossary
14
1. Partitions
2. Cache
3. Profiling
15
• What does it mean to partition data?
• To divide a single dataset into smaller manageable chunks
• →A Partition is a small piece of the total dataset
• How do the DataFrameReaders decide how to partition data?
• It depends according to the reader (CSV, Parquet, ORC, ...)
• Task / Partition relationship:
• A typical Task is processing a single Partition
• →The number of Partitions will determine the number of Tasks needed to process
process the dataset
What is a partition in Spark?
16
During the first part of this presentation, we will focus mainly on...
• The number of Partitions my data is divided into
• The number of Slots I have for parallel execution
The goal is to maximize Slots usage, i.e. ensure as much as possible that
each Slot is processing a Task
What is a partition in Spark?
17
• 4 executors
• 2 cores / executor
• College Scorecards (source: catalog.data.gov) make it easier for
students to search for a college that is a good fit for them. They
can use the College Scorecard to find out more about a college's
affordability and value so they can make more informed decisions
about which college to attend.
Configuration for demo
8
18
Partition tuning: reading a file
3.3
min
numPartitions: 1
3 min 24
19
Partition tuning: reading a file
38 s
numPartitions: 9
42 s
20
Why 9 partitions?
• File size is 1.04 GB
• Max partition size is 128 MB
• 1.04 * 1024 / 128 = 8.32
Partition tuning: reading a file
21
Partition tuning: reading a file
• As a rule of thumb, it is always advised that the number of Partitions is a factor of the
number of Slots, so that every Slot is being used (i.e. assigned a Task) during the
processing
• With 9 Partitions and 8 Slots, we are under-utilizing 7 of the 8 Slots (7 Slots will be
assigned 1 Task, 1 Slot will be assigned 2 Tasks)
22
Partition tuning: reading a file
14 s
15 s
numPartitions: 8
32 s
repartition(8)
23
Partition tuning: reading a file
spark.sql.files.maxPartitionBytes:
The maximum number of bytes to pack into a single partition when reading files.
20 s
320
numPartitions: 8
22 s
24
Partition tuning: reading a file
45 s
128
numPartitions: 8
49 s
25
Partition tuning: repartition and coalesce
repartition(4)coalesce(4)
26
Partition tuning: repartition and coalesce
27
Partition tuning: repartition and coalesce
28
Partition tuning: repartition and coalesce
29
Partition tuning: repartition and coalesce
coalesce repartition
• Performs better: no shuffle
• Records are not evenly distributed
across all partitions→risk of skewed
dataset (i.e. a few partitions
containing most of the data)
• Extra cost because of shuffle
operation
• Ensure uniform distribution of the
records on all partitions→slots
usage will be optimal
30
Partition tuning: writing a file
numPartitions: 19
39 s
31
Partition tuning: writing a file
3.9
min
coalesce(1)
3 min 57
32
Partition tuning: writing a file
1.8
min
22 s
repartition(1)
2 min 18
33
Partition tuning: repartition or coalesce?
• If your dataset is skewed: use repartition
• If you want more partitions: use repartition
• If you want to drastically reduce the number of partitions (e.g. numPartitions = 1): use
repartition
• If your dataset is well balanced (i.e. not skewed) and you want fewer partitions (but
not drastically fewer, i.e. not fewer than the number of Slots): use coalesce
• If in doubt: use repartition
34
spark.sql.files.maxRecordsPerFile:
Maximum number of records to write out to a single file. If this value is
zero or negative, there is no limit.
Partition tuning: writing a file
35
Partition tuning: writing a file
Number of records is checked for each partition (and not for the whole dataset) while the partition is being written – when it is
over the threshold, a new file is created.
for each partition {
for each record {
numRecords ++
if (numRecords > 15000) {
closeFile()
openNewFile()
numRecords = 0
}
writeRecordInFile()
}
}
36
Partition tuning: writing a file
There cannot be less than one file per partition.
37
Wide transformation: The data required to compute the records in a single Partition may reside in many
Partitions of the parent Dataset (i.e. it triggers a shuffle operation)
Partition tuning: wide transformation
45 s
1 min
32
38
spark.sql.shuffle.partitions:
The default number of partitions to use when shuffling data for joins or aggregations.
Partition tuning: wide transformation
39
Partition tuning: wide transformation
28 s
1 min
17
8
40
1. Partitions
2. Cache
3. Profiling
41
When use cache
• When re-use a Dataset multiple times
• To recover quickly from a node failure
• data scientist : training data in an iterative loop 👍
• data analyst : most of the time no, hide that the data are not organized properly 👎
• data engineer : usually no, but depends on the cases. Benchmark before going to prod ❔
42
When use cache
7 sec
43
When use cache
1 min
41 sec
44
How to cache a data set in Spark
Cache strategy: Storage Level
• NONE: No cache
• MEMORY_ONLY :
• data cached non-serialized in memory
• If not enough memory: data is evicted and when needed rebuilt from source
• DISK_ONLY : data is serialized and stored on disk
• MEMORY_AND_DISK :
• data cached non-serialized in memory
• If not enough memory: data is serialized and stored on disk
• OFF_HEAP : data is serialized and stored of heap with Alluxio (formerly Tachyon)
45
How to cache a data set in Spark
Cache strategy: Storage Level
• _SER suffix:
• Always serialize the data in memory
• Save space but with serialization penalty
• _2 suffix :
• Replicate each partition on 2 cluster nodes
• Improve recovery time when node failure
NONE
DISK_ONLY DISK_ONLY_2
MEMORY_ONLY MEMORY_ONLY_2 MEMORY_ONLY_SER MEMORY_ONLY_SER_2
MEMORY_AND_DISK MEMORY_AND_DISK_2 MEMORY_AND_DISK_SER MEMORY_AND_DISK_SER_2
OFF_HEAP
46
How to cache a data set in Spark
Cache strategy: Storage Level
• .cache() alias for .persist(MEMORY_AND_DISK) RDD: MEMORY_ONLY
• Lazy: .count()
47
Broadcast variable
Useful to share small immutable data
48
Broadcast variable
• spark.sql.autoBroadcastJoinThreshold : auto optimize join queries
when the size of one side data is below the threshold (default 10 MB)
1. Partitions
2. Cache
3. Profiling
50
How to Profile a Spark App ?
51
How to Profile a Spark App ?
52
How to Profile a Spark App ?
53
How to Profile a Spark App ?
https://github.com/criteo/babar
54
Questions ?
55
Ressources
• Spark official documentation: https://spark.apache.org/docs/latest/tuning.html
• Mastering Apache Spark by Jacek Laskowski: https://jaceklaskowski.gitbooks.io/mastering-
apache-spark/
• Apache Spark - Best Practices and Tuning by Umberto Griffo:
https://umbertogriffo.gitbooks.io/apache-spark-best-practices-and-tuning/
• High Performance Spark by Rachel Warren, Holden Karau, O'Reilly
Control dataset partitioning and cache to optimize performances in Spark

Contenu connexe

Tendances

Less is More: 2X Storage Efficiency with HDFS Erasure Coding
Less is More: 2X Storage Efficiency with HDFS Erasure CodingLess is More: 2X Storage Efficiency with HDFS Erasure Coding
Less is More: 2X Storage Efficiency with HDFS Erasure CodingZhe Zhang
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Julien Le Dem
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewKonstantin V. Shvachko
 
(Julien le dem) parquet
(Julien le dem)   parquet(Julien le dem)   parquet
(Julien le dem) parquetNAVER D2
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemRutvik Bapat
 
Meethadoop
MeethadoopMeethadoop
MeethadoopIIIT-H
 
Snapshot in Hadoop Distributed File System
Snapshot in Hadoop Distributed File SystemSnapshot in Hadoop Distributed File System
Snapshot in Hadoop Distributed File SystemBhavesh Padharia
 
RAID: High-Performance, Reliable Secondary Storage
RAID: High-Performance, Reliable Secondary StorageRAID: High-Performance, Reliable Secondary Storage
RAID: High-Performance, Reliable Secondary StorageUğur Tılıkoğlu
 
Bigdata processing with Spark - part II
Bigdata processing with Spark - part IIBigdata processing with Spark - part II
Bigdata processing with Spark - part IIArjen de Vries
 
Bigdata processing with Spark
Bigdata processing with SparkBigdata processing with Spark
Bigdata processing with SparkArjen de Vries
 

Tendances (20)

Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)
 
Less is More: 2X Storage Efficiency with HDFS Erasure Coding
Less is More: 2X Storage Efficiency with HDFS Erasure CodingLess is More: 2X Storage Efficiency with HDFS Erasure Coding
Less is More: 2X Storage Efficiency with HDFS Erasure Coding
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
 
Chapter13
Chapter13Chapter13
Chapter13
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
 
(Julien le dem) parquet
(Julien le dem)   parquet(Julien le dem)   parquet
(Julien le dem) parquet
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
HDFS Erasure Coding in Action
HDFS Erasure Coding in Action HDFS Erasure Coding in Action
HDFS Erasure Coding in Action
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
Snapshot in Hadoop Distributed File System
Snapshot in Hadoop Distributed File SystemSnapshot in Hadoop Distributed File System
Snapshot in Hadoop Distributed File System
 
HDF5 I/O Performance
HDF5 I/O PerformanceHDF5 I/O Performance
HDF5 I/O Performance
 
RAID: High-Performance, Reliable Secondary Storage
RAID: High-Performance, Reliable Secondary StorageRAID: High-Performance, Reliable Secondary Storage
RAID: High-Performance, Reliable Secondary Storage
 
Bigdata processing with Spark - part II
Bigdata processing with Spark - part IIBigdata processing with Spark - part II
Bigdata processing with Spark - part II
 
An Introduction to Hadoop
An Introduction to HadoopAn Introduction to Hadoop
An Introduction to Hadoop
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
 
Bigdata processing with Spark
Bigdata processing with SparkBigdata processing with Spark
Bigdata processing with Spark
 

Similaire à Control dataset partitioning and cache to optimize performances in Spark

Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Codemotion
 
Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101Mark Kromer
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudDatabricks
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudRose Toomey
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hoodAdarsh Pannu
 
Presentation by TachyonNexus & Intel at Strata Singapore 2015
Presentation by TachyonNexus & Intel at Strata Singapore 2015Presentation by TachyonNexus & Intel at Strata Singapore 2015
Presentation by TachyonNexus & Intel at Strata Singapore 2015Tachyon Nexus, Inc.
 
Tachyon: An Open Source Memory-Centric Distributed Storage System
Tachyon: An Open Source Memory-Centric Distributed Storage SystemTachyon: An Open Source Memory-Centric Distributed Storage System
Tachyon: An Open Source Memory-Centric Distributed Storage SystemTachyon Nexus, Inc.
 
Managing Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDBManaging Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDBJason Terpko
 
Memory Management Strategies - III.pdf
Memory Management Strategies - III.pdfMemory Management Strategies - III.pdf
Memory Management Strategies - III.pdfHarika Pudugosula
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable PythonTravis Oliphant
 
OVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxOVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxAishg4
 
Tachyon Presentation at AMPCamp 6 (November, 2015)
Tachyon Presentation at AMPCamp 6 (November, 2015)Tachyon Presentation at AMPCamp 6 (November, 2015)
Tachyon Presentation at AMPCamp 6 (November, 2015)Tachyon Nexus, Inc.
 
Managing data and operation distribution in MongoDB
Managing data and operation distribution in MongoDBManaging data and operation distribution in MongoDB
Managing data and operation distribution in MongoDBAntonios Giannopoulos
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Databricks
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Spark Summit
 
Storage talk
Storage talkStorage talk
Storage talkchristkv
 
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...Databricks
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Databricks
 

Similaire à Control dataset partitioning and cache to optimize performances in Spark (20)

Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
 
Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
 
Presentation by TachyonNexus & Intel at Strata Singapore 2015
Presentation by TachyonNexus & Intel at Strata Singapore 2015Presentation by TachyonNexus & Intel at Strata Singapore 2015
Presentation by TachyonNexus & Intel at Strata Singapore 2015
 
Tachyon: An Open Source Memory-Centric Distributed Storage System
Tachyon: An Open Source Memory-Centric Distributed Storage SystemTachyon: An Open Source Memory-Centric Distributed Storage System
Tachyon: An Open Source Memory-Centric Distributed Storage System
 
Managing Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDBManaging Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDB
 
Memory Management Strategies - III.pdf
Memory Management Strategies - III.pdfMemory Management Strategies - III.pdf
Memory Management Strategies - III.pdf
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable Python
 
OVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxOVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptx
 
Tachyon Presentation at AMPCamp 6 (November, 2015)
Tachyon Presentation at AMPCamp 6 (November, 2015)Tachyon Presentation at AMPCamp 6 (November, 2015)
Tachyon Presentation at AMPCamp 6 (November, 2015)
 
Managing data and operation distribution in MongoDB
Managing data and operation distribution in MongoDBManaging data and operation distribution in MongoDB
Managing data and operation distribution in MongoDB
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
 
Storage talk
Storage talkStorage talk
Storage talk
 
Vmfs
VmfsVmfs
Vmfs
 
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
 

Dernier

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 

Dernier (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 

Control dataset partitioning and cache to optimize performances in Spark

  • 1. Control dataset partitioning and cache to optimize performances in Spark Christophe Préaud & Florian Fauvarque
  • 2. 2 Who are we? Christophe Préaud Big data and distributed computing enthusiast Christophe is data engineer at Kelkoo Group, in charge of the maintenance and evolution of the big data technology stack, the development of Spark applications and the Spark support to other teams. Florian Fauvarque Opensource enthusiast, who loves neat and clean code, and more generally good software craftmanship practices Florian is software engineer at Kelkoo Group, in charge of the development of Spark applications to produce analysis and products feeds for affiliate web sites. This presentation is also available at https://aquilae.eu/snowcamp2019-spark
  • 3. 3 The global data-driven marketing platform that connects consumers to products 22 countries International presence 20 years of ecommerce experience 4 price comparison sites
  • 4. 7 We are hiring! Over 30 roles in the company Roles in Grenoble: • Java/Scala Developers • Front-End Developers • Data Scientists • Internships
  • 5. 8 • 2 Billions logs written per day • 60 TB in HDFS • 15 servers in our prod yarn cluster: 1.73 TB memory 520 Vcores • 3300 jobs executed every day KelkooGroup – Some numbers
  • 6. 9 Spark is a unified processing engine that can analyze big data using SQL, machine learning, graph processing or real-time stream analysis: http://spark.apache.org What is Apache Spark?
  • 7. 11 • Task • Slot • Shuffle Spark glossary
  • 8. 12 • Narrow transformation (ex: coalesce, filter, map, …) Spark glossary
  • 9. 13 • Wide transformation (ex: repartition, distinct, groupBy, ...) Spark glossary
  • 11. 15 • What does it mean to partition data? • To divide a single dataset into smaller manageable chunks • →A Partition is a small piece of the total dataset • How do the DataFrameReaders decide how to partition data? • It depends according to the reader (CSV, Parquet, ORC, ...) • Task / Partition relationship: • A typical Task is processing a single Partition • →The number of Partitions will determine the number of Tasks needed to process process the dataset What is a partition in Spark?
  • 12. 16 During the first part of this presentation, we will focus mainly on... • The number of Partitions my data is divided into • The number of Slots I have for parallel execution The goal is to maximize Slots usage, i.e. ensure as much as possible that each Slot is processing a Task What is a partition in Spark?
  • 13. 17 • 4 executors • 2 cores / executor • College Scorecards (source: catalog.data.gov) make it easier for students to search for a college that is a good fit for them. They can use the College Scorecard to find out more about a college's affordability and value so they can make more informed decisions about which college to attend. Configuration for demo 8
  • 14. 18 Partition tuning: reading a file 3.3 min numPartitions: 1 3 min 24
  • 15. 19 Partition tuning: reading a file 38 s numPartitions: 9 42 s
  • 16. 20 Why 9 partitions? • File size is 1.04 GB • Max partition size is 128 MB • 1.04 * 1024 / 128 = 8.32 Partition tuning: reading a file
  • 17. 21 Partition tuning: reading a file • As a rule of thumb, it is always advised that the number of Partitions is a factor of the number of Slots, so that every Slot is being used (i.e. assigned a Task) during the processing • With 9 Partitions and 8 Slots, we are under-utilizing 7 of the 8 Slots (7 Slots will be assigned 1 Task, 1 Slot will be assigned 2 Tasks)
  • 18. 22 Partition tuning: reading a file 14 s 15 s numPartitions: 8 32 s repartition(8)
  • 19. 23 Partition tuning: reading a file spark.sql.files.maxPartitionBytes: The maximum number of bytes to pack into a single partition when reading files. 20 s 320 numPartitions: 8 22 s
  • 20. 24 Partition tuning: reading a file 45 s 128 numPartitions: 8 49 s
  • 21. 25 Partition tuning: repartition and coalesce repartition(4)coalesce(4)
  • 25. 29 Partition tuning: repartition and coalesce coalesce repartition • Performs better: no shuffle • Records are not evenly distributed across all partitions→risk of skewed dataset (i.e. a few partitions containing most of the data) • Extra cost because of shuffle operation • Ensure uniform distribution of the records on all partitions→slots usage will be optimal
  • 26. 30 Partition tuning: writing a file numPartitions: 19 39 s
  • 27. 31 Partition tuning: writing a file 3.9 min coalesce(1) 3 min 57
  • 28. 32 Partition tuning: writing a file 1.8 min 22 s repartition(1) 2 min 18
  • 29. 33 Partition tuning: repartition or coalesce? • If your dataset is skewed: use repartition • If you want more partitions: use repartition • If you want to drastically reduce the number of partitions (e.g. numPartitions = 1): use repartition • If your dataset is well balanced (i.e. not skewed) and you want fewer partitions (but not drastically fewer, i.e. not fewer than the number of Slots): use coalesce • If in doubt: use repartition
  • 30. 34 spark.sql.files.maxRecordsPerFile: Maximum number of records to write out to a single file. If this value is zero or negative, there is no limit. Partition tuning: writing a file
  • 31. 35 Partition tuning: writing a file Number of records is checked for each partition (and not for the whole dataset) while the partition is being written – when it is over the threshold, a new file is created. for each partition { for each record { numRecords ++ if (numRecords > 15000) { closeFile() openNewFile() numRecords = 0 } writeRecordInFile() } }
  • 32. 36 Partition tuning: writing a file There cannot be less than one file per partition.
  • 33. 37 Wide transformation: The data required to compute the records in a single Partition may reside in many Partitions of the parent Dataset (i.e. it triggers a shuffle operation) Partition tuning: wide transformation 45 s 1 min 32
  • 34. 38 spark.sql.shuffle.partitions: The default number of partitions to use when shuffling data for joins or aggregations. Partition tuning: wide transformation
  • 35. 39 Partition tuning: wide transformation 28 s 1 min 17 8
  • 37. 41 When use cache • When re-use a Dataset multiple times • To recover quickly from a node failure • data scientist : training data in an iterative loop 👍 • data analyst : most of the time no, hide that the data are not organized properly 👎 • data engineer : usually no, but depends on the cases. Benchmark before going to prod ❔
  • 39. 43 When use cache 1 min 41 sec
  • 40. 44 How to cache a data set in Spark Cache strategy: Storage Level • NONE: No cache • MEMORY_ONLY : • data cached non-serialized in memory • If not enough memory: data is evicted and when needed rebuilt from source • DISK_ONLY : data is serialized and stored on disk • MEMORY_AND_DISK : • data cached non-serialized in memory • If not enough memory: data is serialized and stored on disk • OFF_HEAP : data is serialized and stored of heap with Alluxio (formerly Tachyon)
  • 41. 45 How to cache a data set in Spark Cache strategy: Storage Level • _SER suffix: • Always serialize the data in memory • Save space but with serialization penalty • _2 suffix : • Replicate each partition on 2 cluster nodes • Improve recovery time when node failure NONE DISK_ONLY DISK_ONLY_2 MEMORY_ONLY MEMORY_ONLY_2 MEMORY_ONLY_SER MEMORY_ONLY_SER_2 MEMORY_AND_DISK MEMORY_AND_DISK_2 MEMORY_AND_DISK_SER MEMORY_AND_DISK_SER_2 OFF_HEAP
  • 42. 46 How to cache a data set in Spark Cache strategy: Storage Level • .cache() alias for .persist(MEMORY_AND_DISK) RDD: MEMORY_ONLY • Lazy: .count()
  • 43. 47 Broadcast variable Useful to share small immutable data
  • 44. 48 Broadcast variable • spark.sql.autoBroadcastJoinThreshold : auto optimize join queries when the size of one side data is below the threshold (default 10 MB)
  • 46. 50 How to Profile a Spark App ?
  • 47. 51 How to Profile a Spark App ?
  • 48. 52 How to Profile a Spark App ?
  • 49. 53 How to Profile a Spark App ? https://github.com/criteo/babar
  • 51. 55 Ressources • Spark official documentation: https://spark.apache.org/docs/latest/tuning.html • Mastering Apache Spark by Jacek Laskowski: https://jaceklaskowski.gitbooks.io/mastering- apache-spark/ • Apache Spark - Best Practices and Tuning by Umberto Griffo: https://umbertogriffo.gitbooks.io/apache-spark-best-practices-and-tuning/ • High Performance Spark by Rachel Warren, Holden Karau, O'Reilly

Notes de l'éditeur

  1. 2
  2. 8
  3. 9
  4. 10
  5. 11
  6. 12
  7. 13
  8. 15
  9. 16
  10. 17
  11. 18
  12. 19
  13. 20
  14. 22
  15. 23
  16. 24
  17. 25
  18. 26
  19. 27
  20. 28
  21. 29
  22. 30
  23. 31
  24. 32
  25. 34
  26. 35
  27. 36
  28. 37
  29. 38
  30. 39