Control dataset partitioning and cache to optimize performances in Spark

Control dataset partitioning and cache to
optimize performances in Spark
Christophe Préaud & Florian Fauvarque

2
Who are we?
Christophe Préaud
Big data and distributed computing enthusiast
Christophe is data engineer at Kelkoo Group, in charge of the maintenance and evolution of the big
data technology stack, the development of Spark applications and the Spark support to other teams.
Florian Fauvarque
Opensource enthusiast, who loves neat and clean code, and more generally good software
craftmanship practices
Florian is software engineer at Kelkoo Group, in charge of the development of Spark applications to
produce analysis and products feeds for affiliate web sites.
This presentation is also available at https://aquilae.eu/snowcamp2019-spark

3
The global data-driven marketing platform that
connects consumers to products
22 countries
International presence
20 years
of ecommerce experience
4 price comparison sites

7
We are hiring!
Over 30 roles in the company
Roles in Grenoble:
• Java/Scala Developers
• Front-End Developers
• Data Scientists
• Internships

8
• 2 Billions logs written per day
• 60 TB in HDFS
• 15 servers in our prod yarn cluster: 1.73 TB memory 520 Vcores
• 3300 jobs executed every day
KelkooGroup – Some numbers

9
Spark is a unified processing engine that can analyze big data using SQL, machine learning, graph
processing or real-time stream analysis: http://spark.apache.org
What is Apache Spark?

11
• Task
• Slot
• Shuffle
Spark glossary

12
• Narrow transformation (ex: coalesce, filter, map, …)
Spark glossary

13
• Wide transformation (ex: repartition, distinct, groupBy, ...)
Spark glossary

14
1. Partitions
2. Cache
3. Profiling

15
• What does it mean to partition data?
• To divide a single dataset into smaller manageable chunks
• →A Partition is a small piece of the total dataset
• How do the DataFrameReaders decide how to partition data?
• It depends according to the reader (CSV, Parquet, ORC, ...)
• Task / Partition relationship:
• A typical Task is processing a single Partition
• →The number of Partitions will determine the number of Tasks needed to process
process the dataset
What is a partition in Spark?

16
During the first part of this presentation, we will focus mainly on...
• The number of Partitions my data is divided into
• The number of Slots I have for parallel execution
The goal is to maximize Slots usage, i.e. ensure as much as possible that
each Slot is processing a Task
What is a partition in Spark?

17
• 4 executors
• 2 cores / executor
• College Scorecards (source: catalog.data.gov) make it easier for
students to search for a college that is a good fit for them. They
can use the College Scorecard to find out more about a college's
affordability and value so they can make more informed decisions
about which college to attend.
Configuration for demo
8

18
Partition tuning: reading a file
3.3
min
numPartitions: 1
3 min 24

19
38 s
numPartitions: 9
42 s

20
Why 9 partitions?
• File size is 1.04 GB
• Max partition size is 128 MB
• 1.04 * 1024 / 128 = 8.32

21
• As a rule of thumb, it is always advised that the number of Partitions is a factor of the
number of Slots, so that every Slot is being used (i.e. assigned a Task) during the
processing
• With 9 Partitions and 8 Slots, we are under-utilizing 7 of the 8 Slots (7 Slots will be
assigned 1 Task, 1 Slot will be assigned 2 Tasks)

22
14 s
15 s
numPartitions: 8
32 s
repartition(8)

23
spark.sql.files.maxPartitionBytes:
The maximum number of bytes to pack into a single partition when reading files.
20 s
320
numPartitions: 8
22 s

24
45 s
128
numPartitions: 8
49 s

25
Partition tuning: repartition and coalesce
repartition(4)coalesce(4)

26

27

28

29
coalesce repartition
• Performs better: no shuffle
• Records are not evenly distributed
across all partitions→risk of skewed
dataset (i.e. a few partitions
containing most of the data)
• Extra cost because of shuffle
operation
• Ensure uniform distribution of the
records on all partitions→slots
usage will be optimal

30
Partition tuning: writing a file
numPartitions: 19
39 s

31
3.9
min
coalesce(1)
3 min 57

32
1.8
min
22 s
repartition(1)
2 min 18

33
Partition tuning: repartition or coalesce?
• If your dataset is skewed: use repartition
• If you want more partitions: use repartition
• If you want to drastically reduce the number of partitions (e.g. numPartitions = 1): use
repartition
• If your dataset is well balanced (i.e. not skewed) and you want fewer partitions (but
not drastically fewer, i.e. not fewer than the number of Slots): use coalesce
• If in doubt: use repartition

34
spark.sql.files.maxRecordsPerFile:
Maximum number of records to write out to a single file. If this value is
zero or negative, there is no limit.

35
Number of records is checked for each partition (and not for the whole dataset) while the partition is being written – when it is
over the threshold, a new file is created.
for each partition {
for each record {
numRecords ++
if (numRecords > 15000) {
closeFile()
openNewFile()
numRecords = 0
}
writeRecordInFile()
}
}

36
There cannot be less than one file per partition.

37
Wide transformation: The data required to compute the records in a single Partition may reside in many
Partitions of the parent Dataset (i.e. it triggers a shuffle operation)
Partition tuning: wide transformation
45 s
1 min
32

38
spark.sql.shuffle.partitions:
The default number of partitions to use when shuffling data for joins or aggregations.

39
28 s
1 min
17
8

40
1. Partitions
2. Cache
3. Profiling

41
When use cache
• When re-use a Dataset multiple times
• To recover quickly from a node failure
• data scientist : training data in an iterative loop 👍
• data analyst : most of the time no, hide that the data are not organized properly 👎
• data engineer : usually no, but depends on the cases. Benchmark before going to prod ❔

43
When use cache
1 min
41 sec

44
How to cache a data set in Spark
Cache strategy: Storage Level
• NONE: No cache
• MEMORY_ONLY :
• data cached non-serialized in memory
• If not enough memory: data is evicted and when needed rebuilt from source
• DISK_ONLY : data is serialized and stored on disk
• MEMORY_AND_DISK :
• data cached non-serialized in memory
• If not enough memory: data is serialized and stored on disk
• OFF_HEAP : data is serialized and stored of heap with Alluxio (formerly Tachyon)

45
• _SER suffix:
• Always serialize the data in memory
• Save space but with serialization penalty
• _2 suffix :
• Replicate each partition on 2 cluster nodes
• Improve recovery time when node failure
NONE
DISK_ONLY DISK_ONLY_2
MEMORY_ONLY MEMORY_ONLY_2 MEMORY_ONLY_SER MEMORY_ONLY_SER_2
MEMORY_AND_DISK MEMORY_AND_DISK_2 MEMORY_AND_DISK_SER MEMORY_AND_DISK_SER_2
OFF_HEAP

46
• .cache() alias for .persist(MEMORY_AND_DISK) RDD: MEMORY_ONLY
• Lazy: .count()

47
Broadcast variable
Useful to share small immutable data

48
Broadcast variable
• spark.sql.autoBroadcastJoinThreshold : auto optimize join queries
when the size of one side data is below the threshold (default 10 MB)

1. Partitions
2. Cache
3. Profiling

50
How to Profile a Spark App ?

51

52

53
https://github.com/criteo/babar

55
Ressources
• Spark official documentation: https://spark.apache.org/docs/latest/tuning.html
• Mastering Apache Spark by Jacek Laskowski: https://jaceklaskowski.gitbooks.io/mastering-
apache-spark/
• Apache Spark - Best Practices and Tuning by Umberto Griffo:
https://umbertogriffo.gitbooks.io/apache-spark-best-practices-and-tuning/
• High Performance Spark by Rachel Warren, Holden Karau, O'Reilly

Control dataset partitioning and cache to optimize performances in Spark

Control dataset partitioning and cache to optimize performances in Spark

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Control dataset partitioning and cache to optimize performances in Spark

Similaire à Control dataset partitioning and cache to optimize performances in Spark (20)

Dernier

Dernier (20)

Control dataset partitioning and cache to optimize performances in Spark

Notes de l'éditeur