2. 2
Who are we?
Christophe Préaud
Big data and distributed computing enthusiast
Christophe is data engineer at Kelkoo Group, in charge of the maintenance and evolution of the big
data technology stack, the development of Spark applications and the Spark support to other teams.
Florian Fauvarque
Opensource enthusiast, who loves neat and clean code, and more generally good software
craftmanship practices
Florian is software engineer at Kelkoo Group, in charge of the development of Spark applications to
produce analysis and products feeds for affiliate web sites.
This presentation is also available at https://aquilae.eu/snowcamp2019-spark
3. 3
The global data-driven marketing platform that
connects consumers to products
22 countries
International presence
20 years
of ecommerce experience
4 price comparison sites
4. 7
We are hiring!
Over 30 roles in the company
Roles in Grenoble:
• Java/Scala Developers
• Front-End Developers
• Data Scientists
• Internships
5. 8
• 2 Billions logs written per day
• 60 TB in HDFS
• 15 servers in our prod yarn cluster: 1.73 TB memory 520 Vcores
• 3300 jobs executed every day
KelkooGroup – Some numbers
6. 9
Spark is a unified processing engine that can analyze big data using SQL, machine learning, graph
processing or real-time stream analysis: http://spark.apache.org
What is Apache Spark?
11. 15
• What does it mean to partition data?
• To divide a single dataset into smaller manageable chunks
• →A Partition is a small piece of the total dataset
• How do the DataFrameReaders decide how to partition data?
• It depends according to the reader (CSV, Parquet, ORC, ...)
• Task / Partition relationship:
• A typical Task is processing a single Partition
• →The number of Partitions will determine the number of Tasks needed to process
process the dataset
What is a partition in Spark?
12. 16
During the first part of this presentation, we will focus mainly on...
• The number of Partitions my data is divided into
• The number of Slots I have for parallel execution
The goal is to maximize Slots usage, i.e. ensure as much as possible that
each Slot is processing a Task
What is a partition in Spark?
13. 17
• 4 executors
• 2 cores / executor
• College Scorecards (source: catalog.data.gov) make it easier for
students to search for a college that is a good fit for them. They
can use the College Scorecard to find out more about a college's
affordability and value so they can make more informed decisions
about which college to attend.
Configuration for demo
8
16. 20
Why 9 partitions?
• File size is 1.04 GB
• Max partition size is 128 MB
• 1.04 * 1024 / 128 = 8.32
Partition tuning: reading a file
17. 21
Partition tuning: reading a file
• As a rule of thumb, it is always advised that the number of Partitions is a factor of the
number of Slots, so that every Slot is being used (i.e. assigned a Task) during the
processing
• With 9 Partitions and 8 Slots, we are under-utilizing 7 of the 8 Slots (7 Slots will be
assigned 1 Task, 1 Slot will be assigned 2 Tasks)
19. 23
Partition tuning: reading a file
spark.sql.files.maxPartitionBytes:
The maximum number of bytes to pack into a single partition when reading files.
20 s
320
numPartitions: 8
22 s
25. 29
Partition tuning: repartition and coalesce
coalesce repartition
• Performs better: no shuffle
• Records are not evenly distributed
across all partitions→risk of skewed
dataset (i.e. a few partitions
containing most of the data)
• Extra cost because of shuffle
operation
• Ensure uniform distribution of the
records on all partitions→slots
usage will be optimal
29. 33
Partition tuning: repartition or coalesce?
• If your dataset is skewed: use repartition
• If you want more partitions: use repartition
• If you want to drastically reduce the number of partitions (e.g. numPartitions = 1): use
repartition
• If your dataset is well balanced (i.e. not skewed) and you want fewer partitions (but
not drastically fewer, i.e. not fewer than the number of Slots): use coalesce
• If in doubt: use repartition
31. 35
Partition tuning: writing a file
Number of records is checked for each partition (and not for the whole dataset) while the partition is being written – when it is
over the threshold, a new file is created.
for each partition {
for each record {
numRecords ++
if (numRecords > 15000) {
closeFile()
openNewFile()
numRecords = 0
}
writeRecordInFile()
}
}
33. 37
Wide transformation: The data required to compute the records in a single Partition may reside in many
Partitions of the parent Dataset (i.e. it triggers a shuffle operation)
Partition tuning: wide transformation
45 s
1 min
32
37. 41
When use cache
• When re-use a Dataset multiple times
• To recover quickly from a node failure
• data scientist : training data in an iterative loop 👍
• data analyst : most of the time no, hide that the data are not organized properly 👎
• data engineer : usually no, but depends on the cases. Benchmark before going to prod ❔
40. 44
How to cache a data set in Spark
Cache strategy: Storage Level
• NONE: No cache
• MEMORY_ONLY :
• data cached non-serialized in memory
• If not enough memory: data is evicted and when needed rebuilt from source
• DISK_ONLY : data is serialized and stored on disk
• MEMORY_AND_DISK :
• data cached non-serialized in memory
• If not enough memory: data is serialized and stored on disk
• OFF_HEAP : data is serialized and stored of heap with Alluxio (formerly Tachyon)
41. 45
How to cache a data set in Spark
Cache strategy: Storage Level
• _SER suffix:
• Always serialize the data in memory
• Save space but with serialization penalty
• _2 suffix :
• Replicate each partition on 2 cluster nodes
• Improve recovery time when node failure
NONE
DISK_ONLY DISK_ONLY_2
MEMORY_ONLY MEMORY_ONLY_2 MEMORY_ONLY_SER MEMORY_ONLY_SER_2
MEMORY_AND_DISK MEMORY_AND_DISK_2 MEMORY_AND_DISK_SER MEMORY_AND_DISK_SER_2
OFF_HEAP
42. 46
How to cache a data set in Spark
Cache strategy: Storage Level
• .cache() alias for .persist(MEMORY_AND_DISK) RDD: MEMORY_ONLY
• Lazy: .count()
51. 55
Ressources
• Spark official documentation: https://spark.apache.org/docs/latest/tuning.html
• Mastering Apache Spark by Jacek Laskowski: https://jaceklaskowski.gitbooks.io/mastering-
apache-spark/
• Apache Spark - Best Practices and Tuning by Umberto Griffo:
https://umbertogriffo.gitbooks.io/apache-spark-best-practices-and-tuning/
• High Performance Spark by Rachel Warren, Holden Karau, O'Reilly