Spark Overview and Performance Issues

Spark
Understanding & Performance Issues

Key-points of Spark
• A better implementation of MapReduce paradigm
• Handles batch, iterative and real-time applications
within a single framework.
• Most computations maps into many maps and
reduces with dependences among them
• Spark’s RDD programming model models these
dependences as a DAG.

Spark Goals
• Generality: diverse workloads, operators, job sizes
• Low latency: sub second
• Fault Tolerance: faults shouldn’t be a special case
• Simplicity: offer a High level API without boilerplate
code

Programming Point of
View
• High Level API (accessible to data scientists)
• Native integration with Java, Python and Scala
• Due to flexible programming model, applications
can rewrite the way shuffle or aggregation is done.
• Optionally applications can choose to put datasets
in memory

Engineering Point of
View
• Uses RPCs for task dispatching and scheduling
• Uses Thread-pool for execution of tasks (rather than a pool of JVM processes that Hadoop does)
• The above two enables spark to schedules tasks in milliseconds whereas MR scheduling takes
seconds or minutes in busy clusters.
• Supports checkpoint-based recovery (like Hadoop) + lineage-based recovery (much faster)
• Spark caches the data to be processed
• Each application gets its own executor processes, which stay up for the duration of the whole
application and run tasks in multiple threads.
• Benefit: Isolating applications from each other, on both the scheduling side (each driver
schedules its own tasks) and executor side (tasks from different applications run in different
JVMs).
• Disadvantage: it also means that data cannot be shared across different Spark applications
(instances of SparkContext) without writing it to an external storage system.

Spark Jargons (1/2)
• Driver: The program/process responsible for
running the Job over Spark Engine
• Executor: The process responsible for executing a
task
• Master: The machine where the Driver is
• Slave/Worker: The machine where the Executor
program runs.

Spark’s Master/Slave
Architecture

Spark Jargons (2/2)
• Job: A parallel computation consisting of multiple tasks that
gets spawned in response to a Spark action (e.g. save,
collect)
• Stages: Each job gets divided into smaller sets of tasks
called stages that depend on each other (similar to the map
and reduce stages in MapReduce)
• Tasks: Each stage has some tasks, one per partition. One
task is executed on one partition of data on one executor.
• Dag: stands for Directed Acyclic Graph, in the present
content is a DAG of operators

RDDs
• Resilient Distributed Datasets are the primary abstraction in Spark, a fault tolerant collection of
elements that can be operated in parallel.
• They are currently two types:
1. Parallelized collections: take an existing scala collection and run functions on it in parallel.
2. Hadoop Datasets: run functions on each record of a file in HDFS or any other storage
supported by Hadoop.
• Support two types of operations Transformations and Actions
1. Transformations: are lazy operations on a RDD that create one or many new RDDs, e.g.
map, filter, reduceByKey, join, randomSplit.
2. Actions: are computed immediately.They consist of running all the previous transformations
in order to get back an actual result. In other words, a RDD operation that returns a value of
any type but RDD[T] is an action. (actions are synchronous)
• An RDD can be persisted into storage disk or cached in memory.

Transformations
• There are two kinds of transformations:
1. Narrow transformations: are the result of the data from a single partition only,
i.e. map, filter..
• Spark groups narrow transformations in one stage which is called pipelining.
2. Wide/Shuffle transformations: are the result of groupByKey and reduceByKey.
The data required to compute the records in a single partition may exist in many
partitions of the parent RDD.
• All of the tuples with the same key must end up in the same partition,
processed by the same task.
• To satisfy these operations, Spark must execute RDD shuffle, which transfers
data across cluster and results in a new stage with a new set of partitions.

Transformations |
Actions
• map( function )
• filter( function )
• flatmap( function )
• sample( function )
• union( otherDataSet )
• dinstict( [numTasks] )
• groupByKey( [numTasks] )
• reduceByKey( function, [numTasks])
• sortByKey( [ascending], [numTasks])
• join( otherDataSets, [numTasks] ) etc…
• reduce(function)
• collect()
• count()
• first()
• take(n)
• takeSample(..)
• saveAsTextFile(path)
• saveAsSequentialFile(path)
• countByKey()
• foreach( function ) etc…

RDD Shuffle
• Shuffling is a process of redistributing data across
partitions (aka repartitioning) that may or may not
cause moving data across JVM processes or even
over the wire (between executors on separate
machines).
• “This typically involves copying data across
executors and machines, making the shuffle a
complex and costly operation.” - wrote in Spark’s
website

Different Deployment Modes
• Spark Stand-alone
• Spark on Yarn
• Spark on Mesos

Common Performance
Issues
• Adequate Parallelism / partitioning: smaller/more numerous partitions allow work to be distributed
among more workers, but larger/fewer partitions allow work to be done in larger chunks, which may
result in the work getting done more quickly as long as all workers are kept busy, due to reduced
overhead.
• Re-partitioning on the Fly: each execution stage may have a different optimal degree of
parallelism, and the data shuffling between stages become opportunities to adjust the partitioning
accordingly.
• Wrong ordering of the transformations: shuffling more unnecessary data
• Data Layout: OO Languages add a layer of abstraction but this increase overhead in memory
usage.Furthermore those frameworks run on top of JVM and it’s garbage collector is known to be
sensitive to memory layout and access patterns.
• Task Placement: The co-allocation of heterogeneous tasks has the potential for creating unexpected
performance issues.
• Load Balancing: assuming applications execute stages sequentially, every imbalance
in a stage’s tasks lead to resource idleness.

Other issues (1/2): Too
many Shuffle files
• Has been observed that the bottleneck that Spark
currently faces is a problem specific to the existing
implementation of how shuffle files are defined.
• Each Map creates one shuffle file for each Reducer so in case that we
have 5000 Maps and 1024 Reduces we end up with over 5 million
shuffle files in total.
• This can lead to:
1. Poor Performance due to communication via Socket
2. Suffer from Random I/Os

Solutions
• Unsuccessful:
• Extra Processing Stage
• TritonSort (Try to bottleneck every source at the same time)
• Optimizations from static point of view or when the structure of the data is
known
• Successful:
1. Shuffle File Consolidation -
proposed by A. Davidson at al. “Optimizing Shuffle Performance in Spark”
2. RDMA in Spark -
proposed by W.Rahman at al “Accelerating Spark with RDMA for Big Data Processing:
Early Experiences”

Other issues (2/2): Data
Shuffle blocks
• It is not feasible to gather all the shuffle data before they are
consumed, because:
1. The data transfers would take a long time to complete
2. A large amount of memory and local storage would be needed to
cache it.
• So Producer Consumer model of Shuffling-Reducing is adopted.
1. This create a complex all-to-all communication pattern that puts a
significant burden on the networking infrastructure.
2. CPU Blocked due to missing Shuffle Block vs Explosion of
Memory utilization due to accumulation of many Shuffle blocks.

Evaluation tools
• Spark monitoring web UI
(Offers precise event timeline, DAG visualisation and other
monitoring tools)
• Sar to report iops for i/o usage. (provided as a part of the
sysstat package)
• iostat
• htop
• free

Evaluation
applications/Benchmarks
• SparkBench ( Benchmark suite ) - by IBM “SPARKBENCH: A Comprehensive
Benchmarking Suite For In Memory Data Analytic Platform Spark”
• GroupBy Test ( commonly used Spark Benchmark ) - used by “Accelerating Spark
with RDMA for Big Data Processing: Early Experiences"
• Twidd (application) - used by “Diagnosing Performance Bottlenecks in Massive Data
Parallel Programs.”
• Elcat (application) - used by “Diagnosing Performance Bottlenecks in Massive Data Parallel
Programs.”
• PageRank (application) - used by “Diagnosing Performance Bottlenecks in Massive Data
Parallel Programs.”
• BDBench (benchmark)
• TPC-DS (benchmark)

Blocked Time Analysis
• Issues:
1. Per-Task Utilization can not be measured in
Spark because all tasks run in a single process.
2. Instrumentation should be Light in terms of
memory
3.Instrumentation shouldn’t add to job time.
4.Needed to add logging in HDFS.

Usage of Memory
• Execution: Memory used for shuffles, sorts and aggregation
• Storage: Memory used to cache data that will be reused
later
• 1st approach: static assignment
• 2nd approach: Unified memory (always storage spills to
disk)
• 3rd approach: Dynamic Assignment into different cores
(Each task is now assigned 1/N of the memory) -> helps
with stranglers.

References
• https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
• SPARKBENCH: A Comprehensive Benchmarking Suite For In Memory
Data Analytic Platform Spark
• Accelerating Spark with RDMA for Big Data Processing: Early
Experiences
• Diagnosing Performance Bottlenecks in Massive Data Parallel Programs.
• A. Davidson at al. “Optimizing Shuffle Performance in Spark”
• W.Rahman at al “Accelerating Spark with RDMA for Big Data Processing:
Early Experiences”

Spark Overview and Performance Issues

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (19)

En vedette

En vedette (20)

Similaire à Spark Overview and Performance Issues

Similaire à Spark Overview and Performance Issues (20)

Plus de Antonios Katsarakis

Plus de Antonios Katsarakis (8)

Dernier

Dernier (20)

Spark Overview and Performance Issues