SlideShare une entreprise Scribd logo
1  sur  23
Spark
Understanding & Performance Issues
Key-points of Spark
• A better implementation of MapReduce paradigm
• Handles batch, iterative and real-time applications
within a single framework.
• Most computations maps into many maps and
reduces with dependences among them
• Spark’s RDD programming model models these
dependences as a DAG.
Spark Goals
• Generality: diverse workloads, operators, job sizes
• Low latency: sub second
• Fault Tolerance: faults shouldn’t be a special case
• Simplicity: offer a High level API without boilerplate
code
Programming Point of
View
• High Level API (accessible to data scientists)
• Native integration with Java, Python and Scala
• Due to flexible programming model, applications
can rewrite the way shuffle or aggregation is done.
• Optionally applications can choose to put datasets
in memory
Engineering Point of
View
• Uses RPCs for task dispatching and scheduling
• Uses Thread-pool for execution of tasks (rather than a pool of JVM processes that Hadoop does)
• The above two enables spark to schedules tasks in milliseconds whereas MR scheduling takes
seconds or minutes in busy clusters.
• Supports checkpoint-based recovery (like Hadoop) + lineage-based recovery (much faster)
• Spark caches the data to be processed
• Each application gets its own executor processes, which stay up for the duration of the whole
application and run tasks in multiple threads.
• Benefit: Isolating applications from each other, on both the scheduling side (each driver
schedules its own tasks) and executor side (tasks from different applications run in different
JVMs).
• Disadvantage: it also means that data cannot be shared across different Spark applications
(instances of SparkContext) without writing it to an external storage system.
Spark Jargons (1/2)
• Driver: The program/process responsible for
running the Job over Spark Engine
• Executor: The process responsible for executing a
task
• Master: The machine where the Driver is
• Slave/Worker: The machine where the Executor
program runs.
Spark’s Master/Slave
Architecture
Spark Jargons (2/2)
• Job: A parallel computation consisting of multiple tasks that
gets spawned in response to a Spark action (e.g. save,
collect)
• Stages: Each job gets divided into smaller sets of tasks
called stages that depend on each other (similar to the map
and reduce stages in MapReduce)
• Tasks: Each stage has some tasks, one per partition. One
task is executed on one partition of data on one executor.
• Dag: stands for Directed Acyclic Graph, in the present
content is a DAG of operators
RDDs
• Resilient Distributed Datasets are the primary abstraction in Spark, a fault tolerant collection of
elements that can be operated in parallel.
• They are currently two types:
1. Parallelized collections: take an existing scala collection and run functions on it in parallel.
2. Hadoop Datasets: run functions on each record of a file in HDFS or any other storage
supported by Hadoop.
• Support two types of operations Transformations and Actions
1. Transformations: are lazy operations on a RDD that create one or many new RDDs, e.g.
map, filter, reduceByKey, join, randomSplit.
2. Actions: are computed immediately.They consist of running all the previous transformations
in order to get back an actual result. In other words, a RDD operation that returns a value of
any type but RDD[T] is an action. (actions are synchronous)
• An RDD can be persisted into storage disk or cached in memory.
Transformations
• There are two kinds of transformations:
1. Narrow transformations: are the result of the data from a single partition only,
i.e. map, filter..
• Spark groups narrow transformations in one stage which is called pipelining.
2. Wide/Shuffle transformations: are the result of groupByKey and reduceByKey.
The data required to compute the records in a single partition may exist in many
partitions of the parent RDD.
• All of the tuples with the same key must end up in the same partition,
processed by the same task.
• To satisfy these operations, Spark must execute RDD shuffle, which transfers
data across cluster and results in a new stage with a new set of partitions.
Transformations |
Actions
• map( function )
• filter( function )
• flatmap( function )
• sample( function )
• union( otherDataSet )
• dinstict( [numTasks] )
• groupByKey( [numTasks] )
• reduceByKey( function, [numTasks])
• sortByKey( [ascending], [numTasks])
• join( otherDataSets, [numTasks] ) etc…
• reduce(function)
• collect()
• count()
• first()
• take(n)
• takeSample(..)
• saveAsTextFile(path)
• saveAsSequentialFile(path)
• countByKey()
• foreach( function ) etc…
RDD Shuffle
• Shuffling is a process of redistributing data across
partitions (aka repartitioning) that may or may not
cause moving data across JVM processes or even
over the wire (between executors on separate
machines).
• “This typically involves copying data across
executors and machines, making the shuffle a
complex and costly operation.” - wrote in Spark’s
website
Spark’s System
Layers
Different Deployment Modes
• Spark Stand-alone
• Spark on Yarn
• Spark on Mesos
Common Performance
Issues
• Adequate Parallelism / partitioning: smaller/more numerous partitions allow work to be distributed
among more workers, but larger/fewer partitions allow work to be done in larger chunks, which may
result in the work getting done more quickly as long as all workers are kept busy, due to reduced
overhead.
• Re-partitioning on the Fly: each execution stage may have a different optimal degree of
parallelism, and the data shuffling between stages become opportunities to adjust the partitioning
accordingly.
• Wrong ordering of the transformations: shuffling more unnecessary data
• Data Layout: OO Languages add a layer of abstraction but this increase overhead in memory
usage.Furthermore those frameworks run on top of JVM and it’s garbage collector is known to be
sensitive to memory layout and access patterns.
• Task Placement: The co-allocation of heterogeneous tasks has the potential for creating unexpected
performance issues.
• Load Balancing: assuming applications execute stages sequentially, every imbalance
in a stage’s tasks lead to resource idleness.
Other issues (1/2): Too
many Shuffle files
• Has been observed that the bottleneck that Spark
currently faces is a problem specific to the existing
implementation of how shuffle files are defined.
• Each Map creates one shuffle file for each Reducer so in case that we
have 5000 Maps and 1024 Reduces we end up with over 5 million
shuffle files in total.
• This can lead to:
1. Poor Performance due to communication via Socket
2. Suffer from Random I/Os
Solutions
• Unsuccessful:
• Extra Processing Stage
• TritonSort (Try to bottleneck every source at the same time)
• Optimizations from static point of view or when the structure of the data is
known
• Successful:
1. Shuffle File Consolidation -
proposed by A. Davidson at al. “Optimizing Shuffle Performance in Spark”
2. RDMA in Spark -
proposed by W.Rahman at al “Accelerating Spark with RDMA for Big Data Processing:
Early Experiences”
Other issues (2/2): Data
Shuffle blocks
• It is not feasible to gather all the shuffle data before they are
consumed, because:
1. The data transfers would take a long time to complete
2. A large amount of memory and local storage would be needed to
cache it.
• So Producer Consumer model of Shuffling-Reducing is adopted.
1. This create a complex all-to-all communication pattern that puts a
significant burden on the networking infrastructure.
2. CPU Blocked due to missing Shuffle Block vs Explosion of
Memory utilization due to accumulation of many Shuffle blocks.
Evaluation tools
• Spark monitoring web UI
(Offers precise event timeline, DAG visualisation and other
monitoring tools)
• Sar to report iops for i/o usage. (provided as a part of the
sysstat package)
• iostat
• htop
• free
Evaluation
applications/Benchmarks
• SparkBench ( Benchmark suite ) - by IBM “SPARKBENCH: A Comprehensive
Benchmarking Suite For In Memory Data Analytic Platform Spark”
• GroupBy Test ( commonly used Spark Benchmark ) - used by “Accelerating Spark
with RDMA for Big Data Processing: Early Experiences"
• Twidd (application) - used by “Diagnosing Performance Bottlenecks in Massive Data
Parallel Programs.”
• Elcat (application) - used by “Diagnosing Performance Bottlenecks in Massive Data Parallel
Programs.”
• PageRank (application) - used by “Diagnosing Performance Bottlenecks in Massive Data
Parallel Programs.”
• BDBench (benchmark)
• TPC-DS (benchmark)
Blocked Time Analysis
• Issues:
1. Per-Task Utilization can not be measured in
Spark because all tasks run in a single process.
2. Instrumentation should be Light in terms of
memory
3.Instrumentation shouldn’t add to job time.
4.Needed to add logging in HDFS.
Usage of Memory
• Execution: Memory used for shuffles, sorts and aggregation
• Storage: Memory used to cache data that will be reused
later
• 1st approach: static assignment
• 2nd approach: Unified memory (always storage spills to
disk)
• 3rd approach: Dynamic Assignment into different cores
(Each task is now assigned 1/N of the memory) -> helps
with stranglers.
References
• https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
• SPARKBENCH: A Comprehensive Benchmarking Suite For In Memory
Data Analytic Platform Spark
• Accelerating Spark with RDMA for Big Data Processing: Early
Experiences
• Diagnosing Performance Bottlenecks in Massive Data Parallel Programs.
• A. Davidson at al. “Optimizing Shuffle Performance in Spark”
• W.Rahman at al “Accelerating Spark with RDMA for Big Data Processing:
Early Experiences”

Contenu connexe

Tendances

Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Improving Apache Spark by Taking Advantage of Disaggregated ArchitectureImproving Apache Spark by Taking Advantage of Disaggregated Architecture
Improving Apache Spark by Taking Advantage of Disaggregated ArchitectureDatabricks
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityJen Aman
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsSpark Summit
 
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production ScaleGPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production ScaleSpark Summit
 
Spark tunning in Apache Kylin
Spark tunning in Apache KylinSpark tunning in Apache Kylin
Spark tunning in Apache KylinShi Shao Feng
 
Apache Flink vs Apache Spark - Reproducible experiments on cloud.
Apache Flink vs Apache Spark - Reproducible experiments on cloud.Apache Flink vs Apache Spark - Reproducible experiments on cloud.
Apache Flink vs Apache Spark - Reproducible experiments on cloud.Shelan Perera
 
Hadoop World 2011: Next Generation Apache Hadoop MapReduce - Mohadev Konar, H...
Hadoop World 2011: Next Generation Apache Hadoop MapReduce - Mohadev Konar, H...Hadoop World 2011: Next Generation Apache Hadoop MapReduce - Mohadev Konar, H...
Hadoop World 2011: Next Generation Apache Hadoop MapReduce - Mohadev Konar, H...Cloudera, Inc.
 
Low Latency Execution For Apache Spark
Low Latency Execution For Apache SparkLow Latency Execution For Apache Spark
Low Latency Execution For Apache SparkJen Aman
 
Resource Aware Scheduling for Hadoop [Final Presentation]
Resource Aware Scheduling for Hadoop [Final Presentation]Resource Aware Scheduling for Hadoop [Final Presentation]
Resource Aware Scheduling for Hadoop [Final Presentation]Lu Wei
 
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...Databricks
 
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...areej qasrawi
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesJen Aman
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache SparkDatabricks
 
A Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache FlinkA Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache FlinkDongwon Kim
 
Взгляд на облака с точки зрения HPC
Взгляд на облака с точки зрения HPCВзгляд на облака с точки зрения HPC
Взгляд на облака с точки зрения HPCOlga Lavrentieva
 

Tendances (19)

Hadoop Internals
Hadoop InternalsHadoop Internals
Hadoop Internals
 
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Improving Apache Spark by Taking Advantage of Disaggregated ArchitectureImproving Apache Spark by Taking Advantage of Disaggregated Architecture
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production ScaleGPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
 
Spark tunning in Apache Kylin
Spark tunning in Apache KylinSpark tunning in Apache Kylin
Spark tunning in Apache Kylin
 
Apache Flink vs Apache Spark - Reproducible experiments on cloud.
Apache Flink vs Apache Spark - Reproducible experiments on cloud.Apache Flink vs Apache Spark - Reproducible experiments on cloud.
Apache Flink vs Apache Spark - Reproducible experiments on cloud.
 
Hadoop World 2011: Next Generation Apache Hadoop MapReduce - Mohadev Konar, H...
Hadoop World 2011: Next Generation Apache Hadoop MapReduce - Mohadev Konar, H...Hadoop World 2011: Next Generation Apache Hadoop MapReduce - Mohadev Konar, H...
Hadoop World 2011: Next Generation Apache Hadoop MapReduce - Mohadev Konar, H...
 
Low Latency Execution For Apache Spark
Low Latency Execution For Apache SparkLow Latency Execution For Apache Spark
Low Latency Execution For Apache Spark
 
Resource Aware Scheduling for Hadoop [Final Presentation]
Resource Aware Scheduling for Hadoop [Final Presentation]Resource Aware Scheduling for Hadoop [Final Presentation]
Resource Aware Scheduling for Hadoop [Final Presentation]
 
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
 
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
 
Big Data for QAs
Big Data for QAsBig Data for QAs
Big Data for QAs
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
Hadoop data management
Hadoop data managementHadoop data management
Hadoop data management
 
Spark Tips & Tricks
Spark Tips & TricksSpark Tips & Tricks
Spark Tips & Tricks
 
A Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache FlinkA Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache Flink
 
Взгляд на облака с точки зрения HPC
Взгляд на облака с точки зрения HPCВзгляд на облака с точки зрения HPC
Взгляд на облака с точки зрения HPC
 

En vedette

Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideSpark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideIBM
 
Immutability
ImmutabilityImmutability
Immutabilitybyanjati
 
Afternoon Talks @Office
Afternoon Talks @OfficeAfternoon Talks @Office
Afternoon Talks @Officebyanjati
 
Building a Data Ingestion & Processing Pipeline with Spark & Airflow
Building a Data Ingestion & Processing Pipeline with Spark & AirflowBuilding a Data Ingestion & Processing Pipeline with Spark & Airflow
Building a Data Ingestion & Processing Pipeline with Spark & AirflowTom Lous
 
Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández
Spark Streaming Tips for Devs and Ops by Fran perez y federico fernándezSpark Streaming Tips for Devs and Ops by Fran perez y federico fernández
Spark Streaming Tips for Devs and Ops by Fran perez y federico fernándezJ On The Beach
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoopclairvoyantllc
 
20140202 fosdem-nosql-devroom-hadoop-yarn
20140202 fosdem-nosql-devroom-hadoop-yarn20140202 fosdem-nosql-devroom-hadoop-yarn
20140202 fosdem-nosql-devroom-hadoop-yarnDatalayer
 
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)Alexander Ulanov
 
Alpine Spark Implementation - Technical
Alpine Spark Implementation - TechnicalAlpine Spark Implementation - Technical
Alpine Spark Implementation - Technicalalpinedatalabs
 
Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark Anubhav Kale
 
Apache Yarn - Hadoop Cluster Management
Apache Yarn -  Hadoop Cluster ManagementApache Yarn -  Hadoop Cluster Management
Apache Yarn - Hadoop Cluster ManagementDmitry Tolpeko
 
Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012Hortonworks
 
How to write maintainable code
How to write maintainable codeHow to write maintainable code
How to write maintainable codePeter Hilton
 
Performance in Spark 2.0, PDX Spark Meetup 8/18/16
Performance in Spark 2.0, PDX Spark Meetup 8/18/16Performance in Spark 2.0, PDX Spark Meetup 8/18/16
Performance in Spark 2.0, PDX Spark Meetup 8/18/16pdx_spark
 
MongoDB Shell Tips & Tricks
MongoDB Shell Tips & TricksMongoDB Shell Tips & Tricks
MongoDB Shell Tips & TricksMongoDB
 
Training MongoDB - Monitoring and Operability
Training MongoDB - Monitoring and OperabilityTraining MongoDB - Monitoring and Operability
Training MongoDB - Monitoring and OperabilityNicolas Motte
 
Improving Mobile Payments With Real time Spark
Improving Mobile Payments With Real time SparkImproving Mobile Payments With Real time Spark
Improving Mobile Payments With Real time Sparkdatamantra
 
Dynamic Reconfiguration of Apache ZooKeeper
Dynamic Reconfiguration of Apache ZooKeeperDynamic Reconfiguration of Apache ZooKeeper
Dynamic Reconfiguration of Apache ZooKeeperDataWorks Summit
 
ApacheCon North America 2014 - Apache Hadoop YARN: The Next-generation Distri...
ApacheCon North America 2014 - Apache Hadoop YARN: The Next-generation Distri...ApacheCon North America 2014 - Apache Hadoop YARN: The Next-generation Distri...
ApacheCon North America 2014 - Apache Hadoop YARN: The Next-generation Distri...Zhijie Shen
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriDemi Ben-Ari
 

En vedette (20)

Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideSpark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting Guide
 
Immutability
ImmutabilityImmutability
Immutability
 
Afternoon Talks @Office
Afternoon Talks @OfficeAfternoon Talks @Office
Afternoon Talks @Office
 
Building a Data Ingestion & Processing Pipeline with Spark & Airflow
Building a Data Ingestion & Processing Pipeline with Spark & AirflowBuilding a Data Ingestion & Processing Pipeline with Spark & Airflow
Building a Data Ingestion & Processing Pipeline with Spark & Airflow
 
Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández
Spark Streaming Tips for Devs and Ops by Fran perez y federico fernándezSpark Streaming Tips for Devs and Ops by Fran perez y federico fernández
Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoop
 
20140202 fosdem-nosql-devroom-hadoop-yarn
20140202 fosdem-nosql-devroom-hadoop-yarn20140202 fosdem-nosql-devroom-hadoop-yarn
20140202 fosdem-nosql-devroom-hadoop-yarn
 
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
 
Alpine Spark Implementation - Technical
Alpine Spark Implementation - TechnicalAlpine Spark Implementation - Technical
Alpine Spark Implementation - Technical
 
Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark
 
Apache Yarn - Hadoop Cluster Management
Apache Yarn -  Hadoop Cluster ManagementApache Yarn -  Hadoop Cluster Management
Apache Yarn - Hadoop Cluster Management
 
Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012
 
How to write maintainable code
How to write maintainable codeHow to write maintainable code
How to write maintainable code
 
Performance in Spark 2.0, PDX Spark Meetup 8/18/16
Performance in Spark 2.0, PDX Spark Meetup 8/18/16Performance in Spark 2.0, PDX Spark Meetup 8/18/16
Performance in Spark 2.0, PDX Spark Meetup 8/18/16
 
MongoDB Shell Tips & Tricks
MongoDB Shell Tips & TricksMongoDB Shell Tips & Tricks
MongoDB Shell Tips & Tricks
 
Training MongoDB - Monitoring and Operability
Training MongoDB - Monitoring and OperabilityTraining MongoDB - Monitoring and Operability
Training MongoDB - Monitoring and Operability
 
Improving Mobile Payments With Real time Spark
Improving Mobile Payments With Real time SparkImproving Mobile Payments With Real time Spark
Improving Mobile Payments With Real time Spark
 
Dynamic Reconfiguration of Apache ZooKeeper
Dynamic Reconfiguration of Apache ZooKeeperDynamic Reconfiguration of Apache ZooKeeper
Dynamic Reconfiguration of Apache ZooKeeper
 
ApacheCon North America 2014 - Apache Hadoop YARN: The Next-generation Distri...
ApacheCon North America 2014 - Apache Hadoop YARN: The Next-generation Distri...ApacheCon North America 2014 - Apache Hadoop YARN: The Next-generation Distri...
ApacheCon North America 2014 - Apache Hadoop YARN: The Next-generation Distri...
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
 

Similaire à Spark Overview and Performance Issues

Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for BeginnersAnirudh
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scalesamthemonad
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hoodAdarsh Pannu
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
Spark streaming high level overview
Spark streaming high level overviewSpark streaming high level overview
Spark streaming high level overviewAvi Levi
 
An Introduction to Apache Spark
An Introduction to Apache SparkAn Introduction to Apache Spark
An Introduction to Apache SparkDona Mary Philip
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML ConferenceDB Tsai
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Olalekan Fuad Elesin
 
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...Flink Forward
 
Dongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkDongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkFlink Forward
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2Fabio Fumarola
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingHortonworks
 
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in PinterestMigrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in PinterestDatabricks
 

Similaire à Spark Overview and Performance Issues (20)

Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Spark architechure.pptx
Spark architechure.pptxSpark architechure.pptx
Spark architechure.pptx
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scale
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Spark streaming high level overview
Spark streaming high level overviewSpark streaming high level overview
Spark streaming high level overview
 
An Introduction to Apache Spark
An Introduction to Apache SparkAn Introduction to Apache Spark
An Introduction to Apache Spark
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
 
Hadoop
HadoopHadoop
Hadoop
 
try
trytry
try
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
 
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
 
Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
 
Dongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkDongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of Flink
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in PinterestMigrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
 
Spark 1.0
Spark 1.0Spark 1.0
Spark 1.0
 

Plus de Antonios Katsarakis

Invalidation-Based Protocols for Replicated Datastores
Invalidation-Based Protocols for Replicated DatastoresInvalidation-Based Protocols for Replicated Datastores
Invalidation-Based Protocols for Replicated DatastoresAntonios Katsarakis
 
Zeus: Locality-aware Distributed Transactions [Eurosys '21 presentation]
Zeus: Locality-aware Distributed Transactions [Eurosys '21 presentation]Zeus: Locality-aware Distributed Transactions [Eurosys '21 presentation]
Zeus: Locality-aware Distributed Transactions [Eurosys '21 presentation]Antonios Katsarakis
 
Hermes Reliable Replication Protocol - Poster
Hermes Reliable Replication Protocol - Poster Hermes Reliable Replication Protocol - Poster
Hermes Reliable Replication Protocol - Poster Antonios Katsarakis
 
Hermes Reliable Replication Protocol - ASPLOS'20 Presentation
Hermes Reliable Replication Protocol -  ASPLOS'20 PresentationHermes Reliable Replication Protocol -  ASPLOS'20 Presentation
Hermes Reliable Replication Protocol - ASPLOS'20 PresentationAntonios Katsarakis
 
Distributed Processing Frameworks
Distributed Processing FrameworksDistributed Processing Frameworks
Distributed Processing FrameworksAntonios Katsarakis
 

Plus de Antonios Katsarakis (8)

The L2AW theorem
The L2AW theoremThe L2AW theorem
The L2AW theorem
 
Invalidation-Based Protocols for Replicated Datastores
Invalidation-Based Protocols for Replicated DatastoresInvalidation-Based Protocols for Replicated Datastores
Invalidation-Based Protocols for Replicated Datastores
 
Zeus: Locality-aware Distributed Transactions [Eurosys '21 presentation]
Zeus: Locality-aware Distributed Transactions [Eurosys '21 presentation]Zeus: Locality-aware Distributed Transactions [Eurosys '21 presentation]
Zeus: Locality-aware Distributed Transactions [Eurosys '21 presentation]
 
Hermes Reliable Replication Protocol - Poster
Hermes Reliable Replication Protocol - Poster Hermes Reliable Replication Protocol - Poster
Hermes Reliable Replication Protocol - Poster
 
Hermes Reliable Replication Protocol - ASPLOS'20 Presentation
Hermes Reliable Replication Protocol -  ASPLOS'20 PresentationHermes Reliable Replication Protocol -  ASPLOS'20 Presentation
Hermes Reliable Replication Protocol - ASPLOS'20 Presentation
 
Scale-out ccNUMA - Eurosys'18
Scale-out ccNUMA - Eurosys'18Scale-out ccNUMA - Eurosys'18
Scale-out ccNUMA - Eurosys'18
 
Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)
 
Distributed Processing Frameworks
Distributed Processing FrameworksDistributed Processing Frameworks
Distributed Processing Frameworks
 

Dernier

%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyviewmasabamasaba
 
WSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security ProgramWSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security ProgramWSO2
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationJuha-Pekka Tolvanen
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...masabamasaba
 
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2
 
tonesoftg
tonesoftgtonesoftg
tonesoftglanshi9
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension AidPhilip Schwarz
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024VictoriaMetrics
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...Shane Coughlan
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfonteinmasabamasaba
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareJim McKeeth
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Bert Jan Schrijver
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...masabamasaba
 
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...chiefasafspells
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...masabamasaba
 

Dernier (20)

%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
WSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security ProgramWSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security Program
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the Situation
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
 
tonesoftg
tonesoftgtonesoftg
tonesoftg
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
 

Spark Overview and Performance Issues

  • 2. Key-points of Spark • A better implementation of MapReduce paradigm • Handles batch, iterative and real-time applications within a single framework. • Most computations maps into many maps and reduces with dependences among them • Spark’s RDD programming model models these dependences as a DAG.
  • 3. Spark Goals • Generality: diverse workloads, operators, job sizes • Low latency: sub second • Fault Tolerance: faults shouldn’t be a special case • Simplicity: offer a High level API without boilerplate code
  • 4. Programming Point of View • High Level API (accessible to data scientists) • Native integration with Java, Python and Scala • Due to flexible programming model, applications can rewrite the way shuffle or aggregation is done. • Optionally applications can choose to put datasets in memory
  • 5. Engineering Point of View • Uses RPCs for task dispatching and scheduling • Uses Thread-pool for execution of tasks (rather than a pool of JVM processes that Hadoop does) • The above two enables spark to schedules tasks in milliseconds whereas MR scheduling takes seconds or minutes in busy clusters. • Supports checkpoint-based recovery (like Hadoop) + lineage-based recovery (much faster) • Spark caches the data to be processed • Each application gets its own executor processes, which stay up for the duration of the whole application and run tasks in multiple threads. • Benefit: Isolating applications from each other, on both the scheduling side (each driver schedules its own tasks) and executor side (tasks from different applications run in different JVMs). • Disadvantage: it also means that data cannot be shared across different Spark applications (instances of SparkContext) without writing it to an external storage system.
  • 6. Spark Jargons (1/2) • Driver: The program/process responsible for running the Job over Spark Engine • Executor: The process responsible for executing a task • Master: The machine where the Driver is • Slave/Worker: The machine where the Executor program runs.
  • 8. Spark Jargons (2/2) • Job: A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g. save, collect) • Stages: Each job gets divided into smaller sets of tasks called stages that depend on each other (similar to the map and reduce stages in MapReduce) • Tasks: Each stage has some tasks, one per partition. One task is executed on one partition of data on one executor. • Dag: stands for Directed Acyclic Graph, in the present content is a DAG of operators
  • 9. RDDs • Resilient Distributed Datasets are the primary abstraction in Spark, a fault tolerant collection of elements that can be operated in parallel. • They are currently two types: 1. Parallelized collections: take an existing scala collection and run functions on it in parallel. 2. Hadoop Datasets: run functions on each record of a file in HDFS or any other storage supported by Hadoop. • Support two types of operations Transformations and Actions 1. Transformations: are lazy operations on a RDD that create one or many new RDDs, e.g. map, filter, reduceByKey, join, randomSplit. 2. Actions: are computed immediately.They consist of running all the previous transformations in order to get back an actual result. In other words, a RDD operation that returns a value of any type but RDD[T] is an action. (actions are synchronous) • An RDD can be persisted into storage disk or cached in memory.
  • 10. Transformations • There are two kinds of transformations: 1. Narrow transformations: are the result of the data from a single partition only, i.e. map, filter.. • Spark groups narrow transformations in one stage which is called pipelining. 2. Wide/Shuffle transformations: are the result of groupByKey and reduceByKey. The data required to compute the records in a single partition may exist in many partitions of the parent RDD. • All of the tuples with the same key must end up in the same partition, processed by the same task. • To satisfy these operations, Spark must execute RDD shuffle, which transfers data across cluster and results in a new stage with a new set of partitions.
  • 11. Transformations | Actions • map( function ) • filter( function ) • flatmap( function ) • sample( function ) • union( otherDataSet ) • dinstict( [numTasks] ) • groupByKey( [numTasks] ) • reduceByKey( function, [numTasks]) • sortByKey( [ascending], [numTasks]) • join( otherDataSets, [numTasks] ) etc… • reduce(function) • collect() • count() • first() • take(n) • takeSample(..) • saveAsTextFile(path) • saveAsSequentialFile(path) • countByKey() • foreach( function ) etc…
  • 12. RDD Shuffle • Shuffling is a process of redistributing data across partitions (aka repartitioning) that may or may not cause moving data across JVM processes or even over the wire (between executors on separate machines). • “This typically involves copying data across executors and machines, making the shuffle a complex and costly operation.” - wrote in Spark’s website
  • 14. Different Deployment Modes • Spark Stand-alone • Spark on Yarn • Spark on Mesos
  • 15. Common Performance Issues • Adequate Parallelism / partitioning: smaller/more numerous partitions allow work to be distributed among more workers, but larger/fewer partitions allow work to be done in larger chunks, which may result in the work getting done more quickly as long as all workers are kept busy, due to reduced overhead. • Re-partitioning on the Fly: each execution stage may have a different optimal degree of parallelism, and the data shuffling between stages become opportunities to adjust the partitioning accordingly. • Wrong ordering of the transformations: shuffling more unnecessary data • Data Layout: OO Languages add a layer of abstraction but this increase overhead in memory usage.Furthermore those frameworks run on top of JVM and it’s garbage collector is known to be sensitive to memory layout and access patterns. • Task Placement: The co-allocation of heterogeneous tasks has the potential for creating unexpected performance issues. • Load Balancing: assuming applications execute stages sequentially, every imbalance in a stage’s tasks lead to resource idleness.
  • 16. Other issues (1/2): Too many Shuffle files • Has been observed that the bottleneck that Spark currently faces is a problem specific to the existing implementation of how shuffle files are defined. • Each Map creates one shuffle file for each Reducer so in case that we have 5000 Maps and 1024 Reduces we end up with over 5 million shuffle files in total. • This can lead to: 1. Poor Performance due to communication via Socket 2. Suffer from Random I/Os
  • 17. Solutions • Unsuccessful: • Extra Processing Stage • TritonSort (Try to bottleneck every source at the same time) • Optimizations from static point of view or when the structure of the data is known • Successful: 1. Shuffle File Consolidation - proposed by A. Davidson at al. “Optimizing Shuffle Performance in Spark” 2. RDMA in Spark - proposed by W.Rahman at al “Accelerating Spark with RDMA for Big Data Processing: Early Experiences”
  • 18. Other issues (2/2): Data Shuffle blocks • It is not feasible to gather all the shuffle data before they are consumed, because: 1. The data transfers would take a long time to complete 2. A large amount of memory and local storage would be needed to cache it. • So Producer Consumer model of Shuffling-Reducing is adopted. 1. This create a complex all-to-all communication pattern that puts a significant burden on the networking infrastructure. 2. CPU Blocked due to missing Shuffle Block vs Explosion of Memory utilization due to accumulation of many Shuffle blocks.
  • 19. Evaluation tools • Spark monitoring web UI (Offers precise event timeline, DAG visualisation and other monitoring tools) • Sar to report iops for i/o usage. (provided as a part of the sysstat package) • iostat • htop • free
  • 20. Evaluation applications/Benchmarks • SparkBench ( Benchmark suite ) - by IBM “SPARKBENCH: A Comprehensive Benchmarking Suite For In Memory Data Analytic Platform Spark” • GroupBy Test ( commonly used Spark Benchmark ) - used by “Accelerating Spark with RDMA for Big Data Processing: Early Experiences" • Twidd (application) - used by “Diagnosing Performance Bottlenecks in Massive Data Parallel Programs.” • Elcat (application) - used by “Diagnosing Performance Bottlenecks in Massive Data Parallel Programs.” • PageRank (application) - used by “Diagnosing Performance Bottlenecks in Massive Data Parallel Programs.” • BDBench (benchmark) • TPC-DS (benchmark)
  • 21. Blocked Time Analysis • Issues: 1. Per-Task Utilization can not be measured in Spark because all tasks run in a single process. 2. Instrumentation should be Light in terms of memory 3.Instrumentation shouldn’t add to job time. 4.Needed to add logging in HDFS.
  • 22. Usage of Memory • Execution: Memory used for shuffles, sorts and aggregation • Storage: Memory used to cache data that will be reused later • 1st approach: static assignment • 2nd approach: Unified memory (always storage spills to disk) • 3rd approach: Dynamic Assignment into different cores (Each task is now assigned 1/N of the memory) -> helps with stranglers.
  • 23. References • https://jaceklaskowski.gitbooks.io/mastering-apache-spark/ • SPARKBENCH: A Comprehensive Benchmarking Suite For In Memory Data Analytic Platform Spark • Accelerating Spark with RDMA for Big Data Processing: Early Experiences • Diagnosing Performance Bottlenecks in Massive Data Parallel Programs. • A. Davidson at al. “Optimizing Shuffle Performance in Spark” • W.Rahman at al “Accelerating Spark with RDMA for Big Data Processing: Early Experiences”