SlideShare une entreprise Scribd logo
1  sur  69
Télécharger pour lire hors ligne
Top 5 Mistakes when writing
Spark applications
Mark Grover | @mark_grover | Software Engineer
Ted Malaska | @TedMalaska | Principal Solutions Architect
tiny.cloudera.com/spark-mistakes
About the book
• @hadooparchbook
• hadooparchitecturebook.com
• github.com/hadooparchitecturebook
• slideshare.com/hadooparchbook
Mistakes people make
when using Spark
Mistakes people we made
when using Spark
Mistake # 1
# Executors, cores, memory !?!
• 6 Nodes
• 16 cores each
• 64 GB of RAM each
Decisions, decisions, decisions
• Number of executors (--num-executors)
• Cores for each executor (--executor-cores)
• Memory for each executor (--executor-
memory)
• 6 nodes
• 16 cores each
• 64 GB of RAM
Spark Architecture recap
Answer #1 – Most granular
• Have smallest sized executors as possible
• 1 core each
• Total of 16 x 6 = 96 cores
• 96 executors
• 64/16 = 4 GB per executor (per node)
Answer #1 – Most granular
• Have smallest sized executors as possible
• 1 core each
• Total of 16 x 6 = 96 cores
• 96 executors
• 64/16 = 4 GB per executor (per node)
Why?
• Not using benefits of running multiple
tasks in same JVM
Answer #2 – Least granular
• 6 executors
• 64 GB memory each
• 16 cores each
Answer #2 – Least granular
• 6 executors
• 64 GB memory each
• 16 cores each
Why?
• Need to leave some memory overhead for
OS/Hadoop daemons
Answer #3 – with overhead
• 6 executors
• 63 GB memory each
• 15 cores each
Answer #3 – with overhead
• 6 executors
• 63 GB memory each
• 15 cores each
Spark on YARN – Memory usage
• --executor-memory controls the heap size
• Need some overhead (controlled by
spark.yarn.executor.memory.overhead)for off heap memory
• Default is max(384MB, .07 * spark.executor.memory)
YARN AM needs a core: Client
mode
YARN AM needs a core: Cluster
mode
HDFS Throughput
• 15 cores per executor can lead to bad
HDFS I/O throughput.
• Best is to keep under 5 cores per executor
Calculations
• 5 cores per executor
– For max HDFS throughput
• Cluster has 6 * 15 = 90 cores in total (after taking out
Hadoop/Yarn daemon cores)
• 90 cores / 5 cores/executor = 18 executors
• 1 executor for AM => 17 executors
• Each node has 3 executors
• 63 GB/3 = 21 GB, 21 x (1-0.07) ~ 19 GB (counting off
heap overhead)
Correct answer
• 17 executors
• 19 GB memory each
• 5 cores each
* Not etched in stone
Read more
• From a great blog post on this topic by
Sandy Ryza:
http://blog.cloudera.com/blog/2015/03/how-
to-tune-your-apache-spark-jobs-part-2/
Mistake # 2
Application failure
15/04/16 14:13:03 WARN scheduler.TaskSetManager: Lost task 19.0 in
stage 6.0 (TID 120, 10.215.149.47):
java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:828) at
org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:123) at
org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:132) at
org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:51
7) at
org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:432)
at org.apache.spark.storage.BlockManager.get(BlockManager.scala:618)
at
org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:146
) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)
Why?
• No Spark shuffle block can be greater than
2 GB
Ok, what’s a shuffle block again?
• In MapReduce terminology, a Mapper-
Reducer pair – the file from local disk that
the reducers read from local disk in
MapReduce.
In other words
Each yellow arrow
in this diagram
represents a
shuffle block.
Wait! What!?! This is Big Data stuff,
no?
• Yeah! Nope!
• Spark uses ByteBuffer as abstraction
for storing blocks
val buf = ByteBuffer.allocate(length.toInt)
• ByteBuffer is limited by Integer.MAX_SIZE(2 GB)!
Once again
• No Spark shuffle block can be greater than
2 GB
Spark SQL
• Especially problematic for Spark SQL
• Default number of partitions to use when
doing shuffles is 200
– This low number of partitions leads to high
shuffle block size
Umm, ok, so what can I do?
1. Increase the number of partitions
– Thereby, reducing the average partition size
2. Get rid of skew in your data
– More on that later
Umm, how exactly?
• In Spark SQL, increase the value of
spark.sql.shuffle.partitions
• In regular Spark applications, use
rdd.repartition() or
rdd.coalesce()
But, how many partitions should I
have?
• Rule of thumb is around 128 MB per
partition
But!
• Spark uses a different data structure for
bookkeeping during shuffles, when the
number of partitions is less than 2000, vs.
more than 2000.
Don’t believe me?
• In MapStatus.scala
def apply(loc: BlockManagerId, uncompressedSizes:
Array[Long]): MapStatus = {
if (uncompressedSizes.length > 2000) {
HighlyCompressedMapStatus(loc,
uncompressedSizes)
} else {
new CompressedMapStatus(loc, uncompressedSizes)
}
}
Ok, so what are you saying?
• If your number of partitions is less than
2000, but close enough to it, bump that
number up to be slightly higher than 2000.
Can you summarize, please?
• Don’t have too big partitions
– Your job will fail due to 2 GB limit
• Don’t have too few partitions
– Your job will be slow, not making using of
parallelism
• Rule of thumb: ~128 MB per partition
• If #partitions < 2000, but close, bump to just > 2000
Mistake # 3
Slow jobs on Join/Shuffle
• Your dataset takes 20 seconds to run over
with a map job, but take 4 hours when
joined or shuffled. What wrong?
Skew and Cartesian
Mistake - Skew
Single Thread
Single Thread
Single Thread
Single Thread
Single Thread
Single Thread
Single Thread
Normal
Distributed
The Holy Grail of Distributed Systems
Mistake - Skew
Single Thread
Normal
Distributed
What about Skew, because that is a thing
Mistake – Skew : Answers
• Salting
• Isolation Salting
• Isolation Map Joins
Mistake – Skew : Salting
• Normal Key: “Foo”
• Salted Key: “Foo” +
random.nextInt(saltFactor)
Managing Parallelism
Mistake – Skew: Salting
Add Example Slide
Mistake – Skew : Salting
• Two Stage Aggregation
– Stage one to do operations on the salted keys
– Stage two to do operation access unsalted
key results
Data Source Map
Convert to
Salted Key & Value
Tuple
Reduce
By Salted Key
Map Convert
results to
Key & Value
Tuple
Reduce
By Key
Results
Mistake – Skew : Isolated Salting
• Second Stage only required for Isolated
Keys
Data Source Map
Convert to
Key & Value
Isolate Key and
convert to
Salted Key &
Value
Tuple
Reduce
By Key &
Salted Key
Filter Isolated
Keys
From Salted
Keys
Map Convert
results to
Key & Value
Tuple
Reduce
By Key
Union to
Results
Mistake – Skew : Isolated Map Join
• Filter Out Isolated Keys and use Map
Join/Aggregate on those
• And normal reduce on the rest of the data
• This can remove a large amount of data being
shuffled
Data Source Filter Normal
Keys
From Isolated
Keys
Reduce
By Normal Key
Union to
Results
Map Join
For Isolated
Keys
Managing Parallelism
Cartesian Join
Map Task
Shuffle Tmp 1
Shuffle Tmp 2
Shuffle Tmp 3
Shuffle Tmp 4
Map Task
Shuffle Tmp 1
Shuffle Tmp 2
Shuffle Tmp 3
Shuffle Tmp 4
Map Task
Shuffle Tmp 1
Shuffle Tmp 2
Shuffle Tmp 3
Shuffle Tmp 4
ReduceTask
ReduceTask
ReduceTask
ReduceTask
Amount
of Data
Amount of Data
10x
100x
1000x
10000x
100000x
1000000x
Or more
Managing Parallelism
• To fight Cartesian Join
– Nested Structures
– Windowing
– Skip Steps
Mistake # 4
Out of luck?
• Do you every run out of memory?
• Do you every have more then 20 stages?
• Is your driver doing a lot of work?
Mistake – DAG Management
• Shuffles are to be avoided
• ReduceByKey over GroupByKey
• TreeReduce over Reduce
• Use Complex Types
Mistake – DAG Management:
Shuffles
• Map Side Reducing if possible
• Think about partitioning/bucketing ahead of
time
• Do as much as possible with a single
Shuffle
• Only send what you have to send
• Avoid Skew and Cartesians
ReduceByKey over GroupByKey
• ReduceByKey can do almost anything that
GroupByKey can do
• Aggregations
• Windowing
• Use memory
• But you have more control
• ReduceByKey has a fixed limit of Memory
requirements
• GroupByKey is unbound and dependent of the
data
TreeReduce over Reduce
• TreeReduce & Reduce returns a result to the driver
• TreeReduce does more work on the executors
• Where Reduce bring everything back to the driver
Partition
Partition
Partition
Partition
Driver
100%
Partition
Partition
Partition
Partition
Driver
4
25%
25%
25%
25%
Complex Types
• Top N List
• Multiple types of Aggregations
• Windowing operations
• All in one pass
Complex Types
• Think outside of the box use objects to reduce by
• (Make something simple)
Mistake # 5
Ever seen this?
Exception in thread "main" java.lang.NoSuchMethodError:
com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode;
at org.apache.spark.util.collection.OpenHashSet.org
$apache$spark$util$collection$OpenHashSet$$hashcode(OpenHashSet.scala:261)
at
org.apache.spark.util.collection.OpenHashSet$mcI$sp.getPos$mcI$sp(OpenHashSet.scala:165)
at
org.apache.spark.util.collection.OpenHashSet$mcI$sp.contains$mcI$sp(OpenHashSet.scala:102)
at
org.apache.spark.util.SizeEstimator$$anonfun$visitArray$2.apply$mcVI$sp(SizeEstimator.scala:214)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at
org.apache.spark.util.SizeEstimator$.visitArray(SizeEstimator.scala:210)
at…....
But!
• I already included guava in my app’s
maven dependencies?
Ah!
• My guava version doesn’t match with
Spark’s guava version!
Shading
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.2</version>
...
<relocations>
<relocation>
<pattern>com.google.protobuf</pattern>
<shadedPattern>com.company.my.protobuf</shadedPattern>
</relocation>
</relocations>
Summary
5 Mistakes
• Size up your executors right
• 2 GB limit on Spark shuffle blocks
• Evil thing about skew and cartesians
• Learn to manage your DAG, yo!
• Do shady stuff, don’t let classpath leaks
mess you up
THANK YOU.
tiny.cloudera.com/spark-mistakes
Mark Grover | @mark_grover
Ted Malaska | @TedMalaska

Contenu connexe

Tendances

Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introductioncolorant
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationDatabricks
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Handling Data Skew Adaptively In Spark Using Dynamic RepartitioningHandling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Handling Data Skew Adaptively In Spark Using Dynamic RepartitioningSpark Summit
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsDatabricks
 
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...Databricks
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & FeaturesDataStax Academy
 
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodRadical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodDatabricks
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcachedJurriaan Persyn
 
Why your Spark Job is Failing
Why your Spark Job is FailingWhy your Spark Job is Failing
Why your Spark Job is FailingDataWorks Summit
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive QueriesOwen O'Malley
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationshadooparchbook
 

Tendances (20)

Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Handling Data Skew Adaptively In Spark Using Dynamic RepartitioningHandling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
 
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodRadical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcached
 
Why your Spark Job is Failing
Why your Spark Job is FailingWhy your Spark Job is Failing
Why your Spark Job is Failing
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 

Similaire à Top 5 Mistakes to Avoid When Writing Apache Spark Applications

Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsmarkgrover
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsmarkgrover
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationshadooparchbook
 
Colvin exadata mistakes_ioug_2014
Colvin exadata mistakes_ioug_2014Colvin exadata mistakes_ioug_2014
Colvin exadata mistakes_ioug_2014marvin herrera
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudRose Toomey
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudDatabricks
 
Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)mundlapudi
 
Redis trouble shooting_eng
Redis trouble shooting_engRedis trouble shooting_eng
Redis trouble shooting_engDaeMyung Kang
 
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in PinterestMigrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in PinterestDatabricks
 
What every developer should know about database scalability, PyCon 2010
What every developer should know about database scalability, PyCon 2010What every developer should know about database scalability, PyCon 2010
What every developer should know about database scalability, PyCon 2010jbellis
 
Writing Scalable Software in Java
Writing Scalable Software in JavaWriting Scalable Software in Java
Writing Scalable Software in JavaRuben Badaró
 
Tuning Linux for your database FLOSSUK 2016
Tuning Linux for your database FLOSSUK 2016Tuning Linux for your database FLOSSUK 2016
Tuning Linux for your database FLOSSUK 2016Colin Charles
 
Scaling with sync_replication using Galera and EC2
Scaling with sync_replication using Galera and EC2Scaling with sync_replication using Galera and EC2
Scaling with sync_replication using Galera and EC2Marco Tusa
 
Cassandra Core Concepts - Cassandra Day Toronto
Cassandra Core Concepts - Cassandra Day TorontoCassandra Core Concepts - Cassandra Day Toronto
Cassandra Core Concepts - Cassandra Day TorontoJon Haddad
 
Dongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkDongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkFlink Forward
 
A Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache FlinkA Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache FlinkDongwon Kim
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance IssuesAntonios Katsarakis
 

Similaire à Top 5 Mistakes to Avoid When Writing Apache Spark Applications (20)

Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Spark Tips & Tricks
Spark Tips & TricksSpark Tips & Tricks
Spark Tips & Tricks
 
Colvin exadata mistakes_ioug_2014
Colvin exadata mistakes_ioug_2014Colvin exadata mistakes_ioug_2014
Colvin exadata mistakes_ioug_2014
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)
 
Redis trouble shooting_eng
Redis trouble shooting_engRedis trouble shooting_eng
Redis trouble shooting_eng
 
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in PinterestMigrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
 
Chicago spark meetup-april2017-public
Chicago spark meetup-april2017-publicChicago spark meetup-april2017-public
Chicago spark meetup-april2017-public
 
What every developer should know about database scalability, PyCon 2010
What every developer should know about database scalability, PyCon 2010What every developer should know about database scalability, PyCon 2010
What every developer should know about database scalability, PyCon 2010
 
Writing Scalable Software in Java
Writing Scalable Software in JavaWriting Scalable Software in Java
Writing Scalable Software in Java
 
Spark architechure.pptx
Spark architechure.pptxSpark architechure.pptx
Spark architechure.pptx
 
Tuning Linux for your database FLOSSUK 2016
Tuning Linux for your database FLOSSUK 2016Tuning Linux for your database FLOSSUK 2016
Tuning Linux for your database FLOSSUK 2016
 
Scaling with sync_replication using Galera and EC2
Scaling with sync_replication using Galera and EC2Scaling with sync_replication using Galera and EC2
Scaling with sync_replication using Galera and EC2
 
Cassandra Core Concepts - Cassandra Day Toronto
Cassandra Core Concepts - Cassandra Day TorontoCassandra Core Concepts - Cassandra Day Toronto
Cassandra Core Concepts - Cassandra Day Toronto
 
Dongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkDongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of Flink
 
A Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache FlinkA Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache Flink
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
 

Plus de Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

Plus de Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Dernier

Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfCionsystems
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendArshad QA
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 

Dernier (20)

Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdf
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and Backend
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 

Top 5 Mistakes to Avoid When Writing Apache Spark Applications

  • 1. Top 5 Mistakes when writing Spark applications Mark Grover | @mark_grover | Software Engineer Ted Malaska | @TedMalaska | Principal Solutions Architect tiny.cloudera.com/spark-mistakes
  • 2. About the book • @hadooparchbook • hadooparchitecturebook.com • github.com/hadooparchitecturebook • slideshare.com/hadooparchbook
  • 4. Mistakes people we made when using Spark
  • 6. # Executors, cores, memory !?! • 6 Nodes • 16 cores each • 64 GB of RAM each
  • 7. Decisions, decisions, decisions • Number of executors (--num-executors) • Cores for each executor (--executor-cores) • Memory for each executor (--executor- memory) • 6 nodes • 16 cores each • 64 GB of RAM
  • 9. Answer #1 – Most granular • Have smallest sized executors as possible • 1 core each • Total of 16 x 6 = 96 cores • 96 executors • 64/16 = 4 GB per executor (per node)
  • 10. Answer #1 – Most granular • Have smallest sized executors as possible • 1 core each • Total of 16 x 6 = 96 cores • 96 executors • 64/16 = 4 GB per executor (per node)
  • 11. Why? • Not using benefits of running multiple tasks in same JVM
  • 12. Answer #2 – Least granular • 6 executors • 64 GB memory each • 16 cores each
  • 13. Answer #2 – Least granular • 6 executors • 64 GB memory each • 16 cores each
  • 14. Why? • Need to leave some memory overhead for OS/Hadoop daemons
  • 15. Answer #3 – with overhead • 6 executors • 63 GB memory each • 15 cores each
  • 16. Answer #3 – with overhead • 6 executors • 63 GB memory each • 15 cores each
  • 17. Spark on YARN – Memory usage • --executor-memory controls the heap size • Need some overhead (controlled by spark.yarn.executor.memory.overhead)for off heap memory • Default is max(384MB, .07 * spark.executor.memory)
  • 18. YARN AM needs a core: Client mode
  • 19. YARN AM needs a core: Cluster mode
  • 20. HDFS Throughput • 15 cores per executor can lead to bad HDFS I/O throughput. • Best is to keep under 5 cores per executor
  • 21. Calculations • 5 cores per executor – For max HDFS throughput • Cluster has 6 * 15 = 90 cores in total (after taking out Hadoop/Yarn daemon cores) • 90 cores / 5 cores/executor = 18 executors • 1 executor for AM => 17 executors • Each node has 3 executors • 63 GB/3 = 21 GB, 21 x (1-0.07) ~ 19 GB (counting off heap overhead)
  • 22. Correct answer • 17 executors • 19 GB memory each • 5 cores each * Not etched in stone
  • 23. Read more • From a great blog post on this topic by Sandy Ryza: http://blog.cloudera.com/blog/2015/03/how- to-tune-your-apache-spark-jobs-part-2/
  • 25. Application failure 15/04/16 14:13:03 WARN scheduler.TaskSetManager: Lost task 19.0 in stage 6.0 (TID 120, 10.215.149.47): java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:828) at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:123) at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:132) at org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:51 7) at org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:432) at org.apache.spark.storage.BlockManager.get(BlockManager.scala:618) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:146 ) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)
  • 26. Why? • No Spark shuffle block can be greater than 2 GB
  • 27. Ok, what’s a shuffle block again? • In MapReduce terminology, a Mapper- Reducer pair – the file from local disk that the reducers read from local disk in MapReduce.
  • 28. In other words Each yellow arrow in this diagram represents a shuffle block.
  • 29. Wait! What!?! This is Big Data stuff, no? • Yeah! Nope! • Spark uses ByteBuffer as abstraction for storing blocks val buf = ByteBuffer.allocate(length.toInt) • ByteBuffer is limited by Integer.MAX_SIZE(2 GB)!
  • 30. Once again • No Spark shuffle block can be greater than 2 GB
  • 31. Spark SQL • Especially problematic for Spark SQL • Default number of partitions to use when doing shuffles is 200 – This low number of partitions leads to high shuffle block size
  • 32. Umm, ok, so what can I do? 1. Increase the number of partitions – Thereby, reducing the average partition size 2. Get rid of skew in your data – More on that later
  • 33. Umm, how exactly? • In Spark SQL, increase the value of spark.sql.shuffle.partitions • In regular Spark applications, use rdd.repartition() or rdd.coalesce()
  • 34. But, how many partitions should I have? • Rule of thumb is around 128 MB per partition
  • 35. But! • Spark uses a different data structure for bookkeeping during shuffles, when the number of partitions is less than 2000, vs. more than 2000.
  • 36. Don’t believe me? • In MapStatus.scala def apply(loc: BlockManagerId, uncompressedSizes: Array[Long]): MapStatus = { if (uncompressedSizes.length > 2000) { HighlyCompressedMapStatus(loc, uncompressedSizes) } else { new CompressedMapStatus(loc, uncompressedSizes) } }
  • 37. Ok, so what are you saying? • If your number of partitions is less than 2000, but close enough to it, bump that number up to be slightly higher than 2000.
  • 38. Can you summarize, please? • Don’t have too big partitions – Your job will fail due to 2 GB limit • Don’t have too few partitions – Your job will be slow, not making using of parallelism • Rule of thumb: ~128 MB per partition • If #partitions < 2000, but close, bump to just > 2000
  • 40. Slow jobs on Join/Shuffle • Your dataset takes 20 seconds to run over with a map job, but take 4 hours when joined or shuffled. What wrong?
  • 42. Mistake - Skew Single Thread Single Thread Single Thread Single Thread Single Thread Single Thread Single Thread Normal Distributed The Holy Grail of Distributed Systems
  • 43. Mistake - Skew Single Thread Normal Distributed What about Skew, because that is a thing
  • 44. Mistake – Skew : Answers • Salting • Isolation Salting • Isolation Map Joins
  • 45. Mistake – Skew : Salting • Normal Key: “Foo” • Salted Key: “Foo” + random.nextInt(saltFactor)
  • 47. Mistake – Skew: Salting
  • 49. Mistake – Skew : Salting • Two Stage Aggregation – Stage one to do operations on the salted keys – Stage two to do operation access unsalted key results Data Source Map Convert to Salted Key & Value Tuple Reduce By Salted Key Map Convert results to Key & Value Tuple Reduce By Key Results
  • 50. Mistake – Skew : Isolated Salting • Second Stage only required for Isolated Keys Data Source Map Convert to Key & Value Isolate Key and convert to Salted Key & Value Tuple Reduce By Key & Salted Key Filter Isolated Keys From Salted Keys Map Convert results to Key & Value Tuple Reduce By Key Union to Results
  • 51. Mistake – Skew : Isolated Map Join • Filter Out Isolated Keys and use Map Join/Aggregate on those • And normal reduce on the rest of the data • This can remove a large amount of data being shuffled Data Source Filter Normal Keys From Isolated Keys Reduce By Normal Key Union to Results Map Join For Isolated Keys
  • 52. Managing Parallelism Cartesian Join Map Task Shuffle Tmp 1 Shuffle Tmp 2 Shuffle Tmp 3 Shuffle Tmp 4 Map Task Shuffle Tmp 1 Shuffle Tmp 2 Shuffle Tmp 3 Shuffle Tmp 4 Map Task Shuffle Tmp 1 Shuffle Tmp 2 Shuffle Tmp 3 Shuffle Tmp 4 ReduceTask ReduceTask ReduceTask ReduceTask Amount of Data Amount of Data 10x 100x 1000x 10000x 100000x 1000000x Or more
  • 53. Managing Parallelism • To fight Cartesian Join – Nested Structures – Windowing – Skip Steps
  • 55. Out of luck? • Do you every run out of memory? • Do you every have more then 20 stages? • Is your driver doing a lot of work?
  • 56. Mistake – DAG Management • Shuffles are to be avoided • ReduceByKey over GroupByKey • TreeReduce over Reduce • Use Complex Types
  • 57. Mistake – DAG Management: Shuffles • Map Side Reducing if possible • Think about partitioning/bucketing ahead of time • Do as much as possible with a single Shuffle • Only send what you have to send • Avoid Skew and Cartesians
  • 58. ReduceByKey over GroupByKey • ReduceByKey can do almost anything that GroupByKey can do • Aggregations • Windowing • Use memory • But you have more control • ReduceByKey has a fixed limit of Memory requirements • GroupByKey is unbound and dependent of the data
  • 59. TreeReduce over Reduce • TreeReduce & Reduce returns a result to the driver • TreeReduce does more work on the executors • Where Reduce bring everything back to the driver Partition Partition Partition Partition Driver 100% Partition Partition Partition Partition Driver 4 25% 25% 25% 25%
  • 60. Complex Types • Top N List • Multiple types of Aggregations • Windowing operations • All in one pass
  • 61. Complex Types • Think outside of the box use objects to reduce by • (Make something simple)
  • 63. Ever seen this? Exception in thread "main" java.lang.NoSuchMethodError: com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode; at org.apache.spark.util.collection.OpenHashSet.org $apache$spark$util$collection$OpenHashSet$$hashcode(OpenHashSet.scala:261) at org.apache.spark.util.collection.OpenHashSet$mcI$sp.getPos$mcI$sp(OpenHashSet.scala:165) at org.apache.spark.util.collection.OpenHashSet$mcI$sp.contains$mcI$sp(OpenHashSet.scala:102) at org.apache.spark.util.SizeEstimator$$anonfun$visitArray$2.apply$mcVI$sp(SizeEstimator.scala:214) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) at org.apache.spark.util.SizeEstimator$.visitArray(SizeEstimator.scala:210) at…....
  • 64. But! • I already included guava in my app’s maven dependencies?
  • 65. Ah! • My guava version doesn’t match with Spark’s guava version!
  • 68. 5 Mistakes • Size up your executors right • 2 GB limit on Spark shuffle blocks • Evil thing about skew and cartesians • Learn to manage your DAG, yo! • Do shady stuff, don’t let classpath leaks mess you up
  • 69. THANK YOU. tiny.cloudera.com/spark-mistakes Mark Grover | @mark_grover Ted Malaska | @TedMalaska