SlideShare une entreprise Scribd logo
1  sur  42
Télécharger pour lire hors ligne
“ 10 things I wish I'd known
before using
in production ! ”
Himanshu Arora
Lead Data Engineer, NeoLynk France
h.arora@neolynk.fr
@him_aro
Nitya Nand Yadav
Data Engineer, NeoLynk France
n.yadav@neolynk.fr
@nityany
Partenaire Suivez l’actualité de nos Tribus JVM, PHP, JS et Agile
sur nos réseaux sociaux :
JVM
What we are going to cover...
1. RDD vs DataFrame vs DataSet
2. Data Serialisation Formats
3. Storage formats
4. Broadcast join
5. Hardware tuning
6. Level of parallelism
7. GC tuning
8. Common errors
9. Data skew
10. Data locality
5
1/10 - RDD vs DataFrames vs DataSets
6
● RDD - Resilient Distributed Dataset
➔ Main abstraction of Spark.
➔ Low-level transformation, actions and control on partition level.
➔ Unstructured dataset like media streams, text streams.
➔ Manipulate data with functional programming constructs.
➔ No optimization
7
● DataFrame
➔ High level abstractions, rich semantics.
➔ Like a big distributed SQL table.
➔ High level expressions (aggregation, average, sum, sql queries).
➔ Performance and optimizations(Predicate pushdown, QBO, CBO...).
➔ No compile time type check, runtime errors.
8
● DataSet
➔ A collection of strongly-typed JVM objects, dictated by a case class you define
in Scala or a class in Java.
➔ DataFrame = DataSet[Row].
➔ Performance and optimisations.
➔ Type-safety at compile time.
9
2/10 - Data Serialisation Format
➔ Data shuffled in serialized format between executors.
➔ RDDs cached & persisted in disk are serialized too.
➔ Default serialization format of spark: Java Serialization (slow & large).
➔ Better use: Kryo serialisation.
➔ Kryo: Faster and more compact (up to 10x).
➔ DataFrame/DataSets use tungsten serialization (even better than kryo).
10
val sparkConf: SparkConf = new SparkConf()
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
val sparkSession: SparkSession = SparkSession
.builder()
.config(sparkConf)
.getOrCreate()
// register your own custom classes with kryo
sparkConf.registerKryoClasses(Array(classOf[MyCustomeClass]))
2/10 - Data Serialisation Format
11
3/10 - Storage Formats
12
➔ Avoid using text, json and csv etc. if possible.
➔ Use compressed binary formats instead.
➔ Popular choices: Apache Parquet, Apache Avro & ORC etc.
➔ Use case dictates the choice.
3/10 - Storage Formats
13
➔ Binary formats.
➔ Splittable.
➔ Parquet: Columnar & Avro: Row based
➔ Parquet: Higher compression rates than row based format.
➔ Parquet: read-heavy workload & Avro: write heavy workload
➔ Schema preserved in files itself.
➔ Avro: Better support for schema evolution
3/10 - Storage Formats: Apache Parquet & Avro
14
val sparkConf: SparkConf = new SparkConf()
.set("spark.sql.parquet.compression.codec", "snappy")
val dataframe = sparkSession.read.parquet("s3a://....")
dataframe.write.parquet("s3a://....")
val sparkConf: SparkConf = new SparkConf()
.set("spark.sql.avro.compression.codec", "snappy")
val dataframe = sparkSession.read.avro("s3a://....")
dataframe.write.avro("s3a://....")
3/10 - Storage Formats
15
3/10 - Benchmark
Using AVRO
instead of JSON
16
4/10 - Broadcast Join
17
//spark automatically broadcasts small dataframes (max. 10MB by default)
val sparkConf: SparkConf = new SparkConf()
.set("spark.sql.autoBroadcastJoinThreshold", "2147483648")
.set("spark.sql.broadcastTimeout", "900") //default 300 secs
/*
Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult:
java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]
*/
//force broadcast
val result = bigDataFrame.join(broadcast(smallDataFrame))
4/10 - Broadcast Join
18
Know Your Cluster
● Number of nodes
● Cores per node
● RAM per node
● Cluster Manager (Yarn,
Mesos …)
Let’s assume:
● 5 nodes
● 16 cores each
● 64GB RAM
● Yarn as RM
● Spark in client mode
5/10 - Hardware Tuning
19
--num-executors = 80 //( 16 cores x 5 nodes)
--executor-cores = 1
--executor-memory = 4GB //(64 GB / 16 executors per node)
➔ Not running multiple tasks on same JVM (not sharing
broadcast vars, accumulators…).
➔ Risk of running out of memory to compute a partition.
5/10 - Hardware Tuning / Scenario #1 (Small executors)
20
--num-executors = 5
--executor-cores = 16
--executor-memory = 64GB
5/10 - Hardware Tuning / Scenario #2 (Large executors)
➔ Very long garbage collection pauses.
➔ Poor performance with HDFS (handling many
concurrent threads).
21
--executor-cores = 5
--num-executors = 14 //(15 core per node / 5 core per executor = 3 x 5 node -1)
--executor-memory = 18GB //(64 / 3 executors per node - 10% overhead)
5/10 - Hardware Tuning / Scenario #3 (Right Balance)
➔ Recommended concurrent threads for HDFS is 5.
➔ Always leave one core for Yarn daemons.
➔ Always leave one executor for Yarn ApplicationMaster.
➔ Off heap memory for yarn = 10% for executor memory.
22
➔ Hardware tuning.
➔ Moving from
Java serializer to
Kryo.
5/10 - Benchmark
23
rdd = sc.textFile('demo.zip')
rdd = rdd.repartition(100)
6/10 - Level of parallelism/partitions
➔ The maximum size of a partition(s) is limited by the available memory of an
executor.
➔ Increasing partitions count will make each partition to have less data.
➔ Spark can not split compressed files (e.g. zip) and creates only 1 partition so
repartition yourself.
24
➔ Quick wins when using a large JVM heap to avoid long GC pauses.
spark.executor.extraJavaOptions: -XX:+UseG1GC -XX:+AlwaysPreTouch -XX:+UseLargePages
-XX:+UseTLAB -XX:+ResizeTLAB
// if creating too many objects in driver (ex. collect())
// which is not a very good idea though
spark.driver.extraJavaOptions: -XX:+UseG1GC -XX:+AlwaysPreTouch -XX:+UseLargePages
-XX:+UseTLAB -XX:+ResizeTLAB
7/10 - GC Tuning
25
Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical
memory used.
8/10 - Knock knock… Who’s there?… An error :(
26
➔ Not enough executor memory.
➔ Too many executor cores (implies too much parallelism).
➔ Not enough spark partitions.
➔ Data skew (let’s talk about that later…).
➔ Increase executor memory.
➔ Reduce number of executor cores.
➔ Increase number of spark partitions.
➔ Persist in memory and disk (or just disk) with serialization.
➔ Off heap memory for caching.
8/10 - Knock knock… Who’s there?… An error :(
27
8/10 - Knock knock… Who’s there?… An error :(
19/01/31 21:03:13 INFO DAGScheduler: Host lost:
ip-172-29-149-243.eu-west-1.compute.internal (epoch 16)
19/01/31 21:03:13 INFO BlockManagerMasterEndpoint: Trying to
remove executors on host
ip-172-29-149-243.eu-west-1.compute.internal from BlockManagerMaster.
19/01/31 21:03:13 INFO BlockManagerMaster: Removed executors on
host ip-172-29-149-243.eu-west-1.compute.internal successfully.
28
{
"Name": "ScaleInContainerPendingRatio",
"Description": "Scale in on ContainerPendingRatio",
"Action": {
"SimpleScalingPolicyConfiguration": {
"AdjustmentType": "CHANGE_IN_CAPACITY",
"ScalingAdjustment": -1,
"CoolDown": 300
}
},
"Trigger": {
"CloudWatchAlarmDefinition": {
"ComparisonOperator": "LESS_THAN_OR_EQUAL",
"EvaluationPeriods": 3,
"MetricName": "ContainerPendingRatio",
"Namespace": "AWS/ElasticMapReduce",
"Dimensions": [
{
"Value": "$${emr.clusterId}",
"Key": "JobFlowId"
}
],
"Period": 300,
"Statistic": "AVERAGE",
"Threshold": 0,
"Unit": "COUNT"
}
}
}
Containers Pending / Containers allocated
Is this really
my cluster
…….?????
{
"Name": "ScaleInMemoryPercentage",
"Description": "Scale in on YARNMemoryAvailablePercentage",
"Action": {
"SimpleScalingPolicyConfiguration": {
"AdjustmentType": "CHANGE_IN_CAPACITY",
"ScalingAdjustment": -2,
"CoolDown": 300
}
},
"Trigger": {
"CloudWatchAlarmDefinition": {
"ComparisonOperator": "GREATER_THAN",
"EvaluationPeriods": 3,
"MetricName": "YARNMemoryAvailablePercentage",
"Namespace": "AWS/ElasticMapReduce",
"Dimensions": [
{
"Key": "JobFlowId",
"Value": "$${emr.clusterId}"
}
],
"Period": 300,
"Statistic": "AVERAGE",
"Threshold": 95.0,
"Unit": "PERCENT"
}
}
}
➔ A condition when data is not uniformly distributed across partitions.
➔ During joins, aggregations etc.
➔ E.g. joining with a column containing lots of null.
➔ Might cause java.lang.OutOfMemoryError: Java heap space.
9/10 - Data Skew
32
df1.join(df2,
Seq(
"make",
"model"
)
)
33
➔ Repartition your data based on key(Rdd) and column(dataframe) ,which will
evenly distribute the data.
➔ Use non-skewed column(s) for join.
➔ Replace null values of join col with NULL_X (X is a random number).
➔ Salting.
9/10 - Data Skew: possible solutions
34
df1.join(df2,
Seq(
"make",
"model",
"engine_size"
)
)
35
Let’s sprinkle
some SALT on
data skew …!!
9/10 - Impossible to find repartitioning key for even data distribution ???
Salting key = Actual partition key + Random fake key
(where fake key takes value between 1 to N, with N being the level of
distribution/partitions)
37
➔ Join DFs : Create salt col on bigger DF and broadcast the smaller one (with
addition col containing 1 to N).
➔ If both are too big to broadcast: Salt one and iterative broadcast other.
38
➔ Why it’s important?
10/10 - Data Locality
39
val sparkSession = SparkSession
.builder()
.appName("spark-app")
.config("spark.locality.wait", "60s") //default 3secs
.config("spark.locality.wait.node", "0") //set to 0 to skip node local
.config("spark.locality.wait.process", "10s")
.config("spark.locality.wait.rack", "30s")
.getOrCreate()
10/10 - Data Locality
40
REFERENCES
41
Thank you...very much !
Slides: http://tiny.cc/tiq56y
@him_aro @nityany

Contenu connexe

Tendances

Debugging & Tuning in Spark
Debugging & Tuning in SparkDebugging & Tuning in Spark
Debugging & Tuning in SparkShiao-An Yuan
 
Out of the box replication in postgres 9.4(pg confus)
Out of the box replication in postgres 9.4(pg confus)Out of the box replication in postgres 9.4(pg confus)
Out of the box replication in postgres 9.4(pg confus)Denish Patel
 
Streaming replication in practice
Streaming replication in practiceStreaming replication in practice
Streaming replication in practiceAlexey Lesovsky
 
Ceph Performance and Sizing Guide
Ceph Performance and Sizing GuideCeph Performance and Sizing Guide
Ceph Performance and Sizing GuideJose De La Rosa
 
Out of the box replication in postgres 9.4
Out of the box replication in postgres 9.4Out of the box replication in postgres 9.4
Out of the box replication in postgres 9.4Denish Patel
 
HBaseCon 2013: OpenTSDB at Box
HBaseCon 2013: OpenTSDB at BoxHBaseCon 2013: OpenTSDB at Box
HBaseCon 2013: OpenTSDB at BoxCloudera, Inc.
 
GlusterFS As an Object Storage
GlusterFS As an Object StorageGlusterFS As an Object Storage
GlusterFS As an Object StorageKeisuke Takahashi
 
Advanced Postgres Monitoring
Advanced Postgres MonitoringAdvanced Postgres Monitoring
Advanced Postgres MonitoringDenish Patel
 
What’s the Best PostgreSQL High Availability Framework? PAF vs. repmgr vs. Pa...
What’s the Best PostgreSQL High Availability Framework? PAF vs. repmgr vs. Pa...What’s the Best PostgreSQL High Availability Framework? PAF vs. repmgr vs. Pa...
What’s the Best PostgreSQL High Availability Framework? PAF vs. repmgr vs. Pa...ScaleGrid.io
 
PostgreSQL Streaming Replication Cheatsheet
PostgreSQL Streaming Replication CheatsheetPostgreSQL Streaming Replication Cheatsheet
PostgreSQL Streaming Replication CheatsheetAlexey Lesovsky
 
Your 1st Ceph cluster
Your 1st Ceph clusterYour 1st Ceph cluster
Your 1st Ceph clusterMirantis
 
Red Hat Enterprise Linux OpenStack Platform on Inktank Ceph Enterprise
Red Hat Enterprise Linux OpenStack Platform on Inktank Ceph EnterpriseRed Hat Enterprise Linux OpenStack Platform on Inktank Ceph Enterprise
Red Hat Enterprise Linux OpenStack Platform on Inktank Ceph EnterpriseRed_Hat_Storage
 
Advanced Apache Cassandra Operations with JMX
Advanced Apache Cassandra Operations with JMXAdvanced Apache Cassandra Operations with JMX
Advanced Apache Cassandra Operations with JMXzznate
 
PGConf.ASIA 2019 - High Availability, 10 Seconds Failover - Lucky Haryadi
PGConf.ASIA 2019 - High Availability, 10 Seconds Failover - Lucky HaryadiPGConf.ASIA 2019 - High Availability, 10 Seconds Failover - Lucky Haryadi
PGConf.ASIA 2019 - High Availability, 10 Seconds Failover - Lucky HaryadiEqunix Business Solutions
 
Cassandra Community Webinar | In Case of Emergency Break Glass
Cassandra Community Webinar | In Case of Emergency Break GlassCassandra Community Webinar | In Case of Emergency Break Glass
Cassandra Community Webinar | In Case of Emergency Break GlassDataStax
 

Tendances (19)

Debugging & Tuning in Spark
Debugging & Tuning in SparkDebugging & Tuning in Spark
Debugging & Tuning in Spark
 
Out of the box replication in postgres 9.4(pg confus)
Out of the box replication in postgres 9.4(pg confus)Out of the box replication in postgres 9.4(pg confus)
Out of the box replication in postgres 9.4(pg confus)
 
Streaming replication in practice
Streaming replication in practiceStreaming replication in practice
Streaming replication in practice
 
Ceph Performance and Sizing Guide
Ceph Performance and Sizing GuideCeph Performance and Sizing Guide
Ceph Performance and Sizing Guide
 
Out of the box replication in postgres 9.4
Out of the box replication in postgres 9.4Out of the box replication in postgres 9.4
Out of the box replication in postgres 9.4
 
HBaseCon 2013: OpenTSDB at Box
HBaseCon 2013: OpenTSDB at BoxHBaseCon 2013: OpenTSDB at Box
HBaseCon 2013: OpenTSDB at Box
 
GlusterFS As an Object Storage
GlusterFS As an Object StorageGlusterFS As an Object Storage
GlusterFS As an Object Storage
 
Advanced Postgres Monitoring
Advanced Postgres MonitoringAdvanced Postgres Monitoring
Advanced Postgres Monitoring
 
Shootout at the PAAS Corral
Shootout at the PAAS CorralShootout at the PAAS Corral
Shootout at the PAAS Corral
 
What’s the Best PostgreSQL High Availability Framework? PAF vs. repmgr vs. Pa...
What’s the Best PostgreSQL High Availability Framework? PAF vs. repmgr vs. Pa...What’s the Best PostgreSQL High Availability Framework? PAF vs. repmgr vs. Pa...
What’s the Best PostgreSQL High Availability Framework? PAF vs. repmgr vs. Pa...
 
PostgreSQL Streaming Replication Cheatsheet
PostgreSQL Streaming Replication CheatsheetPostgreSQL Streaming Replication Cheatsheet
PostgreSQL Streaming Replication Cheatsheet
 
Your 1st Ceph cluster
Your 1st Ceph clusterYour 1st Ceph cluster
Your 1st Ceph cluster
 
Red Hat Enterprise Linux OpenStack Platform on Inktank Ceph Enterprise
Red Hat Enterprise Linux OpenStack Platform on Inktank Ceph EnterpriseRed Hat Enterprise Linux OpenStack Platform on Inktank Ceph Enterprise
Red Hat Enterprise Linux OpenStack Platform on Inktank Ceph Enterprise
 
Advanced Apache Cassandra Operations with JMX
Advanced Apache Cassandra Operations with JMXAdvanced Apache Cassandra Operations with JMX
Advanced Apache Cassandra Operations with JMX
 
PGConf.ASIA 2019 - High Availability, 10 Seconds Failover - Lucky Haryadi
PGConf.ASIA 2019 - High Availability, 10 Seconds Failover - Lucky HaryadiPGConf.ASIA 2019 - High Availability, 10 Seconds Failover - Lucky Haryadi
PGConf.ASIA 2019 - High Availability, 10 Seconds Failover - Lucky Haryadi
 
7 Ways To Crash Postgres
7 Ways To Crash Postgres7 Ways To Crash Postgres
7 Ways To Crash Postgres
 
PostgreSQL and RAM usage
PostgreSQL and RAM usagePostgreSQL and RAM usage
PostgreSQL and RAM usage
 
Cassandra Community Webinar | In Case of Emergency Break Glass
Cassandra Community Webinar | In Case of Emergency Break GlassCassandra Community Webinar | In Case of Emergency Break Glass
Cassandra Community Webinar | In Case of Emergency Break Glass
 
Ceph issue 해결 사례
Ceph issue 해결 사례Ceph issue 해결 사례
Ceph issue 해결 사례
 

Similaire à 10 things i wish i'd known before using spark in production

Building Apache Cassandra clusters for massive scale
Building Apache Cassandra clusters for massive scaleBuilding Apache Cassandra clusters for massive scale
Building Apache Cassandra clusters for massive scaleAlex Thompson
 
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Databricks
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudRevolution Analytics
 
16 artifacts to capture when there is a production problem
16 artifacts to capture when there is a production problem16 artifacts to capture when there is a production problem
16 artifacts to capture when there is a production problemTier1 app
 
Spark Performance Tuning .pdf
Spark Performance Tuning .pdfSpark Performance Tuning .pdf
Spark Performance Tuning .pdfAmit Raj
 
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...Gianmario Spacagna
 
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPBuild Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPDatabricks
 
DevoxxUK: Optimizating Application Performance on Kubernetes
DevoxxUK: Optimizating Application Performance on KubernetesDevoxxUK: Optimizating Application Performance on Kubernetes
DevoxxUK: Optimizating Application Performance on KubernetesDinakar Guniguntala
 
Tuning parallelcodeonsolaris005
Tuning parallelcodeonsolaris005Tuning parallelcodeonsolaris005
Tuning parallelcodeonsolaris005dflexer
 
import rdma: zero-copy networking with RDMA and Python
import rdma: zero-copy networking with RDMA and Pythonimport rdma: zero-copy networking with RDMA and Python
import rdma: zero-copy networking with RDMA and Pythongroveronline
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks
 
‘16 artifacts’ to capture when there is a production problem
‘16 artifacts’ to capture when there is a production problem‘16 artifacts’ to capture when there is a production problem
‘16 artifacts’ to capture when there is a production problemTier1 app
 
DUG'20: 12 - DAOS in Lenovo’s HPC Innovation Center
DUG'20: 12 - DAOS in Lenovo’s HPC Innovation CenterDUG'20: 12 - DAOS in Lenovo’s HPC Innovation Center
DUG'20: 12 - DAOS in Lenovo’s HPC Innovation CenterAndrey Kudryavtsev
 
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016Zabbix
 
Puppet at Opera Sofware - PuppetCamp Oslo 2013
Puppet at Opera Sofware - PuppetCamp Oslo 2013Puppet at Opera Sofware - PuppetCamp Oslo 2013
Puppet at Opera Sofware - PuppetCamp Oslo 2013Cosimo Streppone
 

Similaire à 10 things i wish i'd known before using spark in production (20)

Building Apache Cassandra clusters for massive scale
Building Apache Cassandra clusters for massive scaleBuilding Apache Cassandra clusters for massive scale
Building Apache Cassandra clusters for massive scale
 
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the Cloud
 
20160908 hivemall meetup
20160908 hivemall meetup20160908 hivemall meetup
20160908 hivemall meetup
 
GR740 User day
GR740 User dayGR740 User day
GR740 User day
 
16 artifacts to capture when there is a production problem
16 artifacts to capture when there is a production problem16 artifacts to capture when there is a production problem
16 artifacts to capture when there is a production problem
 
Spark Performance Tuning .pdf
Spark Performance Tuning .pdfSpark Performance Tuning .pdf
Spark Performance Tuning .pdf
 
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
 
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPBuild Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
 
DevoxxUK: Optimizating Application Performance on Kubernetes
DevoxxUK: Optimizating Application Performance on KubernetesDevoxxUK: Optimizating Application Performance on Kubernetes
DevoxxUK: Optimizating Application Performance on Kubernetes
 
Tuning parallelcodeonsolaris005
Tuning parallelcodeonsolaris005Tuning parallelcodeonsolaris005
Tuning parallelcodeonsolaris005
 
import rdma: zero-copy networking with RDMA and Python
import rdma: zero-copy networking with RDMA and Pythonimport rdma: zero-copy networking with RDMA and Python
import rdma: zero-copy networking with RDMA and Python
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 
‘16 artifacts’ to capture when there is a production problem
‘16 artifacts’ to capture when there is a production problem‘16 artifacts’ to capture when there is a production problem
‘16 artifacts’ to capture when there is a production problem
 
DUG'20: 12 - DAOS in Lenovo’s HPC Innovation Center
DUG'20: 12 - DAOS in Lenovo’s HPC Innovation CenterDUG'20: 12 - DAOS in Lenovo’s HPC Innovation Center
DUG'20: 12 - DAOS in Lenovo’s HPC Innovation Center
 
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016
 
Evolution of Spark APIs
Evolution of Spark APIsEvolution of Spark APIs
Evolution of Spark APIs
 
Spark on YARN
Spark on YARNSpark on YARN
Spark on YARN
 
The Accidental DBA
The Accidental DBAThe Accidental DBA
The Accidental DBA
 
Puppet at Opera Sofware - PuppetCamp Oslo 2013
Puppet at Opera Sofware - PuppetCamp Oslo 2013Puppet at Opera Sofware - PuppetCamp Oslo 2013
Puppet at Opera Sofware - PuppetCamp Oslo 2013
 

Plus de Paris Data Engineers !

Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardParis Data Engineers !
 
REX : pourquoi et comment développer son propre scheduler
REX : pourquoi et comment développer son propre schedulerREX : pourquoi et comment développer son propre scheduler
REX : pourquoi et comment développer son propre schedulerParis Data Engineers !
 
Utilisation de MLflow pour le cycle de vie des projet Machine learning
Utilisation de MLflow pour le cycle de vie des projet Machine learningUtilisation de MLflow pour le cycle de vie des projet Machine learning
Utilisation de MLflow pour le cycle de vie des projet Machine learningParis Data Engineers !
 
Change Data Capture with Data Collector @OVH
Change Data Capture with Data Collector @OVHChange Data Capture with Data Collector @OVH
Change Data Capture with Data Collector @OVHParis Data Engineers !
 
Building highly reliable data pipeline @datadog par Quentin François
Building highly reliable data pipeline @datadog par Quentin FrançoisBuilding highly reliable data pipeline @datadog par Quentin François
Building highly reliable data pipeline @datadog par Quentin FrançoisParis Data Engineers !
 
Scala pour le Data Engineering par Jonathan Winandy
Scala pour le Data Engineering par Jonathan WinandyScala pour le Data Engineering par Jonathan Winandy
Scala pour le Data Engineering par Jonathan WinandyParis Data Engineers !
 

Plus de Paris Data Engineers ! (11)

Spark tools by Jonathan Winandy
Spark tools by Jonathan WinandySpark tools by Jonathan Winandy
Spark tools by Jonathan Winandy
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
 
SCIO : Apache Beam API
SCIO : Apache Beam APISCIO : Apache Beam API
SCIO : Apache Beam API
 
Apache Beam de A à Z
 Apache Beam de A à Z Apache Beam de A à Z
Apache Beam de A à Z
 
REX : pourquoi et comment développer son propre scheduler
REX : pourquoi et comment développer son propre schedulerREX : pourquoi et comment développer son propre scheduler
REX : pourquoi et comment développer son propre scheduler
 
Deeplearning in production
Deeplearning in productionDeeplearning in production
Deeplearning in production
 
Utilisation de MLflow pour le cycle de vie des projet Machine learning
Utilisation de MLflow pour le cycle de vie des projet Machine learningUtilisation de MLflow pour le cycle de vie des projet Machine learning
Utilisation de MLflow pour le cycle de vie des projet Machine learning
 
Introduction à Apache Pulsar
 Introduction à Apache Pulsar Introduction à Apache Pulsar
Introduction à Apache Pulsar
 
Change Data Capture with Data Collector @OVH
Change Data Capture with Data Collector @OVHChange Data Capture with Data Collector @OVH
Change Data Capture with Data Collector @OVH
 
Building highly reliable data pipeline @datadog par Quentin François
Building highly reliable data pipeline @datadog par Quentin FrançoisBuilding highly reliable data pipeline @datadog par Quentin François
Building highly reliable data pipeline @datadog par Quentin François
 
Scala pour le Data Engineering par Jonathan Winandy
Scala pour le Data Engineering par Jonathan WinandyScala pour le Data Engineering par Jonathan Winandy
Scala pour le Data Engineering par Jonathan Winandy
 

Dernier

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 

Dernier (20)

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 

10 things i wish i'd known before using spark in production

  • 1. “ 10 things I wish I'd known before using in production ! ”
  • 2. Himanshu Arora Lead Data Engineer, NeoLynk France h.arora@neolynk.fr @him_aro Nitya Nand Yadav Data Engineer, NeoLynk France n.yadav@neolynk.fr @nityany
  • 3. Partenaire Suivez l’actualité de nos Tribus JVM, PHP, JS et Agile sur nos réseaux sociaux : JVM
  • 4.
  • 5. What we are going to cover... 1. RDD vs DataFrame vs DataSet 2. Data Serialisation Formats 3. Storage formats 4. Broadcast join 5. Hardware tuning 6. Level of parallelism 7. GC tuning 8. Common errors 9. Data skew 10. Data locality 5
  • 6. 1/10 - RDD vs DataFrames vs DataSets 6
  • 7. ● RDD - Resilient Distributed Dataset ➔ Main abstraction of Spark. ➔ Low-level transformation, actions and control on partition level. ➔ Unstructured dataset like media streams, text streams. ➔ Manipulate data with functional programming constructs. ➔ No optimization 7
  • 8. ● DataFrame ➔ High level abstractions, rich semantics. ➔ Like a big distributed SQL table. ➔ High level expressions (aggregation, average, sum, sql queries). ➔ Performance and optimizations(Predicate pushdown, QBO, CBO...). ➔ No compile time type check, runtime errors. 8
  • 9. ● DataSet ➔ A collection of strongly-typed JVM objects, dictated by a case class you define in Scala or a class in Java. ➔ DataFrame = DataSet[Row]. ➔ Performance and optimisations. ➔ Type-safety at compile time. 9
  • 10. 2/10 - Data Serialisation Format ➔ Data shuffled in serialized format between executors. ➔ RDDs cached & persisted in disk are serialized too. ➔ Default serialization format of spark: Java Serialization (slow & large). ➔ Better use: Kryo serialisation. ➔ Kryo: Faster and more compact (up to 10x). ➔ DataFrame/DataSets use tungsten serialization (even better than kryo). 10
  • 11. val sparkConf: SparkConf = new SparkConf() .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") val sparkSession: SparkSession = SparkSession .builder() .config(sparkConf) .getOrCreate() // register your own custom classes with kryo sparkConf.registerKryoClasses(Array(classOf[MyCustomeClass])) 2/10 - Data Serialisation Format 11
  • 12. 3/10 - Storage Formats 12
  • 13. ➔ Avoid using text, json and csv etc. if possible. ➔ Use compressed binary formats instead. ➔ Popular choices: Apache Parquet, Apache Avro & ORC etc. ➔ Use case dictates the choice. 3/10 - Storage Formats 13
  • 14. ➔ Binary formats. ➔ Splittable. ➔ Parquet: Columnar & Avro: Row based ➔ Parquet: Higher compression rates than row based format. ➔ Parquet: read-heavy workload & Avro: write heavy workload ➔ Schema preserved in files itself. ➔ Avro: Better support for schema evolution 3/10 - Storage Formats: Apache Parquet & Avro 14
  • 15. val sparkConf: SparkConf = new SparkConf() .set("spark.sql.parquet.compression.codec", "snappy") val dataframe = sparkSession.read.parquet("s3a://....") dataframe.write.parquet("s3a://....") val sparkConf: SparkConf = new SparkConf() .set("spark.sql.avro.compression.codec", "snappy") val dataframe = sparkSession.read.avro("s3a://....") dataframe.write.avro("s3a://....") 3/10 - Storage Formats 15
  • 16. 3/10 - Benchmark Using AVRO instead of JSON 16
  • 17. 4/10 - Broadcast Join 17
  • 18. //spark automatically broadcasts small dataframes (max. 10MB by default) val sparkConf: SparkConf = new SparkConf() .set("spark.sql.autoBroadcastJoinThreshold", "2147483648") .set("spark.sql.broadcastTimeout", "900") //default 300 secs /* Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult: java.util.concurrent.TimeoutException: Futures timed out after [300 seconds] */ //force broadcast val result = bigDataFrame.join(broadcast(smallDataFrame)) 4/10 - Broadcast Join 18
  • 19. Know Your Cluster ● Number of nodes ● Cores per node ● RAM per node ● Cluster Manager (Yarn, Mesos …) Let’s assume: ● 5 nodes ● 16 cores each ● 64GB RAM ● Yarn as RM ● Spark in client mode 5/10 - Hardware Tuning 19
  • 20. --num-executors = 80 //( 16 cores x 5 nodes) --executor-cores = 1 --executor-memory = 4GB //(64 GB / 16 executors per node) ➔ Not running multiple tasks on same JVM (not sharing broadcast vars, accumulators…). ➔ Risk of running out of memory to compute a partition. 5/10 - Hardware Tuning / Scenario #1 (Small executors) 20
  • 21. --num-executors = 5 --executor-cores = 16 --executor-memory = 64GB 5/10 - Hardware Tuning / Scenario #2 (Large executors) ➔ Very long garbage collection pauses. ➔ Poor performance with HDFS (handling many concurrent threads). 21
  • 22. --executor-cores = 5 --num-executors = 14 //(15 core per node / 5 core per executor = 3 x 5 node -1) --executor-memory = 18GB //(64 / 3 executors per node - 10% overhead) 5/10 - Hardware Tuning / Scenario #3 (Right Balance) ➔ Recommended concurrent threads for HDFS is 5. ➔ Always leave one core for Yarn daemons. ➔ Always leave one executor for Yarn ApplicationMaster. ➔ Off heap memory for yarn = 10% for executor memory. 22
  • 23. ➔ Hardware tuning. ➔ Moving from Java serializer to Kryo. 5/10 - Benchmark 23
  • 24. rdd = sc.textFile('demo.zip') rdd = rdd.repartition(100) 6/10 - Level of parallelism/partitions ➔ The maximum size of a partition(s) is limited by the available memory of an executor. ➔ Increasing partitions count will make each partition to have less data. ➔ Spark can not split compressed files (e.g. zip) and creates only 1 partition so repartition yourself. 24
  • 25. ➔ Quick wins when using a large JVM heap to avoid long GC pauses. spark.executor.extraJavaOptions: -XX:+UseG1GC -XX:+AlwaysPreTouch -XX:+UseLargePages -XX:+UseTLAB -XX:+ResizeTLAB // if creating too many objects in driver (ex. collect()) // which is not a very good idea though spark.driver.extraJavaOptions: -XX:+UseG1GC -XX:+AlwaysPreTouch -XX:+UseLargePages -XX:+UseTLAB -XX:+ResizeTLAB 7/10 - GC Tuning 25
  • 26. Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory used. 8/10 - Knock knock… Who’s there?… An error :( 26
  • 27. ➔ Not enough executor memory. ➔ Too many executor cores (implies too much parallelism). ➔ Not enough spark partitions. ➔ Data skew (let’s talk about that later…). ➔ Increase executor memory. ➔ Reduce number of executor cores. ➔ Increase number of spark partitions. ➔ Persist in memory and disk (or just disk) with serialization. ➔ Off heap memory for caching. 8/10 - Knock knock… Who’s there?… An error :( 27
  • 28. 8/10 - Knock knock… Who’s there?… An error :( 19/01/31 21:03:13 INFO DAGScheduler: Host lost: ip-172-29-149-243.eu-west-1.compute.internal (epoch 16) 19/01/31 21:03:13 INFO BlockManagerMasterEndpoint: Trying to remove executors on host ip-172-29-149-243.eu-west-1.compute.internal from BlockManagerMaster. 19/01/31 21:03:13 INFO BlockManagerMaster: Removed executors on host ip-172-29-149-243.eu-west-1.compute.internal successfully. 28
  • 29. { "Name": "ScaleInContainerPendingRatio", "Description": "Scale in on ContainerPendingRatio", "Action": { "SimpleScalingPolicyConfiguration": { "AdjustmentType": "CHANGE_IN_CAPACITY", "ScalingAdjustment": -1, "CoolDown": 300 } }, "Trigger": { "CloudWatchAlarmDefinition": { "ComparisonOperator": "LESS_THAN_OR_EQUAL", "EvaluationPeriods": 3, "MetricName": "ContainerPendingRatio", "Namespace": "AWS/ElasticMapReduce", "Dimensions": [ { "Value": "$${emr.clusterId}", "Key": "JobFlowId" } ], "Period": 300, "Statistic": "AVERAGE", "Threshold": 0, "Unit": "COUNT" } } } Containers Pending / Containers allocated
  • 30. Is this really my cluster …….?????
  • 31. { "Name": "ScaleInMemoryPercentage", "Description": "Scale in on YARNMemoryAvailablePercentage", "Action": { "SimpleScalingPolicyConfiguration": { "AdjustmentType": "CHANGE_IN_CAPACITY", "ScalingAdjustment": -2, "CoolDown": 300 } }, "Trigger": { "CloudWatchAlarmDefinition": { "ComparisonOperator": "GREATER_THAN", "EvaluationPeriods": 3, "MetricName": "YARNMemoryAvailablePercentage", "Namespace": "AWS/ElasticMapReduce", "Dimensions": [ { "Key": "JobFlowId", "Value": "$${emr.clusterId}" } ], "Period": 300, "Statistic": "AVERAGE", "Threshold": 95.0, "Unit": "PERCENT" } } }
  • 32. ➔ A condition when data is not uniformly distributed across partitions. ➔ During joins, aggregations etc. ➔ E.g. joining with a column containing lots of null. ➔ Might cause java.lang.OutOfMemoryError: Java heap space. 9/10 - Data Skew 32
  • 34. ➔ Repartition your data based on key(Rdd) and column(dataframe) ,which will evenly distribute the data. ➔ Use non-skewed column(s) for join. ➔ Replace null values of join col with NULL_X (X is a random number). ➔ Salting. 9/10 - Data Skew: possible solutions 34
  • 36. Let’s sprinkle some SALT on data skew …!!
  • 37. 9/10 - Impossible to find repartitioning key for even data distribution ??? Salting key = Actual partition key + Random fake key (where fake key takes value between 1 to N, with N being the level of distribution/partitions) 37
  • 38. ➔ Join DFs : Create salt col on bigger DF and broadcast the smaller one (with addition col containing 1 to N). ➔ If both are too big to broadcast: Salt one and iterative broadcast other. 38
  • 39. ➔ Why it’s important? 10/10 - Data Locality 39
  • 40. val sparkSession = SparkSession .builder() .appName("spark-app") .config("spark.locality.wait", "60s") //default 3secs .config("spark.locality.wait.node", "0") //set to 0 to skip node local .config("spark.locality.wait.process", "10s") .config("spark.locality.wait.rack", "30s") .getOrCreate() 10/10 - Data Locality 40
  • 42. Thank you...very much ! Slides: http://tiny.cc/tiq56y @him_aro @nityany