This document provides tips and best practices for debugging and tuning Spark applications. It discusses Spark concepts like RDDs, transformations, actions, and the DAG execution model. It then gives recommendations for improving correctness, reducing overhead from parallelism, avoiding data skew, and tuning configurations like storage level, number of partitions, executor resources and joins. Common failures are analyzed along with their causes and fixes. Overall it emphasizes the importance of tuning partitioning, avoiding shuffles when possible, and using the right configurations to optimize Spark jobs.
17. Prefer reduceByKey() over groupByKey()
● reduceByKey() combines output
before shuffling the data
● Also consider aggregateByKey()
● Use groupByKey() if you really
know what you are doing
21. Join
● partitionBy()
● repartitionAndSortWithinPartitions()
● spark.sql.autoBroadcastJoinThreshold (default 10 MB)
● Join it manually by mapPartitions()
○ Broadcast small RDD
■ http://stackoverflow.com/a/17690254/406803
○ Query data from database
■ https://groups.google.com/a/lists.datastax.com/d/topic/spark-connector-user/63ILfPqPRYI/discussion
22. Broadcast Small RDD
val smallRdd = ...
val largeRdd = ...
val smallBroadcast = sc.broadcast(smallRdd.collectAsMap())
val joined = largeRdd.mapPartitions(iter => {
val m = smallBroadcast.value
for {
(k, v) <- iter
if m.contains(k)
} yield (k, (v, m.get(k).get))
}, preservesPartitioning = true)
23. Query Data from Cassandra
val conf = new SparkConf()
.set("spark.cassandra.connection.host", "127.0.0.1")
val connector = CassandraConnector(conf)
val joined = rdd.mapPartitions(iter => {
connector.withSessionDo(session => {
val stmt = session.prepare("SELECT value FROM table WHERE key=?")
iter.map {
case (k, v) => (k, (v, session.execute(stmt.bind(k)).one()))
}
})
})
28. java.io.IOException: No space left on device
● SPARK_WORKER_DIR
● SPARK_LOCAL_DIRS, spark.local.dir
● Shuffle files
○ Only delete after the RDD object has been GC
29. Other Tips
● Event logs
○ spark.eventLog.enabled=true
○ ${SPARK_HOME}/sbin/start-history-server.sh
30. Partitions
● Rule of thumb: ~128 MB per partition
● If #partitions <= 2000, but close, bump to just > 2000
● Increase #partitions by repartition()
● Decrease #partitions by coalesce()
● spark.sql.shuffle.partitions (default 200)
http://www.slideshare.net/cloudera/top-5-mistakes-to-avoid-when-writing-apache-spark-applications
31. Executors, Cores, Memory!?
● 32 nodes
● 16 cores each
● 64 GB of RAM each
● If you have an application need 32 cores, what is the
correct setting?
http://www.slideshare.net/cloudera/top-5-mistakes-to-avoid-when-writing-apache-spark-applications
32. Why Spark Debugging / Tuning is Hard?
● Distributed
● Lazy
● Hard to do benchmark
● Spark is sensitive
33. Conclusion
● When in doubt, repartition!
● Avoid shuffle if you can
● Choose a reasonable partition count
● Premature optimization is the root of all evil -- Donald Knuth
34. Reference
● Tuning and Debugging in Apache Spark
● Top 5 Mistakes to Avoid When Writing Apache Spark
Applications
● How-to: Tune Your Apache Spark Jobs (Part 1)
● How-to: Tune Your Apache Spark Jobs (Part 2)