One of the common requests we receive from customers (at Qubole) is debugging slow spark application. Usually this process is done with trial and error, which takes time and requires running clusters beyond normal usage (read wasted resources). Moreover, it doesn’t tell us where to looks for further improvements. We at Qubole are looking into making this process more self-serve. Towards this goal we have built a tool (OSS https://github.com/qubole/sparklens) based on spark event listener framework.
From a single run of the application, Sparklens provides insights about scalability limits of given spark application. In this talk we will cover what Sparklens does and theory behind Sparklens. We will talk about how structure of spark application puts important constraints on its scalability. How can we find these structural constraints and how to use these constraints as a guide in solving performance and scalability problems of spark applications.
This talk will help audience in answering the following questions about their spark applications: 1) Will their application run faster with more executors? 2) How will cluster utilization change as number of executors change? 3) What is the absolute minimum time this application will take even if we give it infinite executors? 4) What is the expected wall clock time for the application when we fix the most important structural limits of these application? Sparklens makes the ROI of additional executor extremely obvious for a given application and needs just a single run of the application to determine how application with behave with different executor counts. Specifically, it will help managers take the correct side of the tradeoff between spending developer time optimising applications vs spending money on compute bills.
10. Critical Path: Limit to scalability
10
Driver Stage 1 Stage 2 Stage 3
Core1
Core2
Core3
Core4
Time
Share your thoughts @qubole #Sparklens #SAISDev17
11. Ideal Application Time
11
Driver Stage 1 Stage 2 Stage 3
Core1
Core2
Core3
Core4
Time
Share your thoughts @qubole #Sparklens #SAISDev17
12. Latency Vs Compute vs Developer
12
Critical Path
Time
Wall Clock
Time
Ideal*
Application
Time
Share your thoughts @qubole #Sparklens #SAISDev17
13. Structure of a
Spark application
Spark application is either
executing in driver or in
parallel in executors
Child stage is not executed
until all parent stages are
complete
Stage is not complete until
all tasks of stage are
complete
13Share your thoughts @qubole Sparklens
16. Observations & Actions
• The application had too many stages (697)
• The Critical Path Time was 3X the Ideal Application Time
• Instead of letting spark write to hive table, the code was doing
serial writes to each partition, in a loop
• We changed the code to let spark write to partitions in parallel
@qubole
#Sparklens
#SAISDev17
21. Observations & Actions
• 85% of time spent in a single stage with very low number of
tasks.
• 91% compute wasted on executor side.
• Found that repartition(10) was called somewhere in code,
resulting in only 10 tasks. Removed it.
• Also increased the spark.sql.shuffle.partitions from default 200
to 800@qubole
#Sparklens
#SAISDev17
23. Using Sparklens
—packages qubole:sparklens:0.2.0-s_2.11
—conf spark.extraListener=com.qubole.sparklens.QuboleJobListener
For inline processing, add following extra command line options to spark-submit
Old event log files (history server)
—packages qubole:sparklens:0.2.0-s_2.11 --class
com.qubole.sparklens.app.ReporterApp dummy-arg <eventLogFile>
source=history
Special Sparklens output files (very small file with all the relevant data)
—packages qubole:sparklens:0.2.0-s_2.11 --class
com.qubole.sparklens.app.ReporterApp dummy-arg <eventLogFile>@qubole
#Sparklens
#SAISDev17 https://github.com/qubole/sparklens