The Evaluation of TPC-H on Spark and Spark SQL in ALOJA was conducted at the Big Data Lab to obtain the master degree in Management Information Systems at the Johann-Wolfgang Goethe University in Frankfurt, Germany. Furthermore, the analysis was partially accomplished in collaboration and close coordination with the Barcelona Super Computer Center.
The intention of this research was the integration of a TPC-H on Spark Scala benchmark into ALOJA, an open-source and public platform for automated and cost-efficient benchmarks and to perform an evaluation on the runtime of Spark Scala with or without Hive Metastore compared to Spark SQL. Various alternate file formats with different applied compressions on underlying data and its impact are evaluated. The conducted performance evaluation exposed diverse and captivating outcomes for both benchmarks. Further investigations attempt to detect possible bottlenecks and other irregularities. The aim is to provide an explanation to enhance knowledge of Spark’s engine based on examining the physical plans. Our experiments show, inter alia, that: (1) Spark Scala performs better in case of heavy expression calculation, (2) Spark SQL is the better choice in case of strong data access locality in combination with heavyweight parallel execution. In conclusion, diverse results were observed with the consequence that each API has its advantages and disadvantages.
Surprisingly, our findings are well spread between Spark SQL and Spark Scala and contrary to our expectations Spark Scala did not outperform Spark SQL in all aspects but support the idea that applied optimizations appear to be implemented in a different way by Spark for its core and its extension Spark SQL. The API on top of Spark provides extra information about the underlying structured data, which is probably used to perform additional optimizations.
In conclusion, our research demonstrates that there are differences in the generation of query execution plans that goes hand-in-hand with similar discoveries leading to inefficient joins, and it underlines the importance of our benchmark to identify disparities and bottlenecks.
Speaker
Raphael Radowitz, Quality Specialist, SAP Labs Korea
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
Spark SQL Beats Spark Scala by 30% for Some Queries
1. AN EVALUATION OF TPC-H
ON SPARK & SPARK SQL IN ALOJA
M.SC. RAPHAEL RADOWITZ @DATAWORKS SUMMIT, BERLIN 19TH APRIL 2018
FRANKFURT BIG DATA LAB @GOETHE UNIVERSITY
2. AGENDA
Motivation & Research Objectives
Spark
Ecosystem
Data Access
ALOJA & TPC-H
Spark SQL with or without Hive Metastore
File Formats
Correlation Analysis
Query Analysis
Summary
Thursday, April 19, 2018 2
3. SPARK SCALA & SPARK SQL
Do you Want to improve your Apache Spark
performance?
Thursday, April 19, 2018 3
4. QUESTION'S ADDRESSED IN THIS SESSION
1. Should I use Spark Scala or Spark SQL?
2. Does Hive Metastore have an impact on the performance?
3. Should I consider a certain File Format?
Master thesis: “Evaluation of TPC-H on Spark & Spark SQL in ALOJA”
Thursday, April 19, 2018 4
5. OUTCOME OF THE PERFORMANCE EVALUATION
1. Up to 30% of performance increase by switching between Spark Scala &
Spark SQL
2. Hive Metastore produces an overhead
3. File Format and compression increases performance
Parquet with Snappy compression is the best choice
Performance Evaluation conducted on Spark 2.1.1
Thursday, April 19, 2018 5
6. MOTIVATION & RESEARCH OBJECTIVES
Absence of a comprehensive performance evaluation of
Spark SQL compared to Spark Scala
Investigating the performance impact of Spark SQL and Spark Scala
Investigating the influence of Hive’s Metastore on performance
The attempt to detect possible bottlenecks in terms of runtime
Impact of various alternate file formats with different applied compressions
Implement a Spark Scala TPC-H benchmark within ALOJA
Benchmark is publicly accessible on GitHub
Thursday, April 19, 2018 6
7. ALOJA
Benchmark platform to characterize cost-effectiveness of Big Data
deployments
https://aloja.bsc.es/
https://github.com/Aloja/aloja
Collaboration with the Barcelona Super Computer Center (BSC)
Nicolas Poggi
Alejandro Montero
Thursday, April 19, 2018 7
8. TPC-H BENCHMARK
Popular decision support benchmark
Composed of eight different sized tables
22 complex business oriented ad-hoc queries
Thursday, April 19, 2018 8
10. Thursday, April 19,
2018
10
Data access from Spark on HDFS
With or without Metastore
Data File Formats: Text, ORC & Parquet
Dataset API
DATA
ACCESS
11. FILE FORMATS
Text
ORC & Parquet with standard compression
GZIP and ZLIB
ORC with Snappy compression
Parquet with Snappy compression
Thursday, April 19,
2018
11
12. FILE FORMATS Spark Scala file formats with Snappy compression on Cluster with 1TB
Thursday, April 19, 2018 12
13. FILE FORMATS
Parquet is up to 50% faster than text
Standard compressions – GZIP and ZLIB
Parquet is up 16% faster than ORC
Snappy compression (faster than standard
compression)
On average Parquet with Snappy is 10% faster than ORC
with Snappy compression
Only common compression
Thursday, April 19,
2018
13
14. TAKEAWAY
File Formats and compression benefits the
performance of all queries and both benchmarks
equally
ORC & Parquet perform overall best with Snappy
Parquet with Snappy compression is the best
choice
Thursday, April 19,
2018
14
18. TAKEAWAY
Spark Scala does not outperform Spark SQL
Spark Scala and Spark SQL process queries
differently
Are the applied optimization rules the same?
Hive Metastore does not improve the performance,
but creates a minor overhead
Possibility to improve performance by simply
switching API
Thursday, April 19,
2018
18
19. WHAT TO DO?
1. Is there a pattern?
When to use Spark Scala?
When to use Spark SQL?
2. What are the root causes?
Thursday, April 19,
2018
19
20. QUERY ANALYSIS
2 approaches to investigate the performance differences identified:
1. Correlation analysis based on the Choke Point Analysis
2. Investigation of the Execution Plan
Thursday, April 19, 2018 20
21. CHOKE POINT
ANALYSIS
Classifying each TPC-H benchmark query into 6
categories (Low/Medium/High):
Aggregation Performance
Join Performance
Data Access Locality
Expression Calculation
Correlated Subqueries
Parallel Execution
The correlation analysis is based on this
classification
* P. Boncz, T. Neumann, and O. Erling, “TPC-H Analyzed: Hidden Messages and
Lessons Learned from an Influential Benchmark,” in Performance Characterization
and Benchmarking, 2013, pp. 61–76 Thursday, April 19,
2018
21
23. SPARK SCALA – HIGH EXPRESSION CALCULATION
Thursday, April 19, 2018 23
24. SPARK SQL – DATA ACCESS LOCALITY & PARALLEL EXECUTION
Thursday, April 19, 2018 24
25. TAKEAWAY
Spark Scala performs better in case of heavy
Expression Calculation
Spark SQL is the better choice in case of
strong Data Access Locality in combination
with heavyweight Parallel Execution
Thursday, April 19,
2018
25
26. EXECUTION
PLAN ANALYSIS
Execution Plan Analysis revealed different applied
optimizations
Spark SQL and Spark Scala do have different physical
plans
Query Q4, Q5, Q11, Q19 exemplify most substantial
Execution Plan variations:
Different Joins
Different Join order
Different Join build side
Missing filters
Missing projection
Thursday, April 19,
2018
26
Not explicitly defined, but
applied for one API but not the
other.
27. QUERY ANALYSIS – Q11
TPC-H query Q11 demonstrates bad performance for Spark Scala
Performance differences can be tracked down to different applied joins
Wrong build side for joins
QUERY 11
Spark Scala Spark SQL
1 x BroadCastHash
2 x SortMerge
1 x
BroadCastNestedLoop
4 x BroadCastHash
Bad performance Good performance
Join Type Complexity
BroadCastHash O(N)
SortMerge O(N Log N), if not
sorted
BoradCastNestedLoop O(N²)
Thursday, April 19, 2018 27
28. SUMMARY
Up to 30% of performance increase by simply
switching API
Parquet with Snappy is best
Spark API’s can be intermixed seamlessly, but
differences in the execution plan
no guarantee for best performance
Different optimization rules are applied
Spark SQL uses the Catalyst Optimizer
Thursday, April 19,
2018
28
29. THANK YOU
RAPHAEL RADOWITZ @DATAWORKS SUMMIT, BERLIN 19TH APRIL 2018
M.SC. Raphael Radowitz
Contact Detail
Phone: +82 (0) 10 9174 3788
Email: rradowitz@outlook.de