SlideShare une entreprise Scribd logo
1  sur  29
AN EVALUATION OF TPC-H
ON SPARK & SPARK SQL IN ALOJA
M.SC. RAPHAEL RADOWITZ @DATAWORKS SUMMIT, BERLIN 19TH APRIL 2018
FRANKFURT BIG DATA LAB @GOETHE UNIVERSITY
AGENDA
 Motivation & Research Objectives
 Spark
 Ecosystem
 Data Access
 ALOJA & TPC-H
 Spark SQL with or without Hive Metastore
 File Formats
 Correlation Analysis
 Query Analysis
 Summary
Thursday, April 19, 2018 2
SPARK SCALA & SPARK SQL
Do you Want to improve your Apache Spark
performance?
Thursday, April 19, 2018 3
QUESTION'S ADDRESSED IN THIS SESSION
1. Should I use Spark Scala or Spark SQL?
2. Does Hive Metastore have an impact on the performance?
3. Should I consider a certain File Format?
 Master thesis: “Evaluation of TPC-H on Spark & Spark SQL in ALOJA”
Thursday, April 19, 2018 4
OUTCOME OF THE PERFORMANCE EVALUATION
1. Up to 30% of performance increase by switching between Spark Scala &
Spark SQL
2. Hive Metastore produces an overhead
3. File Format and compression increases performance
 Parquet with Snappy compression is the best choice
 Performance Evaluation conducted on Spark 2.1.1
Thursday, April 19, 2018 5
MOTIVATION & RESEARCH OBJECTIVES
 Absence of a comprehensive performance evaluation of
Spark SQL compared to Spark Scala
 Investigating the performance impact of Spark SQL and Spark Scala
 Investigating the influence of Hive’s Metastore on performance
 The attempt to detect possible bottlenecks in terms of runtime
 Impact of various alternate file formats with different applied compressions
 Implement a Spark Scala TPC-H benchmark within ALOJA
 Benchmark is publicly accessible on GitHub
Thursday, April 19, 2018 6
ALOJA
 Benchmark platform to characterize cost-effectiveness of Big Data
deployments
 https://aloja.bsc.es/
 https://github.com/Aloja/aloja
 Collaboration with the Barcelona Super Computer Center (BSC)
 Nicolas Poggi
 Alejandro Montero
Thursday, April 19, 2018 7
TPC-H BENCHMARK
 Popular decision support benchmark
 Composed of eight different sized tables
 22 complex business oriented ad-hoc queries
Thursday, April 19, 2018 8
SPARK ECOSYSTEM / INTERFACES
Thursday, April 19, 2018 9
https://pages.databricks.com/rs/094-YMS-629/images/SparkSQLSigmod2015.pdf
Thursday, April 19,
2018
10
 Data access from Spark on HDFS
 With or without Metastore
 Data File Formats: Text, ORC & Parquet
 Dataset API
DATA
ACCESS
FILE FORMATS
 Text
 ORC & Parquet with standard compression
 GZIP and ZLIB
 ORC with Snappy compression
 Parquet with Snappy compression
Thursday, April 19,
2018
11
FILE FORMATS Spark Scala file formats with Snappy compression on Cluster with 1TB
Thursday, April 19, 2018 12
FILE FORMATS
 Parquet is up to 50% faster than text
 Standard compressions – GZIP and ZLIB
 Parquet is up 16% faster than ORC
 Snappy compression (faster than standard
compression)
 On average Parquet with Snappy is 10% faster than ORC
with Snappy compression
 Only common compression
Thursday, April 19,
2018
13
TAKEAWAY
 File Formats and compression benefits the
performance of all queries and both benchmarks
equally
 ORC & Parquet perform overall best with Snappy
 Parquet with Snappy compression is the best
choice
Thursday, April 19,
2018
14
Thursday, April 19,
2018
15
DATA
ACCESS
TPC-H
BENCHMARK
RESULTS
Thursday, April 19,
2018
16
TPC-H
BENCHMARK
RESULTS
Query Spark Scala (sec) Spark SQL (sec) Difference (%)
Q2 78 83 7%
Q4 73 100 26%
Q5 126 99 27%
Q7 111 94 18%
Q8 99 83 20%
Q11 83 68 21%
Q14 54 64 15%
Q15 69 80 14%
Q18 103 123 16%
Q19 60 80 25%
Q21 262 221 18%
Thursday, April 19,
2018
17
TAKEAWAY
 Spark Scala does not outperform Spark SQL
 Spark Scala and Spark SQL process queries
differently
 Are the applied optimization rules the same?
 Hive Metastore does not improve the performance,
but creates a minor overhead
 Possibility to improve performance by simply
switching API
Thursday, April 19,
2018
18
WHAT TO DO?
1. Is there a pattern?
 When to use Spark Scala?
 When to use Spark SQL?
2. What are the root causes?
Thursday, April 19,
2018
19
QUERY ANALYSIS
 2 approaches to investigate the performance differences identified:
1. Correlation analysis based on the Choke Point Analysis
2. Investigation of the Execution Plan
Thursday, April 19, 2018 20
CHOKE POINT
ANALYSIS
 Classifying each TPC-H benchmark query into 6
categories (Low/Medium/High):
 Aggregation Performance
 Join Performance
 Data Access Locality
 Expression Calculation
 Correlated Subqueries
 Parallel Execution
 The correlation analysis is based on this
classification
* P. Boncz, T. Neumann, and O. Erling, “TPC-H Analyzed: Hidden Messages and
Lessons Learned from an Influential Benchmark,” in Performance Characterization
and Benchmarking, 2013, pp. 61–76 Thursday, April 19,
2018
21
CORRELATION ANALYSIS
Thursday, April 19, 2018 22
SPARK SCALA – HIGH EXPRESSION CALCULATION
Thursday, April 19, 2018 23
SPARK SQL – DATA ACCESS LOCALITY & PARALLEL EXECUTION
Thursday, April 19, 2018 24
TAKEAWAY
 Spark Scala performs better in case of heavy
Expression Calculation
 Spark SQL is the better choice in case of
strong Data Access Locality in combination
with heavyweight Parallel Execution
Thursday, April 19,
2018
25
EXECUTION
PLAN ANALYSIS
 Execution Plan Analysis revealed different applied
optimizations
 Spark SQL and Spark Scala do have different physical
plans
 Query Q4, Q5, Q11, Q19 exemplify most substantial
Execution Plan variations:
 Different Joins
 Different Join order
 Different Join build side
 Missing filters
 Missing projection
Thursday, April 19,
2018
26
Not explicitly defined, but
applied for one API but not the
other.
QUERY ANALYSIS – Q11
 TPC-H query Q11 demonstrates bad performance for Spark Scala
 Performance differences can be tracked down to different applied joins
 Wrong build side for joins
QUERY 11
Spark Scala Spark SQL
1 x BroadCastHash
2 x SortMerge
1 x
BroadCastNestedLoop
4 x BroadCastHash
Bad performance Good performance
Join Type Complexity
BroadCastHash O(N)
SortMerge O(N Log N), if not
sorted
BoradCastNestedLoop O(N²)
Thursday, April 19, 2018 27
SUMMARY
 Up to 30% of performance increase by simply
switching API
 Parquet with Snappy is best
 Spark API’s can be intermixed seamlessly, but
 differences in the execution plan
 no guarantee for best performance
 Different optimization rules are applied
 Spark SQL uses the Catalyst Optimizer
Thursday, April 19,
2018
28
THANK YOU
RAPHAEL RADOWITZ @DATAWORKS SUMMIT, BERLIN 19TH APRIL 2018
M.SC. Raphael Radowitz
Contact Detail
Phone: +82 (0) 10 9174 3788
Email: rradowitz@outlook.de

Contenu connexe

Tendances

Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internalsKostas Tzoumas
 
DASK and Apache Spark
DASK and Apache SparkDASK and Apache Spark
DASK and Apache SparkDatabricks
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsAlluxio, Inc.
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesNishith Agarwal
 
UKOUG - 25 years of hints and tips
UKOUG - 25 years of hints and tipsUKOUG - 25 years of hints and tips
UKOUG - 25 years of hints and tipsConnor McDonald
 
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN Flink Forward
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path ForwardAlluxio, Inc.
 
MySQL Performance Tuning. Part 1: MySQL Configuration (includes MySQL 5.7)
MySQL Performance Tuning. Part 1: MySQL Configuration (includes MySQL 5.7)MySQL Performance Tuning. Part 1: MySQL Configuration (includes MySQL 5.7)
MySQL Performance Tuning. Part 1: MySQL Configuration (includes MySQL 5.7)Aurimas Mikalauskas
 
Netezza vs Teradata vs Exadata
Netezza vs Teradata vs ExadataNetezza vs Teradata vs Exadata
Netezza vs Teradata vs ExadataAsis Mohanty
 
MAA Best Practices for Oracle Database 19c
MAA Best Practices for Oracle Database 19cMAA Best Practices for Oracle Database 19c
MAA Best Practices for Oracle Database 19cMarkus Michalewicz
 
The Oracle RAC Family of Solutions - Presentation
The Oracle RAC Family of Solutions - PresentationThe Oracle RAC Family of Solutions - Presentation
The Oracle RAC Family of Solutions - PresentationMarkus Michalewicz
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleEvan Chan
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Databricks
 
ApacheCon Europe Big Data 2016 – Parquet in practice & detail
ApacheCon Europe Big Data 2016 – Parquet in practice & detailApacheCon Europe Big Data 2016 – Parquet in practice & detail
ApacheCon Europe Big Data 2016 – Parquet in practice & detailUwe Korn
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon Web Services
 
Amazon EMR과 SageMaker를 이용하여 데이터를 준비하고 머신러닝 모델 개발 하기
Amazon EMR과 SageMaker를 이용하여 데이터를 준비하고 머신러닝 모델 개발 하기Amazon EMR과 SageMaker를 이용하여 데이터를 준비하고 머신러닝 모델 개발 하기
Amazon EMR과 SageMaker를 이용하여 데이터를 준비하고 머신러닝 모델 개발 하기Amazon Web Services Korea
 

Tendances (20)

Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
DASK and Apache Spark
DASK and Apache SparkDASK and Apache Spark
DASK and Apache Spark
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
UKOUG - 25 years of hints and tips
UKOUG - 25 years of hints and tipsUKOUG - 25 years of hints and tips
UKOUG - 25 years of hints and tips
 
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path Forward
 
MySQL Performance Tuning. Part 1: MySQL Configuration (includes MySQL 5.7)
MySQL Performance Tuning. Part 1: MySQL Configuration (includes MySQL 5.7)MySQL Performance Tuning. Part 1: MySQL Configuration (includes MySQL 5.7)
MySQL Performance Tuning. Part 1: MySQL Configuration (includes MySQL 5.7)
 
Netezza vs Teradata vs Exadata
Netezza vs Teradata vs ExadataNetezza vs Teradata vs Exadata
Netezza vs Teradata vs Exadata
 
MAA Best Practices for Oracle Database 19c
MAA Best Practices for Oracle Database 19cMAA Best Practices for Oracle Database 19c
MAA Best Practices for Oracle Database 19c
 
The Oracle RAC Family of Solutions - Presentation
The Oracle RAC Family of Solutions - PresentationThe Oracle RAC Family of Solutions - Presentation
The Oracle RAC Family of Solutions - Presentation
 
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
 
Ml ops on AWS
Ml ops on AWSMl ops on AWS
Ml ops on AWS
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
 
ApacheCon Europe Big Data 2016 – Parquet in practice & detail
ApacheCon Europe Big Data 2016 – Parquet in practice & detailApacheCon Europe Big Data 2016 – Parquet in practice & detail
ApacheCon Europe Big Data 2016 – Parquet in practice & detail
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best Practices
 
Amazon EMR과 SageMaker를 이용하여 데이터를 준비하고 머신러닝 모델 개발 하기
Amazon EMR과 SageMaker를 이용하여 데이터를 준비하고 머신러닝 모델 개발 하기Amazon EMR과 SageMaker를 이용하여 데이터를 준비하고 머신러닝 모델 개발 하기
Amazon EMR과 SageMaker를 이용하여 데이터를 준비하고 머신러닝 모델 개발 하기
 

Similaire à Spark SQL Beats Spark Scala by 30% for Some Queries

Exploratory Analysis of Spark Structured Streaming, Todor Ivanov, Jason Taafe...
Exploratory Analysis of Spark Structured Streaming, Todor Ivanov, Jason Taafe...Exploratory Analysis of Spark Structured Streaming, Todor Ivanov, Jason Taafe...
Exploratory Analysis of Spark Structured Streaming, Todor Ivanov, Jason Taafe...DataBench
 
Exploratory Analysis of Spark Structured Streaming
Exploratory Analysis of Spark Structured StreamingExploratory Analysis of Spark Structured Streaming
Exploratory Analysis of Spark Structured Streamingt_ivanov
 
Idea behind Apache Hivemall
Idea behind Apache HivemallIdea behind Apache Hivemall
Idea behind Apache HivemallMakoto Yui
 
Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Makoto Yui
 
Spark is going to replace Apache Hadoop! Know Why?
Spark is going to replace Apache Hadoop! Know Why?Spark is going to replace Apache Hadoop! Know Why?
Spark is going to replace Apache Hadoop! Know Why?Edureka!
 
FPGA Acceleration of Apache Spark on AWS
FPGA Acceleration of Apache Spark on AWSFPGA Acceleration of Apache Spark on AWS
FPGA Acceleration of Apache Spark on AWSChristoforos Kachris
 
ABench: Big Data Architecture Stack Benchmark
ABench: Big Data Architecture Stack BenchmarkABench: Big Data Architecture Stack Benchmark
ABench: Big Data Architecture Stack Benchmarkt_ivanov
 
Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Makoto Yui
 
Apache Spark Data Validation
Apache Spark Data ValidationApache Spark Data Validation
Apache Spark Data ValidationDatabricks
 
5 reasons why spark is in demand!
5 reasons why spark is in demand!5 reasons why spark is in demand!
5 reasons why spark is in demand!Edureka!
 
Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0Knoldus Inc.
 
Spark Will Replace Hadoop ! Know Why
Spark Will Replace Hadoop ! Know Why Spark Will Replace Hadoop ! Know Why
Spark Will Replace Hadoop ! Know Why Edureka!
 
Migrating PostgreSQL to the Cloud
Migrating PostgreSQL to the CloudMigrating PostgreSQL to the Cloud
Migrating PostgreSQL to the CloudMike Fowler
 
5 things one must know about spark!
5 things one must know about spark!5 things one must know about spark!
5 things one must know about spark!Edureka!
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonChristian Perone
 

Similaire à Spark SQL Beats Spark Scala by 30% for Some Queries (20)

Exploratory Analysis of Spark Structured Streaming, Todor Ivanov, Jason Taafe...
Exploratory Analysis of Spark Structured Streaming, Todor Ivanov, Jason Taafe...Exploratory Analysis of Spark Structured Streaming, Todor Ivanov, Jason Taafe...
Exploratory Analysis of Spark Structured Streaming, Todor Ivanov, Jason Taafe...
 
Exploratory Analysis of Spark Structured Streaming
Exploratory Analysis of Spark Structured StreamingExploratory Analysis of Spark Structured Streaming
Exploratory Analysis of Spark Structured Streaming
 
Idea behind Apache Hivemall
Idea behind Apache HivemallIdea behind Apache Hivemall
Idea behind Apache Hivemall
 
Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0
 
Spark is going to replace Apache Hadoop! Know Why?
Spark is going to replace Apache Hadoop! Know Why?Spark is going to replace Apache Hadoop! Know Why?
Spark is going to replace Apache Hadoop! Know Why?
 
Industrialiser spark
Industrialiser sparkIndustrialiser spark
Industrialiser spark
 
FPGA Acceleration of Apache Spark on AWS
FPGA Acceleration of Apache Spark on AWSFPGA Acceleration of Apache Spark on AWS
FPGA Acceleration of Apache Spark on AWS
 
ABench: Big Data Architecture Stack Benchmark
ABench: Big Data Architecture Stack BenchmarkABench: Big Data Architecture Stack Benchmark
ABench: Big Data Architecture Stack Benchmark
 
Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0
 
Apache Spark Data Validation
Apache Spark Data ValidationApache Spark Data Validation
Apache Spark Data Validation
 
5 reasons why spark is in demand!
5 reasons why spark is in demand!5 reasons why spark is in demand!
5 reasons why spark is in demand!
 
Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0
 
SAIS/DWS2018報告会 #saisdws2018
SAIS/DWS2018報告会 #saisdws2018SAIS/DWS2018報告会 #saisdws2018
SAIS/DWS2018報告会 #saisdws2018
 
Spark Will Replace Hadoop ! Know Why
Spark Will Replace Hadoop ! Know Why Spark Will Replace Hadoop ! Know Why
Spark Will Replace Hadoop ! Know Why
 
Migrating PostgreSQL to the Cloud
Migrating PostgreSQL to the CloudMigrating PostgreSQL to the Cloud
Migrating PostgreSQL to the Cloud
 
What's new in Spark 2.0?
What's new in Spark 2.0?What's new in Spark 2.0?
What's new in Spark 2.0?
 
5 things one must know about spark!
5 things one must know about spark!5 things one must know about spark!
5 things one must know about spark!
 
2016 spark survey
2016 spark survey2016 spark survey
2016 spark survey
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
 
AI at Scale
AI at ScaleAI at Scale
AI at Scale
 

Plus de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Plus de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Dernier

Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 

Dernier (20)

Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 

Spark SQL Beats Spark Scala by 30% for Some Queries

  • 1. AN EVALUATION OF TPC-H ON SPARK & SPARK SQL IN ALOJA M.SC. RAPHAEL RADOWITZ @DATAWORKS SUMMIT, BERLIN 19TH APRIL 2018 FRANKFURT BIG DATA LAB @GOETHE UNIVERSITY
  • 2. AGENDA  Motivation & Research Objectives  Spark  Ecosystem  Data Access  ALOJA & TPC-H  Spark SQL with or without Hive Metastore  File Formats  Correlation Analysis  Query Analysis  Summary Thursday, April 19, 2018 2
  • 3. SPARK SCALA & SPARK SQL Do you Want to improve your Apache Spark performance? Thursday, April 19, 2018 3
  • 4. QUESTION'S ADDRESSED IN THIS SESSION 1. Should I use Spark Scala or Spark SQL? 2. Does Hive Metastore have an impact on the performance? 3. Should I consider a certain File Format?  Master thesis: “Evaluation of TPC-H on Spark & Spark SQL in ALOJA” Thursday, April 19, 2018 4
  • 5. OUTCOME OF THE PERFORMANCE EVALUATION 1. Up to 30% of performance increase by switching between Spark Scala & Spark SQL 2. Hive Metastore produces an overhead 3. File Format and compression increases performance  Parquet with Snappy compression is the best choice  Performance Evaluation conducted on Spark 2.1.1 Thursday, April 19, 2018 5
  • 6. MOTIVATION & RESEARCH OBJECTIVES  Absence of a comprehensive performance evaluation of Spark SQL compared to Spark Scala  Investigating the performance impact of Spark SQL and Spark Scala  Investigating the influence of Hive’s Metastore on performance  The attempt to detect possible bottlenecks in terms of runtime  Impact of various alternate file formats with different applied compressions  Implement a Spark Scala TPC-H benchmark within ALOJA  Benchmark is publicly accessible on GitHub Thursday, April 19, 2018 6
  • 7. ALOJA  Benchmark platform to characterize cost-effectiveness of Big Data deployments  https://aloja.bsc.es/  https://github.com/Aloja/aloja  Collaboration with the Barcelona Super Computer Center (BSC)  Nicolas Poggi  Alejandro Montero Thursday, April 19, 2018 7
  • 8. TPC-H BENCHMARK  Popular decision support benchmark  Composed of eight different sized tables  22 complex business oriented ad-hoc queries Thursday, April 19, 2018 8
  • 9. SPARK ECOSYSTEM / INTERFACES Thursday, April 19, 2018 9 https://pages.databricks.com/rs/094-YMS-629/images/SparkSQLSigmod2015.pdf
  • 10. Thursday, April 19, 2018 10  Data access from Spark on HDFS  With or without Metastore  Data File Formats: Text, ORC & Parquet  Dataset API DATA ACCESS
  • 11. FILE FORMATS  Text  ORC & Parquet with standard compression  GZIP and ZLIB  ORC with Snappy compression  Parquet with Snappy compression Thursday, April 19, 2018 11
  • 12. FILE FORMATS Spark Scala file formats with Snappy compression on Cluster with 1TB Thursday, April 19, 2018 12
  • 13. FILE FORMATS  Parquet is up to 50% faster than text  Standard compressions – GZIP and ZLIB  Parquet is up 16% faster than ORC  Snappy compression (faster than standard compression)  On average Parquet with Snappy is 10% faster than ORC with Snappy compression  Only common compression Thursday, April 19, 2018 13
  • 14. TAKEAWAY  File Formats and compression benefits the performance of all queries and both benchmarks equally  ORC & Parquet perform overall best with Snappy  Parquet with Snappy compression is the best choice Thursday, April 19, 2018 14
  • 17. TPC-H BENCHMARK RESULTS Query Spark Scala (sec) Spark SQL (sec) Difference (%) Q2 78 83 7% Q4 73 100 26% Q5 126 99 27% Q7 111 94 18% Q8 99 83 20% Q11 83 68 21% Q14 54 64 15% Q15 69 80 14% Q18 103 123 16% Q19 60 80 25% Q21 262 221 18% Thursday, April 19, 2018 17
  • 18. TAKEAWAY  Spark Scala does not outperform Spark SQL  Spark Scala and Spark SQL process queries differently  Are the applied optimization rules the same?  Hive Metastore does not improve the performance, but creates a minor overhead  Possibility to improve performance by simply switching API Thursday, April 19, 2018 18
  • 19. WHAT TO DO? 1. Is there a pattern?  When to use Spark Scala?  When to use Spark SQL? 2. What are the root causes? Thursday, April 19, 2018 19
  • 20. QUERY ANALYSIS  2 approaches to investigate the performance differences identified: 1. Correlation analysis based on the Choke Point Analysis 2. Investigation of the Execution Plan Thursday, April 19, 2018 20
  • 21. CHOKE POINT ANALYSIS  Classifying each TPC-H benchmark query into 6 categories (Low/Medium/High):  Aggregation Performance  Join Performance  Data Access Locality  Expression Calculation  Correlated Subqueries  Parallel Execution  The correlation analysis is based on this classification * P. Boncz, T. Neumann, and O. Erling, “TPC-H Analyzed: Hidden Messages and Lessons Learned from an Influential Benchmark,” in Performance Characterization and Benchmarking, 2013, pp. 61–76 Thursday, April 19, 2018 21
  • 23. SPARK SCALA – HIGH EXPRESSION CALCULATION Thursday, April 19, 2018 23
  • 24. SPARK SQL – DATA ACCESS LOCALITY & PARALLEL EXECUTION Thursday, April 19, 2018 24
  • 25. TAKEAWAY  Spark Scala performs better in case of heavy Expression Calculation  Spark SQL is the better choice in case of strong Data Access Locality in combination with heavyweight Parallel Execution Thursday, April 19, 2018 25
  • 26. EXECUTION PLAN ANALYSIS  Execution Plan Analysis revealed different applied optimizations  Spark SQL and Spark Scala do have different physical plans  Query Q4, Q5, Q11, Q19 exemplify most substantial Execution Plan variations:  Different Joins  Different Join order  Different Join build side  Missing filters  Missing projection Thursday, April 19, 2018 26 Not explicitly defined, but applied for one API but not the other.
  • 27. QUERY ANALYSIS – Q11  TPC-H query Q11 demonstrates bad performance for Spark Scala  Performance differences can be tracked down to different applied joins  Wrong build side for joins QUERY 11 Spark Scala Spark SQL 1 x BroadCastHash 2 x SortMerge 1 x BroadCastNestedLoop 4 x BroadCastHash Bad performance Good performance Join Type Complexity BroadCastHash O(N) SortMerge O(N Log N), if not sorted BoradCastNestedLoop O(N²) Thursday, April 19, 2018 27
  • 28. SUMMARY  Up to 30% of performance increase by simply switching API  Parquet with Snappy is best  Spark API’s can be intermixed seamlessly, but  differences in the execution plan  no guarantee for best performance  Different optimization rules are applied  Spark SQL uses the Catalyst Optimizer Thursday, April 19, 2018 28
  • 29. THANK YOU RAPHAEL RADOWITZ @DATAWORKS SUMMIT, BERLIN 19TH APRIL 2018 M.SC. Raphael Radowitz Contact Detail Phone: +82 (0) 10 9174 3788 Email: rradowitz@outlook.de