SlideShare une entreprise Scribd logo
1  sur  26
Télécharger pour lire hors ligne
Kazuaki Ishizaki
IBM Research – Tokyo
@kiszk
In-Memory Storage
Evolution in Apache Spark
#UnifiedAnalytics #SparkAISummit
About Me – Kazuaki Ishizaki
• Researcher at IBM Research in compiler optimizations
• Working for IBM Java virtual machine over 20 years
– In particular, just-in-time compiler
• Committer of Apache Spark (SQL package) from 2018
• ACM Distinguished Member
• Homepage: http://ibm.biz/ishizaki
b: https://github.com/kiszk wit: @kiszk
https://slideshare.net/ishizaki
2In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
Why is In-Memory Storage?
• In-memory storage is mandatory for high performance
• In-memory columnar storage is necessary to
– Support first-class citizen column format Parquet
– Achieve better compression rate for table cache
3In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
memory address memory address
SummitAISpark
5000.01.92.0
321Summit
AI
Spark
5000.0
1.9
2.0
3
2
1
Row format Column format
Row 0
Row 1
Row 2
Column x
Column y
Column z
What I Will Talk about
• Columnar storage is used to improve performance for
– table cache, Parquet, ORC, and Arrow
• Columnar storage from Spark 2.3
– improves performance of PySpark with Pandas UDF using
Arrow
– can be connected with external other columnar storages by
using a public class “ColumnVector”
4#UnifiedAnalytics #SparkAISummit
How Columnar Storage is Used
• Table cache ORC
• Pandas UDF Parquet
5In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
df = ...
df.cache
df1 = df.selectExpr(“y + 1.2”)
df = spark.read.parquet(“c”)
df1 = df.selectExpr(“y + 1.2”)
df = spark.read.format(“orc”).load(“c”)
df1 = df.selectExpr(“y + 1.2”)
@pandas_udf(‘double’)
def plus(v):
return v + 1.2
df1 = df.withColumn(‘yy’, plus(df.y))
Performance among Spark Versions
• DataFrame table cache from Spark 2.0 to Spark 2.4
6In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Spark 2.0
Spark 2.3
Spark 2.4
Performance comparison among different Spark versions
Relative elapsed time
shorter is better
df.filter(“i % 16 == 0").count
How This Improvement is Achieved
• Structure of columnar storage
• Generated code to access columnar storage
7In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
Outline
• Introduction
• Deep dive into columnar storage
• Deep dive into generated code of columnar storage
• Next steps
8In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
In-Memory Storage Evolution (1/2)
9In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
AI|
Spark| Spark AI
Table cache
2.0
1.9
2.0 1.9
Spark AI
Parquet vectorized reader
2.0 1.9
1.4 to 1.6
RDD table cache
to 1.3 2.0 to 2.2
RDD table cache : Java objects
Table cache : Own memory layout by Project Tungsten for table
cache
Parquet : Own memory layout, but different class from table
cacheSpark
version
In-Memory Storage Evolution (2/2)
10In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
Spark AI
Table cache
2.0 1.9
Parquet vectorized reader
2.42.3
Pandas UDF with Arrow ORC vectorized reader
ColumnVector becomes a public class
ColumnVector class becomes public class from Spark 2.3
Table cache, Parquet, ORC, and Arrow use common ColumnVector
class
Spark
version
Implementation in Spark 1.4 to 1.6
• Table cache uses CachedBatch that is not accessed directly
from generated code
11In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
case class CachedBatch(
buffers: Array[Array[Byte]],
stats: Row)
Spark AI
2.0 1.9
CachedBatch.buffers
Implementation in Spark 2.0
• Parquet uses ColumnVector class that has well-defined
methods that could be called from generated code
12In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
public abstract class ColumnVector {
float getFloat(…) …
UTF8String getUTF8String(…) …
…
}
public final class OnHeapColumnVector
extends ColumnVector {
private byte[] byteData;
…
private float[] floatData;
…
}
Spark AI
2.0 1.9
copy
2.0 1.9
Spark AI
ColumnVector
ColumnarBatch
Implementation in Spark 2.3
• Table cache, Parquet, and Arrow also use ColumnVector
• ColumnVector becomes a public class to define APIs
13In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
/**
* An interface representing in-memory columnar data in
Spark. This interface defines the main APIs
* to access the data, as well as their batched versions.
The batched versions are considered to be
* faster and preferable whenever possible.
*/
@Evolving
public abstract class ColumnVector … {
float getFloat(…) …
UTF8String getUTF8String(…) …
…
}
https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java
public final class OnHeapColumnVector
extends ColumnVector {
// Array for each type.
private byte[] byteData;
…
private float[] floatData;
…
}
public final class ArrowColumnVector
extends ColumnVector {
…
}
Table cache
Parquet
vectorized readers
Pandas UDF with Arrow
ColumnVector.java
ColumnVector for Your Columnar
• Developers can write an own class, which extends
ColumnVector, to support a new columnar or to exchange
data with other formats
14In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
MyColumnarClass
extends ColumnVector
Columnar
data
source
Implementation in Spark 2.4
• ORC also uses ColumnVector
15In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
/**
* An interface representing in-memory columnar data in
Spark. This interface defines the main APIs
* to access the data, as well as their batched versions.
The batched versions are considered to be
* faster and preferable whenever possible.
*/
@Evolving
public abstract class ColumnVector … {
float getFloat(…) …
UTF8String getUTF8String(…) …
…
}
https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java
public final class OnHeapColumnVector
extends ColumnVector {
// Array for each type.
private byte[] byteData;
…
private float[] floatData;
…
}
public final class ArrowColumnVector
extends ColumnVector {
…
}
Table cache
Parquet and ORC
vectorized readers
Pandas UDF with Arrow
ColumnVector.java
Outline
• Introduction
• Deep dive columnar storage
• Deep dive generated code of columnar storage
• Next steps
16In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
How Spark Program is Executed?
• A Spark program is translated into Java code to be executed
17In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
Source: Michael et al., Spark SQL:
Relational Data Processing in Spark,
SIGMOD’15Catalyst
while (rowIterator.hasNext()) {
Row row = rowIterator.next;
…
}
Java virtual machine
Spark Programdf = ...
df.cache
df1 = df.selectExpr(“y + 1.2”)
Access Columnar Storage (before
2.0)
• While columnar storage is used, generated code gets data
from row storage
Data conversion is required
18In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
while (rowIterator.hasNext()) {
Row row = rowIterator.next();
float y = row.getFloat(1);
float f = y + 1.2;
…
}
df1 = df.selectExpr(“y + 1.2")
Catalyst
df:
2.0 1.9
CachedBatch
Spark AI
row:
2.0
2.0Spark
Data conversion
Columnar
storage
Row
storage
Access Columnar Storage (from
2.0)
• When columnar storage is used, reading data elements
directly accesses columnar storage
– Removed copy for Parquet in 2.0 and table cache in 2.3
19In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
ColumnVector column1 = …
while (i++ < numRows) {
float y = column1.getFloat(i);
float f = y + 1.2;
…
}
df1 = df.selectExpr(“y + 1.2")
Catalyst
ColumnVector 2.0 1.9
2.0
i = 0
1.9
i = 1
y:
3.2 3.1f:
df:
Access Columnar Storage (from
2.3)
• Generate this pattern for all cases regarding ColumnVector
• Use for-loop to encourage compiler optimizations
– Hotspot compiler applies loop optimizations to a well-formed loop
20In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
ColumnVector column1 = …
for (int i = 0; i < numRows; i++) {
float y = column1.getFloat(i);
float f = y + 1.2;
…
}
Catalyst
ColumnVector 2.0 1.9
2.0
i = 0
1.9
i = 1
y:
3.2 3.1f:
df:df1 = df.selectExpr(“y + 1.2")
How Columnar Storage is used in
PySpark
• Share data in columnar storages of Spark and Pandas
– No serialization and deserialization
– 3-100x performance improvements
21In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
ColumnVector
Details on “Apache Arrow and Pandas UDF on Apache Spark” by Takuya Ueshin
Source: ”Introducing Pandas UDF for PySpark” by Databricks blog
@pandas_udf(‘double’)
def plus(v):
return v + 1.2
Outline
• Introduction
• Deep dive columnar storage
• Deep dive generated code of columnar storage
• Next steps
22In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
Next Steps
• Short-term
– support an array type in ColumnVector for table cache
– support additional external columnar storage
• Middle-term
– exploit SIMD instructions to process multiple rows in a column
in generated code
• Extension of SPARK-25728 (Tungsten IR)
23#UnifiedAnalytics #SparkAISummit
Integrate Spark with Others
• Frameworks: DL/ML frameworks
• SPARK-24579
• SPARK-26413
• Resources: GPU, FPGA, ..
• SPARK-27396
• SAIS2019: “Apache Arrow-Based
Unified Data Sharing and
Transferring Format Among
CPU and Accelerators”
24In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki
#UnifiedAnalytics #SparkAISummit
From rapids.ai
FPGA
GPU
Takeaway
• Columnar storage is used to improve performance for
– table cache, Parquet, ORC, and Arrow
• Columnar storage from Spark 2.3
– improves performance of PySpark with Pandas UDF using
Arrow
– can be connected with external other columnar storages by
using a public class “ColumnVector”
25#UnifiedAnalytics #SparkAISummit
Thanks Spark Community
• Especially, @andrewor14, @bryanCutler, @cloud-fan,
@dongjoon-hyun, @gatorsmile, @hvanhovell, @mgaido91,
@ueshin, @viirya
26#UnifiedAnalytics #SparkAISummit

Contenu connexe

Tendances

Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
Databricks
 

Tendances (20)

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Spark SQL Join Improvement at Facebook
Spark SQL Join Improvement at FacebookSpark SQL Join Improvement at Facebook
Spark SQL Join Improvement at Facebook
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Deep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDeep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.x
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
 
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
Why your Spark Job is Failing
Why your Spark Job is FailingWhy your Spark Job is Failing
Why your Spark Job is Failing
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
How to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized OptimizationsHow to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized Optimizations
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 

Similaire à In-Memory Evolution in Apache Spark

MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Spark Summit
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Databricks
 

Similaire à In-Memory Evolution in Apache Spark (20)

SparkTokyo2019
SparkTokyo2019SparkTokyo2019
SparkTokyo2019
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Spark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar CastanedaSpark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar Castaneda
 
Scala at Treasure Data
Scala at Treasure DataScala at Treasure Data
Scala at Treasure Data
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Incorta spark integration
Incorta spark integrationIncorta spark integration
Incorta spark integration
 
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
 
Introduction to Apache Beam
Introduction to Apache BeamIntroduction to Apache Beam
Introduction to Apache Beam
 
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
 
Writing Apache Spark and Apache Flink Applications Using Apache Bahir
Writing Apache Spark and Apache Flink Applications Using Apache BahirWriting Apache Spark and Apache Flink Applications Using Apache Bahir
Writing Apache Spark and Apache Flink Applications Using Apache Bahir
 
Introduction to Apache Spark and MLlib
Introduction to Apache Spark and MLlibIntroduction to Apache Spark and MLlib
Introduction to Apache Spark and MLlib
 
What's New in Spark 2?
What's New in Spark 2?What's New in Spark 2?
What's New in Spark 2?
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
 
Getting started with SparkSQL - Desert Code Camp 2016
Getting started with SparkSQL  - Desert Code Camp 2016Getting started with SparkSQL  - Desert Code Camp 2016
Getting started with SparkSQL - Desert Code Camp 2016
 
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with SparkSparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Spark
 
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and CassandraReal-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
 
Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkWriting Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySpark
 
Apache Kafka with Spark Streaming: Real-time Analytics Redefined
Apache Kafka with Spark Streaming: Real-time Analytics RedefinedApache Kafka with Spark Streaming: Real-time Analytics Redefined
Apache Kafka with Spark Streaming: Real-time Analytics Redefined
 
Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...
Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...
Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...
 
DevNation Live 2020 - What's new with Apache Camel 3
DevNation Live 2020 - What's new with Apache Camel 3DevNation Live 2020 - What's new with Apache Camel 3
DevNation Live 2020 - What's new with Apache Camel 3
 

Plus de Kazuaki Ishizaki

20141224 titech lecture_ishizaki_public
20141224 titech lecture_ishizaki_public20141224 titech lecture_ishizaki_public
20141224 titech lecture_ishizaki_public
Kazuaki Ishizaki
 

Plus de Kazuaki Ishizaki (20)

20230105_TITECH_lecture_ishizaki_public.pdf
20230105_TITECH_lecture_ishizaki_public.pdf20230105_TITECH_lecture_ishizaki_public.pdf
20230105_TITECH_lecture_ishizaki_public.pdf
 
20221226_TITECH_lecture_ishizaki_public.pdf
20221226_TITECH_lecture_ishizaki_public.pdf20221226_TITECH_lecture_ishizaki_public.pdf
20221226_TITECH_lecture_ishizaki_public.pdf
 
Make AI ecosystem more interoperable
Make AI ecosystem more interoperableMake AI ecosystem more interoperable
Make AI ecosystem more interoperable
 
Introduction new features in Spark 3.0
Introduction new features in Spark 3.0Introduction new features in Spark 3.0
Introduction new features in Spark 3.0
 
SparkTokyo2019NovIshizaki
SparkTokyo2019NovIshizakiSparkTokyo2019NovIshizaki
SparkTokyo2019NovIshizaki
 
icpe2019_ishizaki_public
icpe2019_ishizaki_publicicpe2019_ishizaki_public
icpe2019_ishizaki_public
 
hscj2019_ishizaki_public
hscj2019_ishizaki_publichscj2019_ishizaki_public
hscj2019_ishizaki_public
 
Looking back at Spark 2.x and forward to 3.0
Looking back at Spark 2.x and forward to 3.0Looking back at Spark 2.x and forward to 3.0
Looking back at Spark 2.x and forward to 3.0
 
20180109 titech lecture_ishizaki_public
20180109 titech lecture_ishizaki_public20180109 titech lecture_ishizaki_public
20180109 titech lecture_ishizaki_public
 
20171212 titech lecture_ishizaki_public
20171212 titech lecture_ishizaki_public20171212 titech lecture_ishizaki_public
20171212 titech lecture_ishizaki_public
 
Demystifying DataFrame and Dataset
Demystifying DataFrame and DatasetDemystifying DataFrame and Dataset
Demystifying DataFrame and Dataset
 
Transparent GPU Exploitation for Java
Transparent GPU Exploitation for JavaTransparent GPU Exploitation for Java
Transparent GPU Exploitation for Java
 
Making Hardware Accelerator Easier to Use
Making Hardware Accelerator Easier to UseMaking Hardware Accelerator Easier to Use
Making Hardware Accelerator Easier to Use
 
20160906 pplss ishizaki public
20160906 pplss ishizaki public20160906 pplss ishizaki public
20160906 pplss ishizaki public
 
Exploiting GPUs in Spark
Exploiting GPUs in SparkExploiting GPUs in Spark
Exploiting GPUs in Spark
 
Easy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java ProgrammersEasy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java Programmers
 
Exploiting GPUs in Spark
Exploiting GPUs in SparkExploiting GPUs in Spark
Exploiting GPUs in Spark
 
20151112 kutech lecture_ishizaki_public
20151112 kutech lecture_ishizaki_public20151112 kutech lecture_ishizaki_public
20151112 kutech lecture_ishizaki_public
 
20141224 titech lecture_ishizaki_public
20141224 titech lecture_ishizaki_public20141224 titech lecture_ishizaki_public
20141224 titech lecture_ishizaki_public
 
Java Just-In-Timeコンパイラ
Java Just-In-TimeコンパイラJava Just-In-Timeコンパイラ
Java Just-In-Timeコンパイラ
 

Dernier

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 

Dernier (20)

OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
ManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide DeckManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide Deck
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Pharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodologyPharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodology
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 

In-Memory Evolution in Apache Spark

  • 1. Kazuaki Ishizaki IBM Research – Tokyo @kiszk In-Memory Storage Evolution in Apache Spark #UnifiedAnalytics #SparkAISummit
  • 2. About Me – Kazuaki Ishizaki • Researcher at IBM Research in compiler optimizations • Working for IBM Java virtual machine over 20 years – In particular, just-in-time compiler • Committer of Apache Spark (SQL package) from 2018 • ACM Distinguished Member • Homepage: http://ibm.biz/ishizaki b: https://github.com/kiszk wit: @kiszk https://slideshare.net/ishizaki 2In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit
  • 3. Why is In-Memory Storage? • In-memory storage is mandatory for high performance • In-memory columnar storage is necessary to – Support first-class citizen column format Parquet – Achieve better compression rate for table cache 3In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit memory address memory address SummitAISpark 5000.01.92.0 321Summit AI Spark 5000.0 1.9 2.0 3 2 1 Row format Column format Row 0 Row 1 Row 2 Column x Column y Column z
  • 4. What I Will Talk about • Columnar storage is used to improve performance for – table cache, Parquet, ORC, and Arrow • Columnar storage from Spark 2.3 – improves performance of PySpark with Pandas UDF using Arrow – can be connected with external other columnar storages by using a public class “ColumnVector” 4#UnifiedAnalytics #SparkAISummit
  • 5. How Columnar Storage is Used • Table cache ORC • Pandas UDF Parquet 5In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit df = ... df.cache df1 = df.selectExpr(“y + 1.2”) df = spark.read.parquet(“c”) df1 = df.selectExpr(“y + 1.2”) df = spark.read.format(“orc”).load(“c”) df1 = df.selectExpr(“y + 1.2”) @pandas_udf(‘double’) def plus(v): return v + 1.2 df1 = df.withColumn(‘yy’, plus(df.y))
  • 6. Performance among Spark Versions • DataFrame table cache from Spark 2.0 to Spark 2.4 6In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Spark 2.0 Spark 2.3 Spark 2.4 Performance comparison among different Spark versions Relative elapsed time shorter is better df.filter(“i % 16 == 0").count
  • 7. How This Improvement is Achieved • Structure of columnar storage • Generated code to access columnar storage 7In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit
  • 8. Outline • Introduction • Deep dive into columnar storage • Deep dive into generated code of columnar storage • Next steps 8In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit
  • 9. In-Memory Storage Evolution (1/2) 9In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit AI| Spark| Spark AI Table cache 2.0 1.9 2.0 1.9 Spark AI Parquet vectorized reader 2.0 1.9 1.4 to 1.6 RDD table cache to 1.3 2.0 to 2.2 RDD table cache : Java objects Table cache : Own memory layout by Project Tungsten for table cache Parquet : Own memory layout, but different class from table cacheSpark version
  • 10. In-Memory Storage Evolution (2/2) 10In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit Spark AI Table cache 2.0 1.9 Parquet vectorized reader 2.42.3 Pandas UDF with Arrow ORC vectorized reader ColumnVector becomes a public class ColumnVector class becomes public class from Spark 2.3 Table cache, Parquet, ORC, and Arrow use common ColumnVector class Spark version
  • 11. Implementation in Spark 1.4 to 1.6 • Table cache uses CachedBatch that is not accessed directly from generated code 11In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit case class CachedBatch( buffers: Array[Array[Byte]], stats: Row) Spark AI 2.0 1.9 CachedBatch.buffers
  • 12. Implementation in Spark 2.0 • Parquet uses ColumnVector class that has well-defined methods that could be called from generated code 12In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit public abstract class ColumnVector { float getFloat(…) … UTF8String getUTF8String(…) … … } public final class OnHeapColumnVector extends ColumnVector { private byte[] byteData; … private float[] floatData; … } Spark AI 2.0 1.9 copy 2.0 1.9 Spark AI ColumnVector ColumnarBatch
  • 13. Implementation in Spark 2.3 • Table cache, Parquet, and Arrow also use ColumnVector • ColumnVector becomes a public class to define APIs 13In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit /** * An interface representing in-memory columnar data in Spark. This interface defines the main APIs * to access the data, as well as their batched versions. The batched versions are considered to be * faster and preferable whenever possible. */ @Evolving public abstract class ColumnVector … { float getFloat(…) … UTF8String getUTF8String(…) … … } https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java public final class OnHeapColumnVector extends ColumnVector { // Array for each type. private byte[] byteData; … private float[] floatData; … } public final class ArrowColumnVector extends ColumnVector { … } Table cache Parquet vectorized readers Pandas UDF with Arrow ColumnVector.java
  • 14. ColumnVector for Your Columnar • Developers can write an own class, which extends ColumnVector, to support a new columnar or to exchange data with other formats 14In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit MyColumnarClass extends ColumnVector Columnar data source
  • 15. Implementation in Spark 2.4 • ORC also uses ColumnVector 15In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit /** * An interface representing in-memory columnar data in Spark. This interface defines the main APIs * to access the data, as well as their batched versions. The batched versions are considered to be * faster and preferable whenever possible. */ @Evolving public abstract class ColumnVector … { float getFloat(…) … UTF8String getUTF8String(…) … … } https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java public final class OnHeapColumnVector extends ColumnVector { // Array for each type. private byte[] byteData; … private float[] floatData; … } public final class ArrowColumnVector extends ColumnVector { … } Table cache Parquet and ORC vectorized readers Pandas UDF with Arrow ColumnVector.java
  • 16. Outline • Introduction • Deep dive columnar storage • Deep dive generated code of columnar storage • Next steps 16In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit
  • 17. How Spark Program is Executed? • A Spark program is translated into Java code to be executed 17In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit Source: Michael et al., Spark SQL: Relational Data Processing in Spark, SIGMOD’15Catalyst while (rowIterator.hasNext()) { Row row = rowIterator.next; … } Java virtual machine Spark Programdf = ... df.cache df1 = df.selectExpr(“y + 1.2”)
  • 18. Access Columnar Storage (before 2.0) • While columnar storage is used, generated code gets data from row storage Data conversion is required 18In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit while (rowIterator.hasNext()) { Row row = rowIterator.next(); float y = row.getFloat(1); float f = y + 1.2; … } df1 = df.selectExpr(“y + 1.2") Catalyst df: 2.0 1.9 CachedBatch Spark AI row: 2.0 2.0Spark Data conversion Columnar storage Row storage
  • 19. Access Columnar Storage (from 2.0) • When columnar storage is used, reading data elements directly accesses columnar storage – Removed copy for Parquet in 2.0 and table cache in 2.3 19In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit ColumnVector column1 = … while (i++ < numRows) { float y = column1.getFloat(i); float f = y + 1.2; … } df1 = df.selectExpr(“y + 1.2") Catalyst ColumnVector 2.0 1.9 2.0 i = 0 1.9 i = 1 y: 3.2 3.1f: df:
  • 20. Access Columnar Storage (from 2.3) • Generate this pattern for all cases regarding ColumnVector • Use for-loop to encourage compiler optimizations – Hotspot compiler applies loop optimizations to a well-formed loop 20In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit ColumnVector column1 = … for (int i = 0; i < numRows; i++) { float y = column1.getFloat(i); float f = y + 1.2; … } Catalyst ColumnVector 2.0 1.9 2.0 i = 0 1.9 i = 1 y: 3.2 3.1f: df:df1 = df.selectExpr(“y + 1.2")
  • 21. How Columnar Storage is used in PySpark • Share data in columnar storages of Spark and Pandas – No serialization and deserialization – 3-100x performance improvements 21In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit ColumnVector Details on “Apache Arrow and Pandas UDF on Apache Spark” by Takuya Ueshin Source: ”Introducing Pandas UDF for PySpark” by Databricks blog @pandas_udf(‘double’) def plus(v): return v + 1.2
  • 22. Outline • Introduction • Deep dive columnar storage • Deep dive generated code of columnar storage • Next steps 22In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit
  • 23. Next Steps • Short-term – support an array type in ColumnVector for table cache – support additional external columnar storage • Middle-term – exploit SIMD instructions to process multiple rows in a column in generated code • Extension of SPARK-25728 (Tungsten IR) 23#UnifiedAnalytics #SparkAISummit
  • 24. Integrate Spark with Others • Frameworks: DL/ML frameworks • SPARK-24579 • SPARK-26413 • Resources: GPU, FPGA, .. • SPARK-27396 • SAIS2019: “Apache Arrow-Based Unified Data Sharing and Transferring Format Among CPU and Accelerators” 24In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki #UnifiedAnalytics #SparkAISummit From rapids.ai FPGA GPU
  • 25. Takeaway • Columnar storage is used to improve performance for – table cache, Parquet, ORC, and Arrow • Columnar storage from Spark 2.3 – improves performance of PySpark with Pandas UDF using Arrow – can be connected with external other columnar storages by using a public class “ColumnVector” 25#UnifiedAnalytics #SparkAISummit
  • 26. Thanks Spark Community • Especially, @andrewor14, @bryanCutler, @cloud-fan, @dongjoon-hyun, @gatorsmile, @hvanhovell, @mgaido91, @ueshin, @viirya 26#UnifiedAnalytics #SparkAISummit