In-Memory Evolution in Apache Spark

Kazuaki Ishizaki
IBM Research – Tokyo
@kiszk
In-Memory Storage
Evolution in Apache Spark
#UnifiedAnalytics #SparkAISummit

About Me – Kazuaki Ishizaki
• Researcher at IBM Research in compiler optimizations
• Working for IBM Java virtual machine over 20 years
– In particular, just-in-time compiler
• Committer of Apache Spark (SQL package) from 2018
• ACM Distinguished Member
• Homepage: http://ibm.biz/ishizaki
b: https://github.com/kiszk wit: @kiszk
https://slideshare.net/ishizaki
2In-Memory Storage Evolution in Apache Spark / Kazuaki Ishizaki

Why is In-Memory Storage?
• In-memory storage is mandatory for high performance
• In-memory columnar storage is necessary to
– Support first-class citizen column format Parquet
– Achieve better compression rate for table cache
memory address memory address
SummitAISpark
5000.01.92.0
321Summit
AI
Spark
5000.0
1.9
2.0
3
2
1
Row format Column format
Row 0
Row 1
Row 2
Column x
Column y
Column z

What I Will Talk about
• Columnar storage is used to improve performance for
– table cache, Parquet, ORC, and Arrow
• Columnar storage from Spark 2.3
– improves performance of PySpark with Pandas UDF using
Arrow
– can be connected with external other columnar storages by
using a public class “ColumnVector”
4#UnifiedAnalytics #SparkAISummit

How Columnar Storage is Used
• Table cache ORC
• Pandas UDF Parquet
df = ...
df.cache
df1 = df.selectExpr(“y + 1.2”)
df = spark.read.parquet(“c”)
df = spark.read.format(“orc”).load(“c”)
@pandas_udf(‘double’)
def plus(v):
return v + 1.2
df1 = df.withColumn(‘yy’, plus(df.y))

Performance among Spark Versions
• DataFrame table cache from Spark 2.0 to Spark 2.4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Spark 2.0
Spark 2.3
Spark 2.4
Performance comparison among different Spark versions
Relative elapsed time
shorter is better
df.filter(“i % 16 == 0").count

How This Improvement is Achieved
• Structure of columnar storage
• Generated code to access columnar storage

Outline
• Introduction
• Deep dive into columnar storage
• Deep dive into generated code of columnar storage
• Next steps

In-Memory Storage Evolution (1/2)
AI|
Spark| Spark AI
Table cache
2.0
1.9
2.0 1.9
Spark AI
Parquet vectorized reader
2.0 1.9
1.4 to 1.6
RDD table cache
to 1.3 2.0 to 2.2
RDD table cache : Java objects
Table cache : Own memory layout by Project Tungsten for table
cache
Parquet : Own memory layout, but different class from table
cacheSpark
version

In-Memory Storage Evolution (2/2)
Spark AI
Table cache
2.0 1.9
Parquet vectorized reader
2.42.3
Pandas UDF with Arrow ORC vectorized reader
ColumnVector becomes a public class
ColumnVector class becomes public class from Spark 2.3
Table cache, Parquet, ORC, and Arrow use common ColumnVector
class
Spark
version

Implementation in Spark 1.4 to 1.6
• Table cache uses CachedBatch that is not accessed directly
from generated code
case class CachedBatch(
buffers: Array[Array[Byte]],
stats: Row)
Spark AI
2.0 1.9
CachedBatch.buffers

Implementation in Spark 2.0
• Parquet uses ColumnVector class that has well-defined
methods that could be called from generated code
public abstract class ColumnVector {
float getFloat(…) …
UTF8String getUTF8String(…) …
…
}
public final class OnHeapColumnVector
extends ColumnVector {
private byte[] byteData;
…
private float[] floatData;
…
}
Spark AI
2.0 1.9
copy
2.0 1.9
Spark AI
ColumnVector
ColumnarBatch

• Table cache, Parquet, and Arrow also use ColumnVector
• ColumnVector becomes a public class to define APIs
/**
* An interface representing in-memory columnar data in
Spark. This interface defines the main APIs
* to access the data, as well as their batched versions.
The batched versions are considered to be
* faster and preferable whenever possible.
*/
@Evolving
public abstract class ColumnVector … {
…
}
https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java
// Array for each type.
…
…
}
public final class ArrowColumnVector
…
}
Table cache
Parquet
vectorized readers
Pandas UDF with Arrow
ColumnVector.java

ColumnVector for Your Columnar
• Developers can write an own class, which extends
ColumnVector, to support a new columnar or to exchange
data with other formats
MyColumnarClass
extends ColumnVector
Columnar
data
source

• ORC also uses ColumnVector
/**
* An interface representing in-memory columnar data in
Spark. This interface defines the main APIs
* to access the data, as well as their batched versions.
The batched versions are considered to be
* faster and preferable whenever possible.
*/
@Evolving
public abstract class ColumnVector … {
…
}
https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java
// Array for each type.
…
…
}
public final class ArrowColumnVector
…
}
Table cache
Parquet and ORC
vectorized readers
Pandas UDF with Arrow
ColumnVector.java

Outline
• Introduction
• Deep dive columnar storage
• Deep dive generated code of columnar storage
• Next steps

How Spark Program is Executed?
• A Spark program is translated into Java code to be executed
Source: Michael et al., Spark SQL:
Relational Data Processing in Spark,
SIGMOD’15Catalyst
while (rowIterator.hasNext()) {
Row row = rowIterator.next;
…
}
Java virtual machine
Spark Programdf = ...
df.cache

Access Columnar Storage (before
2.0)
• While columnar storage is used, generated code gets data
from row storage
Data conversion is required
while (rowIterator.hasNext()) {
Row row = rowIterator.next();
float y = row.getFloat(1);
float f = y + 1.2;
…
}
df1 = df.selectExpr(“y + 1.2")
Catalyst
df:
2.0 1.9
CachedBatch
Spark AI
row:
2.0
2.0Spark
Data conversion
Columnar
storage
Row
storage

Access Columnar Storage (from
2.0)
• When columnar storage is used, reading data elements
directly accesses columnar storage
– Removed copy for Parquet in 2.0 and table cache in 2.3
ColumnVector column1 = …
while (i++ < numRows) {
float y = column1.getFloat(i);
float f = y + 1.2;
…
}
df1 = df.selectExpr(“y + 1.2")
Catalyst
ColumnVector 2.0 1.9
2.0
i = 0
1.9
i = 1
y:
3.2 3.1f:
df:

Access Columnar Storage (from
2.3)
• Generate this pattern for all cases regarding ColumnVector
• Use for-loop to encourage compiler optimizations
– Hotspot compiler applies loop optimizations to a well-formed loop
ColumnVector column1 = …
for (int i = 0; i < numRows; i++) {
float y = column1.getFloat(i);
float f = y + 1.2;
…
}
Catalyst
ColumnVector 2.0 1.9
2.0
i = 0
1.9
i = 1
y:
3.2 3.1f:
df:df1 = df.selectExpr(“y + 1.2")

How Columnar Storage is used in
PySpark
• Share data in columnar storages of Spark and Pandas
– No serialization and deserialization
– 3-100x performance improvements
ColumnVector
Details on “Apache Arrow and Pandas UDF on Apache Spark” by Takuya Ueshin
Source: ”Introducing Pandas UDF for PySpark” by Databricks blog
@pandas_udf(‘double’)
def plus(v):
return v + 1.2

Outline
• Introduction
• Deep dive columnar storage
• Deep dive generated code of columnar storage
• Next steps

Next Steps
• Short-term
– support an array type in ColumnVector for table cache
– support additional external columnar storage
• Middle-term
– exploit SIMD instructions to process multiple rows in a column
in generated code
• Extension of SPARK-25728 (Tungsten IR)

Integrate Spark with Others
• Frameworks: DL/ML frameworks
• SPARK-24579
• SPARK-26413
• Resources: GPU, FPGA, ..
• SPARK-27396
• SAIS2019: “Apache Arrow-Based
Unified Data Sharing and
Transferring Format Among
CPU and Accelerators”
From rapids.ai
FPGA
GPU

Takeaway
• Columnar storage is used to improve performance for
– table cache, Parquet, ORC, and Arrow
• Columnar storage from Spark 2.3
– improves performance of PySpark with Pandas UDF using
Arrow
– can be connected with external other columnar storages by
using a public class “ColumnVector”

Thanks Spark Community
• Especially, @andrewor14, @bryanCutler, @cloud-fan,
@dongjoon-hyun, @gatorsmile, @hvanhovell, @mgaido91,
@ueshin, @viirya

In-Memory Evolution in Apache Spark

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à In-Memory Evolution in Apache Spark

Similaire à In-Memory Evolution in Apache Spark (20)

Plus de Kazuaki Ishizaki

Plus de Kazuaki Ishizaki (20)

Dernier

Dernier (20)

In-Memory Evolution in Apache Spark