Vectorized R Execution in Apache Spark

WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics

Hyukjin Kwon, Databricks
Vectorized R Execution
in Apache Spark
#UnifiedDataAnalytics #SparkAISummit

Hyukjin Kwon
3
• Apache Spark PMC and Committer
• Koalas committer
• PySpark, SparkSQL, SparkR, build, etc.
• Active in Spark dev @HyukjinKwon

Agenda
• SparkR and R interaction
• Native Implementation
• Apache Arrow
• Vectorized Implementation
• Future Work
4

#UnifiedDataAnalytics #SparkAISummit 5
SparkR and R interaction

Why?
Scala API
R API
6
Cool!

Why?
Scala API
R API
7
12.5x slower … ?

Why?
Scala API
R API
8
40x slower … ???

createDataFrame
Create Spark DataFrame from R DataFrame and lists.
9

collect
Collect R DataFrame from Spark DataFrame at Driver.
10

dapply
Apply R native function to each partition
11

gapply
Apply R native function on each group.
12

Native Implementation

SparkR Architecture
14
Spark
Driver
JVM
JVM
DataSources
JVMR
RBackend R R
R R

Driver implementation
15
1. RBackend opens a server port and
waits for connections
4. RBackendHandler handles and
process requests. It sends back row by
row
2. R establishes the socket
connections
3. Each SparkR call sends serialized
data over the socket and waits for
response
R JVM
Backend

createDataFrame and collect
16
DataFrame
R Data Frame
R Data Frame
Rows to Array(Array(…))
list(list(…))
parallelize(…)
row, row, ...
row, row, ...
parallelize(…)
Bytes to rows
Bytes to lists
data.frame(…)

Worker Implementation
17
R R
JVM
1. RRunner sends data and serialized R
function through a socket.
2. R receives the serialized function and
data.
3. R deserializes the function and the
data row by row
4. R executes the function, and send the
results back to RRunner.

dapply and gapply
18
RRunner
PhysicalOperator
row by row
row by row
Invoke
R function
serialize row by row
deserialize row by row
row, row, ...
row, row, ...

Apache Arrow

Apache Arrow
A cross-language development
platform for in-memory data
Columnar In-Memory
SparkR supports Arrow 0.12.1+(?)
20 20

Vectorization
21 21
https://www.slideshare.net/Hadoop_Summit/the-columnar-roadmap-apache-parquet-and-apache-arrow-102997214
SIMD Pipelining
https://medium.com/wasmer/webassembly-and-simd-13badb9bf1a8

Interchangeable, no copy
22
Each system has its own internal memory format
70-80% computation wasted on (de)serialization
Similar functionality implemented in multiple projects
All systems utilize the same memory format
No overhead for cross-system communication
Projects can share functionality (eg, Parquet-to-Arrow reader)

Serialization and deserialization
23 23
https://wesmckinney.com/blog/arrow-streaming-columnar/
See also Arrow format
https://sapbr.com/2016/08/15/dictionary-encoding/

"Portable" Data Frames
24
Share data and algorithm at ~zero cost
https://www.slideshare.net/wesm/apache-arrow-at-dataengconf-barcelona-2018

Vectorized Implementation

Use Arrow to Serialize/Deserialize data
Streaming format for Interprocess messaging / communication (IPC)
ArrowWriter and ArrowColumnVector
Communicate JVM and R worker via Socket
createDataFrame in SQLContext.R
readArrowStreamFromFile in SQLUtils.scala
collect in DataFrame.R
collectAsArrowToR in Dataset.scala
26

27
DataFrame
R Data Frame
R Data Frame
Rows to Array(Array(…))
list(list(…))
parallelize(…)
row, row, ...
row, row, ...
parallelize(…)
Bytes to rows
Bytes to lists
data.frame(…)

28
DataFrame
R Data Frame
Arrow batches
R Data Frame
Arrow batches to Spark DataFrameArrow batches
Arrow batches Arrow batches Spark DataFrame to Arrow bathes

Benchmark
29
collect
No Arrow: 20.85112 secs
Arrow: 1.224419 secs
17x faster
42x faster
createDataFrame

dapply and gapply
Use Arrow to Serialize/Deserialize data
Streaming format for Interprocess messaging / communication (IPC)
ArrowWriter and ArrowColumnVector
Communicate JVM and R worker via Socket
ArrowRRunner
Physical Operators for each R native function executions
MapPartitionsInRWithArrowExec
FlatMapGroupsInRWithArrowExec
30

dapply and gapply
31
RRunner
PhysicalOperator
row by row
row by row
Invoke
R function
serialize row by row
deserialize row by row
row, row, ...
row, row, ...

dapply and gapply
32
ArrowRRunner
PhysicalOperator
group of rows
group of rows
Invoke
R function
serialize Arrow batches
deserialize Arrow batches
Arrow batches
Arrow batches

Benchmark
33
gapply
Arrow: 16.2713 secs
43x faster
33x faster
dapply

Benchmark
34
Can’t believe?
Can’t wait to try it out by yourself?
Try it out here on a live Jupyter notebook
github.com/HyukjinKwon/spark-notebooks

Future Work

Apache Arrow
ARROW-4512
Actually, the vectorized implementation does not fully work in a streaming manner
yet.
36

dapplyCollect
37

gapplyCollect
38

dapplyCollect and gapplyCollect
39
ArrowRRunner
PhysicalOperator
group of rows
Invoke
R function
Arrow batches
Arrow batches
???
Schema is unknown before eager execution

dapplyCollect and gapplyCollect
40ArrowRRunner
PhysicalOperator
Invoke
R function
DataFrame
R Data Frame
R Data Frame
Arrow batches Arrow batches
Arrow batchesBy-passed
Arrow batches
By-pass with DataFrame<binary>

That’s it!
Special thanks to
Felix Cheung
Bryan Cutler
Liang-Chi Hsieh
Hossein Falaki
Takuya Ueshin
Yanbo Liang
and, last but not least Apache Arrow community :D
41

DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

Vectorized R Execution in Apache Spark

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Vectorized R Execution in Apache Spark

Similaire à Vectorized R Execution in Apache Spark (20)

Plus de Databricks

Plus de Databricks (20)

Dernier

Dernier (20)

Vectorized R Execution in Apache Spark