Apache Spark already has a vectorization optimization in many operations, for instance, internal columnar format, Parquet/ORC vectorized read, Pandas UDFs, etc. Vectorization improves performance greatly in general. In this talk, the performance aspect of SparkR will be discussed and vectorization in SparkR will be introduced with technical details. SparkR vectorization allows users to use the existing codes as are but boost the performance around several thousand present faster when they execute R native functions or convert Spark DataFrame to/from R DataFrame.
15. #UnifiedDataAnalytics #SparkAISummit
Driver implementation
15
1. RBackend opens a server port and
waits for connections
4. RBackendHandler handles and
process requests. It sends back row by
row
2. R establishes the socket
connections
3. Each SparkR call sends serialized
data over the socket and waits for
response
R JVM
Backend
16. #UnifiedDataAnalytics #SparkAISummit
createDataFrame and collect
16
DataFrame
R Data Frame
R Data Frame
Rows to Array(Array(…))
list(list(…))
parallelize(…)
row, row, ...
row, row, ...
parallelize(…)
Bytes to rows
Bytes to lists
data.frame(…)
17. #UnifiedDataAnalytics #SparkAISummit
Worker Implementation
17
R R
JVM
1. RRunner sends data and serialized R
function through a socket.
2. R receives the serialized function and
data.
3. R deserializes the function and the
data row by row
4. R executes the function, and send the
results back to RRunner.
18. #UnifiedDataAnalytics #SparkAISummit
dapply and gapply
18
RRunner
PhysicalOperator
row by row
row by row
Invoke
R function
serialize row by row
deserialize row by row
row, row, ...
row, row, ...
22. #UnifiedDataAnalytics #SparkAISummit
Interchangeable, no copy
22
Each system has its own internal memory format
70-80% computation wasted on (de)serialization
Similar functionality implemented in multiple projects
All systems utilize the same memory format
No overhead for cross-system communication
Projects can share functionality (eg, Parquet-to-Arrow reader)
26. #UnifiedDataAnalytics #SparkAISummit
createDataFrame and collect
Use Arrow to Serialize/Deserialize data
Streaming format for Interprocess messaging / communication (IPC)
ArrowWriter and ArrowColumnVector
Communicate JVM and R worker via Socket
createDataFrame in SQLContext.R
readArrowStreamFromFile in SQLUtils.scala
collect in DataFrame.R
collectAsArrowToR in Dataset.scala
26
27. #UnifiedDataAnalytics #SparkAISummit
createDataFrame and collect
27
DataFrame
R Data Frame
R Data Frame
Rows to Array(Array(…))
list(list(…))
parallelize(…)
row, row, ...
row, row, ...
parallelize(…)
Bytes to rows
Bytes to lists
data.frame(…)
30. #UnifiedDataAnalytics #SparkAISummit
dapply and gapply
Use Arrow to Serialize/Deserialize data
Streaming format for Interprocess messaging / communication (IPC)
ArrowWriter and ArrowColumnVector
Communicate JVM and R worker via Socket
ArrowRRunner
Physical Operators for each R native function executions
MapPartitionsInRWithArrowExec
FlatMapGroupsInRWithArrowExec
30
31. #UnifiedDataAnalytics #SparkAISummit
dapply and gapply
31
RRunner
PhysicalOperator
row by row
row by row
Invoke
R function
serialize row by row
deserialize row by row
row, row, ...
row, row, ...
32. #UnifiedDataAnalytics #SparkAISummit
dapply and gapply
32
ArrowRRunner
PhysicalOperator
group of rows
group of rows
Invoke
R function
serialize Arrow batches
deserialize Arrow batches
Arrow batches
Arrow batches
39. #UnifiedDataAnalytics #SparkAISummit
dapplyCollect and gapplyCollect
39
ArrowRRunner
PhysicalOperator
group of rows
Invoke
R function
serialize Arrow batches
deserialize Arrow batches
Arrow batches
Arrow batches
???
Schema is unknown before eager execution
40. #UnifiedDataAnalytics #SparkAISummit
dapplyCollect and gapplyCollect
40ArrowRRunner
PhysicalOperator
Invoke
R function
serialize Arrow batches
deserialize Arrow batches
DataFrame
R Data Frame
R Data Frame
Arrow batches Arrow batches
Arrow batchesBy-passed
Arrow batches
By-pass with DataFrame<binary>