Script Transformation is an important and growing use-case for Apache Spark at Facebook. Spark’s script transforms allow users to run custom scripts and binaries directly from SQL and serves as an important means of stitching Facebook’s custom business logic with existing data pipelines.
Along with Spark SQL + UDFs, a growing number of our custom pipelines leverage Spark’s script transform operator to run user-provided binaries for applications such as indexing, parallel training and inference at scale. Spawning custom processes from the Spark executors introduces new challenges in production ranging from external resources allocation/management, structured data serialization, and external process monitoring.
In this session, we will talk about the improvements to Spark SQL (and the resource manager) to support running reliable and performant script transformation pipelines. This includes:
1) cgroup v2 containers for CPU, Memory and IO enforcement,
2) Transform jail for processes namespace management,
3) Support for complex types in Row format delimited SerDe,
4) Protocol Buffers for fast and efficient structured data serialization. Finally, we will conclude by sharing our results, lessons learned and future directions (e.g., transform pipelines resource over-subscription).
2. Powering Custom Apps at Facebook
using Spark Script Transformation
Abdulrahman Alfozan
Spark Summit Europe
3. Agenda
1. Intro to Spark Script Transforms
2. Spark Transforms at Facebook
3. Core Engine Improvements
4. Efficiency Analysis and Results
5. Transforms Execution Model
6. Future Plans
4. 2015
Small Scale
Experiments
2016
Few Pipelines in
Production
2017
Running 60TB+
shuffle pipelines
2018
Full-production
deployment
Successor to Apache
Hive at Facebook
2019
Scaling Spark
Largest Compute
Engine at Facebook
by CPU
Spark at Facebook
Reliability and efficiency are our top priority
5. Agenda
1. Intro to Spark Script Transforms
2. Spark Transforms at Facebook
3. Core Engine Improvements
4. Efficiency Analysis and Results
5. Transforms Execution Model
6. Future Plans
8. Script Transforms
ScriptTransformation (inputs,
script, outputs)
TableScan (src_tbl)
SQL query Query plan
Spark External
Process
Input Table
Output Table
inputs
outputs
Execution
SELECT
TRANSFORM (inputs)
USING “script”
AS (outputs)
FROM src_tbl;
9. 1. Flexibility:
Unlike UDFs, transforms allow unlimited use-cases
2. Efficiency:
Most transformers are written in C++
Why Script Transforms?
10. 1. Flexibility:
Unlike UDFs, transforms allow unlimited use-cases
2. Efficiency:
Most transformers are written in C++
Why Script Transforms?
Transforms provide custom data processing while relying on Spark for
ETL, data partitioning, distributed execution, and fault-tolerance.
11. 1. Flexibility:
Unlike UDFs, transforms allow unlimited use-cases
2. Efficiency:
Most transformers are written in C++
Why Script Transforms?
Transforms provide custom data processing while relying on Spark for
ETL, data partitioning, distributed execution, and fault-tolerance.
e.g. Spark is optimized for ETL. PyTorch is optimized for model serving.
12. Agenda
1. Intro to Spark Script Transforms
2. Spark Transforms at Facebook
3. Core Engine Improvements
4. Efficiency Analysis and Results
5. Transforms Execution Model
6. Future plans
20. Agenda
1. Intro to Spark Script Transforms
2. Spark Transforms at Facebook
3. Core Engine Improvements
4. Efficiency Analysis and Results
5. Transforms Execution Model
6. Future Plans
21. ScriptTransformationExec.scala
• Direct process invocation
• Class IOSchema to handle SerDe schema and config
• MonitorThread to track transform process progress
• Transform process error handling and surfacing
Core Engine Improvements
Operator
23. • SimpleSerDe.scala
ROW FORMAT DELIMITED
Core Engine Improvements
SerDe support
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '|'
MAP KEYS TERMINATED BY ':'
LINES TERMINATED BY 'n'
Configurable properties
24. Core Engine Improvements
SerDe support
Development
• Text-based
DelimitedJSONSerDe.scala
SimpleSerDe.scala
• Binary
?
Production
25. • Binary format
Text-based encoding is slow and less-compact
Core Engine Improvements
Production SerDe Requirements
26. • Binary format
Text-based encoding is slow and less-compact
• Zero-copy
Access to serialized data without parsing or unpacking
Improving Facebook’s performance on Android with FlatBuffers
Core Engine Improvements
Production SerDe Requirements
27. • Binary format
Text-based encoding is slow and less-compact
• Zero-copy
Access to serialized data without parsing or unpacking
Improving Facebook’s performance on Android with FlatBuffers
• Word-aligned data
Allow for SIMD optimizations
Core Engine Improvements
Production SerDe Requirements
28. • LazyBinarySerDe (Apache Hive)
Not zero-copy nor word-aligned, require converters in Spark
• Protocol Buffers / Thrift
Not zero-copy, more suited for RPC
• Flatbuffers / Cap’n Proto
require converters (to/from InternalRow) in Spark Core
• Apache Arrow
great future option
Binary SerDe Considerations
29. UnsafeRow
• Binary & Word-aligned
• Zero-copy
• Already part of Spark core
• Available converters to/from InternalRow
Binary SerDe Considerations
Chosen format
31. UnsafeRow SerDe
SPARK-15962: Introduced UnsafeArrayData and UnsafeMapData
apache/spark/sql/catalyst/expressions/UnsafeArrayData.java
apache/spark/sql/catalyst/expressions/UnsafeMapData.java
32. UnsafeRow SerDe
UnsafeRow SerDe C++ library
INT
BIGINT
BOOLEAN
FLOAT
DOUBLE
STRING
ARRAY<INT>
MAP<INT,STRING>
int32_t
int64_t
bool
float
double
unsaferow::String
unsaferow::List<int32_t>
unsaferow::Map<int32_t, unsaferow::String>
SQL datatypes C++ datatypes
33. UnsafeRow SerDe
UnsafeRow SerDe C++ library
SELECT
TRANSFORM (id INT)
ROW FORMAT SERDE 'UnsafeRowSerDe'
USING ‘script’
AS (value BIGINT)
ROW FORMAT SERDE UnsafeRowSerDe'
FROM src_tbl;
#include ”spark/Transformer.h”
while (transformer.readRow(input)) {
// data processing
int32_t id = input->getID();
output->setValue(id*id);
// write output
transformer.writeRow(output)
}
SQL Query C++ Transformer
34. Core Engine Improvements
SerDe support summary
Development
Production
• Text-based
DelimitedJSONSerDe.scala
SimpleSerDe.scala
• Binary
UnsafeRowSerDe.scala
35. Core Engine Improvements
SELECT
TRANSFORM (id, AVG(value) AS value_avg)
USING ‘script’
AS (output)
FROM src_tbl;
GROUP BY id;
Aggregation and projection support (SQL)
36. Agenda
1. Intro to Spark Script Transforms
2. Spark Transforms at Facebook
3. Core Engine Improvements
4. Efficiency Analysis and Results
5. Transforms Execution Model
6. Future Plans
40. • Text-based SerDe overhead is non-negligible
especially for Complex types
• SerDe cost could be up to 70% of pipeline’s CPU resources
Efficiency Analysis
SerDe overhead
41. • Text-based SerDe overhead is non-negligible
especially for Complex types
• SerDe cost could be up to 70% of pipeline’s CPU resources
Solution: use an efficient binary SerDe
Efficiency Analysis
SerDe overhead
45. UnsafeRow SerDe Benchmark
Text-Based SerDe vs Binary
SerDe (UnsafeRow)
Transform pipelines end-
to-end CPU savings:
up to 4x
Complex types SerDe
impacted the most
46. Agenda
1. Intro to Spark Script Transforms
2. Spark Transforms at Facebook
3. Core Engine Improvements
4. Efficiency Analysis and Results
5. Transforms Execution Model
6. Future Plans
47. CPU cores per container: spark.executor.cores = 4
Memory per container: spark.executor.memory=4GB + spark.transform.memory=4GB
Transforms Execution Model
Resource Request
Spark
Driver
Cluster
Manager
Node Manager
Node Manager
Executor
Task 1 Task 2
Resource Request
CPU cores = 4,
Memory = 8GB
Launch Spark
Executor
Process 1 Process 2
Executor
48. • JVM’s memory limits: Xms, Xmx and Xss.
• CPU threads:
spark.executor.core,spark.task.cpus
Transforms Execution Model
Resource Control
49. • JVM’s memory limits: Xms, Xmx and Xss.
• CPU threads:
spark.executor.core,spark.task.cpus
These limits are irrelevant when running an
external process!
Transforms Execution Model
Resource Control
50. • JVM’s memory limits: Xms, Xmx and Xss.
• CPU threads:
spark.executor.core,spark.task.cpus
These limits are irrelevant when running an
external process!
Solution: cgroup v2 containers
Transforms Execution Model
Resource Control