Over the years, Facebook has used Hive as the primary query engine to be used by our data engineers. Since Hive uses SQL-like query language called HQL, the list of built-in User Defined Functions (UDFs) did not always satisfy our customer requirements and as a result, an extensive list of custom UDFs was developed over time. As we started migrating pipelines from Hive to Spark SQL, a number of custom UDFs appeared incompatible with Spark, and many others showed bad performance. In this talk will first take a deep dive into how Hive UDFs work with Spark. We will then share what challenges we overcame on the way to support 99.99% of the custom UDFs in Spark.
Speakers: Sergey Makagonov, Xin Yao
4. • UDFs are used to add custom code logic if built-in
functions cannot achieve desired result
User Defined Functions
SELECT
substr(description, 1, 100) AS first_100,
count(*) AS cnt
FROM tmp_table
GROUP BY 1;
5. • Regular user-defined functions (UDFs): work on a
single row in a table and for one or more inputs
produce a single output
• User-defined table functions (UDTFs): for every row
in a table can return multiple values as output
• Aggregate functions (UDAFs): work on one or more
rows in a table and produce a single output
Types of Hive functions
7. Types of Hive functions. UDTFs
SELECT id, idx
FROM dim_one_row
LATERAL VIEW
STACK(3, 1, 2, 3) tmp AS idx;
Output:
123 1
123 2
123 3
id
123
8. Types of Hive functions. UDAFs
SELECT
COLLECT_SET(id) AS all_ids
FROM dim_three_rows;
Output:
[123, 124, 125]
id
123
124
125
9. How Hive UDFs work in Spark
• most Hive data types (java types and derivatives of
ObjectInspector class) can be converted to
Spark’s data types, and vise versa
• Instances of Hive’s GenericUDF,
SimpleGenericUDAF and GenericUDTF are
called via wrapper classes extending Spark’s
Expression, ImperativeAggregate and
Generator classes respectively
12. • Hive was primary query engine until we started to
migrate jobs to Spark and Presto
• Over the course of several years, over a thousand
custom User Defined Functions were built
• Hive queries that used UDFs accounted for over 70%
of CPU time
• Supporting Hive UDFs in Spark is important for
migration
UDFs at Facebook
13. • At the beginning of Hive to Spark migration – the level
of support of UDFs was unclear
Identifying Baseline
14. • Most of UDFs were already covered with basic tests during Hive days
• We also had a testing framework built for running those tests in Hive
UDFs testing framework
15. • The framework was extended further to allow running queries against Spark
• A temporary scala file is created for each UDF class, containing code to run SQL
queries using DataFrame API
• spark-shell subprocess is spawned to run the scala file:
spark-shell --conf spark.ui.enabled=false … -i /tmp/spark-
hive-udf-1139336654093084343.scala
• Output is parsed and compared to the expected result
UDFs testing framework
16. • With test coverage in place, baseline support of UDFs by query count and CPU days
was identified: 58%
• Failed tests helped to identify the common issues
UDFs testing framework
18. • getRequiredJars and getRequiredFiles - functions to automatically include
additional resources required by this UDF.
• initialize(StructObjectInspector) in GenericUDTF - Spark SQL uses a
deprecated interface initialize(ObjectInspector[]) only.
• configure (GenericUDF, GenericUDTF, and GenericUDAFEvaluator) - a function to initialize
functions with MapredContext, which is inapplicable to Spark.
• close (GenericUDF and GenericUDAFEvaluator) is a function to release
associated resources. Spark SQL does not call this function when tasks finish.
• reset (GenericUDAFEvaluator) - a function to re-initialize aggregation for reusing the
same aggregation. Spark SQL currently does not support the reuse of aggregation.
• getWindowingEvaluator (GenericUDAFEvaluator) - a function to optimize
aggregation by evaluating an aggregate over a fixed window.
Unsupported APIs
19. • getRequiredJars and getRequiredFiles - functions to automatically include
additional resources required by this UDF.
• initialize(StructObjectInspector) in GenericUDTF - Spark SQL uses a
deprecated interface initialize(ObjectInspector[]) only.
• configure (GenericUDF, GenericUDTF, and GenericUDAFEvaluator) - a function to initialize
functions with MapredContext, which is inapplicable to Spark.
• close (GenericUDF and GenericUDAFEvaluator) is a function to release
associated resources. Spark SQL does not call this function when tasks finish.
• reset (GenericUDAFEvaluator) - a function to re-initialize aggregation for reusing the
same aggregation. Spark SQL currently does not support the reuse of aggregation.
• getWindowingEvaluator (GenericUDAFEvaluator) - a function to optimize
aggregation by evaluating an aggregate over a fixed window.
Unsupported APIs
20. getRequiredFiles and getRequiredJars
• functions to automatically include additional resources required by this UDF
• UDF code can assume that file is present in the executor working directory
21. Supporting required files/jars (SPARK-27543)
Driver Executor
Executor fetches files added to
SparkContext from DriverDuring initialization, for each U
DF:
- Identify required files and jars
- Register files for distribution:
SparkContext.addFile(…)
SparkContext.addJar(…)
For each UDF:
- If required file is in working
dir – do nothing (was distributed)
- If file is missing – try create a
symlink to absolute path
22. • Majority of Hive UDFs are written without concurrency in mind
• Hive runs tasks in a separate JVM process per each task
• Spark runs a separate JVM process for each Executor, and Executor can run multiple tasks
concurrently
UDFs and Thread Safety
Executor
Task 1
UDF instance 1
Task 2
UDF instance 2
23. Thread-unsafe UDF Example
• Consider that we have 2 tasks and hence 2 instances
of UDF: “instance 1” and “instance 2”
• evaluate method is called for each row, both of the
instances could pass the null-check inside evaluate
method at the same time
• Once “instance 1” finishes initialization first, it will call
evaluate for the next row
• If “instance 2” is still in the middle of initializing the
mapping, it could overwrite the data that “instance 1”
relied on, which could lead to data corruption or an
exception
24. Approach 1: Introduce Synchronization
• Introduce locking (synchronization) on the UDF
class when initializing the mapping
Cons:
• Synchronization is computationally expensive
• Requires manual and accurate refactoring of code,
which does not scale for hundreds of UDFs
25. Approach 2: Make Field Non-static
• Turn static variable into an instance variable
Cons:
• Adds more pressure on memory (instances cannot
share complex data)
Pros:
• Minimum changes in the code, which can also be
codemoded for all other UDFs that use static fields
of non-primitive types
26. • In Spark, UDF objects are initialized on Driver, serialized, and
later deserialized on executors
• Some classes cannot be deserialized out of the box
• Example: guava’s ImmutableSet. Kryo can successfully
serialize the objects on the driver, but fails to deserialized them
on executors
Kryo serialization/deserialization
27. • Catch serde issues by running Hive UDF tests in cluster mode
• For commonly used classes, write custom or import existing
serializers
• Mark problematic instance variables as transient
Solving Kryo serde problem
28. • Hive UDFs don’t support data types from Spark out of the box
• Similarly, Spark cannot work with Hive’s object inspectors
• For each UDF call, Spark’s data types are wrapped into Hive’s
inspectors and java types
• Same for the results: java types are converted back into Spark’s data
types
Hive UDFs performance
29. • This wrapping/unwrapping overhead can lead up to 2x of CPU time
spent in UDF compared to a Spark-native implementation
• UDFs that work with complex types suffer the most
Hive UDFs performance
30. • UDFs account for 15% of CPU spent for Spark queries
• The top most computationally expensive UDFs can be converted to
Spark-native UDFs
Hive UDFs performance
33. Aggregation
id value
1 100
1 200
2 400
3 100
id value
1 300
2 200
2 300
id value
1 100
1 200
1 300
3 100
id value
2 400
2 200
2 300
id max(value)
1 300
3 100
id max(value)
2 400
Mapper
Reducer
Shuffle Aggregation
Mapper 1
Mapper 2
Reducer 1
Reducer 2
34. 1. Every row needs to be shuffled through network, which
is a heavy operation.
2. Data skew. One reducer need to process more data than
others if one key has more rows.
1. For example: key1 has 1 million rows, while other keys
each have 10 rows on average
What’s the problem
35. Partial Aggregation is the technique that a system partially
aggregates the data in mapper side before shuffle, in order
to reduce the shuffle size.
Partial Aggregation
37. Aggregation
id value
1 100
1 200
2 400
3 100
id value
1 300
2 200
2 300
id value
1 100
1 200
1 300
3 100
id value
2 400
2 200
2 300
id max(value)
1 300
3 100
id max(value)
2 400
Mapper
Reducer
Shuffle Aggregation
Mapper 1
Mapper 2
Reducer 1
Reducer 2
38. Partial Aggregation
id partial_max
(value)
1 200
2 400
3 100
id partial_max
(value)
1 300
2 300
id partial_max
(value)
1 100
1 300
3 100
id partial_max
(value)
2 400
2 300
id max(value)
1 300
3 100
id max(value)
2 400
Mapper Reducer
Shuffle Final
Aggregationid value
1 100
1 200
2 400
3 100
id value
1 300
2 200
2 300
Partial
Aggregation Mapper 1
Mapper 2
Reducer 1
Reducer 2
39. Aggregation vs Partial Aggregation
Aggregation Partial Aggregation
Shuffle Data (# Rows) All rows Reduced number of rows
Computation
Aggregation happens all in
reducer side
Extra CPU for partial
aggregation, distributed across
Mappers and Reducers
40. • Why partial aggregation is important
• It impacts CPU and Shuffle size
• It could help data skew
Partial Aggregation is important
41. 1. Partial aggregation support is already in Spark
2. Fixed some issues to make it work with FB UDAFs
What we did
43. 1. Partial aggregation improved CPU by 20%, shuffle data
size 17%
2. However, we also observed some heavy pipelines
regressed as much as 300%
FB Production Result
45. • Column Expansion
• Partial aggregation expands the number of columns at the
Mapper side, results in a larger shuffle data size
SELECT
key, max(value), min(value), count(value), avg(value)
FROM table
GROUP BY key
When partial aggregation doesn’t work
47. • Query Shape
• Column Expansion
• Data distribution
• No row to aggregate in mapper side
SELECT
key, max(value)
FROM table
GROUP BY key
When partial aggregation doesn’t work
48. Data Distribution
id value
1 100
2 200
3 400
4 100
Partial
Aggregation
id value
1 100
2 200
3 400
4 100
Shuffle
Shuffle 4 rows
Mapper Reducer
Shuffle 4 rows
Extra CPU with NO Row Reduction
id Partial_max(value)
1 100
2 200
3 400
4 100
Aggregation
Partial Aggregation
50. 1. Each UDAF function partial aggregation performance
2. Column Expansion
3. Row Reduction
Partial Aggregation Computation Cost
Factors
51. • Computation cost-based optimizer for partial aggregation
1. Use multiple features to calculate the computation cost of
partial aggregation
1. input column number
2. output column number
3. computation cost of UDAF partial aggregation function
4. …
2. Use the calculated computation cost to decided the
configuration for partial aggregation.
How we solved the problem
52. 1. It improves the efficiency over the board
2. However, there are still queries don’t have most
optimized partial aggregation configuration.
Result
53. 1. Each UDAF function partial aggregation performance
2. Column Expansion
3. Row Reduction
Partial Aggregation Computation Cost
Factors
54. • It’s hard to know the row reduction
• It depends on the data distribution which might be
different for different day
• For different group by keys, the row reduction is different
Row Reduction
55. • History based tuning
• Use history data of the query to predict the best
configuration for future runs
• Perfect for partial aggregation because it operates at
query level. It could try different config and use them to
direct the config of future run
Future work