Supporting Over a Thousand Custom Hive User Defined Functions

Supporting Over a Thousand
Custom Hive User Defined
Functions
By Sergey Makagonov and Xin Yao
Facebook

• Introduction to User Defined Functions
• Hive UDFs at Facebook
• Major challenges and improvements
• Partial aggregations
Agenda

What Are “User Defined
Functions”?

• UDFs are used to add custom code logic if built-in
functions cannot achieve desired result
User Defined Functions
SELECT
substr(description, 1, 100) AS first_100,
count(*) AS cnt
FROM tmp_table
GROUP BY 1;

• Regular user-defined functions (UDFs): work on a
single row in a table and for one or more inputs
produce a single output
• User-defined table functions (UDTFs): for every row
in a table can return multiple values as output
• Aggregate functions (UDAFs): work on one or more
rows in a table and produce a single output
Types of Hive functions

Types of Hive functions. Regular UDFs
SELECT FB_ARRAY_CONCAT(
arr1, arr2
) AS zipped
FROM dim_two_rows;
Output:
[“a”,“b”,”c”,”d”,”e”,”f”]
[“foo”,”bar”,”baz”,”spam”]
arr1 arr2
[‘a’, ‘b’, ‘c’] [‘d’, ‘e’, ‘f’]
[’foo’, ‘bar’] [‘baz’, ‘spam’]

Types of Hive functions. UDTFs
SELECT id, idx
FROM dim_one_row
LATERAL VIEW
STACK(3, 1, 2, 3) tmp AS idx;
Output:
123 1
123 2
123 3
id
123

Types of Hive functions. UDAFs
SELECT
COLLECT_SET(id) AS all_ids
FROM dim_three_rows;
Output:
[123, 124, 125]
id
123
124
125

How Hive UDFs work in Spark
• most Hive data types (java types and derivatives of
ObjectInspector class) can be converted to
Spark’s data types, and vise versa
• Instances of Hive’s GenericUDF,
SimpleGenericUDAF and GenericUDTF are
called via wrapper classes extending Spark’s
Expression, ImperativeAggregate and
Generator classes respectively

• Hive was primary query engine until we started to
migrate jobs to Spark and Presto
• Over the course of several years, over a thousand
custom User Defined Functions were built
• Hive queries that used UDFs accounted for over 70%
of CPU time
• Supporting Hive UDFs in Spark is important for
migration
UDFs at Facebook

• At the beginning of Hive to Spark migration – the level
of support of UDFs was unclear
Identifying Baseline

• Most of UDFs were already covered with basic tests during Hive days
• We also had a testing framework built for running those tests in Hive
UDFs testing framework

• The framework was extended further to allow running queries against Spark
• A temporary scala file is created for each UDF class, containing code to run SQL
queries using DataFrame API
• spark-shell subprocess is spawned to run the scala file:
spark-shell --conf spark.ui.enabled=false … -i /tmp/spark-
hive-udf-1139336654093084343.scala
• Output is parsed and compared to the expected result

• With test coverage in place, baseline support of UDFs by query count and CPU days
was identified: 58%
• Failed tests helped to identify the common issues

• getRequiredJars and getRequiredFiles - functions to automatically include
additional resources required by this UDF.
• initialize(StructObjectInspector) in GenericUDTF - Spark SQL uses a
deprecated interface initialize(ObjectInspector[]) only.
• configure (GenericUDF, GenericUDTF, and GenericUDAFEvaluator) - a function to initialize
functions with MapredContext, which is inapplicable to Spark.
• close (GenericUDF and GenericUDAFEvaluator) is a function to release
associated resources. Spark SQL does not call this function when tasks finish.
• reset (GenericUDAFEvaluator) - a function to re-initialize aggregation for reusing the
same aggregation. Spark SQL currently does not support the reuse of aggregation.
• getWindowingEvaluator (GenericUDAFEvaluator) - a function to optimize
aggregation by evaluating an aggregate over a fixed window.
Unsupported APIs

getRequiredFiles and getRequiredJars
• functions to automatically include additional resources required by this UDF
• UDF code can assume that file is present in the executor working directory

Supporting required files/jars (SPARK-27543)
Driver Executor
Executor fetches files added to
SparkContext from DriverDuring initialization, for each U
DF:
- Identify required files and jars
- Register files for distribution:
SparkContext.addFile(…)
SparkContext.addJar(…)
For each UDF:
- If required file is in working
dir – do nothing (was distributed)
- If file is missing – try create a
symlink to absolute path

• Majority of Hive UDFs are written without concurrency in mind
• Hive runs tasks in a separate JVM process per each task
• Spark runs a separate JVM process for each Executor, and Executor can run multiple tasks
concurrently
UDFs and Thread Safety
Executor
Task 1
UDF instance 1
Task 2
UDF instance 2

Thread-unsafe UDF Example
• Consider that we have 2 tasks and hence 2 instances
of UDF: “instance 1” and “instance 2”
• evaluate method is called for each row, both of the
instances could pass the null-check inside evaluate
method at the same time
• Once “instance 1” finishes initialization first, it will call
evaluate for the next row
• If “instance 2” is still in the middle of initializing the
mapping, it could overwrite the data that “instance 1”
relied on, which could lead to data corruption or an
exception

Approach 1: Introduce Synchronization
• Introduce locking (synchronization) on the UDF
class when initializing the mapping
Cons:
• Synchronization is computationally expensive
• Requires manual and accurate refactoring of code,
which does not scale for hundreds of UDFs

Approach 2: Make Field Non-static
• Turn static variable into an instance variable
Cons:
• Adds more pressure on memory (instances cannot
share complex data)
Pros:
• Minimum changes in the code, which can also be
codemoded for all other UDFs that use static fields
of non-primitive types

• In Spark, UDF objects are initialized on Driver, serialized, and
later deserialized on executors
• Some classes cannot be deserialized out of the box
• Example: guava’s ImmutableSet. Kryo can successfully
serialize the objects on the driver, but fails to deserialized them
on executors
Kryo serialization/deserialization

• Catch serde issues by running Hive UDF tests in cluster mode
• For commonly used classes, write custom or import existing
serializers
• Mark problematic instance variables as transient
Solving Kryo serde problem

• Hive UDFs don’t support data types from Spark out of the box
• Similarly, Spark cannot work with Hive’s object inspectors
• For each UDF call, Spark’s data types are wrapped into Hive’s
inspectors and java types
• Same for the results: java types are converted back into Spark’s data
types
Hive UDFs performance

• This wrapping/unwrapping overhead can lead up to 2x of CPU time
spent in UDF compared to a Spark-native implementation
• UDFs that work with complex types suffer the most

• UDFs account for 15% of CPU spent for Spark queries
• The top most computationally expensive UDFs can be converted to
Spark-native UDFs

SELECT id, max(value)
FROM table
GROUP BY id
Aggregation

Aggregation
id value
1 100
1 200
2 400
3 100
id value
1 300
2 200
2 300
id value
1 100
1 200
1 300
3 100
id value
2 400
2 200
2 300
id max(value)
1 300
3 100
id max(value)
2 400
Mapper
Reducer
Shuffle Aggregation
Mapper 1
Mapper 2
Reducer 1
Reducer 2

1. Every row needs to be shuffled through network, which
is a heavy operation.
2. Data skew. One reducer need to process more data than
others if one key has more rows.
1. For example: key1 has 1 million rows, while other keys
each have 10 rows on average
What’s the problem

Partial Aggregation is the technique that a system partially
aggregates the data in mapper side before shuffle, in order
to reduce the shuffle size.
Partial Aggregation

SELECT id, max(value)
FROM table
GROUP BY id
Partial Aggregation

Partial Aggregation
id partial_max
(value)
1 200
2 400
3 100
id partial_max
(value)
1 300
2 300
id partial_max
(value)
1 100
1 300
3 100
id partial_max
(value)
2 400
2 300
id max(value)
1 300
3 100
id max(value)
2 400
Mapper Reducer
Shuffle Final
Aggregationid value
1 100
1 200
2 400
3 100
id value
1 300
2 200
2 300
Partial
Aggregation Mapper 1
Mapper 2
Reducer 1
Reducer 2

Aggregation vs Partial Aggregation
Aggregation Partial Aggregation
Shuffle Data (# Rows) All rows Reduced number of rows
Computation
Aggregation happens all in
reducer side
Extra CPU for partial
aggregation, distributed across
Mappers and Reducers

• Why partial aggregation is important
• It impacts CPU and Shuffle size
• It could help data skew
Partial Aggregation is important

1. Partial aggregation support is already in Spark
2. Fixed some issues to make it work with FB UDAFs
What we did

Partial Aggregation
Production Result

1. Partial aggregation improved CPU by 20%, shuffle data
size 17%
2. However, we also observed some heavy pipelines
regressed as much as 300%
FB Production Result

1. Query shape
2. Data distribution
What could go wrong?

• Column Expansion
• Partial aggregation expands the number of columns at the
Mapper side, results in a larger shuffle data size
SELECT
key, max(value), min(value), count(value), avg(value)
FROM table
GROUP BY key
When partial aggregation doesn’t work

Column Expansion
id p_max p_min P_count P_avg
1 200 100 2 (300, 2)
2 400 400 1 (400, 1)
3 100 100 1 (100, 1)
id value
1 100
1 200
2 400
3 100
Partial
Aggregation
id value
1 100
1 200
2 400
3 100
Shuffle
Shuffle 2 columns
Mapper Reducer
Shuffle 5 columns
Aggregation
Partial Aggregation

• Query Shape
• Column Expansion
• Data distribution
• No row to aggregate in mapper side
SELECT
key, max(value)
FROM table
GROUP BY key
When partial aggregation doesn’t work

Data Distribution
id value
1 100
2 200
3 400
4 100
Partial
Aggregation
id value
1 100
2 200
3 400
4 100
Shuffle
Shuffle 4 rows
Mapper Reducer
Shuffle 4 rows
Extra CPU with NO Row Reduction
id Partial_max(value)
1 100
2 200
3 400
4 100
Aggregation
Partial Aggregation

Partial Aggregation
Computation Cost-based
optimization

1. Each UDAF function partial aggregation performance
2. Column Expansion
3. Row Reduction
Partial Aggregation Computation Cost
Factors

• Computation cost-based optimizer for partial aggregation
1. Use multiple features to calculate the computation cost of
partial aggregation
1. input column number
2. output column number
3. computation cost of UDAF partial aggregation function
4. …
2. Use the calculated computation cost to decided the
configuration for partial aggregation.
How we solved the problem

1. It improves the efficiency over the board
2. However, there are still queries don’t have most
optimized partial aggregation configuration.
Result

• It’s hard to know the row reduction
• It depends on the data distribution which might be
different for different day
• For different group by keys, the row reduction is different
Row Reduction

• History based tuning
• Use history data of the query to predict the best
configuration for future runs
• Perfect for partial aggregation because it operates at
query level. It could try different config and use them to
direct the config of future run
Future work

Supporting Over a Thousand Custom Hive User Defined Functions

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Supporting Over a Thousand Custom Hive User Defined Functions

Similar to Supporting Over a Thousand Custom Hive User Defined Functions (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Supporting Over a Thousand Custom Hive User Defined Functions