SlideShare a Scribd company logo
1 of 57
Download to read offline
Supporting Over a Thousand
Custom Hive User Defined
Functions
By Sergey Makagonov and Xin Yao
Facebook
• Introduction to User Defined Functions
• Hive UDFs at Facebook
• Major challenges and improvements
• Partial aggregations
Agenda
What Are “User Defined
Functions”?
• UDFs are used to add custom code logic if built-in
functions cannot achieve desired result
User Defined Functions
SELECT
substr(description, 1, 100) AS first_100,
count(*) AS cnt
FROM tmp_table
GROUP BY 1;
• Regular user-defined functions (UDFs): work on a
single row in a table and for one or more inputs
produce a single output
• User-defined table functions (UDTFs): for every row
in a table can return multiple values as output
• Aggregate functions (UDAFs): work on one or more
rows in a table and produce a single output
Types of Hive functions
Types of Hive functions. Regular UDFs
SELECT FB_ARRAY_CONCAT(
arr1, arr2
) AS zipped
FROM dim_two_rows;
Output:
[“a”,“b”,”c”,”d”,”e”,”f”]
[“foo”,”bar”,”baz”,”spam”]
arr1 arr2
[‘a’, ‘b’, ‘c’] [‘d’, ‘e’, ‘f’]
[’foo’, ‘bar’] [‘baz’, ‘spam’]
Types of Hive functions. UDTFs
SELECT id, idx
FROM dim_one_row
LATERAL VIEW
STACK(3, 1, 2, 3) tmp AS idx;
Output:
123 1
123 2
123 3
id
123
Types of Hive functions. UDAFs
SELECT
COLLECT_SET(id) AS all_ids
FROM dim_three_rows;
Output:
[123, 124, 125]
id
123
124
125
How Hive UDFs work in Spark
• most Hive data types (java types and derivatives of
ObjectInspector class) can be converted to
Spark’s data types, and vise versa
• Instances of Hive’s GenericUDF,
SimpleGenericUDAF and GenericUDTF are
called via wrapper classes extending Spark’s
Expression, ImperativeAggregate and
Generator classes respectively
How Hive UDFs work in Spark
UDFs at Facebook
• Hive was primary query engine until we started to
migrate jobs to Spark and Presto
• Over the course of several years, over a thousand
custom User Defined Functions were built
• Hive queries that used UDFs accounted for over 70%
of CPU time
• Supporting Hive UDFs in Spark is important for
migration
UDFs at Facebook
• At the beginning of Hive to Spark migration – the level
of support of UDFs was unclear
Identifying Baseline
• Most of UDFs were already covered with basic tests during Hive days
• We also had a testing framework built for running those tests in Hive
UDFs testing framework
• The framework was extended further to allow running queries against Spark
• A temporary scala file is created for each UDF class, containing code to run SQL
queries using DataFrame API
• spark-shell subprocess is spawned to run the scala file:
spark-shell --conf spark.ui.enabled=false … -i /tmp/spark-
hive-udf-1139336654093084343.scala
• Output is parsed and compared to the expected result
UDFs testing framework
• With test coverage in place, baseline support of UDFs by query count and CPU days
was identified: 58%
• Failed tests helped to identify the common issues
UDFs testing framework
Major challenges
• getRequiredJars and getRequiredFiles - functions to automatically include
additional resources required by this UDF.
• initialize(StructObjectInspector) in GenericUDTF - Spark SQL uses a
deprecated interface initialize(ObjectInspector[]) only.
• configure (GenericUDF, GenericUDTF, and GenericUDAFEvaluator) - a function to initialize
functions with MapredContext, which is inapplicable to Spark.
• close (GenericUDF and GenericUDAFEvaluator) is a function to release
associated resources. Spark SQL does not call this function when tasks finish.
• reset (GenericUDAFEvaluator) - a function to re-initialize aggregation for reusing the
same aggregation. Spark SQL currently does not support the reuse of aggregation.
• getWindowingEvaluator (GenericUDAFEvaluator) - a function to optimize
aggregation by evaluating an aggregate over a fixed window.
Unsupported APIs
• getRequiredJars and getRequiredFiles - functions to automatically include
additional resources required by this UDF.
• initialize(StructObjectInspector) in GenericUDTF - Spark SQL uses a
deprecated interface initialize(ObjectInspector[]) only.
• configure (GenericUDF, GenericUDTF, and GenericUDAFEvaluator) - a function to initialize
functions with MapredContext, which is inapplicable to Spark.
• close (GenericUDF and GenericUDAFEvaluator) is a function to release
associated resources. Spark SQL does not call this function when tasks finish.
• reset (GenericUDAFEvaluator) - a function to re-initialize aggregation for reusing the
same aggregation. Spark SQL currently does not support the reuse of aggregation.
• getWindowingEvaluator (GenericUDAFEvaluator) - a function to optimize
aggregation by evaluating an aggregate over a fixed window.
Unsupported APIs
getRequiredFiles and getRequiredJars
• functions to automatically include additional resources required by this UDF
• UDF code can assume that file is present in the executor working directory
Supporting required files/jars (SPARK-27543)
Driver Executor
Executor fetches files added to
SparkContext from DriverDuring initialization, for each U
DF:
- Identify required files and jars
- Register files for distribution:
SparkContext.addFile(…)
SparkContext.addJar(…)
For each UDF:
- If required file is in working
dir – do nothing (was distributed)
- If file is missing – try create a
symlink to absolute path
• Majority of Hive UDFs are written without concurrency in mind
• Hive runs tasks in a separate JVM process per each task
• Spark runs a separate JVM process for each Executor, and Executor can run multiple tasks
concurrently
UDFs and Thread Safety
Executor
Task 1
UDF instance 1
Task 2
UDF instance 2
Thread-unsafe UDF Example
• Consider that we have 2 tasks and hence 2 instances
of UDF: “instance 1” and “instance 2”
• evaluate method is called for each row, both of the
instances could pass the null-check inside evaluate
method at the same time
• Once “instance 1” finishes initialization first, it will call
evaluate for the next row
• If “instance 2” is still in the middle of initializing the
mapping, it could overwrite the data that “instance 1”
relied on, which could lead to data corruption or an
exception
Approach 1: Introduce Synchronization
• Introduce locking (synchronization) on the UDF
class when initializing the mapping
Cons:
• Synchronization is computationally expensive
• Requires manual and accurate refactoring of code,
which does not scale for hundreds of UDFs
Approach 2: Make Field Non-static
• Turn static variable into an instance variable
Cons:
• Adds more pressure on memory (instances cannot
share complex data)
Pros:
• Minimum changes in the code, which can also be
codemoded for all other UDFs that use static fields
of non-primitive types
• In Spark, UDF objects are initialized on Driver, serialized, and
later deserialized on executors
• Some classes cannot be deserialized out of the box
• Example: guava’s ImmutableSet. Kryo can successfully
serialize the objects on the driver, but fails to deserialized them
on executors
Kryo serialization/deserialization
• Catch serde issues by running Hive UDF tests in cluster mode
• For commonly used classes, write custom or import existing
serializers
• Mark problematic instance variables as transient
Solving Kryo serde problem
• Hive UDFs don’t support data types from Spark out of the box
• Similarly, Spark cannot work with Hive’s object inspectors
• For each UDF call, Spark’s data types are wrapped into Hive’s
inspectors and java types
• Same for the results: java types are converted back into Spark’s data
types
Hive UDFs performance
• This wrapping/unwrapping overhead can lead up to 2x of CPU time
spent in UDF compared to a Spark-native implementation
• UDFs that work with complex types suffer the most
Hive UDFs performance
• UDFs account for 15% of CPU spent for Spark queries
• The top most computationally expensive UDFs can be converted to
Spark-native UDFs
Hive UDFs performance
Partial Aggregation
SELECT id, max(value)
FROM table
GROUP BY id
Aggregation
Aggregation
id value
1 100
1 200
2 400
3 100
id value
1 300
2 200
2 300
id value
1 100
1 200
1 300
3 100
id value
2 400
2 200
2 300
id max(value)
1 300
3 100
id max(value)
2 400
Mapper
Reducer
Shuffle Aggregation
Mapper 1
Mapper 2
Reducer 1
Reducer 2
1. Every row needs to be shuffled through network, which
is a heavy operation.
2. Data skew. One reducer need to process more data than
others if one key has more rows.
1. For example: key1 has 1 million rows, while other keys
each have 10 rows on average
What’s the problem
Partial Aggregation is the technique that a system partially
aggregates the data in mapper side before shuffle, in order
to reduce the shuffle size.
Partial Aggregation
SELECT id, max(value)
FROM table
GROUP BY id
Partial Aggregation
Aggregation
id value
1 100
1 200
2 400
3 100
id value
1 300
2 200
2 300
id value
1 100
1 200
1 300
3 100
id value
2 400
2 200
2 300
id max(value)
1 300
3 100
id max(value)
2 400
Mapper
Reducer
Shuffle Aggregation
Mapper 1
Mapper 2
Reducer 1
Reducer 2
Partial Aggregation
id partial_max
(value)
1 200
2 400
3 100
id partial_max
(value)
1 300
2 300
id partial_max
(value)
1 100
1 300
3 100
id partial_max
(value)
2 400
2 300
id max(value)
1 300
3 100
id max(value)
2 400
Mapper Reducer
Shuffle Final
Aggregationid value
1 100
1 200
2 400
3 100
id value
1 300
2 200
2 300
Partial
Aggregation Mapper 1
Mapper 2
Reducer 1
Reducer 2
Aggregation vs Partial Aggregation
Aggregation Partial Aggregation
Shuffle Data (# Rows) All rows Reduced number of rows
Computation
Aggregation happens all in
reducer side
Extra CPU for partial
aggregation, distributed across
Mappers and Reducers
• Why partial aggregation is important
• It impacts CPU and Shuffle size
• It could help data skew
Partial Aggregation is important
1. Partial aggregation support is already in Spark
2. Fixed some issues to make it work with FB UDAFs
What we did
Partial Aggregation
Production Result
1. Partial aggregation improved CPU by 20%, shuffle data
size 17%
2. However, we also observed some heavy pipelines
regressed as much as 300%
FB Production Result
1. Query shape
2. Data distribution
What could go wrong?
• Column Expansion
• Partial aggregation expands the number of columns at the
Mapper side, results in a larger shuffle data size
SELECT
key, max(value), min(value), count(value), avg(value)
FROM table
GROUP BY key
When partial aggregation doesn’t work
Column Expansion
id p_max p_min P_count P_avg
1 200 100 2 (300, 2)
2 400 400 1 (400, 1)
3 100 100 1 (100, 1)
id value
1 100
1 200
2 400
3 100
Partial
Aggregation
id value
1 100
1 200
2 400
3 100
Shuffle
Shuffle 2 columns
Mapper Reducer
Shuffle 5 columns
Aggregation
Partial Aggregation
• Query Shape
• Column Expansion
• Data distribution
• No row to aggregate in mapper side
SELECT
key, max(value)
FROM table
GROUP BY key
When partial aggregation doesn’t work
Data Distribution
id value
1 100
2 200
3 400
4 100
Partial
Aggregation
id value
1 100
2 200
3 400
4 100
Shuffle
Shuffle 4 rows
Mapper Reducer
Shuffle 4 rows
Extra CPU with NO Row Reduction
id Partial_max(value)
1 100
2 200
3 400
4 100
Aggregation
Partial Aggregation
Partial Aggregation
Computation Cost-based
optimization
1. Each UDAF function partial aggregation performance
2. Column Expansion
3. Row Reduction
Partial Aggregation Computation Cost
Factors
• Computation cost-based optimizer for partial aggregation
1. Use multiple features to calculate the computation cost of
partial aggregation
1. input column number
2. output column number
3. computation cost of UDAF partial aggregation function
4. …
2. Use the calculated computation cost to decided the
configuration for partial aggregation.
How we solved the problem
1. It improves the efficiency over the board
2. However, there are still queries don’t have most
optimized partial aggregation configuration.
Result
1. Each UDAF function partial aggregation performance
2. Column Expansion
3. Row Reduction
Partial Aggregation Computation Cost
Factors
• It’s hard to know the row reduction
• It depends on the data distribution which might be
different for different day
• For different group by keys, the row reduction is different
Row Reduction
• History based tuning
• Use history data of the query to predict the best
configuration for future runs
• Perfect for partial aggregation because it operates at
query level. It could try different config and use them to
direct the config of future run
Future work
Recap
Questions?

More Related Content

What's hot

Apache Arrow Flight Overview
Apache Arrow Flight OverviewApache Arrow Flight Overview
Apache Arrow Flight OverviewJacques Nadeau
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Why your Spark Job is Failing
Why your Spark Job is FailingWhy your Spark Job is Failing
Why your Spark Job is FailingDataWorks Summit
 
Getting Started with Geospatial Data in MongoDB
Getting Started with Geospatial Data in MongoDBGetting Started with Geospatial Data in MongoDB
Getting Started with Geospatial Data in MongoDBMongoDB
 
04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistence04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistenceVenkat Datla
 
End-to-End Deep Learning with Horovod on Apache Spark
End-to-End Deep Learning with Horovod on Apache SparkEnd-to-End Deep Learning with Horovod on Apache Spark
End-to-End Deep Learning with Horovod on Apache SparkDatabricks
 
How we got to 1 millisecond latency in 99% under repair, compaction, and flus...
How we got to 1 millisecond latency in 99% under repair, compaction, and flus...How we got to 1 millisecond latency in 99% under repair, compaction, and flus...
How we got to 1 millisecond latency in 99% under repair, compaction, and flus...ScyllaDB
 
Learning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with SparkLearning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with Sparkphanleson
 
Collections - Maps
Collections - Maps Collections - Maps
Collections - Maps Hitesh-Java
 
Deploying PostgreSQL on Kubernetes
Deploying PostgreSQL on KubernetesDeploying PostgreSQL on Kubernetes
Deploying PostgreSQL on KubernetesJimmy Angelakos
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
Architecture for building scalable and highly available Postgres Cluster
Architecture for building scalable and highly available Postgres ClusterArchitecture for building scalable and highly available Postgres Cluster
Architecture for building scalable and highly available Postgres ClusterAshnikbiz
 
Scala 3 by Example - Algebraic Data Types for Domain Driven Design - Part 1
Scala 3 by Example - Algebraic Data Types for Domain Driven Design - Part 1Scala 3 by Example - Algebraic Data Types for Domain Driven Design - Part 1
Scala 3 by Example - Algebraic Data Types for Domain Driven Design - Part 1Philip Schwarz
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkDatabricks
 
OLTP+OLAP=HTAP
 OLTP+OLAP=HTAP OLTP+OLAP=HTAP
OLTP+OLAP=HTAPEDB
 
Spark Performance Tuning .pdf
Spark Performance Tuning .pdfSpark Performance Tuning .pdf
Spark Performance Tuning .pdfAmit Raj
 
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...Simplilearn
 

What's hot (20)

Apache Arrow Flight Overview
Apache Arrow Flight OverviewApache Arrow Flight Overview
Apache Arrow Flight Overview
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Why your Spark Job is Failing
Why your Spark Job is FailingWhy your Spark Job is Failing
Why your Spark Job is Failing
 
Getting Started with Geospatial Data in MongoDB
Getting Started with Geospatial Data in MongoDBGetting Started with Geospatial Data in MongoDB
Getting Started with Geospatial Data in MongoDB
 
04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistence04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistence
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
End-to-End Deep Learning with Horovod on Apache Spark
End-to-End Deep Learning with Horovod on Apache SparkEnd-to-End Deep Learning with Horovod on Apache Spark
End-to-End Deep Learning with Horovod on Apache Spark
 
Hadoop Oozie
Hadoop OozieHadoop Oozie
Hadoop Oozie
 
How we got to 1 millisecond latency in 99% under repair, compaction, and flus...
How we got to 1 millisecond latency in 99% under repair, compaction, and flus...How we got to 1 millisecond latency in 99% under repair, compaction, and flus...
How we got to 1 millisecond latency in 99% under repair, compaction, and flus...
 
Learning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with SparkLearning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with Spark
 
Collections - Maps
Collections - Maps Collections - Maps
Collections - Maps
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Deploying PostgreSQL on Kubernetes
Deploying PostgreSQL on KubernetesDeploying PostgreSQL on Kubernetes
Deploying PostgreSQL on Kubernetes
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Architecture for building scalable and highly available Postgres Cluster
Architecture for building scalable and highly available Postgres ClusterArchitecture for building scalable and highly available Postgres Cluster
Architecture for building scalable and highly available Postgres Cluster
 
Scala 3 by Example - Algebraic Data Types for Domain Driven Design - Part 1
Scala 3 by Example - Algebraic Data Types for Domain Driven Design - Part 1Scala 3 by Example - Algebraic Data Types for Domain Driven Design - Part 1
Scala 3 by Example - Algebraic Data Types for Domain Driven Design - Part 1
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache Spark
 
OLTP+OLAP=HTAP
 OLTP+OLAP=HTAP OLTP+OLAP=HTAP
OLTP+OLAP=HTAP
 
Spark Performance Tuning .pdf
Spark Performance Tuning .pdfSpark Performance Tuning .pdf
Spark Performance Tuning .pdf
 
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
 

Similar to Supporting Over a Thousand Custom Hive User Defined Functions

User defined-functions-cassandra-summit-eu-2014
User defined-functions-cassandra-summit-eu-2014User defined-functions-cassandra-summit-eu-2014
User defined-functions-cassandra-summit-eu-2014Robert Stupp
 
Spark UDFs are EviL, Catalyst to the rEsCue!
Spark UDFs are EviL, Catalyst to the rEsCue!Spark UDFs are EviL, Catalyst to the rEsCue!
Spark UDFs are EviL, Catalyst to the rEsCue!Adi Polak
 
Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013Cloudera, Inc.
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2Gal Marder
 
Speed up UDFs with GPUs using the RAPIDS Accelerator
Speed up UDFs with GPUs using the RAPIDS AcceleratorSpeed up UDFs with GPUs using the RAPIDS Accelerator
Speed up UDFs with GPUs using the RAPIDS AcceleratorDatabricks
 
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...DataWorks Summit
 
Adi Polak - Light up the Spark in Catalyst by avoiding UDFs - Codemotion Berl...
Adi Polak - Light up the Spark in Catalyst by avoiding UDFs - Codemotion Berl...Adi Polak - Light up the Spark in Catalyst by avoiding UDFs - Codemotion Berl...
Adi Polak - Light up the Spark in Catalyst by avoiding UDFs - Codemotion Berl...Codemotion
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Olalekan Fuad Elesin
 
Spark real world use cases and optimizations
Spark real world use cases and optimizationsSpark real world use cases and optimizations
Spark real world use cases and optimizationsGal Marder
 
Running Spark on Cloud
Running Spark on CloudRunning Spark on Cloud
Running Spark on CloudQubole
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptbhargavi804095
 
Adi Polak - Light up the Spark in Catalyst by avoiding UDFs - Codemotion Berl...
Adi Polak - Light up the Spark in Catalyst by avoiding UDFs - Codemotion Berl...Adi Polak - Light up the Spark in Catalyst by avoiding UDFs - Codemotion Berl...
Adi Polak - Light up the Spark in Catalyst by avoiding UDFs - Codemotion Berl...Codemotion
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupRafal Kwasny
 
Chapter 3.pptx Oracle SQL or local Android database setup SQL, SQL-Lite, codi...
Chapter 3.pptx Oracle SQL or local Android database setup SQL, SQL-Lite, codi...Chapter 3.pptx Oracle SQL or local Android database setup SQL, SQL-Lite, codi...
Chapter 3.pptx Oracle SQL or local Android database setup SQL, SQL-Lite, codi...TAISEEREISA
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
 
Go Faster With Native Compilation
Go Faster With Native CompilationGo Faster With Native Compilation
Go Faster With Native CompilationPGConf APAC
 
Go faster with_native_compilation Part-2
Go faster with_native_compilation Part-2Go faster with_native_compilation Part-2
Go faster with_native_compilation Part-2Rajeev Rastogi (KRR)
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomynzhang
 

Similar to Supporting Over a Thousand Custom Hive User Defined Functions (20)

User defined-functions-cassandra-summit-eu-2014
User defined-functions-cassandra-summit-eu-2014User defined-functions-cassandra-summit-eu-2014
User defined-functions-cassandra-summit-eu-2014
 
Spark UDFs are EviL, Catalyst to the rEsCue!
Spark UDFs are EviL, Catalyst to the rEsCue!Spark UDFs are EviL, Catalyst to the rEsCue!
Spark UDFs are EviL, Catalyst to the rEsCue!
 
Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013
 
Sql lite android
Sql lite androidSql lite android
Sql lite android
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
 
Speed up UDFs with GPUs using the RAPIDS Accelerator
Speed up UDFs with GPUs using the RAPIDS AcceleratorSpeed up UDFs with GPUs using the RAPIDS Accelerator
Speed up UDFs with GPUs using the RAPIDS Accelerator
 
01 oracle architecture
01 oracle architecture01 oracle architecture
01 oracle architecture
 
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
 
Adi Polak - Light up the Spark in Catalyst by avoiding UDFs - Codemotion Berl...
Adi Polak - Light up the Spark in Catalyst by avoiding UDFs - Codemotion Berl...Adi Polak - Light up the Spark in Catalyst by avoiding UDFs - Codemotion Berl...
Adi Polak - Light up the Spark in Catalyst by avoiding UDFs - Codemotion Berl...
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
 
Spark real world use cases and optimizations
Spark real world use cases and optimizationsSpark real world use cases and optimizations
Spark real world use cases and optimizations
 
Running Spark on Cloud
Running Spark on CloudRunning Spark on Cloud
Running Spark on Cloud
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
 
Adi Polak - Light up the Spark in Catalyst by avoiding UDFs - Codemotion Berl...
Adi Polak - Light up the Spark in Catalyst by avoiding UDFs - Codemotion Berl...Adi Polak - Light up the Spark in Catalyst by avoiding UDFs - Codemotion Berl...
Adi Polak - Light up the Spark in Catalyst by avoiding UDFs - Codemotion Berl...
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Chapter 3.pptx Oracle SQL or local Android database setup SQL, SQL-Lite, codi...
Chapter 3.pptx Oracle SQL or local Android database setup SQL, SQL-Lite, codi...Chapter 3.pptx Oracle SQL or local Android database setup SQL, SQL-Lite, codi...
Chapter 3.pptx Oracle SQL or local Android database setup SQL, SQL-Lite, codi...
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
Go Faster With Native Compilation
Go Faster With Native CompilationGo Faster With Native Compilation
Go Faster With Native Compilation
 
Go faster with_native_compilation Part-2
Go faster with_native_compilation Part-2Go faster with_native_compilation Part-2
Go faster with_native_compilation Part-2
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomy
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...HyderabadDolls
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样wsppdmt
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxronsairoathenadugay
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...kumargunjan9515
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowgargpaaro
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdfkhraisr
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...nirzagarg
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...Bertram Ludäscher
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...gajnagarg
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Klinik kandungan
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...HyderabadDolls
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...HyderabadDolls
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制vexqp
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...HyderabadDolls
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...Health
 

Recently uploaded (20)

Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 

Supporting Over a Thousand Custom Hive User Defined Functions

  • 1. Supporting Over a Thousand Custom Hive User Defined Functions By Sergey Makagonov and Xin Yao Facebook
  • 2. • Introduction to User Defined Functions • Hive UDFs at Facebook • Major challenges and improvements • Partial aggregations Agenda
  • 3. What Are “User Defined Functions”?
  • 4. • UDFs are used to add custom code logic if built-in functions cannot achieve desired result User Defined Functions SELECT substr(description, 1, 100) AS first_100, count(*) AS cnt FROM tmp_table GROUP BY 1;
  • 5. • Regular user-defined functions (UDFs): work on a single row in a table and for one or more inputs produce a single output • User-defined table functions (UDTFs): for every row in a table can return multiple values as output • Aggregate functions (UDAFs): work on one or more rows in a table and produce a single output Types of Hive functions
  • 6. Types of Hive functions. Regular UDFs SELECT FB_ARRAY_CONCAT( arr1, arr2 ) AS zipped FROM dim_two_rows; Output: [“a”,“b”,”c”,”d”,”e”,”f”] [“foo”,”bar”,”baz”,”spam”] arr1 arr2 [‘a’, ‘b’, ‘c’] [‘d’, ‘e’, ‘f’] [’foo’, ‘bar’] [‘baz’, ‘spam’]
  • 7. Types of Hive functions. UDTFs SELECT id, idx FROM dim_one_row LATERAL VIEW STACK(3, 1, 2, 3) tmp AS idx; Output: 123 1 123 2 123 3 id 123
  • 8. Types of Hive functions. UDAFs SELECT COLLECT_SET(id) AS all_ids FROM dim_three_rows; Output: [123, 124, 125] id 123 124 125
  • 9. How Hive UDFs work in Spark • most Hive data types (java types and derivatives of ObjectInspector class) can be converted to Spark’s data types, and vise versa • Instances of Hive’s GenericUDF, SimpleGenericUDAF and GenericUDTF are called via wrapper classes extending Spark’s Expression, ImperativeAggregate and Generator classes respectively
  • 10. How Hive UDFs work in Spark
  • 12. • Hive was primary query engine until we started to migrate jobs to Spark and Presto • Over the course of several years, over a thousand custom User Defined Functions were built • Hive queries that used UDFs accounted for over 70% of CPU time • Supporting Hive UDFs in Spark is important for migration UDFs at Facebook
  • 13. • At the beginning of Hive to Spark migration – the level of support of UDFs was unclear Identifying Baseline
  • 14. • Most of UDFs were already covered with basic tests during Hive days • We also had a testing framework built for running those tests in Hive UDFs testing framework
  • 15. • The framework was extended further to allow running queries against Spark • A temporary scala file is created for each UDF class, containing code to run SQL queries using DataFrame API • spark-shell subprocess is spawned to run the scala file: spark-shell --conf spark.ui.enabled=false … -i /tmp/spark- hive-udf-1139336654093084343.scala • Output is parsed and compared to the expected result UDFs testing framework
  • 16. • With test coverage in place, baseline support of UDFs by query count and CPU days was identified: 58% • Failed tests helped to identify the common issues UDFs testing framework
  • 18. • getRequiredJars and getRequiredFiles - functions to automatically include additional resources required by this UDF. • initialize(StructObjectInspector) in GenericUDTF - Spark SQL uses a deprecated interface initialize(ObjectInspector[]) only. • configure (GenericUDF, GenericUDTF, and GenericUDAFEvaluator) - a function to initialize functions with MapredContext, which is inapplicable to Spark. • close (GenericUDF and GenericUDAFEvaluator) is a function to release associated resources. Spark SQL does not call this function when tasks finish. • reset (GenericUDAFEvaluator) - a function to re-initialize aggregation for reusing the same aggregation. Spark SQL currently does not support the reuse of aggregation. • getWindowingEvaluator (GenericUDAFEvaluator) - a function to optimize aggregation by evaluating an aggregate over a fixed window. Unsupported APIs
  • 19. • getRequiredJars and getRequiredFiles - functions to automatically include additional resources required by this UDF. • initialize(StructObjectInspector) in GenericUDTF - Spark SQL uses a deprecated interface initialize(ObjectInspector[]) only. • configure (GenericUDF, GenericUDTF, and GenericUDAFEvaluator) - a function to initialize functions with MapredContext, which is inapplicable to Spark. • close (GenericUDF and GenericUDAFEvaluator) is a function to release associated resources. Spark SQL does not call this function when tasks finish. • reset (GenericUDAFEvaluator) - a function to re-initialize aggregation for reusing the same aggregation. Spark SQL currently does not support the reuse of aggregation. • getWindowingEvaluator (GenericUDAFEvaluator) - a function to optimize aggregation by evaluating an aggregate over a fixed window. Unsupported APIs
  • 20. getRequiredFiles and getRequiredJars • functions to automatically include additional resources required by this UDF • UDF code can assume that file is present in the executor working directory
  • 21. Supporting required files/jars (SPARK-27543) Driver Executor Executor fetches files added to SparkContext from DriverDuring initialization, for each U DF: - Identify required files and jars - Register files for distribution: SparkContext.addFile(…) SparkContext.addJar(…) For each UDF: - If required file is in working dir – do nothing (was distributed) - If file is missing – try create a symlink to absolute path
  • 22. • Majority of Hive UDFs are written without concurrency in mind • Hive runs tasks in a separate JVM process per each task • Spark runs a separate JVM process for each Executor, and Executor can run multiple tasks concurrently UDFs and Thread Safety Executor Task 1 UDF instance 1 Task 2 UDF instance 2
  • 23. Thread-unsafe UDF Example • Consider that we have 2 tasks and hence 2 instances of UDF: “instance 1” and “instance 2” • evaluate method is called for each row, both of the instances could pass the null-check inside evaluate method at the same time • Once “instance 1” finishes initialization first, it will call evaluate for the next row • If “instance 2” is still in the middle of initializing the mapping, it could overwrite the data that “instance 1” relied on, which could lead to data corruption or an exception
  • 24. Approach 1: Introduce Synchronization • Introduce locking (synchronization) on the UDF class when initializing the mapping Cons: • Synchronization is computationally expensive • Requires manual and accurate refactoring of code, which does not scale for hundreds of UDFs
  • 25. Approach 2: Make Field Non-static • Turn static variable into an instance variable Cons: • Adds more pressure on memory (instances cannot share complex data) Pros: • Minimum changes in the code, which can also be codemoded for all other UDFs that use static fields of non-primitive types
  • 26. • In Spark, UDF objects are initialized on Driver, serialized, and later deserialized on executors • Some classes cannot be deserialized out of the box • Example: guava’s ImmutableSet. Kryo can successfully serialize the objects on the driver, but fails to deserialized them on executors Kryo serialization/deserialization
  • 27. • Catch serde issues by running Hive UDF tests in cluster mode • For commonly used classes, write custom or import existing serializers • Mark problematic instance variables as transient Solving Kryo serde problem
  • 28. • Hive UDFs don’t support data types from Spark out of the box • Similarly, Spark cannot work with Hive’s object inspectors • For each UDF call, Spark’s data types are wrapped into Hive’s inspectors and java types • Same for the results: java types are converted back into Spark’s data types Hive UDFs performance
  • 29. • This wrapping/unwrapping overhead can lead up to 2x of CPU time spent in UDF compared to a Spark-native implementation • UDFs that work with complex types suffer the most Hive UDFs performance
  • 30. • UDFs account for 15% of CPU spent for Spark queries • The top most computationally expensive UDFs can be converted to Spark-native UDFs Hive UDFs performance
  • 32. SELECT id, max(value) FROM table GROUP BY id Aggregation
  • 33. Aggregation id value 1 100 1 200 2 400 3 100 id value 1 300 2 200 2 300 id value 1 100 1 200 1 300 3 100 id value 2 400 2 200 2 300 id max(value) 1 300 3 100 id max(value) 2 400 Mapper Reducer Shuffle Aggregation Mapper 1 Mapper 2 Reducer 1 Reducer 2
  • 34. 1. Every row needs to be shuffled through network, which is a heavy operation. 2. Data skew. One reducer need to process more data than others if one key has more rows. 1. For example: key1 has 1 million rows, while other keys each have 10 rows on average What’s the problem
  • 35. Partial Aggregation is the technique that a system partially aggregates the data in mapper side before shuffle, in order to reduce the shuffle size. Partial Aggregation
  • 36. SELECT id, max(value) FROM table GROUP BY id Partial Aggregation
  • 37. Aggregation id value 1 100 1 200 2 400 3 100 id value 1 300 2 200 2 300 id value 1 100 1 200 1 300 3 100 id value 2 400 2 200 2 300 id max(value) 1 300 3 100 id max(value) 2 400 Mapper Reducer Shuffle Aggregation Mapper 1 Mapper 2 Reducer 1 Reducer 2
  • 38. Partial Aggregation id partial_max (value) 1 200 2 400 3 100 id partial_max (value) 1 300 2 300 id partial_max (value) 1 100 1 300 3 100 id partial_max (value) 2 400 2 300 id max(value) 1 300 3 100 id max(value) 2 400 Mapper Reducer Shuffle Final Aggregationid value 1 100 1 200 2 400 3 100 id value 1 300 2 200 2 300 Partial Aggregation Mapper 1 Mapper 2 Reducer 1 Reducer 2
  • 39. Aggregation vs Partial Aggregation Aggregation Partial Aggregation Shuffle Data (# Rows) All rows Reduced number of rows Computation Aggregation happens all in reducer side Extra CPU for partial aggregation, distributed across Mappers and Reducers
  • 40. • Why partial aggregation is important • It impacts CPU and Shuffle size • It could help data skew Partial Aggregation is important
  • 41. 1. Partial aggregation support is already in Spark 2. Fixed some issues to make it work with FB UDAFs What we did
  • 43. 1. Partial aggregation improved CPU by 20%, shuffle data size 17% 2. However, we also observed some heavy pipelines regressed as much as 300% FB Production Result
  • 44. 1. Query shape 2. Data distribution What could go wrong?
  • 45. • Column Expansion • Partial aggregation expands the number of columns at the Mapper side, results in a larger shuffle data size SELECT key, max(value), min(value), count(value), avg(value) FROM table GROUP BY key When partial aggregation doesn’t work
  • 46. Column Expansion id p_max p_min P_count P_avg 1 200 100 2 (300, 2) 2 400 400 1 (400, 1) 3 100 100 1 (100, 1) id value 1 100 1 200 2 400 3 100 Partial Aggregation id value 1 100 1 200 2 400 3 100 Shuffle Shuffle 2 columns Mapper Reducer Shuffle 5 columns Aggregation Partial Aggregation
  • 47. • Query Shape • Column Expansion • Data distribution • No row to aggregate in mapper side SELECT key, max(value) FROM table GROUP BY key When partial aggregation doesn’t work
  • 48. Data Distribution id value 1 100 2 200 3 400 4 100 Partial Aggregation id value 1 100 2 200 3 400 4 100 Shuffle Shuffle 4 rows Mapper Reducer Shuffle 4 rows Extra CPU with NO Row Reduction id Partial_max(value) 1 100 2 200 3 400 4 100 Aggregation Partial Aggregation
  • 50. 1. Each UDAF function partial aggregation performance 2. Column Expansion 3. Row Reduction Partial Aggregation Computation Cost Factors
  • 51. • Computation cost-based optimizer for partial aggregation 1. Use multiple features to calculate the computation cost of partial aggregation 1. input column number 2. output column number 3. computation cost of UDAF partial aggregation function 4. … 2. Use the calculated computation cost to decided the configuration for partial aggregation. How we solved the problem
  • 52. 1. It improves the efficiency over the board 2. However, there are still queries don’t have most optimized partial aggregation configuration. Result
  • 53. 1. Each UDAF function partial aggregation performance 2. Column Expansion 3. Row Reduction Partial Aggregation Computation Cost Factors
  • 54. • It’s hard to know the row reduction • It depends on the data distribution which might be different for different day • For different group by keys, the row reduction is different Row Reduction
  • 55. • History based tuning • Use history data of the query to predict the best configuration for future runs • Perfect for partial aggregation because it operates at query level. It could try different config and use them to direct the config of future run Future work
  • 56. Recap