SlideShare une entreprise Scribd logo
1  sur  51
Télécharger pour lire hors ligne
Sunitha Kambhampati, IBM
How to extend Spark with
customized optimizations
#UnifiedAnalytics #SparkAISummit
Center for Open Source
Data and AI Technologies
IBM Watson West Building
505 Howard St.
San Francisco, CA
CODAIT aims to make AI solutions dramatically
easier to create, deploy, and manage in the
enterprise.
Relaunch of the IBM Spark Technology Center
(STC) to reflect expanded mission.
We contribute to foundational open source
software across the enterprise AI lifecycle.
36 open-source developers!
https://ibm.biz/BdzF6Q
Improving Enterprise AI Lifecycle in Open Source
CODAIT
codait.org
Agenda
•  Introduce Spark Extension Points API
•  Deep Dive into the details
–  What you can do
–  How to use it
–  What things you need to be aware of
•  Enhancements to the API
–  Why
–  Performance results
I want to extend Spark
•  Performance benefits
–  Support for informational referential integrity (RI)
constraints
–  Add Data Skipping Indexes
•  Enabling Third party applications
–  Application uses Spark but it requires some additions
or small changes to Spark
Problem
You have developed customizations to Spark.
How do you add it to your Spark cluster?
Possible Solutions
•  Option 1: Get the code merged to Apache Spark
–  Maybe it is application specific
–  Maybe it is a value add
–  Not something that can be merged into Spark
•  Option 2: Modify Spark code, fork it
–  Maintenance overhead
•  Extensible solution: Use Spark’s Extension Points
API
Spark Extension Points API
•  Added in Spark 2.2 in SPARK-18127
•  Pluggable & Extensible
•  Extend SparkSession with custom optimizations
•  Marked as Experimental API
–  relatively stable
–  has not seen any changes except addition of more
customization
Query Execution
ANALYZER
Rules
SQL Query
DataFrame
ML
Unresolved
Logical
Plan
Analyzed
Logical
Plan
Query Execution
Parser Optimizer
Unresolved
Logical Plan
Analyzed
Logical Plan
Optimized
Logical Plan
Physical
Plan
Analyzer
Rules Rules
SparkPlanner
Spark
Strategies
Supported Customizations
Parser OptimizerAnalyzer
Rules Rules
SparkPlanner
Spark
Strategies
Custom
Rules
Custom
Rules
Custom Spark
Strategies
Custom
Parser
Extensions API: At a High level
•  New SparkSessionExtensions Class
–  Methods to pass the customizations
–  Holds the customizations
•  Pass customizations to Spark
–  withExtensions method in SparkSession.builder
SparkSessionExtensions
•  @DeveloperApi
@Experimental
@InterfaceStability.Unstable
•  Inject Methods
–  Pass the custom user rules to
Spark
•  Build Methods
–  Pass the rules to Spark
components
–  Used by Spark Internals
Extension Hooks: Inject Methods
Parser OptimizerAnalyzer SparkPlanner
injectResolutionRule
injectCheckRule
injectPostHocResolutionRule
injectOptimizerRule
injectFunction
injectPlannerStrategyinjectParser
New in master,
SPARK-25560
Pass custom rules to SparkSession
•  Use ‘withExtensions’ in SparkSession.Builder
	def	withExtensions(	
								f:	SparkSessionExtensions	=>	Unit):	Builder		
•  Use the Spark configuration parameter
–  spark.sql.extensions
•  Takes a class name that implements
Function1[SparkSessionExtensions,	Unit]
Deep Dive
Use Case #1
You want to add your own optimization rule to
Spark’s Catalyst Optimizer
Add your custom optimizer rule
•  Step 1: Implement your optimizer rule
case class GroupByPushDown(spark: SparkSession) extends Rule[LogicalPlan] {
def apply(plan: LogicalPlan): LogicalPlan = plan transform {
…. }}
•  Step 2: Create your ExtensionsBuilder function
type ExtensionsBuilder = SparkSessionExtensions => Unit
val f: ExtensionsBuilder = { e => e.injectOptimizerRule(GroupByPushDown)}
•  Step 3: Use the withExtensions method in SparkSession.builder to
create your custom SparkSession
val spark = SparkSession.builder().master(..).withExtensions(f).getOrCreate()
How does the rule get added?
•  Catalyst Optimizer
–  Rules are grouped in Batches (ie RuleExecutor.Batch)
–  one of the fixed batch has a placeholder to add custom optimizer rules
–  passes in the extendedOperatorOptimizationRules to the batch.
def	extendedOperatorOptimizationRules:	Seq[Rule[LogicalPlan]]	
•  SparkSession stores the SparkSessionExtensions in transient class variable
extensions	
•  The SparkOptimizer instance gets created during the SessionState	
creation for the SparkSession		
–  overrides the extendedOperatorOptimizationRules method to include the
customized rules
–  Check the optimizer method in BaseSessionStateBuilder
Things to Note
•  Rule gets added to a predefined batch
•  Batch here refers to RuleExecutor.Batch
•  In Master, it is to the following batches:
–  “Operator Optimization before Inferring Filters”
–  “Operator Optimization after Inferring Filters”
•  Check the defaultBatches method in
Optimizer class
Use Case #2
You want to add some parser extensions
Parser Customization
•  Step 1: Implement your parser customization
case class RIExtensionsParser(
spark: SparkSession,
delegate: ParserInterface) extends ParserInterface { …}
•  Step 2: Create your ExtensionsBuilder function
type ExtensionsBuilder = SparkSessionExtensions => Unit
val f: ExtensionsBuilder = { e => e.injectParser(RIExtensionsParser)}
•  Step 3: Use the withExtensions method in SparkSession.builder to
create your custom SparkSession
val spark = SparkSession.builder().master("…").withExtensions(f).getOrCreate()
How do the parser extensions work?
•  Customize the parser for any new syntax to support
•  Delegate rest of the Spark SQL syntax to the
SparkSqlParser	
•  sqlParser is created by calling the buildParser on the
extensions object in the SparkSession
–  See sqlParser in BaseSessionStateBuilder	class
–  SparkSqlParser (Default Spark Parser) is passed in
along with the SparkSession
Use Case #3
You want to add some specific checks in the
Analyzer
Analyzer Customizations
•  Analyzer Rules
injectResolutionRule
•  PostHocResolutionRule
injectPostHocResolutionRule
•  CheckRules
injectCheckRule
Analyzer Rule Customization
•  Step 1: Implement your Analyzer rule
case class MyRIRule(spark: SparkSession) extends Rule[LogicalPlan] {
def apply(plan: LogicalPlan): LogicalPlan = plan transform {
…. }}
•  Step 2: Create your ExtensionsBuilder function
type ExtensionsBuilder = SparkSessionExtensions => Unit
val f: ExtensionsBuilder = { e => e.injectResolutionRule(MyRIRule)}
•  Step 3: Use the withExtensions method in SparkSession.builder to
create your custom SparkSession
val spark =
SparkSession.builder().master("..").withExtensions(f).getOrCreate
How is the rule added to the
Analyzer?
•  Analyzer has rules in batches
–  Batch has a placeholder extendedResolutionRules to add custom rules
–  Batch “Post-Hoc Resolution” for postHocResolutionRules	
•  SparkSession stores the SparkSessionExtensions in extensions	
•  When SessionState is created, the custom rules are passed to the
Analyzer by overriding the following class member variables
–  val	extendedResolutionRules	
–  val	postHocResolutionRules	
–  val	extendedCheckRules	
•  Check the BaseSessionStateBuilder.analyzer method
•  Check the HiveSessionStateBuilder.analyzer method
Things to Note
•  Custom resolution rule gets added in the end to
‘Resolution’ Batch
•  The checkRules will get called in the end of the
checkAnalysis method after all the spark
checks are done
•  In Analyzer.checkAnalysis method:
extendedCheckRules.foreach(_(plan))
Use Case #4
You want to add custom planning strategies
Add new physical plan strategy
•  Step1: Implement your new physical plan Strategy class
case class IdxStrategy(spark: SparkSession) extends SparkStrategy {
override def apply(plan: LogicalPlan): Seq[SparkPlan] = { ….. }
}
•  Step 2: Create your ExtensionsBuilder function
type ExtensionsBuilder = SparkSessionExtensions => Unit
val f: ExtensionsBuilder = { e => e.injectPlannerStrategy(IdxStrategy)}
•  Step 3: Use the withExtensions method in SparkSession.builder to
create your custom SparkSession
val spark = SparkSession.builder().master(..).withExtensions(f).getOrCreate()
How does the strategy get added
•  SparkPlanner uses a Seq of SparkStrategy
–  strategies function has a placeholder extraPlanningStrategies	
•  SparkSession stores the SparkSessionExtensions in transient class
variable extensions	
•  The SparkPlanner instance gets created during the SessionState	
creation for the SparkSession
–  overrides the extraPlanningStrategies to include the custom strategy
(buildPlannerStrategies)
–  Check the BaseSessionStateBuilder.planner	method
–  Check the HiveSessionStateBuilder.planner	method
Things to Note
•  Custom Strategies are tried after the strategies
defined in ExperimentalMethods, and before
the regular strategies
–  Check the SparkPlanner.strategies method
Use Case #5
You want to register custom functions in the
session catalog
Register Custom Function
•  Step 1: Create a FunctionDescription with your custom
function
type FunctionDescription =
(FunctionIdentifier, ExpressionInfo, FunctionBuilder)
def utf8strlen(x: String): Int = {..}
val f = udf(utf8strlen(_))
def builder(children: Seq[Expression]) =
f.apply(children.map(Column.apply) : _*).expr
val myfuncDesc = (FunctionIdentifier("utf8strlen"),
new ExpressionInfo("noclass", "utf8strlen"), builder)
Register Custom Function
•  Step 2: Create your ExtensionsBuilder function to inject
the new function
type ExtensionsBuilder = SparkSessionExtensions => Unit
val f: ExtensionsBuilder = { e => e.injectFunction (myfuncDesc)}
•  Step 3: Pass this function to withExtensions method on
SparkSession.builder and create your new SparkSession
val spark =
SparkSession.builder().master(..).withExtensions(f).getOrCreate()
How does Custom Function
registration work
•  SparkSessionExtensions keeps track of the
injectedFunctions
•  During SessionCatalog creation, the
injectedFunctions are registered in the
functionRegistry	
–  See class variable BaseSessionStateBuilder.functionRegistry	
–  See	method	SimpleFunctionRegistry.registerFunction
Things to Note
•  Function registration order is same as the order in which the
injectFunction is called
•  No check if an existing function already exists during the
injection
•  A warning is raised if a function replaces an existing function
–  Check is based on lowercase match of the function name
•  Use the SparkSession.catalog.listFunctions to look up
your function
•  The functions registered will be temporary functions
•  See SimpleFunctionRegistry.registerFunction	method
How to exclude the optimizer rule
•  Spark v2.4 has new SQL Conf:
spark.sql.optimizer.excludedRules	
•  Specify the custom rule’s class name
session.conf.set(	
"spark.sql.optimizer.excludedRules",	
"org.mycompany.spark.MyCustomRule")
Other ways to customize
•  ExperimentalMethods
–  Customize Physical Planning Strategies
–  Customize Optimizer Rules
•  Use the SparkSession.experimental method
–  spark.experimental.extraStrategies	
•  Added in the beginning of strategies in SparkPlanner
–  spark.experimental.extraOptimizations	
•  Added after all the batches in SparkOptimizer
Things to Note
•  ExperimentalMethods
–  Rules are injected in a different location than
Extension Points API
–  So use this only if it is advantageous for your usecase
•  Recommendation: Use Extension Points API
Proposed API Enhancements
SPARK-26249: API Enhancements
•  Motivation
–  Lack of fine grained control on rule execution order
–  Add batches in a specific order
•  Add support to extensions API
–  Inject optimizer rule in a specific order
–  Inject optimizer batch
Inject Optimizer Rule in Order
•  Inject a rule after or before an existing rule in a given
existing batch in the Optimizer
def injectOptimizerRuleInOrder(
builder: RuleBuilder,
batchName: String,
ruleOrder: Order.Order,
existingRule: String): Unit
Inject Optimizer Batch
•  Inject a batch of optimizer rules
•  Specify the order where you want to inject the batch
def injectOptimizerBatch(
batchName: String,
maxIterations: Int,
existingBatchName: String,
order: Order.Value,
rules: Seq[RuleBuilder]): Unit
End to End Use Case
Use case: GroupBy Push Down Through Join
•  If the join is an RI join, heuristically push down Group By to the fact table
–  The input to the Group By remains the same before and after the join
–  The input to the join is reduced
–  Overall reduction of the execution time
Aggregate functions
on fact table columns
Grouping columns
is a superset of join
columns
PK – FK joins
Group By Push Down Through Join
Execution plan transformation:
•  Query execution drops from 70 secs to 30 secs (1TB
TPC-DS setup), 2x improvement
select c_customer_sk, c_first_name, c_last_name, s_store_sk, s_store_name,
min(ss.ss_quantity) as store_sales_quantity
from store_sales ss, date_dim, customer, store
where d_date_sk = ss_sold_date_sk and
c_customer_sk = ss_customer_sk and
s_store_sk = ss_store_sk and
d_year between 2000 and 2002
group by c_customer_sk, c_first_name, c_last_name, s_store_sk, s_store_name
order by c_customer_sk, c_first_name, c_last_name, s_store_sk, s_store_name
limit 100;
Star schema: customer
date_dim
store_sales
store
N : 1
1:NN:1
Retrieve the minimum quantity of items that were sold
between the year 2000 and 2002 grouped by customer and
store information
Optimized Query Plan: Explain
== Optimized Logical Plan ==
GlobalLimit 100
+- LocalLimit 100
+- Sort [c_customer_sk#52 ASC NULLS FIRST, c_first_name#60 ASC NULLS FIRST, c_last_name#61 ASC NULLS FIRST, s_store_sk#70 ASC NULLS FIRST, s_store_name#75 ASC NULLS FIRST],
true
+- Project [c_customer_sk#52, c_first_name#60, c_last_name#61, s_store_sk#70, s_store_name#75, store_sales_quantity#0L]
+- Join Inner, (s_store_sk#70 = ss_store_sk#8)
:- Project [c_customer_sk#52, c_first_name#60, c_last_name#61, ss_store_sk#8, store_sales_quantity#0L]
: +- Join Inner, (c_customer_sk#52 = ss_customer_sk#4)
: :- Aggregate [ss_customer_sk#4, ss_store_sk#8], [ss_customer_sk#4, ss_store_sk#8, min(ss_quantity#11L) AS store_sales_quantity#0L]
: : +- Project [ss_customer_sk#4, ss_store_sk#8, ss_quantity#11L]
: : +- Join Inner, (d_date_sk#24 = ss_sold_date_sk#1)
: : :- Project [ss_sold_date_sk#1, ss_customer_sk#4, ss_store_sk#8, ss_quantity#11L]
: : : +- Filter ((isnotnull(ss_sold_date_sk#1) && isnotnull(ss_customer_sk#4)) && isnotnull(ss_store_sk#8))
: : : +-
Relation[ss_sold_date_sk#1,ss_sold_time_sk#2,ss_item_sk#3,ss_customer_sk#4,ss_cdemo_sk#5,ss_hdemo_sk#6,ss_addr_sk#7,ss_store_sk#8,ss_promo_sk#9,ss_ticket_number#10L,ss_quantity#11L,s
s_wholesale_cost#12,ss_list_price#13,ss_sales_price#14,ss_ext_discount_amt#15,ss_ext_sales_price#16,ss_ext_wholesale_cost#17,ss_ext_list_price#18,ss_ext_tax#19,ss_coupon_amt#20,ss_net_paid
#21,ss_net_paid_inc_tax#22,ss_net_profit#23] parquet
: : +- Project [d_date_sk#24]
: : +- Filter (((isnotnull(d_year#30L) && (d_year#30L >= 2000)) && (d_year#30L <= 2002)) && isnotnull(d_date_sk#24))
: : +-
Relation[d_date_sk#24,d_date_id#25,d_date#26,d_month_seq#27L,d_week_seq#28L,d_quarter_seq#29L,d_year#30L,d_dow#31L,d_moy#32L,d_dom#33L,d_qoy#34L,d_fy_year#35L,d_fy_quarter_seq#
36L,d_fy_week_seq#37L,d_day_name#38,d_quarter_name#39,d_holiday#40,d_weekend#41,d_following_holiday#42,d_first_dom#43L,d_last_dom#44L,d_same_day_ly#45L,d_same_day_lq#46L,d_curre
nt_day#47,... 4 more fields] parquet
: +- Project [c_customer_sk#52, c_first_name#60, c_last_name#61]
: +- Filter isnotnull(c_customer_sk#52)
: +-
Relation[c_customer_sk#52,c_customer_id#53,c_current_cdemo_sk#54,c_current_hdemo_sk#55,c_current_addr_sk#56,c_first_shipto_date_sk#57,c_first_sales_date_sk#58,c_salutation#59,c_first_name
#60,c_last_name#61,c_preferred_cust_flag#62,c_birth_day#63L,c_birth_month#64L,c_birth_year#65L,c_birth_country#66,c_login#67,c_email_address#68,c_last_review_date#69L] parquet
+- Project [s_store_sk#70, s_store_name#75]
+- Filter isnotnull(s_store_sk#70)
+-
Relation[s_store_sk#70,s_store_id#71,s_rec_start_date#72,s_rec_end_date#73,s_closed_date_sk#74,s_store_name#75,s_number_employees#76L,s_floor_space#77L,s_hours#78,s_manager#79,s_mar
ket_id#80L,s_geography_class#81,s_market_desc#82,s_market_manager#83,s_division_id#84L,s_division_name#85,s_company_id#86L,s_company_name#87,s_street_number#88,s_street_name#89,s
_street_type#90,s_suite_number#91,s_city#92,s_county#93,... 5 more fields] parquet
Group By is pushed below
Join
Benefits of the Proposed Changes
•  Implemented new GroupByPushDown optimization
rule
–  Benefit from RI constraints
•  Used the Optimizer Customization
•  Injected using injectOptimizerRuleInOrder	
e.injectOptimizerRuleInOrder(	
		GroupByPushDown,	
		"Operator	Optimization	before	Inferring	Filters",	
		Order.after,	
							"org.apache.spark.sql.catalyst.optimizer.PushDownPredicate")	
•  Achieved 2X performance improvements
Recap: How to Extend Spark
•  Use the Extension Points API
•  Five Extension Points
•  To add a rule is a 3 step process
–  Implement your rule
–  Implement your wrapper function, use right inject method
type ExtensionsBuilder = SparkSessionExtensions => Unit
–  Plug in the wrapper function
withExtensions method in SparkSession.Builder
Resources
•  https://developer.ibm.com/code/2017/11/30/learn-extension-
points-apache-spark-extend-spark-catalyst-optimizer/
•  https://rtahboub.github.io/blog/2018/writing-customized-
parser/
•  https://github.com/apache/spark/blob/master/sql/core/src/test/
scala/org/apache/spark/sql/
SparkSessionExtensionSuite.scala
•  https://issues.apache.org/jira/browse/SPARK-18127
•  https://issues.apache.org/jira/browse/SPARK-26249
•  http://people.csail.mit.edu/matei/papers/2015/
sigmod_spark_sql.pdf
Thank you!
https://ibm.biz/Bd2GbF

Contenu connexe

Tendances

Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Databricks
 
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -Yoshiyasu SAEKI
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversScyllaDB
 
Informational Referential Integrity Constraints Support in Apache Spark with ...
Informational Referential Integrity Constraints Support in Apache Spark with ...Informational Referential Integrity Constraints Support in Apache Spark with ...
Informational Referential Integrity Constraints Support in Apache Spark with ...Databricks
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkKazuaki Ishizaki
 
Spark SQL Join Improvement at Facebook
Spark SQL Join Improvement at FacebookSpark SQL Join Improvement at Facebook
Spark SQL Join Improvement at FacebookDatabricks
 
Parallelizing with Apache Spark in Unexpected Ways
Parallelizing with Apache Spark in Unexpected WaysParallelizing with Apache Spark in Unexpected Ways
Parallelizing with Apache Spark in Unexpected WaysDatabricks
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilDatabricks
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...Databricks
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architectureAdam Doyle
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
 
Streaming SQL for Data Engineers: The Next Big Thing?
Streaming SQL for Data Engineers: The Next Big Thing?Streaming SQL for Data Engineers: The Next Big Thing?
Streaming SQL for Data Engineers: The Next Big Thing?Yaroslav Tkachenko
 
Understanding and Improving Code Generation
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code GenerationDatabricks
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsSpark Summit
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDatabricks
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introductioncolorant
 

Tendances (20)

Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
 
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
Informational Referential Integrity Constraints Support in Apache Spark with ...
Informational Referential Integrity Constraints Support in Apache Spark with ...Informational Referential Integrity Constraints Support in Apache Spark with ...
Informational Referential Integrity Constraints Support in Apache Spark with ...
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache Spark
 
Spark SQL Join Improvement at Facebook
Spark SQL Join Improvement at FacebookSpark SQL Join Improvement at Facebook
Spark SQL Join Improvement at Facebook
 
Parallelizing with Apache Spark in Unexpected Ways
Parallelizing with Apache Spark in Unexpected WaysParallelizing with Apache Spark in Unexpected Ways
Parallelizing with Apache Spark in Unexpected Ways
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Streaming SQL for Data Engineers: The Next Big Thing?
Streaming SQL for Data Engineers: The Next Big Thing?Streaming SQL for Data Engineers: The Next Big Thing?
Streaming SQL for Data Engineers: The Next Big Thing?
 
Understanding and Improving Code Generation
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code Generation
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 

Similaire à Extend Spark with Custom Optimizations

Non SharePoint Deployment
Non SharePoint DeploymentNon SharePoint Deployment
Non SharePoint DeploymentSparked
 
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and PluginsMonitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and PluginsDatabricks
 
Useful practices of creation automatic tests by using cucumber jvm
Useful practices of creation automatic tests by using cucumber jvmUseful practices of creation automatic tests by using cucumber jvm
Useful practices of creation automatic tests by using cucumber jvmAnton Shapin
 
Eclipse IDE, 2019.09, Java Development
Eclipse IDE, 2019.09, Java Development Eclipse IDE, 2019.09, Java Development
Eclipse IDE, 2019.09, Java Development Pei-Hsuan Hsieh
 
(ATS4-DEV07) How to Build a Custom Search Panel for Symyx Notebook
(ATS4-DEV07) How to Build a Custom Search Panel for Symyx Notebook(ATS4-DEV07) How to Build a Custom Search Panel for Symyx Notebook
(ATS4-DEV07) How to Build a Custom Search Panel for Symyx NotebookBIOVIA
 
Building Custom Adapters 3.7
Building Custom Adapters 3.7Building Custom Adapters 3.7
Building Custom Adapters 3.7StephenKardian
 
Understanding SharePoint Framework Extensions
Understanding SharePoint Framework ExtensionsUnderstanding SharePoint Framework Extensions
Understanding SharePoint Framework ExtensionsBIWUG
 
Logic apps and PowerApps - Integrate across your APIs
Logic apps and PowerApps - Integrate across your APIsLogic apps and PowerApps - Integrate across your APIs
Logic apps and PowerApps - Integrate across your APIsSriram Hariharan
 
Azure Resource Manager templates: Improve deployment time and reusability
Azure Resource Manager templates: Improve deployment time and reusabilityAzure Resource Manager templates: Improve deployment time and reusability
Azure Resource Manager templates: Improve deployment time and reusabilityStephane Lapointe
 
CTS2 Development Framework
CTS2 Development FrameworkCTS2 Development Framework
CTS2 Development Frameworkcts2framework
 
SparkOscope: Enabling Apache Spark Optimization through Cross Stack Monitorin...
SparkOscope: Enabling Apache Spark Optimization through Cross Stack Monitorin...SparkOscope: Enabling Apache Spark Optimization through Cross Stack Monitorin...
SparkOscope: Enabling Apache Spark Optimization through Cross Stack Monitorin...Databricks
 
China Science Challenge
China Science ChallengeChina Science Challenge
China Science Challengeremko caprio
 
SgCodeJam24 Workshop
SgCodeJam24 WorkshopSgCodeJam24 Workshop
SgCodeJam24 Workshopremko caprio
 
SplunkLive! Developer Session
SplunkLive! Developer SessionSplunkLive! Developer Session
SplunkLive! Developer SessionSplunk
 
Setting up your virtual infrastructure using FIWARE Lab Cloud
Setting up your virtual infrastructure using FIWARE Lab CloudSetting up your virtual infrastructure using FIWARE Lab Cloud
Setting up your virtual infrastructure using FIWARE Lab CloudFernando Lopez Aguilar
 
MWLUG 2015 - AD114 Take Your XPages Development to the Next Level
MWLUG 2015 - AD114 Take Your XPages Development to the Next LevelMWLUG 2015 - AD114 Take Your XPages Development to the Next Level
MWLUG 2015 - AD114 Take Your XPages Development to the Next Levelbalassaitis
 

Similaire à Extend Spark with Custom Optimizations (20)

Non SharePoint Deployment
Non SharePoint DeploymentNon SharePoint Deployment
Non SharePoint Deployment
 
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and PluginsMonitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
 
Useful practices of creation automatic tests by using cucumber jvm
Useful practices of creation automatic tests by using cucumber jvmUseful practices of creation automatic tests by using cucumber jvm
Useful practices of creation automatic tests by using cucumber jvm
 
Eclipse IDE, 2019.09, Java Development
Eclipse IDE, 2019.09, Java Development Eclipse IDE, 2019.09, Java Development
Eclipse IDE, 2019.09, Java Development
 
(ATS4-DEV07) How to Build a Custom Search Panel for Symyx Notebook
(ATS4-DEV07) How to Build a Custom Search Panel for Symyx Notebook(ATS4-DEV07) How to Build a Custom Search Panel for Symyx Notebook
(ATS4-DEV07) How to Build a Custom Search Panel for Symyx Notebook
 
Building Custom Adapters 3.7
Building Custom Adapters 3.7Building Custom Adapters 3.7
Building Custom Adapters 3.7
 
Understanding SharePoint Framework Extensions
Understanding SharePoint Framework ExtensionsUnderstanding SharePoint Framework Extensions
Understanding SharePoint Framework Extensions
 
Typesafe spark- Zalando meetup
Typesafe spark- Zalando meetupTypesafe spark- Zalando meetup
Typesafe spark- Zalando meetup
 
Logic apps and PowerApps - Integrate across your APIs
Logic apps and PowerApps - Integrate across your APIsLogic apps and PowerApps - Integrate across your APIs
Logic apps and PowerApps - Integrate across your APIs
 
Azure Resource Manager templates: Improve deployment time and reusability
Azure Resource Manager templates: Improve deployment time and reusabilityAzure Resource Manager templates: Improve deployment time and reusability
Azure Resource Manager templates: Improve deployment time and reusability
 
slides.pptx
slides.pptxslides.pptx
slides.pptx
 
CTS2 Development Framework
CTS2 Development FrameworkCTS2 Development Framework
CTS2 Development Framework
 
SparkOscope: Enabling Apache Spark Optimization through Cross Stack Monitorin...
SparkOscope: Enabling Apache Spark Optimization through Cross Stack Monitorin...SparkOscope: Enabling Apache Spark Optimization through Cross Stack Monitorin...
SparkOscope: Enabling Apache Spark Optimization through Cross Stack Monitorin...
 
China Science Challenge
China Science ChallengeChina Science Challenge
China Science Challenge
 
SgCodeJam24 Workshop
SgCodeJam24 WorkshopSgCodeJam24 Workshop
SgCodeJam24 Workshop
 
SplunkLive! Developer Session
SplunkLive! Developer SessionSplunkLive! Developer Session
SplunkLive! Developer Session
 
Setting up your virtual infrastructure using FIWARE Lab Cloud
Setting up your virtual infrastructure using FIWARE Lab CloudSetting up your virtual infrastructure using FIWARE Lab Cloud
Setting up your virtual infrastructure using FIWARE Lab Cloud
 
MWLUG 2015 - AD114 Take Your XPages Development to the Next Level
MWLUG 2015 - AD114 Take Your XPages Development to the Next LevelMWLUG 2015 - AD114 Take Your XPages Development to the Next Level
MWLUG 2015 - AD114 Take Your XPages Development to the Next Level
 
Apache DeltaSpike: The CDI Toolbox
Apache DeltaSpike: The CDI ToolboxApache DeltaSpike: The CDI Toolbox
Apache DeltaSpike: The CDI Toolbox
 
Apache DeltaSpike the CDI toolbox
Apache DeltaSpike the CDI toolboxApache DeltaSpike the CDI toolbox
Apache DeltaSpike the CDI toolbox
 

Plus de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Plus de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Dernier

定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 

Dernier (20)

定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 

Extend Spark with Custom Optimizations

  • 1. Sunitha Kambhampati, IBM How to extend Spark with customized optimizations #UnifiedAnalytics #SparkAISummit
  • 2. Center for Open Source Data and AI Technologies IBM Watson West Building 505 Howard St. San Francisco, CA CODAIT aims to make AI solutions dramatically easier to create, deploy, and manage in the enterprise. Relaunch of the IBM Spark Technology Center (STC) to reflect expanded mission. We contribute to foundational open source software across the enterprise AI lifecycle. 36 open-source developers! https://ibm.biz/BdzF6Q Improving Enterprise AI Lifecycle in Open Source CODAIT codait.org
  • 3. Agenda •  Introduce Spark Extension Points API •  Deep Dive into the details –  What you can do –  How to use it –  What things you need to be aware of •  Enhancements to the API –  Why –  Performance results
  • 4. I want to extend Spark •  Performance benefits –  Support for informational referential integrity (RI) constraints –  Add Data Skipping Indexes •  Enabling Third party applications –  Application uses Spark but it requires some additions or small changes to Spark
  • 5. Problem You have developed customizations to Spark. How do you add it to your Spark cluster?
  • 6. Possible Solutions •  Option 1: Get the code merged to Apache Spark –  Maybe it is application specific –  Maybe it is a value add –  Not something that can be merged into Spark •  Option 2: Modify Spark code, fork it –  Maintenance overhead •  Extensible solution: Use Spark’s Extension Points API
  • 7. Spark Extension Points API •  Added in Spark 2.2 in SPARK-18127 •  Pluggable & Extensible •  Extend SparkSession with custom optimizations •  Marked as Experimental API –  relatively stable –  has not seen any changes except addition of more customization
  • 9. Query Execution Parser Optimizer Unresolved Logical Plan Analyzed Logical Plan Optimized Logical Plan Physical Plan Analyzer Rules Rules SparkPlanner Spark Strategies
  • 10. Supported Customizations Parser OptimizerAnalyzer Rules Rules SparkPlanner Spark Strategies Custom Rules Custom Rules Custom Spark Strategies Custom Parser
  • 11. Extensions API: At a High level •  New SparkSessionExtensions Class –  Methods to pass the customizations –  Holds the customizations •  Pass customizations to Spark –  withExtensions method in SparkSession.builder
  • 12. SparkSessionExtensions •  @DeveloperApi @Experimental @InterfaceStability.Unstable •  Inject Methods –  Pass the custom user rules to Spark •  Build Methods –  Pass the rules to Spark components –  Used by Spark Internals
  • 13. Extension Hooks: Inject Methods Parser OptimizerAnalyzer SparkPlanner injectResolutionRule injectCheckRule injectPostHocResolutionRule injectOptimizerRule injectFunction injectPlannerStrategyinjectParser New in master, SPARK-25560
  • 14. Pass custom rules to SparkSession •  Use ‘withExtensions’ in SparkSession.Builder def withExtensions( f: SparkSessionExtensions => Unit): Builder •  Use the Spark configuration parameter –  spark.sql.extensions •  Takes a class name that implements Function1[SparkSessionExtensions, Unit]
  • 16. Use Case #1 You want to add your own optimization rule to Spark’s Catalyst Optimizer
  • 17. Add your custom optimizer rule •  Step 1: Implement your optimizer rule case class GroupByPushDown(spark: SparkSession) extends Rule[LogicalPlan] { def apply(plan: LogicalPlan): LogicalPlan = plan transform { …. }} •  Step 2: Create your ExtensionsBuilder function type ExtensionsBuilder = SparkSessionExtensions => Unit val f: ExtensionsBuilder = { e => e.injectOptimizerRule(GroupByPushDown)} •  Step 3: Use the withExtensions method in SparkSession.builder to create your custom SparkSession val spark = SparkSession.builder().master(..).withExtensions(f).getOrCreate()
  • 18. How does the rule get added? •  Catalyst Optimizer –  Rules are grouped in Batches (ie RuleExecutor.Batch) –  one of the fixed batch has a placeholder to add custom optimizer rules –  passes in the extendedOperatorOptimizationRules to the batch. def extendedOperatorOptimizationRules: Seq[Rule[LogicalPlan]] •  SparkSession stores the SparkSessionExtensions in transient class variable extensions •  The SparkOptimizer instance gets created during the SessionState creation for the SparkSession –  overrides the extendedOperatorOptimizationRules method to include the customized rules –  Check the optimizer method in BaseSessionStateBuilder
  • 19. Things to Note •  Rule gets added to a predefined batch •  Batch here refers to RuleExecutor.Batch •  In Master, it is to the following batches: –  “Operator Optimization before Inferring Filters” –  “Operator Optimization after Inferring Filters” •  Check the defaultBatches method in Optimizer class
  • 20. Use Case #2 You want to add some parser extensions
  • 21. Parser Customization •  Step 1: Implement your parser customization case class RIExtensionsParser( spark: SparkSession, delegate: ParserInterface) extends ParserInterface { …} •  Step 2: Create your ExtensionsBuilder function type ExtensionsBuilder = SparkSessionExtensions => Unit val f: ExtensionsBuilder = { e => e.injectParser(RIExtensionsParser)} •  Step 3: Use the withExtensions method in SparkSession.builder to create your custom SparkSession val spark = SparkSession.builder().master("…").withExtensions(f).getOrCreate()
  • 22. How do the parser extensions work? •  Customize the parser for any new syntax to support •  Delegate rest of the Spark SQL syntax to the SparkSqlParser •  sqlParser is created by calling the buildParser on the extensions object in the SparkSession –  See sqlParser in BaseSessionStateBuilder class –  SparkSqlParser (Default Spark Parser) is passed in along with the SparkSession
  • 23. Use Case #3 You want to add some specific checks in the Analyzer
  • 24. Analyzer Customizations •  Analyzer Rules injectResolutionRule •  PostHocResolutionRule injectPostHocResolutionRule •  CheckRules injectCheckRule
  • 25. Analyzer Rule Customization •  Step 1: Implement your Analyzer rule case class MyRIRule(spark: SparkSession) extends Rule[LogicalPlan] { def apply(plan: LogicalPlan): LogicalPlan = plan transform { …. }} •  Step 2: Create your ExtensionsBuilder function type ExtensionsBuilder = SparkSessionExtensions => Unit val f: ExtensionsBuilder = { e => e.injectResolutionRule(MyRIRule)} •  Step 3: Use the withExtensions method in SparkSession.builder to create your custom SparkSession val spark = SparkSession.builder().master("..").withExtensions(f).getOrCreate
  • 26. How is the rule added to the Analyzer? •  Analyzer has rules in batches –  Batch has a placeholder extendedResolutionRules to add custom rules –  Batch “Post-Hoc Resolution” for postHocResolutionRules •  SparkSession stores the SparkSessionExtensions in extensions •  When SessionState is created, the custom rules are passed to the Analyzer by overriding the following class member variables –  val extendedResolutionRules –  val postHocResolutionRules –  val extendedCheckRules •  Check the BaseSessionStateBuilder.analyzer method •  Check the HiveSessionStateBuilder.analyzer method
  • 27. Things to Note •  Custom resolution rule gets added in the end to ‘Resolution’ Batch •  The checkRules will get called in the end of the checkAnalysis method after all the spark checks are done •  In Analyzer.checkAnalysis method: extendedCheckRules.foreach(_(plan))
  • 28. Use Case #4 You want to add custom planning strategies
  • 29. Add new physical plan strategy •  Step1: Implement your new physical plan Strategy class case class IdxStrategy(spark: SparkSession) extends SparkStrategy { override def apply(plan: LogicalPlan): Seq[SparkPlan] = { ….. } } •  Step 2: Create your ExtensionsBuilder function type ExtensionsBuilder = SparkSessionExtensions => Unit val f: ExtensionsBuilder = { e => e.injectPlannerStrategy(IdxStrategy)} •  Step 3: Use the withExtensions method in SparkSession.builder to create your custom SparkSession val spark = SparkSession.builder().master(..).withExtensions(f).getOrCreate()
  • 30. How does the strategy get added •  SparkPlanner uses a Seq of SparkStrategy –  strategies function has a placeholder extraPlanningStrategies •  SparkSession stores the SparkSessionExtensions in transient class variable extensions •  The SparkPlanner instance gets created during the SessionState creation for the SparkSession –  overrides the extraPlanningStrategies to include the custom strategy (buildPlannerStrategies) –  Check the BaseSessionStateBuilder.planner method –  Check the HiveSessionStateBuilder.planner method
  • 31. Things to Note •  Custom Strategies are tried after the strategies defined in ExperimentalMethods, and before the regular strategies –  Check the SparkPlanner.strategies method
  • 32. Use Case #5 You want to register custom functions in the session catalog
  • 33. Register Custom Function •  Step 1: Create a FunctionDescription with your custom function type FunctionDescription = (FunctionIdentifier, ExpressionInfo, FunctionBuilder) def utf8strlen(x: String): Int = {..} val f = udf(utf8strlen(_)) def builder(children: Seq[Expression]) = f.apply(children.map(Column.apply) : _*).expr val myfuncDesc = (FunctionIdentifier("utf8strlen"), new ExpressionInfo("noclass", "utf8strlen"), builder)
  • 34. Register Custom Function •  Step 2: Create your ExtensionsBuilder function to inject the new function type ExtensionsBuilder = SparkSessionExtensions => Unit val f: ExtensionsBuilder = { e => e.injectFunction (myfuncDesc)} •  Step 3: Pass this function to withExtensions method on SparkSession.builder and create your new SparkSession val spark = SparkSession.builder().master(..).withExtensions(f).getOrCreate()
  • 35. How does Custom Function registration work •  SparkSessionExtensions keeps track of the injectedFunctions •  During SessionCatalog creation, the injectedFunctions are registered in the functionRegistry –  See class variable BaseSessionStateBuilder.functionRegistry –  See method SimpleFunctionRegistry.registerFunction
  • 36. Things to Note •  Function registration order is same as the order in which the injectFunction is called •  No check if an existing function already exists during the injection •  A warning is raised if a function replaces an existing function –  Check is based on lowercase match of the function name •  Use the SparkSession.catalog.listFunctions to look up your function •  The functions registered will be temporary functions •  See SimpleFunctionRegistry.registerFunction method
  • 37. How to exclude the optimizer rule •  Spark v2.4 has new SQL Conf: spark.sql.optimizer.excludedRules •  Specify the custom rule’s class name session.conf.set( "spark.sql.optimizer.excludedRules", "org.mycompany.spark.MyCustomRule")
  • 38. Other ways to customize •  ExperimentalMethods –  Customize Physical Planning Strategies –  Customize Optimizer Rules •  Use the SparkSession.experimental method –  spark.experimental.extraStrategies •  Added in the beginning of strategies in SparkPlanner –  spark.experimental.extraOptimizations •  Added after all the batches in SparkOptimizer
  • 39. Things to Note •  ExperimentalMethods –  Rules are injected in a different location than Extension Points API –  So use this only if it is advantageous for your usecase •  Recommendation: Use Extension Points API
  • 41. SPARK-26249: API Enhancements •  Motivation –  Lack of fine grained control on rule execution order –  Add batches in a specific order •  Add support to extensions API –  Inject optimizer rule in a specific order –  Inject optimizer batch
  • 42. Inject Optimizer Rule in Order •  Inject a rule after or before an existing rule in a given existing batch in the Optimizer def injectOptimizerRuleInOrder( builder: RuleBuilder, batchName: String, ruleOrder: Order.Order, existingRule: String): Unit
  • 43. Inject Optimizer Batch •  Inject a batch of optimizer rules •  Specify the order where you want to inject the batch def injectOptimizerBatch( batchName: String, maxIterations: Int, existingBatchName: String, order: Order.Value, rules: Seq[RuleBuilder]): Unit
  • 44. End to End Use Case
  • 45. Use case: GroupBy Push Down Through Join •  If the join is an RI join, heuristically push down Group By to the fact table –  The input to the Group By remains the same before and after the join –  The input to the join is reduced –  Overall reduction of the execution time Aggregate functions on fact table columns Grouping columns is a superset of join columns PK – FK joins
  • 46. Group By Push Down Through Join Execution plan transformation: •  Query execution drops from 70 secs to 30 secs (1TB TPC-DS setup), 2x improvement select c_customer_sk, c_first_name, c_last_name, s_store_sk, s_store_name, min(ss.ss_quantity) as store_sales_quantity from store_sales ss, date_dim, customer, store where d_date_sk = ss_sold_date_sk and c_customer_sk = ss_customer_sk and s_store_sk = ss_store_sk and d_year between 2000 and 2002 group by c_customer_sk, c_first_name, c_last_name, s_store_sk, s_store_name order by c_customer_sk, c_first_name, c_last_name, s_store_sk, s_store_name limit 100; Star schema: customer date_dim store_sales store N : 1 1:NN:1 Retrieve the minimum quantity of items that were sold between the year 2000 and 2002 grouped by customer and store information
  • 47. Optimized Query Plan: Explain == Optimized Logical Plan == GlobalLimit 100 +- LocalLimit 100 +- Sort [c_customer_sk#52 ASC NULLS FIRST, c_first_name#60 ASC NULLS FIRST, c_last_name#61 ASC NULLS FIRST, s_store_sk#70 ASC NULLS FIRST, s_store_name#75 ASC NULLS FIRST], true +- Project [c_customer_sk#52, c_first_name#60, c_last_name#61, s_store_sk#70, s_store_name#75, store_sales_quantity#0L] +- Join Inner, (s_store_sk#70 = ss_store_sk#8) :- Project [c_customer_sk#52, c_first_name#60, c_last_name#61, ss_store_sk#8, store_sales_quantity#0L] : +- Join Inner, (c_customer_sk#52 = ss_customer_sk#4) : :- Aggregate [ss_customer_sk#4, ss_store_sk#8], [ss_customer_sk#4, ss_store_sk#8, min(ss_quantity#11L) AS store_sales_quantity#0L] : : +- Project [ss_customer_sk#4, ss_store_sk#8, ss_quantity#11L] : : +- Join Inner, (d_date_sk#24 = ss_sold_date_sk#1) : : :- Project [ss_sold_date_sk#1, ss_customer_sk#4, ss_store_sk#8, ss_quantity#11L] : : : +- Filter ((isnotnull(ss_sold_date_sk#1) && isnotnull(ss_customer_sk#4)) && isnotnull(ss_store_sk#8)) : : : +- Relation[ss_sold_date_sk#1,ss_sold_time_sk#2,ss_item_sk#3,ss_customer_sk#4,ss_cdemo_sk#5,ss_hdemo_sk#6,ss_addr_sk#7,ss_store_sk#8,ss_promo_sk#9,ss_ticket_number#10L,ss_quantity#11L,s s_wholesale_cost#12,ss_list_price#13,ss_sales_price#14,ss_ext_discount_amt#15,ss_ext_sales_price#16,ss_ext_wholesale_cost#17,ss_ext_list_price#18,ss_ext_tax#19,ss_coupon_amt#20,ss_net_paid #21,ss_net_paid_inc_tax#22,ss_net_profit#23] parquet : : +- Project [d_date_sk#24] : : +- Filter (((isnotnull(d_year#30L) && (d_year#30L >= 2000)) && (d_year#30L <= 2002)) && isnotnull(d_date_sk#24)) : : +- Relation[d_date_sk#24,d_date_id#25,d_date#26,d_month_seq#27L,d_week_seq#28L,d_quarter_seq#29L,d_year#30L,d_dow#31L,d_moy#32L,d_dom#33L,d_qoy#34L,d_fy_year#35L,d_fy_quarter_seq# 36L,d_fy_week_seq#37L,d_day_name#38,d_quarter_name#39,d_holiday#40,d_weekend#41,d_following_holiday#42,d_first_dom#43L,d_last_dom#44L,d_same_day_ly#45L,d_same_day_lq#46L,d_curre nt_day#47,... 4 more fields] parquet : +- Project [c_customer_sk#52, c_first_name#60, c_last_name#61] : +- Filter isnotnull(c_customer_sk#52) : +- Relation[c_customer_sk#52,c_customer_id#53,c_current_cdemo_sk#54,c_current_hdemo_sk#55,c_current_addr_sk#56,c_first_shipto_date_sk#57,c_first_sales_date_sk#58,c_salutation#59,c_first_name #60,c_last_name#61,c_preferred_cust_flag#62,c_birth_day#63L,c_birth_month#64L,c_birth_year#65L,c_birth_country#66,c_login#67,c_email_address#68,c_last_review_date#69L] parquet +- Project [s_store_sk#70, s_store_name#75] +- Filter isnotnull(s_store_sk#70) +- Relation[s_store_sk#70,s_store_id#71,s_rec_start_date#72,s_rec_end_date#73,s_closed_date_sk#74,s_store_name#75,s_number_employees#76L,s_floor_space#77L,s_hours#78,s_manager#79,s_mar ket_id#80L,s_geography_class#81,s_market_desc#82,s_market_manager#83,s_division_id#84L,s_division_name#85,s_company_id#86L,s_company_name#87,s_street_number#88,s_street_name#89,s _street_type#90,s_suite_number#91,s_city#92,s_county#93,... 5 more fields] parquet Group By is pushed below Join
  • 48. Benefits of the Proposed Changes •  Implemented new GroupByPushDown optimization rule –  Benefit from RI constraints •  Used the Optimizer Customization •  Injected using injectOptimizerRuleInOrder e.injectOptimizerRuleInOrder( GroupByPushDown, "Operator Optimization before Inferring Filters", Order.after, "org.apache.spark.sql.catalyst.optimizer.PushDownPredicate") •  Achieved 2X performance improvements
  • 49. Recap: How to Extend Spark •  Use the Extension Points API •  Five Extension Points •  To add a rule is a 3 step process –  Implement your rule –  Implement your wrapper function, use right inject method type ExtensionsBuilder = SparkSessionExtensions => Unit –  Plug in the wrapper function withExtensions method in SparkSession.Builder
  • 50. Resources •  https://developer.ibm.com/code/2017/11/30/learn-extension- points-apache-spark-extend-spark-catalyst-optimizer/ •  https://rtahboub.github.io/blog/2018/writing-customized- parser/ •  https://github.com/apache/spark/blob/master/sql/core/src/test/ scala/org/apache/spark/sql/ SparkSessionExtensionSuite.scala •  https://issues.apache.org/jira/browse/SPARK-18127 •  https://issues.apache.org/jira/browse/SPARK-26249 •  http://people.csail.mit.edu/matei/papers/2015/ sigmod_spark_sql.pdf