SlideShare une entreprise Scribd logo
1  sur  29
Page1 © Hortonworks Inc. 2014
Cost-based query optimization in
Apache Hive
Julian Hyde Julian Hyde
June 4th, 2014
Page2 © Hortonworks Inc. 2014
About me
Julian Hyde
Architect at Hortonworks
Open source:
• Founder & lead, Apache Optiq (query optimization framework)
• Founder & lead, Pentaho Mondrian (analysis engine)
• Committer, Apache Drill
• Contributor, Apache Hive
• Contributor, Cascading Lingual (SQL interface to Cascading)
Past:
• SQLstream (streaming SQL)
• Broadbase (data warehouse)
• Oracle (SQL kernel development)
Page3 © Hortonworks Inc. 2014
(Thanks to
John Pullokkaran,
Harish Butani
for presentation content
and actually doing the work.)
Page4 © Hortonworks Inc. 2014
Apache Hive
The original “SQL on Hadoop”
Undergoing extensive renovation
• Tez execution engine
• YARN execution environment
• Vectorized data representation
• Column-oriented data storage (ORC)
• ACID transactions
• SQL standards compliance
• SQL authorization model
• Cost-based query optimization (CBO) What? Why? How? When?
“Stinger
Initiative”
Page5 © Hortonworks Inc. 2014
Incremental cutover to cost-based optimization
Release Date Remarks
Apache Hive 0.12 October 2013 • Rule-based Optimizations
• No join reordering
• Main optimizations: predicate push-
down & partition pruning
• Semantic info and operator tree tightly
coupled
Apache Hive 0.13 April 2014 “Old-style” JOIN & push-down conditions:
… FROM t1, t2 WHERE …
CBO just missed the deadline 
HDP 2.1 April 2014 Cost-based ordering of joins
• HIVE-6439 “Introduce CBO step in
Semantic Analyzer”
• HIVE-5775 “Introduce Cost Based
Optimizer in Hive”
Apache Hive 0.14 ? CBO patches
More rework of internals
More cost-based features…
Page6 © Hortonworks Inc. 2014
Apache Optiq
(incubating)
Page7 © Hortonworks Inc. 2014
Apache Optiq
Apache incubator project since May, 2014
Query planning framework
• Extensible
• Usable standalone (JDBC) or embedded
Adoption
Lingual – SQL interface to Cascading
Apache Drill
Apache Hive
Adapters: Splunk, Spark, MongoDB, JDBC, CSV, Web tables, In-memory
data
Page8 © Hortonworks Inc. 2014
Conventional DB architecture
Page9 © Hortonworks Inc. 2014
Optiq architecture
Page10 © Hortonworks Inc. 2014
Optiq – APIs and SPIs
Cost, statistics
RelOptCost
RelOptCostFactory
RelMetadataProvider
• RelMdColumnUniquensss
• RelMdDistinctRowCount
• RelMdSelectivity
SQL parser
SqlNode
SqlParser
SqlValidator
Transformation rules
RelOptRule
• MergeFilterRule
• PushAggregateThroughUni
onRule
• RemoveCorrelationForScal
arProjectRule
• 100+ more
Unification (materialized view)
Column trimming
Relational algebra
RelNode (operator)
• TableScan
• Filter
• Project
• Union
• Aggregate
• …
RelDataType (type)
RexNode (expression)
RelTrait (physical property)
• RelConvention (calling-convention)
• RelCollation (sortedness)
• TBD (bucketedness/distribution) JDBC driver
Metadata
Schema
Table
Function
• TableFunction
• TableMacro
Page11 © Hortonworks Inc. 2014
Now… back to Hive
Page12 © Hortonworks Inc. 2014
CBO in Hive
Why cost-based optimization?
Ease of Use – Join Reordering
View Chaining
Ad hoc queries involving multiple views
Enables BI Tools as front ends to Hive
First version
Modest goal
Concrete results
Join re-ordering
Page 12
Page13 © Hortonworks Inc. 2014
Query preparation – Hive 0.13
SQL
parser
Semantic
analyzer
Logical
Optimizer
Physical
Optimizer
Abstract
Syntax
Tree (AST)
Hive SQL
Annotated
AST
Plan
Tez
Tuned
Plan
Page14 © Hortonworks Inc. 2014
Query preparation – full CBO
SQL
parser
Semantic
analyzer
Translate
to algebra
Physical
Optimizer
Abstract
Syntax
Tree (AST)
Hive SQL
Tez
Tuned
Plan
Optiq
optimizer
RelNode
Annotated
AST
Page15 © Hortonworks Inc. 2014
Query preparation – initial CBO
SQL
parser
Semantic
analyzer
Logical
Optimizer
Physical
Optimizer
Hive SQL
AST with optimized
join-ordering
Tez
Tuned
Plan
Translate
to algebra
Optiq
optimizer
Page16 © Hortonworks Inc. 2014
© Hortonworks Inc. 2013
Query Execution – The basics
Page 16
SELECT R1.x
FROM R1
JOIN R2 ON R1.x = R2.x
JOIN R3 on R1.x = R3.x AND R2.x = R3.x
WHERE R1.z > 10;
p
s


R1 R2
R3
TS [R1]
TS [R2]
RS
RS
Shuffle
Join
TS [R3]
Map
Join
Filter FS
Page17 © Hortonworks Inc. 2014
© Hortonworks Inc. 2013
Query Optimization – Rule Based vs. Cost Based
Page 17
p
s


R1 R2
R3
p
s


R1
R2
R3
p
s


R1
R3
R2
p
s


R2
R3
R1
Page18 © Hortonworks Inc. 2014
Introduction of CBO into Hive Planning
cbo
enabled?
No
Generate Plan w/o
multi-way joins
Can
cbo handle
plan?
No
- Predicate Pushdown
- Part. Pruning
- Column Pruning
- Stats Annotation
Pre CBO Optimizer
Col stats
available?
No
Optiq-based
Planner
Hive Plan
Revised AST
Regular Planning route on
new AST with CBO
turned off.
Fallback to Regular
planning: as though cbo
is disabled.
- < 10 total Join
Ops
- No Outer Joins
- No Windowing,
Lateral Views,
Script Op.
Series of gating
factors to get a CBO
Plan.
Page19 © Hortonworks Inc. 2014
Optiq Planner Process
Hive
Plan
Planner
RelNode
GraphRelNode Converter
RexNode Converter
Hive Op  RelNode
Hive Expr  RexNode
• Node for each node in
Input Plan
• Each node is a Set of
alternate Sub Plans
• Set further divided into
Subsets: based on
traits like sortedness
1. Plan Graph
• Rule: specifies a Operator
sub-graph to match and
logic to generate equivalent
‘better’ sub-graph.
• We only have Join
Reordering Rules.
2. Rules
• RelNodes have Cost (&
Cumulative Cost)
• We only use Cardinality
for Cost.
3. Cost Model
- Used to Plugin Schema,
Cost Formulas:
Selectivity, NDV
calculations etc.
- We only added
Selectivity and NDV
formulas; Schema is
only available at the
Node level
4. Metadata Providers
Rule Match Queue
- Add Rule matches to Queue
- Apply Rule match
transformations to Plan Graph
- Iterate for fixed iterations or
until Cost doesn’t change.
- Match importance based on
Cost of RelNode and height.
Best
RelNode
Graph
AST Converter
Revised
AST
Logical Plan
Physical traits:
Table Part./Buckets;
RedSink Ops
removed
Page20 © Hortonworks Inc. 2014
Join Reordering Rules
a b
=
b a
1. Swap Join Rule
a b
=
2. Push Join Through Join Rule
c
a c
b
c b
a=
but is really:
Optiq
schema is
position
based
b
a c
3. So
a b
c
d
≠
a c
d
b
4. Pull Up Project above Join
b
a c
d
a c
b
d
=
Added bonus
Join permutations
across sub-query
blocks
5. Merge Projects
Page21 © Hortonworks Inc. 2014
Summary
Join re-ordering
Join cardinality is used for cost
All other operators are assumed to have tiny cost
Cardinality of filter, join, group-by is based on selectivity
Selectivity is computed based on number-of-distinct-values (NDV)
Table Stats and Column stats are required
Current limitations
Only supports: filter, inner join, group-by, project, order-by, limit
Not all UDFs
Does not attempt all join permutations (e.g. bushy trees; 10-way joins or more)
May not work well for Bucket, SMB & Skew Joins
Page 21
Page22 © Hortonworks Inc. 2014
TPC-DS Query 50
Joins Store Sales, and Store Returns fact tables.
Each of the fact tables are independently restricted by date.
Analysis at Store grain, so this dimension also joined in.
As specified Query starts by joining the 2 Fact tables.
select
s_store_name , .. other store details
,sum(case when (sr_returned_date_sk - ss_sold_date_sk <= 30 ) then 1 else 0 end) as `30 days`, …
from
store_sales ss,store_returns sr,store s ,date_dim d1 ,date_dim d2
where
d2.d_year = 2000 and d2.d_moy = 9
and ss.ss_ticket_number = sr.sr_ticket_number and ss.ss_item_sk = sr.sr_item_sk
and ss.ss_sold_date_sk = d1.d_date_sk and sr.sr_returned_date_sk = d2.d_date_sk
and ss.ss_customer_sk = sr.sr_customer_sk and ss.ss_store_sk = s.s_store_sk
group by store details
order by store details limit 100;
Join Graph
Page23 © Hortonworks Inc. 2014
TPC-DS Query 50
Specified
Join Tree
Non CBO Plan
CBO Plan
Page24 © Hortonworks Inc. 2014
TPC-DS Query 50
Run 1 Run 2
Non CBO 53.1 53.4
CBO 22.5 21.9
 1 year test
 > 10 mins for Non CBO
 CBO time was about the same
 Fact tables
 partitioned by Day,
 bucketed by Item
 Bucketing off
 Bucketing should help CBO plan.
 SR table much smaller. Better chance of Bucket Join in place of Shuffle
Join.
Join Ordering Cost Estimate
['d2', [[['store_sales', 'd1'], 'store_returns'], 'store']] 515074768.659
['d1', [[['store_sales', 'store'], 'store_returns'], 'd2']] 448155.355
…
['store_returns', 'd2'] 9938.93
['store_sales', 'store_returns'] 156727295.634
['d1', 'store_sales'] 123675664.449
Facts restricted to 3 months
Orderings considered by Planner
Page25 © Hortonworks Inc. 2014
TPC-DS Query 17
Joins Store Sales, Store Returns and Catalog
Sales fact tables.
Each of the fact tables are independently
restricted by time.
Analysis at Item and Store grain, so these
dimensions are also joined in.
As specified Query starts by joining the 3 Fact
tables.
select i_item_id
,i_item_desc
,s_state
,count(ss_quantity) as store_sales_quantitycount
,….
from store_sales ss ,store_returns sr, catalog_sales cs,
date_dim d1, date_dim d2, date_dim d3, store s, item I
where d1.d_quarter_name = '2000Q1’
and d1.d_date_sk = ss.ss_sold_date_sk
and i.i_item_sk = ss.ss_item_sk and …
group by i_item_id ,i_item_desc, ,s_state
order by i_item_id ,i_item_desc, s_state
limit 100;
Page26 © Hortonworks Inc. 2014
TPC-DS Query 17
Specified
Join Tree
Non CBO Plan
CBO Plan
Page27 © Hortonworks Inc. 2014
TPC-DS Query 17
Run 1 Run 2
Non CBO 100.71 127.53
CBO 50.9 44.52
 1 year test
 > 10 mins for Non CBO
 CBO time was about the same
 Fact tables
 partitioned by Day,
 bucketed by Item
 Bucketing off
 Bucketing should help CBO plan.
 SR table much smaller. Better chance of Bucket Join in place of Shuffle
Join.
Join Ordering Cost Estimate
['item', [[[[[['d2', 'store_returns'], 'store_sales'], 'catalog_sales'], 'd1'], 'd3'], 'store']] 3547898.061
…
['store_returns', 'd2’] 19224.71
['store_sales', 'store_returns’] 23057497.991
['d1', 'store_sales'] 26142.943
Facts restricted to 3 months
Orderings considered by Planner
Page28 © Hortonworks Inc. 2014
Next?
Outer joins
Scale to larger numbers of joins
Support all expressions (UDFs)
Join algorithm selection
Sortedness & distribution as a trait
Trait propagation
Better cost model
More statistics
Move all pre-planning and logical planning to Optiq
Use Optiq costs/statistics to help physical planning
Constant reduction & tree pruning
Rewrite query to use materialized view
Page29 © Hortonworks Inc. 2014
Thank you!
@julianhyde
http://hive.apache.org/
http://incubator.apache.org/projects/optiq.html

Contenu connexe

Tendances

Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Julian Hyde
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit
 
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...Christian Tzolov
 
Introduction to Apache Calcite
Introduction to Apache CalciteIntroduction to Apache Calcite
Introduction to Apache CalciteJordan Halterman
 
LLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveLLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveDataWorks Summit
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in SparkDatabricks
 
SQL for NoSQL and how Apache Calcite can help
SQL for NoSQL and how  Apache Calcite can helpSQL for NoSQL and how  Apache Calcite can help
SQL for NoSQL and how Apache Calcite can helpChristian Tzolov
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLDatabricks
 
Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Julian Hyde
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
The evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its CommunityThe evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its CommunityJulian Hyde
 
Introduction to HiveQL
Introduction to HiveQLIntroduction to HiveQL
Introduction to HiveQLkristinferrier
 
Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Julian Hyde
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 
Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Stamatis Zampetakis
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and FutureDataWorks Summit
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkPatrick Wendell
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 

Tendances (20)

Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...
 
Introduction to Apache Calcite
Introduction to Apache CalciteIntroduction to Apache Calcite
Introduction to Apache Calcite
 
LLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveLLAP: long-lived execution in Hive
LLAP: long-lived execution in Hive
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
 
SQL for NoSQL and how Apache Calcite can help
SQL for NoSQL and how  Apache Calcite can helpSQL for NoSQL and how  Apache Calcite can help
SQL for NoSQL and how Apache Calcite can help
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
 
Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
The evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its CommunityThe evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its Community
 
Introduction to HiveQL
Introduction to HiveQLIntroduction to HiveQL
Introduction to HiveQL
 
Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Hive tuning
Hive tuningHive tuning
Hive tuning
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 

Similaire à Cost-based query optimization in Apache Hive

Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Julian Hyde
 
SQL on everything, in memory
SQL on everything, in memorySQL on everything, in memory
SQL on everything, in memoryJulian Hyde
 
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveXu Jiang
 
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Julian Hyde
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?DataWorks Summit
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoDataWorks Summit
 
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksSelf-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksGrega Kespret
 
Presentation v mware roi tco calculator
Presentation   v mware roi tco calculatorPresentation   v mware roi tco calculator
Presentation v mware roi tco calculatorsolarisyourep
 
2018 data warehouse features in spark
2018   data warehouse features in spark2018   data warehouse features in spark
2018 data warehouse features in sparkChester Chen
 
Querying Druid in SQL with Superset
Querying Druid in SQL with SupersetQuerying Druid in SQL with Superset
Querying Druid in SQL with SupersetDataWorks Summit
 
Apache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them AllApache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them AllMichael Mior
 
Use dependency injection to get Hadoop *out* of your application code
Use dependency injection to get Hadoop *out* of your application codeUse dependency injection to get Hadoop *out* of your application code
Use dependency injection to get Hadoop *out* of your application codeDataWorks Summit
 
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Spark Summit
 
Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?DataWorks Summit
 
Maximizing Database Tuning in SAP SQL Anywhere
Maximizing Database Tuning in SAP SQL AnywhereMaximizing Database Tuning in SAP SQL Anywhere
Maximizing Database Tuning in SAP SQL AnywhereSAP Technology
 
Presto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performancePresto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performanceDataWorks Summit
 
Powerpivot web wordpress present
Powerpivot web wordpress presentPowerpivot web wordpress present
Powerpivot web wordpress presentMariAnne Woehrle
 
Accelerating query processing with materialized views in Apache Hive
Accelerating query processing with materialized views in Apache HiveAccelerating query processing with materialized views in Apache Hive
Accelerating query processing with materialized views in Apache HiveSahil Takiar
 
Big Data, Bigger Analytics
Big Data, Bigger AnalyticsBig Data, Bigger Analytics
Big Data, Bigger AnalyticsItzhak Kameli
 

Similaire à Cost-based query optimization in Apache Hive (20)

Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14
 
SQL on everything, in memory
SQL on everything, in memorySQL on everything, in memory
SQL on everything, in memory
 
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
 
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksSelf-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
 
Presentation v mware roi tco calculator
Presentation   v mware roi tco calculatorPresentation   v mware roi tco calculator
Presentation v mware roi tco calculator
 
2018 data warehouse features in spark
2018   data warehouse features in spark2018   data warehouse features in spark
2018 data warehouse features in spark
 
Querying Druid in SQL with Superset
Querying Druid in SQL with SupersetQuerying Druid in SQL with Superset
Querying Druid in SQL with Superset
 
Apache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them AllApache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them All
 
Use dependency injection to get Hadoop *out* of your application code
Use dependency injection to get Hadoop *out* of your application codeUse dependency injection to get Hadoop *out* of your application code
Use dependency injection to get Hadoop *out* of your application code
 
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
 
Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?
 
Maximizing Database Tuning in SAP SQL Anywhere
Maximizing Database Tuning in SAP SQL AnywhereMaximizing Database Tuning in SAP SQL Anywhere
Maximizing Database Tuning in SAP SQL Anywhere
 
Presto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performancePresto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performance
 
Powerpivot web wordpress present
Powerpivot web wordpress presentPowerpivot web wordpress present
Powerpivot web wordpress present
 
Quark Virtualization Engine for Analytics
Quark Virtualization Engine for Analytics Quark Virtualization Engine for Analytics
Quark Virtualization Engine for Analytics
 
Accelerating query processing with materialized views in Apache Hive
Accelerating query processing with materialized views in Apache HiveAccelerating query processing with materialized views in Apache Hive
Accelerating query processing with materialized views in Apache Hive
 
Big Data, Bigger Analytics
Big Data, Bigger AnalyticsBig Data, Bigger Analytics
Big Data, Bigger Analytics
 

Plus de Julian Hyde

Building a semantic/metrics layer using Calcite
Building a semantic/metrics layer using CalciteBuilding a semantic/metrics layer using Calcite
Building a semantic/metrics layer using CalciteJulian Hyde
 
Cubing and Metrics in SQL, oh my!
Cubing and Metrics in SQL, oh my!Cubing and Metrics in SQL, oh my!
Cubing and Metrics in SQL, oh my!Julian Hyde
 
Adding measures to Calcite SQL
Adding measures to Calcite SQLAdding measures to Calcite SQL
Adding measures to Calcite SQLJulian Hyde
 
Morel, a data-parallel programming language
Morel, a data-parallel programming languageMorel, a data-parallel programming language
Morel, a data-parallel programming languageJulian Hyde
 
Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...Julian Hyde
 
Morel, a Functional Query Language
Morel, a Functional Query LanguageMorel, a Functional Query Language
Morel, a Functional Query LanguageJulian Hyde
 
What to expect when you're Incubating
What to expect when you're IncubatingWhat to expect when you're Incubating
What to expect when you're IncubatingJulian Hyde
 
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Open Source SQL - beyond parsers: ZetaSQL and Apache CalciteOpen Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Open Source SQL - beyond parsers: ZetaSQL and Apache CalciteJulian Hyde
 
Efficient spatial queries on vanilla databases
Efficient spatial queries on vanilla databasesEfficient spatial queries on vanilla databases
Efficient spatial queries on vanilla databasesJulian Hyde
 
Tactical data engineering
Tactical data engineeringTactical data engineering
Tactical data engineeringJulian Hyde
 
Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!Julian Hyde
 
Spatial query on vanilla databases
Spatial query on vanilla databasesSpatial query on vanilla databases
Spatial query on vanilla databasesJulian Hyde
 
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Julian Hyde
 
Lazy beats Smart and Fast
Lazy beats Smart and FastLazy beats Smart and Fast
Lazy beats Smart and FastJulian Hyde
 
Data profiling with Apache Calcite
Data profiling with Apache CalciteData profiling with Apache Calcite
Data profiling with Apache CalciteJulian Hyde
 
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache CalciteA smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache CalciteJulian Hyde
 
Data Profiling in Apache Calcite
Data Profiling in Apache CalciteData Profiling in Apache Calcite
Data Profiling in Apache CalciteJulian Hyde
 
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)Julian Hyde
 

Plus de Julian Hyde (20)

Building a semantic/metrics layer using Calcite
Building a semantic/metrics layer using CalciteBuilding a semantic/metrics layer using Calcite
Building a semantic/metrics layer using Calcite
 
Cubing and Metrics in SQL, oh my!
Cubing and Metrics in SQL, oh my!Cubing and Metrics in SQL, oh my!
Cubing and Metrics in SQL, oh my!
 
Adding measures to Calcite SQL
Adding measures to Calcite SQLAdding measures to Calcite SQL
Adding measures to Calcite SQL
 
Morel, a data-parallel programming language
Morel, a data-parallel programming languageMorel, a data-parallel programming language
Morel, a data-parallel programming language
 
Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...
 
Morel, a Functional Query Language
Morel, a Functional Query LanguageMorel, a Functional Query Language
Morel, a Functional Query Language
 
What to expect when you're Incubating
What to expect when you're IncubatingWhat to expect when you're Incubating
What to expect when you're Incubating
 
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Open Source SQL - beyond parsers: ZetaSQL and Apache CalciteOpen Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
 
Efficient spatial queries on vanilla databases
Efficient spatial queries on vanilla databasesEfficient spatial queries on vanilla databases
Efficient spatial queries on vanilla databases
 
Tactical data engineering
Tactical data engineeringTactical data engineering
Tactical data engineering
 
Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!
 
Spatial query on vanilla databases
Spatial query on vanilla databasesSpatial query on vanilla databases
Spatial query on vanilla databases
 
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
 
Lazy beats Smart and Fast
Lazy beats Smart and FastLazy beats Smart and Fast
Lazy beats Smart and Fast
 
Data profiling with Apache Calcite
Data profiling with Apache CalciteData profiling with Apache Calcite
Data profiling with Apache Calcite
 
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache CalciteA smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
 
Data Profiling in Apache Calcite
Data Profiling in Apache CalciteData Profiling in Apache Calcite
Data Profiling in Apache Calcite
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 

Dernier

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 

Dernier (20)

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 

Cost-based query optimization in Apache Hive

  • 1. Page1 © Hortonworks Inc. 2014 Cost-based query optimization in Apache Hive Julian Hyde Julian Hyde June 4th, 2014
  • 2. Page2 © Hortonworks Inc. 2014 About me Julian Hyde Architect at Hortonworks Open source: • Founder & lead, Apache Optiq (query optimization framework) • Founder & lead, Pentaho Mondrian (analysis engine) • Committer, Apache Drill • Contributor, Apache Hive • Contributor, Cascading Lingual (SQL interface to Cascading) Past: • SQLstream (streaming SQL) • Broadbase (data warehouse) • Oracle (SQL kernel development)
  • 3. Page3 © Hortonworks Inc. 2014 (Thanks to John Pullokkaran, Harish Butani for presentation content and actually doing the work.)
  • 4. Page4 © Hortonworks Inc. 2014 Apache Hive The original “SQL on Hadoop” Undergoing extensive renovation • Tez execution engine • YARN execution environment • Vectorized data representation • Column-oriented data storage (ORC) • ACID transactions • SQL standards compliance • SQL authorization model • Cost-based query optimization (CBO) What? Why? How? When? “Stinger Initiative”
  • 5. Page5 © Hortonworks Inc. 2014 Incremental cutover to cost-based optimization Release Date Remarks Apache Hive 0.12 October 2013 • Rule-based Optimizations • No join reordering • Main optimizations: predicate push- down & partition pruning • Semantic info and operator tree tightly coupled Apache Hive 0.13 April 2014 “Old-style” JOIN & push-down conditions: … FROM t1, t2 WHERE … CBO just missed the deadline  HDP 2.1 April 2014 Cost-based ordering of joins • HIVE-6439 “Introduce CBO step in Semantic Analyzer” • HIVE-5775 “Introduce Cost Based Optimizer in Hive” Apache Hive 0.14 ? CBO patches More rework of internals More cost-based features…
  • 6. Page6 © Hortonworks Inc. 2014 Apache Optiq (incubating)
  • 7. Page7 © Hortonworks Inc. 2014 Apache Optiq Apache incubator project since May, 2014 Query planning framework • Extensible • Usable standalone (JDBC) or embedded Adoption Lingual – SQL interface to Cascading Apache Drill Apache Hive Adapters: Splunk, Spark, MongoDB, JDBC, CSV, Web tables, In-memory data
  • 8. Page8 © Hortonworks Inc. 2014 Conventional DB architecture
  • 9. Page9 © Hortonworks Inc. 2014 Optiq architecture
  • 10. Page10 © Hortonworks Inc. 2014 Optiq – APIs and SPIs Cost, statistics RelOptCost RelOptCostFactory RelMetadataProvider • RelMdColumnUniquensss • RelMdDistinctRowCount • RelMdSelectivity SQL parser SqlNode SqlParser SqlValidator Transformation rules RelOptRule • MergeFilterRule • PushAggregateThroughUni onRule • RemoveCorrelationForScal arProjectRule • 100+ more Unification (materialized view) Column trimming Relational algebra RelNode (operator) • TableScan • Filter • Project • Union • Aggregate • … RelDataType (type) RexNode (expression) RelTrait (physical property) • RelConvention (calling-convention) • RelCollation (sortedness) • TBD (bucketedness/distribution) JDBC driver Metadata Schema Table Function • TableFunction • TableMacro
  • 11. Page11 © Hortonworks Inc. 2014 Now… back to Hive
  • 12. Page12 © Hortonworks Inc. 2014 CBO in Hive Why cost-based optimization? Ease of Use – Join Reordering View Chaining Ad hoc queries involving multiple views Enables BI Tools as front ends to Hive First version Modest goal Concrete results Join re-ordering Page 12
  • 13. Page13 © Hortonworks Inc. 2014 Query preparation – Hive 0.13 SQL parser Semantic analyzer Logical Optimizer Physical Optimizer Abstract Syntax Tree (AST) Hive SQL Annotated AST Plan Tez Tuned Plan
  • 14. Page14 © Hortonworks Inc. 2014 Query preparation – full CBO SQL parser Semantic analyzer Translate to algebra Physical Optimizer Abstract Syntax Tree (AST) Hive SQL Tez Tuned Plan Optiq optimizer RelNode Annotated AST
  • 15. Page15 © Hortonworks Inc. 2014 Query preparation – initial CBO SQL parser Semantic analyzer Logical Optimizer Physical Optimizer Hive SQL AST with optimized join-ordering Tez Tuned Plan Translate to algebra Optiq optimizer
  • 16. Page16 © Hortonworks Inc. 2014 © Hortonworks Inc. 2013 Query Execution – The basics Page 16 SELECT R1.x FROM R1 JOIN R2 ON R1.x = R2.x JOIN R3 on R1.x = R3.x AND R2.x = R3.x WHERE R1.z > 10; p s   R1 R2 R3 TS [R1] TS [R2] RS RS Shuffle Join TS [R3] Map Join Filter FS
  • 17. Page17 © Hortonworks Inc. 2014 © Hortonworks Inc. 2013 Query Optimization – Rule Based vs. Cost Based Page 17 p s   R1 R2 R3 p s   R1 R2 R3 p s   R1 R3 R2 p s   R2 R3 R1
  • 18. Page18 © Hortonworks Inc. 2014 Introduction of CBO into Hive Planning cbo enabled? No Generate Plan w/o multi-way joins Can cbo handle plan? No - Predicate Pushdown - Part. Pruning - Column Pruning - Stats Annotation Pre CBO Optimizer Col stats available? No Optiq-based Planner Hive Plan Revised AST Regular Planning route on new AST with CBO turned off. Fallback to Regular planning: as though cbo is disabled. - < 10 total Join Ops - No Outer Joins - No Windowing, Lateral Views, Script Op. Series of gating factors to get a CBO Plan.
  • 19. Page19 © Hortonworks Inc. 2014 Optiq Planner Process Hive Plan Planner RelNode GraphRelNode Converter RexNode Converter Hive Op  RelNode Hive Expr  RexNode • Node for each node in Input Plan • Each node is a Set of alternate Sub Plans • Set further divided into Subsets: based on traits like sortedness 1. Plan Graph • Rule: specifies a Operator sub-graph to match and logic to generate equivalent ‘better’ sub-graph. • We only have Join Reordering Rules. 2. Rules • RelNodes have Cost (& Cumulative Cost) • We only use Cardinality for Cost. 3. Cost Model - Used to Plugin Schema, Cost Formulas: Selectivity, NDV calculations etc. - We only added Selectivity and NDV formulas; Schema is only available at the Node level 4. Metadata Providers Rule Match Queue - Add Rule matches to Queue - Apply Rule match transformations to Plan Graph - Iterate for fixed iterations or until Cost doesn’t change. - Match importance based on Cost of RelNode and height. Best RelNode Graph AST Converter Revised AST Logical Plan Physical traits: Table Part./Buckets; RedSink Ops removed
  • 20. Page20 © Hortonworks Inc. 2014 Join Reordering Rules a b = b a 1. Swap Join Rule a b = 2. Push Join Through Join Rule c a c b c b a= but is really: Optiq schema is position based b a c 3. So a b c d ≠ a c d b 4. Pull Up Project above Join b a c d a c b d = Added bonus Join permutations across sub-query blocks 5. Merge Projects
  • 21. Page21 © Hortonworks Inc. 2014 Summary Join re-ordering Join cardinality is used for cost All other operators are assumed to have tiny cost Cardinality of filter, join, group-by is based on selectivity Selectivity is computed based on number-of-distinct-values (NDV) Table Stats and Column stats are required Current limitations Only supports: filter, inner join, group-by, project, order-by, limit Not all UDFs Does not attempt all join permutations (e.g. bushy trees; 10-way joins or more) May not work well for Bucket, SMB & Skew Joins Page 21
  • 22. Page22 © Hortonworks Inc. 2014 TPC-DS Query 50 Joins Store Sales, and Store Returns fact tables. Each of the fact tables are independently restricted by date. Analysis at Store grain, so this dimension also joined in. As specified Query starts by joining the 2 Fact tables. select s_store_name , .. other store details ,sum(case when (sr_returned_date_sk - ss_sold_date_sk <= 30 ) then 1 else 0 end) as `30 days`, … from store_sales ss,store_returns sr,store s ,date_dim d1 ,date_dim d2 where d2.d_year = 2000 and d2.d_moy = 9 and ss.ss_ticket_number = sr.sr_ticket_number and ss.ss_item_sk = sr.sr_item_sk and ss.ss_sold_date_sk = d1.d_date_sk and sr.sr_returned_date_sk = d2.d_date_sk and ss.ss_customer_sk = sr.sr_customer_sk and ss.ss_store_sk = s.s_store_sk group by store details order by store details limit 100; Join Graph
  • 23. Page23 © Hortonworks Inc. 2014 TPC-DS Query 50 Specified Join Tree Non CBO Plan CBO Plan
  • 24. Page24 © Hortonworks Inc. 2014 TPC-DS Query 50 Run 1 Run 2 Non CBO 53.1 53.4 CBO 22.5 21.9  1 year test  > 10 mins for Non CBO  CBO time was about the same  Fact tables  partitioned by Day,  bucketed by Item  Bucketing off  Bucketing should help CBO plan.  SR table much smaller. Better chance of Bucket Join in place of Shuffle Join. Join Ordering Cost Estimate ['d2', [[['store_sales', 'd1'], 'store_returns'], 'store']] 515074768.659 ['d1', [[['store_sales', 'store'], 'store_returns'], 'd2']] 448155.355 … ['store_returns', 'd2'] 9938.93 ['store_sales', 'store_returns'] 156727295.634 ['d1', 'store_sales'] 123675664.449 Facts restricted to 3 months Orderings considered by Planner
  • 25. Page25 © Hortonworks Inc. 2014 TPC-DS Query 17 Joins Store Sales, Store Returns and Catalog Sales fact tables. Each of the fact tables are independently restricted by time. Analysis at Item and Store grain, so these dimensions are also joined in. As specified Query starts by joining the 3 Fact tables. select i_item_id ,i_item_desc ,s_state ,count(ss_quantity) as store_sales_quantitycount ,…. from store_sales ss ,store_returns sr, catalog_sales cs, date_dim d1, date_dim d2, date_dim d3, store s, item I where d1.d_quarter_name = '2000Q1’ and d1.d_date_sk = ss.ss_sold_date_sk and i.i_item_sk = ss.ss_item_sk and … group by i_item_id ,i_item_desc, ,s_state order by i_item_id ,i_item_desc, s_state limit 100;
  • 26. Page26 © Hortonworks Inc. 2014 TPC-DS Query 17 Specified Join Tree Non CBO Plan CBO Plan
  • 27. Page27 © Hortonworks Inc. 2014 TPC-DS Query 17 Run 1 Run 2 Non CBO 100.71 127.53 CBO 50.9 44.52  1 year test  > 10 mins for Non CBO  CBO time was about the same  Fact tables  partitioned by Day,  bucketed by Item  Bucketing off  Bucketing should help CBO plan.  SR table much smaller. Better chance of Bucket Join in place of Shuffle Join. Join Ordering Cost Estimate ['item', [[[[[['d2', 'store_returns'], 'store_sales'], 'catalog_sales'], 'd1'], 'd3'], 'store']] 3547898.061 … ['store_returns', 'd2’] 19224.71 ['store_sales', 'store_returns’] 23057497.991 ['d1', 'store_sales'] 26142.943 Facts restricted to 3 months Orderings considered by Planner
  • 28. Page28 © Hortonworks Inc. 2014 Next? Outer joins Scale to larger numbers of joins Support all expressions (UDFs) Join algorithm selection Sortedness & distribution as a trait Trait propagation Better cost model More statistics Move all pre-planning and logical planning to Optiq Use Optiq costs/statistics to help physical planning Constant reduction & tree pruning Rewrite query to use materialized view
  • 29. Page29 © Hortonworks Inc. 2014 Thank you! @julianhyde http://hive.apache.org/ http://incubator.apache.org/projects/optiq.html

Notes de l'éditeur

  1. Hive CBO didn’t quite make it into Apache Hive 0.13. This talk: What is CBO? Why are we putting it in Hive? How did we do it? When is it released? And what next?
  2. 0. Converters convert a Hive Op. Graph to an Optiq representation. In Optiq we have RelNodes and RexNodes in place of Operators and ExprNodes. The conversion creates a ‘Logical’ plan. RedSinks are dropped; Physical traits like Partitioning/Bucketness is lost. The Plan Graph is the central data structure of the Planner. There is a Node for each Node in the input Plan. A Node represents a Set of equivalent Sub Graphs(Plans). Each Set is further divided into Subsets based on traits: traits capture physical attributes like sortedness/bucketness Rules comprise of a Match Graph Template and an onMatch action. Action generates a ‘better’ equivalent Plan. So Rule match actions populates Plan Graph Sets. Metadata Providers provide all Metadata information to the Planner: Schema, but also Cost Formulas like Selectivity and NDV calculations. RelNodes have Cost. The Cost model encapsulates Cost calculations. Rule Match Queue is a Queue of Rule Matches. Planner runs until the Queue is empty for a fixed number of iterations. The application of a RuleMatch adds to the Plan Graph and also adds new Rule Matches to the Queue. RuleMatches are ordered based on importance: which is based on RelNode cost and distance of Node in Plan from Root.