SlideShare a Scribd company logo
1 of 16
Putting the Sting in
Hive
Page 1
Alan F. Gates
@alanfgates
Stinger Overview
Page 2
•An initiative, not a project or product
•Includes changes to Hive and a new project Tez
•Two main goals
–Improve Hive performance 100x over Hive 0.10
–Extend Hive SQL to include features needed for
analytics
•Hive will support:
–BI tools connecting to Hadoop
–Analysts performing ad-hoc, interactive queries
–Still excellent at the large batch jobs it is used for today
© 2013 Hortonworks
Stinger Mileposts
Page 3
© 2013 Hortonworks
Stinger Phase 3
•Buffer Cache
•Cost Based
Optimizer
Stinger Phase 2
•YARN Resource Mgmnt
•Hive on Apache Tez
•Query Service
•Vectorized Operators
Stinger Phase 1
•Base Optimizations
•SQL Analytics
•ORCFile Format
1 2 Improve existing tools & preserve
investments
Enable Hive to support interactive
workloads
Released in
Hive 0.11
Current
Work
Roadmap
Hive Performance Gains in 0.11
Page 4
© 2013 Hortonworks
• Enable star joins by improving Hive’s map join (aka
broadcast join)
–Where possible do in single map only task
–When not possible push larger tables to separate tasks
• Collapse adjacent jobs where possible
–Hive has lots of M->MR type plans, collapse these to MR
–Collapse adjacent jobs on sufficiently similar keys when
feasible
–join followed by group
–join followed by order
–group followed by order
• Improvements in sort merge bucket (SMB) joins
Page ‹#›
© Hortonworks Inc. 2013
Before
Page 6
© Hortonworks Inc. 2013
After
Page 7
© Hortonworks Inc. 2013
Improvements in SMB Joins
• TPC-DS Query 82, Scale=200, 10 EC2 nodes (40 disks)
3257.692
2862.669
255.641
71.114
0
500
1000
1500
2000
2500
3000
3500
Query 82
Text
RCFile
Partitioned RCFile
Partitioned RCFile + Optimizations
Page 8
New Technologies in Hive
Page 9
© 2013 Hortonworks
• All covered in depth in other talks
– See Owen’s, Eric’s, and Jitendra’s talk ORC File & Vectorization at 4:25 today
• Tez – A new execution engine for relational tools such as Hive
– No need to use MapReduce, instead provides general DAG execution
– Data moved between tasks via socket, disk, or HDFS based on performance / re-
startability trade off
– Provides standing service to greatly reduce query start time
• ORCFile – A rewrite of RCFile
– Columnar
– Tightly integrated with Hive’s type model, including support for nested types
– Much better compression
– Supports projection and filter push down
• Vectorization – Rewriting operators to take advantage of modern
processors
– Based on work done in MonetDB
– Rewrite operators to radically reduce number of function calls, branch prediction
misses, and cache misses
© Hortonworks Inc. 2013
Standard Queries
Page 10
260
165
38
77
142
296
38 42
67
80
0
50
100
150
200
250
300
Query 27
Scale 200
Query 82
Scale 200
Query 27
Scale 1000
Query 82
Scale 1000
Query 27 Star Join
Query 82 Fact Table Join
Hive 0.10, RC File
Hive 0.11 CP, RC File
Hive 0.11 CP, ORC File
© Hortonworks Inc. 2013
Performance Trajectory
Page 11
1X
2X
12X
11X
21X
0X
5X
10X
15X
20X
25X
Hive 10
Text
Hive 10
RC
Hive 11
RC
Hive 11
ORC
Hive 11 CP
ORC, Tez…
Query 27 Speedup
1X
14X
44X
57X
78X
0X
10X
20X
30X
40X
50X
60X
70X
80X
90X
Hive 10
Text
Hive 10
RC
Hive 11
RC
Hive 11
ORC
Hive 11 CP
ORC, Tez
Query 82 Speedup
© Hortonworks Inc. 2013
Query 12 – Demonstrating MRR
Page 12
55 54
75
65
35 34
55
46
0
10
20
30
40
50
60
70
80
RC File
Scale 200
ORC File
Scale 200
RC File
Scale 1000
ORC File
Scale 1000
ElapsedTime(seconds)
Query 12 - MRR Optimization
Traditional
Map-Reduce
Tez Map
Reduce Reduce
Hive Performance Up Next
Page 13
© 2013 Hortonworks
• Push down start up time - even for queries that spend less than a
second running on the cluster, there is ~15 seconds of start up time
– Tez service will remove Hadoop startup issues
– Need to reduce time for the metadata access
– Need intelligent file caching so that hot tables can be kept in memory
• Keep working on the optimizer
– Y Smart work from Ohio State University
– Start using statistics to make intelligent decisions about how many mappers and
reducers to spawn – maybe in Hive, maybe in Tez
– Start using statistics to choose between competing plan options
• Buffer Cache
– Coordinate with HDFS team to determine caching strategy
Extending Hive SQL in 0.11
Page 14
© 2013 Hortonworks
• DECIMAL data type – for fixed precision calculation (e.g. currency)
• OVER clause
– PARTITION BY, ORDER BY, ROWS
BETWEEN/FOLLOWING/PRECEDING
– Works with existing aggregate functions
– New analytic and window functions added
– ROW_NUMBER, RANK, DENSE_RANK, LEAD, LAG, LEAD, FIRST_VALUE
, LAST_VALUE, NTILE, CUME_DIST, PERCENT_RANK
SELECT salesperson, AVG(salesprice) OVER
(PARTITION BY region ORDER BY date
ROWS BETWEEN 10 PRECEEDING AND 10 FOLLOWING)
FROM sales;
Extending Hive SQL Post 0.11
Page 15
© 2013 Hortonworks
• Subqueries in WHERE
– Non-correlated first
– [NOT] IN first, then extend to (in)equalities and EXISTS
• Datatype conformance – Hive has Java type model, add support for
SQL types:
– DATE
– CHAR() and VARCHAR()
– add precision and scale to decimal and float
– aliases for standard SQL types (BLOB = binary, CLOB = string, integer =
int, real/number = decimal)
• Security
– Add security checks to views, indices, functions, etc.
– Secure GRANT and REVOKE
Questions
Page 16
© 2013 Hortonworks

More Related Content

What's hot

Hive ACID Apache BigData 2016
Hive ACID Apache BigData 2016Hive ACID Apache BigData 2016
Hive ACID Apache BigData 2016alanfgates
 
A TPC Benchmark of Hive LLAP and Comparison with Presto
A TPC Benchmark of Hive LLAP and Comparison with PrestoA TPC Benchmark of Hive LLAP and Comparison with Presto
A TPC Benchmark of Hive LLAP and Comparison with PrestoYu Liu
 
Hive acid-updates-strata-sjc-feb-2015
Hive acid-updates-strata-sjc-feb-2015Hive acid-updates-strata-sjc-feb-2015
Hive acid-updates-strata-sjc-feb-2015alanfgates
 
Hive acid-updates-summit-sjc-2014
Hive acid-updates-summit-sjc-2014Hive acid-updates-summit-sjc-2014
Hive acid-updates-summit-sjc-2014alanfgates
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem DataWorks Summit/Hadoop Summit
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloudgluent.
 
Llap: Locality is Dead
Llap: Locality is DeadLlap: Locality is Dead
Llap: Locality is Deadt3rmin4t0r
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_featuresAlberto Romero
 
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseApache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseDataWorks Summit/Hadoop Summit
 
HBaseConEast2016: HBase and Spark, State of the Art
HBaseConEast2016: HBase and Spark, State of the ArtHBaseConEast2016: HBase and Spark, State of the Art
HBaseConEast2016: HBase and Spark, State of the ArtMichael Stack
 
ApacheCon 2020 - Flink SQL in 2020: Time to show off!
ApacheCon 2020 - Flink SQL in 2020: Time to show off!ApacheCon 2020 - Flink SQL in 2020: Time to show off!
ApacheCon 2020 - Flink SQL in 2020: Time to show off!Timo Walther
 
Apache Ratis - In Search of a Usable Raft Library
Apache Ratis - In Search of a Usable Raft LibraryApache Ratis - In Search of a Usable Raft Library
Apache Ratis - In Search of a Usable Raft LibraryTsz-Wo (Nicholas) Sze
 
Hive - 1455: Cloud Storage
Hive - 1455: Cloud StorageHive - 1455: Cloud Storage
Hive - 1455: Cloud StorageHortonworks
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Data Con LA
 

What's hot (20)

Apache Hive on ACID
Apache Hive on ACIDApache Hive on ACID
Apache Hive on ACID
 
Hive ACID Apache BigData 2016
Hive ACID Apache BigData 2016Hive ACID Apache BigData 2016
Hive ACID Apache BigData 2016
 
A TPC Benchmark of Hive LLAP and Comparison with Presto
A TPC Benchmark of Hive LLAP and Comparison with PrestoA TPC Benchmark of Hive LLAP and Comparison with Presto
A TPC Benchmark of Hive LLAP and Comparison with Presto
 
Hive acid-updates-strata-sjc-feb-2015
Hive acid-updates-strata-sjc-feb-2015Hive acid-updates-strata-sjc-feb-2015
Hive acid-updates-strata-sjc-feb-2015
 
Hive acid-updates-summit-sjc-2014
Hive acid-updates-summit-sjc-2014Hive acid-updates-summit-sjc-2014
Hive acid-updates-summit-sjc-2014
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
 
Llap: Locality is Dead
Llap: Locality is DeadLlap: Locality is Dead
Llap: Locality is Dead
 
From Device to Data Center to Insights
From Device to Data Center to InsightsFrom Device to Data Center to Insights
From Device to Data Center to Insights
 
Evolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage SubsystemEvolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage Subsystem
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_features
 
HiveACIDPublic
HiveACIDPublicHiveACIDPublic
HiveACIDPublic
 
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseApache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
 
HBaseConEast2016: HBase and Spark, State of the Art
HBaseConEast2016: HBase and Spark, State of the ArtHBaseConEast2016: HBase and Spark, State of the Art
HBaseConEast2016: HBase and Spark, State of the Art
 
ApacheCon 2020 - Flink SQL in 2020: Time to show off!
ApacheCon 2020 - Flink SQL in 2020: Time to show off!ApacheCon 2020 - Flink SQL in 2020: Time to show off!
ApacheCon 2020 - Flink SQL in 2020: Time to show off!
 
Apache Ratis - In Search of a Usable Raft Library
Apache Ratis - In Search of a Usable Raft LibraryApache Ratis - In Search of a Usable Raft Library
Apache Ratis - In Search of a Usable Raft Library
 
Apache Hive ACID Project
Apache Hive ACID ProjectApache Hive ACID Project
Apache Hive ACID Project
 
Hive - 1455: Cloud Storage
Hive - 1455: Cloud StorageHive - 1455: Cloud Storage
Hive - 1455: Cloud Storage
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
 

Viewers also liked

Strata Stinger Talk October 2013
Strata Stinger Talk October 2013Strata Stinger Talk October 2013
Strata Stinger Talk October 2013alanfgates
 
Simply the best college best work
Simply the best   college best workSimply the best   college best work
Simply the best college best workCraig Skelly
 
Outline providing effectivefeedbacktoemployees (1)
Outline providing effectivefeedbacktoemployees (1)Outline providing effectivefeedbacktoemployees (1)
Outline providing effectivefeedbacktoemployees (1)Linnea Hanson
 
Hive2.0 big dataspain-nov-2016
Hive2.0 big dataspain-nov-2016Hive2.0 big dataspain-nov-2016
Hive2.0 big dataspain-nov-2016alanfgates
 
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016alanfgates
 
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query PerformanceORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query PerformanceDataWorks Summit
 
Big data spain keynote nov 2016
Big data spain keynote nov 2016Big data spain keynote nov 2016
Big data spain keynote nov 2016alanfgates
 
Keynote apache bd-eu-nov-2016
Keynote apache bd-eu-nov-2016Keynote apache bd-eu-nov-2016
Keynote apache bd-eu-nov-2016alanfgates
 
Hortonworks apache training
Hortonworks apache trainingHortonworks apache training
Hortonworks apache trainingalanfgates
 
Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014alanfgates
 
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...DataWorks Summit/Hadoop Summit
 

Viewers also liked (17)

Strata Stinger Talk October 2013
Strata Stinger Talk October 2013Strata Stinger Talk October 2013
Strata Stinger Talk October 2013
 
Simply the best college best work
Simply the best   college best workSimply the best   college best work
Simply the best college best work
 
Bowling event
Bowling eventBowling event
Bowling event
 
Outline providing effectivefeedbacktoemployees (1)
Outline providing effectivefeedbacktoemployees (1)Outline providing effectivefeedbacktoemployees (1)
Outline providing effectivefeedbacktoemployees (1)
 
Hive2.0 big dataspain-nov-2016
Hive2.0 big dataspain-nov-2016Hive2.0 big dataspain-nov-2016
Hive2.0 big dataspain-nov-2016
 
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
 
Ninjutsu
NinjutsuNinjutsu
Ninjutsu
 
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query PerformanceORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
 
Big data spain keynote nov 2016
Big data spain keynote nov 2016Big data spain keynote nov 2016
Big data spain keynote nov 2016
 
Rpp reproduksi - copy (1)
Rpp reproduksi - copy (1)Rpp reproduksi - copy (1)
Rpp reproduksi - copy (1)
 
Keynote apache bd-eu-nov-2016
Keynote apache bd-eu-nov-2016Keynote apache bd-eu-nov-2016
Keynote apache bd-eu-nov-2016
 
Hortonworks apache training
Hortonworks apache trainingHortonworks apache training
Hortonworks apache training
 
Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014
 
Indexed Hive
Indexed HiveIndexed Hive
Indexed Hive
 
Brownian motion
Brownian motionBrownian motion
Brownian motion
 
Types dbms
Types dbmsTypes dbms
Types dbms
 
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
 

Similar to Stinger hadoop summit june 2013

April 2013 HUG: The Stinger Initiative - Making Apache Hive 100 Times Faster
April 2013 HUG: The Stinger Initiative - Making Apache Hive 100 Times FasterApril 2013 HUG: The Stinger Initiative - Making Apache Hive 100 Times Faster
April 2013 HUG: The Stinger Initiative - Making Apache Hive 100 Times FasterYahoo Developer Network
 
Stinger Initiative - Deep Dive
Stinger Initiative - Deep DiveStinger Initiative - Deep Dive
Stinger Initiative - Deep DiveHortonworks
 
Gunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerGunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerhdhappy001
 
La big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixitLa big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixitData Con LA
 
Alan Gates, Hortonworks_Hadoop&SQL
Alan Gates, Hortonworks_Hadoop&SQLAlan Gates, Hortonworks_Hadoop&SQL
Alan Gates, Hortonworks_Hadoop&SQLThe Hive
 
Stinger Initiative: Leveraging Hive & Yarn for High-Performance/Interactive Q...
Stinger Initiative: Leveraging Hive & Yarn for High-Performance/Interactive Q...Stinger Initiative: Leveraging Hive & Yarn for High-Performance/Interactive Q...
Stinger Initiative: Leveraging Hive & Yarn for High-Performance/Interactive Q...Caserta
 
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...Global Business Events
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingTeddy Choi
 
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016StampedeCon
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksData Con LA
 
Hive big-data meetup
Hive big-data meetupHive big-data meetup
Hive big-data meetupRemus Rusanu
 
Architecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchArchitecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchHortonworks
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?DataWorks Summit
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoDataWorks Summit
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud ComputingFarzad Nozarian
 
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelt3rmin4t0r
 
A Scalable Data Transformation Framework using the Hadoop Ecosystem
A Scalable Data Transformation Framework using the Hadoop EcosystemA Scalable Data Transformation Framework using the Hadoop Ecosystem
A Scalable Data Transformation Framework using the Hadoop EcosystemSerendio Inc.
 

Similar to Stinger hadoop summit june 2013 (20)

April 2013 HUG: The Stinger Initiative - Making Apache Hive 100 Times Faster
April 2013 HUG: The Stinger Initiative - Making Apache Hive 100 Times FasterApril 2013 HUG: The Stinger Initiative - Making Apache Hive 100 Times Faster
April 2013 HUG: The Stinger Initiative - Making Apache Hive 100 Times Faster
 
Stinger Initiative - Deep Dive
Stinger Initiative - Deep DiveStinger Initiative - Deep Dive
Stinger Initiative - Deep Dive
 
Gunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerGunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stinger
 
La big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixitLa big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixit
 
Alan Gates, Hortonworks_Hadoop&SQL
Alan Gates, Hortonworks_Hadoop&SQLAlan Gates, Hortonworks_Hadoop&SQL
Alan Gates, Hortonworks_Hadoop&SQL
 
Stinger Initiative: Leveraging Hive & Yarn for High-Performance/Interactive Q...
Stinger Initiative: Leveraging Hive & Yarn for High-Performance/Interactive Q...Stinger Initiative: Leveraging Hive & Yarn for High-Performance/Interactive Q...
Stinger Initiative: Leveraging Hive & Yarn for High-Performance/Interactive Q...
 
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
 
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of Hortonworks
 
February 2014 HUG : Hive On Tez
February 2014 HUG : Hive On TezFebruary 2014 HUG : Hive On Tez
February 2014 HUG : Hive On Tez
 
Hive big-data meetup
Hive big-data meetupHive big-data meetup
Hive big-data meetup
 
Architecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchArchitecting the Future of Big Data and Search
Architecting the Future of Big Data and Search
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
 
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthel
 
A Scalable Data Transformation Framework using the Hadoop Ecosystem
A Scalable Data Transformation Framework using the Hadoop EcosystemA Scalable Data Transformation Framework using the Hadoop Ecosystem
A Scalable Data Transformation Framework using the Hadoop Ecosystem
 

Recently uploaded

08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 

Recently uploaded (20)

08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 

Stinger hadoop summit june 2013

  • 1. Putting the Sting in Hive Page 1 Alan F. Gates @alanfgates
  • 2. Stinger Overview Page 2 •An initiative, not a project or product •Includes changes to Hive and a new project Tez •Two main goals –Improve Hive performance 100x over Hive 0.10 –Extend Hive SQL to include features needed for analytics •Hive will support: –BI tools connecting to Hadoop –Analysts performing ad-hoc, interactive queries –Still excellent at the large batch jobs it is used for today © 2013 Hortonworks
  • 3. Stinger Mileposts Page 3 © 2013 Hortonworks Stinger Phase 3 •Buffer Cache •Cost Based Optimizer Stinger Phase 2 •YARN Resource Mgmnt •Hive on Apache Tez •Query Service •Vectorized Operators Stinger Phase 1 •Base Optimizations •SQL Analytics •ORCFile Format 1 2 Improve existing tools & preserve investments Enable Hive to support interactive workloads Released in Hive 0.11 Current Work Roadmap
  • 4. Hive Performance Gains in 0.11 Page 4 © 2013 Hortonworks • Enable star joins by improving Hive’s map join (aka broadcast join) –Where possible do in single map only task –When not possible push larger tables to separate tasks • Collapse adjacent jobs where possible –Hive has lots of M->MR type plans, collapse these to MR –Collapse adjacent jobs on sufficiently similar keys when feasible –join followed by group –join followed by order –group followed by order • Improvements in sort merge bucket (SMB) joins
  • 6. © Hortonworks Inc. 2013 Before Page 6
  • 7. © Hortonworks Inc. 2013 After Page 7
  • 8. © Hortonworks Inc. 2013 Improvements in SMB Joins • TPC-DS Query 82, Scale=200, 10 EC2 nodes (40 disks) 3257.692 2862.669 255.641 71.114 0 500 1000 1500 2000 2500 3000 3500 Query 82 Text RCFile Partitioned RCFile Partitioned RCFile + Optimizations Page 8
  • 9. New Technologies in Hive Page 9 © 2013 Hortonworks • All covered in depth in other talks – See Owen’s, Eric’s, and Jitendra’s talk ORC File & Vectorization at 4:25 today • Tez – A new execution engine for relational tools such as Hive – No need to use MapReduce, instead provides general DAG execution – Data moved between tasks via socket, disk, or HDFS based on performance / re- startability trade off – Provides standing service to greatly reduce query start time • ORCFile – A rewrite of RCFile – Columnar – Tightly integrated with Hive’s type model, including support for nested types – Much better compression – Supports projection and filter push down • Vectorization – Rewriting operators to take advantage of modern processors – Based on work done in MonetDB – Rewrite operators to radically reduce number of function calls, branch prediction misses, and cache misses
  • 10. © Hortonworks Inc. 2013 Standard Queries Page 10 260 165 38 77 142 296 38 42 67 80 0 50 100 150 200 250 300 Query 27 Scale 200 Query 82 Scale 200 Query 27 Scale 1000 Query 82 Scale 1000 Query 27 Star Join Query 82 Fact Table Join Hive 0.10, RC File Hive 0.11 CP, RC File Hive 0.11 CP, ORC File
  • 11. © Hortonworks Inc. 2013 Performance Trajectory Page 11 1X 2X 12X 11X 21X 0X 5X 10X 15X 20X 25X Hive 10 Text Hive 10 RC Hive 11 RC Hive 11 ORC Hive 11 CP ORC, Tez… Query 27 Speedup 1X 14X 44X 57X 78X 0X 10X 20X 30X 40X 50X 60X 70X 80X 90X Hive 10 Text Hive 10 RC Hive 11 RC Hive 11 ORC Hive 11 CP ORC, Tez Query 82 Speedup
  • 12. © Hortonworks Inc. 2013 Query 12 – Demonstrating MRR Page 12 55 54 75 65 35 34 55 46 0 10 20 30 40 50 60 70 80 RC File Scale 200 ORC File Scale 200 RC File Scale 1000 ORC File Scale 1000 ElapsedTime(seconds) Query 12 - MRR Optimization Traditional Map-Reduce Tez Map Reduce Reduce
  • 13. Hive Performance Up Next Page 13 © 2013 Hortonworks • Push down start up time - even for queries that spend less than a second running on the cluster, there is ~15 seconds of start up time – Tez service will remove Hadoop startup issues – Need to reduce time for the metadata access – Need intelligent file caching so that hot tables can be kept in memory • Keep working on the optimizer – Y Smart work from Ohio State University – Start using statistics to make intelligent decisions about how many mappers and reducers to spawn – maybe in Hive, maybe in Tez – Start using statistics to choose between competing plan options • Buffer Cache – Coordinate with HDFS team to determine caching strategy
  • 14. Extending Hive SQL in 0.11 Page 14 © 2013 Hortonworks • DECIMAL data type – for fixed precision calculation (e.g. currency) • OVER clause – PARTITION BY, ORDER BY, ROWS BETWEEN/FOLLOWING/PRECEDING – Works with existing aggregate functions – New analytic and window functions added – ROW_NUMBER, RANK, DENSE_RANK, LEAD, LAG, LEAD, FIRST_VALUE , LAST_VALUE, NTILE, CUME_DIST, PERCENT_RANK SELECT salesperson, AVG(salesprice) OVER (PARTITION BY region ORDER BY date ROWS BETWEEN 10 PRECEEDING AND 10 FOLLOWING) FROM sales;
  • 15. Extending Hive SQL Post 0.11 Page 15 © 2013 Hortonworks • Subqueries in WHERE – Non-correlated first – [NOT] IN first, then extend to (in)equalities and EXISTS • Datatype conformance – Hive has Java type model, add support for SQL types: – DATE – CHAR() and VARCHAR() – add precision and scale to decimal and float – aliases for standard SQL types (BLOB = binary, CLOB = string, integer = int, real/number = decimal) • Security – Add security checks to views, indices, functions, etc. – Secure GRANT and REVOKE

Editor's Notes

  1. Speedup (y-axis) as a ratio to Hive 10 Text. Bigger is better.
  2. Time (y-axis) in seconds. Smaller is better.