SlideShare une entreprise Scribd logo
1  sur  27
Page1 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
LLAP: Sub-Second Analytical Queries in Hive
Gunther Hagleitner
Page2 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Perf overview – a demo in a gif
Page3 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
4 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Hive 2 with LLAP: 5x Performance Boost at 10 TB Scale
0
2
4
6
8
10
12
14
16
18
0
500
1000
1500
2000
2500
Improvement(Factor)
Querytime,s
Hive 2 with LLAP: 5x Performance Improvement Across All Query Types
(10 TB, 10 Nodes (256GB/32 cores), TPC-DS Queries)
Hive 2 LLAP Improvement v Hive 1 (Factor)
5 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
0
5
10
15
20
25
30
35
40
45
50
0
50
100
150
200
250
Speedup(xFactor)
QueryTime(s)(LowerisBetter)
Hive 2 with LLAP averages 26x faster than Hive 1
Hive 1 / Tez Time (s) Hive 2 / LLAP Time(s) Speedup (x Factor)
Hive 2 with LLAP: 25+x Performance Boost: Interactive / 1TB Scale
6 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Apache Hive vs. Apache Impala at 10TB
 10TB scale on 10 identical
nodes.
 Hive and Impala showed
similar times on most
smaller queries.
 Hive scaled better,
completing larger queries
quicker with lower end-to-
end time for all queries
Highlights
1
10
100
1000
10000
Querytime,s
Impala 2.8 Hive 2.1
7 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Apache Hive vs. Presto on a partitioned 1TB dataset.
 Presto lacks basic
performance optimizations
like dynamic partition
pruning.
 On a real dataset / workload
Presto perform poorly
without full re-writes.
 Example: Query 55 without
re-writes = 185.17s, with re-
writes = 16s. LLAP = 1.37s.
Highlights
8 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Hive LLAP: Stable Performance under High Concurrency
4x Queries,
2.8x
Runtime
Difference
5x Queries,
4.6x
Runtime
Difference
Mark
Concurrent
Queries
Average
Runtime
5 7.76s
25 36.24s
100 102.89s
9 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Hive with LLAP: Linear Scale
Average and Maximum query times
Mixed SQL workload of simple and complex queries
1TB Dataset Size
0
20
40
60
80
100
120
8 16 32
Time(s)
Concurrent Users
Average Query Time by Concurrency, 1TB
Average Time: 8 Node Average Time: 16 Node
0
50
100
150
200
250
300
8 16 32
Time(s)
Concurrent Users
Maximum Query Time by Concurrency, 1TB
Max Time: 8 Node Max Time: 16 Node
Linear Scale:
Double Nodes,
Consistent Runtime
Page10 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Performance – cache on HDFS, 1Tb scale
20.2
18.3
17.6
16.2 16.2
16.8
14.5
11.3
0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
0
5
10
15
20
25
24 26 28 30 32 34 36 38 40 42
Cachehitrate
Queryruntime,s
Cache size, Gb
Query runtime, after
unrelated workload
Query runtime, after
related workload
Metadata hit rate
Data hit rate
Page11 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Performance – cache on cloud storage
• Cloud storage, slower than
on-prem HDFS
• Large scan on 1Tb scale
• Local HD/SDD cache
• Massive improvement on slow
storage with little memory cost
0
50
100
150
200
250
300
350
400
Cold cache Hot cache
Runtime,s
Entire query
Scan portion
Page12 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Hive with LLAP on text data (CSV, logs, etc)
-
5.00
10.00
15.00
20.00
25.00
30.00
35.00
40.00
45.00
50.00
0
200
400
600
800
1000
1200
1400
Text: No Cache
Text: Cached
Improvement (Factor)
Page13 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Hive with LLAP: Architecture Overview
Deep
Storage
YARN Cluster
LLAP Daemon
Query
Executors
LLAP Daemon
Query
Executors
LLAP Daemon
Query
Executors
LLAP Daemon
Query
Executors
Query
Coordinators
Coord-
inator
Coord-
inator
Coord-
inator
HiveServer2
(Query
Endpoint)
ODBC /
JDBC
SQL
Queries
In-Memory Cache
(Shared Across All Users)
HDFS and
Compatible
S3 WASB Isilon
Page14 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
I/OExecution
LLAP: In-memory processing
Decoder
Distributed FS
Fragment
Hive
operator
Hive
operator
Vectorized
processin
g
col1 col2
Low-level pushdown
(e.g. filter)*
SSD/HD cache
Off-heap cache Compression
codec
Native data
vectors
Compact encoded data
Compressed data
col1
col2
Page15 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Parallel queries – priorities, preemption
• Lower-priority fragments can be preempted
• For example, a fragment can start running before its inputs are
ready, for better pipelining; such fragments may be preempted
• LLAP work queue examines the DAG parameters to give
preference to interactive (BI) queries
LLAP
QueueExecutor
Executor
Interactive
query map 1/3
…
Interactive
query map 3/3
Executor
Interactive
query map 2/3
Wide query
reduce waiting
Time
ClusterUtilization
Long-Running Query
Short-Running Query
Page16 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Sidebar: Relational data node
• LLAP can provide a "relational datanode" view of the data
• Standard interface to read tables directly from LLAP nodes
• Supports column-level security
• LLAP handles ACID, schema-evolution
• Optimized for fast access/scans
• Read horizontal table partitions in parallel on compute nodes
• Push projections, filters, aggregates into LLAP
• Utilizes cache and LLAP IO subsystem
Page17 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Example - SparkSQL integration
• Allowing direct access to Hive
table data from Spark breaks
security model for warehouses
secured using Ranger or SQL
Standard Authorization
• Other features may not work
correctly unless support added in
Spark
• Row/Cell level security (when
implemented), ACID, Schema
Evolution, …
• SparkSQL has interfaces for external
sources; they can be implemented to
access Hive data via HS2/LLAP
• SparkSQL Catalyst optimizer hooks
can be used to push processing into
LLAP (e.g. filters on the tables)
• Some optimizations, like dynamic partition
pruning, can be used by Hive even if SparkSQL
doesn't support them; if execution pushdown
is advanced enough
Page18 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Example - SparkSQL integration – execution flow
Page 18
HadoopRDD.compute()
LlapInputFormat.getRecordReader()
- Open socket for incoming data
- Send package/plan to LLAP
Check permissions
Generate splits w/LLAP locations
Return securely signed splits
HS2/Metastore
Spark Executor
HadoopRDD.getPartitions()
Get Hive splits
Cluster nodes
Verify splits
Run query plan
Send data back
LLAP
Partitions/
Splits
Request
splits
: :
var llapContext =
LlapContext.newInstance(
sparkContext, jdbcUrl)
var df: DataFrame =
llapContext.sql("select *
from tpch_text_5.region")
DataFrame for Hive/LLAP data
Page19 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
LLAP Availability
• First version shipped in Apache Hive 2.0
• Hive 2.0.1 contains important bug fixes to make it production ready
• Hive 2.1 has new features and further perf improvements
• HDP 2.5 includes beta version of LLAP
• HDP 2.6 includes GA version of Apache Hive 2.1 w/ LLAP
Page20 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Setup: Ambari
• Simple setup
• Integrated into Hive configuration page
• Launches HS2 instance and LLAP cluster
• Handles configuration, slider, security,
alerts and monitoring, …
• Requires Ambari 2.5.x
• Requires HDP 2.6.x stack (includes Apache
Hive 2.1 w/ LLAP)
Page21 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Setup: Cloud
• Spin up LLAP clusters in minutes
• Exposes HS2 endpoint
• JDBC/ODBC
• Zeppelin/Hive view also options
• RDS + S3 friendly
• Use HD cache for fast S3 access
• RDS for shared metastore instance
• HDP 2.6 available now!
Page22 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Monitoring
• Simple UI for monitoring
• JMX endpoint with more data
• Logs and jstack endpoints
Page23 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Monitoring
• HDP ships with number of default
Grafana dashboards for LLAP
• Show wealth of metrics for cache,
compute and scheduling
• Helps to identify bottlenecks and issues
• Helps with capacity planning
• Easy to extend or integrate with other
systems
Page24 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Watching queries – Tez UI integration
Page25 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Future work
• Fine-grained workload management
• Reservations & priorities
• Configurable guardrails
• Maximum permissible scan range, runtime, cost
• Integrated with CBO’s cost model
• Admin tools
• Identify hotspots & data layout issues
• Easy reporting on health, performance and utilization
Page26 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Summary
• LLAP is a new execution substrate for fast, concurrent
analytical workloads, harnessing Hive vectorized SQL
engine and efficient in-memory caching layer
• Provides secure relational view of the data through a
simple API
• Available in Hive 2, integrated with ecosystem (Ambari,
Cloud, Grafana, jmx, Hive & Tez views)
Page27 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Questions?
?
Interested? Stop by the Hortonworks booth to learn more

Contenu connexe

Tendances

Major advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL complianceMajor advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL complianceDataWorks Summit/Hadoop Summit
 
A TPC Benchmark of Hive LLAP and Comparison with Presto
A TPC Benchmark of Hive LLAP and Comparison with PrestoA TPC Benchmark of Hive LLAP and Comparison with Presto
A TPC Benchmark of Hive LLAP and Comparison with PrestoYu Liu
 
Unifying Stream, SWL and CEP for Declarative Stream Processing with Apache Flink
Unifying Stream, SWL and CEP for Declarative Stream Processing with Apache FlinkUnifying Stream, SWL and CEP for Declarative Stream Processing with Apache Flink
Unifying Stream, SWL and CEP for Declarative Stream Processing with Apache FlinkDataWorks Summit/Hadoop Summit
 
Hadoop & cloud storage object store integration in production (final)
Hadoop & cloud storage  object store integration in production (final)Hadoop & cloud storage  object store integration in production (final)
Hadoop & cloud storage object store integration in production (final)Chris Nauroth
 
Sub-second-sql-on-hadoop-at-scale
Sub-second-sql-on-hadoop-at-scaleSub-second-sql-on-hadoop-at-scale
Sub-second-sql-on-hadoop-at-scaleYifeng Jiang
 
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseApache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseDataWorks Summit/Hadoop Summit
 
Llap: Locality is Dead
Llap: Locality is DeadLlap: Locality is Dead
Llap: Locality is Deadt3rmin4t0r
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem DataWorks Summit/Hadoop Summit
 
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016alanfgates
 
The Future of Apache Ambari
The Future of Apache AmbariThe Future of Apache Ambari
The Future of Apache AmbariDataWorks Summit
 
Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
Overview of Apache Flink: the 4G of Big Data Analytics FrameworksOverview of Apache Flink: the 4G of Big Data Analytics Frameworks
Overview of Apache Flink: the 4G of Big Data Analytics FrameworksDataWorks Summit/Hadoop Summit
 
Practice of large Hadoop cluster in China Mobile
Practice of large Hadoop cluster in China MobilePractice of large Hadoop cluster in China Mobile
Practice of large Hadoop cluster in China MobileDataWorks Summit
 
Network for the Large-scale Hadoop cluster at Yahoo! JAPAN
Network for the Large-scale Hadoop cluster at Yahoo! JAPANNetwork for the Large-scale Hadoop cluster at Yahoo! JAPAN
Network for the Large-scale Hadoop cluster at Yahoo! JAPANDataWorks Summit/Hadoop Summit
 

Tendances (20)

Major advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL complianceMajor advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL compliance
 
A TPC Benchmark of Hive LLAP and Comparison with Presto
A TPC Benchmark of Hive LLAP and Comparison with PrestoA TPC Benchmark of Hive LLAP and Comparison with Presto
A TPC Benchmark of Hive LLAP and Comparison with Presto
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
 
Unifying Stream, SWL and CEP for Declarative Stream Processing with Apache Flink
Unifying Stream, SWL and CEP for Declarative Stream Processing with Apache FlinkUnifying Stream, SWL and CEP for Declarative Stream Processing with Apache Flink
Unifying Stream, SWL and CEP for Declarative Stream Processing with Apache Flink
 
Migrating pipelines into Docker
Migrating pipelines into DockerMigrating pipelines into Docker
Migrating pipelines into Docker
 
Hadoop & cloud storage object store integration in production (final)
Hadoop & cloud storage  object store integration in production (final)Hadoop & cloud storage  object store integration in production (final)
Hadoop & cloud storage object store integration in production (final)
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Sub-second-sql-on-hadoop-at-scale
Sub-second-sql-on-hadoop-at-scaleSub-second-sql-on-hadoop-at-scale
Sub-second-sql-on-hadoop-at-scale
 
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseApache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
 
Llap: Locality is Dead
Llap: Locality is DeadLlap: Locality is Dead
Llap: Locality is Dead
 
From Device to Data Center to Insights
From Device to Data Center to InsightsFrom Device to Data Center to Insights
From Device to Data Center to Insights
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
 
The Future of Apache Ambari
The Future of Apache AmbariThe Future of Apache Ambari
The Future of Apache Ambari
 
Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
Overview of Apache Flink: the 4G of Big Data Analytics FrameworksOverview of Apache Flink: the 4G of Big Data Analytics Frameworks
Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
 
A Multi Colored YARN
A Multi Colored YARNA Multi Colored YARN
A Multi Colored YARN
 
Practice of large Hadoop cluster in China Mobile
Practice of large Hadoop cluster in China MobilePractice of large Hadoop cluster in China Mobile
Practice of large Hadoop cluster in China Mobile
 
Network for the Large-scale Hadoop cluster at Yahoo! JAPAN
Network for the Large-scale Hadoop cluster at Yahoo! JAPANNetwork for the Large-scale Hadoop cluster at Yahoo! JAPAN
Network for the Large-scale Hadoop cluster at Yahoo! JAPAN
 
Streaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkStreaming in the Wild with Apache Flink
Streaming in the Wild with Apache Flink
 

Similaire à LLAP: Sub-Second Analytical Queries in Hive

Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_featuresAlberto Romero
 
LLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveLLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveDataWorks Summit
 
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseDataWorks Summit
 
Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Ashish Narasimham
 
Apache Hive 2.0; SQL, Speed, Scale
Apache Hive 2.0; SQL, Speed, ScaleApache Hive 2.0; SQL, Speed, Scale
Apache Hive 2.0; SQL, Speed, ScaleHortonworks
 
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善HortonworksJapan
 
What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?DataWorks Summit
 
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019alanfgates
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksData Con LA
 
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?DataWorks Summit
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?DataWorks Summit
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoDataWorks Summit
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateDataWorks Summit
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinAlex Zeltov
 
Moving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloudMoving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloudDataWorks Summit/Hadoop Summit
 
Paris FOD Meetup #5 Hortonworks Presentation
Paris FOD Meetup #5 Hortonworks PresentationParis FOD Meetup #5 Hortonworks Presentation
Paris FOD Meetup #5 Hortonworks PresentationAbdelkrim Hadjidj
 

Similaire à LLAP: Sub-Second Analytical Queries in Hive (20)

LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_features
 
LLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveLLAP: long-lived execution in Hive
LLAP: long-lived execution in Hive
 
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data Warehouse
 
Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30
 
Hadoop 3 in a Nutshell
Hadoop 3 in a NutshellHadoop 3 in a Nutshell
Hadoop 3 in a Nutshell
 
Apache Hive 2.0; SQL, Speed, Scale
Apache Hive 2.0; SQL, Speed, ScaleApache Hive 2.0; SQL, Speed, Scale
Apache Hive 2.0; SQL, Speed, Scale
 
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
 
What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?
 
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of Hortonworks
 
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
 
Moving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloudMoving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloud
 
Paris FOD Meetup #5 Hortonworks Presentation
Paris FOD Meetup #5 Hortonworks PresentationParis FOD Meetup #5 Hortonworks Presentation
Paris FOD Meetup #5 Hortonworks Presentation
 

Plus de DataWorks Summit/Hadoop Summit

Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerDataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformDataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLDataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...DataWorks Summit/Hadoop Summit
 

Plus de DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

Dernier

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 

Dernier (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 

LLAP: Sub-Second Analytical Queries in Hive

  • 1. Page1 © Hortonworks Inc. 2011 – 2017. All Rights Reserved LLAP: Sub-Second Analytical Queries in Hive Gunther Hagleitner
  • 2. Page2 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Perf overview – a demo in a gif
  • 3. Page3 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
  • 4. 4 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Hive 2 with LLAP: 5x Performance Boost at 10 TB Scale 0 2 4 6 8 10 12 14 16 18 0 500 1000 1500 2000 2500 Improvement(Factor) Querytime,s Hive 2 with LLAP: 5x Performance Improvement Across All Query Types (10 TB, 10 Nodes (256GB/32 cores), TPC-DS Queries) Hive 2 LLAP Improvement v Hive 1 (Factor)
  • 5. 5 © Hortonworks Inc. 2011 – 2017. All Rights Reserved 0 5 10 15 20 25 30 35 40 45 50 0 50 100 150 200 250 Speedup(xFactor) QueryTime(s)(LowerisBetter) Hive 2 with LLAP averages 26x faster than Hive 1 Hive 1 / Tez Time (s) Hive 2 / LLAP Time(s) Speedup (x Factor) Hive 2 with LLAP: 25+x Performance Boost: Interactive / 1TB Scale
  • 6. 6 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Apache Hive vs. Apache Impala at 10TB  10TB scale on 10 identical nodes.  Hive and Impala showed similar times on most smaller queries.  Hive scaled better, completing larger queries quicker with lower end-to- end time for all queries Highlights 1 10 100 1000 10000 Querytime,s Impala 2.8 Hive 2.1
  • 7. 7 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Apache Hive vs. Presto on a partitioned 1TB dataset.  Presto lacks basic performance optimizations like dynamic partition pruning.  On a real dataset / workload Presto perform poorly without full re-writes.  Example: Query 55 without re-writes = 185.17s, with re- writes = 16s. LLAP = 1.37s. Highlights
  • 8. 8 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Hive LLAP: Stable Performance under High Concurrency 4x Queries, 2.8x Runtime Difference 5x Queries, 4.6x Runtime Difference Mark Concurrent Queries Average Runtime 5 7.76s 25 36.24s 100 102.89s
  • 9. 9 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Hive with LLAP: Linear Scale Average and Maximum query times Mixed SQL workload of simple and complex queries 1TB Dataset Size 0 20 40 60 80 100 120 8 16 32 Time(s) Concurrent Users Average Query Time by Concurrency, 1TB Average Time: 8 Node Average Time: 16 Node 0 50 100 150 200 250 300 8 16 32 Time(s) Concurrent Users Maximum Query Time by Concurrency, 1TB Max Time: 8 Node Max Time: 16 Node Linear Scale: Double Nodes, Consistent Runtime
  • 10. Page10 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Performance – cache on HDFS, 1Tb scale 20.2 18.3 17.6 16.2 16.2 16.8 14.5 11.3 0.00% 20.00% 40.00% 60.00% 80.00% 100.00% 0 5 10 15 20 25 24 26 28 30 32 34 36 38 40 42 Cachehitrate Queryruntime,s Cache size, Gb Query runtime, after unrelated workload Query runtime, after related workload Metadata hit rate Data hit rate
  • 11. Page11 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Performance – cache on cloud storage • Cloud storage, slower than on-prem HDFS • Large scan on 1Tb scale • Local HD/SDD cache • Massive improvement on slow storage with little memory cost 0 50 100 150 200 250 300 350 400 Cold cache Hot cache Runtime,s Entire query Scan portion
  • 12. Page12 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Hive with LLAP on text data (CSV, logs, etc) - 5.00 10.00 15.00 20.00 25.00 30.00 35.00 40.00 45.00 50.00 0 200 400 600 800 1000 1200 1400 Text: No Cache Text: Cached Improvement (Factor)
  • 13. Page13 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Hive with LLAP: Architecture Overview Deep Storage YARN Cluster LLAP Daemon Query Executors LLAP Daemon Query Executors LLAP Daemon Query Executors LLAP Daemon Query Executors Query Coordinators Coord- inator Coord- inator Coord- inator HiveServer2 (Query Endpoint) ODBC / JDBC SQL Queries In-Memory Cache (Shared Across All Users) HDFS and Compatible S3 WASB Isilon
  • 14. Page14 © Hortonworks Inc. 2011 – 2017. All Rights Reserved I/OExecution LLAP: In-memory processing Decoder Distributed FS Fragment Hive operator Hive operator Vectorized processin g col1 col2 Low-level pushdown (e.g. filter)* SSD/HD cache Off-heap cache Compression codec Native data vectors Compact encoded data Compressed data col1 col2
  • 15. Page15 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Parallel queries – priorities, preemption • Lower-priority fragments can be preempted • For example, a fragment can start running before its inputs are ready, for better pipelining; such fragments may be preempted • LLAP work queue examines the DAG parameters to give preference to interactive (BI) queries LLAP QueueExecutor Executor Interactive query map 1/3 … Interactive query map 3/3 Executor Interactive query map 2/3 Wide query reduce waiting Time ClusterUtilization Long-Running Query Short-Running Query
  • 16. Page16 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Sidebar: Relational data node • LLAP can provide a "relational datanode" view of the data • Standard interface to read tables directly from LLAP nodes • Supports column-level security • LLAP handles ACID, schema-evolution • Optimized for fast access/scans • Read horizontal table partitions in parallel on compute nodes • Push projections, filters, aggregates into LLAP • Utilizes cache and LLAP IO subsystem
  • 17. Page17 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Example - SparkSQL integration • Allowing direct access to Hive table data from Spark breaks security model for warehouses secured using Ranger or SQL Standard Authorization • Other features may not work correctly unless support added in Spark • Row/Cell level security (when implemented), ACID, Schema Evolution, … • SparkSQL has interfaces for external sources; they can be implemented to access Hive data via HS2/LLAP • SparkSQL Catalyst optimizer hooks can be used to push processing into LLAP (e.g. filters on the tables) • Some optimizations, like dynamic partition pruning, can be used by Hive even if SparkSQL doesn't support them; if execution pushdown is advanced enough
  • 18. Page18 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Example - SparkSQL integration – execution flow Page 18 HadoopRDD.compute() LlapInputFormat.getRecordReader() - Open socket for incoming data - Send package/plan to LLAP Check permissions Generate splits w/LLAP locations Return securely signed splits HS2/Metastore Spark Executor HadoopRDD.getPartitions() Get Hive splits Cluster nodes Verify splits Run query plan Send data back LLAP Partitions/ Splits Request splits : : var llapContext = LlapContext.newInstance( sparkContext, jdbcUrl) var df: DataFrame = llapContext.sql("select * from tpch_text_5.region") DataFrame for Hive/LLAP data
  • 19. Page19 © Hortonworks Inc. 2011 – 2017. All Rights Reserved LLAP Availability • First version shipped in Apache Hive 2.0 • Hive 2.0.1 contains important bug fixes to make it production ready • Hive 2.1 has new features and further perf improvements • HDP 2.5 includes beta version of LLAP • HDP 2.6 includes GA version of Apache Hive 2.1 w/ LLAP
  • 20. Page20 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Setup: Ambari • Simple setup • Integrated into Hive configuration page • Launches HS2 instance and LLAP cluster • Handles configuration, slider, security, alerts and monitoring, … • Requires Ambari 2.5.x • Requires HDP 2.6.x stack (includes Apache Hive 2.1 w/ LLAP)
  • 21. Page21 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Setup: Cloud • Spin up LLAP clusters in minutes • Exposes HS2 endpoint • JDBC/ODBC • Zeppelin/Hive view also options • RDS + S3 friendly • Use HD cache for fast S3 access • RDS for shared metastore instance • HDP 2.6 available now!
  • 22. Page22 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Monitoring • Simple UI for monitoring • JMX endpoint with more data • Logs and jstack endpoints
  • 23. Page23 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Monitoring • HDP ships with number of default Grafana dashboards for LLAP • Show wealth of metrics for cache, compute and scheduling • Helps to identify bottlenecks and issues • Helps with capacity planning • Easy to extend or integrate with other systems
  • 24. Page24 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Watching queries – Tez UI integration
  • 25. Page25 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Future work • Fine-grained workload management • Reservations & priorities • Configurable guardrails • Maximum permissible scan range, runtime, cost • Integrated with CBO’s cost model • Admin tools • Identify hotspots & data layout issues • Easy reporting on health, performance and utilization
  • 26. Page26 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Summary • LLAP is a new execution substrate for fast, concurrent analytical workloads, harnessing Hive vectorized SQL engine and efficient in-memory caching layer • Provides secure relational view of the data through a simple API • Available in Hive 2, integrated with ecosystem (Ambari, Cloud, Grafana, jmx, Hive & Tez views)
  • 27. Page27 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Questions? ? Interested? Stop by the Hortonworks booth to learn more

Notes de l'éditeur

  1. Introduce myself Set stage for demo
  2. Llap off -> 10s Llap on -> < 1s
  3. Observations: -> same hive, same interface (only ‘mode’ difference) -> huge speed up, esp significant when working online (tableau, ad hoc) -> always on (+ cache, memory) v on demand -> why containers? Throughput, shared cluster Rest of presentation: More details about performance and behavior, then technical details
  4. TPCDS (decision support: ad-hoc, reporting, iterative OLAP, data mining) TPCDS, 10TB, Hive 2 (llap) v Hive 1 (containers) (availability: hive 2.0+ has llap) Benchmark on 10 nodes (There are new techniques to bring down the long running queries, not included (optimizer v runtime)) Things to note: -> red line is factor improvement (< 15x, average 5x), tremendous improvement across the board -> analytical queries (ad hoc, drill down, etc) very fast (< 5s, 10s) -> heavy lifting still takes time (other ways to handle) Average is 5x
  5. This time: Choose interactive (adhoc, iterative) queries, smaller data set. Avg is 25x, Blue hive 2 times hard to see on this graph. All hive 2 queries are low seconds again. -> Good for working w/ data set.
  6. HDP 2.5 ORC/ZLib, CDH 5.8 Parquet/Snappy
  7. Hive version = Hive 2.1 Presto version = Presto 0.163
  8. Additional details: Queries are 20 interactive queries taken from TPC-DS using 1 TB of data from TPC-DS schema. Queries run at random using a jmeter test harness Average moves linearly System tries to handle short queries first, w/o starving longer ones
  9. -> cache only speed up on hdfs: llap benefits always there, cache size is variable -> HDFS - this is speed up over: locality + buffer cache -> unrelated workload: query stays the same -> Metadata cache is immediately useful, even if cache is small -> Cache size doesn’t have to be as big as whole data set (pushdown proj + filter, hot data set) -> 2x speed up when all is cached
  10. Story is different when there is no locality
  11. - Text is a horrible format for analytical engines: verbose, low information density, row major, inefficient serialization of data - Typically: ETL step to transform text data to columnar format - LLAP: use it directly, first access will transform and cache the data Cache can be HD or SD (large cache on slower medium still preferable) Steady state: Similar performance to ORC
  12. Start left: No change to clients, HS2 still enpoint for ODBC, JDBC Number of coordinators, each one handles one query at a time (limits concurrency), run in yarn These coordinators break the query into “fragments (part of the plan + part of the data)” and schedule the work out to the daemons Coos let HS2 know the status of the query execution LLAP daemons consist of 2 parts IO subsystem: Cache + elevator, cache is shared Executors: Multi threaded query execution Storage can be either local (hdfs) or remote (s3, wasb, etc) -> Perf: Daemons + coos run permanently, no spin up cost Daemons + coos run fixed set of code: JIT gains Executors are multi threaded, more use of cluster Cache avoids reload of data Multi-threaded IO
  13. Left side: Fragment (part of plan + part of data) Operators need to work on data: -> setup will initialize operators + IO -> IO scan (table, range, filters, projections) -> pool of column vectors (consumer/producer) Pushdown: I/O has metadata cache (column index, min/max, bloom) Will jump over unnecessary data Cache: Read through + coherent Trade-off: Encoding (decoder + cpu v space), dictionary + runlength Off heap (GC), ssd/hd Eviction: lrfu (don’t swap entire table for 1 scan) Cache locality (multiple v wait) Text conversion! Performance: Threading: No more ping/pong, independent IO/proc Multiple fragments in same process JIT Push down Metadata cache: Don’t need to read headers/footers/index Primary data cache: Don’t touch disk
  14. Scheduling: Each llap daemon has queue of fragments Coordinators submit these fragments These fragments can be pre-empted even after they start running Might be running but waiting for input Short running query comes in to interrupt long running (to keep system responsive)
  15. - Think distributed result set or relational datanode - Granular security
  16. Quick way to setup on premise Pre-emption is required for guaranteed startup Pick queue, #nodes, #concurrent queries Done! Under the cover computes: Memory sizes, cache size, #executors, jvm parameters, etc etc All of that can be fine tuned (later or on advanced page) Will use slider to spin up llap, check RM UI and monitoring pages for more info
  17. Hortonworks data cloud Get it on the amazon marketplace, check hwx.com for details HDC allows for transient + permanent clusters, multiple use cases, single view on data EDW-Analytics based on LLAP