We discuss the current state of LLAP (Live Long and Process) – the concurrent sub-second execution of analytical queries engine for Hive 2.0. LLAP is a hybrid execution model that enables performance improvement in and across queries, such as caching of columnar data with cache coherence and intelligent eviction for disaggregated storage models (like S3, Isilon, Azure), JIT-friendly operator pipelines, asynchronous I/O, data pre-fetching and multi-threaded processing. LLAP features robust machine and service failure tolerance achieved by building on top of the time-tested fault tolerant subsystems, as well as a concurrency-directed design that achieves high utilization with low latency via resource sharing, reducing overheads for multiple queries, and enabling the system to preempt tasks of lower priority without failing any query in-flight. The talk also aims to cover the novel deployment model required for hybrid execution. The elasticity demands of the system are served by a long-lived YARN service interacting with on-demand elastic containers serving as a tightly integrated DAG-based framework for query execution. We discuss the current state of the project, performance numbers, deployment and usage strategy, as well as future work, including how LLAP fits into a unified secure DataFrame access layer.
Observations:
-> same hive, same interface (only ‘mode’ difference)
-> huge speed up, esp significant when working online (tableau, ad hoc)
-> always on (+ cache, memory) v on demand
-> why containers? Throughput, shared cluster
Rest of presentation: More details about performance and behavior, then technical details
TPCDS (decision support: ad-hoc, reporting, iterative OLAP, data mining)
TPCDS, 10TB, Hive 2 (llap) v Hive 1 (containers)
(availability: hive 2.0+ has llap)
Benchmark on 10 nodes
(There are new techniques to bring down the long running queries, not included (optimizer v runtime))
Things to note:
-> red line is factor improvement (< 15x, average 5x), tremendous improvement across the board
-> analytical queries (ad hoc, drill down, etc) very fast (< 5s, 10s)
-> heavy lifting still takes time (other ways to handle)
Average is 5x
This time: Choose interactive (adhoc, iterative) queries, smaller data set.
Avg is 25x, Blue hive 2 times hard to see on this graph.
All hive 2 queries are low seconds again.
-> Good for working w/ data set.
HDP 2.5 ORC/ZLib, CDH 5.8 Parquet/Snappy
Hive version = Hive 2.1
Presto version = Presto 0.163
Additional details:
Queries are 20 interactive queries taken from TPC-DS using 1 TB of data from TPC-DS schema.
Queries run at random using a jmeter test harness
Average moves linearly
System tries to handle short queries first, w/o starving longer ones
-> cache only speed up on hdfs: llap benefits always there, cache size is variable
-> HDFS - this is speed up over: locality + buffer cache
-> unrelated workload: query stays the same
-> Metadata cache is immediately useful, even if cache is small
-> Cache size doesn’t have to be as big as whole data set (pushdown proj + filter, hot data set)
-> 2x speed up when all is cached
Story is different when there is no locality
- Text is a horrible format for analytical engines: verbose, low information density, row major, inefficient serialization of data
- Typically: ETL step to transform text data to columnar format
- LLAP: use it directly, first access will transform and cache the data
Cache can be HD or SD (large cache on slower medium still preferable)
Steady state: Similar performance to ORC
Start left: No change to clients, HS2 still enpoint for ODBC, JDBC
Number of coordinators, each one handles one query at a time (limits concurrency), run in yarn
These coordinators break the query into “fragments (part of the plan + part of the data)” and schedule the work out to the daemons
Coos let HS2 know the status of the query execution
LLAP daemons consist of 2 parts
IO subsystem: Cache + elevator, cache is shared
Executors: Multi threaded query execution
Storage can be either local (hdfs) or remote (s3, wasb, etc)
-> Perf:
Daemons + coos run permanently, no spin up cost
Daemons + coos run fixed set of code: JIT gains
Executors are multi threaded, more use of cluster
Cache avoids reload of data
Multi-threaded IO
Left side: Fragment (part of plan + part of data)
Operators need to work on data:
-> setup will initialize operators + IO
-> IO scan (table, range, filters, projections)
-> pool of column vectors (consumer/producer)
Pushdown:
I/O has metadata cache (column index, min/max, bloom)
Will jump over unnecessary data
Cache:
Read through + coherent
Trade-off: Encoding (decoder + cpu v space), dictionary + runlength
Off heap (GC), ssd/hd
Eviction: lrfu (don’t swap entire table for 1 scan)
Cache locality (multiple v wait)
Text conversion!
Performance:
Threading: No more ping/pong, independent IO/proc
Multiple fragments in same process
JIT
Push down
Metadata cache: Don’t need to read headers/footers/index
Primary data cache: Don’t touch disk
Scheduling:
Each llap daemon has queue of fragments
Coordinators submit these fragments
These fragments can be pre-empted even after they start running
Might be running but waiting for input
Short running query comes in to interrupt long running (to keep system responsive)
- Think distributed result set or relational datanode
- Granular security
Quick way to setup on premise
Pre-emption is required for guaranteed startup
Pick queue, #nodes, #concurrent queries
Done!
Under the cover computes: Memory sizes, cache size, #executors, jvm parameters, etc etc
All of that can be fine tuned (later or on advanced page)
Will use slider to spin up llap, check RM UI and monitoring pages for more info
Hortonworks data cloud
Get it on the amazon marketplace, check hwx.com for details
HDC allows for transient + permanent clusters, multiple use cases, single view on data
EDW-Analytics based on LLAP