SlideShare une entreprise Scribd logo
1  sur  64
Page1 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
LLAP: Building Cloud-First BI
Sergey Shelukhin
Page2 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
LLAP: Building Cloud-First BI
• What is LLAP? Overview
• Cloud-first BI
• Efficient, scalable multi-user execution and caching
• Secure
• Universal (not just for Hive)
• Production-ready tools
• How to run LLAP
Page3 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Overview (AKA the old slides)
Page4 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
What is LLAP?
• Hybrid model combining daemons and containers for
fast, concurrent execution of analytical workloads
(e.g. Hive SQL queries)
• Concurrent queries without specialized YARN queue setup
• Multi-threaded execution of vectorized operator pipelines
• Asynchronous IO and efficient in-memory caching
• Relational view of the data available thru the API
• High performance scans, code pushdown, centralized security
• Not an "execution engine" (like Tez, MR, Spark)
• Not a storage substrate – reads from HDFS/S3/…
Node
LLAP Process
Cache
Query Fragment
HDFS
Query Fragment
Page5 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
LLAP and Hive
• Transparent to users, BI tools, etc. – HS2/JDBC is the access point
Deep
Storage
YARN Cluster
HiveServer2
(Query
Endpoint)
ODBC,
JDBC SQL
Queries
LLAP Daemon
Query
Executors
LLAP Daemon
Query
Executors
LLAP Daemon
Query
Executors
LLAP Daemon
Query
Executors
Query
Coordinators
Coord-
inator
Coord-
inator
Coord-
inator
In-Memory Cache
(Shared Across All Users)
HDFS and
Compatible
S3 WASB Isilon
DAGs
Page6 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
LLAP and Hive
• LLAP is Hive
• Supports all file formats that Hive does
• Hive Operators used for processing, same compiler, etc.
• ⇒ Automatic support for most new optimizations and features
• HS2 controls the concurrent queries (session pool)
• Each Query coordinated by a Tez AM; LLAP hosts Tez shuffle
• Can run in parallel with container-based jobs
• After perf testing, we recommend running most Hive workloads in LLAP
• In this new model, the cluster capacity is divided proportionally if needed
Page7 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
LLAP in a BI system - overview
• No split brain – a single platform for ETL…
• Fault-tolerant components and proven scalability (runs TPCDS 100 Tb)
• ANSI SQL, CB Optimizer, ACID transactions
• …and BI
• Vectorized engine for fast processing
• Intelligent scheduling of different workloads running together
• Caching, incl. on local disk (SSD), to avoid expensive reads
• Zero-ETL analytics with efficient caching of text data
• External access, one metadata store, unified security
Page8 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Efficient BI queries (AKA the new slides)
Page9 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
LLAP
Queue
Short overview – execution
• LLAP daemon has a number of executors
(think containers) that run "fragments"
• Fragments are parts of multiple parallel
workloads (e.g. Hive SQL queries)
• Work queue with pluggable priority
• Geared towards low latency queries over long-
running queries (by default)
• I/O is similar to containers – read/write to
HDFS, shuffle, other storages and formats
• Streaming output for data API
Executor
Q1 Map 1
Executor
External read
Executor
Q3 Reducer 3
Q1 Map 1
Q1 Map 1
Q3 Map 19
HDFS
Waiting for
shuffle inputs
HBase
LLAP
(shuffle input)
Spark
executor
Page10 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Efficiency for individual queries
• Eliminates container startup costs
• JIT optimizer has a chance to work (esp. for vectorization)
• Data sharing (hash join tables, etc.)
0 5 10 15 20 25
New containers
Cold LLAP
Warm LLAP (no cache, new user)
4th run of the query (no cache,…
Time, sec
Page11 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Individual queries – LLAP vs Hive 1 (x26 faster)
0
5
10
15
20
25
30
35
40
45
50
0
50
100
150
200
250
Speedup(xFactor)
QueryTime(s)(LowerisBetter)
Hive 1 / Tez Time (s) Hive 2 / LLAP Time(s) Speedup (x Factor)
Page12 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hive + LLAP vs Impala; TPCDS 10Tb on 10 nodes
• LLAP - 180 GB memory size, 32 GB cache
2 hours, 24 mins(*)
Total Time in HDP 2.6
2 hours, 32 min(*)
Total Time in CDH 5.11
99 / 99
Unmodified TPC-DS Queries
Complete In HDP 2.6 at 10TB
57 / 99(**)
Unmodified TPC-DS Queries
Complete In CDH 5.11 at 10TB
(*) Among queries which complete in both engines.
(**) 3 runtime failures, 39 compile failures
Page13 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hive + LLAP vs Impala (log scale; lower is better)
1
10
100
1000
10000
Query64
Query13
Query93
Query91
Query57
Query88
Query47
Query85
Query99
Query51
Query17
Query62
Query43
Query89
Query75
Query48
Query28
Query46
Query90
Query11
Query96
Query97
Query15
Query1
Query53
Query69
Query63
Query3
Query25
Query26
Query74
Query50
Query7
Query34
Query59
Query60
Query9
Query79
Query71
Query30
Query33
Query65
Query31
Query73
Query76
Query61
Query52
Query29
Query55
Query39
Query42
Query68
Query2
Query4
Query19
Query49
Query56
Runtime(s)
HDP 2.6 CDH 5.11
Page14 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Parallel queries – priorities, preemption
• Lower-priority fragments can be preempted
• For example, a fragment can start running before its inputs are
ready, for better pipelining; such fragments may be preempted
• LLAP work queue examines the DAG parameters to give
preference to interactive (BI) queries
LLAP
QueueExecutor
Executor
Interactive
query map 1/3
…
Interactive
query map 3/3
Executor
Interactive
query map 2/3
Wide query
reduce waiting
Time
ClusterUtilization
Long-Running Query
Short-Running Query
Page15 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Efficiency for parallel queries
4x Queries,
2.8x
Runtime
Difference
5x Queries,
4.6x
Runtime
Difference
Concurrent
Queries
Average
Runtime
5 7.76s
25 36.24s
100 102.89s
Page16 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Central workload management (WIP)
• Easier query SLAs and priorities
• Separate ETL and BI queries
• Controls for massively parallel workloads
• Manage resource allocations
• Rules for different queues and users –
guaranteed resources allocations, etc.
• Multiple "equal" queries – FCFS, equal
distribution, ...
Theory
Practice
Page17 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Central workload management (WIP)
• The scheduling can be more flexible than e.g. YARN
• Scheduler is aware of query characteristics, fragment types, etc.
• Fragments easy to pre-empt compared to containers
• Runaway query prevention, dynamic policy-based priorities
• Is it BI or ETL? We can guess from query runtime, data consumed, etc.
• Ability to "pause" long queries, only allowing leftover resources
• Ability to size a BI query allocation to e.g. always run in one wave
• Intelligent utilization of slack
• Queries get guaranteed fractions of the cluster, but can use empty space
Page18 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Caching
Page19 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Caching for BI workloads - basics
• Fine-grained (columnar), compact (dictionary, RLE encoded)
• Important due to projections over many wide EDW tables
• Prioritized – indexes are cached with higher priority
• Important to make use of PPD for BI query filters
• Off-heap (no extra GC), supports SSD
• Saves on cloud reads
• LRFU replacement policy avoids the damage from large scans
• Since cache is most critical for BI queries, esp. on the cloud
Page20 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Caching for BI workloads - basics
• On-prem, does not make large queries on ORC much faster
• On S3/Azure though, reads are much more slow
• Even disk cache is still much faster than FS reads
• Especially if text is involved
Page21 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
I/O threads
In-memory processing – columnar
SSD cache
- work in
progress
*
Off-heap cache
Compact encoded data
Distributed FS
Compressed data
Decoder: ORC,
Parquet(*)
col1
col2
Compression
codec
Read planner
Execution thread
Fragment
Hive
operator
Hive
operator
Vectorized
processin
g
col1 col2
Native data
vectors
Replacement
policy
Page22 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Up to 10x speed up with a cloud FS (100 Tb dataset, SSD)
0
1
2
3
4
5
6
7
8
9
10
0
500
1000
1500
2000
2500
3000
query71
query19
query73
query82
query68
query91
query12
query90
query18
query85
query27
query45
query13
query40
query48
query89
query79
query58
query66
query15
query42
query98
query26
query20
query46
query7
query17
query34
query96
query25
query32
query28
query21
query49
query87
query84
query50
query95
query97
query64
query55
query52
query93
query39
query76
query92
query94
query88
query3
query43
Speedup
Time,sec
Cold
With cache
Speedup
10x
• 1.5x overall
• 2.1x small
queries
Page23 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Caching for BI workloads – zero-ETL
• Cache supports columnar data and text
• ORC is cached natively
• Parquet runs in LLAP without cache; caching support WIP
• Zero-ETL analytics on CSV and JSON data with text caching
• Text is efficiently encoded in background; once cached, queries speed up
Page24 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
I/O threads
Execution thread
In-memory processing – text – the first query
Fragment
Hive
operator
Hive
operator
Vectorized
processin
g
col1 col2
SSD cache
Native data
vectors
Off-heap cache
Compact encoded data
Distributed FS
Compressed data
Decoder: ORC Read planner
Compression
codec
"Decoder": text
col1 col2
Encoder: ORC lite
Page25 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
I/O threads
Execution thread
In-memory processing – text – the first query
Fragment
Hive
operator
Hive
operator
Vectorized
processin
g
col1 col2
SSD cache
Native data
vectors
Off-heap cache
Compact encoded data
Distributed FS
Compressed data
Decoder: ORC Read planner
Compression
codec
"Decoder": text
col1 col2
Encoder: ORC lite
Page26 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
I/O threads
Execution thread
In-memory processing – text – the second query
Fragment
Hive
operator
Hive
operator
Vectorized
processin
g
col1 col2
SSD cache
Native data
vectors
Off-heap cache
Compact encoded data
Distributed FS
Compressed data
Decoder: ORC
col2
Read planner
Compression
codec
"Decoder": text
Encoder: ORC lite
Page27 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Up to 38x speed up with text cache + cloud FS
Page28 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Caching for BI workloads – usage patterns
• Usually, the data is loaded periodically (15-60 minutes)
• Hive ACID is a good fit for such data loads
• The most recent data is the most accessed
• Avoiding hotspots – locality is configurable
• Non-strict locality "auto"-creates more replicas
• Strict cache locality is also possible (for filly in memory/on-SSD data)
• Locality is based on consistent hashing with some fault tolerance
• Automatically coherent cache – unique-file based
• Better ACID support (WIP) – updates without invalidating old base files
Page29 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
External access
Page30 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
External access – relational view for everyone
• Hive-on-Tez and other DAG executors can use LLAP directly
• LLAP also provides a "relational datanode" view of the data
• Anyone (with access) can push the (approved) code in, from
complex query fragments to simple data reads
• E.g. a Spark DataFrame can be created with LlapInputFormat
• Gives the external services the access to
• Hive data: centralized, secure data access
• Ability to read all Hive table types, like ACID transactional tables
• Hive features: from column-level security, to LLAP columnar cache
Page31 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
SparkSQL+LLAP example
• Ranger for Hive support cell-level security and masking
• With SparkSQL utilizing LLAP, this can be used from Spark
• More at https://hortonworks.com/blog/row-column-level-control-apache-spark/
Page32 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
SparkSQL+LLAP example
• Ranger for Hive support cell-level security and masking
• With SparkSQL utilizing LLAP, this can be used from Spark
• More at https://hortonworks.com/blog/row-column-level-control-apache-spark/
Page33 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
SparkSQL+LLAP example
• Ranger for Hive support cell-level security and masking
• With SparkSQL utilizing LLAP, this can be used from Spark
• More at https://hortonworks.com/blog/row-column-level-control-apache-spark/
Page34 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Security
Page35 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
LLAP security – the EDW usage patterns
• Single, centrally administered LLAP cluster for all users
• For now, separate ad hoc clusters cannot use Ambari
• Use Ranger
• Hive SQL standard auth is an option; doAs is not recommended
• Hive session AMs and LLAP run as hive superuser; managed by HS2
• HS2 serves as a central coordinator for security
• Beeline and JDBC access; no CLI (requires client kinit)
• HS2 checks permissions, enforces Ranger policies
• Coordinates the usual Hadoop security dance (tokens for tasks, etc.)
Page36 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
LLAP daemon
Task Jo
b
FS
SM T TK
LLAP security – internals with Tez
HS2
SM T TK
SessionSession
FS
Tez AM
T FS
ZK
Paths w/ACLs
T TK
Jo
b
Ranger
Policies
FS
LLAP daemon
Task Jo
b
FS
SM T TK
FS
Jo
b
F💓
FS
FS T🔐
T Task
Jo
b
Page37 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
LLAP security – external (Spark) additions
FS
LLAP daemon
Task Jo
b
FS
SM T TK
Spark Client
FS Jo
b
FS
T
🔏 Specs
K
HS2
SM T TK
Session
FS
K
🔏 TaskT
K
K
Page38 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Integration and tools
Page39 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
LLAP in Ambari (and HDP)
• Ambari 2.5 + HDP 2.6 = LLAP GA
• The latest recommended update is HDP 2.6.1.0
• Do not use Tech Preview versions; use GA
• Separate version of Hive, "Hive Interactive"
• Ambari 3.0 – no more separate versions
• Enable "Interactive Query" in Hive tab
• A default configuration is chosen; more on that later
• No Ambari? Will cover this later
Page40 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
LLAP on the cloud
• LLAP is in HDC (Hortonworks Data Cloud)
• For quick, automated cluster deployments on AWS
• Also available on Azure HDInsight
• Details and links at the end!
Page41 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Tez UI – queries and LLAP integration
Page42 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Tez UI – query swimlane
Page43 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Tez UI – DAG swimlane
Page44 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Tez UI – LLAP counters
Page45 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Tez UI – query information and debug data
Page46 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Monitoring
• LLAP exposes a UI for
monitoring
• Also has jmx endpoint with
much more data, logs and
jstack endpoints as usual
• Aggregate monitoring UI is
work in progress
Page47 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Debugging
• JMX view (:15002/jmx) contains many detailed metrics
• E.g. "ExecutorsState" shows the running tasks and their state
• Log lines for most operations are annotated with task attempt #
• By default, stores separate log files per query
• Logs can be downloaded (yarn logs ...) even for a running LLAP app
• Log file contains the statements annotated for the query
• File name contains session application ID and DAG# - "dag_TTTT_MMM_N"
- e.g. dag_1490656001509_4602_1 for Tez AM application_1490656001509_4602
Page48 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Running LLAP
Page49 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Making the best use of LLAP - summary
• Java 8, G1GC; some kernel configs, esp. for TCP connections
• Ambari comes with reasonable Hive perf configuration
• May still need tweaking for specific workload; more so w/o Ambari
• SSD and text cache require "advanced config" until Ambari 3.0
• LLAP cluster sizing in a nutshell
• AM per query (+ 1-2), AMs on all nodes; 2Gb RAM per AM, rest to LLAP
• (Executor+IO thread) per core, 3-4Gb RAM per executor, rest to cache
• Without Ambari, see hive --service llap command + Slider
Page50 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Making the best use of LLAP – OS and Java
• Java 8 strongly recommended
• G1 GC recommended (e.g. --args " -XX:+UseG1GC -XX:+ResizeTLAB -XX:-ResizePLAB")
• Kernel settings
sysctl -w net.core.somaxconn=16384;
echo "never" > /sys/kernel/mm/transparent_hugepage/enabled
echo "never" > /sys/kernel/mm/transparent_hugepage/defrag
/etc/init.d/nscd restart
Page51 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Making the best use of LLAP – cluster sizing
1. Pick the number of parallel queries (not just sessions)
• AM per query + some constant slack
2. Pick executors per node, and io.threadpool count (# of cores per node)
3. Determine total memory size for LLAP - subtract the AM(s) from each NM
(YARN memory per node)
• 1 AM = 2 Gb; best to spread the AMs across each node
4. Determine the Xmx for LLAP; one executor (core) == 3-4Gb
5. Determine cache size - from the total, take out Xmx + ~3Gb (if lower, 20%)
• ~3Gb for Java overhead, shared overhead
6. Tweak based on workload?
Page52 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Cluster sizing example
• 6 node cluster – 100Gb RAM each, 24 cores, NM Size = 96Gb, 25
concurrent queries
• 28 AMs; total AM memory = 28*2 = 56
• (96*6 – 28*2)/6 = 86.66 => "Memory per daemon" = 85Gb
• "LLAP heap size" (Xmx) = 3Gb*24 = 72Gb
• "Cache size" = 85Gb – 72Gb – 3Gb ~= 10Gb
Page53 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Making the best use of LLAP – Hive settings
• Ambari has a lot of this configured by default
• The basics - use ORC; enable PPD, configure mapjoin size, etc.
• Enable vectorization and consider new vectorization features
• hive.vectorized.execution.*.enabled – see the configuration file documentation
• Enable LLAP split locality - hive.llap.client.consistent.splits
• hive.llap.task.scheduler.locality.delay to tweak strict/relaxed locality
• Consider disabling CBO for interactive queries (test your queries!)
• Use parallel compilation in HS2 - hive.driver.parallel.compilation
• Shuffle improvement - tez.am.am-rm.heartbeat.interval-ms.max=5000
Page54 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Making the best use of LLAP – new cache stuff
• Text cache (not turned on in Ambari until 3.0!)
• hive.llap.io.encode.enabled,
hive.vectorized.use.(row|vector).serde.deserialize
• SSD cache (not turned on in Ambari until 3.0!)
• hive.llap.io.allocator.mmap; hive.llap.io.allocator.mmap.path
• hive.llap.io.memory.size controls the total cache size (on disk)
• You have to disable YARN memory check for now until YARN 2.8
• Cache on cloud FS
• hive.orc.splits.allow.synthetic.fileid,
hive.llap.cache.allow.synthetic.fileid
Page55 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
LLAP without Ambari
• Requires Hive 2.X (2.2-2.3 are coming soon); ZK, YARN, Slider
• hive --service llap generates a slider package
• Run this as the correct user – slider paths are user-specific! kinit on secure cluster
• Specify a name, # of instances; memory, cache size, etc.; see --help
• Generates run.sh to start the cluster (in --output) directory
• Or use --startImmediately in newer versions
• Queries can be run from HS2 and CLI; basic configuration:
• hive.execution.mode=llap, hive.llap.execution.mode=all,
hive.llap.io.enabled=true, hive.llap.daemon.service.hosts=@<cluster
name>
Page56 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Summary
LLAP provides a
• unified, managed and secure cloud-ready EDW
solution for BI and ETL workloads via a
• fast execution substrate harnessing Hive
vectorized SQL engine and
• efficient in-memory caching layer for columnar
and text data
Page57 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Try Hive LLAP Today on-prem or in the Cloud
Hortonworks Data Platform 2.6
Powered by 100% open source Apache Hadoop
http://hortonworks.com/downloads/
Hortonworks Data Cloud
Easy HDP on Amazon Web Services
http://hortonworks.com/products/cloud/aws/
Microsoft Azure HDInsight
A cloud Spark and Hadoop service for your enterprise
http://azure.microsoft.com/en-us/services/hdinsight/
Page58 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Questions?
?
Interested? Stop by the Hortonworks booth to learn more
Page59 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Backup slides
Page60 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
LLAP daemon
SM TK
LLAP security – internals with Tez
HS2
SM TK
Session
ZK
Paths w/hive ACLs
TK
Ranger
Policies
FS
LLAP daemon
SM TK
Tez AM
FS
Page61 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
LLAP daemon
SM T TK
LLAP security – internals with Tez
HS2
SM T TK
SessionSession
FS
ZK
Paths w/hive ACLs
T TK
Ranger
Policies
FS
LLAP daemon
SM T TK
T🔐
FS
T🔐
Tez AM
T FS
FS T🔐
Page62 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
LLAP daemon
Task Jo
b
FS
SM T TK
LLAP security – internals with Tez
HS2
SM T TK
SessionSession
FS
Tez AM
T FS
ZK
Paths w/hive ACLs
T TK
Jo
b
Ranger
Policies
FS
T Task
T Task
LLAP daemon
Task
SM T TK
FSJo
b
FS
Page63 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
LLAP daemon
Task Jo
b
FS
SM T TK
LLAP security – internals with Tez
HS2
SM T TK
SessionSession
FS
Tez AM
T FS
ZK
Paths w/hive ACLs
T TK
Jo
b
Ranger
Policies
FS
Jo
b
F💓 LLAP daemon
Task Jo
b
FS
SM T TK
FS
Jo
b
F💓 FS
Jo
b
T
Page64 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Future work
• Centralized workload management
• Better tool support for cluster management, SSD cache, etc.
• Improvements to Spark-LLAP integration
• Parquet cache support
• Memory management, cache improvements, faster cluster
startup time, other performance features

Contenu connexe

Plus de DataWorks Summit

Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...DataWorks Summit
 
Applying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real ProblemsApplying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real ProblemsDataWorks Summit
 
Open Source, Open Data: Driving Innovation in Smart Cities
Open Source, Open Data: Driving Innovation in Smart CitiesOpen Source, Open Data: Driving Innovation in Smart Cities
Open Source, Open Data: Driving Innovation in Smart CitiesDataWorks Summit
 
Data Protection in Hybrid Enterprise Data Lake Environment
Data Protection in Hybrid Enterprise Data Lake EnvironmentData Protection in Hybrid Enterprise Data Lake Environment
Data Protection in Hybrid Enterprise Data Lake EnvironmentDataWorks Summit
 
Big Data Technologies in Support of a Medical School Data Science Institute
Big Data Technologies in Support of a Medical School Data Science InstituteBig Data Technologies in Support of a Medical School Data Science Institute
Big Data Technologies in Support of a Medical School Data Science InstituteDataWorks Summit
 
Hadoop Storage in the Cloud Native Era
Hadoop Storage in the Cloud Native EraHadoop Storage in the Cloud Native Era
Hadoop Storage in the Cloud Native EraDataWorks Summit
 
Free Servers to Build Big Data System on: Bing’s Approach
Free Servers to Build Big Data System on: Bing’s ApproachFree Servers to Build Big Data System on: Bing’s Approach
Free Servers to Build Big Data System on: Bing’s ApproachDataWorks Summit
 
IoFMT – Internet of Fleet Management Things
IoFMT – Internet of Fleet Management ThingsIoFMT – Internet of Fleet Management Things
IoFMT – Internet of Fleet Management ThingsDataWorks Summit
 

Plus de DataWorks Summit (20)

Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
 
Applying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real ProblemsApplying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real Problems
 
Open Source, Open Data: Driving Innovation in Smart Cities
Open Source, Open Data: Driving Innovation in Smart CitiesOpen Source, Open Data: Driving Innovation in Smart Cities
Open Source, Open Data: Driving Innovation in Smart Cities
 
Data Protection in Hybrid Enterprise Data Lake Environment
Data Protection in Hybrid Enterprise Data Lake EnvironmentData Protection in Hybrid Enterprise Data Lake Environment
Data Protection in Hybrid Enterprise Data Lake Environment
 
Big Data Technologies in Support of a Medical School Data Science Institute
Big Data Technologies in Support of a Medical School Data Science InstituteBig Data Technologies in Support of a Medical School Data Science Institute
Big Data Technologies in Support of a Medical School Data Science Institute
 
Hadoop Storage in the Cloud Native Era
Hadoop Storage in the Cloud Native EraHadoop Storage in the Cloud Native Era
Hadoop Storage in the Cloud Native Era
 
Free Servers to Build Big Data System on: Bing’s Approach
Free Servers to Build Big Data System on: Bing’s ApproachFree Servers to Build Big Data System on: Bing’s Approach
Free Servers to Build Big Data System on: Bing’s Approach
 
IoFMT – Internet of Fleet Management Things
IoFMT – Internet of Fleet Management ThingsIoFMT – Internet of Fleet Management Things
IoFMT – Internet of Fleet Management Things
 

Dernier

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 

Dernier (20)

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 

LLAP: Building Cloud-First BI

  • 1. Page1 © Hortonworks Inc. 2011 – 2015. All Rights Reserved LLAP: Building Cloud-First BI Sergey Shelukhin
  • 2. Page2 © Hortonworks Inc. 2011 – 2015. All Rights Reserved LLAP: Building Cloud-First BI • What is LLAP? Overview • Cloud-first BI • Efficient, scalable multi-user execution and caching • Secure • Universal (not just for Hive) • Production-ready tools • How to run LLAP
  • 3. Page3 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Overview (AKA the old slides)
  • 4. Page4 © Hortonworks Inc. 2011 – 2015. All Rights Reserved What is LLAP? • Hybrid model combining daemons and containers for fast, concurrent execution of analytical workloads (e.g. Hive SQL queries) • Concurrent queries without specialized YARN queue setup • Multi-threaded execution of vectorized operator pipelines • Asynchronous IO and efficient in-memory caching • Relational view of the data available thru the API • High performance scans, code pushdown, centralized security • Not an "execution engine" (like Tez, MR, Spark) • Not a storage substrate – reads from HDFS/S3/… Node LLAP Process Cache Query Fragment HDFS Query Fragment
  • 5. Page5 © Hortonworks Inc. 2011 – 2015. All Rights Reserved LLAP and Hive • Transparent to users, BI tools, etc. – HS2/JDBC is the access point Deep Storage YARN Cluster HiveServer2 (Query Endpoint) ODBC, JDBC SQL Queries LLAP Daemon Query Executors LLAP Daemon Query Executors LLAP Daemon Query Executors LLAP Daemon Query Executors Query Coordinators Coord- inator Coord- inator Coord- inator In-Memory Cache (Shared Across All Users) HDFS and Compatible S3 WASB Isilon DAGs
  • 6. Page6 © Hortonworks Inc. 2011 – 2015. All Rights Reserved LLAP and Hive • LLAP is Hive • Supports all file formats that Hive does • Hive Operators used for processing, same compiler, etc. • ⇒ Automatic support for most new optimizations and features • HS2 controls the concurrent queries (session pool) • Each Query coordinated by a Tez AM; LLAP hosts Tez shuffle • Can run in parallel with container-based jobs • After perf testing, we recommend running most Hive workloads in LLAP • In this new model, the cluster capacity is divided proportionally if needed
  • 7. Page7 © Hortonworks Inc. 2011 – 2015. All Rights Reserved LLAP in a BI system - overview • No split brain – a single platform for ETL… • Fault-tolerant components and proven scalability (runs TPCDS 100 Tb) • ANSI SQL, CB Optimizer, ACID transactions • …and BI • Vectorized engine for fast processing • Intelligent scheduling of different workloads running together • Caching, incl. on local disk (SSD), to avoid expensive reads • Zero-ETL analytics with efficient caching of text data • External access, one metadata store, unified security
  • 8. Page8 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Efficient BI queries (AKA the new slides)
  • 9. Page9 © Hortonworks Inc. 2011 – 2015. All Rights Reserved LLAP Queue Short overview – execution • LLAP daemon has a number of executors (think containers) that run "fragments" • Fragments are parts of multiple parallel workloads (e.g. Hive SQL queries) • Work queue with pluggable priority • Geared towards low latency queries over long- running queries (by default) • I/O is similar to containers – read/write to HDFS, shuffle, other storages and formats • Streaming output for data API Executor Q1 Map 1 Executor External read Executor Q3 Reducer 3 Q1 Map 1 Q1 Map 1 Q3 Map 19 HDFS Waiting for shuffle inputs HBase LLAP (shuffle input) Spark executor
  • 10. Page10 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Efficiency for individual queries • Eliminates container startup costs • JIT optimizer has a chance to work (esp. for vectorization) • Data sharing (hash join tables, etc.) 0 5 10 15 20 25 New containers Cold LLAP Warm LLAP (no cache, new user) 4th run of the query (no cache,… Time, sec
  • 11. Page11 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Individual queries – LLAP vs Hive 1 (x26 faster) 0 5 10 15 20 25 30 35 40 45 50 0 50 100 150 200 250 Speedup(xFactor) QueryTime(s)(LowerisBetter) Hive 1 / Tez Time (s) Hive 2 / LLAP Time(s) Speedup (x Factor)
  • 12. Page12 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Hive + LLAP vs Impala; TPCDS 10Tb on 10 nodes • LLAP - 180 GB memory size, 32 GB cache 2 hours, 24 mins(*) Total Time in HDP 2.6 2 hours, 32 min(*) Total Time in CDH 5.11 99 / 99 Unmodified TPC-DS Queries Complete In HDP 2.6 at 10TB 57 / 99(**) Unmodified TPC-DS Queries Complete In CDH 5.11 at 10TB (*) Among queries which complete in both engines. (**) 3 runtime failures, 39 compile failures
  • 13. Page13 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Hive + LLAP vs Impala (log scale; lower is better) 1 10 100 1000 10000 Query64 Query13 Query93 Query91 Query57 Query88 Query47 Query85 Query99 Query51 Query17 Query62 Query43 Query89 Query75 Query48 Query28 Query46 Query90 Query11 Query96 Query97 Query15 Query1 Query53 Query69 Query63 Query3 Query25 Query26 Query74 Query50 Query7 Query34 Query59 Query60 Query9 Query79 Query71 Query30 Query33 Query65 Query31 Query73 Query76 Query61 Query52 Query29 Query55 Query39 Query42 Query68 Query2 Query4 Query19 Query49 Query56 Runtime(s) HDP 2.6 CDH 5.11
  • 14. Page14 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Parallel queries – priorities, preemption • Lower-priority fragments can be preempted • For example, a fragment can start running before its inputs are ready, for better pipelining; such fragments may be preempted • LLAP work queue examines the DAG parameters to give preference to interactive (BI) queries LLAP QueueExecutor Executor Interactive query map 1/3 … Interactive query map 3/3 Executor Interactive query map 2/3 Wide query reduce waiting Time ClusterUtilization Long-Running Query Short-Running Query
  • 15. Page15 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Efficiency for parallel queries 4x Queries, 2.8x Runtime Difference 5x Queries, 4.6x Runtime Difference Concurrent Queries Average Runtime 5 7.76s 25 36.24s 100 102.89s
  • 16. Page16 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Central workload management (WIP) • Easier query SLAs and priorities • Separate ETL and BI queries • Controls for massively parallel workloads • Manage resource allocations • Rules for different queues and users – guaranteed resources allocations, etc. • Multiple "equal" queries – FCFS, equal distribution, ... Theory Practice
  • 17. Page17 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Central workload management (WIP) • The scheduling can be more flexible than e.g. YARN • Scheduler is aware of query characteristics, fragment types, etc. • Fragments easy to pre-empt compared to containers • Runaway query prevention, dynamic policy-based priorities • Is it BI or ETL? We can guess from query runtime, data consumed, etc. • Ability to "pause" long queries, only allowing leftover resources • Ability to size a BI query allocation to e.g. always run in one wave • Intelligent utilization of slack • Queries get guaranteed fractions of the cluster, but can use empty space
  • 18. Page18 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Caching
  • 19. Page19 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Caching for BI workloads - basics • Fine-grained (columnar), compact (dictionary, RLE encoded) • Important due to projections over many wide EDW tables • Prioritized – indexes are cached with higher priority • Important to make use of PPD for BI query filters • Off-heap (no extra GC), supports SSD • Saves on cloud reads • LRFU replacement policy avoids the damage from large scans • Since cache is most critical for BI queries, esp. on the cloud
  • 20. Page20 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Caching for BI workloads - basics • On-prem, does not make large queries on ORC much faster • On S3/Azure though, reads are much more slow • Even disk cache is still much faster than FS reads • Especially if text is involved
  • 21. Page21 © Hortonworks Inc. 2011 – 2015. All Rights Reserved I/O threads In-memory processing – columnar SSD cache - work in progress * Off-heap cache Compact encoded data Distributed FS Compressed data Decoder: ORC, Parquet(*) col1 col2 Compression codec Read planner Execution thread Fragment Hive operator Hive operator Vectorized processin g col1 col2 Native data vectors Replacement policy
  • 22. Page22 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Up to 10x speed up with a cloud FS (100 Tb dataset, SSD) 0 1 2 3 4 5 6 7 8 9 10 0 500 1000 1500 2000 2500 3000 query71 query19 query73 query82 query68 query91 query12 query90 query18 query85 query27 query45 query13 query40 query48 query89 query79 query58 query66 query15 query42 query98 query26 query20 query46 query7 query17 query34 query96 query25 query32 query28 query21 query49 query87 query84 query50 query95 query97 query64 query55 query52 query93 query39 query76 query92 query94 query88 query3 query43 Speedup Time,sec Cold With cache Speedup 10x • 1.5x overall • 2.1x small queries
  • 23. Page23 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Caching for BI workloads – zero-ETL • Cache supports columnar data and text • ORC is cached natively • Parquet runs in LLAP without cache; caching support WIP • Zero-ETL analytics on CSV and JSON data with text caching • Text is efficiently encoded in background; once cached, queries speed up
  • 24. Page24 © Hortonworks Inc. 2011 – 2015. All Rights Reserved I/O threads Execution thread In-memory processing – text – the first query Fragment Hive operator Hive operator Vectorized processin g col1 col2 SSD cache Native data vectors Off-heap cache Compact encoded data Distributed FS Compressed data Decoder: ORC Read planner Compression codec "Decoder": text col1 col2 Encoder: ORC lite
  • 25. Page25 © Hortonworks Inc. 2011 – 2015. All Rights Reserved I/O threads Execution thread In-memory processing – text – the first query Fragment Hive operator Hive operator Vectorized processin g col1 col2 SSD cache Native data vectors Off-heap cache Compact encoded data Distributed FS Compressed data Decoder: ORC Read planner Compression codec "Decoder": text col1 col2 Encoder: ORC lite
  • 26. Page26 © Hortonworks Inc. 2011 – 2015. All Rights Reserved I/O threads Execution thread In-memory processing – text – the second query Fragment Hive operator Hive operator Vectorized processin g col1 col2 SSD cache Native data vectors Off-heap cache Compact encoded data Distributed FS Compressed data Decoder: ORC col2 Read planner Compression codec "Decoder": text Encoder: ORC lite
  • 27. Page27 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Up to 38x speed up with text cache + cloud FS
  • 28. Page28 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Caching for BI workloads – usage patterns • Usually, the data is loaded periodically (15-60 minutes) • Hive ACID is a good fit for such data loads • The most recent data is the most accessed • Avoiding hotspots – locality is configurable • Non-strict locality "auto"-creates more replicas • Strict cache locality is also possible (for filly in memory/on-SSD data) • Locality is based on consistent hashing with some fault tolerance • Automatically coherent cache – unique-file based • Better ACID support (WIP) – updates without invalidating old base files
  • 29. Page29 © Hortonworks Inc. 2011 – 2015. All Rights Reserved External access
  • 30. Page30 © Hortonworks Inc. 2011 – 2015. All Rights Reserved External access – relational view for everyone • Hive-on-Tez and other DAG executors can use LLAP directly • LLAP also provides a "relational datanode" view of the data • Anyone (with access) can push the (approved) code in, from complex query fragments to simple data reads • E.g. a Spark DataFrame can be created with LlapInputFormat • Gives the external services the access to • Hive data: centralized, secure data access • Ability to read all Hive table types, like ACID transactional tables • Hive features: from column-level security, to LLAP columnar cache
  • 31. Page31 © Hortonworks Inc. 2011 – 2015. All Rights Reserved SparkSQL+LLAP example • Ranger for Hive support cell-level security and masking • With SparkSQL utilizing LLAP, this can be used from Spark • More at https://hortonworks.com/blog/row-column-level-control-apache-spark/
  • 32. Page32 © Hortonworks Inc. 2011 – 2015. All Rights Reserved SparkSQL+LLAP example • Ranger for Hive support cell-level security and masking • With SparkSQL utilizing LLAP, this can be used from Spark • More at https://hortonworks.com/blog/row-column-level-control-apache-spark/
  • 33. Page33 © Hortonworks Inc. 2011 – 2015. All Rights Reserved SparkSQL+LLAP example • Ranger for Hive support cell-level security and masking • With SparkSQL utilizing LLAP, this can be used from Spark • More at https://hortonworks.com/blog/row-column-level-control-apache-spark/
  • 34. Page34 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Security
  • 35. Page35 © Hortonworks Inc. 2011 – 2015. All Rights Reserved LLAP security – the EDW usage patterns • Single, centrally administered LLAP cluster for all users • For now, separate ad hoc clusters cannot use Ambari • Use Ranger • Hive SQL standard auth is an option; doAs is not recommended • Hive session AMs and LLAP run as hive superuser; managed by HS2 • HS2 serves as a central coordinator for security • Beeline and JDBC access; no CLI (requires client kinit) • HS2 checks permissions, enforces Ranger policies • Coordinates the usual Hadoop security dance (tokens for tasks, etc.)
  • 36. Page36 © Hortonworks Inc. 2011 – 2015. All Rights Reserved LLAP daemon Task Jo b FS SM T TK LLAP security – internals with Tez HS2 SM T TK SessionSession FS Tez AM T FS ZK Paths w/ACLs T TK Jo b Ranger Policies FS LLAP daemon Task Jo b FS SM T TK FS Jo b F💓 FS FS T🔐 T Task Jo b
  • 37. Page37 © Hortonworks Inc. 2011 – 2015. All Rights Reserved LLAP security – external (Spark) additions FS LLAP daemon Task Jo b FS SM T TK Spark Client FS Jo b FS T 🔏 Specs K HS2 SM T TK Session FS K 🔏 TaskT K K
  • 38. Page38 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Integration and tools
  • 39. Page39 © Hortonworks Inc. 2011 – 2015. All Rights Reserved LLAP in Ambari (and HDP) • Ambari 2.5 + HDP 2.6 = LLAP GA • The latest recommended update is HDP 2.6.1.0 • Do not use Tech Preview versions; use GA • Separate version of Hive, "Hive Interactive" • Ambari 3.0 – no more separate versions • Enable "Interactive Query" in Hive tab • A default configuration is chosen; more on that later • No Ambari? Will cover this later
  • 40. Page40 © Hortonworks Inc. 2011 – 2015. All Rights Reserved LLAP on the cloud • LLAP is in HDC (Hortonworks Data Cloud) • For quick, automated cluster deployments on AWS • Also available on Azure HDInsight • Details and links at the end!
  • 41. Page41 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Tez UI – queries and LLAP integration
  • 42. Page42 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Tez UI – query swimlane
  • 43. Page43 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Tez UI – DAG swimlane
  • 44. Page44 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Tez UI – LLAP counters
  • 45. Page45 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Tez UI – query information and debug data
  • 46. Page46 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Monitoring • LLAP exposes a UI for monitoring • Also has jmx endpoint with much more data, logs and jstack endpoints as usual • Aggregate monitoring UI is work in progress
  • 47. Page47 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Debugging • JMX view (:15002/jmx) contains many detailed metrics • E.g. "ExecutorsState" shows the running tasks and their state • Log lines for most operations are annotated with task attempt # • By default, stores separate log files per query • Logs can be downloaded (yarn logs ...) even for a running LLAP app • Log file contains the statements annotated for the query • File name contains session application ID and DAG# - "dag_TTTT_MMM_N" - e.g. dag_1490656001509_4602_1 for Tez AM application_1490656001509_4602
  • 48. Page48 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Running LLAP
  • 49. Page49 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Making the best use of LLAP - summary • Java 8, G1GC; some kernel configs, esp. for TCP connections • Ambari comes with reasonable Hive perf configuration • May still need tweaking for specific workload; more so w/o Ambari • SSD and text cache require "advanced config" until Ambari 3.0 • LLAP cluster sizing in a nutshell • AM per query (+ 1-2), AMs on all nodes; 2Gb RAM per AM, rest to LLAP • (Executor+IO thread) per core, 3-4Gb RAM per executor, rest to cache • Without Ambari, see hive --service llap command + Slider
  • 50. Page50 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Making the best use of LLAP – OS and Java • Java 8 strongly recommended • G1 GC recommended (e.g. --args " -XX:+UseG1GC -XX:+ResizeTLAB -XX:-ResizePLAB") • Kernel settings sysctl -w net.core.somaxconn=16384; echo "never" > /sys/kernel/mm/transparent_hugepage/enabled echo "never" > /sys/kernel/mm/transparent_hugepage/defrag /etc/init.d/nscd restart
  • 51. Page51 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Making the best use of LLAP – cluster sizing 1. Pick the number of parallel queries (not just sessions) • AM per query + some constant slack 2. Pick executors per node, and io.threadpool count (# of cores per node) 3. Determine total memory size for LLAP - subtract the AM(s) from each NM (YARN memory per node) • 1 AM = 2 Gb; best to spread the AMs across each node 4. Determine the Xmx for LLAP; one executor (core) == 3-4Gb 5. Determine cache size - from the total, take out Xmx + ~3Gb (if lower, 20%) • ~3Gb for Java overhead, shared overhead 6. Tweak based on workload?
  • 52. Page52 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Cluster sizing example • 6 node cluster – 100Gb RAM each, 24 cores, NM Size = 96Gb, 25 concurrent queries • 28 AMs; total AM memory = 28*2 = 56 • (96*6 – 28*2)/6 = 86.66 => "Memory per daemon" = 85Gb • "LLAP heap size" (Xmx) = 3Gb*24 = 72Gb • "Cache size" = 85Gb – 72Gb – 3Gb ~= 10Gb
  • 53. Page53 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Making the best use of LLAP – Hive settings • Ambari has a lot of this configured by default • The basics - use ORC; enable PPD, configure mapjoin size, etc. • Enable vectorization and consider new vectorization features • hive.vectorized.execution.*.enabled – see the configuration file documentation • Enable LLAP split locality - hive.llap.client.consistent.splits • hive.llap.task.scheduler.locality.delay to tweak strict/relaxed locality • Consider disabling CBO for interactive queries (test your queries!) • Use parallel compilation in HS2 - hive.driver.parallel.compilation • Shuffle improvement - tez.am.am-rm.heartbeat.interval-ms.max=5000
  • 54. Page54 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Making the best use of LLAP – new cache stuff • Text cache (not turned on in Ambari until 3.0!) • hive.llap.io.encode.enabled, hive.vectorized.use.(row|vector).serde.deserialize • SSD cache (not turned on in Ambari until 3.0!) • hive.llap.io.allocator.mmap; hive.llap.io.allocator.mmap.path • hive.llap.io.memory.size controls the total cache size (on disk) • You have to disable YARN memory check for now until YARN 2.8 • Cache on cloud FS • hive.orc.splits.allow.synthetic.fileid, hive.llap.cache.allow.synthetic.fileid
  • 55. Page55 © Hortonworks Inc. 2011 – 2015. All Rights Reserved LLAP without Ambari • Requires Hive 2.X (2.2-2.3 are coming soon); ZK, YARN, Slider • hive --service llap generates a slider package • Run this as the correct user – slider paths are user-specific! kinit on secure cluster • Specify a name, # of instances; memory, cache size, etc.; see --help • Generates run.sh to start the cluster (in --output) directory • Or use --startImmediately in newer versions • Queries can be run from HS2 and CLI; basic configuration: • hive.execution.mode=llap, hive.llap.execution.mode=all, hive.llap.io.enabled=true, hive.llap.daemon.service.hosts=@<cluster name>
  • 56. Page56 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Summary LLAP provides a • unified, managed and secure cloud-ready EDW solution for BI and ETL workloads via a • fast execution substrate harnessing Hive vectorized SQL engine and • efficient in-memory caching layer for columnar and text data
  • 57. Page57 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Try Hive LLAP Today on-prem or in the Cloud Hortonworks Data Platform 2.6 Powered by 100% open source Apache Hadoop http://hortonworks.com/downloads/ Hortonworks Data Cloud Easy HDP on Amazon Web Services http://hortonworks.com/products/cloud/aws/ Microsoft Azure HDInsight A cloud Spark and Hadoop service for your enterprise http://azure.microsoft.com/en-us/services/hdinsight/
  • 58. Page58 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Questions? ? Interested? Stop by the Hortonworks booth to learn more
  • 59. Page59 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Backup slides
  • 60. Page60 © Hortonworks Inc. 2011 – 2015. All Rights Reserved LLAP daemon SM TK LLAP security – internals with Tez HS2 SM TK Session ZK Paths w/hive ACLs TK Ranger Policies FS LLAP daemon SM TK Tez AM FS
  • 61. Page61 © Hortonworks Inc. 2011 – 2015. All Rights Reserved LLAP daemon SM T TK LLAP security – internals with Tez HS2 SM T TK SessionSession FS ZK Paths w/hive ACLs T TK Ranger Policies FS LLAP daemon SM T TK T🔐 FS T🔐 Tez AM T FS FS T🔐
  • 62. Page62 © Hortonworks Inc. 2011 – 2015. All Rights Reserved LLAP daemon Task Jo b FS SM T TK LLAP security – internals with Tez HS2 SM T TK SessionSession FS Tez AM T FS ZK Paths w/hive ACLs T TK Jo b Ranger Policies FS T Task T Task LLAP daemon Task SM T TK FSJo b FS
  • 63. Page63 © Hortonworks Inc. 2011 – 2015. All Rights Reserved LLAP daemon Task Jo b FS SM T TK LLAP security – internals with Tez HS2 SM T TK SessionSession FS Tez AM T FS ZK Paths w/hive ACLs T TK Jo b Ranger Policies FS Jo b F💓 LLAP daemon Task Jo b FS SM T TK FS Jo b F💓 FS Jo b T
  • 64. Page64 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Future work • Centralized workload management • Better tool support for cluster management, SSD cache, etc. • Improvements to Spark-LLAP integration • Parquet cache support • Memory management, cache improvements, faster cluster startup time, other performance features

Notes de l'éditeur

  1. 6 months note
  2. 6 months note
  3. Need to mention column level security again
  4. ----- Meeting Notes (17/1/11 10:19) ----- Knox HS2 #2 - what's the story?
  5. ----- Meeting Notes (17/1/11 10:19) ----- Knox HS2 #2 - what's the story?
  6. ----- Meeting Notes (17/1/11 10:19) ----- Knox HS2 #2 - what's the story?
  7. ----- Meeting Notes (17/1/11 10:19) ----- Knox HS2 #2 - what's the story?
  8. ----- Meeting Notes (17/1/11 10:19) ----- Knox HS2 #2 - what's the story?