Contenu connexe Similaire à Hive present-and-feature-shanghai (20) Plus de Yifeng Jiang (20) Hive present-and-feature-shanghai1. © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Apache Hive Present and Future
Yifeng Jiang
Solutions Engineer, Hortonworks, inc.
July 23, 2015
2. © Hortonworks Inc. 2011 – 2015. All Rights Reserved
About Me
蒋 燚峰 (Yifeng Jiang)
• Solutions Engineer, Hortonworks inc.
• HBase book author
• Hobbies: hiking, watching movie
3. © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Agenda
• Apache Hive Present
• How Hive Achieved 100x Performance
• Sub-second Response
4. © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hadoop for the Enterprise:
Implement a Modern Data Architecture with HDP
Customer Momentum
• 430+ customers (as of March 31, 2015)
• 105 customers added in Q1 2015
Hortonworks Data Platform
• Completely open multi-tenant platform for any app & any data.
• A centralized architecture of consistent enterprise services for
resource management, security, operations, and governance.
Partner for Customer Success
• Open source community leadership focus on enterprise needs
• Unrivaled world class support
• Founded in 2011
• Original 24 architects, developers,
operators of Hadoop from Yahoo!
• 600+ Employees
• 1100+ Ecosystem Partners
Apache Project Committers
PMC
Members
Hadoop 27 21
Pig 5 5
Hive 18 6
Tez 16 15
HBase 6 4
Phoenix 4 4
Accumulo 2 2
Storm 3 2
Slider 11 11
Falcon 5 3
Flume 1 1
Sqoop 1 1
Ambari 36 28
Oozie 3 2
Zookeeper 2 1
Knox 13 3
Ranger 11 n/a
TOTAL 164 109
5. © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hortonworks Data Platform (HDP) 2.2 Stack
Hive: SQL on Hadoop
6. © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Apache Hive Present
Transaction, Security, Performance
7. © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Apache Hive: SQL on Hadoop
• OSS data warehouse built on top of Hadoop
• First Apache Hive released in 2009
• Initial goal was to write MapReduce jobs in SQL
– Most query ran from minutes to hours
– Primary used for batch processing
8. © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hive – Single tool for all SQL use cases
OLTP, ERP, CRM Systems
Unstructured documents, emails
Clickstream
Server logs
Sentiment, Web Data
Sensor. Machine Data
Geolocation
Interactive
Analytics
Batch Reports /
Deep Analytics
Hive - SQL
ETL / ELT
9. © Hortonworks Inc. 2011 – 2015. All Rights Reserved © Hortonworks Inc. 2015
Hive Scales to Any Workload
Page 9
Hive at Facebook
• 100+ PB of data under management
• 15+ TB of data loaded daily
• 60,000+ Hive queries per day
• More than 1,000 users per day
10. © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Transactions
Insert, Update and Delete SQL Statements
11. © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Transaction Use Cases
Reporting with Analytics (YES)
Reporting on data with occasional updates
Corrections to the fact tables, evolving dimension tables
Low concurrency updates, low TPS
Operational (OLTP) Database (NO)
Small Transactions, each doing single line inserts
High Concurrency - Hundreds to thousands of connections
Hive
OLTP Hive
Replication
Analytics Modifications
Hive
High Concurrency
OLTP
12. © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Deep Dive: Transaction
Transaction Support in Hive with ACID semantics
• Hive native support for INSERT, UPDATE, DELETE.
• Split Into Phases:
• Phase 1: Hive Streaming Ingest (append)
• Phase 2: INSERT / UPDATE / DELETE Support
• Phase 3: BEGIN / COMMIT / ROLLBACK Txn
[Done]
[Done]
[Next]
Read-
Optimized
ORCFile
Delta File
Merged
Read-
Optimized
ORCFile
1. Original File
Task reads the latest
ORCFile
Task
Read-
Optimized
ORCFile
Task Task
2. Edits Made
Task reads the ORCFile and merges
the delta file with the edits
3. Edits Merged
Task reads the
updated ORCFile
Hive ACID Compactor
periodically merges the delta
files in the background.
13. © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hive Compaction
Read-
Optimized
ORCFile
Delta File
Merged
Read-
Optimized
ORCFile
Read-
Optimized
ORCFile
Delta File
Delta File
Delta File
Minor Compaction
10% local
Major Compaction
10% global
Minor / Major compaction
15. © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Ranger: Central Security Administration
Apache Ranger
• Security dashboard
• Centralizes administration of
security policy
• Ensures consistent coverage
across the entire Hadoop stack
16. © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Setup Authorization Policy (Hive)
16
file level
access control,
flexible
definition
Control
permissions
17. © Hortonworks Inc. 2011 – 2015. All Rights Reserved
How Hive Achieved 100x
Performance
ORC, Tez, CBO, Vectorization
18. © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Need for Speed: The Stinger Initiative
Stinger: An Open Roadmap to improve Apache Hive’s performance 100x.
Launched: February 2013; Delivered: April 2014.
Delivered in 100% Apache Open Source.
SQL Engine
Vectorized
SQL Engine
Columnar
Storage
ORCFile
= 100X+ +
Distributed
Execution
Apache Tez
19. © Hortonworks Inc. 2011 – 2015. All Rights Reserved
TPC-DS Benchmark at 30 Terabyte Scale
Sample of 50 queries from TPC-DS at 30 terabyte scale.
Average 52x Query Speedup, Maximum 160x Query Speedup.
Total benchmark time decreased from 7.8 days to 9.3 hours.(3)
Cost-Based Optimizer added in Hive 14 gave additional 2.5x Speedup.
20. © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ORC File Format
Columnar Storage for Hive
21. © Hortonworks Inc. 2011 – 2015. All Rights Reserved © Hortonworks Inc. 2015
ORCFile – Columnar Storage for Hive
• Columns stored separately
• Knows types
–Uses type-specific encoders
–Stores statistics (min, max, sum, count)
• Has light-weight index
–Skip over blocks of rows that don’t matter
Page 21
22. © Hortonworks Inc. 2011 – 2015. All Rights Reserved © Hortonworks Inc. 2015
ORCFile – Columnar Storage for Hive
Large block size ideal for
map/reduce.
Columnar format enables
high compression and high
performance.
23. © Hortonworks Inc. 2011 – 2015. All Rights Reserved © Hortonworks Inc. 2015
ORCFile – Create Table
• Defined at table or partition level
• Configurable compression codec
Page 23
create table Addresses (
name string,
street string,
city string,
state string,
zip int
) stored as orc tblproperties ("orc.compress"=”ZLIB");
24. © Hortonworks Inc. 2011 – 2015. All Rights Reserved © Hortonworks Inc. 2015
ORCFile – Convert Text to ORC
• Always ORC
• One SQL to convert text to ORC
Page 24
-- Create Text & ORC tables
CREATE TABLE test_details_txt( visit_id INT, store_id SMALLINT) STORED AS TEXTFILE;
CREATE TABLE test_details_orc( visit_id INT, store_id SMALLINT) STORED AS ORC;
-- Load into Text table
LOAD DATA LOCAL INPATH '/home/user/test_details.csv' INTO TABLE test_details_txt;
-- Copy to ORC table
INSERT OVERWRITE INTO test_details_orc SELECT * FROM test_details_txt;
26. © Hortonworks Inc. 2011 – 2015. All Rights Reserved © Hortonworks Inc. 2015
I/O Synchronization
Barrier
I/O Synchronization
Barrier
Job 1 ( Join a & b )
Job 3 ( Group by of c )
Job 2 (Group by of a
Join b)
Job 4 (Join of S & R )
Hive - MR
MR vs. Tez Example
Page 26
Single Job
Hive - Tez
Join a & b
Group by of a Join b
Group by of c
Job 4 (Join of S & R )
27. © Hortonworks Inc. 2011 – 2015. All Rights Reserved © Hortonworks Inc. 2015
Tez – Introduction
Page 27
• Distributed execution framework for
data-processing applications
–Target for application (framework), not end
user
–Hive on Tez, Pig on Tez, Cascading on Tez, …
• Lessons learned from MapReduce
–Significant performance improvement
–Batch, interactive
–Petabytes scale
• Run on YARN
–Utilize cluster resource
28. © Hortonworks Inc. 2011 – 2015. All Rights Reserved © Hortonworks Inc. 2015
Tez – Switch from MapReduce
• One command to switch from MapReduce to Tez
Page 28
set hive.execution.engine=tez;
SELECT * FROM my_table;
• Set Tez as default engine on Hadoop 2
$ vi hive-site.xml
hive.execution.engine=tez
29. © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Cost Based Optimizer
Making the SQL smarter
30. © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Cost Based Optimizer in Hive
Cost-Based Optimizer (CBO) creates optimized execution plan using
Hive table statistics
Why cost-based optimization?
• Simple use – e.g., adjust join order automatically
• Reduce the need for SQL tuning
• Optimized plan relates to better cluster utilization
Page 30
31. © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Performance Improvement – Query 17
Scale = 30TB
Input records ~186M
CBO Elapsed
Time (sec)
Elapsed
Time
Intermediate
data (GB)
Output and
Intermediate
Records
OFF 10,683 ~3 hrs 5,017 135,647,792,123
ON 1,284 ~20 mins 275 8,543,232,360
32. © Hortonworks Inc. 2011 – 2015. All Rights Reserved © Hortonworks Inc. 2015
CBO – Enable CBO
• Enable CBO before submitting query
Page 32
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
• Refresh statistics
ANALYZE TABLE my_table COMPUTE STATISTICS FOR COLUMNS;
33. © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Vectorized Query Execution
Process 1024 Rows at a Time
34. © Hortonworks Inc. 2011 – 2015. All Rights Reserved © Hortonworks Inc. 2015
Vectorization – Vectorized SQL Engine
• Feature:
–Process a block of 1024 rows instead of one row at a time
–Leverage modern hardware architecture
• Benefit:
–Max to 3x faster for big query
–Reduce CPU time, utilize cluster resource
Page 34
35. © Hortonworks Inc. 2011 – 2015. All Rights Reserved © Hortonworks Inc. 2015
Vectorization – Enable Vectorization
• Enable vectorized SQL engine
Page 35
set hive.vectorized.execution.enabled = true
set hive.vectorized.execution.reduce.enabled = true;
• Support ORC only
• A few data types and features are not supported
36. © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hive on Tez: Conclusion
Hive on Tez delivers fast batch and interactive SQL today.
But users need more speed!
Proven at petabyte scale.
Scalei
The most comprehensive
open-source SQL on
Hadoop.
SQLi
More than 90 Hortonworks
customers use Hive-on-Tez
today for fast SQL.
Speedi
Hortonworks Customer Support metrics as of Feb/2015
37. © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Sub-second Query Response
Solving Hive’s Top Performance Challenges
38. © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Next Stop: Stinger.next and Sub-Second SQL
Emergence of LLAP and Hive-on-Spark bring Sub-Second within reach.
39. © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Apache Hive: Modern ArchitectureStorage
Columnar Storage
ORCFile Parquet
Unstructured Data
JSON CSV
Text Avro
Custom
Weblog
Engine
SQL Engines
Row Engine Vector Engine
SQL
SQL Support
SQL:2011 Optimizer HCatalog HiveServer2
Cache
Block Cache
Linux Cache
Distributed
Execution
Hadoop 1
MapReduce
Hadoop 2
Tez
Historical
Current
In Development
Legend
40. © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Apache Hive: Modern ArchitectureStorage
Columnar Storage
ORCFile Parquet
Unstructured Data
JSON CSV
Text Avro
Custom
Weblog
Engine
SQL Engines
Row Engine Vector Engine
SQL
SQL Support
SQL:2011 Optimizer HCatalog HiveServer2
Cache
Block Cache
Linux Cache
Distributed
Execution
Hadoop 1
MapReduce
Hadoop 2
Tez
Vector Cache
LLAP
Persistent Server
Historical
Current
In Development
Legend
41. © Hortonworks Inc. 2011 – 2015. All Rights Reserved
HBase Meta store: Why?
Page 41Hive & HBase For Transaction Processing
700+ metastore
queries to create
execution plan!
42. © Hortonworks Inc. 2011 – 2015. All Rights Reserved
LLAP: What
Page 42Hive & HBase For Transaction Processing
Node
LLAP
Process
HDFS
Query
Fragm
ent
LLAP In-Memory
columnar cache
LLAP process
running read task
for a query
LLAP process runs on multiple nodes, accelerating Tez tasks
Node
Hive
Query
Node NodeNode Node
LLAP LLAP LLAP LLAP
LLAP = Live Long And Process
43. © Hortonworks Inc. 2011 – 2015. All Rights Reserved
LLAP: Why?
Page 43
• LLAP is a node resident daemon process
– Low latency by reducing setup cost
• LLAP has in-memory columnar data cache
– Hot data sits in memory, not HDFS
– Store data in columnar format for vectorization
processing
• Use YARN for resource management
– Utilize cluster resource
Node
LLAP Process
Query
Fragment
LLAP In-
Memory
columnar
cache
LLAP
process
running a
task for a
query
HDFS
44. © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hive Sub-second Response
=
Sub-Second
Hive
Metadata
Fast, Scalable
Metadata
Catalog
Persistent
Server
LLAP
+ +
SQL Engine
Vectorized
Hash Join
Choice of
Execution
Engines
Tez
+
45. © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Key Takeaways
Hive Present and Future
46. © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hive Present and Future
• Hive is the de facto standard of SQL on Hadoop
• One tool, batch and interactive processing
• One tool, all big data SQL use cases: ETL, reporting, BI and analytics
• Hive keeps envolving
• SQL:2011 Analytics support
• Enhance transactions
• Sub-second query response
47. © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Try Hive Today
• Try Hive latest feature today
• Hive on Tez
• ORC file formant
• CBO
• Vectorization
• Just a few lines of configuration/SQL change
• Stay tuned for Hive evolution
48. © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Thank you
Yifeng Jiang, Solutions Engineer, Hortonworks
@uprush
Notes de l'éditeur 花粉症。。 聞いた方手を上げて Hortonworks has a singular focus - enabling Apache Hadoop as an enterprise data platform for any app and any data type
We were founded in 2011 by 24 developers from Yahoo where Hadoop was conceived to address data challenges at internet scale. What we now know of as Hadoop really started in 2005, when a team at Yahoo was directed to build out a large-scale data storage and processing technology that would allow them to improve their most critical application, Search.
Their challenge was essentially two-fold. First they needed to capture and archive the contents of the internet, and then process the data so that users could search through it effectively an efficiently. Clearly traditional approaches were both technically (due to the size of the data) and commercially (due to the cost) impractical. The result was the Apache Hadoop project that delivered large scale storage (HDFS) and processing (MapReduce).
Today we are over 600 employees and have partnered with over 1000 companies who are the leaders in the data center
We have also been very fortunate to achieve very significant customer adoption with over 330 customers as of the end of 2014, spanning nearly every vertical.
Hortonworks was founded the sole intent to make Hadoop an enterprise data platform. With YARN as its foundation, HDP delivers a centralized architecture with true multi-tenancy for data-processing and shared services for Security, Governance and Operations to satisfy enterprise requirements, all deeply integrated and certified with leading datacenter technologies.
We are uniquely focused on this transformation of Hadoop and doing our work completely in open source. This is all predicated on our leadership in the community, which enables not only to best support users of but also provides uniquely present customer requirements within this open, thriving community.
私の日本語力では。。。 Vectorized query execution improves performance of operations like scans, aggregations, filters and joins, by performing them in batches of 1024 rows at once instead of single row each time. LLAP: Persistent servers cache vectors and start queries instantly. Pluggable integrations with Tez
Vectorized Hash Join Solves CPU Boundedness for Hive on Tez.
Improved metadata catalog allows instant query planning and optimization for any engine.
LLAP is a node resident daemon process
Low latency by reducing setup cost
Multi-threaded engine that runs smaller tasks for query including reads, filter and some joins
Use regular Tez tasks for larger shuffle and other operators
LLAP has In-memory columnar data cache
High throughput IO using Async IO Elevator with dedicated thread and core per disk
Low latency by providing data from in-memory (off heap) cache instead of going to HDFS
Store data in columnar format for vectorization irrespective of underlying file type
Security enforced across queries and users
Uses YARN for resource management